Generating UUIDs


#1

I have a few questions regarding UUIDs:

  • Do I need UUID for each part of the quad (subject, predicate, and object) ?
  • Which UUID version is recommended?
  • What Go library is recommended? I heard this one is not a bad choice.

After figuring all the above I’ll be happy to add it to the examples folder. I assume others will have similar questions.


#2

I prefer this:


https://docs.mongodb.com/manual/reference/method/ObjectId/

It’s how MongoDB generates unique identifiers. Tried and tested with the bonus of embedding the creation time inside it.
I even think if Cayley in future automatically provides this functionality, it should use this approach.


#3

No; that defeats a lot of the purpose. UUIDs should be made for unique entities, and even then they might have useful IRIs, eg https://rdf.cayley.io/subgraphA/2347908234af292f...

Schema (the predicates) should be well-formed and well-named IRIs.

Objects are other related concepts (perhaps a UUID) or literals.

Most graph-y sorts of situations look like:

https://rdf.cayley.io/subgraphA/2347908234af292f rdf:name "oren"
https://rdf.cayley.io/subgraphA/2347908234af292f foaf:knows https://rdf.cayley.io/subgraphA/7ab3c34734309673
https://rdf.cayley.io/subgraphA/7ab3c34734309673 rdf:name "barakmich"

– for this simple graph I generated some UUIDs by smashing the keyboard in a vaguely hex fashion :slight_smile: These are perfectly reasonable, as long as subgraphA is willing to maintain their lifecycle.

Eh just crypto/sha1 and encoding/hex ought to do the trick. Honestly any sufficiently entropic hex string can be what we use; a SHA of something useful (datetime + mac address?) ought to do the trick. If you prefer, you can even just make them incrementing numbers.

Perhaps UUID is an overzealous standard. It need not be universal, just unique to the graph’s namespace (eg, it’s IRI should be unique to the subgraph)

I’d rather not do it the Mongo way if Cayley is generating them, even if you append it to the path. There’s nothing, per-se, wrong with it (it’s 12 bytes, less some entropy) but I’d rather not have any comparison. These are entities, not documents.

That said, within your own domain, you can make use of this fact, and have something like:

https://rdf.cayley.io/subgraphA/mongo/502838296713 rdf:name "some-document"

Where all entities under /mongo also utilize the mongo IDs, or something.


Glossary Of Terms
#4

As others have said, a UUID for each part would make zero sense. Consider the way you represent data in a table oriented SQL database

{ID}, {name}, {birthday}, {favorite_color}

— this allows you to change name, favorite_color or birthday just by referencing the {ID} which will have no reason to change.

It really doesn’t matter in most cases unless you are generating very large numbers of UUIDs and you see that in your pprof coming up as a blocker which can happen due to various edge cases (randomness exhaustion, etc).


#5

@barakmich I’m not sure why the fact that it is used my MongoDB is a ‘negative’.

I’m just saying their technique works and we can use their ‘approach’. There is nothing ‘proprietary’ about it in the sense that they aren’t claiming their technique is synonymous with their Database, hence Cayley using it, automatically does not suggest that Cayley copied MongoDB.

The fact that Mongo uses “documents” and Cayley uses “entities” is just semantics and is irrelevant. If Cayley does use something similar to Mongo’s ObjectId, their technique is so general, no-one would even ‘assume’ Cayley copied. Cayley doesn’t even need to reference MongoDb or “objectId”.

Now there is 1 (semi-)negative. You mentioned that UUID is an overzealous standard and the identifier generated need not be universally unique. MongoDB’s technique is designed to be universal because from the ground-up the database was designed to scale horizontally (via built-in sharding). That means if a particular server goes down temporarily, they still need to create a universally unique id without a clash. They can’t rely on the different servers synchronising their unique ids.
MongoDB’s technique is ‘overkill’ compared to the ‘lighter’ standard of just being unique to the graph’s namespace.
However, it is hardly computationally prohibitive and in the end of the day it produces typically a 24 character hex string. If in the unlikely event Cayley wants to support built-in sharding in the future, then this is a nice candidate for a unique id generator that will be make sharding less painful.

Now the main positive is it is tried and tested. Hundreds of thousands of mongodb databases with billions of unique entries (.i.e. ‘documents’) have been created with no clashes.

Another lesser positive is is that the Object Id can be created client side OR server side.

An ever lesser bonus is that it incorporates the creation time automatically so you don’t need to store the “created_at” time in the graph database as a seperate entity. You get it for free (unless you want to query on it)

@oren:
The 12-byte ObjectId value consists of:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id,
and a 3-byte counter, starting with a random value.

This 12 byte Object Id is usually represented by a 24 character hex string.


#6

Perhaps, but I think it makes a sane default. If you have performance or size reasons to avoid it – by all means go in eyes wide open. But, as a general recommendation, as a starting point, I believe it is the safest and sanest one.

I have been working with “middle-sized” datasets for years (high TBs, low PBs)… and nearly everytime I didn’t go with a UUID I grew over time to regret as the data was more linked over time. Now, your subgraph examples feels “universal enough”, but as a default.


#7

Or you could use an actual standard instead of whatever Mongo came up with. And SHA and RFC4122 have been tried and tested orders of magnitude more than Mongo IDs. I honestly don’t care which standard, but a standard seems like the right choice. And yeah, as @robertmeta suggests, a full UUID is a safe, sane default. One nice thing about the package @oren linked is it generates these per RFC4122. So, yeah, we can do that.

If we’re feeling saucy we can do the same trick Freebase did and Base64 encode them (or similar) – this is where http://rdf.freebase.com/ns/m.02mjmr came from (incrementing numbers, Base64-esque encode. This one is Barack Obama’s mid (burned in my memory))

Are you arguing for the concept of primary keys? If you are, you’re right. An entity is a primary key, as is a Mongo ID. What you call it is indeed irrelevant.

However, Mongo wants people to think in documents, IE, JSON blobs, and that’s not what we’re up to here. Though there is a transformation from a JSON blob to a graph (though, interestingly, not the reverse), it’s not a transform to be taken lightly (for schema reasons).

The major difference is this: these IDs are meant to be shared. You should be able to publish a long-lived, generated, IRI out to the web, and have other people on the web reference that IRI, even if only to say “hey, IRI-A in my dataset is exactly entity IRI-B in this other data set”, which is the goal of linked open data. Whether they have any meaning (eg, timestamps) beyond that is completely irrelevant. You can, in your graph, imbue debug meaning in them, but it’s opaque in the eyes of the web. UUIDs by default make that the norm.


#8

Great discussion! some of this should probably be part of the official documentation about modeling.


#9

Also



They short and first based on salt which could be helpful in graph relations.


#10

hashids look very similar to how MIDs were generated, FWIW

Whatever works. I’m pro-standard. But for creating IDs, we can have it take an interface similar to hash.Hash – and use whatever generator you like. Yes, even Mongo-esque if you wish. It’s your graph, it’s your rules. Just be consistent and do the right things – which is the bigger discussion here, documenting the right things.


#11

hashid is for ‘obfuscating’ an actual id to something else when you need to expose it to the general public. It allows you to obfuscate the unique id you are using in your database to a nice string and then un-obfuscate it back when you want the original id again.

It won’t generate a unique id.


#12

Just one quick note about an alternative UUID approach. The new version is called Universally Unique Lexicographically Sortable Identifier (ULID). It has started with javascript version, https://github.com/alizain/ulid and ported to go https://github.com/oklog/ulid.

Comparison between GUID/UUID vs. ULD are

A GUID/UUID can be suboptimal for many use-cases because:

  • It isn’t the most character efficient way of encoding 128 bits
  • UUID v1/v2 is impractical in many environments, as it requires access to a unique, stable MAC address
  • UUID v3/v5 requires a unique seed and produces randomly distributed IDs, which can cause fragmentation in many data structures
  • UUID v4 provides no other information than randomness which can cause fragmentation in many data structures

A ULID however:

  • Is compatible with UUID/GUID’s
  • 1.21e+24 unique ULIDs per millisecond (1,208,925,819,614,629,174,706,176 to be exact)
  • Lexicographically sortable
  • Canonically encoded as a 26 character string, as opposed to the 36 character UUID
  • Uses Crockford’s base32 for better efficiency and readability (5 bits per character)
  • Case insensitive
  • No special characters (URL safe)