Paper Chase: Reification (aka 'putting data about links on links')


#1

Alright, since the list has been a little sleepy, let’s shake things up. I have a mission for you all.

Purpose: It seems fairly obvious that metadata about links (timestamps, sources, authorship, etc) is a useful thing. However, this is not in the strictest RDF specs.There are many thoughts on the subject from various sources (academic, other projects, etc).

This looks and feels like:

  • Metadata on quads
  • Timestamps on quads (eg, github.com/google/badwolf)
  • Hypergraphs (links to links)
  • Neo4j’s notion of properties (KVs) on links
  • Bare nodes, nodes that are triples, etc

It would be good to do something RDF-compatible and, perhaps, interoperable with others. There’s no reason not to share good ideas instead of rolling our own from scratch. While I’m not ruling that out, we should at least get a picture of what exists and adapt or adopt.

Task: Collect links to papers and technical implementations in this thread. Deep understanding need not apply for the moment, let’s just survey the space and what people have done. Then, we can internalize it.

To help, we’re looking for “reification” as it’s called in RDF and academia, as well as hypergraphs, or any keywords you might pull from the examples above.

Goal: Answer this question definitively and with a sound basis for implementing. We hear it all the time. It need not be pretty in RDF, just usefully expressable therein. We can add extensions and filetypes for convenience, if they can be dumped ultimately to RDF.


#2

Interesting paper:

Sourced from blaze graph: https://wiki.blazegraph.com/wiki/index.php/Reification_Done_Right


#3

While searching for papers I found that in most of them the next conclusion is made:

  1. Here is how reification is done in RDF.
  2. This form is too verbose to write.
  3. Lets invent our own serialization format!

Which is not interesting from implementation perspective. But after reading that part of RDF spec I realized that we cannot state RDF compatibility yet. One of the examples why:

ex:item10245 ex:weight "2.4"^^xsd:decimal .

# should be interpreted the same same way as this:

ex:triple12345   rdf:type        rdf:Statement .
ex:triple12345   rdf:subject     ex:item10245 .
ex:triple12345   rdf:predicate   ex:weight . 
ex:triple12345   rdf:object      "2.4"^^xsd:decimal .

This also means that queries with there abstract predicates should also work:

// Returns all quads
g.V().Save("<rdf:subject>","s").
   Save("<rdf:predicate>","p").
   Save("<rdf:object>","o").All()

Further, there is a <rdf:value> predicate that works the same way with node->value relation.
Thus, given a simple quad:

<alice> <follows> <bob> .

we should build this kind of data structure internally (type predicates omitted):

_:n1 <rdf:value> <alice> .
_:n2 <rdf:value> <follows> .
_:n3 <rdf:value> <bob> .

_:q1 <rdf:subject> _:n1 .
_:q1 <rdf:predicate> _:n2 .
_:q1 <rdf:object> _:n3 .

Given all the above, I want to discuss few design decisions:

  1. Nodes and quads should have a separate unique ID that could be used to attach additional metadata, not related to actual values.
  2. HasA and LinksTo iterators are in fact a subset of some Traverse iterator with rdf:subject/… as Via parameter in forward or reverse direction.
  3. A new ValueOf iterator can be introduced to replace NameOf method on QuadStore. The same stands for Quad method.

IDs for nodes and quads

We already have a sort of IDs for these internally - the hash. Lets assume for now that we don’t want to introduce a unique intermediate blank node, as in RDF. So, what type the hash should have? I think the same blank node concept is a good fit.

Thus, the first proposed change is to make a quad.BNode an interface and replace graph.Value with it. This allows to use these intermediate values (graph.Value) in queries without the need to call NameOf or Quad method on QuadStore. They become a well-defined first-class objects without loosing the flexibility they were made for.
Most implementations are using hashes as IDs, thus they may be represented as bnodes like _:n-900150983cd24fb0d6963f7d28e17f72 or with q- prefix for quads.

Now, the reasons why we need an intermediate nodes for values and quads, in my opinion:

  1. They can be used them to attach metadata. This is a problem - we cannot add metadata without affecting value/quad hash right now.
  2. Values and quads can be updated. Tx with delete-insert operations are nice, but they cannot replace values or fix typos efficiently.
  3. One True Graph. Any relation, builtin or not, works the same way as any other. At least from the user’s perspective. SPOL indexes are potentially the same as indexes on any other predicate.
  4. SameAs might be easier with this, because we can attach few values to one node. Not sure if it’s a good idea or not.

There is a lot of things to discuss here, so I’ll cut it short to hear other thoughts on this problem.

Traverse iterator

LinksTo iterator was used to traverse from nodes to links via a certain direction. Virtually, it can be replaced with some Traverse iterator which follows in reverse via <rdf:subject> link, for example. The same is true for HasA iterator, but in forward direction. This will not affect code too much right now, but at least path lib should have to know that Out("<rdf:subject>") should be translated into HasA(it, quad.Subject).

ValueOf iterator

Right now there is no way for optimizer to know if user wants to just enumerate nodes (like most iterators do), or if he wants to get a value via NameOf later. This is a possible optimization - introduce an iterator which can convert values from graph.Value into quad.Value, allowing to hide the details how it was retrieved. PG might use inline JOINs in this case, other backends might want to batch NameOf (materialize a page from sub-iterator and resolve multiple names at once). Also, path lib needs to know that Out("<rdf:value>") should be translated into ValueOf(it). Same for bnode->quad conversion.


Get Quad By Id?
#4

A practical approach that made sense to me when I read it first was the one described in Linked Data Patterns which is that reified statements as described in the spec have their place in modelling but are generally costly, and so using graph annotation with a graph per resource (which I guess is equivalent to “Ids for nodes and quads”, graph per sourceor the more elaborate and interesting graph per aspect pattern can be good pragmatic approaches.

I don’t know if everything that can be expressed with RDF Reification can be expressed with the named graph pattern though, or if this would be useful for implementation but I thought I’d share anyway :slight_smile: