Temporal data modelling


#1

Hello everyone,

I’d like to raise a topic that is highly relevant to Cayley, especially provided great adoption projects in the space, such as dgraph, achieved in the last couple of years. Big thanks to @barakmich for doing a write-up on query languages, although what’s certainly missing from that great piece of writing is BQL (Badwolf Query Language), which is somehow related to the Datalog-inspired language, the one that was mentioned. Throughout due diligence, I discovered that it was no one else but @barakmich who initially started the conversation with the badwolf people, and at some point, a full-blown driver for Cayley was going to be built, or at least some abstraction compatible with iterator trees, which would make BQL accessible to Cayley. Sadly, the conversation came to a halt.

Now, one could argue that the killer feature of Badwolf, temporal data modelling, is very underappreciated. One of the primary reasons graph databases appeal to people, I believe, is that they provide a much more intuitive way of thinking about linked, highly interconnected real-life data, and change is a very big part of it. One case which I would love to make for something like BQL would be to track user actions, so easy analytics over the whole user base experience becomes possible, but then again you would expect more flexibility in how requests are made. For certain queries, you would need to specify a timeframe, and maybe some limits (sampling?) and ideally, you would want to group different anchor points by their predicate in the output, so it’s then easier to deal with it in the application. Unfortunately, BQL itself does not support any sophistication with respect to “anchors”, optional timestamps associated with triplets, e.g. it’s not currently possible to query all “versions” of triplets in a certain timeframe or simply query the last one, without involving both a sort and a limit. Hopefully, this will change some time soon. That being said, I’m unaware of a graph-oriented project doing any of this substantially better, than Badwolf. BTW, another cool thing they got is the ability to have multiple graphs just like you would have multiple tables in a regular database. You can cross-reference them too. To manage, simply run CREATE GRAPH or DROP GRAPH. This should be possible in Cayley (quads?), but I don’t see a clear way to do it, with path at least, that wouldn’t involve a lot of wrapper code.

In fact, we as-in the programming community, try to approach the question of time from completely different angles. I mean, look at consensus protocols, the likes of Blockchain, or its early predecessors. They are basically a series of timestamped state changes, a full temporal representation of the automata. Now, what does it all have to do with graphs and Cayley, in particular? I believe we simply can pull much, much more data seemingly out of nowhere, by doing more graph theory on already existing linked data. The only real missing piece is efficient temporal modelling. I’m sure somebody already does this somewhere, and yes, I’m staring right at you, Google, but you also have to agree that it would be incredibly hard to first implement, and then integrate this without tailored tools, a specialised domain language, such as BQL. The real question is: can Cayley already do what Badwolf does, to make a solid case for BQL integration (multiple graphs, triplets with indexed timestamps)? And even if with some effort you made all of that happen, could Cayley do it well enough? It has been known for some time that Cayley has overall bad performance even on some simple recursive queries. I initially thought that the problem was in the SQL optimizer:

a bug in query optimizer - it won’t let SQL to run a full optimization pass for recursive queries… this will require more work to implement properly - need to add a few more optimizers to SQL.

@dennwc in #765

But it turned not to be the case, as @iddan figured results return 100x more slow even (sic!) for the in-memory storage. Other players in the field, such as Dgraph or Neoj4 don’t seem to be troubled by this, but of course, they enjoy the privilege of doing it all for a single backend, while Cayley has to deal with many. In any case, one beautiful thing about running Postgres is that you can do simple relational modelling, pub/sub, hstore, and with the help of Cayley, even graphs—all from within a single database.

Can we get more input from the maintainers, as to what are their thoughts on the future of Cayley, their plans and ambitions? I can see it benefit greatly from the aforementioned temporal features, even if implemented only in some experimental opt-in setting, and would certainly love to contribute to it myself! That said, after more than 9 months of casually looking into the project, even though very interested, I still hesitate to commit my efforts to Cayley before starting a conversation first.

Hopefully, some of you guys share my vision.

-badt@veritas.icu


#2

Hey @badt. Thank you for sharing your thoughts. A few comments:

  1. In K Health, I tried to use SPARQL for querying RDF data but it was too hard using the SPARQL syntax to achieve very simple concepts that were easy to explain with a classic programming language. I find that Cayley with its Gremlin like approach for graph traversal eases composing complex queries that SPARQL does not support.
  2. BQL is implemented in Go, and so is Cayley and it is proven that SPARQL can be compiled to Gremlin. It might be possible to implement BQL over Cayley. I would not find it useful, but maybe you would.
  3. We are working on a next-gen query language that should reflect the ideas of SPARQL, Gremlin, JSON-LD, and GraphQL called LinkedQL. The aim is for LinkedQL to feel native to JSON-LD as SPARQL feels for Turtle. I would love to get your input on the work we have done so far.
  4. In RDF in general, to save metadata regarding triples you can another few triples and link to the subject, predicate, and object or wrap the triple with a graph and give the graph metadata. Cayley supports the first solution well and partially supports the second solution.