Query Languages Tour


#1

After discussing in IRC, I figured it might be helpful to walk people through the landscape for query languages. There are more I’m forgetting or combining into some general topics driven by exemplars, but for the moment, think of this as a sort of a one-man’s history instead of an authoritative survey. And we can always extend the thread with more!

When it comes to graph query languages, there is much scholarly writing and very little consensus on what the right answer is. Everyone, it seems, has proposed one at some time or another. A Google search for [graph query languages] even suggests a couple dozen academic papers as well as those from other projects. If you want a weird afternoon, read a couple of them. You’ll find that everybody has slightly different cases they’re optimizing for. That there are many facets shows a power to graphs, but seriously muddies the waters for someone trying to learn how to work with graph stores. So this post is a walk through some things I’ve seen as I’ve worked through Knowledge Graph, along with some commentary and general notions to help sort this mess out.

The standard

There’s exactly one by an actual standards body: SPARQL. The W3C created a complimentary spec for RDF to query property graphs. They based its look and feel on SQL – complete with the awkwardness therein. There are certainly pros: it’s an actual standard, and it feels the most like querying the underlying triples. It’s just not super abstract. The SPARQL committee got it’s charter in late 2008; the basic ideas were already in place, of course (2005?), but it’s good to know where things are in the history.

SELECT ?y
WHERE {
  /en/humphrey/bogart /film/actor/film ?z.
  ?z /film/performance/film ?y.
}

Other graph stores

Gremlin
If you’ve worked with Cayley, you know Gremlin a bit.

Proper Gremlin is written in Groovy. It started in ~2009? There’s no standard syntax, but there is a supposedly standard set of behavior – the docs have a good many cases. However, I’ve yet to find a conformance test; partly because I don’t think there is one? To wit:

Gremlin is written in Java 8. There are various language variants of Gremlin such as Gremlin-Groovy (packaged with TinkerPop3), Gremlin-Scala, Gremlin-JavaScript, Gremlin-Clojure (known as Ogre), etc. It is best to think of Gremlin as a style of graph traversing that is not bound to a particular programming language per se. Within a programming language familiar to the developer, there is a Gremlin variant that they can use that leverages the idioms of that language.

This sounds promising, until you realize all their alternatives wrap or dial the JVM. There’s no grammar, Gremlin is defined by the way Gremlin works.

At minimum, a programming language providing a Gremlin implementation must support function chaining (with lambdas/anonymous functions being a “nice to have” if the variant wishes to offer arbitrary computations beyond the provided Gremlin steps).

Which, hey, we do.

Yet, I rather like the imperative style of Gremlin traversals. It’s why I wrote a Gremlin-inspired embedded set of methods.

Our query, above, is roughly:

g.V("/en/humphrey_bogart").Out("/film/actor/film").Out("/film/performance/film").All()

This also allows us to use various late-binding tricks. Eg, morphisms to evaluate later:

starred_in = g.M().Out("/film/actor/film").Out("/film/performance/film")
g.V("/en/humphrey_bogart").Follow(starred_in).All()

We could do more. We could also go through and try to really align our Gremlin to the Tinkerpop traversal docs. I’ve tried to keep it pretty close though. It’s hard when it’s not testable.

Cypher

From Neo4j, came about in 2012/2013; so it’s still relatively new. They released a grammar and test suite, openCypher sometime in early 2015.

(Personal rant: I mean, yay opening up the grammar, that’s something you’d want from an actual project. But it’s more or less Neo Technologies dumping a “standard” out they’d already built in-house, and expecting others to follow. And the Cypher Language Group consists entirely of Neo employees. Hardly a community effort.)

Still, let’s evaluate our sample query, as close to how the examples have it:

MATCH (hb)-[:/film/actor/film]->(cvt)-[:/film/performance/film]->(y)
WHERE hb.name = "Humphrey Bogart"
RETURN y

One impedence mismatch, Cypher (and Neo4j) assume you have KVs attached to everything (hb.name = "Humphrey Bogart"). This is both helpful and not. It makes it fairly non-RDF, and creates a number of problems for beginners, whose first instinct is to put a lot of data in as KVs – where you then can never join with it.

One plus – everything requires a tag, and the return statement allows multiple explicit tagged outputs. Another plus, the literals can be extracted out from the query shape (though var hb = accomplishes this in Cayley today)

To me, it feels a little bit heavy on the executable ASCII art, but that does make it a little more friendly-looking?

It’s not impossible. If we encode KVs on nodes as edges in their own right, that kind of works. You’ll see MQL wanted a little bit of this feel too.

Metaweb

When Metaweb was getting started, there was graphd. It had a nigh-incomprehensible query language, GQL, that, the theory went, was easy for machines to compile to. As a separate service, we ran MQL, which compiled user and client queries into GQL and parsed the result back out again into . For reference, I joined in early 2008. MQL is an approximately late 2006 into 2007 creation and the query language to graphd predates that a bit.

GQL

GQL was a mess. I can still write it. It was highly tied to the data model, and went something like this:

read (result=(guid) (<-right typeguid=12344934934249999923 left->(guid=0123000234234))

Where the constants are 128 bit hexadecimals written in ASCII. This is what you’d write for the equivalent to:

g.V(id1).Out(pred1).All()

where id1’s generated ID is 0123… and pred1’s is 12344…

To be fair – there are worse things that you could need to generate. Only a few keywords, but debugging it was a pain. And the result structure was nested lists of IDs and s-expressions, and sometimes values. One of the bigger pains was roundtrips and caching to merely translate, eg, id1 into its assigned hexadecimal.

MQL

Anyway, to actually make something for developers, there was the Metaweb Query Language, MQL. The idea was to leverage JSON – an uncertain bet, at the time, when JSON was still not quite a de-facto lingua franca. JSON, combined with judicious semantics around objects, nulls, keys, and values

It ruled and it sucked:

  1. It got documented, sort of. Fairly well, good getting started guides, people could use it and follow the docs, but some of the exact semantics got lost to oblivion. The only true implementation of MQL that could ever exist was MQL itself. Incidentally, the MQL I’ve written for Cayley is a minimal, definitely not full but at least well-defined MQL clone.

  2. It did a lot of schema lookups. This was great for discoverability, but ended up being a pain for productionizing. If you knew the Freebase schema, it would infer things for you. Eg:

[{
  "id": null,
  "name": "Humphrey Bogart",
  "type" : "/film/actor",
  "film": null
}]

Because you specified the type of /film/actor it would add that to the search path when resolving the other keys. id was special (a good idea), /type/object was always at the end of the search path. so the /type/object/name predicate for the key name would get auto-resolved, as would type (the predicate was /type/object/type) but that predicate was given special resolving semantics (IE: add it to the search path) – so you start to see why the exact semantics get lost to time.

Oh, and each of those schema lookups? Need to be resolved to hex under the hood, but fortunately those could be fairly aggressively cached.

Because the semantics were unclear, people tended to just write out the exact predicates to disallow any ambiguity. “Type predicates” vs “No Type predicates” in a graph schema is a bit of an argument that has good arguments on both sides.

  1. Impedence mismatch between queries and returns. Often beginner-friendly, but confusing. The query as written returns a list of things, named Humphrey Bogart, and exactly one film. IE:
[{
  "id": "/en/humphrey_bogart",
  "name": "Humphrey Bogart",
  "type": "/film/actor",
  "film": "The Big Sleep"
}]

The notion of mirroring the output to the input is clever, but it can also be a headache. Consider that:

  • The null didn’t resolve to an id, but rather the /type/object/name attached to the target. Oh, unless it didn’t have a /type/object/name. Then it resolved to the id.
  • I only got one random film, why didn’t I get all the films? You needed to replace "film": null with "film": []. Then you’d get a list of name/ids.
  • What if there are multiple Humphrey Bogarts, who are actors in the dataset? If your query is written (as the example is) inside a top-level list, then you’ll get multiple such actors:
[{
  "id" : "/en/humphrey_bogart",
  "name": "Humphrey Bogart",
  ....
 },
 {
  "id": "/en/other_bogart",
  "name": "Humphrey Bogart",
  ....
 }
]

In any order. Unless you provided a magic key – inside the object! – such as:

[{
  "id": null,
  "name": "Humphrey Bogart",
  "type" : "/film/actor",
  "film": null,
  "sort": "id"
}]

Which would sort by the id returned. How this worked when nested required nested sorts… you can start to imagine.

In order to get any sort of consistent output, everything was wrapped in a list, everything had a "sort": "guid" – the unique ID, and another special key – and you never used null except for in the id or name field. The net result being the above query written so as not to break:

[{
  "id": null,
  "name": "Humphrey Bogart",
  "/film/actor/film": [{
    "id": null,
    "name": null,
    "sort": "guid"
   }],
  "sort": "guid"
}]

Don’t forget your correct commas.

In trying to mirror the output, it actually became less useful.

  1. It was (comparatively) slow as sin. Python, parsing JSON, generating bizarre s-expressions, round-trip to the graph, parsing the resulting s-expressions…

  2. Traversals were written inside out. Because developers are lazy and it’s easy to write for x := range result – and because the cardinality of things was split between branches of the tree, most traversals looked reverse:

[{
  "id": null,
  "!/film/performance/film" : {
    "!/film/actor/film: {
        "id": null,
        "name" : "Humphrey Bogart"
    }
  }
}]

If you have reverse properties (another lookup!) built into your graph, you don’t need to write the ! (reversed) predicates, and can use their mirrors:

[{
  "id": null,
  "/film/film/starring" : {
    "/film/performance/actor: {
        "id": null,
        "name" : "Humphrey Bogart"
    }
  }
}]

But the common thing to see was the start of the traversal at the deepest point of the query. Reading inside-out is an odd side-effect.


I want to call out how insightful it was for its day. It did JSON queries before they were cool. A tree template maps onto a graph (barring a few more advanced cases) in a surprisingly useable way. It was clever and fun. You could do a lot with a little.

There are some clear parallels to what I know about GraphQL, incidentally; a tree-structured template in a language that looks designer-friendly smells a lot like MQL. I’m not sure how to fully reconcile the ideas, though. Open question.

MQL was being rewritten from the ground-up when we were acquired.

Google-era

There were a number of mini languages that came up once inside Google. Many were purpose-based. Explaining all of them would be tiresome and ultimately useless. Let me highlight two more interesting ones.

MLL

Cayley wasn’t the first “let’s make an open-source graphd” idea inside of Google. Someone hacked on something similar before I did. In Haskell, so it was hard to get people into it. As far as I can tell, it never saw the light of day. I want to give credit though; they also designed a query language that may be one of the most pretty. The timeframe in question is early 2011.

The short version is, if you read the scholarly literature as above, those folks love Datalog and variants thereon. Its problems tend to be it being more academic than practical, but MLL (a Metaweb Logic Language, get it?) was the first time I got to play with a custom Datalog atop Freebase data. I can’t remember exactly, but it went something like this:

> /type/object/name(X, "Humphrey Bogart").
X = "/en/humphrey_bogart"
> /starred_in(X, Y) := /film/actor/film(X, Z), /film/performance/film(Z, Y).
> /starred_in("/en/humphrey_bogart", Y).
Y = "/en/the_big_sleep"
Y = "/en/casablanca_1942"
....

Every predicate is a function, and every function a virtual predicate. Triples in a graph are exactly logic assertions. It’s rather cool from a meta-perspective, but then, the rules and queries start to look a bit more like the comparative SPARQL over time:

SELECT ?y
WHERE {
  /en/humphrey/bogart /film/actor/film ?z.
  ?z /film/performance/film ?y.
}

Being aware of the Datalog-inspired side of the world is useful.

PathQuery

There was a design for a full language within the big G, starting circa late 2012. I’m not sure where it ended up, but it was supposed to be a big thing. I had some issues with the design after trying to actually implement it for other graph stores (ie, Cayley), but the freight train was already barreling along and they didn’t want me to throw a spec into the ring (understandably; see my ire for Cypher). I wouldn’t repeat the design exactly, but there’s no doubt it’s inspiration in my head.

Without giving lots away, the key insight to it I want to call out, and the one that I think has legs, is this: Suppose traversals – eg .Out(foo) – are first-class in the language, along with node literals. What language might you design?

"/en/humphrey_bogart".name
// Returns the literal "Humphrey Bogart"
"/en/humphrey_bogart"./film/actor/film./film/performance/film
// Returns the set of literals ["/en/the_big_sleep", "/en/casablanca_1942", ...]

Incidentally, some of these ideas are wrapped around JS semantics in current Cayley GremlinJS-ness. Dealing with what Cayley calls tags, and intermediate values, and functions, and literal-to-traversal conversion, and recursion, and on and on aren’t (and weren’t) 100% clear. But it’s really a nifty thing to talk about

/starred_in = /film/actor/film./film/performance/film

"/en/humphrey_bogart"./starred_in
// Returns the set of literals ["/en/the_big_sleep", "/en/casablanca_1942", ...]

now that you’ve seen a bit of Datalog and Gremlin and MQL.

TLDR

There’s a lot out there?

Personally, I’d like to talk about:
a) Implementing SPARQL. It is the (slow moving, sleepy, and you thought Cayley looked dead) standard. But it has conformance tests and is portable.
b) Long term, creating a true community language designed by engineers with academic references and requests for external comment, from best practices acheived in building and using a graph database? One with a formal grammar and well-defined acceptance testing – load these triples, run these queries, get these results. There are multiple projects we could reach out to to work with us.


Sketching a new graph query language
#2

That was an awesome writeup, I know you’ve talked with this developer of the Badwolf project: https://github.com/google/badwolf

Any more thoughts that might be added to the above on it?


#3

Hello,

Thanks for creating this project. I wonder if you have any plan for SPARQL implementation. If you do, could you suggest some timeline? I would like to use it but I already have some queries that would like to use.

Also, is SPARQL->Gremlin convertible always?

Thanks!


Cayley Query coverage (support Gremlin Table?)
#4

Today I was searching for different graph query languages and I came across this GQL Manifesto. Is Cayley interested in being part of that?