Sketching a new graph query language


#1

I mentioned in Query Languages Tour a little about PathQuery, or at least my memory of it. It certainly didn’t work exactly that way – it did some other things – but I thought I’d sketch up a more formal idea of the query language I’ve kinda been daydreaming about for a bit. Note, this is just a sketch. Think about it as a jam session, a brainstorm, a written-down idea. I’m curious what people think of the concept, and welcome feedback.

Premise

One longstanding problems in graph databases is an effective query language that suits many use cases. SPARQL is the standard, Gremlin optimizes for imperative paths, GraphQL (and MQL in its time) aims to make record-based retrieval fast and easy.

There seem to be a couple general strategies for graph queries, often times combined.

  • Paths
    • These queries tend to want to find paths or factoids starting from a subset and ending at another (“movies starring Keanu Reeves”)
    • eg, What Gremlin is best at; declaring a path between the start and end of a traversal
  • Records/Topics
    • These queries tend to want a given subset of information about an entity or set of entities (“give me these facts about Keanu Reeves”)
    • eg, What GraphQL is best at; known shared schema, ask-and-receive
  • Breadth-First Topics
    • As above, without knowing in advance the facts you’re going to receive (“give me all facts about Keanu Reeves”)
    • eg, Topic Pages for Freebase, Wikidata, et al
  • Returning tables (and quads themselves)
    • eg, What SPARQL is good at

From experience inside Cayley, we know a really powerful construct is the general idea of functions which take a set of nodes or links and return a set of nodes or links. We’ve called them Paths a lot, and that seems to work okay as a name. We also know these work for any query language we like; while tagging is a little odd, it generally works to generate all the above cases.

The idea, then, is this: what language would we design that makes Paths first-class objects in the language?

Unary Functions

If we call the Data datatype the union of all the RDF datatypes (strings, bools, floats, IRIs, even Statements), then Paths are these unary functions which take a Set of Data things and return a Set of Data things [Data] -> [Data] (or if you prefer, func([]Data)[]Data )

This starts to get interesting when we consider that all the basic datatypes – say the integer 3, the string "foo" – have a natural unary function:

func(input []Data) out []Data
	if 3 in input:
		return [3]
	return []

This is equivalent to .Is(3) in Gizmo.

This is simple for a lot of the types, but, since we’re sketching a language, we notice that IRIs are really important – they’re the only valid Data type for Subject and Predicates in standard RDF. It’s through them we traverse a graph. So, in the sketch, I’m going to call <foo> the IRI version of the above, .Is(<foo>) and foo alone as another interesting unary function:

func foo(input []Data) out []Data
	out = []
	for x in input:
		for every triple matching (x, <foo>, ?y) in the graph:
			out.append(y)
	return out

Or, simply, .Out(<foo>). We might add sigils, such as !foo to go in the opposite direction.

Composition

So let’s create an environment for this language. Statements always give results, and every new line (for the moment) assumes a fresh look at the graph – that is, the input to any unary function is always the entire graph; every node, every Data type. So the query:

"Keanu Reeves"

Returns the set ["Keanu Reeves"]

name

Returns the set of all strings at the end of the <name> predicate in the entire graph. For the movie set, this is everyone…

["Keanu Reeves", "Sandra Bullock", "Edward Norton".........]

But now we can start to compose things with our first operator, .

</en/keanu_reeves>.name
["Keanu Reeves"]

. takes one of these unary functions and composes it with another. f.g is equivalent to g(f(input))
If we also add assignment, we can get something like:

/film/film/actor = /film/film/starring./film/performance/actor

"The Net".!name./film/film/actor.name

We’re starting to get the feel for it

Unions

Sometimes we may want to apply multiple functions and union their results. For the sketch, let’s adapt () for that purpose:

("Keanu Reeves", "Sandra Bullock").!name.!/film/film/actor.name

Movies starring either Keanu or Sandra – this works out with the idea of unary functions, and can also work for traversal:

</en/keanu_reeves>.(!/film/film/actor, !/film/film/directed_by)

Movies he either directed or starred in

Variables

It may make sense to add tagging, let’s use a sigil for that:

$movie./film/film/actor.name."Keanu Reeves"

We’re using $ to declare a variable which becomes part of the record moving forward. Something perhaps like:

[{"_id": "Keanu Reeves", "movie": "</en/the_matrix>"}...]

You might argue that all output should always be records, eg, all the first above example being:

[{"_id": "Keanu Reeves"}]

Which may make sense – but again, just brainstorming

Records and Intersections

A lot of the time, though, we want to look at things like records. We want to have objects, constrained by something, returning something. We can use {} for both intersection and records. Each path inside the braces is a constraint.

{/film/film/actor.name."Keanu Reeves", /film/film/actor.name."Sandra Bullock"}.name

Things that must star Keanu /and/ Sandra, then the name of that thing, Returning

["The Lake House", "Speed"]

You can create records manually, like this

{ name."Speed", /film/film/actor.name.$actors} // Everyone who was in Speed

But why not make that equivalent, with a little syntactic sugar, to the following:

{ 
name."Speed",
actors: /film/film/actor.name
}

Now we have something that looks a lot more like GraphQL.

And beyond…

From here we can go a little nuts just playing around


==, ~= for regexs or matching

{ 
name ~= "Matrix",
actors: /film/film/actor.name
}
{ 
name == "The Matrix",
actors: /film/film/actor.name
}

TODOs

  • Optional constraints (likely a ‘?’ sigil)
  • NOTs
  • Finding paths between nodes/BFS
  • Queries for edges
  • Aggregation functions: Count/Sum/etc
  • Label/Subgraph

#2

The proposal is interesting, but it’s a bit confusing to see the dot operation compose functions in this way.

For example:

</en/keanu_reeves>.name

This means "find the name of /en/keanu_reeves", but for it to work /en/keanu_reeves should be explicitly defined as a subject-only IRI in the query language. Because in other cases expressions like:

"The Net".!name./film/film/actor.name

might become confusing very quickly. Is /film/film/actor a subject or an object? How the second dot would know to invert subject-object relation while building the query?