Modified GraphQL aimed at Graph DBs

graphql

#1

Hey guys,

Founder of Dgraph here. We’ve been building Dgraph on top of GraphQL as the primary language. I’ve been impressed by some of the ease of using GraphQL around building client-side apps – in particular the type system and retrieving graph data as “documents”; and the fact that it allows retrieval of subgraphs as opposed to lists of things, which is what Cypher and Gremlin focus on.

However, as we gain more experience with GraphQL, we’ve hit the capabilities of using GraphQL as a language for graph database. For e.g., you can’t support filtering, sorting, intersections etc. In addition, the type system of GraphQL is more suited to a relational database, but a bit verbose for a graph database. To tackle these challenges, our implementation of GraphQL has morphed into a modified version, which adds additional operations and functions on top, and does away with some.

We plan to release this version as an independent spec in the next few weeks, possibly calling it GraphQL+-; for the lack of any better terms – “plus-minus” because we’ve added some functionality and removed some verbosity.

I reckon that you guys are working on support GraphQL as well. So, I thought I might hit you up, and see if there’s any interest in a possible collaboration to build a modified GraphQL for the purposes of a graph database. Note that this won’t necessarily fit with other GraphQL tools, so this might not be an obvious choice; but I think if GraphQL has to work w.r.t. Cayley and Dgraph, it needs to be modified, and corresponding tools must be built. We’re already on that path.


#2

Is the doing away with some required? It seems to break compatibility, and if you break compatibility, it seems like you should go with another name (other than GraphQL±).

We have bumped into confusion because we took the Gremlin moniker without being actually 1:1 compatible with it, and it is something we are rectifying (renaming to Gizmo) in the near future… no reason for you to duplicate our prior mistakes. :smile:

Also, as you mentioned, tools will have to be built to target this new query language explicitly, and tools built for GraphQL may well fail due to removals.

Hopefully with a nice open license on it!

Well, we too are looking for the “one true” query language, MQL, Cypher, Gremlin and GraphQL all have issues we would love to see improved on, will be curious to see the spec you guys release. Any WIP we can poke at?


#3

Sure, we are interested in collaboration, because, as you mention, we also had to do some hacks to GraphQL for it to work well with graphs and RDF. We’ll be glad to discuss a draft of specification when it’s available.

We do not support GraphQL type system yet, so it will be interesting to hear why it is verbose and which parts you want to omit in the new spec.

Also, I want to mention that we have an additional convention for representing reverse traversal in GraphQL as well as reverse fields in our schema. At least for older versions of Dgraph they were not supported, so the question is, does Dgraph support reverse traversal now, and do you plan to add them to the spec?


#4

Open to suggestions! Though note that there’s a vast excitement around GraphQL, that would help with adoption of “GraphQL±”, but might not help with a name significantly different. Also, it’s possible that the existing tools for GraphQL are adapted a bit to support our modified spec. With that in mind, I want to stay as close to GraphQL as possible, until the path is clear.

Apache Version 2.0 – and we’re thinking of working on a nice ANTLR grammar to go alongwith. Our hand-written graphql parser has too much code, so we’re experimenting with moving away from that, and using ANTLR (or something else).

Give it a couple of weeks. We’re making a release this week, and will spend the next coming weeks on bringing the documentation up to speed. You can view some of the older changes here:
https://wiki.dgraph.io/Queries_and_Mutations

Before I mention the parts we omit, maybe I can explain how Dgraph’s type system works. Feel free to skip this section.

We use it like Go implements interfaces. Just like a Go struct which implements all functions defined in Go interface, automatically supports that interface; the presence of specified valid predicates in a predefined GraphQL type, makes an entity automatically of that type. Thus, an entity can be of multiple types, without explicitly specifying which type it is.

This is powerful in the sense that the type system is dynamic and fluid. But, has a flaw that you can’t just say, give me every node of type X. So, type system is a way to filter and convert results into clean “documents”.

Now, what do we NOT use

We haven’t found any use for interfaces; the GraphQL usage is just like yet another type. We haven’t yet implemented unions or enums either – not sure how useful they are yet.

What did we extend

  • Scalars: We extended that to include geo information
  • Filters: We added a @filter directive, which is composed of ANDs and ORs, using round brackets.
  • Functions: We use functions to do substring matching, geo proximity search, region covering, equality, inequality, etc. These functions can be used as filters, or just as result generators.
  • Sorting

What we plan to do later

  • Variable assignment to portion of results, which can then be intersected, or merged later. Sort of like select col as c where ....

This is just a few things from top of my head. When we work on the spec, we’ll have examples and think more about the areas where we disagree with GraphQL and why. For e.g., any of the things that we don’t implement are up for a debate by an ardent user. In other words, I think this spec should ideally be based on real world use cases; and be minimalistic.

I think we can do reverses and we will give it a serious thought in the later releases (v0.8 and onwards). It’s a bit annoying for the client to have to specify front and back edges all the time.


#5

Hey all,

Here’s the language spec as promised. Most of it is there, we’re making edits as we speak to add more examples.

https://wiki.dgraph.io/Query_Language_Spec

Note that we are open to suggestions. There are many things missing still (like reverses as mentioned in this thread), cycle detection, shortest path etc. So, just send your feedback my way, either here, or at my email id: manish@dgraph.io, etc.


#6

@mrjn Thanks! Will check it from time to time and leave comments.

Mutations and IDs

First thing that I already noticed, is UID/XID stuff. It’s definitely an internal details of unique IDs implementation in dgraph, and they shouldn’t be in the spec. Yes, I realize that this is a dgraph documentation so they need to be there, but if we want to work on a spec, we should distinguish between them :slight_smile:

I mentioned it because of mutations syntax:

mutation {
 set {
  <_new_:class> <student> <_new_:x> .
  <_new_:x> <name> "alice" .
 }
}

Nquads syntax allows to specify file/request-local IDs called blank nodes (bnodes). This is a standard approach, and I think it will be better to use them to auto-assign internal IDs:

mutation {
 set {
  _:class <student> _:x .
  _:x <name> "alice" .
 }
}

The same might be done to UIDs: _:uid-0x11168064b01135b. And again, format of these IDs (after uid-) should be implementation-specific.


#7

That’s a good point. Filed an issue:

We’ll change the way we parse uid as well, so it’s more in line with RDFs. Possibly just using <0xid> (starting with zero-x to represent hex), which I think is valid RDF.

Keep them coming. I think it would be great if we can together build a spec that works for both the projects; and is better than existing languages.


#8

Update: Reverses are now part of the spec.

https://wiki.dgraph.io/Query_Language_Spec#Reverse_Edges


#9

@mrjn Thanks for an update!

Draft is still mostly about how dgraph does things, so I still insist on moving it to separate document. Current one (on dgraph page) can reference a new page as a spec draft and specify an exact way how certain things are done in dgraph in particular. We’ll do the same for Cayley.

Add few more comments about the spec:

Mangling

Dgraph has a particular way of mangling of IRIs that cannot express all possible IRIs in the wild. We should define a particular way of mangling that transforms any character that is allowed in RDF IRI spec. Even if dgraph decides to not specify namespaces for IRIs and not allow most characters in names.

Mutations

NQuads syntax is too verbose to insert nested objects, so I think it shouldn’t be the only way to modify data. I propose to add a parameter to mutations to specify format in which data should be interpreted. I think of JSON-LD particularly, since it maps pretty good to overall GraphQL approach, and it’s mostly compatible with GraphQL output.

As a side note, maybe we can even extend spec to state that output will always be JSON-LD instead of plain JSON. This should not be underestimated, since web servers will be able to embed parts of output from database query directly to the HTML page to make search engines happy.

Also, NQuads supports a “label” part in quad. We should define how it affects queries and mutations.

Languages

As I wrote above, spec is still about way dgraph does things. I don’t think that different languages should create different predicates. At least user should not know these details. Thus, I propose to add an optional @lang directive, that can be specified on any other field of type string. Dgraph will just append language to predicate name, and Cayley will do something different. In both cases user will not need to specify language as a part of predicate name (and JSON field name in output). I also think that this directive should be inherited, thus user may specify it on the root and everything below will have the same language (or fallback to default one).

Auto-assign UID

Old comments still needs to be addressed, but I understand why it’s still there - users need a valid documentation for current release for dgraph. This is one of the reasons why we should make it separate from dgraph docs :slight_smile:

And, as I wrote in issue on dgraph repo, please do not use UIDs (internal graph IDs) with IRI syntax. This will confuse people from RDF world, and will make new users think that this is a correct approach in RDF (we suffer from this already). These IDs are not stable, so they should be marked as blank nodes (just _:xxxx instead of <_new_:xxxx>).

Also, returned UIDs list might be nested in case of mutations with any format that allows nested structures.

External IDs

These are considered stable and should be IRIs, as they are now. This section is ok.

Deletes

As in any other mutation, user should be able to specify some condition before removing nodes and properties. This is just a note. I think we may leave transactions undefined for now.

Queries

Example code starts with specifying "me" as a top-level object. We should clarify that this name will affect only JSON output and has no particular meaning.

Pagination

Looks good, except of "first: -1" part. It might not a be a most intuitive way to specify descending order. Or is it descenging at all in this case?

Count

In theory, GraphQL will not allow to specify _count_ inside a scalar field, thus I can’t count how many scalars are associated with certain predicates. Need to clarify this part. We may say that scalars fields in query will not be validated that way, but it should be considered carefully.

Functions and filters

It looks really weird to specify all filters inside top-level object instead of inline filter at that particular field. For example:

{
  me(_xid_: m.06pj8) {
    type.object.name
    film.director.film @filter(anyof("type.object.name", "war spies"))  {
      _uid_
      type.object.name
    }
  }
}

May be transformed to:

{
  me(_xid_: m.06pj8) {
    type.object.name
    film.director.film {
      _uid_
      type.object.name @filter(anyof("war spies"))
    }
  }
}

It’s also a bit weird to see that current draft allows logical operators like || and &&, but has a separate functions like geq instead of >=. We need some consistency here.

Geo-indexing stuff might be outside of the spec for now. It’s more like an extension, I guess.

Sorting

Sorting keywords might be "sort" and maybe something shorter for "orderdesc". Also, draft says nothing about sorting with multiple properties.

Schema

Scalar types

Ints should not be limited to int32.
Float precision is undefined.
ID type should be IRI, not just string (UTF8).
Why not a single type for time? Encoding might drop time part if it’s zero.
Again, geo stuff might be outside of standard.
UIDs can be changed to something like node scalar type. It’s a graph database, so why user should care about having UIDs? :wink:

Schema file

I don’t think graph database should have a static schema. Schema might be a part of the graph, which is dynamic. We need to define mutations that affects schema. Or make it work as any other mutation. Dgraph might return an error in this case.

Reverse edges

As you may know, @reverse means nothing for us, since we do it anyway, but it’s a useful hint anyway. Now, here is the problem with current approach:

First it’s a bit not intuitive to see @reverse and mean “generate reverse index”. In RDF some object can define reverse properties, thus if I came from RDF world, I would assume this statement defines a reverse property, not an additional reverse index for a normal property.

Second, if user defines a name property inside a Person type to have a reverse index, should any other name predicates on other types be also indexed this way? This is the most confusing part. In RDF properties are defined the same way as objects (instances of class), thus if someone says that particular property should have an index, it will propagate to all classes that have this property. And it makes sense.

Additionally, we should make sure that ~ will not collide with mangling rules. As an alternative, we can have @rev or similar modifier to query properties in reverse. I know that they may collide by name with normal properties, but it’s worth to discuss this approach as well.

RDF Types

Soon schema.org might be a de-facto standard for RDF, thus we might want to specify scalar types from there as well. They are mostly one-to-one replacement, but it worth to mention.

Debug root

I see no difference in how query is executed in debug mode vs normal mode. This means it can be moved up one level in results JSON. Something like this:

{
  "data": [....], // actual query result
  "debug": {....}
}

And it may be controlled by HTTP query parameter instead of a separate query that does nothing in particular in GraphQL.


#10

Sounds good. Are you guys alright with using the Dgraph wiki for this new page, or should we use a new Github repo for this? Wiki allows for easy editing of pages, templates and other prowess of a well-designed software for documentation. Github would allow a neutral territory.

Dgraph team hasn’t paid much attention to that so far. The spec could mention the possible characters and how they’re transformed, if needed.

I’ve been thinking about having an easier way to provide mutations as well, but haven’t come across something convincing. How would JSON-LD mutations look? How would the current JSON output be different under JSON-LD?

Yes. Both labels and language would be used as filters to queries. We can elaborate in the spec.

Agreed. That’s something we plan to remove in v0.8 from Dgraph as well. The way it would be provided in queries would be like so: predicate@lang, which would return the result in the language specified. Also, using name@en:ru:hi, would return name in one of the specified languages - English, Russian, Hindi, which is available first. https://github.com/dgraph-io/dgraph/issues/362

Agreed. Once we decide on wiki v/s github, we can get started on getting it in shape. You can then log in and start updating it. For the contentious sections, we can have a process in place to resolve those contentions.

That part would remain. Though note that any IRI which starts with 0x, would be considered a UID by Dgraph with the upcoming changes (so we don’t need to provide _uid_:).

It’s a way to specify the last N. Python does something like this.

Typically there’s only one scalar value attached to one subject-predicate combo. For e.g., name. We could have multiple languages attached to them; but I doubt there’s any requirement where we need to retrieve the name in all the languages via a query. Therefore, we kept the counting only for outbound edges.

We’ve been considering switching the _count_ to a directive @count. It would be easier to specify and look better in queries.

Those two aren’t the same. We’re doing filtering at film.director.film. So, we only retrieve the films which have “war” or “spies”. Whether we choose to retrieve the type.object.name or not, is optional. Also, in the former, the _uid_ would only contain the movies which match the filter. In the latter, the _uid_ would be all the movies, and only the names would be filtered. Thus, filtering is specified at outbound edges, not leaf/value edges.

Hmm… I can see why this is confusing. Though, we treat geq, leq, eq, ge, le as functions; and operators like operators attaching various filters together. They’re technically operating in different domains. Any function can be used as a filter, and only filters can be AND or OR etc.

If we try to make geq, leq etc. use >=, <=, the function names would look really strange. Alternatively, we could switch ==, || to and and or respectively. But, IMO, using the mathematical symbols for operators is better.

I had the exact same question to my team. And they reminded me that SQL uses order keyword. Instead of orderdesc, we could add a negative symbol in front of the value, like so: order: -film.film.initlal_release_date to represent descending. This schematic is used by Google in Datastore.

The scalar types, we can discuss in detail when we start working on the spec.

Hmm… that’s an interesting point. That’d need further thought on our end.

Reverses can only be generated for UIDs, in RDF terms, IRIs. For the schema, a predicate is global. So, one predicate can only contain one type of data; even though it could be referenced from multiple types.

We use a tilde marker to query reverses. That won over using a @reverse, or something similar in our team.

Debug root would return all the _uid_s without necessarily having to ask for it. Note that _uid_s are important to generate further queries. Also, debug returns latency information. We can’t have the differentiation as you mentioned, because the data is tree structured, and hence the uids must be returned along with each node in the tree, not separately.

Thanks for the long review of the spec. Let us know which method do you prefer, wiki or github for this spec; and we can start putting wheels in motion.


#11

Personally, I don’t mind having draft on Dgraph wiki. But Github might be better, since it allows to file issues, send PRs, see diffs and do spec reviews.

I haven’t thought how updates might work, but insert/upsert is trivial. Just convert JSON-LD to nquads, as defined by spec, and insert all these quads into a database.

Straight from official examples:

{
  "@context": "http://schema.org/",
  "@type": "Person",
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "telephone": "(425) 123-4567",
  "url": "http://www.janedoe.com"
}

Will be converted to:

_:b0 <http://schema.org/jobTitle> "Professor" .
_:b0 <http://schema.org/name> "Jane Doe" .
_:b0 <http://schema.org/telephone> "(425) 123-4567" .
_:b0 <http://schema.org/url> <http://www.janedoe.com> .
_:b0 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Person> .

Currently, GraphQL only returns fields that users asks for. JSON-LD almost always has a context object/field inside, that describes some types/namespaces that are used. Thus, it will be always present, even if user haven’t specified it in query.

In this example @context field is defined inside an object, which might not be expected by GraphQL clients:

{
  "@context": "http://schema.org/",
  "@type": "Person",
  "name": "Jane Doe",
  "jobTitle": "Professor",
  "telephone": "(425) 123-4567",
  "url": "http://www.janedoe.com"
}

In other aspects, it’s still just a JSON object. Only difference is that some field names have a special meaning ( @id , @type , @context , @graph ).

The second interesting aspect is that JSON-LD can describe self-references and loops, which are necessary for most graphs. Here is an example of “flattened” JSON-LD that shows how objects can reference each-other:

{
  "@context": "http://schema.org/",
  "@graph": [
    {
      "id": "_:b0",
      "name": "Alice",
      "follows": [
        {"id": "_:b1"}
      ]
    },
    {
      "id": "_:b1",
      "name": "Bob",
      "follows": [
        {"id": "_:b0"}
      ]
    }
  ]
}

Thus, this format might be better suitable for graphs, while mostly preserving properties of current GraphQL JSON output (selective fields, tree structure, types).

As a note, JSON-LD might also solve mangling issue, since most URL-specific chars will be inside context, and will be stripped from predicate names.

I’m curious why this is necessary. Result are in ascending order, but only last N is shown. Why not replace it with positive limit and descending order (even assuming random uids)?

Does this means Dgraph forbids multiple scalar values per predicate? I’m not sure I can come up with a use case except of languages, but generally scalar arrays makes sense, thus I don’t really want to exclude them at the very beginning.

I’ve seen some commits mentioning this change recently. How the final approach looks like?

Yes, I know. This might require to define another keyword which is not a best way. Or…

The obvious solution is to introduce @is / @has keyword on predicate that works like @filter, (filters by object value), but affects parent (subject) instead of values on predicate (object). It might also hide values from results by default, solving previous issue.

Currently, @filter on parent makes query looks unnecessary verbose.

Sure, but how it relates to query language? >= is converted to gte node in AST, and higher level code will not even see the difference.

Also thought about that. It’s a better option, indeed.

Sounds reasonable. RDF actually allows multiple types per predicate, but I’m not sure it’s a good idea at all.

And I still think schema file with @reverse leads to a confusion. Partially because of the naming, and because multiple types might specify name predicate with/without @reverse, and it’s unclear if index will be generated in each case or not.

It’s a lot more concise for sure :slight_smile:

You might have misunderstood me. I’m not telling that latency data is useless, or _uid_s should not be returned. I mentioned separate uid list/tree only for mutations.

I was trying to tell that this latency info might be inserted at the top level of JSON response (at the same level as error and data in GraphQL, which are also not affected by query). And, if debug mode forces implementation to print _uid_ values, let it be so. But debug still can be an HTTP query parameter, and not a separate root. This way any root might be put in debug mode with HTTP query parameter without adding yet another root.

Let’s do it! As I mentioned, Github will probably be better because of functionality it provides. And if you are going to make a repository, let’s add a query language name to the discussion before that. Anything containing “GraphQL” will be confused with original, thus both our projects will have a huge amount of users asking why original queries are not working. We already suffer from the same issue with Gremlin-inspired language. So, let’s pick a unique name, and tell that it’s mostly like GraphQL.


#12

Choice of platform

Let’s use Github then. I’m open to suggestions for naming the language. A common colleague of mine and Barak’s would have named it Cerebro (or Cerebra) :-).

Mutations

One of the main principles we have for Dgraph core is to keep it minimal. The current implementation of just using RDF NQuads is a bit verbose, but libraries can be used to help convert to RDF from JSON-LD or SQL tables etc. It provides lot of flexibility, being a simple standard. And can easily pick up data dumps and backups; so I think we might be better off with RDF itself.

Having said that, the restriction that every predicate must be a full fledged URL, is just not right I think. I think it’s best to let users just pass in data the way they want.

Returning extra fields

We had originally done that, but that added a ton of complexity, while also breaking the promise of only returning that’s asked for. So, we got rid of all that to keep it simple and minimal. You get what you ask for.

Multiple scalar values per predicate

We are going to allow that soon. But, it would be based on language and label. Only one value per each combo.

Count

We’ve made count act like a function, so you could do:

 {
    count(predicate)
    count(second-predicate)
    count(third-predicate)
    other-predicate {
      name
    }
  }

This seemed like the cleanest out of other approaches.

Filtering

Understand the verbosity, but using filtering at the parent has clear advantages.

The benefit of the way filtering is done right now is that we can add ANDs and ORs together to form in various ways to form very complex filters. Also, it’s very clear what nodes you’re getting at what level. Once you have done filtering at parent P, you know whom all the nodes and properties received below would correspond to. That won’t be possible if we require filtering via the children of a node. Also, how these children filters would AND/OR among each other isn’t clear.

The reason to not use the functions directly for filtering is that functions can also act like UID generators (when used at root) { directors(anyof(type.object.name, "steven spielberg")) { ... }}; while functions can be used as filters in @filter directive. { films(id: 0xabc) { film.film.directed_by @filter(anyof(type.object.name, "steven spielberg")) { ... } }}

Thus, functions only generate ids, while @filter enforces them. It’s simple and effective.

Functions

Regarding greater than equals, these are user visible functions, not internal. The idea behind using ge and not >=, is because we don’t want to treat such operators specially. Not all functions can be represented mathematically; but all mathematical functions can be represented by english names.

For e.g., consider near function, which gives you nearest geo-locations. That to us is no different from ge, le, lt, gt. They all take in arguments, and return back a list of ids.

I think the main argument was the inconsistency between the && and || in filters. We could use and and or respectively to represent these, if that makes the query more consistent; doing away with mathematical operators altogether.

Reverses

Our implementation of reverses would only apply if the predicate points to ids, not values. It asserts for that. If naming is confusing, maybe we can discuss what sort of naming would be clear. But, the basic idea is that we shouldn’t generate reverses for everything automatically.

Debug

Debug being a query parameter, sure. That sounds like a good idea.