Beginner's Guide to Schema Design: Working Thread


#1

Hi All,

Per the post at Best Place for a Beginner to Learn about Schema Design, I’m going to take a whack at a “Beginner’s Guide to Graph Database Schema Design”. As I’m a relative beginner myself, when possible I’d like some quick feedback from the experts on this forum regarding the correctness or incorrectness of the below points. Having these solid points nailed down will help me make sure I’m going in the right direction with the guide.

Introduction to Graph Database Schema Design

Triples are the building blocks of graphs

  1. Subject (Must be an IRI)
  2. Predicate (Must be an IRI)
  3. Object (Can be either an IRI or a literal)
  4. Label
    4.1.The addition of a fourth attribute – label – provides a way to indicate that a given triple is part of a subgraph. Triples with a label added are called quads.

A subject or object IRI should…

  1. Identify a single node/entity/vertex within a graph
  2. Be unique within the graph (only one instance for a given IRI)
  3. Be self-describing
    i. To the extent possible. Avoid using UUID if human-readable label can be used.

A predicate IRI should…

  1. Identify a unique type of relationship within the graph
    1.1 IRI can be used for more than one relationship instance
  2. Be self describing
    2.1 Use of a UUID never makes sense for a predicate

Find all the quads for a given subject/user in Golang?
#2

“Blank nodes” are also possible (like _:n123 ). But in contrast to IRI they are local to file and are not persistent. Their only purpose is to distinguish between other nameless nodes in the same file. A good example might be a geo-coordinates pair. It has no IRI itself, only a link from some other IRI and to separate latitude/longitude values:

<home> <at> _:n123 .
_:n123 <lat> "54.32" .
_:n123 <long> "12.34" .

In the ideal world it should also work as a correct URL to description of this entity (in machine-readable form at least). Sites like http://schema.org do both at the same time: page shows human-readable version, while JSON-LD/Microdata/RDFa markdown can be extracted to read the sub-graph. Some use Accept and Content-Type HTTP headers for this.

Also it’s a good practice to add at least minimal metadata for predicate IRI or type IRI, like human-readable name and a short description, or a sameAs reference to similar well-known IRI(s).


#3

I’m not sure I understand the purpose of blank nodes as opposed to creating an IRI. Is the only difference that one is temporary and not subject to the same design considerations as creating an actual well-formed IRI? In this example you gave, the blank node seems like an “object literal” that would look like {“lat”: 54.32, “long”: 12.34}. Are blank nodes sometimes absolutely necessary, or can they always be replaced by using locally-unique IDs or literals? Or is “blank node” just another word for “locally-unique ID”?

I’m thinking this Beginner’s Guide to Schema Design should possibly be re-framed to “beginner’s guide to schema design in Cayley”, and cover maybe 95% of design situations, with a disclaimer that there are other possibilities. For instance, based on Barak’s comment from this post: https://discourse.cayley.io/t/9-12-14-how-to-model-data-in-graph-databases-specifically-cayley/77:

“Some graph stores allow arbitrary data on nodes, as Key/Value pairs. I personally disagree with that decision a lot, as any key you wish to put on a node can be “promoted” to an actual predicate and stored as a full triple.”

In this guide, I’m thinking maybe we should exclude the concept of creating “props” attached to nodes, and instead discuss props, types, and predicates as the same concept. Do you think this is the right approach?

I’m a bit confused about how one would attach meta data to a predicate IRI. In Cayley, it is not possible to create relationships off of predicates themselves, is it? If not, is the approach just to create an object/node/vertex for the predicate IRI and attach relationships to it, such that the “predicate IRI” actually exist in the schema as both an Object IRI and a predicate IRI?

By the way, thanks for your fast and helpful responses!


#4

It’s close, but you can’t access individual fields as triples in JSON approach. And if you convert it to triples, it will have a blank node as identifier, so yes, they are absolutely necessary, I would say :slight_smile:

Just a node without separate IRI or name. If you want to export it, you should assign a locally-unique ID. Withing a graph it also has a unique ID which may or may not correlate with file-local IDs.

This is not about “no properties for nodes” it’s about not saving properties as opaque values on nodes (as JSON, for example). Neo4J works this way - you may attach KVs to nodes, but you can’t traverse them as a consequence. Better approach is to define a predicate which describes each property and store KVs as triples. All RDF graphs behaves this way, this is not Cayley-specific.

RDF joins nodes by value, meaning that if you specify the same name in file it will refer to the same node in graph. Thus, you can specify IRI as a predicate in one triple, and an object of another triple. Graph becomes self-describing this way. Consider an example:

<bob> <type> <person>
<bob> <firstName> "Bob"

<firstName> <type> <property>
<firstName> <label> "First name"
<firstName> <desc> "First name of a person"

<person> <type> <class>
<person> <label> "Person"
<person> <desc> "A real person"
<person> <hasProperty> <firstName>

(P.S. this is an example, real IRIs for these concepts are different, see RDFs)

This approach attaches metadata to <firstName> predicate globally. But as you mentioned, we can’t attach additional properties or KV to predicates for individual triple yet. Only possible way to do is to create a mediator node:

<bob> _:n1 "Bob"
_:n1 <type> <firstName>
_:n1 <timestamp> "2001-02-03 04:05:06"

But it will break every query that expects a <bob> <firstName> "Bob" triple currently. We are working on alternative solution.


#5

Thanks @dennwc, this helps clear things up a lot. I need to read up much more on RDFs, but I had a couple more questions in the meantime.

Are angle brackets required by RDF? I.e. would it be just as acceptable to have used:

“bob” “type” "person"
or
bob type person
?

RDF as far as I can tell is geared toward resources on the internet. I do think this one interesting use for Cayley, but actually one of my major interests in Cayley is as a graph datastore within a private graph - such as for a corporation. For example, say I wanted to store payroll and organizational structure information in a private graph for a company HR department. I might want to store things like

“Tim” “Reports To” “Ann” .
“Tim” “Has Salary” “$75,000”

Does the formality of IRI’s change if they are designed to be used in a private database instance like this? If you were to design a schema to store these two relationships…what would the identifiers look like? Maybe something like:

“/employee/Tim Smith” “/organization/reportsTo” “/employee/Ann Nettles”
"/employee/Tim Smith" “/compensation/salary” “75,000”

“/employee/Tim Smith” “type” “person”
"/employee/Ann Nettles" “type” “person”

“person” “type” “class”
“person” “label” “person”
“person” “desc” “Human Being”

“/organization/reportsTo” “type” “property”
"/organization/reportsTo" “label” “Reports To”
"/organization/reportsTo" “desc” “Subject reports to object in organizational structure”

“/compensation/salary” “type” “property”
"/compensation/salary" “label” “Has Salary”
"/compensation/salary" “desc” “Salary in US Dollars”

And then let’s say the company got acquired by another company, in order to fully specify the location of the resources, the domain, e.g. “companyA.org/” could be prepended to all of the local “IRIs”?

Thanks,
Jeff


#6

Yes, angle brackets appears at least in NQuads and Turtle spec, so they are required. And it makes sense because IRIs are like URLs (relative or absolute), and they have some limitations for a set of characters that can be used.

As an example, Freebase used relative URLs the same way as you suggest. Further, Turtle syntax has a way to specify a base prefix for IRIs within the file, so you can change domain by editing one line.

In any case, we believe that sharing schema is not the same as sharing data. You can make a public server that serves full schema of your organization, at the same time storing actual data in other places.
This way you can promote your domain-specific schema, so more organization can try to use it, making data formats interchangeable. And, most likely you will have a part of the data that you want to open and share, like price lists, organization contacts and maybe more. And if the schema becomes standard, search engines will take it (and the data you share) into account to find new users or customers for you site/service/organization.

If it still doesn’t makes sense for you, then you still can make this server private and internal users can still discover data by following IRIs as URLs, using local DNS.


#7

In my learning about RDF over the past few days, I’ve started to understand that there are many ways to express a given RDF dataset. For this beginner’s guide, I am thinking of avoiding going into many of the concrete languages, and simply using N-Quads format for explanation as it seems the simplest. Do you think this makes sense?

I definitely agree. Is there a single best place to search for established vocabularies in specific knowledge domains? For instance, for an upcoming project I will be looking for vocabularies to in the project management, enterprise software, and finance domains.


#8

So I finished draft version 1 of this guide. When you get a chance, will you please look it over and let me know what you think? I put it in a git repo here: https://github.com/tamethecomplex/tutorial-documents/blob/master/cayley/BeginnersGraphDatabaseSchemaDesign.md.

Thanks,
Jeff


#9

In a graph, “Bob” and “Samantha” are both vertices, and “knows” is an edge connecting “Bob” to “Samantha”.

It may be worth to mention that “edge” may also be a “node” in a graph?

Composed of a subject, object, and predicate which are surrounded by angle brackets and separated by a space

This is true only for IRIs in nquads/turtle. All literals, like human-readable names, dates, int/float values, etc are surrounded by ":

</org/bob> <name> "Bob"

Everything else looks correct. Good job! :slight_smile:


#10

Thanks a lot for the feedback @dennwc. I’ll post the altered guide to this thread in my next post.

Do you think this would make sense to add to the Cayley website or maybe under a ‘guides’ folder in the cayley repo? Hopefully it will be useful to someone else.


#11

Version 1.0 post review:

Beginner’s Guide to Graph Database Schema Design

Overview of this Guide

After you read this guide, you will have an understanding of the following concepts:

  1. What is a graph and how can it store concepts
  2. What is a triple and how can it define a graph
  3. What is RDF
  4. What is an RDF Concrete Syntax
  5. Expressing Your Graph using N-triples
  6. What is an RDF schema
  7. Finding and Understanding Existing Schemas
  8. Extending Existing Schemas
  9. What is a quad
  10. Saving your quads to a file using N-quads

What is a graph?

A graph has two elements: vertices and edges. A vertex is an entity, and an edge is a relationship between two entities.

What is a triple?

A triple is a 3-word statement that specifies a single relationship (edge) between two entities (vertices). As an example: Bob and Samantha are both entities. The 3-word statement: “Bob” “knows” “Samantha” specifies that the entity “Bob” is related to “Samantha” in that he “knows” her. In a graph, “Bob” and “Samantha” are both vertices, and “knows” is an edge connecting “Bob” to “Samantha”. Each term of a triple has its own name. The first term is the “subject”, the second term is the “predicate”, and the third term is the “object”. Groups of triples can describe any graph.

What is RDF?

RDF stands for “Resource Description Framework”. It is a standard maintained by the World Wide Web Consortium (W3C, https://www.w3.org/Consortium/) for describing information on the web. More generally, RDF is a language for describing graphs (see, https://www.w3.org/TR/rdf11-concepts/#data-model).

What is an RDF Concrete Syntax

RDF is not a “language” but rather a set of concepts. There are several concrete syntaxes that can be used to write RDF. These can be found under the W3C RDF-related standards at (https://www.w3.org/TR/#tr_RDF). Common examples are Turtle, JSON-LD, and N-triples. It can be very confusing for a beginner to be presented with several syntaxes, so for this guide, we will use the simplest one of all: N-triples. The N-triples specification can be found here (https://www.w3.org/TR/n-triples/).

Expressing Your Graph Using N-Triples

The N-Triples concrete syntax is a collection of an arbitrary number (N) of triples. These triples are saved in a string, where each triple is:

  1. On its own line
  2. Composed of a subject, object, and predicate, each of which is separated by a space
    1. Identifiers are surrounded by angle brackets
    2. Literals are surrounded by double quotes
  3. Terminated with a space followed by a period (" .").

Using our example above, the relationship between Bob and Samantha can be expressed in N-triples syntax as:

<Bob> <knows> <Samantha> .

Because a string in N-triples format can be any length and contain any number of triples, it can be used to describe a graph of any size or complexity. For example, let’s say that “Samantha” also knows “Bob”, and “Bob” is the spouse of “Carolyn”. This expanded graph can be described by adding two more triples:

<Bob> <knows> <Samantha>
<Samantha> <knows> <Bob> .
<Bob> <isTheSpouseOf> <Carolyn> .

Predicates as Vertices

It is important to note that identifers used as predicates in a graph can also be vertices. For example, the identifier <knows> may have a description “Indicates that a person knows another person”. <knows> can be associated with this description by adding another triple to the graph:

<knows> <hasDescription> "Indicates that a person knows anther person."

What is an RDF Schema

As explained previously, an RDF graph can be described by a collection of triples - each of which consists of a subject, predicate, and object. An RDF schema is a collection of types of subjects, types of predicates, and types of objects. Because an RDF schema describes the triples in a graph, an RDF schema is metadata.

Note that for a beginner, finding RDF schemas online can be confusing for the fact that not all RDF schemas are referred to as “RDF schemas”. They are sometimes referred to as “vocabularies”, “languages”, or “ontologies”. We are going to boil all of that down here and say that a schema is metadata that describes triples in a graph. period.

There are many published RDF schemas available online. One of the most important is “RDF Schema” (https://www.w3.org/TR/rdf-schema/). Note the confusion starts immediately… “RDF Schema” is an RDF schema, not the only RDF schema. Another RDF Schema, “Web Ontology Language (OWL)” (https://www.w3.org/TR/owl-guide/) is an extension of “RDF Schema” and is an RDF Schema in its own right.

Components of an RDF Schema

As stated previously, an RDF Schema is metadata used to describe subjects, predicates, and objects in a graph. At the highest level, in RDF, every subject, predicate, and object is referred to as a “Resource”. A resource can be either a specific thing, like “Bob”, or a type of thing, like “Person”. In an RDF schema, a subclass of “Resource” called “Class” is used to define all of the metadata that makes up the schema.

Examples of Schema Classes

The RDF Schema specification (https://www.w3.org/TR/rdf-schema/) defines some basic RDF classes. For instance, the class “rdf:Property” is the class that defines “something that relates a subject resource to an object resource”. That is, a “rdf:Property” class describes the set of predicates in a graph. An example sub class of “rdf:Property” is “rdf:type”. A predicate of class “rdf:type” is used to indicate that the subject of a triple is a class of the object of a triple. Another example of a subclass of “rdf:Property” is “rdfs:label”. A predicate of class “rdfs:label” indicates that its object is a human-readable short description of its subject.

Class prefixes

Note that the discrepancy in the above use of “rdf:” vs “rdfs:” is not a typo. Because of how various RDF schemas evolved over time, prefixes on various rdf classes may differ. A prefix is actually a stand-in for a full base URL that is used to aid readability. In this case, the prefix “rdf:” actually refers to the namespace “http://www.w3.org/1999/02/22-rdf-syntax-ns#”. Such that “rdf:type” fully expanded is "http://www.w3.org/1999/02/22-rdf-syntax-ns#type. Similarly, the prefix “rdfs:” refers to namespace “http://www.w3.org/2000/01/rdf-schema#”, such that the class “rdfs:label” actually refers to “http://www.w3.org/2000/01/rdf-schema#label”.

Web Ontology Language (OWL)

OWL deserves a special mention, and here we will give a brief explanation. OWL is a widely-used extension of the “RDF Schema” specification. the “owl:” prefix refers to the namespace “http://www.w3.org/2002/07/owl#”. An example class from the “owl:” namespace is owl:sameAs. To type a triple " owl:sameAs " indicates that subject and object are identical. owl:sameAs as a property is useful when reconciling disparate schemas, to indicate that two graph vertexes are duplicates of one another.

Finding and Understanding Other Published RDF Schemas

One of the motivations for the creation and standardization of RDF was to add a structure to content on the internet such that it could be more easily searched and organized. It is an obvious truth that the creation of RDF standards, and the adherence to standards of many different users, improves the ease of linking many datasets together across the internet, and making sense of the relationships between the data.

There are many RDF schemas (also referred to as vocabularies or ontologies) that have been created and published for this purpose. One site which has aggregated and made available many schemas is “Linked Open Vocabularies” (https://lov.okfn.org/dataset/lov). One example schema which can be found by searching the site for “biology” is “UniProt RDF schema ontology” (http://www.uniprot.org/core). The organization Universal Protein Resource (UniProt) (http://www.uniprot.org/help/about) is a “comprehensive resource for protein sequence and annotation data.” A biologist who is creating an RDF graph representing her research and wants to share that research with others would be better off using this schema from UniProt than creating her own.

Extending Existing Schemas

Because RDF Schema and OWL are the most widely-recognized schema standards for RDF, when designing your project’s schema, you should make use of all rdf: rdfs: and owl: schema terms that apply to your project, rather than duplicating those concepts. In addition, if there are other widely-used schemas such as those which can be found at https://lov.okfn.org/dataset/lov, your project would do well to re-use those schemas as well, instead of defining your own.

Still, with any project in a specific-enough domain, there are likely schema concepts you will need to formalize on your own. In these cases, you can use whatever published schemas are available, and where a given concept does not exist in those schemas, create new schema terms of your own. If those schema terms prove to be widely applicable to your domain, you may consider registering your schema with https://lov.okfn.org/dataset/lov so that others can make use of it.

What is a Quad

Cayley database can be used to store triples, but it can also be used to store an extended form of triple called a quad. The first three elements of the quad are the same as a triple: subject, predicate, object. However, a fourth term, “label” indicates that the specified triple is part of the subgraph indicated in the “label” field. The concept of a quad was created to allow the expression of an “RDF Dataset”, which is like a normal “RDF Graph” except that it can contain one or more subgraphs, also called “named graphs”. For a formal explanation of an “RDF Dataset” and its relationship to the quad, see (https://www.w3.org/TR/rdf11-datasets/#quad-semantics).

Expressing your graph using N-quads

The N-quads concrete language specification (https://www.w3.org/TR/n-quads/) is analagous to the N-triples specification discussed above, but expresses quads rather than triples. As an example of using the N-quads specification, let’s say you want to save an RDF dataset that stores both your friends and your acquaintances. You want to be able to easily filter to the “friends” or “acquaintances” subgraphs for different purposes. This RDF dataset expressed using N-quads may be:

<self> <owl:sameAs> <jeff> .
<self> <talksWith> <david> <friends> .
<self> <goesOutWith> <michelle> <friends> .
<self> <hasMet> <daniel> <acquaintances> .
<self> <hasDoneBusinessWith> <rachelle> <acquaintances> .

Note that a line written in N-Triples format is also a valid line in N-quads. If a fourth term (label) is not specified within an N-quads file, the corresponding triple is assigned to the “default” graph. Thus in the example above, the triple:

<self> <owl:sameAs> <jeff> .

Will be assigned to the “default” graph within the dataset.


Data Modeling in Cayley
#12

Sure, we should add it. We had a post with all terms that we need to define, but no one done it yet. Maybe we can move some parts there.


#13

I might be wrong, but I think spaces cannot be used in IRIs. <is_the_spouse_of> is safer.


#14

Agreed, I updated the post, thanks for the catch.


#15

Do you think it belongs in the cayleygraph website repo? If so, should I submit a pull request?

I will plan to add some definitions to this post as well, in addition to getting to understand Gizmo syntax a bit better and start putting together some additional materials on it.


#16

We decided to move documentation from GitHub to discourse, so we can have faster feedback (users can ask questions in the same thread).

@robertmeta What do you think? Should we split it into separate posts?


#17

Question about this… I agree that having documentation in the same thread is a good thing so people can ask questions. But shouldn’t the documentation be in some clear, central location and be versioned so that people know they are looking at the most up-to-date document?

I’m concerned that by looking at a thread, a user will be confused as to which post is official and which post is just side conversation.


#18

Yeah, we share your concerns. At the same time we decided that additional feedback worth it. For now we are going to update the first posts, so users will actually see most up-to-date information. And later one, after some stable release we will move all documentation to a central place, of course.


#19

Ok, that makes sense. Should I change the title of this thread to “Beginner’s Guide to Schema Design” and edit the first post to be the most up-to-date document?

Thanks,
Jeff


#20

Let’s wait for @robertmeta, he might want to add something as well.


FAQ: Frequently Asked Questions