Many small graphs instead of one big one?


#1

Hi, I’m trying to evaluate which graphdb to use after deciding that Neo4J is not a good match to my needs.

One of my main needs is that I need to be able to have many small graphs instead of one big one - think graphs with probably less than 200,000 nodes and at most 10 edges per node (but many with only a couple edges)

edges need to have direction.

Is Cayley a good match for this? Neo4J isn’t because of having just one big graph per neo4j running instance, which would make things prohibitively expensive. If you do not believe Cayley is a good match for this and you know one please point it out - thanks!


#2

It’s definitely worth exploring. But also it might be good to know about read write conditions, and if it’s for a social network or scientific desktop use. So a bit more context will probably get a better answer.


#3

more like social network, read quite frequently write infrequently -
probably an initial write for any particular graph, and then updates done
in batches at most 2 times a day.


#4

To clarify the social network - users upload types of financial data and have a graph of that data generated that will be frequently used when writing documents in an online editor. As the user proceeds through the document queries will be run against an Elasticsearch index and a graph db so that they will be able to see the connections in their data relevant to what they’re doing. There will also be some common example data that will be referenced to tell the user your data looks a lot like Source X, please read how Source X has structured their statements. So when data gets updated it will be moved into a batch that will be run at the most twice a day but I think for most users data updates will happen at best every two weeks, data reads however could be dozens of times a day per user ( a hundred queries per day is probably an extreme)

Multiple users, on a company account, can share the same graph and index. The number of users will not increase graph size or even the number of writes, but will increase the number of reads


#5

This use case should be covered pretty good by quad labels, if you interpret them as subgraph names. The good thing is that you will still be able to query the large graph. Or do you need full isolation of the data?

Most queries will work well with label filter, but some query patterns may require to build specific indexes.


#6

Hey, I’m having problems googling quad labels and finding a useful example
relevant to my needs, do you know of one?

Anyway when I thought about doing it in a big graph, then I feel I would
have to put the label ‘user’ on each node, the reason why I figured I would
have to do that is it is possible that certain types of nodes will be
exactly the same between users, so I need to have a unique identifier such
as the owner on these nodes to differentiate them from other identical
nodes when updating. (this at any rate was my reasoning) however by making
these unique per user, given my theory of having 50,000+ users with some
hundreds of thousands of nodes - I worry that the big graph version would
be too big? So for that reason I thought, it is probably better to
separate out and have lots of small graphs - what do you think?

Thanks,
Bryan Rasmussen


#7

Labels are a part of nquads spec that we support.

In short, if you have a set of links:

<user1> <knowns> <fact1> .

<fact1> <is> <Fact> .
<fact1> <description> "Some usefult data" .

You can label them by adding fourth value to each of them:

<user1> <knowns> <fact1> <user1_graph>.

<fact1> <is> <Fact> <user1_graph> .
<fact1> <description> "Some usefult data" <user1_graph> .

From the database perspective, this will look like a separate subgraph.

Note, that labels are always set on links, not nodes. It means that nodes will connect multiple small subgraphs into a global one automatically.

At the same time, links with different labels are considered distinct (will co-exists in the graph):

<fact1> <is> <Fact> <user1_graph> .
<fact2> <is> <Fact> <user2_graph> .
<fact2> <is> <Fact> .

Considering your use case where users are adding some well-known nodes, you may write user-related data with labels (to filter these links for a particular user), and write well-known global data without labels (any user will be able to access it). If you want a complete isolation for each user, it’s better to label all data coming from each user. It will still allow you to run queries for all the data to join it from multiple user subgraphs.

Hope this explains the topic and possible solution a bit.