Rough Edges


#1

This is an ongoing post I will update with random rough edges, annoyances or undocumented things as, feel free to comment with your own.

  • If given a handle (passed in), you can’t create a transaction from it, you have to know to do cayley.NewTransaction() and trans.AddQuad and then handle.ApplyTransaction(trans).
  • Why does .Tag take a …string instead of just string?

Allow calling AddQuad with 3 parameters
#2

From Slack:

  • Tag+Back should remove the tag from result set. Or maybe there should be another method to make tags for Back?

#3

Tag+Back should remove the tag from result set. Or maybe there should be another method to make tags for Back?

I would have to double check, but I think I have used .Tag and .Back together taking advantage of the non-clearing behavior.


#4

I think It’s easier to use something like .Tag("useful").Mark("back") for your case then removing the tag with other path command for general use case.

Prior to question on Slack I was porting path lib to a new optimizer and was also wondering why tags are used for Back at all. Tags are used directly to get results, but Back is only like a pointer that helps to construct iterator tree. I believe Back tags should not affect the query results, because it also affects how optimizer will interpret the query. In some cases it can push iterators up and down the tree or rewrite them, knowing that nothing will be tagged there. But using Tag will prevent optimizer from touching these iterators.


#5

Agreed, the double-use of .Tag boxes you in terms of optimization and makes use confusing.


#6

this was my suggestion, and I would like .Forget(“Tag”) or Push(Stack) and Pop(stack)


#7

From Slack:

We need a way to force other constraints on predicates without doing Out-Constraint-In or Back. Maybe .HasFilter("<pred>", constr)? Constraint is an interface similar to this:

type ValueFilter interface{
   Filter(v quad.Value) bool
}

Has will be a shorthand for HasFilter with an IN [values] constraint.


#8

Tags

Tags seems a bit magic for me. Why tagging to a map? Why not slice? Or maybe a map of maps, struct, or whatever? This approach can be rewritten to a more generic solution where the caller can control what exactly happens to values and when. This will help to extract subgraph from the query, for example, save tree-like structures natively, etc.


#9

Next/Contains

Next/Contains pair makes iterators more complex than it should be. Currently if we have an All iterator in the tree it might be one of the two things: a) super-slow scan over a whole quad index, and b) fast but random lookup over the same index. I think we can simplify the code of both iterator and optimizer if we divide these iterators, or rather make a more low-level layer that defines what these iterators actually do as a separate objects. If an iterator in a particular tree does a scan, it should be represented by some Scan type, and if optimizer decides that iterator should seek random values instead, this should be a separate Lookup object as well. This will make it easier to implement more efficient queries, since we will be able to determine if a particular branch only does Contains and can be replaced by a single Walk iterator that checks the whole sub-path as one operation.


#10

IgnoreDuplicates

Currently we have a strange peace of code that alters behavior of transactions (aka deltas) and makes them strict in regards of quad existence. So you basically tell if the whole Tx should fail if one of values exists, or it shouldn’t. And this flag is global. Why? In any other systems you can just state in the same Tx that you want to check if a particular value exists or not, if it should equal to a certain value, etc. If we really want users to build apps on top of Cayley, we should allow these individual constraints to be added per-transaction, and should be explicitly defined by user. If he asks to just write a value, we should just write a value, and if he asks us to check existence, we should do exactly this.


#11

ACID and import

We have only one way to put quads into the system - batch some quads, make a Tx and run in. Now the question: should this write be ACID-compliant as any good transaction? Are we doing a batch import job or a critical Tx for an app? No way to know. Let’s split this interface to separate concepts: QuadWriter/ApplyLog (more on this later) for batch writes and Commit/ApplyDeltas for transactions. All QuadStores should implement both, even if it does the same thing under the hood (since it don’t in most cases).


#12

Log

I really think the log in current version of Cayley is a bad idea. It only hurts both performance and database size. The most typical example is a delete process, when we double the amount of data instead of removing it. And we still don’t have a native replication, since backends do it for us. So why do we need the log in the first place if backends do all the work? I would rather drop the log completely for the time we have a real replication. Once it’s done, assume that servers with an old version is compacted and can only stream last snapshot of data (aka horizon). Everything is done with no harm for current users. Because currently I have a very hard time explaining why we have this log at all.


#13

Log operations

Currently we have a log/deltas that are based on quads: it either adds or removes a quad. So each time we want to dump or restore the dataset we should stream individual quads, while chaotically seeking through the node/value index. Ideally we can have a separate operation for inserting a node and inserting a quad with ids of nodes. Basically we need to just switch our log to operate on Primitives instead of quads for it to work. This is what reification will solve eventually.

The second issue is that we don’t have a way to delete quads by pattern. Each time I want to remove a bad supernode from large dataset, I should stream a large portion of this dataset back with quads marked as deleted. We need a single log operation that represents a delete by pattern. Each backend will have to update their indexes anyway, but at least we will not flood our operation log with these deletes. I would also say that update of links/nodes should be supported as a single operation in the log as well. This is useful for doing refactoring over an existing dataset.