Graph Shaped Queries


#1

The Problem

I am trying to create an iterator that will use ‘variables’ as its source of values rather than a subiterator or some fixed set of values. The basic problem is; given this graph:

<alice> <follows> <bob> .
<alice> <loves> <bob> .
<bob> <follows> <alice> .
<bob> <loves> <alice> . 
<charlie> <follows> <bob> .
<charlie> <loves> <alice> .
<eve> <follows> <alice> .
<eve> <follows> <bob> .

How do we find all the people who follow and love the same person (Alice and Bob)? Of course, we could use something like:

cayley.StartMorphism().Has(“follows”, “bob”).Has(“loves”, “bob”).Or(cayley.SM().Has(“follows”, “alice”).Has(“loves”, “alice”))

But that presupposes some knowledge of the graph. My original idea was to introduce a syntax that looks something like:

var1 := cayley.NewVariable()
m := cayley.StartMorphism().HasVar(“follows”, var1.Use()).HasVar(“loves”, var1.Bind())

Where the bound variable iterates over all nodes (using qs.NodesAllIterator, like a buildHas), and the value in the first iterator updates along with the HasVar variable.

My Approach

After discussing with @barakmich, I decided to create an iterationContext would be necessary to keeps variable values consistent, and it would be useful for later work:

We need to update the iterator interface to have Next(*iterationContext) bool, Contains(*iterationContext, Value) bool, and NextPath(*iterationContext). It seems that NextResult and Check no longer exist and the other methods don’t require context.

The iterationContext will then be passed down the iterator tree by via those methods. For example, and’s Next() method will look like:

func (it *And) Next(ctx iterationContext) bool {
	graph.NextLogIn(it)
	it.runstats.Next += 1
	for it.primaryIt.Next(ctx) {
		curr := it.primaryIt.Result()
		if it.subItsContain(ctx, curr, nil) {
			it.result = curr
			return graph.NextLogOut(it, true)
		}
	}
	it.err = it.primaryIt.Err()
	return graph.NextLogOut(it, false)
}

The iterationContext will need a map[string]graph.Value field to store the current value of the variable, and some get/set methods (which might need a mutex):

func (c *IterationContext) getValue(key string) graph.Value {
	return values[key]
}

func (c *IterationContext) setValue(key string, val graph.Value) {
	values[key] = val
}

Then we’ll create a new variable iterator that will either use the iterationContext or the a fixed subiterator as the source of its values. Its next method should, I think, look something like this:

func (it *Variable) Next(ctx *IterationContext) bool {
	graph.NextLogIn(it)

	currentValue := ctx.getValue(it.VarName)
	if currentValue == nil {
		it.subIt = it.qs.NodesAllIterator()
	}else {
		if it.subIt != nil {
			if it.PreviousValue == currentValue {
				return graph.NextLogOut(it, false)
			} else {
				it.result = currentValue
				return graph.NextLogOut(it, true)
			}
		}
	}
	
	if it.subIt.Next() {
		it.result = it.subIt.Result()
		ctx.SetValue(it.VarName)
		return graph.NextLogOut(it, true)
	}
	
	return graph.NextLogOut(it, false)
}

##Questions

Does that approach make sense? In particular, do you think the Next() method on the variable iterator works?

Once the variable iterator is implemented, how do we make it available to the end user without messing up the existing path syntax? Ideally I would like the user to be able to pass a variable name to any path method in path.go, like p.Has(“follows”, “”), and then create the variable iterator behind the scenes in buildHas. Or we could create special path methods as in the original idea.

Thanks, I appreciate any feedback!


#2

Regarding context, please use standard golang context package instead of custom type. It will allow us to propagate query cancellation, as well as some debug info.

For path syntax, I think it would be better to declare a new type called Variable, as you proposed, and allow to pass it to Has and Is directly instead of making a new set of functions:

v := NewVar()
p := StartPath().Out().Is(v).Out().Has("<follows>", v)

Also, I don’t see the reason for making Use and Bind calls - ordering can be determined in runtime while executing query. Since we have a global context for the query, first use of variable will became a Use call, changing some state, and the following usages will become Bind calls.

This actually raises a broader question of how queries are executed. In case of variables, we may want to reorder parts of the query in a way that all conditions for a variable are in the same place, so we can find a suitable node more effectively and avoid possible synchronization issues.


#3

Thanks for the feedback, it’s really useful!

The standard context object can be used to retrieve values with the Value method, but it’s immutable, and you can only add values by returning a new context with context.WithValue(). I want to update the value of the variables during iteration, so I don’t see how we can get around using a custom type.

Great point about the path syntax, it’s much cleaner.

Speaking of reordering the query, I have noticed that the many iterators on a the iterator tree never have their Next() methods called. In most of our use cases we found that the variable iterator was a secondary iterator of an and iterator (or the descendant of the secondary iterator of an and iterator). That means the Contains() method is used instead.

What do you think the Contains method should mean on a variable iterator? If it just calls contains on its subiterator it basically does nothing. If it just checks whether the argument is equal to the variable’s current value then there will be syncing issues.


#4

It’s true that context values are immutable once assigned a value, but this doesn’t mean your values should be immutable :slight_smile: You can easily create context via WithValue, pass a pointer to you struct and later retrieve the same pointer and modify data inside this struct.

It probably should check against current value. But I’m not sure if this will work. And syncing is a problem indeed, this is the reason I think reordering might be useful.


#5

I have a work in progress implementation here and it would be good to have some feedback on the robustness of it.

I had already updated the iterator interface with a custom ‘context’ type, we’re currently using this version for demoware, and I’ll update that when the more serious problems are solved.

It probably should check against current value. But I’m not sure if this will work. And syncing is a problem indeed, this is the reason I think reordering might be useful.

The way I ‘solved’ the consistency issue was to reintroduce the idea of binders/users, so that the value of the variable is only ever updated by one iterator. I only define Next() for binders, only define Contains() for users, and reorder the iterator tree so that they only have the appropriate methods called.

The main problem is that there is this assumption that the set of values represented by an iterator won’t change during execution. This allows things like the materialize optimization, and the ‘seen’ map on the recursive iterator, both of which break variables.
Currently I have a bug where I have the binder as a descendant of a recursive iterator and recursiveIterator.Next() will only return True once the binder has exhausted all its possible values.

Do you think there is anything I can do to be certain that these sorts of bugs won’t happen? Is there a way I can fit my changes into the current paradigm that doesn’t run into these sorts of issues?


#6

Let me check the code carefully. It might take some time.

Also, I started to work on new query optimizer that might be easier to work with. Maybe we can define some other approach for building iterator tree with variables? Ideally variable should be represented by a single iterator, but this is not possible with current iterator trees (or I can’t think of a good way at the moment).


#7

OK cool, thanks. Let me know when you have had time to look it over, and ask me anything.

Is there a branch for the optimizer?


#8

Sure: #585


#9

We’ve been using this on our fork for a while, but it would be nice to merge it in, and we have a bit of time to devote to it now.

There are a few problems with the PR that I’m going to work on soon:
• Pass an actual context at iteration (rather than a custom object)
• Create a variable shape that reorders the iterator tree only where necessary
• Only avoid materialise iterator, and clear recursive iterator where necessary

Are there any I’m missing? Is there anything that makes this approach untenable?


#10

Great news! It would be nice to have an implementation that makes reordering on shapes. Although iterator optimizations are still there and might still change order in some cases.

Also I want to explore a different approach here. We might try to rewrite the query to make variable a single iterator, possibly with multiple “outs”.

So if you have time, can you please provide an example of query and how it currently works with variables? Additional details are always good to have.


#11

Yes good point. The variable shape’s builditeratoron method could mark its subiterators as not for reordering. It would require a bit of extra code in each iterator, but it’s probably not as anomalous as having to reorder the fully built iterator.

A single iterator sound cool, it would certainly feel cleaner. I don’t quite know what you mean by “outs” do you mean the out method in path.go?

One possibility is to allow variable iterators to be subiterators of multiple other iterators. Making the the iterator tree more of a DAG.

[quote]
example of query and how it currently works with variables? [/quote]

OK, so a little background on our use case. We have a DAG (but pretty much a tree) that represents stuff about a software project - AST, version control, profiling. And variables help us look for disconnected parts of the DAG that have similar properties.

All our current queries look rather alike and just use Has() to compare properties of two nodes in the DAG. This one will is supposed to find functions with an argument with the same name as a variable:

v := NewVar("varName")
p := StartPath().Out().
    Has("kind", "function").
    Tag("func") .
    Out().
    Has("kind", "argument").
    Has("name", v).
    Back("func").
    Out().
    Has("kind", "variable").
    Has("name", v)

I omitted details for brevity, but I hope this communicates the idea.


#12

Ideally it would be nice to preserve the order that shape optimizer generates and check costs of each iterator/shape there. This will make ordering hacks unnecessary.

This was exactly what I had in mind :slight_smile:

Let me check if this can be done by sharing an iterator, and if optimizer can actually see where is the common subtree for the variable.


#13

Let me check if this can be done by sharing an iterator, and if optimiser can actually see where is the common subtree for the variable.

Great. What do you think of this now? If we had an actual optimiser that walked the iterator tree and shape tree this would be fine right? But as far as I’ve seen we just call the optimise method recursively down the trees. Perhaps we need to add some kind of do-nothing iterator at the top of the variable subtree.


#14

Sorry, for long wait - was on vacation :slight_smile:

First, I want to see if this issue can be solved by some tree manipulation. Here is the tree for your query:

Intersect{
	NodesFrom{
		Dir: quad.Subject,
		Quads: Quads{
			QuadFilter{Dir: quad.Object, Values: Lookup{"variable"}},
			QuadFilter{Dir: quad.Predicate, Values: Lookup{"kind"}},
		},
	},
	NodesFrom{
		Dir: quad.Subject,
		Quads: Quads{
			QuadFilter{Dir: quad.Object, Values: Var("VAR")},
			QuadFilter{Dir: quad.Predicate, Values: Lookup{"name"}},
		},
	},
	NodesFrom{
		Dir: quad.Object,
		Quads: Quads{
			QuadFilter{
				Dir: quad.Subject,
				Values: Save{
					Tags: []string{"func"},
					From: Intersect{
						QuadsAction{Result: quad.Object},
						NodesFrom{
							Dir: quad.Subject,
							Quads: Quads{
								QuadFilter{Dir: quad.Object, Values: Lookup{"function"}},
								QuadFilter{Dir: quad.Predicate, Values: Lookup{"kind"}},
							},
						},
						NodesFrom{
							Dir: quad.Subject,
							Quads: Quads{
								QuadFilter{
									Dir: quad.Object,
									Values: Intersect{
										NodesFrom{
											Dir: quad.Subject,
											Quads: Quads{
												QuadFilter{Dir: quad.Object, Values: Var("VAR")},
												QuadFilter{Dir: quad.Predicate, Values: Lookup{"name"}},
											},
										},
										NodesFrom{
											Dir: quad.Subject,
											Quads: Quads{
												QuadFilter{Dir: quad.Object, Values: Lookup{"argument"}},
												QuadFilter{Dir: quad.Predicate, Values: Lookup{"kind"}},
											},
										},
									},
								},
							},
						},
					},
				},
			},
		},
	},
}

Let’s decompose it a bit.

Variable subtree that can be extracted:

var X = NodesFrom{
	Dir: quad.Subject,
	Quads: Quads{
		QuadFilter{Dir: quad.Object, Values: Var("VAR")},
		QuadFilter{Dir: quad.Predicate, Values: Lookup{"name"}},
	},
}

And two subject branches that depends on it:

var Y = Intersect{
	X,
	NodesFrom{
		Dir: quad.Subject,
		Quads: Quads{
			QuadFilter{Dir: quad.Object, Values: Lookup{"variable"}},
			QuadFilter{Dir: quad.Predicate, Values: Lookup{"kind"}},
		},
	},
}

var Z = Intersect{
	X,
	NodesFrom{
		Dir: quad.Subject,
		Quads: Quads{
			QuadFilter{Dir: quad.Object, Values: Lookup{"argument"}},
			QuadFilter{Dir: quad.Predicate, Values: Lookup{"kind"}},
		},
	},
}

And the final query with these branches:

Intersect{
	Y,
	NodesFrom{
		Dir: quad.Object,
		Quads: Quads{
			QuadFilter{
				Dir: quad.Subject,
				Values: Save{
					Tags: []string{"func"},
					From: Intersect{
						QuadsAction{Result: quad.Object},
						NodesFrom{
							Dir: quad.Subject,
							Quads: Quads{
								QuadFilter{
									Dir: quad.Object,
									Values: Z,
								},
							},
						},
						NodesFrom{
							Dir: quad.Subject,
							Quads: Quads{
								QuadFilter{Dir: quad.Object, Values: Lookup{"function"}},
								QuadFilter{Dir: quad.Predicate, Values: Lookup{"kind"}},
							},
						},
					},
				},
			},
		},
	},
}

Now, we want to change the way how Var iterator works. It should pick one value in the first branch and use it to check the second one. The same asymmetric behavior is already implemented in And/Intersect iterator - the first branch will use Next to get results, and the second one will use Contains to ensure that these results match other conditions.

Thus, I propose to rewrite the tree as following: find closest common Intersect for a particular variable and replace it with a new IntersectWithVar shape. This shape will set a list of variables to And iterator, which will use them to populate variable values using context passed to Next. The same state will be passed to Contains which will use it to set variables in second branch.

Also, since Var iterator can have no value on it’s own, I propose to make it work like Save:

type Var struct{
  Name string
  From Shape
}

It will work as pass-thru on Next and Intersect{X, Sub} on Contains (it should check if result matches a variable, and if subiterator matches it as well).

Now, I will describe in short the changes to code that will be necessary for this to work.

First, Next and Contains will now accept context.Context argument. This will help us to pass state of variables, as well as cancel queries.

Code in And.Next() and And.Contains() will create a new variables state and pass it to subiterators:

vars := make(map[string]graph.Value)
ctx = context.WithValue(ctx, KeyVars, vars)
primary.Next(ctx)
for _, sub := range it.sub {
  sub.Contains(ctx)
}

Var.Next() is a pass-thru that populates a variable in the query context:

ok := subIt.Next(ctx)
if !ok {
  return false
}
v := subIt.Result()
vars := ctx.Value(KeyVars).(map[string]graph.Value)
vars[name] = v
return true

Var.Contains() uses variables to compare it to argument first, and then checks it with subiterator:

vars := ctx.Value(KeyVars).(map[string]graph.Value)
v, ok := vars[name]
if !ok {
  // variable not defined - return an error
  it.err = fmt.Errorf("variable not defined: %q", name)
  return false
}
if r != v {
  return false
}
return subIt.Contains(ctx, r)

This will require at least one Var to be in Next’able branch of the tree. Instead, we can allow Contains to populate variables instead of throwing errors and Next to use them (and call Contains to check subiterator).

The only “hack” in all of this is that query should be optimized to annotate Intersects, which needs to initialize query context for variables. Alternative may be that user should initialize the context by hand, but it’s easy to forget to do it for every query.