Most efficient way to iterate all quads matching certain properties


#1

Hi, I’m using Cayley in a new linked data project. It’s aimed at libraries that use Library of Congress data. Often developers and librarians need to cleanup their data, specifically their categories. They want to make sure their local categories match those of the Library of Congress. In some cases they are using a pre approved ID and they can easily lookup the new label via an API, but sometimes all they have is a string, and they need to find the best guess.

That’s where I am right now.

I am using Cayley to load in all the RDF triples, and allow for quick lookup via Subject, but now I need to tackle the less clear part which is basically a full-text search issue.

My current thought is that I would use Bleve to build a secondary full-text index.

So, I would load all the data into Cayley, and then I would iterate over all the data in Cayley and push it into a Bleve index.

I have Cayley working, and I have a Bleve index waiting, now I’m trying to make sure I efficiently iterate over the quads. I’m hoping to do some amount of batching. Here is what I have so far, and I can make do, but if there is a more efficient route I would love to hear about it. I don’t need to index every quad, just the labels so being able to select just a portion of the data would be helpful as well.

qs := a.NewQuadstore(...)

iterator := qs.QuadsAllIterator()

defer iterator.Close()
for iterator.Next(nil) {
	val := iterator.Result()
	quad := qs.Quad(val)
	if quad.Predicate.String() == "<http://www.w3.org/2004/02/skos/core#altLabel>" {
		...Send quad to be indexed...
	}
	if quad.Predicate.String() == "<http://www.w3.org/2004/02/skos/core#prefLabel>" {
		log.Print(quad.Object.String() + quad.Subject.String())
		...Send quad to be indexed...
	}
}

#2

Not sure if I get what you want but this comes to mind:
Write into a buffered channel, read that channel in a different Go-routine and fill up a buffer of size X, when that buffer has a certain size or the channel is closed, flush the buffer to Bleve and either continue the loop or end the routine if channel is closed.

But I know we have a elastic search storage backend as well that does fulltext. So maybe it has some extra options or you can search the quads separate in ES in the same collection as Cayley. @dennwc or (github.com/michaelqiu94) might able to tell you.


#3

Thanks for the reply. I agree. If there isn’t a native way to iterate in batches then I think a buffered channel would be the way to go.

I don’t think I am going to use a backend like ES right now. I like boltdb and other ondisk/memory backends because of the operational simplicity. Our plan is to distribute data files to app servers on a semi-regular basis. The data changes rarely so we don’t need to worry about our app servers also writing to the backend. So, each app server can have it’s own copy of the data.


#4

Hi @voidfiles,

You can filter quads by building a query using shapes lib (graph/shape ). Please check Quads or QuadActions objects.