Storing and querying RDF data in Neo4J through Sail
This blog post discusses the use of Neo4J as a RDF triple store. Michael Hunger however informed me that the neo-rdf-sail component is no longer under active development and advised me to have a look at Tinkerpop’s Sail implementation. Read the updated version of this article here.[/information]
Recently, I got asked to implement a storage and querying platform for biological RDF (Resource Description Framework) data. RDF data is a set of statements about resources in the form of subject-predicate-object expressions (also referred to as triples). Let’s have a look at some simple RDF triples that define ‘me’, Davy Suvee:
Each subject is identified through an URI (Uniform Resource Identifier). For instance, I identify myself as being http://www.example.org/person/Davy_Suvee. A predicate, also identified through an URI, either points to a literal value or to a concrete object (which is again identified through an URI). In the example above, the first_name, last_name and age predicates all point to a literal value, while the company predicate points to http://www.example.org/company/DataBlend, the company I work for. The DataBlend subject also exhibits a number of properties, including name and VAT-number. Today’s triplestores allow you to save billions of these triples and information is retrieved through so-called SPARQL-queries. For instance, to retrieve my first name and age, I can use the following SPARQL-query:
2. Neo4J as a RDF data store
Similar to SQL, SPARQL provides a set of powerful querying constructs that allow you to declaratively specify your needs. Calculating shortest paths between random subjects on the contrary, can not easily be accomplished through SPARQL (unless one encodes the specific path structure, which kind of defeats the point). Being able to quickly calculate shortest paths, which is a requirement for the project I’m implementing, is one of the main selling points of Graph Databases. As RDF data can be thought of as a graph, it comes as no surprise that many Graph Databases, including Neo4J, provide native support for storing and querying RDF data. In case of Neo4J, this is achieved through the use of the neo4j-rdf, neo4j-rdf-sparql and neo-rdf-sail components. Unfortunately, I couldn’t find a recent piece of code that details the various steps for automatically importing RDF triple files within Neo4J. Hence, this article. The complete source code can be found on the Datablend public GitHub repository.
Start by setting up the Neo4J database connection:
An embedded Neo4J graph database (EmbeddedGraphDatabase) is used for importing 5MB of RDF tuples containing airline flight information. (This example data set was found at rdfdata.org, a great resource for some open RDF data sets). In order to easily find back flight information, we fully text-index our RDF triples (through Lucene). Next, we wrap the embedded Neo4J graph database as a VerboseQuadStore (one of internal triples store implementations provided by Neo4J). Finally, we expose our triple store through the Sail interface, which is part of the openrdf.org project. By doing so, we can use an entire range of RDF utilities (parsers and query evaluators) that are part of the openrdf.org project. Once we have a sail connection available, we can import the required RDF triples through the add-method.
That’s it! Once the import is finished, you can query your RDF triplets by executing a SPARQL-query. The query below for instance, will retrieve the flight number, departure and destination city of all flights that have a duration of 1 hour and 35 minutes.
3. Shortest path calculation
Through the SimpleFulltextIndex we can easily find back the Neo4J node equivalent of a particular RDF subject. Once we got hold of the required nodes, we can use the graph algorithms provided in the neo4j-graph-algo component to calculate (shortest) paths. Very cool!