Datablend » NoSQL

The power of graphs to analyse biological data

Davy Suvee — Mon, 02 Dec 2013 07:04:34 +0000

Watch Davy Suvee present at GraphConnect London 2013 on the power of graph databases to analyse biological datasets.

The Power of Graphs to Analyze Biological Data – Davy Suvee @ GraphConnect London 2013 from Neo Technology on Vimeo.

Counting triangles smarter (or how to beat Big Data vendors at their own game)

Davy Suvee — Mon, 11 Feb 2013 10:02:00 +0000

A few months ago, I discovered Vertica’s “Counting Triangles”-article through Prismatic. The blog post describes a number of benchmarks on counting triangles in large networks. A triangle is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends with each other, the three of you define a friendship triangle. Counting all triangles within a large network is a rather compute-intensive task. In its most naive form, an algorithm iterates through all vertices in the network, retrieving the adjacent vertices of their adjacent vertices. If one of the vertices adjacent to the latter vertices is identical to the origin vertex, we identified a triangle.

The Vertica article illustrates how to execute an optimised implementation of the above algorithm through Hadoop and their own Massive Parallel Processing (MPP) Database product (both being run on a 4-node cluster). The dataset involves the LiveJournal social network graph, containing around 86 million relationships, resulting in around 285 million identified triangles. As can be expected, the Vertica solution shines in all respects (counting all triangles in 97 seconds), beating the Hadoop solution by a factor of 40. A few weeks later, the Oracle guys published a similar blog post, using their ExaData platform, beating Vertica’s results by a factor of 7, clocking in at 14 seconds.

Although Vertica and Oracle’s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating triangles through a smarter algorithm that is able to deliver similar performance on commodity hardware (i.e. my MacBook Pro Retina).

1. Doing it the smart way

The LiveJournal social network graph, about 1.3GB in raw size, contains around 86 million relationships. Each line in this file declares a relationship between a source and target vertex (where each vertex is identified by an unique id). Relationships are assumed to be bi-directional: if person 1 knows person 2, person 2 also knows person 1.

Let’s start by creating a row-like data structure for storing these relationships. The key of each row is the id of the source vertex. The row values are the id’s of all target vertices associated with the particular source vertex.

With this structure in place, one can execute the naive algorithm as described above. Unfortunately, iterating four levels deep will result in mediocre performance. Let’s improve our data structure by indexing each relationship through its lowest key. So, even though the LiveJournal file declares the relationship as being “2 0“, we persist the relationship by assigning the 2-value to the 0-row. (Order doesn’t matter as relationships are bi-directional anyway.)

Calculating triangles becomes a lot easier (and faster) now. If the key of a row is part of a triangle, its two adjacent vertices should be in its list of values (as by definition, the row key is the smallest vertex id of the three of them). Hence, we need to check whether we can find edges amongst the vertices contained within each row. So, for each row, we iterate through its list of values. For each of these values, we retrieve the associated row and verify whether one of its values is part of the original source-values. By doing so, we get rid of one expensive for-loop. Nevertheless, the amount of calculations that need to be executed is still close to 2 billion!

2. Persisting the relationships

The data structure as described above is persisted in a custom datastore that we developed at Datablend for powering the similr-engine (a chemical structure search engine). The datastore is fully persistent and optimised for quickly performing set-based operations (intersections, differences, unions, … ). Parsing the 86 million relationships and creating the appropriate in-memory data structure takes around 20 seconds on my MacBook Pro. An additional 4 seconds is required for persisting the entire data structure to the datastore itself. So around 25 seconds in total for effectively storing all 86 million relationships. Vertica nor Oracle mention the time it takes to persist the Livejournal dataset within their respective databases. However, I assume it also requires them a few seconds to execute this load-operation.

What about disk usage? The custom Datablend datastore takes the second place, requiring only 37 Mb more compared to Oracle’s Hybrid Columnar Compression version.

3. Calculating the triangles

The Oracle setup (on a cluster of 4 nodes, each with 96GB of RAM and 12 cores) is able to calculate the 265 million triangles in 14 seconds. The optimised algorithm described above, running on the custom Datablend datastore, takes the first place, clocking in at 9 seconds! The calculation runs fully pararellized on my MacBook Pro Retina and has a peak use of only 2.11 GB of RAM!

4. Conclusion

Datablend’s custom datastore is a very specific solution that targets a particular range of Big Data computations. It is in no means as generic and versatile as the MPP database solutions offered by both Vertica and Oracle. Nevertheless, the article tries to illustrate that one does not require a large computing cluster to execute particular Big Data computations. Just use the most appropriate/smart solution to solve the problem in an elegant and fast way. Don’t hesitate to contact us if you have any questions related to similr and/or Datablend.

Similr: blazingly fast chemical similarity searches

Davy Suvee — Mon, 04 Feb 2013 10:00:38 +0000

Today, Datablend announces Similr to be available for beta sign-up. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular chemical structure. It employs a wide range of fingerprinting algorithms, which combined, allow to identify matching compounds in millisecond time. Similr’s functionalities are available through a flexible and expressive REST API and allows to scan more than 30 million compounds that have been made publicly available through PubChem.

Similr will provide unlimited API-access to academics. Free commercial access, limited to a 1000 API-calls a month, will be available. Higher up, customers can choose between a pay-as-you-go subscription or opt-in for a dedicated installation that allows for the import of (private) compounds.

Similr is being developed by Datablend, a Big Data consultancy company. Datablend’s expertise in Pharma, combined with proficient knowledge of NoSQL technologies, allowed for the development of a highly optimised chemical similarity search algorithm that is able to scan millions of compounds at blazing speeds. Don’t hesitate to contact us if you have any questions related to Similr and/or Datablend.

Redis and Lua: a NoSQL power-horse

Davy Suvee — Tue, 29 Jan 2013 09:59:16 +0000

Recently, I’ve started implementing a number of Redis-based solutions for a Datablend customer. Redis is frequently referred to as the Swiss Army Knife of NoSQL databases and rightfully deserves that title. At its core, it is an in-memory key-value datastore. Values that are assigned to keys can be ‘structured’ through the use of strings, hashes, lists, sets and sorted sets. The power of these simple data structures, combined with its intuitive API, makes Redis a true power-horse for solving various ‘Big Data’-related problems. To illustrate this point, I reimplemented my MongoDB-based molecular similarity search through Redis and its integrated Lua support. As always, the complete source code can be found on the Datablend public GitHub repository.

1. Redis ‘fingerprint’ data model

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Cheminformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. Chemical similarity can be determined through the use of so-called fingerprints (i.e. linear, substructure chemical patterns of a certain length). Similarity between compounds is identified by calculating the Tanimoto coefficient. This computation involves the calculation of intersections between sets of fingerprints, an operation that is natively supported by Redis.

Our Redis-based data model for storing fingerprints requires three different data-structures:

For each compound, identified by an unique key, we store its set of fingerprints (where each fingerprint is again identified by an unique key).
For each fingerprint, identified by an unique key, we store the set of compounds containing this fingerprint. These fingerprint sets can be conceived as the inverted indexes of the compound sets mentioned above.
For each fingerprint, we store its number of occurrences through a dedicated weight-key.

Fingerprints are calculated by using the, 33 and 35) are sufficient to create both the inverted indexes (compound->fingerprints and fingerprint->compounds) and incrementing the accompanying counters.

2. Finding similar chemical compounds

For retrieving compounds that satisfy a particular Tanimoto coefficient, we reuse the same principles as outlined in my original MongoDB article. The number of round-trips to the Redis datastore is minimised by implementing the algorithm via the build-in Lua scripting support. We start by retrieving the number of fingerprints of the particular input compound. Based upon that cardinality, we calculate the fingerprints of interest (i.e. the min-set of fingerprints that lead us to compounds that are able to satisfy the Tanimoto coefficient). For this, we need to identify the subset of compound fingerprints that occur the least throughout the entire dataset. Redis allows us to perform this query via a single sort-command; it takes the compound-key as input and sorts the contained fingerprints by employing the value of the external fingerprint weight keys. Out of this sorted set of fingerprints, we sub-select the top x fingerprints of interest. What a powerful and elegant command!

We use the inverted index (fingerprint->compounds) to identify those compounds that are able to satisfy the particular input Tanimoto coefficient. Applying the Redis union-command upon the calculated set of fingerprint keys returns the set of potential compounds. Once this set has been identified, we calculate similarity by making use of the Redis intersect-command. Only compounds that satisfy the Tanimoto restriction are returned.

3. Conclusion

With 25.000 stored compounds, Redis requires less then 20ms to retrieve compounds that are 70% similar to a particular input compound. Snappier compared to my original MongoDB implementation. In addition, Redis requires less then 1GB of RAM to maintain a live index of the 460.000 PubChem compounds that have at least one associated assay. This allows scientist to host a local instance of the compound datastore, effectively eliminating the need for a dedicated (and expensive) compound database setup.

Circle through your Google Analytics data with Neo4J and Circos

Davy Suvee — Sun, 11 Mar 2012 09:51:19 +0000

Storing massive amounts of data in a NoSQL data store is just one side of the Big Data equation. Being able to visualize your data in such a way that you can easily gain deeper insights, is where things really start to get interesting. Lately, I’ve been exploring various options for visualizing (directed) graphs, including Circos. Circos is an amazing software package that visualizes your data through a circular layout. Although it’s originally designed for displaying genomic data, it allows to create good-looking figures from data in any field. Just transform your data set into a tabular format and you are ready to go. The figure below illustrates the core concept behind Circos. The table’s columns and rows are represented by segments around the circle. Individual cells are shown as ribbons, which connect the corresponding row and column segments. The ribbons themselves are proportional in width to the value in the cell.

When visualizing a directed graph, nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. The proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. In my case, I want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, …) and how do they navigate between pages. The rest of this article details how to 1) retrieve the raw visit information through the Google Analytics API, 2) persist this information as a graph in Neo4J and 3) query and preprocess this data for visualization through Circos. As always, the complete source code can be found on the Datablend public GitHub repository.

1. Retrieving your Google Analytics data

Let’s start by retrieving the raw Google Analytics data. The Google Analytics data API provides access to all dimensions and metrics that can be queried through the web application. In my case, I’m interested in retrieving the previous page path property for each page view. If a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance). Otherwise, it contains the internal path. We will use Google’s Java Data API to connect and retrieve this information. We are particularly interested in the pagePath, pageTitle, previousPagePath and medium dimensions, while our metric of choice is the number of pageViews. After setting the date range, the feed of entries that satisfy this criteria can be retrieved. For ease of use, we transform this data to a domain entity and filter/clean the data accordingly. If a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, …) as previous path.

2. Storing navigational data as a directed graph in Neo4J

The set of site navigations can easily be stored as a directed graph in the the degree of my nodes is correct if I would perform other types of calculations. For each individual navigation relationship, we also store the date of visit.

3. Creating the Circos tabular data format

The Circos tabular data format is quite easy to construct. It’s basically a tab-delimited file with row and column headers. A cell is interpreted as a value that flows from the row entity to the column entity. We will use the Neo4J Cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period. Doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time.

Next, we create the tab delimited file itself. We iterate through all entries (i.e. navigations) that match our Cypher query and store them in a temporary list. Afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. At the end, we filter this occurrence matrix on the minimal number of required navigations. This ensures that we will only create segments for paths that are relevant in the total population. As a final step, we print the occurrences matrix as a tab-delimited file. For each path, we will use a shorthand as the Circos renderer seems to have problem with long string identifiers.

The text below is a sample of the output generated by the printCircosData method. It first prints the legend (matching shorthands with actual paths). Next it prints the tab-delimited Circos table.

4. Use the Circos power

Although Circos can be installed on your local computer, we will use its online version to create the visualization of our data. Upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site’s navigation information.

With just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. The outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. In case of referrals, no navigations have this path as target (indicated by the empty middle ring). Its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. The l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). This segment visualizes the navigation data related to my “The joy of algorithms and NoSQL: a MongoDB example”-article. Most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. The l15-segment (my blog’s main page) is the only path that receives an almost equal amount of incoming and outgoing traffic.

With just a few tweaks to the Circos input data, we can easily focus on particular types of navigation data. In the figure below, I made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors.

5. Conclusions

In the era of Big Data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. Circos specializes in a very specific type of visualization, but does its job extremely well. I would be delighted to hear about other types of visualizations for directed graphs.

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework

Davy Suvee — Wed, 08 Feb 2012 09:50:11 +0000

[information]

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the capabilities of the map-reduce implementation.

[/information]

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities through Tanimoto coefficients. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, part 2 of this article describes the use of MongoDB’s build-in map-reduce functionality to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.

1. MongoDB Aggregation Framework

The MongoDB Aggregation Framework draws on the well-known linux pipeline concept, where the output of one command is piped or redirected to be used as input of the next command. In case of MongoDB, multiple operators are combined into a single pipeline that is responsible for processing a stream of documents. Some operators, such as $match, $limit and $skip take a document as input and output the same document in case a certain set of criteria’s is met. Other operators, such as $project and $unwind take a single document as input and reshape that document or emit multiple documents based upon a certain projection. The $group operator finally, takes multiple documents as input and groups them into a single document by aggregating the relevant values. Expressions can be used within some of these operators to calculate new values or execute string operations.

Multiple operators are combined into a single pipeline that is applied upon a list of documents. The pipeline itself is executed as a MongoDB Command, resulting in single MongoDB document that contains an array of all documents that came out at end of the pipeline. The next paragraph details the refactoring of the molecular similarities algorithm as a pipeline of operators. Make sure to (re)read the previous two articles to fully grasp the implementation logic.

2. Molecular Similarity Pipeline

When applying a pipeline upon a certain collection, all documents contained within this collection are given as input to the first operator. It’s considered best practice to filter this list as quickly as possible to limit the number of total documents that are passed through the pipeline. In our case, this means filtering out all document that will never be able to satisfy the target Tanimoto coefficient. Hence, as a first step, we match all documents for which the fingerprint count is within a certain threshold. If we target a Tanimoto coefficient of 0.8 with a target compound containing 40 unique fingerprints, the $match operator look as follows:

Only compounds that have a fingerprint count between 32 and 50 will be streamed to the next pipeline operator. To perform this filtering, the $match operator is able to use the index that we have defined for the fingerprint_count property. For computing the Tanimoto coefficient, we need to calculate the number of shared fingerprints between a certain input compound and the compound we are targeting. In order to be able to work at the fingerprint level, we use the $unwind operator. $unwind peels off the elements of an array one by one, returning a stream of documents where the specified array is replaced by one of its elements. In our case, we apply the $unwind upon the fingerprints property. Hence, each compound document will result in n compound documents, where n is the number of unique fingerprints contained within the compound.

In order to calculate the number of shared fingerprints, we will start off by filtering out all documents which do not have a fingerprint that is in the list of fingerprints of the target compound. For doing so, we again apply the $match operator, this time filtering on the fingerprints property, where only documents that contain a fingerprint that is in the list of target fingerprints are maintained.

As we only match fingerprints that are in the list of target fingerprints, the output can be used to count the total number of shared fingerprints. For this, we apply the $group operator on the compound_cid, though which we create a new type of document, containing the number of matching fingerprints (by summating the number of occurrences), the total number of fingerprints of the input compound and the smiles representation.

We now have all parameters in place to calculate the Tanimoto coefficient. For this we will use the $project operator which, next to copying the compound id and smiles property, also adds a new, computed property named tanimoto.

As we are only interested in compounds that have a target Tanimoto coefficient of 0.8, we apply an additional $match operator to filter out all the ones that do not reach this coefficient.

The full pipeline command can be found below.

The output of this pipeline contains a list of compounds which have a Tanimoto of 0.8 or higher with respect to a particular target compound. A visual representation of this pipeline can be found below:

3. Conclusion

The new MongoDB Aggregation Framework provides a set of easy-to-use operators that allow users to express map-reduce type of algorithms in a more concise fashion. The pipeline concept beneath it offers an intuitive way of processing data. It is no surprise that this pipeline paradigm is adopted by various NoSQL approaches, including Tinkerpop’s Gremlin Framework and Neo4J’s Cypher implementation.

Performance wise, the pipeline solution is a major improvement upon the map-reduce implementation. The employed operators are natively supported by the MongoDB platform, which results in a huge performance improvement with respect to interpreted Javascript. As the Aggregation Framework is also able to work in a sharded environment, it easily beats the performance of my initial implementation, especially when the number of input compounds is high and the target Tanimoto coefficient is low. Great work from the MongoDB team!

Visualizing RDF Schema inferencing through Neo4J, Tinkerpop, Sail and Gephi

Davy Suvee — Mon, 21 Nov 2011 09:47:11 +0000

Last week, the Neo4J plugin for Gephi was released. Gephi is an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs. The graphs themselves can be loaded through a variety of file formats. Thanks to Martin Škurla, it is now possible to load and lazily explore graphs that are stored in a Neo4J data store.

In one of my previous articles, I explained how Neo4J and the Tinkerpop framework can be used to load and query RDF triples. The newly released Neo4J plugin now allows to visually browse these RDF triples and perform some more fancy operations such as finding patterns and executing social network analysis algorithms from within Gephi itself. Tinkerpop’s Sail Ouplementation also supports the notion of RDF Schema inferencing. Inferencing is the process where new (RDF) data is automatically deducted from existing (RDF) data through reasoning. Unfortunately, the Sail reasoner cannot easily be integrated within Gephi, as the Gephi plugin grabs a lock on the Neo4J store and no RDF data can be added, except through the plugin itself.

Being able to visualize the RDF Schema reasoning process and graphically indicate which RDF triples were added manually and which RDF data was automatically inferred would be a nice to have. To implement this feature, we should be able to push graph changes from Tinkerpop and Neo4J to Gephi. Luckily, the Gephi graph streaming plugin allows us to do just that. In the rest of this article, I will detail how to setup the required Gephi environment and how we can stream (inferred) RDF data from Neo4J to Gephi.

1. Adding the (inferred) RDF data

Let’s start by setting up the required Neo4J/Tinkerpop/Sail environment that we will use to store and infer RDF triples. The setup is similar to the one explained in my previous Tinkerpop article. However, instead of wrapping our GraphSail as a SailRepository, we will wrap it as a ForwardChainingRDFSInferencer. This inferencer will listen for RDF triples that are added and/or removed and will automatically execute RDF Schema inferencing, applying the rules as defined by the RDF Semantics Recommendation.

We are now ready to add RDF triples. Let’s create a simple loop that allows us to read-in RDF triples and add them to the Sail store.

The inference method itself is rather simple. We first start by parsing the RDF subject, predicate and object. Next, we start a new transaction, add the statement and commit the transaction. This will not only add the RDF triple to our Neo4J store but will additionally run the RDF Schema inferencing process and automatically add the inferred RDF triples. Pretty easy!

But how do we retrieve the inferred RDF triples that were added through the inference process? Although the ForwardChainingRDFSInferencer allows us to register a listener that is able to detect changes to the graph, it does not provide the required API to distinct between the manually added or inferred RDF triples. Luckily, we can still access the underlying Neo4J store and capture these graph changes by implementing the Neo4J TransactionEventHandler interface. After a transaction is committed, we can fetch the newly created relationships (i.e. RDF triples). For each of these relationships, the start node (i.e. RDF subject), end node (i.e. RFD object) and relationship type (i.e. RDF predicate) can be retrieved. In case a RDF triple was added through inference, the value of the boolean property “inferred” is “true”. We filter the relationships to the ones that are defined within our domain (as otherwise the full RDFS meta model will be visualized as well). Finally we push the relevant nodes and edges.

2. Pushing the (inferred) RDF data

The streaming plugin for Gephi allows reading and visualizing data that is send to its master server. This master server is a REST interface that is able to receive graph data through a JSON interface. The PushUtility used in the PushTransactionEventHandler is responsible for generating the required JSON edge and node data format and pushing it to the Gephi master.

3. Visualizing the (inferred) RDF data

Start the Gephi Streaming Master server. This will allow Gephi to receive the (inferred) RDF triples that we send it through its REST interface. Let’s run our Java application and add the following RDF triples:

The first two RDF triples above state that a teacher teaches a student. The last RDF triple states that Davy teaches Bob. As a result, the RDF Schema inferencer deducts that Davy must be a teacher and that Bob must be a student. Let’s have a look at what Gephi visualized for us.

Mmm … That doesn’t really look impressive . Let’s use some formatting. First apply Force Atlas lay-outing. Afterwards, scale the edges and enable the labels on both the edges and the nodes. Finally, apply partitioning on the edges by coloring the arrows using the inferred property on the edges. We can now clearly identify the inferred RDF statements (i.e. Davy being a teacher and Bob being a student).

Let’s add some additional RDF triples.

Basically, these RDF triples state that both teacher and student are subclasses of person. As a result, the RDFS inferencer is able to deduct that both Davy and Bob must be persons. The Gephi visualization is updated accordingly.

4. Conclusion

With just a few lines of code we are able to stream (inferred) RDF triples to Gephi and make use of its powerful visualization and analysis tools to explore and inspect our datasets. As always, the complete source code can be found on the Datablend public GitHub repository. Make sure to surf the internet to find some other nice Gephi streaming examples, the coolest one probably being the visualization of the Egyptian revolution on Twitter.

The (non-)sense of NoSQL O(R)M frameworks

Davy Suvee — Tue, 11 Oct 2011 09:45:53 +0000

NoSQL seems to be ready for prime time. Several NoSQL companies, including 10gen (MongoDB), DataStax (Cassandra) and Neo Technology (Neo4J), recently received millions in funding to expand their (commercial) NoSQL offerings. Even Oracle is now entering the already crowded NoSQL-space with its very own key-value NoSQL Database 11g. No doubt that this type of publicity will boost the enterprise adoption of NoSQL technologies.

At the same time, the rise of Object Relational Mapping (ORM) within the NoSQL space is impressive. Multiple approaches and frameworks are competing within the same solution space and I can’t stop wondering whether they do not enter the market too soon … Don’t get me wrong. I strongly believe in the advantages of using an ORM. In fact, I can not even remember my last enterprise-type application that is not powered by an ORM. So why am I expressing my concerns in case of NoSQL?

Relational databases have been around for the past thirty or more years. We have all done our own share of low-level database work and have been exposed to the overal technicalities of a RDBMS. As a result, when advancing to the use of an ORM, people can count on this basic knowledge set when problems are encountered. Most of us however, lack this type of in-depth NoSQL expertise. Nevertheless, as the NoSQL hype is growing, people will fall back on their ORM expertise in order to quickly adopt this new technology. This is enforced by the motivation of several NoSQL ORM frameworks: “Don’t mind the NoSQL complexities. Just employ an approach you already know!”. Unfortunately, all abstractions will fail you at some point in time …

But what about the mapping-side of things? First of all, various NoSQL approaches, including document-based databases, don’t exhibit the object-oriented impedance mismatch. In approaches that do, an ORM will not always make sense. Several ORMs target a variety of NoSQL technologies by providing a generic mapping framework: “Map your objects. Don’t care about the target NoSQL technology. We’ll do that.”. Unfortunately, this approach will fail you at achieving the true (performance) promise of NoSQL. An ORM framework would not have been able to help me at attaining the performance gains as described in one of my previous articles. In fact, it would not even make sense to implement the problem domain as such. Hence, I’m afraid that the use of an ORM framework will disappoint a lot of NoSQL newcomers …

ORM frameworks in the NoSQL space will however get us closer to the idea of Polyglot Persistence. Spring Data Graph for instance, allows you to map Pojos in such a way that parts of it are persisted to a traditional database and parts of it to the Neo4J graph database. This is achieved in a technical transparant way, making it an easy-to-use solution. Nevertheless, I still feel it’s way too soon for ORMs as the knowledge and best practices on NoSQL are just being developed.

The joy of algorithms and NoSQL: a MongoDB example (part 2)

Davy Suvee — Thu, 11 Aug 2011 09:44:06 +0000

[information]

[/information]

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered. The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000. Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!

1. MongoDB molecular similarity map-reduce query

Map-reduce is a conceptual framework, introduced by Google, to enable the processing of huge datasets using a large number of processing nodes. The general idea is that a larger problem is divided in a set of smaller subproblems that can be answered (i.e. solved) by an individual processing node (the map-step). Afterwards, the individual solutions are combined again to produce the final answer to the larger problem (the reduce-step). By making sure that the individual map and reduce steps can be computed independently of each other, this divide-and-conquer technique can be easily parallellized on a cluster of processing nodes. Let’s start by refactoring our solution to use MongoDB’s map-reduce functionality.

The map-step of a MongoDB’s map-reduce implementation takes a MongoDB document as input and emits one (or more) answers (which, in essence, are again MongoDB documents). Executing our map-step on all compound documents in the compounds collection would not be very efficient. Instead, we would like to limit the execution of our map-step to those documents that can theoretically match the target Tanimoto. Luckily, we already defined this query, namely the compound selection query that was described in the part one of this article! By employing this query, only compounds that match this query are pushed through the map-step. A MongoDB map (and reduce) function is expressed through JavaScript. In our case, we calculate the number of unique fingerprint patterns that are shared by both the target and input compound. In case the minimum number of fingerprint patterns is reached, the map-step emits a document containing the PubChem identifier (as id) and some essential statistics (as values). A reduce-step is employed to aggregate answers into the final result. In our case however, we are interested into the individual results for each compound (document). Hence, no reduce function is applied. When this map-reduce function is executed only 27 compounds are returned (which could potentially match), instead of 7000 compounds when employing the previous Java query!

One would expect the execution time of the map-reduce query to be considerably faster compared to the Java solution. Unfortunately, this is not the case. First of all, interpreted Javascript is a multitude of times slower compared to Java. Secondly, although map-reduce steps could be parallellized when multiple CPU cores are available, the MongoDB map-reduce function always runs single-threaded. To circumvent this limitation, one can use MongoDB sharding. Simply explained, instead of putting all data on a single MongoDB node, multiple MongoDB nodes are employed, each responsible for storing a part of the total data set. When executing our map-reduce function, each node will execute the map-reduce steps on its part of the data in parallel. Hence, when using sharding on a cluster of 4 MongoDB nodes, the map-reduce query executes almost 4 times faster, already catching up with the performance of the Java solution. With the exception of the MongoDB sharding configuration, no changes are required to the map-reduce function itself. Hence, scaling horizontally with MongoDB is a breeze …

2. Conclusion

MongoDB’s map-reduce performance is a bit of a disappointment. MongoDB currently advises to only use it for near real-time computations. Version 2.0 of MongoDB should drastically improve map-reduce performance, as the JavaScript engine will be replaced by other execution platforms. Nevertheless, map-reduce performance can currently be boosted by splitting the load on multiple MongoDB shards.

The joy of algorithms and NoSQL: a MongoDB example (part 1)

Davy Suvee — Sun, 07 Aug 2011 09:42:36 +0000

[information]

[/information]

In one of my previous blog posts, I debated the superficial idea that you should own billions of data records before you are eligible to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically MongoDB, to solve a specific Chemoinformatics problem in a truly elegant and efficient way. The complete source code can be found on the Datablend public GitHub repository.

1. Molecular similarity theory

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that similar compounds generally exhibit similar biological activities.) Unfortunately, finding substructures in chemical compounds is a NP-complete problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of structural keys and fingerprints.

In case of structural keys, we precompute the answers on a couple of specific questions that try to capture the essential characteristics of a compound. Each answer is assigned a fixed location within a bitstring. At query time, a lot of time is saved by only executing substructure searches for compounds that have compatible structural keys. (Compatibility being computed by making use of efficient bit operators.)
When employing fingerprints, all linear substructure patterns of a certain length are calculated. As the number of potential patterns is huge, it is not possible to assign an individual bit position to each possible pattern (as is done with structural keys). Instead, the fingerprints patterns are used in a hash. The downside of this approach is that, depending of the size of the hash, multiple fingerprint patterns share the same bit position, giving lead to potential false positives.

In this article, we will demonstrate the use of non-hashed fingerprints to calculate compound similarities (i.e. using the raw fingerprints). This approach has two advantages:

We eliminate the chance of false positives
The raw fingerprints can be used in other types of structural compound mining

2. Molecular similarity practice

Let’s start by showing how to calculate the fingerprints of a chemical compound. Various fingerprinting algorithms are available today. Luckily, we don’t need to implement these algorithms ourselves. The excellent open-source jCompoundMapper library provides us with all the required functionality. It uses MDL SD formatted compounds as input and is able to output a variety of fingerprints. We start by creating a reader for MDL SD files. Afterwards, we use the 2DMol fingerprinter to encode the first molecule in the compounds.sdf file.

The first molecule of the input file has the following chemical structure: C₅₂H₈₇N₃O₁₃. Its 2DMol fingerprint contains 120 unique fingerprint patterns, a selection of them shown below:

Now the question that remains is how to measure the similarity between the fingerprints of compounds A and B. Several methods are again available, but in our case we will use the so-called Tanimoto association coefficient, which is defined as:

N_A refers to the number of unique fingerprint patterns found in compound A, while N_B refers to the number of unique fingerprint patterns found in compound B. N_AB specifies the number of unique fingerprint patterns found in both compound A and B. As can be observed from the equation, two identical compounds would have a Tanimoto coefficient of 1.0.

3. MongoDB datamodel

MongoDB is a so-called document-oriented database. When using document-oriented databases, you basically group together all related data in a single document instead of normalizing your data in various RDBMS tables and using joins to retrieve the required information later on. In case of MongoDB, documents are stored using BSON (binary JSON) and related documents are stored in collections. Let’s start by describing the compounds collection that stores a separate document for each compound. The JSON-document of our C₅₂H₈₇N₃O₁₃ compound looks as follows:

For each compound, we store its unique easy to create this JSON document for each compound. Once we retrieved a molecule through the MDL file reader, it is just a matter of creating the necessary document objects and inserting them in the compounds collection.

To complete our MongoDB data model, we will add the fingerprintscount collection. The rationale for this collection will be explained in the next section of this article. For now, just consider it to be a collection that stores the number of times a particular fingerprint pattern was encountered when importing the compound data. The listing below shows an extract of the fingerprintscount collection.

To update this collection, we make use of MongoDB’s increment operator in combination with an upsert. This way, whenever a fingerprint pattern is encountered for the first time, a document is automatically created for the fingerprint and its count is put on 1. If document already exists for this fingerprint pattern, its associated count is incremented by 1. Pretty easy!

For enhancing query performance, we need to define indexes for the appropriate document properties. By doing so, the created B-Tree indexes are used by MongoDB ‘s query optimizer to quickly find back the required documents instead of needing to scan through all of them. Creating these indexes is very easy as illustrated below.

4. MongoDB molecular similarity query

It’s time to bring it all together. Imagine we want to find all compounds that have a Tanimoto coefficient of 0.8 for a particular input compound. As MongoDB’s querying speed is quite fast we could try to brute-force it and basically compute the Tanimoto coefficient for each compound document that is stored in the compounds collection. But let’s try to do it a bit smarter. By looking at the Tanimoto coefficient equation, we can already narrow the search space significantly. Imagine our input compound (A) has 40 unique fingerprint patterns. Let’s fill in some of the parameters in the Tanimoto equation:

From this equation, we can deduct the minimum and maximum number of unique fingerprint patterns another compound should have in order to (in the best case) satisfy the equation:

Hence, the compound we are comparing with should only be considered if it has between 32 and 50 unique fingerprint patterns. By taking this into account in our search query, we can narrow the search space significantly. But we can optimize our query a bit further. Imagine we would query for all documents that share at least 1 of 9 unique fingerprints patterns with the input compound. All documents that are not part of the resultset of this query will never be able to reach the Tanimoto coefficient of 0.8, as the maximum of possibly remaining shared fingerprint patterns would be 31, and

were we replaced N_B by 31 to maximize the resulting Tanimoto coefficient. This 9-number can be obtained by solving the following equation:

Which 9 fingerprint patterns of the input compound should we consider in this query? Common sense tells us the pick the 9 fingerprint patterns for which the occurrence in the total compound population is the lowest, as this would narrow our search space even further. (Hence the need for the fingerprintscount collection to be able to quickly retrieve the counts of the individual fingerprint patterns.) The code listing below illustrates how to retrieve the counts for the individual patterns. Using the MongoDB QueryBuilder we create the query object that, using the in-operator, specifies that a document from the fingerprintscount collection should only be returned if its fingerprint pattern is part of the fingerprint patterns we are interested in. Using the additional sort-operator, we retrieve the individual fingerprint patterns in ascending count-order.

It’s time for the final step, namely retrieving all compound documents that have the potential to satisfy a Tanimoto coefficient of for instance 0.8. For doing this, we build a query that retrieves all compounds that have at least one pattern out of the sublist of fingerprint patterns we calculated in the previous step. Additionally, only compounds that have a total count of fingerprint patterns that is within the minimum and maximum range should be returned. Although this optimized query drastically narrows the search space, we still need to compute the Tanimoto coefficient afterwards to ensure that it has a value of 0.8 or above.

5. Conclusion

The MongoDB implementation is able to easily screen a database of a million compounds in subsecond time. The required calculation time however, largely depends on the target Tanimoto: a target Tanimoto of 0.6 will result in significantly more results compared to a target Tanimoto of 0.8. When using the MongoDB explain functionality, one can observe that the query time is rather short, compared to the time that is required to transfer the data to the Java application and execute the Tanimoto calculations. In my next article, I will discuss the use of MongoDB’s build-in map-reduce functionality to keep the Tanimoto calculation local to the compound data and hence improve overall performance.