The joy of algorithms and NoSQL: a MongoDB example (part 2)

August 11, 2011 Davy Suvee chemoinformatics, compound comparison, mongodb, NoSQL No Comments

[information]

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance.

[/information]

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered. The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000. Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!

1. MongoDB molecular similarity map-reduce query

Map-reduce is a conceptual framework, introduced by Google, to enable the processing of huge datasets using a large number of processing nodes. The general idea is that a larger problem is divided in a set of smaller subproblems that can be answered (i.e. solved) by an individual processing node (the map-step). Afterwards, the individual solutions are combined again to produce the final answer to the larger problem (the reduce-step). By making sure that the individual map and reduce steps can be computed independently of each other, this divide-and-conquer technique can be easily parallellized on a cluster of processing nodes. Let’s start by refactoring our solution to use MongoDB’s map-reduce functionality.

The map-step of a MongoDB’s map-reduce implementation takes a MongoDB document as input and emits one (or more) answers (which, in essence, are again MongoDB documents). Executing our map-step on all compound documents in the compounds collection would not be very efficient. Instead, we would like to limit the execution of our map-step to those documents that can theoretically match the target Tanimoto. Luckily, we already defined this query, namely the compound selection query that was described in the part one of this article! By employing this query, only compounds that match this query are pushed through the map-step. A MongoDB map (and reduce) function is expressed through JavaScript. In our case, we calculate the number of unique fingerprint patterns that are shared by both the target and input compound. In case the minimum number of fingerprint patterns is reached, the map-step emits a document containing the PubChem identifier (as id) and some essential statistics (as values). A reduce-step is employed to aggregate answers into the final result. In our case however, we are interested into the individual results for each compound (document). Hence, no reduce function is applied. When this map-reduce function is executed only 27 compounds are returned (which could potentially match), instead of 7000 compounds when employing the previous Java query!

One would expect the execution time of the map-reduce query to be considerably faster compared to the Java solution. Unfortunately, this is not the case. First of all, interpreted Javascript is a multitude of times slower compared to Java. Secondly, although map-reduce steps could be parallellized when multiple CPU cores are available, the MongoDB map-reduce function always runs single-threaded. To circumvent this limitation, one can use MongoDB sharding. Simply explained, instead of putting all data on a single MongoDB node, multiple MongoDB nodes are employed, each responsible for storing a part of the total data set. When executing our map-reduce function, each node will execute the map-reduce steps on its part of the data in parallel. Hence, when using sharding on a cluster of 4 MongoDB nodes, the map-reduce query executes almost 4 times faster, already catching up with the performance of the Java solution. With the exception of the MongoDB sharding configuration, no changes are required to the map-reduce function itself. Hence, scaling horizontally with MongoDB is a breeze …

2. Conclusion

MongoDB’s map-reduce performance is a bit of a disappointment. MongoDB currently advises to only use it for near real-time computations. Version 2.0 of MongoDB should drastically improve map-reduce performance, as the JavaScript engine will be replaced by other execution platforms. Nevertheless, map-reduce performance can currently be boosted by splitting the load on multiple MongoDB shards.

The joy of algorithms and NoSQL: a MongoDB example (part 2)

1. MongoDB molecular similarity map-reduce query

2. Conclusion

Leave a Reply Cancel reply