Datablend » chemoinformatics

Similr: blazingly fast chemical similarity searches

Davy Suvee — Mon, 04 Feb 2013 10:00:38 +0000

Today, Datablend announces Similr to be available for beta sign-up. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular chemical structure. It employs a wide range of fingerprinting algorithms, which combined, allow to identify matching compounds in millisecond time. Similr’s functionalities are available through a flexible and expressive REST API and allows to scan more than 30 million compounds that have been made publicly available through PubChem.

Similr will provide unlimited API-access to academics. Free commercial access, limited to a 1000 API-calls a month, will be available. Higher up, customers can choose between a pay-as-you-go subscription or opt-in for a dedicated installation that allows for the import of (private) compounds.

Similr is being developed by Datablend, a Big Data consultancy company. Datablend’s expertise in Pharma, combined with proficient knowledge of NoSQL technologies, allowed for the development of a highly optimised chemical similarity search algorithm that is able to scan millions of compounds at blazing speeds. Don’t hesitate to contact us if you have any questions related to Similr and/or Datablend.

Redis and Lua: a NoSQL power-horse

Davy Suvee — Tue, 29 Jan 2013 09:59:16 +0000

Recently, I’ve started implementing a number of Redis-based solutions for a Datablend customer. Redis is frequently referred to as the Swiss Army Knife of NoSQL databases and rightfully deserves that title. At its core, it is an in-memory key-value datastore. Values that are assigned to keys can be ‘structured’ through the use of strings, hashes, lists, sets and sorted sets. The power of these simple data structures, combined with its intuitive API, makes Redis a true power-horse for solving various ‘Big Data’-related problems. To illustrate this point, I reimplemented my MongoDB-based molecular similarity search through Redis and its integrated Lua support. As always, the complete source code can be found on the Datablend public GitHub repository.

1. Redis ‘fingerprint’ data model

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Cheminformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. Chemical similarity can be determined through the use of so-called fingerprints (i.e. linear, substructure chemical patterns of a certain length). Similarity between compounds is identified by calculating the Tanimoto coefficient. This computation involves the calculation of intersections between sets of fingerprints, an operation that is natively supported by Redis.

Our Redis-based data model for storing fingerprints requires three different data-structures:

For each compound, identified by an unique key, we store its set of fingerprints (where each fingerprint is again identified by an unique key).
For each fingerprint, identified by an unique key, we store the set of compounds containing this fingerprint. These fingerprint sets can be conceived as the inverted indexes of the compound sets mentioned above.
For each fingerprint, we store its number of occurrences through a dedicated weight-key.

Fingerprints are calculated by using the, 33 and 35) are sufficient to create both the inverted indexes (compound->fingerprints and fingerprint->compounds) and incrementing the accompanying counters.

2. Finding similar chemical compounds

For retrieving compounds that satisfy a particular Tanimoto coefficient, we reuse the same principles as outlined in my original MongoDB article. The number of round-trips to the Redis datastore is minimised by implementing the algorithm via the build-in Lua scripting support. We start by retrieving the number of fingerprints of the particular input compound. Based upon that cardinality, we calculate the fingerprints of interest (i.e. the min-set of fingerprints that lead us to compounds that are able to satisfy the Tanimoto coefficient). For this, we need to identify the subset of compound fingerprints that occur the least throughout the entire dataset. Redis allows us to perform this query via a single sort-command; it takes the compound-key as input and sorts the contained fingerprints by employing the value of the external fingerprint weight keys. Out of this sorted set of fingerprints, we sub-select the top x fingerprints of interest. What a powerful and elegant command!

We use the inverted index (fingerprint->compounds) to identify those compounds that are able to satisfy the particular input Tanimoto coefficient. Applying the Redis union-command upon the calculated set of fingerprint keys returns the set of potential compounds. Once this set has been identified, we calculate similarity by making use of the Redis intersect-command. Only compounds that satisfy the Tanimoto restriction are returned.

3. Conclusion

With 25.000 stored compounds, Redis requires less then 20ms to retrieve compounds that are 70% similar to a particular input compound. Snappier compared to my original MongoDB implementation. In addition, Redis requires less then 1GB of RAM to maintain a live index of the 460.000 PubChem compounds that have at least one associated assay. This allows scientist to host a local instance of the compound datastore, effectively eliminating the need for a dedicated (and expensive) compound database setup.

The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework

Davy Suvee — Wed, 08 Feb 2012 09:50:11 +0000

[information]

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the capabilities of the map-reduce implementation.

[/information]

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities through Tanimoto coefficients. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, part 2 of this article describes the use of MongoDB’s build-in map-reduce functionality to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.

Recently, MongoDB introduced its new Aggregation Framework. This framework provides a more simple solution to calculating aggregate values instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the Datablend public GitHub repository.

1. MongoDB Aggregation Framework

The MongoDB Aggregation Framework draws on the well-known linux pipeline concept, where the output of one command is piped or redirected to be used as input of the next command. In case of MongoDB, multiple operators are combined into a single pipeline that is responsible for processing a stream of documents. Some operators, such as $match, $limit and $skip take a document as input and output the same document in case a certain set of criteria’s is met. Other operators, such as $project and $unwind take a single document as input and reshape that document or emit multiple documents based upon a certain projection. The $group operator finally, takes multiple documents as input and groups them into a single document by aggregating the relevant values. Expressions can be used within some of these operators to calculate new values or execute string operations.

Multiple operators are combined into a single pipeline that is applied upon a list of documents. The pipeline itself is executed as a MongoDB Command, resulting in single MongoDB document that contains an array of all documents that came out at end of the pipeline. The next paragraph details the refactoring of the molecular similarities algorithm as a pipeline of operators. Make sure to (re)read the previous two articles to fully grasp the implementation logic.

2. Molecular Similarity Pipeline

When applying a pipeline upon a certain collection, all documents contained within this collection are given as input to the first operator. It’s considered best practice to filter this list as quickly as possible to limit the number of total documents that are passed through the pipeline. In our case, this means filtering out all document that will never be able to satisfy the target Tanimoto coefficient. Hence, as a first step, we match all documents for which the fingerprint count is within a certain threshold. If we target a Tanimoto coefficient of 0.8 with a target compound containing 40 unique fingerprints, the $match operator look as follows:

Only compounds that have a fingerprint count between 32 and 50 will be streamed to the next pipeline operator. To perform this filtering, the $match operator is able to use the index that we have defined for the fingerprint_count property. For computing the Tanimoto coefficient, we need to calculate the number of shared fingerprints between a certain input compound and the compound we are targeting. In order to be able to work at the fingerprint level, we use the $unwind operator. $unwind peels off the elements of an array one by one, returning a stream of documents where the specified array is replaced by one of its elements. In our case, we apply the $unwind upon the fingerprints property. Hence, each compound document will result in n compound documents, where n is the number of unique fingerprints contained within the compound.

In order to calculate the number of shared fingerprints, we will start off by filtering out all documents which do not have a fingerprint that is in the list of fingerprints of the target compound. For doing so, we again apply the $match operator, this time filtering on the fingerprints property, where only documents that contain a fingerprint that is in the list of target fingerprints are maintained.

As we only match fingerprints that are in the list of target fingerprints, the output can be used to count the total number of shared fingerprints. For this, we apply the $group operator on the compound_cid, though which we create a new type of document, containing the number of matching fingerprints (by summating the number of occurrences), the total number of fingerprints of the input compound and the smiles representation.

We now have all parameters in place to calculate the Tanimoto coefficient. For this we will use the $project operator which, next to copying the compound id and smiles property, also adds a new, computed property named tanimoto.

As we are only interested in compounds that have a target Tanimoto coefficient of 0.8, we apply an additional $match operator to filter out all the ones that do not reach this coefficient.

The full pipeline command can be found below.

The output of this pipeline contains a list of compounds which have a Tanimoto of 0.8 or higher with respect to a particular target compound. A visual representation of this pipeline can be found below:

3. Conclusion

The new MongoDB Aggregation Framework provides a set of easy-to-use operators that allow users to express map-reduce type of algorithms in a more concise fashion. The pipeline concept beneath it offers an intuitive way of processing data. It is no surprise that this pipeline paradigm is adopted by various NoSQL approaches, including Tinkerpop’s Gremlin Framework and Neo4J’s Cypher implementation.

Performance wise, the pipeline solution is a major improvement upon the map-reduce implementation. The employed operators are natively supported by the MongoDB platform, which results in a huge performance improvement with respect to interpreted Javascript. As the Aggregation Framework is also able to work in a sharded environment, it easily beats the performance of my initial implementation, especially when the number of input compounds is high and the target Tanimoto coefficient is low. Great work from the MongoDB team!

The joy of algorithms and NoSQL: a MongoDB example (part 2)

Davy Suvee — Thu, 11 Aug 2011 09:44:06 +0000

[information]

[/information]

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered. The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000. Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to MongoDB’s build-in map-reduce functionality!

1. MongoDB molecular similarity map-reduce query

Map-reduce is a conceptual framework, introduced by Google, to enable the processing of huge datasets using a large number of processing nodes. The general idea is that a larger problem is divided in a set of smaller subproblems that can be answered (i.e. solved) by an individual processing node (the map-step). Afterwards, the individual solutions are combined again to produce the final answer to the larger problem (the reduce-step). By making sure that the individual map and reduce steps can be computed independently of each other, this divide-and-conquer technique can be easily parallellized on a cluster of processing nodes. Let’s start by refactoring our solution to use MongoDB’s map-reduce functionality.

The map-step of a MongoDB’s map-reduce implementation takes a MongoDB document as input and emits one (or more) answers (which, in essence, are again MongoDB documents). Executing our map-step on all compound documents in the compounds collection would not be very efficient. Instead, we would like to limit the execution of our map-step to those documents that can theoretically match the target Tanimoto. Luckily, we already defined this query, namely the compound selection query that was described in the part one of this article! By employing this query, only compounds that match this query are pushed through the map-step. A MongoDB map (and reduce) function is expressed through JavaScript. In our case, we calculate the number of unique fingerprint patterns that are shared by both the target and input compound. In case the minimum number of fingerprint patterns is reached, the map-step emits a document containing the PubChem identifier (as id) and some essential statistics (as values). A reduce-step is employed to aggregate answers into the final result. In our case however, we are interested into the individual results for each compound (document). Hence, no reduce function is applied. When this map-reduce function is executed only 27 compounds are returned (which could potentially match), instead of 7000 compounds when employing the previous Java query!

One would expect the execution time of the map-reduce query to be considerably faster compared to the Java solution. Unfortunately, this is not the case. First of all, interpreted Javascript is a multitude of times slower compared to Java. Secondly, although map-reduce steps could be parallellized when multiple CPU cores are available, the MongoDB map-reduce function always runs single-threaded. To circumvent this limitation, one can use MongoDB sharding. Simply explained, instead of putting all data on a single MongoDB node, multiple MongoDB nodes are employed, each responsible for storing a part of the total data set. When executing our map-reduce function, each node will execute the map-reduce steps on its part of the data in parallel. Hence, when using sharding on a cluster of 4 MongoDB nodes, the map-reduce query executes almost 4 times faster, already catching up with the performance of the Java solution. With the exception of the MongoDB sharding configuration, no changes are required to the map-reduce function itself. Hence, scaling horizontally with MongoDB is a breeze …

2. Conclusion

MongoDB’s map-reduce performance is a bit of a disappointment. MongoDB currently advises to only use it for near real-time computations. Version 2.0 of MongoDB should drastically improve map-reduce performance, as the JavaScript engine will be replaced by other execution platforms. Nevertheless, map-reduce performance can currently be boosted by splitting the load on multiple MongoDB shards.

The joy of algorithms and NoSQL: a MongoDB example (part 1)

Davy Suvee — Sun, 07 Aug 2011 09:42:36 +0000

[information]

[/information]

In one of my previous blog posts, I debated the superficial idea that you should own billions of data records before you are eligible to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically MongoDB, to solve a specific Chemoinformatics problem in a truly elegant and efficient way. The complete source code can be found on the Datablend public GitHub repository.

1. Molecular similarity theory

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that similar compounds generally exhibit similar biological activities.) Unfortunately, finding substructures in chemical compounds is a NP-complete problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of structural keys and fingerprints.

In case of structural keys, we precompute the answers on a couple of specific questions that try to capture the essential characteristics of a compound. Each answer is assigned a fixed location within a bitstring. At query time, a lot of time is saved by only executing substructure searches for compounds that have compatible structural keys. (Compatibility being computed by making use of efficient bit operators.)
When employing fingerprints, all linear substructure patterns of a certain length are calculated. As the number of potential patterns is huge, it is not possible to assign an individual bit position to each possible pattern (as is done with structural keys). Instead, the fingerprints patterns are used in a hash. The downside of this approach is that, depending of the size of the hash, multiple fingerprint patterns share the same bit position, giving lead to potential false positives.

In this article, we will demonstrate the use of non-hashed fingerprints to calculate compound similarities (i.e. using the raw fingerprints). This approach has two advantages:

We eliminate the chance of false positives
The raw fingerprints can be used in other types of structural compound mining

2. Molecular similarity practice

Let’s start by showing how to calculate the fingerprints of a chemical compound. Various fingerprinting algorithms are available today. Luckily, we don’t need to implement these algorithms ourselves. The excellent open-source jCompoundMapper library provides us with all the required functionality. It uses MDL SD formatted compounds as input and is able to output a variety of fingerprints. We start by creating a reader for MDL SD files. Afterwards, we use the 2DMol fingerprinter to encode the first molecule in the compounds.sdf file.

The first molecule of the input file has the following chemical structure: C₅₂H₈₇N₃O₁₃. Its 2DMol fingerprint contains 120 unique fingerprint patterns, a selection of them shown below:

Now the question that remains is how to measure the similarity between the fingerprints of compounds A and B. Several methods are again available, but in our case we will use the so-called Tanimoto association coefficient, which is defined as:

N_A refers to the number of unique fingerprint patterns found in compound A, while N_B refers to the number of unique fingerprint patterns found in compound B. N_AB specifies the number of unique fingerprint patterns found in both compound A and B. As can be observed from the equation, two identical compounds would have a Tanimoto coefficient of 1.0.

3. MongoDB datamodel

MongoDB is a so-called document-oriented database. When using document-oriented databases, you basically group together all related data in a single document instead of normalizing your data in various RDBMS tables and using joins to retrieve the required information later on. In case of MongoDB, documents are stored using BSON (binary JSON) and related documents are stored in collections. Let’s start by describing the compounds collection that stores a separate document for each compound. The JSON-document of our C₅₂H₈₇N₃O₁₃ compound looks as follows:

For each compound, we store its unique easy to create this JSON document for each compound. Once we retrieved a molecule through the MDL file reader, it is just a matter of creating the necessary document objects and inserting them in the compounds collection.

To complete our MongoDB data model, we will add the fingerprintscount collection. The rationale for this collection will be explained in the next section of this article. For now, just consider it to be a collection that stores the number of times a particular fingerprint pattern was encountered when importing the compound data. The listing below shows an extract of the fingerprintscount collection.

To update this collection, we make use of MongoDB’s increment operator in combination with an upsert. This way, whenever a fingerprint pattern is encountered for the first time, a document is automatically created for the fingerprint and its count is put on 1. If document already exists for this fingerprint pattern, its associated count is incremented by 1. Pretty easy!

For enhancing query performance, we need to define indexes for the appropriate document properties. By doing so, the created B-Tree indexes are used by MongoDB ‘s query optimizer to quickly find back the required documents instead of needing to scan through all of them. Creating these indexes is very easy as illustrated below.

4. MongoDB molecular similarity query

It’s time to bring it all together. Imagine we want to find all compounds that have a Tanimoto coefficient of 0.8 for a particular input compound. As MongoDB’s querying speed is quite fast we could try to brute-force it and basically compute the Tanimoto coefficient for each compound document that is stored in the compounds collection. But let’s try to do it a bit smarter. By looking at the Tanimoto coefficient equation, we can already narrow the search space significantly. Imagine our input compound (A) has 40 unique fingerprint patterns. Let’s fill in some of the parameters in the Tanimoto equation:

From this equation, we can deduct the minimum and maximum number of unique fingerprint patterns another compound should have in order to (in the best case) satisfy the equation:

Hence, the compound we are comparing with should only be considered if it has between 32 and 50 unique fingerprint patterns. By taking this into account in our search query, we can narrow the search space significantly. But we can optimize our query a bit further. Imagine we would query for all documents that share at least 1 of 9 unique fingerprints patterns with the input compound. All documents that are not part of the resultset of this query will never be able to reach the Tanimoto coefficient of 0.8, as the maximum of possibly remaining shared fingerprint patterns would be 31, and

were we replaced N_B by 31 to maximize the resulting Tanimoto coefficient. This 9-number can be obtained by solving the following equation:

Which 9 fingerprint patterns of the input compound should we consider in this query? Common sense tells us the pick the 9 fingerprint patterns for which the occurrence in the total compound population is the lowest, as this would narrow our search space even further. (Hence the need for the fingerprintscount collection to be able to quickly retrieve the counts of the individual fingerprint patterns.) The code listing below illustrates how to retrieve the counts for the individual patterns. Using the MongoDB QueryBuilder we create the query object that, using the in-operator, specifies that a document from the fingerprintscount collection should only be returned if its fingerprint pattern is part of the fingerprint patterns we are interested in. Using the additional sort-operator, we retrieve the individual fingerprint patterns in ascending count-order.

It’s time for the final step, namely retrieving all compound documents that have the potential to satisfy a Tanimoto coefficient of for instance 0.8. For doing this, we build a query that retrieves all compounds that have at least one pattern out of the sublist of fingerprint patterns we calculated in the previous step. Additionally, only compounds that have a total count of fingerprint patterns that is within the minimum and maximum range should be returned. Although this optimized query drastically narrows the search space, we still need to compute the Tanimoto coefficient afterwards to ensure that it has a value of 0.8 or above.

5. Conclusion

The MongoDB implementation is able to easily screen a database of a million compounds in subsecond time. The required calculation time however, largely depends on the target Tanimoto: a target Tanimoto of 0.6 will result in significantly more results compared to a target Tanimoto of 0.8. When using the MongoDB explain functionality, one can observe that the query time is rather short, compared to the time that is required to transfer the data to the Java application and execute the Tanimoto calculations. In my next article, I will discuss the use of MongoDB’s build-in map-reduce functionality to keep the Tanimoto calculation local to the compound data and hence improve overall performance.