The joy of algorithms and NoSQL: a MongoDB example (part 1)

August 7, 2011 Davy Suvee chemoinformatics, compound comparison, mongodb, NoSQL No Comments

[information]

Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance.

[/information]

In one of my previous blog posts, I debated the superficial idea that you should own billions of data records before you are eligible to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically MongoDB, to solve a specific Chemoinformatics problem in a truly elegant and efficient way. The complete source code can be found on the Datablend public GitHub repository.

1. Molecular similarity theory

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that similar compounds generally exhibit similar biological activities.) Unfortunately, finding substructures in chemical compounds is a NP-complete problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of structural keys and fingerprints.

In case of structural keys, we precompute the answers on a couple of specific questions that try to capture the essential characteristics of a compound. Each answer is assigned a fixed location within a bitstring. At query time, a lot of time is saved by only executing substructure searches for compounds that have compatible structural keys. (Compatibility being computed by making use of efficient bit operators.)
When employing fingerprints, all linear substructure patterns of a certain length are calculated. As the number of potential patterns is huge, it is not possible to assign an individual bit position to each possible pattern (as is done with structural keys). Instead, the fingerprints patterns are used in a hash. The downside of this approach is that, depending of the size of the hash, multiple fingerprint patterns share the same bit position, giving lead to potential false positives.

In this article, we will demonstrate the use of non-hashed fingerprints to calculate compound similarities (i.e. using the raw fingerprints). This approach has two advantages:

We eliminate the chance of false positives
The raw fingerprints can be used in other types of structural compound mining

2. Molecular similarity practice

Let’s start by showing how to calculate the fingerprints of a chemical compound. Various fingerprinting algorithms are available today. Luckily, we don’t need to implement these algorithms ourselves. The excellent open-source jCompoundMapper library provides us with all the required functionality. It uses MDL SD formatted compounds as input and is able to output a variety of fingerprints. We start by creating a reader for MDL SD files. Afterwards, we use the 2DMol fingerprinter to encode the first molecule in the compounds.sdf file.

The first molecule of the input file has the following chemical structure: C₅₂H₈₇N₃O₁₃. Its 2DMol fingerprint contains 120 unique fingerprint patterns, a selection of them shown below:

Now the question that remains is how to measure the similarity between the fingerprints of compounds A and B. Several methods are again available, but in our case we will use the so-called Tanimoto association coefficient, which is defined as:

$T = \frac{N_{AB}}{N_{A} + N_{B} - N_{AB}}$

N_A refers to the number of unique fingerprint patterns found in compound A, while N_B refers to the number of unique fingerprint patterns found in compound B. N_AB specifies the number of unique fingerprint patterns found in both compound A and B. As can be observed from the equation, two identical compounds would have a Tanimoto coefficient of 1.0.

3. MongoDB datamodel

MongoDB is a so-called document-oriented database. When using document-oriented databases, you basically group together all related data in a single document instead of normalizing your data in various RDBMS tables and using joins to retrieve the required information later on. In case of MongoDB, documents are stored using BSON (binary JSON) and related documents are stored in collections. Let’s start by describing the compounds collection that stores a separate document for each compound. The JSON-document of our C₅₂H₈₇N₃O₁₃ compound looks as follows:

For each compound, we store its unique easy to create this JSON document for each compound. Once we retrieved a molecule through the MDL file reader, it is just a matter of creating the necessary document objects and inserting them in the compounds collection.

To complete our MongoDB data model, we will add the fingerprintscount collection. The rationale for this collection will be explained in the next section of this article. For now, just consider it to be a collection that stores the number of times a particular fingerprint pattern was encountered when importing the compound data. The listing below shows an extract of the fingerprintscount collection.

To update this collection, we make use of MongoDB’s increment operator in combination with an upsert. This way, whenever a fingerprint pattern is encountered for the first time, a document is automatically created for the fingerprint and its count is put on 1. If document already exists for this fingerprint pattern, its associated count is incremented by 1. Pretty easy!

For enhancing query performance, we need to define indexes for the appropriate document properties. By doing so, the created B-Tree indexes are used by MongoDB ‘s query optimizer to quickly find back the required documents instead of needing to scan through all of them. Creating these indexes is very easy as illustrated below.

4. MongoDB molecular similarity query

It’s time to bring it all together. Imagine we want to find all compounds that have a Tanimoto coefficient of 0.8 for a particular input compound. As MongoDB’s querying speed is quite fast we could try to brute-force it and basically compute the Tanimoto coefficient for each compound document that is stored in the compounds collection. But let’s try to do it a bit smarter. By looking at the Tanimoto coefficient equation, we can already narrow the search space significantly. Imagine our input compound (A) has 40 unique fingerprint patterns. Let’s fill in some of the parameters in the Tanimoto equation:

$0.8 = \frac{N_{AB}}{40 + N_{B} - N_{AB}}$

From this equation, we can deduct the minimum and maximum number of unique fingerprint patterns another compound should have in order to (in the best case) satisfy the equation:

$0.8 = \frac{N_{B}}{40 + N_{B} - N_{B}}$ $0.8 = \frac{40}{40 + N_{B} - 40}$

Hence, the compound we are comparing with should only be considered if it has between 32 and 50 unique fingerprint patterns. By taking this into account in our search query, we can narrow the search space significantly. But we can optimize our query a bit further. Imagine we would query for all documents that share at least 1 of 9 unique fingerprints patterns with the input compound. All documents that are not part of the resultset of this query will never be able to reach the Tanimoto coefficient of 0.8, as the maximum of possibly remaining shared fingerprint patterns would be 31, and

$\frac{31}{40 + 31 - 31} = 0.77$

were we replaced N_B by 31 to maximize the resulting Tanimoto coefficient. This 9-number can be obtained by solving the following equation:

$N_{query} = (N_{A} - N_{minimum-number-of-fingerprint-patterns}) + 1$

Which 9 fingerprint patterns of the input compound should we consider in this query? Common sense tells us the pick the 9 fingerprint patterns for which the occurrence in the total compound population is the lowest, as this would narrow our search space even further. (Hence the need for the fingerprintscount collection to be able to quickly retrieve the counts of the individual fingerprint patterns.) The code listing below illustrates how to retrieve the counts for the individual patterns. Using the MongoDB QueryBuilder we create the query object that, using the in-operator, specifies that a document from the fingerprintscount collection should only be returned if its fingerprint pattern is part of the fingerprint patterns we are interested in. Using the additional sort-operator, we retrieve the individual fingerprint patterns in ascending count-order.

It’s time for the final step, namely retrieving all compound documents that have the potential to satisfy a Tanimoto coefficient of for instance 0.8. For doing this, we build a query that retrieves all compounds that have at least one pattern out of the sublist of fingerprint patterns we calculated in the previous step. Additionally, only compounds that have a total count of fingerprint patterns that is within the minimum and maximum range should be returned. Although this optimized query drastically narrows the search space, we still need to compute the Tanimoto coefficient afterwards to ensure that it has a value of 0.8 or above.

5. Conclusion

The MongoDB implementation is able to easily screen a database of a million compounds in subsecond time. The required calculation time however, largely depends on the target Tanimoto: a target Tanimoto of 0.6 will result in significantly more results compared to a target Tanimoto of 0.8. When using the MongoDB explain functionality, one can observe that the query time is rather short, compared to the time that is required to transfer the data to the Java application and execute the Tanimoto calculations. In my next article, I will discuss the use of MongoDB’s build-in map-reduce functionality to keep the Tanimoto calculation local to the compound data and hence improve overall performance.