Redis and Lua: a NoSQL power-horse
Recently, I’ve started implementing a number of Redis-based solutions for a Datablend customer. Redis is frequently referred to as the Swiss Army Knife of NoSQL databases and rightfully deserves that title. At its core, it is an in-memory key-value datastore. Values that are assigned to keys can be ‘structured’ through the use of strings, hashes, lists, sets and sorted sets. The power of these simple data structures, combined with its intuitive API, makes Redis a true power-horse for solving various ‘Big Data’-related problems. To illustrate this point, I reimplemented my MongoDB-based molecular similarity search through Redis and its integrated Lua support. As always, the complete source code can be found on the Datablend public GitHub repository.
1. Redis ‘fingerprint’ data model
Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Cheminformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. Chemical similarity can be determined through the use of so-called fingerprints (i.e. linear, substructure chemical patterns of a certain length). Similarity between compounds is identified by calculating the Tanimoto coefficient. This computation involves the calculation of intersections between sets of fingerprints, an operation that is natively supported by Redis.
Our Redis-based data model for storing fingerprints requires three different data-structures:
- For each compound, identified by an unique key, we store its set of fingerprints (where each fingerprint is again identified by an unique key).
- For each fingerprint, identified by an unique key, we store the set of compounds containing this fingerprint. These fingerprint sets can be conceived as the inverted indexes of the compound sets mentioned above.
- For each fingerprint, we store its number of occurrences through a dedicated weight-key.
Fingerprints are calculated by using the jCompoundMapper library. The Jedis (Java Redis Client) library is employed for persisting the data-structures mentioned above. Three simple atomic operations (lines 30, 33 and 35) are sufficient to create both the inverted indexes (compound->fingerprints and fingerprint->compounds) and incrementing the accompanying counters.
2. Finding similar chemical compounds
For retrieving compounds that satisfy a particular Tanimoto coefficient, we reuse the same principles as outlined in my original MongoDB article. The number of round-trips to the Redis datastore is minimised by implementing the algorithm via the build-in Lua scripting support. We start by retrieving the number of fingerprints of the particular input compound. Based upon that cardinality, we calculate the fingerprints of interest (i.e. the min-set of fingerprints that lead us to compounds that are able to satisfy the Tanimoto coefficient). For this, we need to identify the subset of compound fingerprints that occur the least throughout the entire dataset. Redis allows us to perform this query via a single sort-command; it takes the compound-key as input and sorts the contained fingerprints by employing the value of the external fingerprint weight keys. Out of this sorted set of fingerprints, we sub-select the top x fingerprints of interest. What a powerful and elegant command!
We use the inverted index (fingerprint->compounds) to identify those compounds that are able to satisfy the particular input Tanimoto coefficient. Applying the Redis union-command upon the calculated set of fingerprint keys returns the set of potential compounds. Once this set has been identified, we calculate similarity by making use of the Redis intersect-command. Only compounds that satisfy the Tanimoto restriction are returned.
With 25.000 stored compounds, Redis requires less then 20ms to retrieve compounds that are 70% similar to a particular input compound. Snappier compared to my original MongoDB implementation. In addition, Redis requires less then 1GB of RAM to maintain a live index of the 460.000 PubChem compounds that have at least one associated assay. This allows scientist to host a local instance of the compound datastore, effectively eliminating the need for a dedicated (and expensive) compound database setup.