<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datablend &#187; chemoinformatics</title>
	<atom:link href="http://datablend.be/?cat=26&#038;feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://datablend.be</link>
	<description>Big Data Simplified</description>
	<lastBuildDate>Mon, 07 Sep 2015 09:04:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.6.1</generator>
		<item>
		<title>Similr: blazingly fast chemical similarity searches</title>
		<link>http://datablend.be/?p=280</link>
		<comments>http://datablend.be/?p=280#comments</comments>
		<pubDate>Mon, 04 Feb 2013 10:00:38 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[similr]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=280</guid>
		<description><![CDATA[Today, Datablend announces Similr to be available for beta sign-up. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular chemical structure. It employs a wide range of fingerprinting algorithms, which combined, allow to identify matching compounds in millisecond time. Similr&#8217;s functionalities are available through a flexible and<p><a href="http://datablend.be/?p=280">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Today, Datablend announces <a href="http://similr.li">Similr</a> to be available for <a href="http://similr.li">beta sign-up</a>. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular <span class="highlight"><em>chemical structure</em></span>. It employs a wide range of <span class="highlight"><em>fingerprinting algorithms</em></span>, which combined, allow to identify matching compounds in <em>millisecond</em> time. Similr&#8217;s functionalities are available through a flexible and expressive REST API and allows to scan more than 30 million compounds that have been made publicly available through <a href="http://pubchem.ncbi.nlm.nih.gov">PubChem</a>.</p>
<p style="text-align: justify;">Similr will provide <span class="highlight"><em>unlimited</em></span> API-access to academics. Free commercial access,  limited to a 1000 API-calls a month, will be available. Higher up, customers can choose between a <span class="highlight"><em>pay-as-you-go subscription</em></span> or opt-in for a dedicated installation that allows for the import of (private) compounds.</p>
<p style="text-align: justify;">Similr is being developed by Datablend, a Big Data consultancy company. Datablend&#8217;s expertise in Pharma, combined with proficient knowledge of NoSQL technologies, allowed for the development of a <span class="highlight"><em>highly optimised chemical similarity search algorithm</em></span> that is able to scan millions of compounds at blazing speeds. Don&#8217;t hesitate to <a href="http://datablend.be/?page_id=37">contact us</a> if you have any questions related to Similr and/or Datablend.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=280</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Redis and Lua: a NoSQL power-horse</title>
		<link>http://datablend.be/?p=278</link>
		<comments>http://datablend.be/?p=278#comments</comments>
		<pubDate>Tue, 29 Jan 2013 09:59:16 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=278</guid>
		<description><![CDATA[Recently, I&#8217;ve started implementing a number of Redis-based solutions for a Datablend customer. Redis is frequently referred to as the Swiss Army Knife of NoSQL databases and rightfully deserves that title. At its core, it is an in-memory key-value datastore. Values that are assigned to keys can be &#8216;structured&#8217; through the use of strings, hashes,<p><a href="http://datablend.be/?p=278">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Recently, I&#8217;ve started implementing a number of Redis-based solutions for a Datablend customer. <a href="http://redis.io" target="_blank">Redis</a> is frequently referred to as the <span class="highlight"><em>Swiss Army Knife</em></span> of NoSQL databases and rightfully deserves that title. At its core, it is an <span class="highlight"><em>in-memory key-value</em></span> datastore. Values that are assigned to keys can be &#8216;structured&#8217; through the use of strings, hashes, lists, sets and sorted sets. The power of these simple data structures, combined with its intuitive API, makes Redis a true power-horse for solving various &#8216;Big Data&#8217;-related problems. To illustrate this point, I reimplemented my MongoDB-based <a href="http://datablend.be/?p=962" target="_blank">molecular similarity search</a> through Redis and its integrated Lua support. As always, the complete source code can be found on the <a target='_blank' href="https://github.com/datablend/redis-compound-comparison">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. Redis &#8216;fingerprint&#8217; data model</h3>
</p>
<p style="text-align: justify;"><span class="highlight">Molecular similarity</span> refers to the <em>similarity</em> of <em>chemical compounds</em> with respect to their structural and/or functional qualities. By calculating molecular similarities, Cheminformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. Chemical similarity can be determined through the use of so-called <span class="highlight"><em>fingerprints</em></span> (i.e. linear, substructure chemical patterns of a certain length). Similarity between compounds is identified by calculating the <a href="http://en.wikipedia.org/wiki/Jaccard_index" target='_blank'>Tanimoto coefficient</a>. This computation involves the calculation of <span class="highlight"><em>intersections</em></span> between sets of fingerprints, an operation that is natively supported by Redis.</p>
<p style="text-align: justify;">Our Redis-based data model for storing fingerprints requires three different data-structures:</p>
<ol>
<li>For each <span class="highlight"><em>compound</em></span>, identified by an <span class="highlight"><em>unique key</em></span>, we store its set of fingerprints (where each fingerprint is again identified by an unique key).</li>
<li>For each <span class="highlight"><em>fingerprint</em></span>, identified by an <span class="highlight"><em>unique key</em></span>, we store the set of compounds containing this fingerprint. These fingerprint sets can be conceived as the inverted indexes of the compound sets mentioned above.</li>
<li>For each <span class="highlight"><em>fingerprint</em></span>, we store its number of occurrences through a dedicated <span class="highlight"><em>weight</em>-key</span>.</li>
</ol>
<p>&nbsp;</p>
<p style="text-align: justify;">Fingerprints are calculated by using the, 33 and 35) are sufficient to create both the <span class="highlight"><em>inverted indexes</em></span> (compound->fingerprints and fingerprint->compounds) and incrementing the accompanying <span class="highlight"><em>counters</em></span>.</p>
<script src="https://gist.github.com/4666154.js"></script>
<p>&nbsp;</p>
<h3>2. Finding similar chemical compounds</h3>
<p style="text-align: justify;">For retrieving compounds that satisfy a particular Tanimoto coefficient, we reuse the same principles as outlined in my <a href="http://datablend.be/?p=962" target="_blank">original MongoDB article</a>. The number of round-trips to the Redis datastore is minimised by implementing the algorithm via the <span class="highlight"><em>build-in Lua scripting</em></span> support. We start by retrieving the number of fingerprints of the particular input compound. Based upon that cardinality, we  calculate the <span class="highlight"><em>fingerprints of interest</em></span> (i.e. the min-set of fingerprints that lead us to compounds that are able to satisfy the Tanimoto coefficient). For this, we need to identify the subset of compound fingerprints that occur the least throughout the entire dataset. Redis allows us to perform this query via a single <span class="highlight"><em>sort</em></span>-command; it takes the compound-key as input and sorts the contained fingerprints by employing the value of the external fingerprint weight keys. Out of this sorted set of fingerprints, we <span class="highlight"><em>sub-select the top x fingerprints</em></span> of interest. What a <span class="highlight"><em>powerful</em></span> and <span class="highlight"><em>elegant</em></span> command!</p>
<p style="text-align: justify;">We use the inverted index (fingerprint->compounds) to identify those compounds that are able to satisfy the particular input Tanimoto coefficient. Applying the Redis <span class="highlight"><em>union</em></span>-command upon the calculated set of fingerprint keys returns the set of potential compounds. Once this set has been identified, we calculate similarity by making use of the Redis <span class="highlight"><em>intersect</em></span>-command. Only compounds that satisfy the Tanimoto restriction are returned.</p>
<script src="https://gist.github.com/4667047.js"></script>
<p>&nbsp;</p>
<h3>3. Conclusion</h3>
<p style="text-align: justify;">With 25.000 stored compounds, Redis requires less then 20ms to retrieve compounds that are 70% similar to a particular input compound. Snappier compared to my original MongoDB implementation. In addition, Redis requires less then 1GB of RAM to maintain a live index of the <a href="http://cactus.nci.nih.gov/download/roadmap/">460.000 PubChem compounds</a> that have at least one associated assay. This allows scientist to host a local instance of the compound datastore, effectively eliminating the need for a dedicated (and expensive) compound database setup.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=278</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework</title>
		<link>http://datablend.be/?p=265</link>
		<comments>http://datablend.be/?p=265#comments</comments>
		<pubDate>Wed, 08 Feb 2012 09:50:11 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=265</guid>
		<description><![CDATA[[information] Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the<p><a href="http://datablend.be/?p=265">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[[information]
<p style="text-align: justify;"><a href="http://datablend.be/?p=962" target='_blank'>Part 1</a> of this article describes the use of MongoDB to implement the computation of molecular similarities. <a href="http://datablend.be/?p=968" target='_blank'>Part 2</a> discusses the refactoring of this solution by making use of MongoDB’s build-in <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>map-reduce functionality</a> to improve overall performance. <a href="http://datablend.be/?p=1400" target='_blank'>Part 3</a> finally, illustrates the use of the new MongoDB <a href="http://www.mongodb.org/display/DOCS/Aggregation+Framework" target='_blank'>Aggregation Framework</a>, which boosts performance beyond the capabilities of the map-reduce implementation.</p>
[/information]
<p style="text-align: justify;">In <a href="http://datablend.be/?p=962" target='_blank'>part 1</a> of this article, I described the use of <a href="http://www.mongodb.org" target='_blank'>MongoDB</a> to solve a specific <a href="http://en.wikipedia.org/wiki/Cheminformatics" target='_blank'>Chemoinformatics</a> problem, namely the computation of <span class="highlight">molecular similarities</span> through <span class="highlight">Tanimoto coefficients</span>. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, <a href="http://datablend.be/?p=962" target='_blank'>part 2</a> of this article describes the use of MongoDB’s build-in <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'> map-reduce functionality</a> to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.</p>
<p style="text-align: justify;">Recently, MongoDB introduced its new <a href="http://www.mongodb.org/display/DOCS/Aggregation+Framework" target='_blank'>Aggregation Framework</a>. This framework provides a more simple solution to calculating <span class="highlight">aggregate values</span> instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the <a target='_blank' href="https://github.com/datablend/mongo-compound-comparison-revisited">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. MongoDB Aggregation Framework</h3>
<p style="text-align: justify;">The MongoDB Aggregation Framework draws on the well-known <span class="highlight">linux pipeline</span> concept, where the output of one command is <span class="highlight">piped</span> or <span class="highlight">redirected</span> to be used as input of the next command.  In case of MongoDB, multiple operators are combined into a single pipeline that is responsible for processing a stream of documents. Some operators, such as <span class="highlight">$match</span>, <span class="highlight">$limit</span> and <span class="highlight">$skip</span> take a document as input and output the same document in case a certain set of criteria’s is met. Other operators, such as <span class="highlight">$project</span> and <span class="highlight">$unwind</span> take a single document as input and reshape that document or emit multiple documents based upon a certain projection. The <span class="highlight">$group</span> operator finally, takes multiple documents as input and groups them into a single document by aggregating the relevant values. <span class="highlight">Expressions</span> can be used within some of these operators to calculate new values or execute string operations.</p>
<p style="text-align: justify;">Multiple operators are combined into a single pipeline that is applied upon a list of documents. The pipeline itself is executed as a MongoDB <span class="highlight">Command</span>, resulting in single MongoDB document that contains an array of all documents that came out at end of the pipeline. The next paragraph details the refactoring of the molecular similarities algorithm as a pipeline of operators. Make sure to (re)read the previous two articles to fully grasp the implementation logic.</p>
<p>&nbsp;</p>
<h3>2. Molecular Similarity Pipeline</h3>
<p style="text-align: justify;">When applying a pipeline upon a certain collection, <span class="highlight">all</span> documents contained within this collection are given as input to the first operator. It&#8217;s considered best practice to <span class="highlight">filter</span> this list as quickly as possible to limit the number of total documents that are passed through the pipeline. In our case, this means filtering out all document that will never be able to <span class="highlight">satisfy the target Tanimoto coefficient</span>. Hence, as a first step, we match all documents for which the fingerprint count is within a certain threshold. If we target a Tanimoto coefficient of 0.8 with a target compound containing 40 unique fingerprints, the <span class="highlight">$match</span> operator look as follows:</p>
<script src="https://gist.github.com/1767088.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Only compounds that have a fingerprint count between 32 and 50 will be streamed to the next pipeline operator. To perform this filtering, the <span class="highlight">$match operator</span> is able to use the index that we have defined for the <em>fingerprint_count</em> property. For computing the Tanimoto coefficient, we need to calculate the number of <span class="highlight">shared fingerprints</span> between a certain input compound and the compound we are targeting. In order to be able to work at the fingerprint level, we use the <span class="highlight">$unwind operator</span>. $unwind peels off the elements of an array one by one, returning a stream of documents where the specified array is replaced by one of its elements. In our case, we apply the $unwind upon the <em>fingerprints</em> property. Hence, each compound document will result in <span class="highlight"><em>n</em></span> compound documents, where <span class="highlight"><em>n</em></span> is the number of unique fingerprints contained within the compound.</p>
<script src="https://gist.github.com/1767390.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">In order to calculate the number of shared fingerprints, we will start off by filtering out all documents which do not have a fingerprint that is in the list of fingerprints of the target compound. For doing so, we again apply the <span class="highlight">$match operator</span>, this time filtering on the <em>fingerprints</em> property, where only documents that contain a fingerprint that is in the list of target fingerprints are maintained.</p>
<script src="https://gist.github.com/1767484.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">As we only match fingerprints that are in the list of target fingerprints, the output can be used to <span class="highlight">count the total number of shared fingerprints</span>. For this, we apply the <span class="highlight">$group operator</span> on the <em>compound_cid</em>, though which we create a new type of document, containing <span class="highlight">the number of matching fingerprints</span> (by summating the number of occurrences), the <span class="highlight">total number of fingerprints of the input compound</span> and the <span class="highlight">smiles representation</span>.</p>
<script src="https://gist.github.com/1767565.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">We now have all parameters in place to calculate the Tanimoto coefficient. For this we will use the <span class="highlight">$project operator</span> which, next to copying the compound id and smiles property, also adds a new, computed property named <em>tanimoto</em>.</p>
<script src="https://gist.github.com/1767728.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">As we are only interested in compounds that have a target Tanimoto coefficient of 0.8, we apply an additional <span class="highlight">$match operator</span> to filter out all the ones that do not reach this coefficient.</p>
<script src="https://gist.github.com/1767791.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The <span class="highlight">full pipeline command</span> can be found below. </p>
<script src="https://gist.github.com/1767950.js"></script>
<p>&nbsp;</p>
<p>The <span class="highlight">output</span> of this pipeline contains a list of compounds which have a Tanimoto of 0.8 or higher with respect to a particular target compound. A visual representation of this pipeline can be found below:</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/pipeline.jpg">
<p align="center"><img width="900" src="http://datablend.be/wp-content/uploads/pipeline.jpg" alt="pipeline" /></p>
<p></a>  </p>
<p>&nbsp;</p>
<h3>3. Conclusion</h3>
<p style="text-align: justify;">The new MongoDB Aggregation Framework provides a set of easy-to-use operators that allow users to express map-reduce type of algorithms in a more <span class="highlight">concise</span> fashion. The pipeline concept beneath it offers an <span class="highlight">intuitive</span> way of processing data. It is no surprise that this pipeline paradigm is adopted by various NoSQL approaches, including <a href="https://github.com/tinkerpop/gremlin/wiki" target='_blank'>Tinkerpop&#8217;s Gremlin Framework</a> and <a href="http://docs.neo4j.org/chunked/stable/cypher-query-lang.html" target='_blank'>Neo4J&#8217;s Cypher</a> implementation.</p>
<p style="text-align: justify;"><span class="highlight">Performance wise</span>, the pipeline solution is a major improvement upon the map-reduce implementation. The employed operators are <span class="highlight">natively supported by the MongoDB platform</span>, which results in a huge performance improvement with respect to interpreted Javascript. As the Aggregation Framework is also able to work in a <span class="highlight">sharded environment</span>, it easily beats the performance of my initial implementation, especially when the number of input compounds is high and the target Tanimoto coefficient is low. <span class="highlight">Great work from the MongoDB team!</span></p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=265</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The joy of algorithms and NoSQL: a MongoDB example (part 2)</title>
		<link>http://datablend.be/?p=256</link>
		<comments>http://datablend.be/?p=256#comments</comments>
		<pubDate>Thu, 11 Aug 2011 09:44:06 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=256</guid>
		<description><![CDATA[[information] Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. [/information] In part 1 of this article, I described the use of MongoDB to solve a specific<p><a href="http://datablend.be/?p=256">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[[information]
<p style="text-align: justify;"><a href="http://datablend.be/?p=962" target='_blank'>Part 1</a> of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>MongoDB’s build-in map-reduce functionality</a> to improve overall performance.</p>
[/information]
<p style="text-align: justify;">In <a href="http://datablend.be/?p=962" target='_blank'>part 1</a> of this article, I described the use of <a href="http://www.mongodb.org" target='_blank'>MongoDB</a> to solve a specific <a href="http://en.wikipedia.org/wiki/Cheminformatics" target='_blank'>Chemoinformatics</a> problem, namely the computation of <span class="highlight">molecular similarities</span>. Depending on the target <span class="highlight">Tanimoto coefficient</span>, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query <span class="highlight">increases significantly</span> when the target Tanimoto is <span class="highlight">lowered</span>. The example code on the <a href="https://github.com/datablend/mongo-compound-comparison" target='_blank'>GitHub repository</a> for instance, imports and indexes <span class="highlight">~25000</span> chemical compounds. When a target Tanimoto of <span class="highlight">0.8</span> is employed, the query returns <span class="highlight">~700</span> compounds. When the target Tanimoto is lowered to <span class="highlight">0.6</span>, the number of returned compounds increases to <span class="highlight">~7000</span>. Using the <a href="http://www.mongodb.org/display/DOCS/Explain" target='_blank'>MongoDB explain functionality</a>, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>MongoDB’s build-in map-reduce functionality</a>!</p>
<p>&nbsp;</p>
<h3>1. MongoDB molecular similarity map-reduce query</h3>
<p style="text-align: justify;"><a href="http://nl.wikipedia.org/wiki/MapReduce" target='_blank'>Map-reduce</a> is a conceptual framework, introduced by Google, to enable the <span class="highlight">processing of huge datasets</span> using a <span class="highlight">large number of processing nodes</span>. The general idea is that a <span class="highlight">larger problem</span> is <span class="highlight">divided</span> in a set of <span class="highlight">smaller subproblems</span> that can be answered (i.e. solved) by an individual processing node (the <span class="highlight">map-step</span>). Afterwards, the <span class="highlight">individual solutions</span> are <span class="highlight">combined</span> again to produce the <span class="highlight">final answer</span> to the <span class="highlight">larger problem</span> (the <span class="highlight">reduce-step</span>). By making sure that the individual <span class="highlight">map and reduce steps</span> can be computed <span class="highlight">independently</span> of each other, this divide-and-conquer technique can be easily <span class="highlight">parallellized</span> on a cluster of processing nodes. Let&#8217;s start by refactoring our solution to use MongoDB&#8217;s map-reduce functionality.</p>
<script src="https://gist.github.com/1139665.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The <span class="highlight">map-step</span> of a MongoDB&#8217;s map-reduce implementation takes a MongoDB <span class="highlight">document as input</span> and <span class="highlight">emits</span> one (or more) answers (which, in essence, are again MongoDB documents). Executing our <span class="highlight">map-step</span> on all compound documents in the <span class="highlight">compounds</span> collection would not be very efficient. Instead, we would like to limit the execution of our map-step to those documents that can theoretically match the target Tanimoto. Luckily, we already defined this query, namely the <span class="highlight">compound selection query</span> that was described in the part one of this article! By employing this query, only compounds that match this query are pushed through the map-step. A MongoDB map (and reduce) function is expressed through <span class="highlight">JavaScript</span>. In our case, we calculate the number of unique fingerprint patterns that are <span class="highlight">shared</span> by both the target and input compound. In case the <span class="highlight">minimum number of fingerprint patterns</span> is reached, the map-step <span class="highlight">emits</span> a document containing the PubChem identifier (as id) and some essential statistics (as values). A <span class="highlight">reduce-step</span> is employed to <span class="highlight">aggregate</span> answers into the final result. In our case however, we are interested into the individual results for each compound (document). Hence, no reduce function is applied. When this map-reduce function is executed only <span class="highlight">27</span> compounds are returned (which could potentially match), instead of <span class="highlight">7000</span> compounds when employing the previous Java query!</p>
<p style="text-align: justify;">One would expect the <span class="highlight">execution time</span> of the <span class="highlight">map-reduce query</span> to be considerably faster compared to the <span class="highlight">Java solution</span>. Unfortunately, this is <span class="highlight">not</span> the case. First of all, interpreted Javascript is a multitude of times slower compared to Java. Secondly, although map-reduce steps could be <span class="highlight">parallellized</span> when multiple CPU cores are available, the MongoDB map-reduce function always runs <span class="highlight">single-threaded</span>. To <span class="highlight">circumvent this limitation</span>, one can use <a href="http://www.mongodb.org/display/DOCS/Sharding+Introduction" target='_blank'>MongoDB sharding</a>. Simply explained, instead of putting all data on a <span class="highlight">single</span> MongoDB node, <span class="highlight">multiple</span> MongoDB nodes are employed, each responsible for storing <span class="highlight">a part of the total data set</span>. When executing our map-reduce function, each node will execute the map-reduce steps on its part of the data in <span class="highlight">parallel</span>. Hence, when using sharding on a cluster of <span class="highlight">4</span> MongoDB nodes, the map-reduce query  executes almost <span class="highlight">4</span> times faster, already catching up with the performance of the Java solution. With the exception of the MongoDB <span class="highlight">sharding configuration</span>, no changes are required to the map-reduce function itself. Hence, <span class="highlight">scaling horizontally</span> with MongoDB is a breeze &#8230; </p>
<p>&nbsp;</p>
<h3>2. Conclusion</h3>
<p style="text-align: justify;">MongoDB&#8217;s map-reduce performance is a bit of a disappointment. MongoDB currently advises to only use it for <span class="highlight">near real-time computations</span>. <span class="highlight">Version 2.0</span> of MongoDB should drastically <span class="highlight">improve</span> map-reduce performance, as the JavaScript engine will be replaced by other execution platforms. Nevertheless, map-reduce performance can currently be boosted by splitting the load on multiple MongoDB shards.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=256</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The joy of algorithms and NoSQL: a MongoDB example (part 1)</title>
		<link>http://datablend.be/?p=254</link>
		<comments>http://datablend.be/?p=254#comments</comments>
		<pubDate>Sun, 07 Aug 2011 09:42:36 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=254</guid>
		<description><![CDATA[[information] Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. [/information] In one of my previous blog posts, I debated the superficial idea that you should own<p><a href="http://datablend.be/?p=254">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[[information]
<p style="text-align: justify;">Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. <a href="http://datablend.be/?p=968" target='_blank'>Part 2</a> discusses the refactoring of this solution by making use of <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>MongoDB’s build-in map-reduce functionality</a> to improve overall performance.</p>
[/information]
<p style="text-align: justify;">In one of my <a href="http://datablend.be/?p=437"  target='_blank' >previous</a> blog posts, I debated the superficial idea that you should own billions of data records before you are <em>eligible</em> to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically <a  target='_blank'  href="http://www.mongodb.org/">MongoDB</a>, to solve a specific <a  target='_blank' href="http://en.wikipedia.org/wiki/Cheminformatics">Chemoinformatics</a> problem in a truly elegant and efficient way. The complete source code can be found on the <a target='_blank' href="https://github.com/datablend/mongo-compound-comparison">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. Molecular similarity theory</h3>
<p style="text-align: justify;"><span class="highlight">Molecular similarity</span> refers to the <em>similarity</em> of <em>chemical compounds</em> with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that <span class="highlight">similar compounds</span> generally exhibit <span class="highlight">similar biological</span> activities.) Unfortunately, finding substructures in chemical compounds is a <span class="highlight">NP-complete</span> problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of <span class="highlight">structural keys</span> and <span class="highlight">fingerprints</span>.</p>
<p style="text-align: justify;">In case of <span class="highlight">structural keys</span>, we precompute the answers on a couple of specific questions that try to capture the essential characteristics of a compound. Each answer is assigned a fixed location within a <em>bitstring</em>. At query time, a lot of time is saved by only executing substructure searches for compounds that have compatible structural keys. (Compatibility being computed by making use of efficient bit operators.)<br />
When employing <span class="highlight">fingerprints</span>, all linear substructure patterns of a certain length are calculated. As the number of potential patterns is huge, it is not possible to assign an individual bit position to each possible pattern (as is done with structural keys). Instead, the fingerprints patterns are used in a <span class="highlight">hash</span>. The downside of this approach is that, depending of the size of the hash, multiple fingerprint patterns share the same bit position, giving lead to potential <span class="highlight">false positives</span>.</p>
<p style="text-align: justify;">In this article, we will demonstrate the use of <span class="highlight">non-hashed fingerprints</span> to calculate compound similarities (i.e. using the raw fingerprints). This approach has two advantages:</p>
<ol>
<li>We eliminate the chance of false positives</li>
<li>The raw fingerprints can be used in other types of structural compound mining</li>
</ol>
<p>&nbsp;</p>
<h3>2. Molecular similarity practice</h3>
<p style="text-align: justify;">Let&#8217;s start by showing how to calculate the fingerprints of a chemical compound. Various fingerprinting algorithms are available today. Luckily, we don&#8217;t need to implement these algorithms ourselves. The excellent open-source <a target='_blank' href="http://jcompoundmapper.sourceforge.net">jCompoundMapper</a> library provides us with all the required functionality. It uses <a target='_blank' href="http://en.wikipedia.org/wiki/Chemical_table_file">MDL SD</a> formatted compounds as input and is able to output a variety of fingerprints. We start by creating a reader for MDL SD files. Afterwards, we use the <span class="highlight">2DMol</span> fingerprinter to encode the first molecule in the <em>compounds.sdf</em> file.</p>
<script src="https://gist.github.com/1129153.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The first molecule of the input file has the following chemical structure: <a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46200001" target='_blank'>C<sub>52</sub>H<sub>87</sub>N<sub>3</sub>O<sub>13</sub></a>. Its 2DMol fingerprint contains 120 unique fingerprint patterns, a selection of them shown below:</p>
<script src="https://gist.github.com/1129183.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Now the question that remains is how to measure the similarity between the fingerprints of compounds <span class="highlight">A</span> and <span class="highlight">B</span>. Several methods are again available, but in our case we will use the so-called <a href="http://en.wikipedia.org/wiki/Jaccard_index" target='_blank'>Tanimoto association coefficient</a>, which is defined as:</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large T = \frac{N_{AB}}{N_{A} + N_{B} - N_{AB}}"></center></p>
<p style="text-align: justify;"><span class="highlight">N<sub>A</sub></span> refers to the number of unique fingerprint patterns found in compound <span class="highlight">A</span>, while <span class="highlight">N<sub>B</sub></span> refers to the number of unique fingerprint patterns found in compound <span class="highlight">B</span>. <span class="highlight">N<sub>AB</sub></span> specifies the number of unique fingerprint patterns found in <span class="highlight">both</span> compound <span class="highlight">A</span> and <span class="highlight">B</span>. As can be observed from the equation, two identical compounds would have a <span class="highlight">Tanimoto coefficient</span> of <span class="highlight">1.0</span>.</p>
<p>&nbsp;</p>
<h3>3. MongoDB datamodel</h3>
<p style="text-align: justify;"><a  target='_blank'  href="http://www.mongodb.org/">MongoDB</a> is a so-called <a href="http://en.wikipedia.org/wiki/Document-oriented_database" target='_blank'>document-oriented database</a>. When using document-oriented databases, you basically group together all related data in a <span class="highlight">single document</span> instead of normalizing your data in various <span class="highlight">RDBMS tables</span> and using <span class="highlight">joins</span> to retrieve the required information later on. In case of MongoDB, documents are stored using <a href="http://bsonspec.org" target='_blank' >BSON</a> (binary JSON) and related documents are stored in <span class="highlight">collections</span>. Let&#8217;s start by describing the <span class="highlight">compounds</span> collection that stores a separate document for each compound. The JSON-document of our <a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46200001" target='_blank'>C<sub>52</sub>H<sub>87</sub>N<sub>3</sub>O<sub>13</sub></a> compound looks as follows:</p>
<script src="https://gist.github.com/1129242.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">For each compound, we store its unique  easy to create this JSON document for each compound. Once we retrieved a molecule through the MDL file reader, it is just a matter of creating the necessary document objects and <span class="highlight">inserting</span> them in the <span class="highlight">compounds collection</span>.</p>
<script src="https://gist.github.com/1129346.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">To complete our MongoDB data model, we will add the <span class="highlight">fingerprintscount</span> collection. The rationale for this collection will be explained in the next section of this article. For now, just consider it to be a collection that stores the number of times a particular fingerprint pattern was encountered when importing the compound data. The listing below shows an extract of the <span class="highlight">fingerprintscount</span> collection.</p>
<script src="https://gist.github.com/1129361.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">To update this collection, we make use of <span class="highlight">MongoDB&#8217;s increment</span> operator in combination with an <span class="highlight">upsert</span>. This way, whenever a fingerprint pattern is encountered for the first time, a document is automatically created for the fingerprint and its count is put on 1. If document already exists for this fingerprint pattern, its associated count is incremented by 1. Pretty easy!</p>
<script src="https://gist.github.com/1129357.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">For enhancing query performance, we need to define <span class="highlight">indexes</span> for the appropriate document properties. By doing so, the created <span class="highlight">B-Tree</span> indexes are used by MongoDB &#8216;s <span class="highlight">query optimizer</span> to quickly find back the required documents instead of needing to scan through all of them. Creating these indexes is very easy as illustrated below.</p>
<script src="https://gist.github.com/1129375.js"></script>
<p>&nbsp;</p>
<h3>4. MongoDB molecular similarity query</h3>
<p style="text-align: justify;">It&#8217;s time to bring it all together. Imagine we want to find all compounds that have a <span class="highlight">Tanimoto coefficient</span> of 0.8 for a particular input compound. As MongoDB&#8217;s querying speed is quite fast we could try to brute-force it and basically compute the Tanimoto coefficient for each compound document that is stored in the compounds collection. But let&#8217;s try to do it a bit smarter. By looking at the Tanimoto coefficient equation, we can already narrow the search space significantly. Imagine our input compound (<span class="highlight">A</span>) has 40 unique fingerprint patterns. Let&#8217;s fill in some of the parameters in the Tanimoto equation:</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large 0.8 = \frac{N_{AB}}{40 + N_{B} - N_{AB}}"></center></p>
<p style="text-align: justify;">From this equation, we can deduct the <span class="highlight">minimum</span> and <span class="highlight">maximum</span> number of unique fingerprint patterns another compound should have in order to (in the best case) satisfy the equation:</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large 0.8 = \frac{N_{B}}{40 + N_{B} - N_{B}}"> <img src="http://tools.jcisio.com/tex/mathtex/?\large 0.8 = \frac{40}{40 + N_{B} - 40}"></center></p>
<p style="text-align: justify;">Hence, the compound we are comparing with should only be considered if it has between <span class="highlight">32</span> and <span class="highlight">50</span> unique fingerprint patterns. By taking this into account in our search query, we can narrow the search space significantly. But we can optimize our query a bit further. Imagine we would query for all documents that share <span class="highlight">at least 1 of 9</span> unique fingerprints patterns with the input compound. All documents that are not part of the resultset of this query will never be able to reach the Tanimoto coefficient of 0.8, as the maximum of possibly remaining shared fingerprint patterns would be <span class="highlight">31</span>, and</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large \frac{31}{40 + 31 - 31} = 0.77"></center></p>
<p style="text-align: justify;">were we replaced <span class="highlight">N<sub>B</sub></span> by <span class="highlight">31</span> to maximize the resulting Tanimoto coefficient. This <span class="highlight">9</span>-number can be obtained by solving the following equation:</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large N_{query} = (N_{A} - N_{minimum-number-of-fingerprint-patterns}) + 1"></center></p>
<p style="text-align: justify;">Which 9 fingerprint patterns of the input compound should we consider in this query? Common sense tells us the pick the 9 fingerprint patterns for which the <span class="highlight">occurrence in the total compound population is the lowest</span>, as this would narrow our search space even further. (Hence the need for the <span class="highlight">fingerprintscount</span> collection to be able to quickly retrieve the counts of the individual fingerprint patterns.) The code listing below illustrates how to retrieve the counts for the individual patterns. Using the MongoDB <span class="highlight">QueryBuilder</span> we create the query object that, using the <span class="highlight">in</span>-operator, specifies that a document from the <span class="highlight">fingerprintscount</span> collection should only be returned if its fingerprint pattern is part of the fingerprint patterns we are interested in. Using the additional <span class="highlight">sort</span>-operator, we retrieve the individual fingerprint patterns in ascending <span class="highlight">count</span>-order.</p>
<script src="https://gist.github.com/1129445.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">It&#8217;s time for the final step, namely retrieving all compound documents that <span class="highlight">have the potential</span> to satisfy a Tanimoto coefficient of for instance 0.8. For doing this, we build a query that retrieves all compounds that have at least one pattern out of the sublist of fingerprint patterns we calculated in the previous step. Additionally, only compounds that have a total count of fingerprint patterns that is within the minimum and maximum range should be returned. Although this optimized query drastically narrows the search space, we still need to compute the Tanimoto coefficient afterwards to ensure that it has a value of 0.8 or above.</p>
<script src="https://gist.github.com/1129469.js"></script>
<p>&nbsp;</p>
<h3>5. Conclusion</h3>
<p style="text-align: justify;">The MongoDB implementation is able to easily screen a database of a million compounds in subsecond time. The required calculation time however, largely depends on the target Tanimoto: a target Tanimoto of 0.6 will result in significantly more results compared to a target Tanimoto of 0.8. When using the <a href="http://www.mongodb.org/display/DOCS/Explain" target='_blank'>MongoDB explain functionality</a>, one can observe that the query time is rather short, compared to the time that is required to transfer the data to the Java application and execute the Tanimoto calculations. In my next article, I will discuss the use of <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>MongoDB&#8217;s build-in map-reduce functionality</a> to keep the Tanimoto calculation local to the compound data and hence improve overall performance.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=254</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
