<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datablend &#187; NoSQL</title>
	<atom:link href="https://datablend.be/?cat=20&#038;feed=rss2" rel="self" type="application/rss+xml" />
	<link>https://datablend.be</link>
	<description>Big Data Simplified</description>
	<lastBuildDate>Mon, 07 Sep 2015 09:04:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.6.1</generator>
		<item>
		<title>The power of graphs to analyse biological data</title>
		<link>https://datablend.be/?p=344</link>
		<comments>https://datablend.be/?p=344#comments</comments>
		<pubDate>Mon, 02 Dec 2013 07:04:34 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[gephi]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graphconnect]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=344</guid>
		<description><![CDATA[Watch Davy Suvee present at GraphConnect London 2013 on the power of graph databases to analyse biological datasets. The Power of Graphs to Analyze Biological Data &#8211; Davy Suvee @ GraphConnect London 2013 from Neo Technology on Vimeo.]]></description>
				<content:encoded><![CDATA[<p>Watch Davy Suvee present at <a href="http://www.graphconnect.com/london/" title="GraphConnect London 2013">GraphConnect London 2013</a> on the power of graph databases to analyse biological datasets.</p>
<p><iframe src="//player.vimeo.com/video/80463932" width="500" height="281" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
<p><a href="http://vimeo.com/80463932">The Power of Graphs to Analyze Biological Data &#8211; Davy Suvee @ GraphConnect London 2013</a> from <a href="http://vimeo.com/neo4j">Neo Technology</a> on <a href="https://vimeo.com">Vimeo</a>.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=344</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Counting triangles smarter (or how to beat Big Data vendors at their own game)</title>
		<link>https://datablend.be/?p=282</link>
		<comments>https://datablend.be/?p=282#comments</comments>
		<pubDate>Mon, 11 Feb 2013 10:02:00 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[exadata]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[similr]]></category>
		<category><![CDATA[vertica]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=282</guid>
		<description><![CDATA[A few months ago, I discovered Vertica&#8217;s &#8220;Counting Triangles&#8221;-article through Prismatic. The blog post describes a number of benchmarks on counting triangles in large networks. A triangle is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends<p><a href="https://datablend.be/?p=282">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">A few months ago, I discovered Vertica&#8217;s <a href="http://www.vertica.com/2011/09/21/counting-triangles">&#8220;Counting Triangles&#8221;-article</a> through <a href="http://getprismatic.com/news/home">Prismatic</a>. The blog post describes a number of benchmarks on counting triangles in large networks. A <span class="highlight"><em>triangle</em></span> is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends with each other, the three of you define a <span class="highlight"><em>friendship triangle</em></span>. Counting all triangles within a large network is a rather <span class="highlight"><em>compute-intensive task</em></span>. In its most naive form, an algorithm iterates through all vertices in the network, retrieving the adjacent vertices of their adjacent vertices. If one of the vertices adjacent to the latter vertices is identical to the origin vertex, we identified a triangle.</p>
<p style="text-align: justify;">The Vertica article illustrates how to execute an optimised implementation of the above algorithm through <span class="highlight"><em>Hadoop</em></span> and their own <span class="highlight"><em>Massive Parallel Processing (MPP) Database</em></span> product (both being run on a 4-node cluster). The dataset involves the <a href="http://snap.stanford.edu/data/soc-LiveJournal1.html">LiveJournal social network graph</a>, containing around <em>86 million relationships</em>, resulting in around <em>285 million identified triangles</em>.  As can be expected, the Vertica solution shines in all respects (counting all triangles in <span class="highlight"><em>97 seconds</em></span>), beating the Hadoop solution by a factor of <em>40</em>. A few weeks later, the Oracle guys published a similar <a href="http://structureddata.org/2011/10/17/counting-triangles-faster">blog post</a>, using their <span class="highlight"><em>ExaData</em></span> platform, beating Vertica&#8217;s results by a factor of <em>7</em>, clocking in at <span class="highlight"><em>14 seconds</em></span>.</p>
<p style="text-align: justify;">Although Vertica and Oracle&#8217;s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating triangles  through a <span class="highlight"><em>smarter algorithm</em></span> that is able to deliver similar performance on <span class="highlight"><em>commodity hardware</em></span> (i.e. my MacBook Pro Retina).</p>
<p>&nbsp;</p>
<h3>1. Doing it the smart way</h3>
</p>
<p style="text-align: justify;">The <a href="http://snap.stanford.edu/data/soc-LiveJournal1.html">LiveJournal social network graph</a>, about 1.3GB in raw size, contains around 86 million relationships. Each line in this file declares a relationship between a <span class="highlight"><em>source</em></span> and <span class="highlight"><em>target vertex</em></span> (where each vertex is identified by an unique id). Relationships are assumed to be <span class="highlight"><em>bi-directional</em></span>: if person 1 knows person 2, person 2 also knows person 1.</p>
<script src="https://gist.github.com/4737460.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Let&#8217;s start by creating a <span class="highlight"><em>row-like</em></span> data structure for storing these relationships. The <em>key</em> of each row is the <em>id of the source vertex</em>. The row <em>values</em> are the id&#8217;s of all <em>target vertices</em> associated with the particular source vertex.</p>
<script src="https://gist.github.com/4737525.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">With this structure in place, one can execute the naive algorithm as described above. Unfortunately, iterating four levels deep will result in mediocre performance. Let&#8217;s improve our data structure by <span class="highlight"><em>indexing</em></span> each relationship through its <span class="highlight"><em>lowest key</em></span>. So, even though the LiveJournal file declares the relationship as being &#8220;<em>2 0</em>&#8220;, we persist the relationship by assigning the <em>2</em>-value to the <em>0</em>-row. (Order doesn&#8217;t matter as relationships are bi-directional anyway.)</p>
<script src="https://gist.github.com/4737638.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Calculating triangles becomes a lot easier (and faster) now. If the key of a row is part of a <span class="highlight"><em>triangle</em></span>, its two <span class="highlight"><em>adjacent vertices</em></span> should be in its list of values (as by definition, the row key is the smallest vertex id of the three of them). Hence, we  need to check whether we can <span class="highlight"><em>find edges amongst the vertices</em></span> contained within each row. So, for each row, we iterate through its list of values. For each of these values, we retrieve the associated row and verify whether one of its values is part of the original <em>source</em>-values. By doing so, we get rid of one expensive <em>for</em>-loop. Nevertheless, the amount of calculations that need to be executed is still close to <span class="highlight"><em>2 billion</em></span>! </p>
<script src="https://gist.github.com/4737835.js"></script>
<p>&nbsp;</p>
<h3>2. Persisting the relationships</h3>
<p style="text-align: justify;">The data structure as described above is persisted in a <span class="highlight"><em>custom datastore</em></span> that we developed at Datablend for powering the <a href="http://www.similr.li">similr</a>-engine (a chemical structure search engine). The datastore is <span class="highlight"><em>fully persistent</em></span> and optimised for quickly performing <span class="highlight"><em>set-based operations</em></span> (intersections, differences, unions, &#8230; ). Parsing the 86 million relationships and creating the appropriate in-memory data structure takes around <em>20 seconds</em> on my MacBook Pro. An additional <em>4 seconds</em> is required for persisting the entire data structure to the datastore itself. So around <span class="highlight"><em>25 seconds</em></span> in total for effectively storing all 86 million relationships. Vertica nor Oracle mention the time it takes to persist the Livejournal dataset within their respective databases. However, I assume it also requires them a few seconds to execute this load-operation.</p>
<p style="text-align: justify;">What about <span class="highlight"><em>disk usage</em></span>? The custom Datablend datastore takes the <span class="highlight"><em>second place</em></span>, requiring only 37 Mb more compared to Oracle’s Hybrid Columnar Compression version.</p>
<script src="https://gist.github.com/4739843.js"></script>
<p>&nbsp;</p>
<h3>3. Calculating the triangles</h3>
<p style="text-align: justify;">The Oracle setup (on a cluster of 4 nodes, each with 96GB of RAM and 12 cores) is able to calculate the 265 million triangles in 14 seconds. The optimised algorithm described above, running on the custom Datablend datastore, takes the first place, clocking in at <span class="highlight"><em>9 seconds</em></span>! The calculation runs fully pararellized on my MacBook Pro Retina and has a peak use of only 2.11 GB of RAM!<br />
<script src="https://gist.github.com/4740331.js"></script>
<p>&nbsp;</p>
<h3>4. Conclusion</h3>
<p style="text-align: justify;">Datablend&#8217;s custom datastore is a very specific solution that targets a <span class="highlight"><em>particular range of Big Data computations</em></span>. It is in no means as generic and versatile as the MPP database solutions offered by both Vertica and Oracle. Nevertheless, the article tries to illustrate that one does not require a large computing cluster to execute particular Big Data computations. Just use the most appropriate/smart solution to solve the problem in an elegant and fast way. Don&#8217;t hesitate to <a href="http://datablend.be/?page_id=37">contact us</a> if you have any questions related to <a href="http://www.similr.li">similr</a> and/or Datablend.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=282</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Similr: blazingly fast chemical similarity searches</title>
		<link>https://datablend.be/?p=280</link>
		<comments>https://datablend.be/?p=280#comments</comments>
		<pubDate>Mon, 04 Feb 2013 10:00:38 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[similr]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=280</guid>
		<description><![CDATA[Today, Datablend announces Similr to be available for beta sign-up. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular chemical structure. It employs a wide range of fingerprinting algorithms, which combined, allow to identify matching compounds in millisecond time. Similr&#8217;s functionalities are available through a flexible and<p><a href="https://datablend.be/?p=280">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Today, Datablend announces <a href="http://similr.li">Similr</a> to be available for <a href="http://similr.li">beta sign-up</a>. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular <span class="highlight"><em>chemical structure</em></span>. It employs a wide range of <span class="highlight"><em>fingerprinting algorithms</em></span>, which combined, allow to identify matching compounds in <em>millisecond</em> time. Similr&#8217;s functionalities are available through a flexible and expressive REST API and allows to scan more than 30 million compounds that have been made publicly available through <a href="http://pubchem.ncbi.nlm.nih.gov">PubChem</a>.</p>
<p style="text-align: justify;">Similr will provide <span class="highlight"><em>unlimited</em></span> API-access to academics. Free commercial access,  limited to a 1000 API-calls a month, will be available. Higher up, customers can choose between a <span class="highlight"><em>pay-as-you-go subscription</em></span> or opt-in for a dedicated installation that allows for the import of (private) compounds.</p>
<p style="text-align: justify;">Similr is being developed by Datablend, a Big Data consultancy company. Datablend&#8217;s expertise in Pharma, combined with proficient knowledge of NoSQL technologies, allowed for the development of a <span class="highlight"><em>highly optimised chemical similarity search algorithm</em></span> that is able to scan millions of compounds at blazing speeds. Don&#8217;t hesitate to <a href="http://datablend.be/?page_id=37">contact us</a> if you have any questions related to Similr and/or Datablend.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=280</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Redis and Lua: a NoSQL power-horse</title>
		<link>https://datablend.be/?p=278</link>
		<comments>https://datablend.be/?p=278#comments</comments>
		<pubDate>Tue, 29 Jan 2013 09:59:16 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=278</guid>
		<description><![CDATA[Recently, I&#8217;ve started implementing a number of Redis-based solutions for a Datablend customer. Redis is frequently referred to as the Swiss Army Knife of NoSQL databases and rightfully deserves that title. At its core, it is an in-memory key-value datastore. Values that are assigned to keys can be &#8216;structured&#8217; through the use of strings, hashes,<p><a href="https://datablend.be/?p=278">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Recently, I&#8217;ve started implementing a number of Redis-based solutions for a Datablend customer. <a href="http://redis.io" target="_blank">Redis</a> is frequently referred to as the <span class="highlight"><em>Swiss Army Knife</em></span> of NoSQL databases and rightfully deserves that title. At its core, it is an <span class="highlight"><em>in-memory key-value</em></span> datastore. Values that are assigned to keys can be &#8216;structured&#8217; through the use of strings, hashes, lists, sets and sorted sets. The power of these simple data structures, combined with its intuitive API, makes Redis a true power-horse for solving various &#8216;Big Data&#8217;-related problems. To illustrate this point, I reimplemented my MongoDB-based <a href="http://datablend.be/?p=962" target="_blank">molecular similarity search</a> through Redis and its integrated Lua support. As always, the complete source code can be found on the <a target='_blank' href="https://github.com/datablend/redis-compound-comparison">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. Redis &#8216;fingerprint&#8217; data model</h3>
</p>
<p style="text-align: justify;"><span class="highlight">Molecular similarity</span> refers to the <em>similarity</em> of <em>chemical compounds</em> with respect to their structural and/or functional qualities. By calculating molecular similarities, Cheminformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. Chemical similarity can be determined through the use of so-called <span class="highlight"><em>fingerprints</em></span> (i.e. linear, substructure chemical patterns of a certain length). Similarity between compounds is identified by calculating the <a href="http://en.wikipedia.org/wiki/Jaccard_index" target='_blank'>Tanimoto coefficient</a>. This computation involves the calculation of <span class="highlight"><em>intersections</em></span> between sets of fingerprints, an operation that is natively supported by Redis.</p>
<p style="text-align: justify;">Our Redis-based data model for storing fingerprints requires three different data-structures:</p>
<ol>
<li>For each <span class="highlight"><em>compound</em></span>, identified by an <span class="highlight"><em>unique key</em></span>, we store its set of fingerprints (where each fingerprint is again identified by an unique key).</li>
<li>For each <span class="highlight"><em>fingerprint</em></span>, identified by an <span class="highlight"><em>unique key</em></span>, we store the set of compounds containing this fingerprint. These fingerprint sets can be conceived as the inverted indexes of the compound sets mentioned above.</li>
<li>For each <span class="highlight"><em>fingerprint</em></span>, we store its number of occurrences through a dedicated <span class="highlight"><em>weight</em>-key</span>.</li>
</ol>
<p>&nbsp;</p>
<p style="text-align: justify;">Fingerprints are calculated by using the, 33 and 35) are sufficient to create both the <span class="highlight"><em>inverted indexes</em></span> (compound->fingerprints and fingerprint->compounds) and incrementing the accompanying <span class="highlight"><em>counters</em></span>.</p>
<script src="https://gist.github.com/4666154.js"></script>
<p>&nbsp;</p>
<h3>2. Finding similar chemical compounds</h3>
<p style="text-align: justify;">For retrieving compounds that satisfy a particular Tanimoto coefficient, we reuse the same principles as outlined in my <a href="http://datablend.be/?p=962" target="_blank">original MongoDB article</a>. The number of round-trips to the Redis datastore is minimised by implementing the algorithm via the <span class="highlight"><em>build-in Lua scripting</em></span> support. We start by retrieving the number of fingerprints of the particular input compound. Based upon that cardinality, we  calculate the <span class="highlight"><em>fingerprints of interest</em></span> (i.e. the min-set of fingerprints that lead us to compounds that are able to satisfy the Tanimoto coefficient). For this, we need to identify the subset of compound fingerprints that occur the least throughout the entire dataset. Redis allows us to perform this query via a single <span class="highlight"><em>sort</em></span>-command; it takes the compound-key as input and sorts the contained fingerprints by employing the value of the external fingerprint weight keys. Out of this sorted set of fingerprints, we <span class="highlight"><em>sub-select the top x fingerprints</em></span> of interest. What a <span class="highlight"><em>powerful</em></span> and <span class="highlight"><em>elegant</em></span> command!</p>
<p style="text-align: justify;">We use the inverted index (fingerprint->compounds) to identify those compounds that are able to satisfy the particular input Tanimoto coefficient. Applying the Redis <span class="highlight"><em>union</em></span>-command upon the calculated set of fingerprint keys returns the set of potential compounds. Once this set has been identified, we calculate similarity by making use of the Redis <span class="highlight"><em>intersect</em></span>-command. Only compounds that satisfy the Tanimoto restriction are returned.</p>
<script src="https://gist.github.com/4667047.js"></script>
<p>&nbsp;</p>
<h3>3. Conclusion</h3>
<p style="text-align: justify;">With 25.000 stored compounds, Redis requires less then 20ms to retrieve compounds that are 70% similar to a particular input compound. Snappier compared to my original MongoDB implementation. In addition, Redis requires less then 1GB of RAM to maintain a live index of the <a href="http://cactus.nci.nih.gov/download/roadmap/">460.000 PubChem compounds</a> that have at least one associated assay. This allows scientist to host a local instance of the compound datastore, effectively eliminating the need for a dedicated (and expensive) compound database setup.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=278</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Circle through your Google Analytics data with Neo4J and Circos</title>
		<link>https://datablend.be/?p=267</link>
		<comments>https://datablend.be/?p=267#comments</comments>
		<pubDate>Sun, 11 Mar 2012 09:51:19 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[circos]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=267</guid>
		<description><![CDATA[Storing massive amounts of data in a NoSQL data store is just one side of the Big Data equation. Being able to visualize your data in such a way that you can easily gain deeper insights, is where things really start to get interesting. Lately, I&#8217;ve been exploring various options for visualizing (directed) graphs, including<p><a href="https://datablend.be/?p=267">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Storing <span class="highlight">massive amounts of data</span> in a NoSQL data store is just one side of the <span class="highlight">Big Data</span> equation. Being able to visualize your data in such a way that you can easily gain <span class="highlight">deeper insights</span>, is where things really start to get interesting. Lately, I&#8217;ve been exploring various options for visualizing (directed) graphs, including <a href="http://circos.ca/" target='_blank'>Circos</a>. Circos is an amazing software package that visualizes your data through a <span class="highlight">circular layout</span>. Although it&#8217;s originally designed for displaying <span class="highlight">genomic data</span>, it allows to create good-looking figures from data in any field. Just transform your data set into a tabular format and you are ready to go. The figure below illustrates the core concept behind Circos. The table&#8217;s <span class="highlight">columns</span> and <span class="highlight">rows</span> are represented by <span class="highlight">segments</span> around the circle. Individual <span class="highlight">cells</span> are shown as <span class="highlight">ribbons</span>, which <span class="highlight">connect</span> the corresponding row and column segments. The ribbons themselves are <span class="highlight">proportional</span> in width to the value in the cell.</p>
<p>&nbsp;</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/circos-visualize-table.png">
<p align="center"><img width="400" src="http://datablend.be/wp-content/uploads/circos-visualize-table.png" alt="circos" /></p>
<p></a>  </p>
<p>&nbsp;</p>
<p style="text-align: justify;">When visualizing a <span class="highlight">directed graph</span>, nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. The proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the <span class="highlight">key data points</span> within your table. In my case, I want to better understand the <span class="highlight">flow of visitors</span> to and within the datablend site and blog; where do visitors come from (direct, referral, search, &#8230;) and how do they navigate between pages. The rest of this article details how to 1) retrieve the <span class="highlight">raw visit information</span> through the Google Analytics API, 2) <span class="highlight">persist</span> this information <span class="highlight">as a graph</span> in Neo4J and 3) query and <span class="highlight">preprocess</span> this data for visualization through Circos. As always, the complete source code can be found on the <a target='_blank' href="https://github.com/datablend/neo4j-google-analytics">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. Retrieving your Google Analytics data</h3>
<p style="text-align: justify;">Let&#8217;s start by retrieving the <span class="highlight">raw Google Analytics data</span>. The Google Analytics data API provides access to all <span class="highlight">dimensions</span> and <span class="highlight">metrics</span> that can be queried through the web application. In my case, I&#8217;m interested in retrieving the <span class="highlight"><em>previous page path</em></span> property for each page view. If a visitor enters through a page outside of the datablend website, the <em>previous page path</em> is marked as <span class="highlight"><em>(entrance)</em></span>. Otherwise, it contains the <span class="highlight">internal path</span>. We will use Google&#8217;s Java Data API to connect and retrieve this information. We are particularly interested in the <span class="highlight"><em>pagePath</em></span>, <span class="highlight"><em>pageTitle</em></span>, <span class="highlight"><em>previousPagePath</em></span> and <span class="highlight"><em>medium</em></span> dimensions, while our metric of choice is the number of <span class="highlight"><em>pageViews</em></span>. After setting the date range, the <span class="highlight">feed of entries</span> that satisfy this criteria can be retrieved. For ease of use, we transform this data to a domain entity and filter/clean the data accordingly. If a visit originates from outside the datablend website, we store the specific <span class="highlight">medium</span> (direct, referral, search, &#8230;) as previous path.</p>
<script src="https://gist.github.com/2011682.js"></script>
<p>&nbsp;</p>
<h3>2. Storing navigational data as a directed graph in Neo4J</h3>
<p style="text-align: justify;">The set of site navigations can easily be stored as a directed graph in the the <span class="highlight">degree</span> of my nodes is correct if I would perform other types of calculations. For each individual navigation relationship, we also store <span class="highlight">the date of visit</span>.</p>
<script src="https://gist.github.com/2011787.js"></script>
<p>&nbsp;</p>
<h3>3. Creating the Circos tabular data format</h3>
<p style="text-align: justify;">The <span class="highlight">Circos tabular data format</span> is quite easy to construct. It&#8217;s basically a <span class="highlight">tab-delimited file</span> with row and column headers. A cell is interpreted as a <span class="highlight">value</span> that <span class="highlight">flows</span> from the <span class="highlight">row entity</span> to the <span class="highlight">column entity</span>. We will use the <a target='_blank' href="http://docs.neo4j.org/chunked/stable/cypher-query-lang.html">Neo4J Cypher query language</a> to retrieve the data of interest, namely all navigations that occurred within a <span class="highlight">certain time period</span>. Doing so allows us to create <span class="highlight">historical visualizations</span> of our navigations and observe how visit flow behaviors are changing over time.</p>
<script src="https://gist.github.com/2011910.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Next, we create the tab delimited file itself. We <span class="highlight">iterate</span> through all entries (i.e. navigations) that match our Cypher query and store them in a temporary list. Afterwards, we start building the <span class="highlight">two-dimensional array</span> by <span class="highlight">normalizing</span> (i.e. summing) the number of navigations between the source and target paths. At the end, we <span class="highlight">filter</span> this occurrence matrix on the <span class="highlight">minimal number</span> of required navigations. This ensures that we will only create segments for paths that are <span class="highlight">relevant</span> in the total population. As a final step, we <span class="highlight">print</span> the occurrences matrix as a tab-delimited file. For each path, we will use a <span class="highlight">shorthand</span> as the Circos renderer seems to have problem with long string identifiers.</p>
<script src="https://gist.github.com/2011992.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The text below is a sample of the output generated by the <span class="highlight"><em>printCircosData</em></span> method. It first prints the legend (matching shorthands with actual paths). Next it prints the tab-delimited Circos table.</p>
<script src="https://gist.github.com/2012044.js"></script>
<p>&nbsp;</p>
<h3>4. Use the Circos power</h3>
<p style="text-align: justify;">Although Circos can be installed on your local computer, we will use its <a target='_blank' href="http://mkweb.bcgsc.ca/tableviewer/">online version</a> to create the visualization of our data. Upload your tab-delimited file and just wait a few seconds before enjoying the <span class="highlight">beautiful rendering</span> of your site&#8217;s navigation information.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/circos1.jpg">
<p align="center"><img width="700" src="http://datablend.be/wp-content/uploads/circos1.jpg" alt="circos" /></p>
<p></a>  </p>
<p style="text-align: justify;">With just a glimpse of an eye we can already see that the <span class="highlight">l3-segment</span> (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. The <span class="highlight">outer 3 rings</span> visualize the total amounts of navigations that are <span class="highlight">leaving</span> and <span class="highlight">entering</span> this particular path. In case of referrals, no navigations have this path as target (indicated by the empty middle ring). Its <span class="highlight">total segment count</span> (inner ring) is entirely build up out of navigations that have a referral as source. The <span class="highlight">l6-segment</span> seems to be the path that attracts the most traffic (around 2500 navigations). This segment visualizes the navigation data related to my <a target='_blank' href="http://datablend.be/?page_id=44&#038;paged=2">&#8220;The joy of algorithms and NoSQL: a MongoDB example&#8221;</a>-article. Most of its traffic is received through referrals, while a decent amount is also generated through <span class="highlight">direct</span> (l17-segment) and <span class="highlight">search</span> (l27-segment) traffic. The <span class="highlight">l15-segment</span> (my blog&#8217;s main page) is the only path that receives an almost equal amount of incoming and outgoing traffic.</p>
<p style="text-align: justify;">With just a few tweaks to the Circos input data, we can easily <span class="highlight">focus</span> on particular types of navigation data. In the figure below, I made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/circos2.jpg">
<p align="center"><img width="700" src="http://datablend.be/wp-content/uploads/circos2.jpg" alt="circos" /></p>
<p></a>  </p>
<h3>5. Conclusions</h3>
<p style="text-align: justify;">In the era of Big Data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. Circos specializes in a very specific type of visualization, but does its job extremely well. I would be delighted to hear about other types of visualizations for directed graphs.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=267</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The joy of algorithms and NoSQL revisited: the MongoDB Aggregation Framework</title>
		<link>https://datablend.be/?p=265</link>
		<comments>https://datablend.be/?p=265#comments</comments>
		<pubDate>Wed, 08 Feb 2012 09:50:11 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=265</guid>
		<description><![CDATA[[information] Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. Part 3 finally, illustrates the use of the new MongoDB Aggregation Framework, which boosts performance beyond the<p><a href="https://datablend.be/?p=265">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[[information]
<p style="text-align: justify;"><a href="http://datablend.be/?p=962" target='_blank'>Part 1</a> of this article describes the use of MongoDB to implement the computation of molecular similarities. <a href="http://datablend.be/?p=968" target='_blank'>Part 2</a> discusses the refactoring of this solution by making use of MongoDB’s build-in <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>map-reduce functionality</a> to improve overall performance. <a href="http://datablend.be/?p=1400" target='_blank'>Part 3</a> finally, illustrates the use of the new MongoDB <a href="http://www.mongodb.org/display/DOCS/Aggregation+Framework" target='_blank'>Aggregation Framework</a>, which boosts performance beyond the capabilities of the map-reduce implementation.</p>
[/information]
<p style="text-align: justify;">In <a href="http://datablend.be/?p=962" target='_blank'>part 1</a> of this article, I described the use of <a href="http://www.mongodb.org" target='_blank'>MongoDB</a> to solve a specific <a href="http://en.wikipedia.org/wiki/Cheminformatics" target='_blank'>Chemoinformatics</a> problem, namely the computation of <span class="highlight">molecular similarities</span> through <span class="highlight">Tanimoto coefficients</span>. When employing a low target Tanimoto coefficient however, the number of returned compounds increases exponentially, resulting in a noticeable data transfer overhead. To circumvent this problem, <a href="http://datablend.be/?p=962" target='_blank'>part 2</a> of this article describes the use of MongoDB’s build-in <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'> map-reduce functionality</a> to perform the Tanimoto coefficient calculation local to where the compound data is stored. Unfortunately, the execution of these map-reduce algorithms through Javascript is rather slow and a performance improvement can only be achieved when multiple shards are employed within the same MongoDB cluster.</p>
<p style="text-align: justify;">Recently, MongoDB introduced its new <a href="http://www.mongodb.org/display/DOCS/Aggregation+Framework" target='_blank'>Aggregation Framework</a>. This framework provides a more simple solution to calculating <span class="highlight">aggregate values</span> instead of relying upon the powerful map-reduce constructs. With just a few simple primitives, it allows you to compute, group, reshape and project documents that are contained within a certain MongoDB collection. The remainder of this article describes the refactoring of the map-reduce algorithm to make optimal use of the new MongoDB Aggregation Framework. The complete source code can be found on the <a target='_blank' href="https://github.com/datablend/mongo-compound-comparison-revisited">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. MongoDB Aggregation Framework</h3>
<p style="text-align: justify;">The MongoDB Aggregation Framework draws on the well-known <span class="highlight">linux pipeline</span> concept, where the output of one command is <span class="highlight">piped</span> or <span class="highlight">redirected</span> to be used as input of the next command.  In case of MongoDB, multiple operators are combined into a single pipeline that is responsible for processing a stream of documents. Some operators, such as <span class="highlight">$match</span>, <span class="highlight">$limit</span> and <span class="highlight">$skip</span> take a document as input and output the same document in case a certain set of criteria’s is met. Other operators, such as <span class="highlight">$project</span> and <span class="highlight">$unwind</span> take a single document as input and reshape that document or emit multiple documents based upon a certain projection. The <span class="highlight">$group</span> operator finally, takes multiple documents as input and groups them into a single document by aggregating the relevant values. <span class="highlight">Expressions</span> can be used within some of these operators to calculate new values or execute string operations.</p>
<p style="text-align: justify;">Multiple operators are combined into a single pipeline that is applied upon a list of documents. The pipeline itself is executed as a MongoDB <span class="highlight">Command</span>, resulting in single MongoDB document that contains an array of all documents that came out at end of the pipeline. The next paragraph details the refactoring of the molecular similarities algorithm as a pipeline of operators. Make sure to (re)read the previous two articles to fully grasp the implementation logic.</p>
<p>&nbsp;</p>
<h3>2. Molecular Similarity Pipeline</h3>
<p style="text-align: justify;">When applying a pipeline upon a certain collection, <span class="highlight">all</span> documents contained within this collection are given as input to the first operator. It&#8217;s considered best practice to <span class="highlight">filter</span> this list as quickly as possible to limit the number of total documents that are passed through the pipeline. In our case, this means filtering out all document that will never be able to <span class="highlight">satisfy the target Tanimoto coefficient</span>. Hence, as a first step, we match all documents for which the fingerprint count is within a certain threshold. If we target a Tanimoto coefficient of 0.8 with a target compound containing 40 unique fingerprints, the <span class="highlight">$match</span> operator look as follows:</p>
<script src="https://gist.github.com/1767088.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Only compounds that have a fingerprint count between 32 and 50 will be streamed to the next pipeline operator. To perform this filtering, the <span class="highlight">$match operator</span> is able to use the index that we have defined for the <em>fingerprint_count</em> property. For computing the Tanimoto coefficient, we need to calculate the number of <span class="highlight">shared fingerprints</span> between a certain input compound and the compound we are targeting. In order to be able to work at the fingerprint level, we use the <span class="highlight">$unwind operator</span>. $unwind peels off the elements of an array one by one, returning a stream of documents where the specified array is replaced by one of its elements. In our case, we apply the $unwind upon the <em>fingerprints</em> property. Hence, each compound document will result in <span class="highlight"><em>n</em></span> compound documents, where <span class="highlight"><em>n</em></span> is the number of unique fingerprints contained within the compound.</p>
<script src="https://gist.github.com/1767390.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">In order to calculate the number of shared fingerprints, we will start off by filtering out all documents which do not have a fingerprint that is in the list of fingerprints of the target compound. For doing so, we again apply the <span class="highlight">$match operator</span>, this time filtering on the <em>fingerprints</em> property, where only documents that contain a fingerprint that is in the list of target fingerprints are maintained.</p>
<script src="https://gist.github.com/1767484.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">As we only match fingerprints that are in the list of target fingerprints, the output can be used to <span class="highlight">count the total number of shared fingerprints</span>. For this, we apply the <span class="highlight">$group operator</span> on the <em>compound_cid</em>, though which we create a new type of document, containing <span class="highlight">the number of matching fingerprints</span> (by summating the number of occurrences), the <span class="highlight">total number of fingerprints of the input compound</span> and the <span class="highlight">smiles representation</span>.</p>
<script src="https://gist.github.com/1767565.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">We now have all parameters in place to calculate the Tanimoto coefficient. For this we will use the <span class="highlight">$project operator</span> which, next to copying the compound id and smiles property, also adds a new, computed property named <em>tanimoto</em>.</p>
<script src="https://gist.github.com/1767728.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">As we are only interested in compounds that have a target Tanimoto coefficient of 0.8, we apply an additional <span class="highlight">$match operator</span> to filter out all the ones that do not reach this coefficient.</p>
<script src="https://gist.github.com/1767791.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The <span class="highlight">full pipeline command</span> can be found below. </p>
<script src="https://gist.github.com/1767950.js"></script>
<p>&nbsp;</p>
<p>The <span class="highlight">output</span> of this pipeline contains a list of compounds which have a Tanimoto of 0.8 or higher with respect to a particular target compound. A visual representation of this pipeline can be found below:</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/pipeline.jpg">
<p align="center"><img width="900" src="http://datablend.be/wp-content/uploads/pipeline.jpg" alt="pipeline" /></p>
<p></a>  </p>
<p>&nbsp;</p>
<h3>3. Conclusion</h3>
<p style="text-align: justify;">The new MongoDB Aggregation Framework provides a set of easy-to-use operators that allow users to express map-reduce type of algorithms in a more <span class="highlight">concise</span> fashion. The pipeline concept beneath it offers an <span class="highlight">intuitive</span> way of processing data. It is no surprise that this pipeline paradigm is adopted by various NoSQL approaches, including <a href="https://github.com/tinkerpop/gremlin/wiki" target='_blank'>Tinkerpop&#8217;s Gremlin Framework</a> and <a href="http://docs.neo4j.org/chunked/stable/cypher-query-lang.html" target='_blank'>Neo4J&#8217;s Cypher</a> implementation.</p>
<p style="text-align: justify;"><span class="highlight">Performance wise</span>, the pipeline solution is a major improvement upon the map-reduce implementation. The employed operators are <span class="highlight">natively supported by the MongoDB platform</span>, which results in a huge performance improvement with respect to interpreted Javascript. As the Aggregation Framework is also able to work in a <span class="highlight">sharded environment</span>, it easily beats the performance of my initial implementation, especially when the number of input compounds is high and the target Tanimoto coefficient is low. <span class="highlight">Great work from the MongoDB team!</span></p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=265</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualizing RDF Schema inferencing through Neo4J, Tinkerpop, Sail and Gephi</title>
		<link>https://datablend.be/?p=260</link>
		<comments>https://datablend.be/?p=260#comments</comments>
		<pubDate>Mon, 21 Nov 2011 09:47:11 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[gephi]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[sail]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=260</guid>
		<description><![CDATA[Last week, the Neo4J plugin for Gephi was released. Gephi is an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs. The graphs themselves can be loaded through a variety of file formats. Thanks to Martin Škurla, it is now possible to load and lazily explore graphs that are stored<p><a href="https://datablend.be/?p=260">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Last week, the <a target='_blank' href="http://neo4j.org">Neo4J</a> <a target='_blank' href="https://gephi.org/plugins/neo4j-graph-database-support/">plugin</a> for <a target='_blank' href="http://gephi.org/">Gephi</a> was released. Gephi is an open-source <span class="highlight">visualization</span> and <span class="highlight">manipulation</span> tool that allows users to <span class="highlight">interactively browse and explore graphs</span>. The graphs themselves can be loaded through a variety of file formats. Thanks to <a target='_blank' href="http://twitter.com/#!/muzPayne">Martin Škurla</a>, it is now possible to load and lazily explore graphs that are stored in a Neo4J data store.</p>
<p style="text-align: justify;">In <a target='_blank' http://datablend.be/?p=554>one of my previous articles</a>, I explained how Neo4J and the <a target='_blank' href="http://tinkerpop.com/">Tinkerpop</a> framework can be used to load and query RDF triples. The newly released Neo4J plugin now allows to visually browse these RDF triples and perform some more fancy operations such as <span class="highlight">finding patterns</span> and <span class="highlight">executing social network analysis algorithms</span> from within Gephi itself. Tinkerpop&#8217;s <a target='_blank' https://github.com/tinkerpop/blueprints/wiki/Sail-Ouplementation>Sail Ouplementation</a> also supports the notion of <a target='_blank' href="http://www-ksl.stanford.edu/software/jtp/doc/owl-reasoning.html">RDF Schema inferencing</a>. Inferencing is the process where new (RDF) data is <span class="highlight">automatically deducted</span> from existing (RDF) data through <span class="highlight">reasoning</span>. Unfortunately, the Sail reasoner cannot easily be integrated within Gephi, as the Gephi plugin grabs a lock on the Neo4J store and no RDF data can be added, except through the plugin itself.</p>
<p style="text-align: justify;">Being able to visualize the RDF Schema reasoning process and graphically indicate which RDF triples were added manually and which RDF data was automatically inferred would be a nice to have. To implement this feature, we should be able to push graph changes from Tinkerpop and Neo4J to Gephi. Luckily, the <a target='_blank' href="https://gephi.org/plugins/graph-streaming/">Gephi graph streaming plugin</a> allows us to do just that. In the rest of this article, I will detail how to setup the required Gephi environment and how we can stream (inferred) RDF data from Neo4J to Gephi.</p>
<p>&nbsp;</p>
<h3>1. Adding the (inferred) RDF data</h3>
<p style="text-align: justify;">Let&#8217;s start by setting up the required Neo4J/Tinkerpop/Sail environment that we will use to store and infer RDF triples. The setup is similar to the one explained in <a target='_blank' http://datablend.be/?p=554>my previous Tinkerpop article</a>. However, instead of wrapping our <em>GraphSail</em> as a <em>SailRepository</em>, we will wrap it as a <em>ForwardChainingRDFSInferencer</em>. This inferencer will listen for RDF triples that are added and/or removed and will automatically execute RDF Schema inferencing, applying the rules as defined by the <a target='_blank' href="http://www.w3.org/TR/2004/REC-rdf-mt-20040210/">RDF Semantics Recommendation</a>.</p>
<script src="https://gist.github.com/1380489.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">We are now ready to add RDF triples. Let&#8217;s create a simple <span class="highlight">loop</span> that allows us to read-in RDF triples and add them to the Sail store.</p>
<script src="https://gist.github.com/1380494.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The inference method itself is rather simple. We first start by parsing the RDF <span class="highlight">subject</span>, <span class="highlight">predicate</span> and <span class="highlight">object</span>. Next, we start a new transaction, add the statement and commit the transaction. This will not only add the RDF triple to our Neo4J store but will additionally run the RDF Schema inferencing process and automatically add the inferred RDF triples. Pretty easy!</p>
<script src="https://gist.github.com/1380511.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">But how do we retrieve the inferred RDF triples that were added through the inference process? Although the <em>ForwardChainingRDFSInferencer</em> allows us to register a listener that is able to detect changes to the graph, it does not provide the required API to distinct between the <span class="highlight">manually added or inferred RDF triples</span>. Luckily, we can still access the underlying Neo4J store and capture these graph changes by implementing the Neo4J <em>TransactionEventHandler</em> interface. After a transaction is committed, we can fetch the newly created <span class="highlight">relationships</span> (i.e. RDF triples). For each of these relationships, the <span class="highlight">start node</span> (i.e. RDF subject), <span class="highlight">end node</span> (i.e. RFD object) and <span class="highlight">relationship type</span> (i.e. RDF predicate) can be retrieved. In case a RDF triple was added through inference, the value of the boolean property <span class="highlight">&#8220;inferred&#8221;</span> is <span class="highlight">&#8220;true&#8221;</span>. We filter the relationships to the ones that are defined within our domain (as otherwise the full RDFS meta model will be visualized as well). Finally we push the relevant nodes and edges.</p>
<script src="https://gist.github.com/1380584.js"></script>
<p>&nbsp;</p>
<h3>2. Pushing the (inferred) RDF data</h3>
<p style="text-align: justify;">
The streaming plugin for Gephi allows reading and visualizing data that is send to its master server. This master server is a REST interface that is able to receive graph data through a JSON interface. The <em>PushUtility</em> used in the <em>PushTransactionEventHandler</em> is responsible for generating the <a target='_blank' href="http://wiki.gephi.org/index.php/Specification_-_GSoC_Graph_Streaming_API">required JSON edge and node data format</a> and pushing it to the Gephi master.   </p>
<script src="https://gist.github.com/1380606.js"></script>
<p>&nbsp;</p>
<h3>3. Visualizing the (inferred) RDF data</h3>
<p style="text-align: justify;">Start the Gephi Streaming Master server. This will allow Gephi to receive the (inferred) RDF triples that we send it through its REST interface. Let&#8217;s run our Java application and add the following RDF triples:</p>
<script src="https://gist.github.com/1380624.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The first two RDF triples above state that a <span class="highlight">teacher</span> <span class="highlight">teaches</span> a <span class="highlight">student</span>. The last RDF triple states that <span class="highlight">Davy</span> <span class="highlight">teaches</span> <span class="highlight">Bob</span>. As a result, the RDF Schema inferencer deducts that <span class="highlight">Davy</span> must be a <span class="highlight">teacher</span> and that <span class="highlight">Bob</span> must be a <span class="highlight">student</span>. Let&#8217;s have a look at what Gephi visualized for us.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/gephi1.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/gephi1.jpg" alt="gephi" /></p>
<p></a></p>
<p style="text-align: justify;">Mmm &#8230; That doesn’t really look impressive <img src='https://datablend.be/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> . Let&#8217;s use some formatting. First apply <span class="highlight">Force Atlas lay-outing</span>. Afterwards, scale the edges and enable the labels on both the edges and the nodes. Finally, apply <span class="highlight">partitioning</span> on the edges by coloring the arrows using the <span class="highlight">inferred</span> property on the edges. We can now clearly identify the inferred RDF statements (i.e. Davy being a teacher and Bob being a student).</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/gephi2.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/gephi2.jpg" alt="gephi" /></p>
<p></a></p>
<p>&nbsp;</p>
<p style="text-align: justify;">Let&#8217;s add some additional RDF triples.</p>
<script src="https://gist.github.com/1380683.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Basically, these RDF triples state that both <span class="highlight">teacher</span> and <span class="highlight">student</span> are <span class="highlight">subclasses</span> of <span class="highlight">person</span>. As a result, the RDFS inferencer is able to deduct that both <span class="highlight">Davy</span> and <span class="highlight">Bob</span> must be <span class="highlight">persons</span>. The Gephi visualization is updated accordingly.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/gephi3.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/gephi3.jpg" alt="gephi" /></p>
<p></a> </p>
<p>&nbsp;</p>
<h3>4. Conclusion</h3>
<p style="text-align: justify;">With just a few lines of code we are able to stream (inferred) RDF triples to Gephi and make use of its powerful visualization and analysis tools to explore and inspect our datasets. As always, the complete source code can be found on the <a target='_blank' href="https://github.com/datablend/blueprints-streaming-sail-inferencing">Datablend public GitHub repository</a>. Make sure to surf the internet to find some other nice Gephi streaming examples, the coolest one probably being <a target='_blank' href="http://www.youtube.com/watch?v=2guKJfvq4uI">the visualization of the Egyptian revolution on Twitter</a>. </p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=260</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The (non-)sense of NoSQL O(R)M frameworks</title>
		<link>https://datablend.be/?p=258</link>
		<comments>https://datablend.be/?p=258#comments</comments>
		<pubDate>Tue, 11 Oct 2011 09:45:53 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[general]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[orm]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=258</guid>
		<description><![CDATA[NoSQL seems to be ready for prime time. Several NoSQL companies, including 10gen (MongoDB), DataStax (Cassandra) and Neo Technology (Neo4J), recently received millions in funding to expand their (commercial) NoSQL offerings. Even Oracle is now entering the already crowded NoSQL-space with its very own key-value NoSQL Database 11g. No doubt that this type of publicity<p><a href="https://datablend.be/?p=258">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;"><span class="highlight">NoSQL</span> seems to be ready for <span class="highlight">prime time</span>. Several NoSQL companies, including <a target ="_blank" href="http://www.10gen.com">10gen</a> (MongoDB), <a target ="_blank" href="http://www.datastax.com">DataStax</a> (Cassandra) and <a target ="_blank" href="http://neotechnology.com">Neo Technology</a> (Neo4J), recently received millions in funding to expand their (commercial) NoSQL offerings. Even <span class="highlight">Oracle</span> is now entering the already crowded NoSQL-space with its very own key-value <a target ="_blank" href="http://www.oracle.com/technetwork/database/nosqldb">NoSQL Database 11g</a>. No doubt that this type of publicity will boost the <span class="highlight">enterprise adoption</span> of NoSQL technologies.</p>
<p style="text-align: justify;">At the same time, the rise of <span class="highlight">Object Relational Mapping</span> (ORM) within the NoSQL space is impressive. Multiple approaches and frameworks are competing within the same solution space and I can&#8217;t stop wondering whether they do not enter the market <span class="highlight">too soon</span> &#8230; Don&#8217;t get me wrong. I strongly believe in the advantages of using an ORM. In fact, I can not even remember my last enterprise-type application that is not powered by an ORM. So why am I expressing <span class="highlight">my concerns</span> in case of NoSQL?</p>
<p style="text-align: justify;">Relational databases have been around for the past thirty or more years. We have all done our own share of <span class="highlight">low-level database work</span> and have been exposed to the overal technicalities of a RDBMS. As a result, when advancing to the use of an ORM, people can count on this <span class="highlight">basic knowledge set</span> when problems are encountered. Most of us however, lack this type of in-depth NoSQL expertise. Nevertheless, as the NoSQL hype is growing, people will fall back on their ORM expertise in order to quickly adopt this new technology. This is enforced by the motivation of several NoSQL ORM frameworks: <em>&#8220;Don&#8217;t mind the NoSQL complexities. Just employ an approach you already know!&#8221;</em>. Unfortunately, all abstractions will <span class="highlight">fail</span> you at some point in time &#8230; </p>
<p style="text-align: justify;">But what about the <span class="highlight">mapping-side of things</span>? First of all, various NoSQL approaches, including document-based databases, don&#8217;t exhibit the <span class="highlight">object-oriented impedance mismatch</span>. In approaches that do, an ORM will not always make sense. Several ORMs target a variety of NoSQL technologies by providing a generic mapping framework: <em>&#8220;Map your objects. Don&#8217;t care about the target NoSQL technology. We&#8217;ll do that.&#8221;</em>. Unfortunately, this approach will fail you at achieving the true (performance) promise of NoSQL. An ORM framework would not have been able to help me at attaining the performance gains as described in one of <a target ="_blank" href="http://datablend.be/?p=202">my previous articles</a>. In fact, it would not even make sense to implement the problem domain as such. Hence, I&#8217;m afraid that the use of an ORM framework will disappoint a lot of NoSQL newcomers &#8230;</p>
<p style="text-align: justify;">ORM frameworks in the NoSQL space will however get us closer to the idea of <span class="highlight">Polyglot Persistence</span>. <a target ="_blank" href="http://www.springsource.org/spring-data/neo4j">Spring Data Graph</a> for instance, allows you to map <span class="highlight">Pojos</span> in such a way that parts of it are persisted to a traditional database and parts of it to the Neo4J graph database. This is achieved in a technical transparant way, making it an easy-to-use solution. Nevertheless, I still feel it&#8217;s way too soon for ORMs as the knowledge and best practices on NoSQL are just being developed.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=258</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The joy of algorithms and NoSQL: a MongoDB example (part 2)</title>
		<link>https://datablend.be/?p=256</link>
		<comments>https://datablend.be/?p=256#comments</comments>
		<pubDate>Thu, 11 Aug 2011 09:44:06 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=256</guid>
		<description><![CDATA[[information] Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. [/information] In part 1 of this article, I described the use of MongoDB to solve a specific<p><a href="https://datablend.be/?p=256">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[[information]
<p style="text-align: justify;"><a href="http://datablend.be/?p=962" target='_blank'>Part 1</a> of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>MongoDB’s build-in map-reduce functionality</a> to improve overall performance.</p>
[/information]
<p style="text-align: justify;">In <a href="http://datablend.be/?p=962" target='_blank'>part 1</a> of this article, I described the use of <a href="http://www.mongodb.org" target='_blank'>MongoDB</a> to solve a specific <a href="http://en.wikipedia.org/wiki/Cheminformatics" target='_blank'>Chemoinformatics</a> problem, namely the computation of <span class="highlight">molecular similarities</span>. Depending on the target <span class="highlight">Tanimoto coefficient</span>, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query <span class="highlight">increases significantly</span> when the target Tanimoto is <span class="highlight">lowered</span>. The example code on the <a href="https://github.com/datablend/mongo-compound-comparison" target='_blank'>GitHub repository</a> for instance, imports and indexes <span class="highlight">~25000</span> chemical compounds. When a target Tanimoto of <span class="highlight">0.8</span> is employed, the query returns <span class="highlight">~700</span> compounds. When the target Tanimoto is lowered to <span class="highlight">0.6</span>, the number of returned compounds increases to <span class="highlight">~7000</span>. Using the <a href="http://www.mongodb.org/display/DOCS/Explain" target='_blank'>MongoDB explain functionality</a>, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>MongoDB’s build-in map-reduce functionality</a>!</p>
<p>&nbsp;</p>
<h3>1. MongoDB molecular similarity map-reduce query</h3>
<p style="text-align: justify;"><a href="http://nl.wikipedia.org/wiki/MapReduce" target='_blank'>Map-reduce</a> is a conceptual framework, introduced by Google, to enable the <span class="highlight">processing of huge datasets</span> using a <span class="highlight">large number of processing nodes</span>. The general idea is that a <span class="highlight">larger problem</span> is <span class="highlight">divided</span> in a set of <span class="highlight">smaller subproblems</span> that can be answered (i.e. solved) by an individual processing node (the <span class="highlight">map-step</span>). Afterwards, the <span class="highlight">individual solutions</span> are <span class="highlight">combined</span> again to produce the <span class="highlight">final answer</span> to the <span class="highlight">larger problem</span> (the <span class="highlight">reduce-step</span>). By making sure that the individual <span class="highlight">map and reduce steps</span> can be computed <span class="highlight">independently</span> of each other, this divide-and-conquer technique can be easily <span class="highlight">parallellized</span> on a cluster of processing nodes. Let&#8217;s start by refactoring our solution to use MongoDB&#8217;s map-reduce functionality.</p>
<script src="https://gist.github.com/1139665.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The <span class="highlight">map-step</span> of a MongoDB&#8217;s map-reduce implementation takes a MongoDB <span class="highlight">document as input</span> and <span class="highlight">emits</span> one (or more) answers (which, in essence, are again MongoDB documents). Executing our <span class="highlight">map-step</span> on all compound documents in the <span class="highlight">compounds</span> collection would not be very efficient. Instead, we would like to limit the execution of our map-step to those documents that can theoretically match the target Tanimoto. Luckily, we already defined this query, namely the <span class="highlight">compound selection query</span> that was described in the part one of this article! By employing this query, only compounds that match this query are pushed through the map-step. A MongoDB map (and reduce) function is expressed through <span class="highlight">JavaScript</span>. In our case, we calculate the number of unique fingerprint patterns that are <span class="highlight">shared</span> by both the target and input compound. In case the <span class="highlight">minimum number of fingerprint patterns</span> is reached, the map-step <span class="highlight">emits</span> a document containing the PubChem identifier (as id) and some essential statistics (as values). A <span class="highlight">reduce-step</span> is employed to <span class="highlight">aggregate</span> answers into the final result. In our case however, we are interested into the individual results for each compound (document). Hence, no reduce function is applied. When this map-reduce function is executed only <span class="highlight">27</span> compounds are returned (which could potentially match), instead of <span class="highlight">7000</span> compounds when employing the previous Java query!</p>
<p style="text-align: justify;">One would expect the <span class="highlight">execution time</span> of the <span class="highlight">map-reduce query</span> to be considerably faster compared to the <span class="highlight">Java solution</span>. Unfortunately, this is <span class="highlight">not</span> the case. First of all, interpreted Javascript is a multitude of times slower compared to Java. Secondly, although map-reduce steps could be <span class="highlight">parallellized</span> when multiple CPU cores are available, the MongoDB map-reduce function always runs <span class="highlight">single-threaded</span>. To <span class="highlight">circumvent this limitation</span>, one can use <a href="http://www.mongodb.org/display/DOCS/Sharding+Introduction" target='_blank'>MongoDB sharding</a>. Simply explained, instead of putting all data on a <span class="highlight">single</span> MongoDB node, <span class="highlight">multiple</span> MongoDB nodes are employed, each responsible for storing <span class="highlight">a part of the total data set</span>. When executing our map-reduce function, each node will execute the map-reduce steps on its part of the data in <span class="highlight">parallel</span>. Hence, when using sharding on a cluster of <span class="highlight">4</span> MongoDB nodes, the map-reduce query  executes almost <span class="highlight">4</span> times faster, already catching up with the performance of the Java solution. With the exception of the MongoDB <span class="highlight">sharding configuration</span>, no changes are required to the map-reduce function itself. Hence, <span class="highlight">scaling horizontally</span> with MongoDB is a breeze &#8230; </p>
<p>&nbsp;</p>
<h3>2. Conclusion</h3>
<p style="text-align: justify;">MongoDB&#8217;s map-reduce performance is a bit of a disappointment. MongoDB currently advises to only use it for <span class="highlight">near real-time computations</span>. <span class="highlight">Version 2.0</span> of MongoDB should drastically <span class="highlight">improve</span> map-reduce performance, as the JavaScript engine will be replaced by other execution platforms. Nevertheless, map-reduce performance can currently be boosted by splitting the load on multiple MongoDB shards.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=256</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The joy of algorithms and NoSQL: a MongoDB example (part 1)</title>
		<link>https://datablend.be/?p=254</link>
		<comments>https://datablend.be/?p=254#comments</comments>
		<pubDate>Sun, 07 Aug 2011 09:42:36 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[NoSQL]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=254</guid>
		<description><![CDATA[[information] Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. Part 2 discusses the refactoring of this solution by making use of MongoDB’s build-in map-reduce functionality to improve overall performance. [/information] In one of my previous blog posts, I debated the superficial idea that you should own<p><a href="https://datablend.be/?p=254">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[[information]
<p style="text-align: justify;">Part 1 of this article describes the use of MongoDB to implement the computation of molecular similarities. <a href="http://datablend.be/?p=968" target='_blank'>Part 2</a> discusses the refactoring of this solution by making use of <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>MongoDB’s build-in map-reduce functionality</a> to improve overall performance.</p>
[/information]
<p style="text-align: justify;">In one of my <a href="http://datablend.be/?p=437"  target='_blank' >previous</a> blog posts, I debated the superficial idea that you should own billions of data records before you are <em>eligible</em> to use NoSQL/Big Data technologies. In this article, I try to illustrate my point, by employing NoSQL, and more specifically <a  target='_blank'  href="http://www.mongodb.org/">MongoDB</a>, to solve a specific <a  target='_blank' href="http://en.wikipedia.org/wiki/Cheminformatics">Chemoinformatics</a> problem in a truly elegant and efficient way. The complete source code can be found on the <a target='_blank' href="https://github.com/datablend/mongo-compound-comparison">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. Molecular similarity theory</h3>
<p style="text-align: justify;"><span class="highlight">Molecular similarity</span> refers to the <em>similarity</em> of <em>chemical compounds</em> with respect to their structural and/or functional qualities. By calculating molecular similarities, Chemoinformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. (This by applying the hypothesis that <span class="highlight">similar compounds</span> generally exhibit <span class="highlight">similar biological</span> activities.) Unfortunately, finding substructures in chemical compounds is a <span class="highlight">NP-complete</span> problem. Hence, calculating similarities for a particular target compound can take a very long time when considering millions of input compounds. Scientist solved this problem by introducing the notion of <span class="highlight">structural keys</span> and <span class="highlight">fingerprints</span>.</p>
<p style="text-align: justify;">In case of <span class="highlight">structural keys</span>, we precompute the answers on a couple of specific questions that try to capture the essential characteristics of a compound. Each answer is assigned a fixed location within a <em>bitstring</em>. At query time, a lot of time is saved by only executing substructure searches for compounds that have compatible structural keys. (Compatibility being computed by making use of efficient bit operators.)<br />
When employing <span class="highlight">fingerprints</span>, all linear substructure patterns of a certain length are calculated. As the number of potential patterns is huge, it is not possible to assign an individual bit position to each possible pattern (as is done with structural keys). Instead, the fingerprints patterns are used in a <span class="highlight">hash</span>. The downside of this approach is that, depending of the size of the hash, multiple fingerprint patterns share the same bit position, giving lead to potential <span class="highlight">false positives</span>.</p>
<p style="text-align: justify;">In this article, we will demonstrate the use of <span class="highlight">non-hashed fingerprints</span> to calculate compound similarities (i.e. using the raw fingerprints). This approach has two advantages:</p>
<ol>
<li>We eliminate the chance of false positives</li>
<li>The raw fingerprints can be used in other types of structural compound mining</li>
</ol>
<p>&nbsp;</p>
<h3>2. Molecular similarity practice</h3>
<p style="text-align: justify;">Let&#8217;s start by showing how to calculate the fingerprints of a chemical compound. Various fingerprinting algorithms are available today. Luckily, we don&#8217;t need to implement these algorithms ourselves. The excellent open-source <a target='_blank' href="http://jcompoundmapper.sourceforge.net">jCompoundMapper</a> library provides us with all the required functionality. It uses <a target='_blank' href="http://en.wikipedia.org/wiki/Chemical_table_file">MDL SD</a> formatted compounds as input and is able to output a variety of fingerprints. We start by creating a reader for MDL SD files. Afterwards, we use the <span class="highlight">2DMol</span> fingerprinter to encode the first molecule in the <em>compounds.sdf</em> file.</p>
<script src="https://gist.github.com/1129153.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The first molecule of the input file has the following chemical structure: <a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46200001" target='_blank'>C<sub>52</sub>H<sub>87</sub>N<sub>3</sub>O<sub>13</sub></a>. Its 2DMol fingerprint contains 120 unique fingerprint patterns, a selection of them shown below:</p>
<script src="https://gist.github.com/1129183.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Now the question that remains is how to measure the similarity between the fingerprints of compounds <span class="highlight">A</span> and <span class="highlight">B</span>. Several methods are again available, but in our case we will use the so-called <a href="http://en.wikipedia.org/wiki/Jaccard_index" target='_blank'>Tanimoto association coefficient</a>, which is defined as:</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large T = \frac{N_{AB}}{N_{A} + N_{B} - N_{AB}}"></center></p>
<p style="text-align: justify;"><span class="highlight">N<sub>A</sub></span> refers to the number of unique fingerprint patterns found in compound <span class="highlight">A</span>, while <span class="highlight">N<sub>B</sub></span> refers to the number of unique fingerprint patterns found in compound <span class="highlight">B</span>. <span class="highlight">N<sub>AB</sub></span> specifies the number of unique fingerprint patterns found in <span class="highlight">both</span> compound <span class="highlight">A</span> and <span class="highlight">B</span>. As can be observed from the equation, two identical compounds would have a <span class="highlight">Tanimoto coefficient</span> of <span class="highlight">1.0</span>.</p>
<p>&nbsp;</p>
<h3>3. MongoDB datamodel</h3>
<p style="text-align: justify;"><a  target='_blank'  href="http://www.mongodb.org/">MongoDB</a> is a so-called <a href="http://en.wikipedia.org/wiki/Document-oriented_database" target='_blank'>document-oriented database</a>. When using document-oriented databases, you basically group together all related data in a <span class="highlight">single document</span> instead of normalizing your data in various <span class="highlight">RDBMS tables</span> and using <span class="highlight">joins</span> to retrieve the required information later on. In case of MongoDB, documents are stored using <a href="http://bsonspec.org" target='_blank' >BSON</a> (binary JSON) and related documents are stored in <span class="highlight">collections</span>. Let&#8217;s start by describing the <span class="highlight">compounds</span> collection that stores a separate document for each compound. The JSON-document of our <a href="http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=46200001" target='_blank'>C<sub>52</sub>H<sub>87</sub>N<sub>3</sub>O<sub>13</sub></a> compound looks as follows:</p>
<script src="https://gist.github.com/1129242.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">For each compound, we store its unique  easy to create this JSON document for each compound. Once we retrieved a molecule through the MDL file reader, it is just a matter of creating the necessary document objects and <span class="highlight">inserting</span> them in the <span class="highlight">compounds collection</span>.</p>
<script src="https://gist.github.com/1129346.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">To complete our MongoDB data model, we will add the <span class="highlight">fingerprintscount</span> collection. The rationale for this collection will be explained in the next section of this article. For now, just consider it to be a collection that stores the number of times a particular fingerprint pattern was encountered when importing the compound data. The listing below shows an extract of the <span class="highlight">fingerprintscount</span> collection.</p>
<script src="https://gist.github.com/1129361.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">To update this collection, we make use of <span class="highlight">MongoDB&#8217;s increment</span> operator in combination with an <span class="highlight">upsert</span>. This way, whenever a fingerprint pattern is encountered for the first time, a document is automatically created for the fingerprint and its count is put on 1. If document already exists for this fingerprint pattern, its associated count is incremented by 1. Pretty easy!</p>
<script src="https://gist.github.com/1129357.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">For enhancing query performance, we need to define <span class="highlight">indexes</span> for the appropriate document properties. By doing so, the created <span class="highlight">B-Tree</span> indexes are used by MongoDB &#8216;s <span class="highlight">query optimizer</span> to quickly find back the required documents instead of needing to scan through all of them. Creating these indexes is very easy as illustrated below.</p>
<script src="https://gist.github.com/1129375.js"></script>
<p>&nbsp;</p>
<h3>4. MongoDB molecular similarity query</h3>
<p style="text-align: justify;">It&#8217;s time to bring it all together. Imagine we want to find all compounds that have a <span class="highlight">Tanimoto coefficient</span> of 0.8 for a particular input compound. As MongoDB&#8217;s querying speed is quite fast we could try to brute-force it and basically compute the Tanimoto coefficient for each compound document that is stored in the compounds collection. But let&#8217;s try to do it a bit smarter. By looking at the Tanimoto coefficient equation, we can already narrow the search space significantly. Imagine our input compound (<span class="highlight">A</span>) has 40 unique fingerprint patterns. Let&#8217;s fill in some of the parameters in the Tanimoto equation:</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large 0.8 = \frac{N_{AB}}{40 + N_{B} - N_{AB}}"></center></p>
<p style="text-align: justify;">From this equation, we can deduct the <span class="highlight">minimum</span> and <span class="highlight">maximum</span> number of unique fingerprint patterns another compound should have in order to (in the best case) satisfy the equation:</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large 0.8 = \frac{N_{B}}{40 + N_{B} - N_{B}}"> <img src="http://tools.jcisio.com/tex/mathtex/?\large 0.8 = \frac{40}{40 + N_{B} - 40}"></center></p>
<p style="text-align: justify;">Hence, the compound we are comparing with should only be considered if it has between <span class="highlight">32</span> and <span class="highlight">50</span> unique fingerprint patterns. By taking this into account in our search query, we can narrow the search space significantly. But we can optimize our query a bit further. Imagine we would query for all documents that share <span class="highlight">at least 1 of 9</span> unique fingerprints patterns with the input compound. All documents that are not part of the resultset of this query will never be able to reach the Tanimoto coefficient of 0.8, as the maximum of possibly remaining shared fingerprint patterns would be <span class="highlight">31</span>, and</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large \frac{31}{40 + 31 - 31} = 0.77"></center></p>
<p style="text-align: justify;">were we replaced <span class="highlight">N<sub>B</sub></span> by <span class="highlight">31</span> to maximize the resulting Tanimoto coefficient. This <span class="highlight">9</span>-number can be obtained by solving the following equation:</p>
<p><center><img src="http://tools.jcisio.com/tex/mathtex/?\large N_{query} = (N_{A} - N_{minimum-number-of-fingerprint-patterns}) + 1"></center></p>
<p style="text-align: justify;">Which 9 fingerprint patterns of the input compound should we consider in this query? Common sense tells us the pick the 9 fingerprint patterns for which the <span class="highlight">occurrence in the total compound population is the lowest</span>, as this would narrow our search space even further. (Hence the need for the <span class="highlight">fingerprintscount</span> collection to be able to quickly retrieve the counts of the individual fingerprint patterns.) The code listing below illustrates how to retrieve the counts for the individual patterns. Using the MongoDB <span class="highlight">QueryBuilder</span> we create the query object that, using the <span class="highlight">in</span>-operator, specifies that a document from the <span class="highlight">fingerprintscount</span> collection should only be returned if its fingerprint pattern is part of the fingerprint patterns we are interested in. Using the additional <span class="highlight">sort</span>-operator, we retrieve the individual fingerprint patterns in ascending <span class="highlight">count</span>-order.</p>
<script src="https://gist.github.com/1129445.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">It&#8217;s time for the final step, namely retrieving all compound documents that <span class="highlight">have the potential</span> to satisfy a Tanimoto coefficient of for instance 0.8. For doing this, we build a query that retrieves all compounds that have at least one pattern out of the sublist of fingerprint patterns we calculated in the previous step. Additionally, only compounds that have a total count of fingerprint patterns that is within the minimum and maximum range should be returned. Although this optimized query drastically narrows the search space, we still need to compute the Tanimoto coefficient afterwards to ensure that it has a value of 0.8 or above.</p>
<script src="https://gist.github.com/1129469.js"></script>
<p>&nbsp;</p>
<h3>5. Conclusion</h3>
<p style="text-align: justify;">The MongoDB implementation is able to easily screen a database of a million compounds in subsecond time. The required calculation time however, largely depends on the target Tanimoto: a target Tanimoto of 0.6 will result in significantly more results compared to a target Tanimoto of 0.8. When using the <a href="http://www.mongodb.org/display/DOCS/Explain" target='_blank'>MongoDB explain functionality</a>, one can observe that the query time is rather short, compared to the time that is required to transfer the data to the Java application and execute the Tanimoto calculations. In my next article, I will discuss the use of <a href="http://www.mongodb.org/display/DOCS/MapReduce" target='_blank'>MongoDB&#8217;s build-in map-reduce functionality</a> to keep the Tanimoto calculation local to the compound data and hence improve overall performance.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=254</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
