<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datablend</title>
	<atom:link href="https://datablend.be/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>https://datablend.be</link>
	<description>Big Data Simplified</description>
	<lastBuildDate>Mon, 07 Sep 2015 09:04:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.6.1</generator>
		<item>
		<title>The power of Manchester City: a data analysis</title>
		<link>https://datablend.be/?p=491</link>
		<comments>https://datablend.be/?p=491#comments</comments>
		<pubDate>Sun, 17 Aug 2014 11:55:16 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[graph]]></category>
		<category><![CDATA[infographics]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=491</guid>
		<description><![CDATA[What makes Manchester City such a great team? The infographic below illustrates one of the teams most powerful characteristics: its successful passing capability. The visualisation is based upon the Opta dataset released in August 2011, containing the high detailed Bolton vs Manchester City match statistics. The data has been loaded in the neo4j graph databases<p><a href="https://datablend.be/?p=491">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">What makes Manchester City such a great team? The infographic below illustrates one of the teams most powerful characteristics: its successful passing capability. The visualisation is based upon the <a href="http://www.optasports.com" title="Opta" target="_blank">Opta</a> dataset released in August 2011, containing the high detailed Bolton vs Manchester City match statistics. The data has been loaded in the <a href="http://neo4j.org" title="neo4j" target="_blank">neo4j</a> graph databases and the <a href="http://docs.neo4j.org/chunked/stable/cypher-query-lang.html" title="Cypher Query Language" target="_blank">Cypher Query Language</a> has been used to extract the passing statistics.</p>
<p style="text-align: justify;">For each of the teams, we calculated the average position of their players on the pitch (based upon their individual actions). The thickness of the red edges visualises the number of successful passes amongst two individual players. Finally, we visualised the number of successful passes of each player (see shirt-number) and the distribution of the length of his passes. Have fun exploring this infographic!</p>
<p> </p>
<p><center><a href="http://datablend.be/wp-content/uploads/2014/08/VoetbalInfographic-2.jpg"><img src="http://datablend.be/wp-content/uploads/2014/08/VoetbalInfographic-2.jpg" alt="VoetbalInfographic-2" width="400" height="300" class="alignnone size-medium wp-image-498" /></a></center> </p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=491</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Coalition-Cocktail &#8211; Hacking the Elections @ Engagor</title>
		<link>https://datablend.be/?p=440</link>
		<comments>https://datablend.be/?p=440#comments</comments>
		<pubDate>Tue, 27 May 2014 14:57:42 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[circos]]></category>
		<category><![CDATA[engagor]]></category>
		<category><![CDATA[gephi]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[hackaton]]></category>
		<category><![CDATA[vk14]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=440</guid>
		<description><![CDATA[Last weekend, Engagor organised their hacktheelections hackaton. The Datablend team (Quentin, Stijn and Davy) was joined by Marc Broos, Tim Coene and Josbert van de Zande with one goal in mind: trying to visualise the (pre-arranged?) political coalition and, if possible, also predict the formation-period. Technically, we extracted over 160K tweets through the Engagor API.<p><a href="https://datablend.be/?p=440">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: left;">Last weekend, <a title="Engagor" href="https://engagor.com" target="_blank">Engagor</a> organised their <a title="hacktheelections" href="http://www.hacktheelections.com" target="_blank">hacktheelections</a> hackaton. The Datablend team (Quentin, Stijn and Davy) was joined by Marc Broos, Tim Coene and Josbert van de Zande with one goal in mind: trying to visualise the (pre-arranged?) political coalition and, if possible, also predict the formation-period.</p>
<p style="text-align: left;">Technically, we extracted over 160K tweets through the Engagor API. Next, a &#8220;sentiment&#8221;-based political graph was build and stored in the <a href="http://www.neo4j.org" title="Neo4J" target="_blank">Neo4J</a> graph database. A <a href="http://www.gephi.org" title="Gephi" target="_blank">Gephi</a>-based visualisation, based upon community detection, revealed the &#8220;truth&#8221;, which was, to be honest, a bit disappointing and at the same time somewhat expected: the political parties from Wallionia form a solid community while the Flemish parties are interconnected through many clusters. A warning sign for the upcoming formation process?</p>
<p style="text-align: left;">The slidedeck below provides an overview of the single day of hacking. Our results ranked third (on 8 teams). Although being very informative, it&#8217;s quite hard to compete with a &#8220;pokemon&#8221;-themed fighting animation between politicians *wink*. Many thanks to <a title="Engagor" href="https://engagor.com" target="_blank">Engagor</a> for the spotless organisation!</p>
<p><center><iframe src="http://www.slideshare.net/slideshow/embed_code/35157124" width="600" height="489" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe><br/><br/></center></p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=440</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Datablend lanceert vk14-bingo.be</title>
		<link>https://datablend.be/?p=403</link>
		<comments>https://datablend.be/?p=403#comments</comments>
		<pubDate>Mon, 12 May 2014 11:26:55 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[visualisation]]></category>
		<category><![CDATA[vk14]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=403</guid>
		<description><![CDATA[Wordt U ook overladen met informatie in verband met de komende verkiezingen? Bent U, net zoals zo vele andere burgers, op zoek naar een eenvoudig alternatief waarbij U in 1 oogopslag kunt zien waar elke partij voor staat? Zoek niet langer en maak gebruik van vk14-bingo.be. We hebben voor U de verschillende partijprogramma’s woord voor<p><a href="https://datablend.be/?p=403">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p>Wordt U ook overladen met informatie in verband met de komende verkiezingen? Bent U, net zoals zo vele andere burgers, op zoek naar een eenvoudig alternatief waarbij U in 1 oogopslag kunt zien waar elke partij voor staat? Zoek niet langer en maak gebruik van vk14-bingo.be. We hebben voor U de verschillende partijprogramma’s woord voor woord geanalyseerd en de kerngedachten ten op zichte van elkaar uitgezet. Ga nu eenvoudig na op welke thema’s iedere partij inzet en vergelijk onderling.</p>
<p>Maar er is meer. Onze politici hebben de weg naar Twitter ondertussen gevonden en maken vlijtig gebruik van dit nieuwe medium om zichzelf, de partij en hun standpunten te promoten. Maar hoe trouw zijn ze aan hun eigen partijprogramma? vk14-bingo.be analyseert in real-time de twitter berichten van meer dan 850 Vlaamse politici met als 1 doel: welke partij vervult als eerste virtueel zijn programma en wint #vk14-bingo!</p>
<p>Hebt U vragen of opmerkingen over deze analyse? Contacteer ons via <a title="vk14-bingo@datablend.be" href="mailto:vk14-bingo@datablend.be" target="_blank">vk14-bingo@datablend.be</a>. Geïnteresseerd in een analyse van uw eigen data, groot of klein? Bezoek <a title="datablend.be" href="http://www.datablend.be" target="_blank">datablend.be</a> of contacteer ons via <a title="vk14-bingo@datablend.be" href="mailto:vk14-bingo@datablend.be" target="_blank">info@datablend.be</a> en we komen graag even bij U langs.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=403</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Datanews &#8211; Ook met weinig data kan je nuttige dingen doen!</title>
		<link>https://datablend.be/?p=353</link>
		<comments>https://datablend.be/?p=353#comments</comments>
		<pubDate>Fri, 24 Jan 2014 07:48:21 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[datablend]]></category>
		<category><![CDATA[datanews]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=353</guid>
		<description><![CDATA[]]></description>
				<content:encoded><![CDATA[<p><a href="http://datablend.be/wp-content/uploads/2014/01/datanews.jpg"><img src="http://datablend.be/wp-content/uploads/2014/01/datanews.jpg" alt="datanews" width="800" class="alignnone size-medium wp-image-354" /></a></p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=353</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The power of graphs to analyse biological data</title>
		<link>https://datablend.be/?p=344</link>
		<comments>https://datablend.be/?p=344#comments</comments>
		<pubDate>Mon, 02 Dec 2013 07:04:34 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[gephi]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graphconnect]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=344</guid>
		<description><![CDATA[Watch Davy Suvee present at GraphConnect London 2013 on the power of graph databases to analyse biological datasets. The Power of Graphs to Analyze Biological Data &#8211; Davy Suvee @ GraphConnect London 2013 from Neo Technology on Vimeo.]]></description>
				<content:encoded><![CDATA[<p>Watch Davy Suvee present at <a href="http://www.graphconnect.com/london/" title="GraphConnect London 2013">GraphConnect London 2013</a> on the power of graph databases to analyse biological datasets.</p>
<p><iframe src="//player.vimeo.com/video/80463932" width="500" height="281" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
<p><a href="http://vimeo.com/80463932">The Power of Graphs to Analyze Biological Data &#8211; Davy Suvee @ GraphConnect London 2013</a> from <a href="http://vimeo.com/neo4j">Neo Technology</a> on <a href="https://vimeo.com">Vimeo</a>.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=344</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Yelp graph: checkin-based business clustering</title>
		<link>https://datablend.be/?p=308</link>
		<comments>https://datablend.be/?p=308#comments</comments>
		<pubDate>Sun, 01 Dec 2013 11:51:58 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[gephi]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[yelp]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=308</guid>
		<description><![CDATA[Recently, Yelp made available a sample dataset from the greater Phoenix metropolitan area including around 11.000 business, 8000 checkin-sets, 43.000 users and 230.000 user reviews. With the help of this data, data scientists can execute real-life experiments with various data mining/machine learning algorithms. In our case, we are interested in finding out whether it is possible<p><a href="https://datablend.be/?p=308">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Recently, <a title="yelp" href="http://yelp.com" target="_blank">Yelp</a> made available <a title="a sample dataset" href="http://www.yelp.co.uk/dataset_challenge" target="_blank">a sample dataset</a> from the greater Phoenix metropolitan area including around 11.000 business, 8000 checkin-sets, 43.000 users and 230.000 user reviews. With the help of this data, data scientists can execute real-life experiments with various data mining/machine learning algorithms. In our case, we are interested in finding out whether it is possible to visually cluster businesses by category, based purely on their checkin data. The checkin data itself is available on a day-hour level: for each business, it is possible to retrieve the number of checkins on a Sunday afternoon between 3 and 4. So, with only this data in mind, are we able to cluster businesses as being restaurants or fashion stores, based purely on the correlations calculated amongst their checkin data? For this experiment, we use the <a title="Neo4J" href="http://neo4j.org" target="_blank">Neo4J</a> graph database for storing our checkin-based correlation graph and employ the <a title="Gephi" href="http://gephi.org" target="_blank">Gephi</a> graph visualisation platform for interpreting the identified business communities/clusters. As always, the <a title="full source code" href="https://github.com/datablend/yelp-graph" target="_blank">full source code</a> of this article can be found on the Datablend public github repository (although you will need to acquire the dataset yourself through the <a title="Yelp Dataset Challenge" href="http://www.yelp.co.uk/dataset_challenge" target="_blank">Yelp Dataset Challenge</a> portal).</p>
<h3>1. Building the Neo4J checkin correlation graph</h3>
<p style="text-align: justify;">We start by parsing both the business and checkin json-files from the Yelp Dataset challenge. Unfortunately, checkin data is available for only 8,282 out of the 11,537 supplied businesses. In addition, many of these have only a limited set of associated checkins. Hence, in order to make sure that only relevant correlations are calculated, we ignore the ones that have less than a 100 checkins, resulting in around 1920 remaining businesses.</p>
<p style="text-align: justify;">Next, we try to identify the correlation between two businesses by using the Pearson Correlation Coefficient (read <a title="this site" href="https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php" target="_blank">this site</a> for a nice introduction). Simply put, we try to identify whether a linear association exists between the checkins of two individual businesses. In our case, the calculation is based upon 168 data points (24 hours x 7 days), the idea being that two breakfast restaurants will get most of their checkins from the morning till noon, while two bars will get most of their checkins during the evening and at night. Hence, we expect the correlation between two businesses of the same type to be quite high, while different types of businesses (i.e. a breakfast restaurant and a bar) will  result in little or no correlation.</p>
<p style="text-align: justify;">Time to get our hands dirty. After parsing the data files, we use the existing apache.commons.math to calculate the pairwise correlation between the checkin datasets of the 1920 businesses. If the resulting coefficient is 0.8 or higher, we consider both businesses to be correlated. We create a unique node for each business within the Neo4J graph and combine them via a &#8220;correlated&#8221;-relationship.</p>
<script src="https://gist.github.com/7704584.js"></script>
<p style="text-align: justify;">The generated graph contains 606 unique nodes (i.e. businesses that are correlated to at least one other business) and 2585 edges (i.e. actual correlations).</p>
<h3>2. Gephi interpretation</h3>
<p style="text-align: justify;">Our next task is to observe whether groups of businesses exists that are highly correlated (i.e. highly interconnected) and identify whether these correlations makes sense. In order to do so, we import our Neo4J correlation graph in Gephi through the <a title="Gephi Neo4J plugin" href="https://marketplace.gephi.org/plugin/neo4j-graph-database-support/" target="_blank">Gephi Neo4J plugin</a>. Once loaded, we run the <a title="modularity" href="http://wiki.gephi.org/index.php/Modularity" target="_blank">modularity</a>-function to identify meaningful communities. These computed communities are then used to partition (i.e. color) the nodes (and their related edges) so that clusters can easily be observed. Next, we apply K-core filtering, in our case 3-core, to keep the subgraph from which all nodes have a degree of at least 3 (i.e. 3 relationships with other nodes). The size of the nodes (and their associated labels) is configured to be proportional with their degree. Finally, we apply <a title="Fruchterman-Reingold" href="http://wiki.gephi.org/index.php/Fruchterman-Reingold" target="_blank">Fruchterman-Reingold</a> lay-outing in order to clearly visualise the various clusters.</p>
<p style="text-align: justify;"><a href="http://datablend.be/wp-content/uploads/2013/12/yelp-graph.jpg" target="_blank"><img class="alignnone size-medium wp-image-334" alt="yelp-graph" src="http://datablend.be/wp-content/uploads/2013/12/yelp-graph.jpg"/></a></p>
<p style="text-align: justify;">We can easily observe 8 communities, but are these clusters meaningful? The pink cluster on the right-end side is highly interconnected (i.e. all nodes of the cluster have mutual correlations). Most of them can be identified as being breakfast diners (ex. <a href="http://www.yelp.co.uk/search?find_desc=the+good+egg&#038;find_loc=Phoenix%2C+AZ%2C+USA" title="The Good Egg">The good egg</a>, <a href="http://www.yelp.co.uk/biz/the-breakfast-joynt-scottsdale-2" title="The breakfast joynt">The breakfast joynt</a> and <a href="http://www.yelp.co.uk/biz/orange-table-scottsdale" title="Orange table">Orange table</a>). Cool. This certainly make sense, as most of these business have checkins early morning until early afternoon. The yellow cluster on the top contains various department stores (including <a href="http://www.yelp.co.uk/biz/costco-phoenix-4" title="Costco">Costco</a>, <a href="http://www.yelp.co.uk/biz/nordstrom-rack-phoenix-2#query:nordstroms%20rack" title="Nordstrom Rack" target="_blank">Nordstrom Rack</a> and <a href="http://www.yelp.co.uk/biz/ikea-tempe" title="IKEA" target="_blank">IKEA</a>). Again meaningful, as most of them open their doors somewhere around 10AM and close around 7PM. At first sight, it seems strange that the coffee places are correlated into two separate groups (yellow cluster at the bottom and pink cluster on the top). The reason however is simple: some of them close late afternoon while others are open until midnight.</p>
<h3>3. Conclusion</h3>
<p style="text-align: justify;">The Neo4J/Gephi solution works remarkably well to visually identify the various business clusters from the Yelp dataset. In a next blog article, we will show how to use the <a href="http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm" title="k-nearest neighbours algorithm">k-nearest neighbours algorithm</a> to automatically predict the type of business based upon solely the checkin information.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=308</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Counting triangles smarter (or how to beat Big Data vendors at their own game)</title>
		<link>https://datablend.be/?p=282</link>
		<comments>https://datablend.be/?p=282#comments</comments>
		<pubDate>Mon, 11 Feb 2013 10:02:00 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[exadata]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[similr]]></category>
		<category><![CDATA[vertica]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=282</guid>
		<description><![CDATA[A few months ago, I discovered Vertica&#8217;s &#8220;Counting Triangles&#8221;-article through Prismatic. The blog post describes a number of benchmarks on counting triangles in large networks. A triangle is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends<p><a href="https://datablend.be/?p=282">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">A few months ago, I discovered Vertica&#8217;s <a href="http://www.vertica.com/2011/09/21/counting-triangles">&#8220;Counting Triangles&#8221;-article</a> through <a href="http://getprismatic.com/news/home">Prismatic</a>. The blog post describes a number of benchmarks on counting triangles in large networks. A <span class="highlight"><em>triangle</em></span> is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends with each other, the three of you define a <span class="highlight"><em>friendship triangle</em></span>. Counting all triangles within a large network is a rather <span class="highlight"><em>compute-intensive task</em></span>. In its most naive form, an algorithm iterates through all vertices in the network, retrieving the adjacent vertices of their adjacent vertices. If one of the vertices adjacent to the latter vertices is identical to the origin vertex, we identified a triangle.</p>
<p style="text-align: justify;">The Vertica article illustrates how to execute an optimised implementation of the above algorithm through <span class="highlight"><em>Hadoop</em></span> and their own <span class="highlight"><em>Massive Parallel Processing (MPP) Database</em></span> product (both being run on a 4-node cluster). The dataset involves the <a href="http://snap.stanford.edu/data/soc-LiveJournal1.html">LiveJournal social network graph</a>, containing around <em>86 million relationships</em>, resulting in around <em>285 million identified triangles</em>.  As can be expected, the Vertica solution shines in all respects (counting all triangles in <span class="highlight"><em>97 seconds</em></span>), beating the Hadoop solution by a factor of <em>40</em>. A few weeks later, the Oracle guys published a similar <a href="http://structureddata.org/2011/10/17/counting-triangles-faster">blog post</a>, using their <span class="highlight"><em>ExaData</em></span> platform, beating Vertica&#8217;s results by a factor of <em>7</em>, clocking in at <span class="highlight"><em>14 seconds</em></span>.</p>
<p style="text-align: justify;">Although Vertica and Oracle&#8217;s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating triangles  through a <span class="highlight"><em>smarter algorithm</em></span> that is able to deliver similar performance on <span class="highlight"><em>commodity hardware</em></span> (i.e. my MacBook Pro Retina).</p>
<p>&nbsp;</p>
<h3>1. Doing it the smart way</h3>
</p>
<p style="text-align: justify;">The <a href="http://snap.stanford.edu/data/soc-LiveJournal1.html">LiveJournal social network graph</a>, about 1.3GB in raw size, contains around 86 million relationships. Each line in this file declares a relationship between a <span class="highlight"><em>source</em></span> and <span class="highlight"><em>target vertex</em></span> (where each vertex is identified by an unique id). Relationships are assumed to be <span class="highlight"><em>bi-directional</em></span>: if person 1 knows person 2, person 2 also knows person 1.</p>
<script src="https://gist.github.com/4737460.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Let&#8217;s start by creating a <span class="highlight"><em>row-like</em></span> data structure for storing these relationships. The <em>key</em> of each row is the <em>id of the source vertex</em>. The row <em>values</em> are the id&#8217;s of all <em>target vertices</em> associated with the particular source vertex.</p>
<script src="https://gist.github.com/4737525.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">With this structure in place, one can execute the naive algorithm as described above. Unfortunately, iterating four levels deep will result in mediocre performance. Let&#8217;s improve our data structure by <span class="highlight"><em>indexing</em></span> each relationship through its <span class="highlight"><em>lowest key</em></span>. So, even though the LiveJournal file declares the relationship as being &#8220;<em>2 0</em>&#8220;, we persist the relationship by assigning the <em>2</em>-value to the <em>0</em>-row. (Order doesn&#8217;t matter as relationships are bi-directional anyway.)</p>
<script src="https://gist.github.com/4737638.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Calculating triangles becomes a lot easier (and faster) now. If the key of a row is part of a <span class="highlight"><em>triangle</em></span>, its two <span class="highlight"><em>adjacent vertices</em></span> should be in its list of values (as by definition, the row key is the smallest vertex id of the three of them). Hence, we  need to check whether we can <span class="highlight"><em>find edges amongst the vertices</em></span> contained within each row. So, for each row, we iterate through its list of values. For each of these values, we retrieve the associated row and verify whether one of its values is part of the original <em>source</em>-values. By doing so, we get rid of one expensive <em>for</em>-loop. Nevertheless, the amount of calculations that need to be executed is still close to <span class="highlight"><em>2 billion</em></span>! </p>
<script src="https://gist.github.com/4737835.js"></script>
<p>&nbsp;</p>
<h3>2. Persisting the relationships</h3>
<p style="text-align: justify;">The data structure as described above is persisted in a <span class="highlight"><em>custom datastore</em></span> that we developed at Datablend for powering the <a href="http://www.similr.li">similr</a>-engine (a chemical structure search engine). The datastore is <span class="highlight"><em>fully persistent</em></span> and optimised for quickly performing <span class="highlight"><em>set-based operations</em></span> (intersections, differences, unions, &#8230; ). Parsing the 86 million relationships and creating the appropriate in-memory data structure takes around <em>20 seconds</em> on my MacBook Pro. An additional <em>4 seconds</em> is required for persisting the entire data structure to the datastore itself. So around <span class="highlight"><em>25 seconds</em></span> in total for effectively storing all 86 million relationships. Vertica nor Oracle mention the time it takes to persist the Livejournal dataset within their respective databases. However, I assume it also requires them a few seconds to execute this load-operation.</p>
<p style="text-align: justify;">What about <span class="highlight"><em>disk usage</em></span>? The custom Datablend datastore takes the <span class="highlight"><em>second place</em></span>, requiring only 37 Mb more compared to Oracle’s Hybrid Columnar Compression version.</p>
<script src="https://gist.github.com/4739843.js"></script>
<p>&nbsp;</p>
<h3>3. Calculating the triangles</h3>
<p style="text-align: justify;">The Oracle setup (on a cluster of 4 nodes, each with 96GB of RAM and 12 cores) is able to calculate the 265 million triangles in 14 seconds. The optimised algorithm described above, running on the custom Datablend datastore, takes the first place, clocking in at <span class="highlight"><em>9 seconds</em></span>! The calculation runs fully pararellized on my MacBook Pro Retina and has a peak use of only 2.11 GB of RAM!<br />
<script src="https://gist.github.com/4740331.js"></script>
<p>&nbsp;</p>
<h3>4. Conclusion</h3>
<p style="text-align: justify;">Datablend&#8217;s custom datastore is a very specific solution that targets a <span class="highlight"><em>particular range of Big Data computations</em></span>. It is in no means as generic and versatile as the MPP database solutions offered by both Vertica and Oracle. Nevertheless, the article tries to illustrate that one does not require a large computing cluster to execute particular Big Data computations. Just use the most appropriate/smart solution to solve the problem in an elegant and fast way. Don&#8217;t hesitate to <a href="http://datablend.be/?page_id=37">contact us</a> if you have any questions related to <a href="http://www.similr.li">similr</a> and/or Datablend.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=282</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Similr: blazingly fast chemical similarity searches</title>
		<link>https://datablend.be/?p=280</link>
		<comments>https://datablend.be/?p=280#comments</comments>
		<pubDate>Mon, 04 Feb 2013 10:00:38 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[similr]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=280</guid>
		<description><![CDATA[Today, Datablend announces Similr to be available for beta sign-up. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular chemical structure. It employs a wide range of fingerprinting algorithms, which combined, allow to identify matching compounds in millisecond time. Similr&#8217;s functionalities are available through a flexible and<p><a href="https://datablend.be/?p=280">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Today, Datablend announces <a href="http://similr.li">Similr</a> to be available for <a href="http://similr.li">beta sign-up</a>. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular <span class="highlight"><em>chemical structure</em></span>. It employs a wide range of <span class="highlight"><em>fingerprinting algorithms</em></span>, which combined, allow to identify matching compounds in <em>millisecond</em> time. Similr&#8217;s functionalities are available through a flexible and expressive REST API and allows to scan more than 30 million compounds that have been made publicly available through <a href="http://pubchem.ncbi.nlm.nih.gov">PubChem</a>.</p>
<p style="text-align: justify;">Similr will provide <span class="highlight"><em>unlimited</em></span> API-access to academics. Free commercial access,  limited to a 1000 API-calls a month, will be available. Higher up, customers can choose between a <span class="highlight"><em>pay-as-you-go subscription</em></span> or opt-in for a dedicated installation that allows for the import of (private) compounds.</p>
<p style="text-align: justify;">Similr is being developed by Datablend, a Big Data consultancy company. Datablend&#8217;s expertise in Pharma, combined with proficient knowledge of NoSQL technologies, allowed for the development of a <span class="highlight"><em>highly optimised chemical similarity search algorithm</em></span> that is able to scan millions of compounds at blazing speeds. Don&#8217;t hesitate to <a href="http://datablend.be/?page_id=37">contact us</a> if you have any questions related to Similr and/or Datablend.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=280</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Redis and Lua: a NoSQL power-horse</title>
		<link>https://datablend.be/?p=278</link>
		<comments>https://datablend.be/?p=278#comments</comments>
		<pubDate>Tue, 29 Jan 2013 09:59:16 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[chemoinformatics]]></category>
		<category><![CDATA[compound comparison]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[redis]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=278</guid>
		<description><![CDATA[Recently, I&#8217;ve started implementing a number of Redis-based solutions for a Datablend customer. Redis is frequently referred to as the Swiss Army Knife of NoSQL databases and rightfully deserves that title. At its core, it is an in-memory key-value datastore. Values that are assigned to keys can be &#8216;structured&#8217; through the use of strings, hashes,<p><a href="https://datablend.be/?p=278">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Recently, I&#8217;ve started implementing a number of Redis-based solutions for a Datablend customer. <a href="http://redis.io" target="_blank">Redis</a> is frequently referred to as the <span class="highlight"><em>Swiss Army Knife</em></span> of NoSQL databases and rightfully deserves that title. At its core, it is an <span class="highlight"><em>in-memory key-value</em></span> datastore. Values that are assigned to keys can be &#8216;structured&#8217; through the use of strings, hashes, lists, sets and sorted sets. The power of these simple data structures, combined with its intuitive API, makes Redis a true power-horse for solving various &#8216;Big Data&#8217;-related problems. To illustrate this point, I reimplemented my MongoDB-based <a href="http://datablend.be/?p=962" target="_blank">molecular similarity search</a> through Redis and its integrated Lua support. As always, the complete source code can be found on the <a target='_blank' href="https://github.com/datablend/redis-compound-comparison">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. Redis &#8216;fingerprint&#8217; data model</h3>
</p>
<p style="text-align: justify;"><span class="highlight">Molecular similarity</span> refers to the <em>similarity</em> of <em>chemical compounds</em> with respect to their structural and/or functional qualities. By calculating molecular similarities, Cheminformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. Chemical similarity can be determined through the use of so-called <span class="highlight"><em>fingerprints</em></span> (i.e. linear, substructure chemical patterns of a certain length). Similarity between compounds is identified by calculating the <a href="http://en.wikipedia.org/wiki/Jaccard_index" target='_blank'>Tanimoto coefficient</a>. This computation involves the calculation of <span class="highlight"><em>intersections</em></span> between sets of fingerprints, an operation that is natively supported by Redis.</p>
<p style="text-align: justify;">Our Redis-based data model for storing fingerprints requires three different data-structures:</p>
<ol>
<li>For each <span class="highlight"><em>compound</em></span>, identified by an <span class="highlight"><em>unique key</em></span>, we store its set of fingerprints (where each fingerprint is again identified by an unique key).</li>
<li>For each <span class="highlight"><em>fingerprint</em></span>, identified by an <span class="highlight"><em>unique key</em></span>, we store the set of compounds containing this fingerprint. These fingerprint sets can be conceived as the inverted indexes of the compound sets mentioned above.</li>
<li>For each <span class="highlight"><em>fingerprint</em></span>, we store its number of occurrences through a dedicated <span class="highlight"><em>weight</em>-key</span>.</li>
</ol>
<p>&nbsp;</p>
<p style="text-align: justify;">Fingerprints are calculated by using the, 33 and 35) are sufficient to create both the <span class="highlight"><em>inverted indexes</em></span> (compound->fingerprints and fingerprint->compounds) and incrementing the accompanying <span class="highlight"><em>counters</em></span>.</p>
<script src="https://gist.github.com/4666154.js"></script>
<p>&nbsp;</p>
<h3>2. Finding similar chemical compounds</h3>
<p style="text-align: justify;">For retrieving compounds that satisfy a particular Tanimoto coefficient, we reuse the same principles as outlined in my <a href="http://datablend.be/?p=962" target="_blank">original MongoDB article</a>. The number of round-trips to the Redis datastore is minimised by implementing the algorithm via the <span class="highlight"><em>build-in Lua scripting</em></span> support. We start by retrieving the number of fingerprints of the particular input compound. Based upon that cardinality, we  calculate the <span class="highlight"><em>fingerprints of interest</em></span> (i.e. the min-set of fingerprints that lead us to compounds that are able to satisfy the Tanimoto coefficient). For this, we need to identify the subset of compound fingerprints that occur the least throughout the entire dataset. Redis allows us to perform this query via a single <span class="highlight"><em>sort</em></span>-command; it takes the compound-key as input and sorts the contained fingerprints by employing the value of the external fingerprint weight keys. Out of this sorted set of fingerprints, we <span class="highlight"><em>sub-select the top x fingerprints</em></span> of interest. What a <span class="highlight"><em>powerful</em></span> and <span class="highlight"><em>elegant</em></span> command!</p>
<p style="text-align: justify;">We use the inverted index (fingerprint->compounds) to identify those compounds that are able to satisfy the particular input Tanimoto coefficient. Applying the Redis <span class="highlight"><em>union</em></span>-command upon the calculated set of fingerprint keys returns the set of potential compounds. Once this set has been identified, we calculate similarity by making use of the Redis <span class="highlight"><em>intersect</em></span>-command. Only compounds that satisfy the Tanimoto restriction are returned.</p>
<script src="https://gist.github.com/4667047.js"></script>
<p>&nbsp;</p>
<h3>3. Conclusion</h3>
<p style="text-align: justify;">With 25.000 stored compounds, Redis requires less then 20ms to retrieve compounds that are 70% similar to a particular input compound. Snappier compared to my original MongoDB implementation. In addition, Redis requires less then 1GB of RAM to maintain a live index of the <a href="http://cactus.nci.nih.gov/download/roadmap/">460.000 PubChem compounds</a> that have at least one associated assay. This allows scientist to host a local instance of the compound datastore, effectively eliminating the need for a dedicated (and expensive) compound database setup.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=278</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hubway Data Visualization Challenge Entry: the flow of bikers</title>
		<link>https://datablend.be/?p=276</link>
		<comments>https://datablend.be/?p=276#comments</comments>
		<pubDate>Tue, 16 Oct 2012 09:57:49 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[hubway]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=276</guid>
		<description><![CDATA[Last week, Hubway announced its Data Visualization Challenge. Hubway is a bike sharing system located in the Boston area: you simply pick up a bike at a particular station and drop it off at the closest station near your destination. For this challenge, Hubway released a CSV-file, containing over half a million rides. Each entry<p><a href="https://datablend.be/?p=276">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Last week, <a href="http://www.thehubway.com" target='_blank'>Hubway</a> announced its <a href="http://hubwaydatachallenge.org" target='_blank'>Data Visualization Challenge</a>. Hubway is a bike sharing system located in the Boston area: you simply pick up a bike at a particular station and drop it off at the closest station near your destination. For this challenge, Hubway released a CSV-file, containing over half a million rides. Each entry contains the origin and destination station as well as the timing-information and some anonymoused demographic information. The purpose of the challenge is to create <span class="highlight">appealing visualizations</span> that provide Hubway with <span class="highlight">cool insights</span> in how customers are using their bikes. As I had 8 hours to spare on a flight to New York, I decided to give it go.</p>
<p>&nbsp;</p>
<h3>1. Flow of bikers</h3>
<p style="text-align: justify;">The goal of my visualization is to depict how bikers <span class="highlight">flow</span> through the city of Boston, namely: &#8220;taking a specific station as starting point, to which other stations are people biking&#8221;. A <span class="highlight">classical, graph-based visualization</span> would show this flow, but would also be quite cluttered as each origin-destination tuple would have its own edge, this way failing to provide the grant overview. The use of a <a href="http://en.wikipedia.org/wiki/Flow_map" target='_blank'>flow map</a> however, would make the visualization both appealing and insightful. Cartographers use flow maps to show the movement of objects from one location to another, such as the number of people in a migration, the amount of goods being traded, or the number of packets in a network. Flow maps reduce visual clutter by merging edges where possible.</p>
<p style="text-align: justify;">Playing around with the library in the past, I remembered somebody releasing a <a href="http://graphics.stanford.edu/papers/flow_map_layout/" target='_blank'>flow map layout implementation</a>. Taking their implementation as a starting point, I applied some modifications (related to the mercator-layouting) and supplied it with my pre-processed Hubway biking data. For each station, I can now generate a separate map that visualises the flow of bikers towards other stations, where each station is mapped at its geographically correct location. As can be expected, most people bike to close-by stations, but others seem to enjoy their biking to far-off locations. Let&#8217;s have a look at a few examples. The image below displays the flow map for the <span class="highlight">Boston University Central station located at 725 Commonwealth Avenue</span> (A32003). As this station is quite central to the city, we see that people bike off in almost all directions, although most of them keep close to Charles River.</p>
<p>&nbsp;</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/flow1.jpg">
<p align="center"><img width="600" src="http://datablend.be/wp-content/uploads/flow1.jpg" alt="flow1" /></p>
<p></a></p>
<p>&nbsp;</p>
<p>If we generate the flow map for a biking station near the corners of the city, such as <span class="highlight">Andrew Station on Dorchester Avenue</span> (C32012), an entirely different flow pattern can be observed as biking destinations are concentrated at the east-side of Boston.</p>
<p>&nbsp;</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/flow21.jpg">
<p align="center"><img width="600" src="http://datablend.be/wp-content/uploads/flow21.jpg" alt="flow2" /></p>
<p></a></p>
<p>&nbsp;</p>
<h3>2. Conclusion</h3>
<p style="text-align: justify;">The current application could easily be extended to filter trips on demographics and/or timing information. One could also overlay various flow maps in order to detect similarities between flows of bikers. If people would be interested in extending my implementation, I willing to upload my &#8220;code-hacking&#8221; to github so that the project can be forked. Just let me know.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>https://datablend.be/?feed=rss2&#038;p=276</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
