<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Datablend &#187; visualisation</title>
	<atom:link href="http://datablend.be/?cat=32&#038;feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://datablend.be</link>
	<description>Big Data Simplified</description>
	<lastBuildDate>Mon, 07 Sep 2015 09:04:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.6.1</generator>
		<item>
		<title>The power of Manchester City: a data analysis</title>
		<link>http://datablend.be/?p=491</link>
		<comments>http://datablend.be/?p=491#comments</comments>
		<pubDate>Sun, 17 Aug 2014 11:55:16 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[graph]]></category>
		<category><![CDATA[infographics]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=491</guid>
		<description><![CDATA[What makes Manchester City such a great team? The infographic below illustrates one of the teams most powerful characteristics: its successful passing capability. The visualisation is based upon the Opta dataset released in August 2011, containing the high detailed Bolton vs Manchester City match statistics. The data has been loaded in the neo4j graph databases<p><a href="http://datablend.be/?p=491">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">What makes Manchester City such a great team? The infographic below illustrates one of the teams most powerful characteristics: its successful passing capability. The visualisation is based upon the <a href="http://www.optasports.com" title="Opta" target="_blank">Opta</a> dataset released in August 2011, containing the high detailed Bolton vs Manchester City match statistics. The data has been loaded in the <a href="http://neo4j.org" title="neo4j" target="_blank">neo4j</a> graph databases and the <a href="http://docs.neo4j.org/chunked/stable/cypher-query-lang.html" title="Cypher Query Language" target="_blank">Cypher Query Language</a> has been used to extract the passing statistics.</p>
<p style="text-align: justify;">For each of the teams, we calculated the average position of their players on the pitch (based upon their individual actions). The thickness of the red edges visualises the number of successful passes amongst two individual players. Finally, we visualised the number of successful passes of each player (see shirt-number) and the distribution of the length of his passes. Have fun exploring this infographic!</p>
<p> </p>
<p><center><a href="http://datablend.be/wp-content/uploads/2014/08/VoetbalInfographic-2.jpg"><img src="http://datablend.be/wp-content/uploads/2014/08/VoetbalInfographic-2.jpg" alt="VoetbalInfographic-2" width="400" height="300" class="alignnone size-medium wp-image-498" /></a></center> </p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=491</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Datablend lanceert vk14-bingo.be</title>
		<link>http://datablend.be/?p=403</link>
		<comments>http://datablend.be/?p=403#comments</comments>
		<pubDate>Mon, 12 May 2014 11:26:55 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[visualisation]]></category>
		<category><![CDATA[vk14]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=403</guid>
		<description><![CDATA[Wordt U ook overladen met informatie in verband met de komende verkiezingen? Bent U, net zoals zo vele andere burgers, op zoek naar een eenvoudig alternatief waarbij U in 1 oogopslag kunt zien waar elke partij voor staat? Zoek niet langer en maak gebruik van vk14-bingo.be. We hebben voor U de verschillende partijprogramma’s woord voor<p><a href="http://datablend.be/?p=403">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p>Wordt U ook overladen met informatie in verband met de komende verkiezingen? Bent U, net zoals zo vele andere burgers, op zoek naar een eenvoudig alternatief waarbij U in 1 oogopslag kunt zien waar elke partij voor staat? Zoek niet langer en maak gebruik van vk14-bingo.be. We hebben voor U de verschillende partijprogramma’s woord voor woord geanalyseerd en de kerngedachten ten op zichte van elkaar uitgezet. Ga nu eenvoudig na op welke thema’s iedere partij inzet en vergelijk onderling.</p>
<p>Maar er is meer. Onze politici hebben de weg naar Twitter ondertussen gevonden en maken vlijtig gebruik van dit nieuwe medium om zichzelf, de partij en hun standpunten te promoten. Maar hoe trouw zijn ze aan hun eigen partijprogramma? vk14-bingo.be analyseert in real-time de twitter berichten van meer dan 850 Vlaamse politici met als 1 doel: welke partij vervult als eerste virtueel zijn programma en wint #vk14-bingo!</p>
<p>Hebt U vragen of opmerkingen over deze analyse? Contacteer ons via <a title="vk14-bingo@datablend.be" href="mailto:vk14-bingo@datablend.be" target="_blank">vk14-bingo@datablend.be</a>. Geïnteresseerd in een analyse van uw eigen data, groot of klein? Bezoek <a title="datablend.be" href="http://www.datablend.be" target="_blank">datablend.be</a> of contacteer ons via <a title="vk14-bingo@datablend.be" href="mailto:vk14-bingo@datablend.be" target="_blank">info@datablend.be</a> en we komen graag even bij U langs.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=403</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The power of graphs to analyse biological data</title>
		<link>http://datablend.be/?p=344</link>
		<comments>http://datablend.be/?p=344#comments</comments>
		<pubDate>Mon, 02 Dec 2013 07:04:34 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[gephi]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graphconnect]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=344</guid>
		<description><![CDATA[Watch Davy Suvee present at GraphConnect London 2013 on the power of graph databases to analyse biological datasets. The Power of Graphs to Analyze Biological Data &#8211; Davy Suvee @ GraphConnect London 2013 from Neo Technology on Vimeo.]]></description>
				<content:encoded><![CDATA[<p>Watch Davy Suvee present at <a href="http://www.graphconnect.com/london/" title="GraphConnect London 2013">GraphConnect London 2013</a> on the power of graph databases to analyse biological datasets.</p>
<p><iframe src="//player.vimeo.com/video/80463932" width="500" height="281" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
<p><a href="http://vimeo.com/80463932">The Power of Graphs to Analyze Biological Data &#8211; Davy Suvee @ GraphConnect London 2013</a> from <a href="http://vimeo.com/neo4j">Neo Technology</a> on <a href="https://vimeo.com">Vimeo</a>.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=344</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Yelp graph: checkin-based business clustering</title>
		<link>http://datablend.be/?p=308</link>
		<comments>http://datablend.be/?p=308#comments</comments>
		<pubDate>Sun, 01 Dec 2013 11:51:58 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[gephi]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[visualisation]]></category>
		<category><![CDATA[yelp]]></category>

		<guid isPermaLink="false">http://datablend.be/?p=308</guid>
		<description><![CDATA[Recently, Yelp made available a sample dataset from the greater Phoenix metropolitan area including around 11.000 business, 8000 checkin-sets, 43.000 users and 230.000 user reviews. With the help of this data, data scientists can execute real-life experiments with various data mining/machine learning algorithms. In our case, we are interested in finding out whether it is possible<p><a href="http://datablend.be/?p=308">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Recently, <a title="yelp" href="http://yelp.com" target="_blank">Yelp</a> made available <a title="a sample dataset" href="http://www.yelp.co.uk/dataset_challenge" target="_blank">a sample dataset</a> from the greater Phoenix metropolitan area including around 11.000 business, 8000 checkin-sets, 43.000 users and 230.000 user reviews. With the help of this data, data scientists can execute real-life experiments with various data mining/machine learning algorithms. In our case, we are interested in finding out whether it is possible to visually cluster businesses by category, based purely on their checkin data. The checkin data itself is available on a day-hour level: for each business, it is possible to retrieve the number of checkins on a Sunday afternoon between 3 and 4. So, with only this data in mind, are we able to cluster businesses as being restaurants or fashion stores, based purely on the correlations calculated amongst their checkin data? For this experiment, we use the <a title="Neo4J" href="http://neo4j.org" target="_blank">Neo4J</a> graph database for storing our checkin-based correlation graph and employ the <a title="Gephi" href="http://gephi.org" target="_blank">Gephi</a> graph visualisation platform for interpreting the identified business communities/clusters. As always, the <a title="full source code" href="https://github.com/datablend/yelp-graph" target="_blank">full source code</a> of this article can be found on the Datablend public github repository (although you will need to acquire the dataset yourself through the <a title="Yelp Dataset Challenge" href="http://www.yelp.co.uk/dataset_challenge" target="_blank">Yelp Dataset Challenge</a> portal).</p>
<h3>1. Building the Neo4J checkin correlation graph</h3>
<p style="text-align: justify;">We start by parsing both the business and checkin json-files from the Yelp Dataset challenge. Unfortunately, checkin data is available for only 8,282 out of the 11,537 supplied businesses. In addition, many of these have only a limited set of associated checkins. Hence, in order to make sure that only relevant correlations are calculated, we ignore the ones that have less than a 100 checkins, resulting in around 1920 remaining businesses.</p>
<p style="text-align: justify;">Next, we try to identify the correlation between two businesses by using the Pearson Correlation Coefficient (read <a title="this site" href="https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php" target="_blank">this site</a> for a nice introduction). Simply put, we try to identify whether a linear association exists between the checkins of two individual businesses. In our case, the calculation is based upon 168 data points (24 hours x 7 days), the idea being that two breakfast restaurants will get most of their checkins from the morning till noon, while two bars will get most of their checkins during the evening and at night. Hence, we expect the correlation between two businesses of the same type to be quite high, while different types of businesses (i.e. a breakfast restaurant and a bar) will  result in little or no correlation.</p>
<p style="text-align: justify;">Time to get our hands dirty. After parsing the data files, we use the existing apache.commons.math to calculate the pairwise correlation between the checkin datasets of the 1920 businesses. If the resulting coefficient is 0.8 or higher, we consider both businesses to be correlated. We create a unique node for each business within the Neo4J graph and combine them via a &#8220;correlated&#8221;-relationship.</p>
<script src="https://gist.github.com/7704584.js"></script>
<p style="text-align: justify;">The generated graph contains 606 unique nodes (i.e. businesses that are correlated to at least one other business) and 2585 edges (i.e. actual correlations).</p>
<h3>2. Gephi interpretation</h3>
<p style="text-align: justify;">Our next task is to observe whether groups of businesses exists that are highly correlated (i.e. highly interconnected) and identify whether these correlations makes sense. In order to do so, we import our Neo4J correlation graph in Gephi through the <a title="Gephi Neo4J plugin" href="https://marketplace.gephi.org/plugin/neo4j-graph-database-support/" target="_blank">Gephi Neo4J plugin</a>. Once loaded, we run the <a title="modularity" href="http://wiki.gephi.org/index.php/Modularity" target="_blank">modularity</a>-function to identify meaningful communities. These computed communities are then used to partition (i.e. color) the nodes (and their related edges) so that clusters can easily be observed. Next, we apply K-core filtering, in our case 3-core, to keep the subgraph from which all nodes have a degree of at least 3 (i.e. 3 relationships with other nodes). The size of the nodes (and their associated labels) is configured to be proportional with their degree. Finally, we apply <a title="Fruchterman-Reingold" href="http://wiki.gephi.org/index.php/Fruchterman-Reingold" target="_blank">Fruchterman-Reingold</a> lay-outing in order to clearly visualise the various clusters.</p>
<p style="text-align: justify;"><a href="http://datablend.be/wp-content/uploads/2013/12/yelp-graph.jpg" target="_blank"><img class="alignnone size-medium wp-image-334" alt="yelp-graph" src="http://datablend.be/wp-content/uploads/2013/12/yelp-graph.jpg"/></a></p>
<p style="text-align: justify;">We can easily observe 8 communities, but are these clusters meaningful? The pink cluster on the right-end side is highly interconnected (i.e. all nodes of the cluster have mutual correlations). Most of them can be identified as being breakfast diners (ex. <a href="http://www.yelp.co.uk/search?find_desc=the+good+egg&#038;find_loc=Phoenix%2C+AZ%2C+USA" title="The Good Egg">The good egg</a>, <a href="http://www.yelp.co.uk/biz/the-breakfast-joynt-scottsdale-2" title="The breakfast joynt">The breakfast joynt</a> and <a href="http://www.yelp.co.uk/biz/orange-table-scottsdale" title="Orange table">Orange table</a>). Cool. This certainly make sense, as most of these business have checkins early morning until early afternoon. The yellow cluster on the top contains various department stores (including <a href="http://www.yelp.co.uk/biz/costco-phoenix-4" title="Costco">Costco</a>, <a href="http://www.yelp.co.uk/biz/nordstrom-rack-phoenix-2#query:nordstroms%20rack" title="Nordstrom Rack" target="_blank">Nordstrom Rack</a> and <a href="http://www.yelp.co.uk/biz/ikea-tempe" title="IKEA" target="_blank">IKEA</a>). Again meaningful, as most of them open their doors somewhere around 10AM and close around 7PM. At first sight, it seems strange that the coffee places are correlated into two separate groups (yellow cluster at the bottom and pink cluster on the top). The reason however is simple: some of them close late afternoon while others are open until midnight.</p>
<h3>3. Conclusion</h3>
<p style="text-align: justify;">The Neo4J/Gephi solution works remarkably well to visually identify the various business clusters from the Yelp dataset. In a next blog article, we will show how to use the <a href="http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm" title="k-nearest neighbours algorithm">k-nearest neighbours algorithm</a> to automatically predict the type of business based upon solely the checkin information.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=308</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hubway Data Visualization Challenge Entry: the flow of bikers</title>
		<link>http://datablend.be/?p=276</link>
		<comments>http://datablend.be/?p=276#comments</comments>
		<pubDate>Tue, 16 Oct 2012 09:57:49 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[hubway]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=276</guid>
		<description><![CDATA[Last week, Hubway announced its Data Visualization Challenge. Hubway is a bike sharing system located in the Boston area: you simply pick up a bike at a particular station and drop it off at the closest station near your destination. For this challenge, Hubway released a CSV-file, containing over half a million rides. Each entry<p><a href="http://datablend.be/?p=276">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Last week, <a href="http://www.thehubway.com" target='_blank'>Hubway</a> announced its <a href="http://hubwaydatachallenge.org" target='_blank'>Data Visualization Challenge</a>. Hubway is a bike sharing system located in the Boston area: you simply pick up a bike at a particular station and drop it off at the closest station near your destination. For this challenge, Hubway released a CSV-file, containing over half a million rides. Each entry contains the origin and destination station as well as the timing-information and some anonymoused demographic information. The purpose of the challenge is to create <span class="highlight">appealing visualizations</span> that provide Hubway with <span class="highlight">cool insights</span> in how customers are using their bikes. As I had 8 hours to spare on a flight to New York, I decided to give it go.</p>
<p>&nbsp;</p>
<h3>1. Flow of bikers</h3>
<p style="text-align: justify;">The goal of my visualization is to depict how bikers <span class="highlight">flow</span> through the city of Boston, namely: &#8220;taking a specific station as starting point, to which other stations are people biking&#8221;. A <span class="highlight">classical, graph-based visualization</span> would show this flow, but would also be quite cluttered as each origin-destination tuple would have its own edge, this way failing to provide the grant overview. The use of a <a href="http://en.wikipedia.org/wiki/Flow_map" target='_blank'>flow map</a> however, would make the visualization both appealing and insightful. Cartographers use flow maps to show the movement of objects from one location to another, such as the number of people in a migration, the amount of goods being traded, or the number of packets in a network. Flow maps reduce visual clutter by merging edges where possible.</p>
<p style="text-align: justify;">Playing around with the library in the past, I remembered somebody releasing a <a href="http://graphics.stanford.edu/papers/flow_map_layout/" target='_blank'>flow map layout implementation</a>. Taking their implementation as a starting point, I applied some modifications (related to the mercator-layouting) and supplied it with my pre-processed Hubway biking data. For each station, I can now generate a separate map that visualises the flow of bikers towards other stations, where each station is mapped at its geographically correct location. As can be expected, most people bike to close-by stations, but others seem to enjoy their biking to far-off locations. Let&#8217;s have a look at a few examples. The image below displays the flow map for the <span class="highlight">Boston University Central station located at 725 Commonwealth Avenue</span> (A32003). As this station is quite central to the city, we see that people bike off in almost all directions, although most of them keep close to Charles River.</p>
<p>&nbsp;</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/flow1.jpg">
<p align="center"><img width="600" src="http://datablend.be/wp-content/uploads/flow1.jpg" alt="flow1" /></p>
<p></a></p>
<p>&nbsp;</p>
<p>If we generate the flow map for a biking station near the corners of the city, such as <span class="highlight">Andrew Station on Dorchester Avenue</span> (C32012), an entirely different flow pattern can be observed as biking destinations are concentrated at the east-side of Boston.</p>
<p>&nbsp;</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/flow21.jpg">
<p align="center"><img width="600" src="http://datablend.be/wp-content/uploads/flow21.jpg" alt="flow2" /></p>
<p></a></p>
<p>&nbsp;</p>
<h3>2. Conclusion</h3>
<p style="text-align: justify;">The current application could easily be extended to filter trips on demographics and/or timing information. One could also overlay various flow maps in order to detect similarities between flows of bikers. If people would be interested in extending my implementation, I willing to upload my &#8220;code-hacking&#8221; to github so that the project can be forked. Just let me know.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=276</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Circle through your Google Analytics data with Neo4J and Circos</title>
		<link>http://datablend.be/?p=267</link>
		<comments>http://datablend.be/?p=267#comments</comments>
		<pubDate>Sun, 11 Mar 2012 09:51:19 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[circos]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[NoSQL]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=267</guid>
		<description><![CDATA[Storing massive amounts of data in a NoSQL data store is just one side of the Big Data equation. Being able to visualize your data in such a way that you can easily gain deeper insights, is where things really start to get interesting. Lately, I&#8217;ve been exploring various options for visualizing (directed) graphs, including<p><a href="http://datablend.be/?p=267">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">Storing <span class="highlight">massive amounts of data</span> in a NoSQL data store is just one side of the <span class="highlight">Big Data</span> equation. Being able to visualize your data in such a way that you can easily gain <span class="highlight">deeper insights</span>, is where things really start to get interesting. Lately, I&#8217;ve been exploring various options for visualizing (directed) graphs, including <a href="http://circos.ca/" target='_blank'>Circos</a>. Circos is an amazing software package that visualizes your data through a <span class="highlight">circular layout</span>. Although it&#8217;s originally designed for displaying <span class="highlight">genomic data</span>, it allows to create good-looking figures from data in any field. Just transform your data set into a tabular format and you are ready to go. The figure below illustrates the core concept behind Circos. The table&#8217;s <span class="highlight">columns</span> and <span class="highlight">rows</span> are represented by <span class="highlight">segments</span> around the circle. Individual <span class="highlight">cells</span> are shown as <span class="highlight">ribbons</span>, which <span class="highlight">connect</span> the corresponding row and column segments. The ribbons themselves are <span class="highlight">proportional</span> in width to the value in the cell.</p>
<p>&nbsp;</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/circos-visualize-table.png">
<p align="center"><img width="400" src="http://datablend.be/wp-content/uploads/circos-visualize-table.png" alt="circos" /></p>
<p></a>  </p>
<p>&nbsp;</p>
<p style="text-align: justify;">When visualizing a <span class="highlight">directed graph</span>, nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. The proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the <span class="highlight">key data points</span> within your table. In my case, I want to better understand the <span class="highlight">flow of visitors</span> to and within the datablend site and blog; where do visitors come from (direct, referral, search, &#8230;) and how do they navigate between pages. The rest of this article details how to 1) retrieve the <span class="highlight">raw visit information</span> through the Google Analytics API, 2) <span class="highlight">persist</span> this information <span class="highlight">as a graph</span> in Neo4J and 3) query and <span class="highlight">preprocess</span> this data for visualization through Circos. As always, the complete source code can be found on the <a target='_blank' href="https://github.com/datablend/neo4j-google-analytics">Datablend public GitHub repository</a>.</p>
<p>&nbsp;</p>
<h3>1. Retrieving your Google Analytics data</h3>
<p style="text-align: justify;">Let&#8217;s start by retrieving the <span class="highlight">raw Google Analytics data</span>. The Google Analytics data API provides access to all <span class="highlight">dimensions</span> and <span class="highlight">metrics</span> that can be queried through the web application. In my case, I&#8217;m interested in retrieving the <span class="highlight"><em>previous page path</em></span> property for each page view. If a visitor enters through a page outside of the datablend website, the <em>previous page path</em> is marked as <span class="highlight"><em>(entrance)</em></span>. Otherwise, it contains the <span class="highlight">internal path</span>. We will use Google&#8217;s Java Data API to connect and retrieve this information. We are particularly interested in the <span class="highlight"><em>pagePath</em></span>, <span class="highlight"><em>pageTitle</em></span>, <span class="highlight"><em>previousPagePath</em></span> and <span class="highlight"><em>medium</em></span> dimensions, while our metric of choice is the number of <span class="highlight"><em>pageViews</em></span>. After setting the date range, the <span class="highlight">feed of entries</span> that satisfy this criteria can be retrieved. For ease of use, we transform this data to a domain entity and filter/clean the data accordingly. If a visit originates from outside the datablend website, we store the specific <span class="highlight">medium</span> (direct, referral, search, &#8230;) as previous path.</p>
<script src="https://gist.github.com/2011682.js"></script>
<p>&nbsp;</p>
<h3>2. Storing navigational data as a directed graph in Neo4J</h3>
<p style="text-align: justify;">The set of site navigations can easily be stored as a directed graph in the the <span class="highlight">degree</span> of my nodes is correct if I would perform other types of calculations. For each individual navigation relationship, we also store <span class="highlight">the date of visit</span>.</p>
<script src="https://gist.github.com/2011787.js"></script>
<p>&nbsp;</p>
<h3>3. Creating the Circos tabular data format</h3>
<p style="text-align: justify;">The <span class="highlight">Circos tabular data format</span> is quite easy to construct. It&#8217;s basically a <span class="highlight">tab-delimited file</span> with row and column headers. A cell is interpreted as a <span class="highlight">value</span> that <span class="highlight">flows</span> from the <span class="highlight">row entity</span> to the <span class="highlight">column entity</span>. We will use the <a target='_blank' href="http://docs.neo4j.org/chunked/stable/cypher-query-lang.html">Neo4J Cypher query language</a> to retrieve the data of interest, namely all navigations that occurred within a <span class="highlight">certain time period</span>. Doing so allows us to create <span class="highlight">historical visualizations</span> of our navigations and observe how visit flow behaviors are changing over time.</p>
<script src="https://gist.github.com/2011910.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Next, we create the tab delimited file itself. We <span class="highlight">iterate</span> through all entries (i.e. navigations) that match our Cypher query and store them in a temporary list. Afterwards, we start building the <span class="highlight">two-dimensional array</span> by <span class="highlight">normalizing</span> (i.e. summing) the number of navigations between the source and target paths. At the end, we <span class="highlight">filter</span> this occurrence matrix on the <span class="highlight">minimal number</span> of required navigations. This ensures that we will only create segments for paths that are <span class="highlight">relevant</span> in the total population. As a final step, we <span class="highlight">print</span> the occurrences matrix as a tab-delimited file. For each path, we will use a <span class="highlight">shorthand</span> as the Circos renderer seems to have problem with long string identifiers.</p>
<script src="https://gist.github.com/2011992.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">The text below is a sample of the output generated by the <span class="highlight"><em>printCircosData</em></span> method. It first prints the legend (matching shorthands with actual paths). Next it prints the tab-delimited Circos table.</p>
<script src="https://gist.github.com/2012044.js"></script>
<p>&nbsp;</p>
<h3>4. Use the Circos power</h3>
<p style="text-align: justify;">Although Circos can be installed on your local computer, we will use its <a target='_blank' href="http://mkweb.bcgsc.ca/tableviewer/">online version</a> to create the visualization of our data. Upload your tab-delimited file and just wait a few seconds before enjoying the <span class="highlight">beautiful rendering</span> of your site&#8217;s navigation information.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/circos1.jpg">
<p align="center"><img width="700" src="http://datablend.be/wp-content/uploads/circos1.jpg" alt="circos" /></p>
<p></a>  </p>
<p style="text-align: justify;">With just a glimpse of an eye we can already see that the <span class="highlight">l3-segment</span> (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. The <span class="highlight">outer 3 rings</span> visualize the total amounts of navigations that are <span class="highlight">leaving</span> and <span class="highlight">entering</span> this particular path. In case of referrals, no navigations have this path as target (indicated by the empty middle ring). Its <span class="highlight">total segment count</span> (inner ring) is entirely build up out of navigations that have a referral as source. The <span class="highlight">l6-segment</span> seems to be the path that attracts the most traffic (around 2500 navigations). This segment visualizes the navigation data related to my <a target='_blank' href="http://datablend.be/?page_id=44&#038;paged=2">&#8220;The joy of algorithms and NoSQL: a MongoDB example&#8221;</a>-article. Most of its traffic is received through referrals, while a decent amount is also generated through <span class="highlight">direct</span> (l17-segment) and <span class="highlight">search</span> (l27-segment) traffic. The <span class="highlight">l15-segment</span> (my blog&#8217;s main page) is the only path that receives an almost equal amount of incoming and outgoing traffic.</p>
<p style="text-align: justify;">With just a few tweaks to the Circos input data, we can easily <span class="highlight">focus</span> on particular types of navigation data. In the figure below, I made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/circos2.jpg">
<p align="center"><img width="700" src="http://datablend.be/wp-content/uploads/circos2.jpg" alt="circos" /></p>
<p></a>  </p>
<h3>5. Conclusions</h3>
<p style="text-align: justify;">In the era of Big Data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. Circos specializes in a very specific type of visualization, but does its job extremely well. I would be delighted to hear about other types of visualizations for directed graphs.</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=267</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Running along the graph using Neo4J Spatial and Gephi</title>
		<link>http://datablend.be/?p=262</link>
		<comments>http://datablend.be/?p=262#comments</comments>
		<pubDate>Wed, 04 Jan 2012 09:48:13 +0000</pubDate>
		<dc:creator>Davy Suvee</dc:creator>
				<category><![CDATA[gephi]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[spatial]]></category>
		<category><![CDATA[visualisation]]></category>

		<guid isPermaLink="false">http://datablend22.lin3.nucleus.be/?p=262</guid>
		<description><![CDATA[When I started running some years ago, I bought a Garmin Forerunner 405. It&#8217;s a nifty little device that tracks GPS coordinates while you are running. After a run, the device can be synchronized by uploading your data to the Garmin Connect website. Based upon the tracked time and GPS coordinates, the Garmin Connect website<p><a href="http://datablend.be/?p=262">Continue Reading →</a></p>]]></description>
				<content:encoded><![CDATA[<p style="text-align: justify;">When I started running some years ago, I bought a <a target='_blank' href="https://buy.garmin.com/shop/shop.do?pID=11039&#038;ra=true#owners">Garmin Forerunner 405</a>. It&#8217;s a nifty little device that tracks GPS coordinates while you are running. After a run, the device can be synchronized by uploading your data to the <a target='_blank' href="http://connect.garmin.com">Garmin Connect website</a>. Based upon the tracked time and GPS coordinates, the Garmin Connect website provides you with a detailed overview of your run, including <span class="highlight"><em>distance</em></span>, <span class="highlight"><em>average pace</em></span>, <span class="highlight"><em>elevation loss/gain</em></span> and <span class="highlight"><em>lap splits</em></span>. It also visualizes your run, by overlaying the tracked course on Bing and/or Google maps. Pretty cool! One of my last runs can be found <a target='_blank' href="http://connect.garmin.com/activity/138373187">here</a>.</p>
<p style="text-align: justify;">Apart from <span class="highlight"><em>simple aggregations</em></span> such as total distance and average speed, the Garmin Connect website provides little or no support to gain deeper insights in all of my runs. As I often run the same course, it would be interesting to calculate my <span class="highlight"><em>average pace at specific locations</em></span>. When combining the data of all of my courses, I could deduct <span class="highlight"><em>frequently encountered locations</em></span>. Finally, could there be a <span class="highlight"><em> correlation</em></span> between my <span class="highlight"><em>average pace</em></span> and my <span class="highlight"><em>distance from home?</em></span> In order to come up with answers to these questions, I will import my running data into a <a target='_blank' href="https://github.com/neo4j/spatial">Neo4J Spatial</a> datastore. Neo4J Spatial extends the <a target='_blank' href="http://neo4j.org/">Neo4J Graph Database</a> with the necessary tools and utilities to store and query spatial data in your graph models. For visualizing my running data, I will make use of <a target='_blank' href="http://gephi.org/">Gephi</a>, an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs.</p>
<p>&nbsp;</p>
<h3>1. Extracting GPX data</h3>
<p style="text-align: justify;">The Garmin Connect website allows to download running data through various formats, including <span class="highlight"><em>KML</em></span>, <span class="highlight"><em>TCX</em></span> and <span class="highlight"><em>GPX</em></span>. <a target='_blank' href="http://topografix.com/gpx.asp">GPX</a> (the GPS Exchange Format) is a light-weight XML data format that is used for interchanging GPS data (waypoints, routes, and tracks) between applications and web services. Below, you can find a GPX extract enumerating several tracked points. Each of these points contains the <span class="highlight"><em>GPS location</em></span>, the <span class="highlight"><em>elevation</em></span> and the corresponding <span class="highlight"><em>timestamp</em></span>.</p>
<script src="https://gist.github.com/1559458.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Based upon this data, one is able to calculate various metrics, including <span class="highlight"><em>pace</em></span>. For this, we will use <a target='_blank' href="http://gpstools.sourceforge.net/">GPSdings</a>, a Java library that provides the required functionality to extract and analyze GPX data. We start by reading in a GPX file. Afterwards, we <span class="highlight"><em>analyze</em></span> the content using the GPSdings <em>TrackAnalyzer</em> which, amongst other metrics, calculates the pace for each point that was tracked during a run. The information we need is stored in the first segment of the first track.</p>
<script src="https://gist.github.com/1559808.js"></script>
<p>&nbsp;</p>
<h3>2. Importing GPS data in Neo4J Spatial</h3>
<p style="text-align: justify;"><span class="highlight"><em>Neo4J Spatial</em></span> is build on top of <span class="highlight"><em>Neo4J</em></span> and provides support for <span class="highlight"><em>spatial data</em></span>. Once your data is stored, <span class="highlight"><em>spatial operations</em></span> can be executed, which for instance allow to search for data within specified regions or within a specified distance of a particular point of interest. We start by setting up a Neo4J <em>EmbeddedGraphDatabase</em>. We then wrap it as a <em>SpatialDatabaseService</em>, which allows us to create an <em>EditableLayer</em>. <em>EditableLayer</em> is Neo4J&#8217;s main abstraction, which is used to define a <span class="highlight"><em>collection of geometries</em></span>. Each layer needs to be initialized with a specific <em>GeometryEncoder</em>, which acts a kind of adapter to map from the graph to the geometries and vice versa. In our case, we will employ the <em>SimplePointEncoder</em>.</p>
<script src="https://gist.github.com/1559893.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Adding spatial data to the running layer is very easy. We start by creating a <em>Coordinate</em> for each point that is parsed by GPSdings. Next, we add this new coordinate to the running layer. This operation returns a <em>SpatialDatabaseRecord</em> which, under the hood, is just a <span class="highlight"><em>regular Neo4J node</em></span>. Hence, we can add any property we want to this node. In our case, we will add two properties. One property, named <span class="highlight"><em>speed</em></span>, indicating the (average) pace. One property, named <span class="highlight"><em>occurrences</em></span>, indicating the number of times this particular coordinate was encountered in the overall data set. Once the new coordinate is created, we connect the previous node with the newly created node through the <span class="highlight"><em>NEXT</em></span> relationship type. Hence, our graph is an <span class="highlight"><em>enumeration</em></span> of the encountered coordinates, <span class="highlight"><em>interlinked</em></span> through NEXT edges.</p>
<script src="https://gist.github.com/1559954.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">In case a coordinate is encountered multiple times, we <span class="highlight"><em>recalculate the average speed</em></span> and <span class="highlight"><em>increment the number of encounters</em></span>.</p>
<script src="https://gist.github.com/1560142.js"></script>
<p>&nbsp;</p>
<p style="text-align: justify;">Unfortunately, chances are low to encounter an already existing coordinate, as coordinates in a GPX file have a 15-digit precision right of the decimal point. Instead of trying to <span class="highlight"><em>round</em></span> these coordinates ourselves, we will use the <span class="highlight"><em>Neo4J Spatial querying API</em></span>. A simple <span class="highlight"><em>nearest neighbor</em></span>-search limited to 20 meters allows us to find matching coordinates. (I choose 20 meters, as 20 is a little above the average distance between two coordinates). In case we find a coordinate within this 20-meter range, we will <span class="highlight"><em>reuse</em></span> it. Otherwise, we just create a <span class="highlight"><em>new coordinate</em></span>. The full algorithm for importing multiple GPX datasets can be found below.</p>
<script src="https://gist.github.com/1560218.js"></script>
<p>&nbsp;</p>
<h3>3. Visualizing running data</h3>
<p style="text-align: justify;">By using the <span class="highlight"><em>Neo4J Spatial querying API</em></span>, we are able to retrieve the set of coordinates that satisfy a particular condition. However, coordinates are somewhat <span class="highlight"><em>abstract</em></span> to interpret. Instead, we will use the excellent Gephi Graph visualization and exploration tool. By installing the <a target='_blank' href="http://gephi.org/tag/neo4j/">Gephi Neo4J plugin</a>, we are able to load and explore graphs that are stored in a Neo4J (Spatial) datastore. Let&#8217;s start by <span class="highlight"><em>importing</em></span>  our dataset in Gephi.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/geo1.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/geo1.jpg" alt="gephi" /></p>
<p></a></p>
<p style="text-align: justify;">The displayed graph contains other types of nodes and edges (i.e. <em>Layer </em>and <em>RTree </em>index information), in addition to the coordinates and NEXT edges that we added ourselves. Let&#8217;s get rid of those by <span class="highlight"><em>filtering our graph</em></span> on the NEXT relationship-type.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/geo2.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/geo2.jpg" alt="gephi" /></p>
<p></a>  </p>
<p style="text-align: justify;">Only half of the edges remain &#8230; However, we will still not gain novel insights from this mess. Let&#8217;s layout our graph by using the <a target='_blank' href="http://gephi.org/plugins/geolayout/">Gephi GeoLayout plugin</a>. This layouter takes <span class="highlight"><em>geocoded graphs</em></span> as input and will layout graphs according to the geocoded attributes. Make sure to increase scaling, as our coordinates are located closely together. Cool! This view clearly outlines the courses I&#8217;m running.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/geo3.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/geo3.jpg" alt="gephi" /></p>
<p></a></p>
<p style="text-align: justify;">Let&#8217;s visualize the coordinates that were <span class="highlight"><em>frequently encountered</em></span> during the 4 runs that are imported in the Neo4J Spatial datastore. For this, we will use the <span class="highlight"><em>InDegree</em></span> node property, which indicates <span class="highlight"><em>the number of incoming edges</em></span> for each coordinate. We rank <span class="highlight"><em>node weight</em></span> (i.e. node size) through this property. Hence, frequently encountered nodes will show up bigger. In my case, frequently encountered coordinates are found around the place where I live (and hence start my runs) and on street intersections.</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/geo4.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/geo4.jpg" alt="gephi" /></p>
<p></a></p>
<p style="text-align: justify;">Let&#8217;s do one final analysis, namely a visualization that illustrates the <span class="highlight"><em>average pace throughout all runs</em></span>. For this, we rank both <span class="highlight"><em>node weight</em></span> and <span class="highlight"><em>node color</em></span> through the <span class="highlight"><em>speed</em></span> property. Hence, coordinates with a high average pace are colored green and show up bigger. Coordinates with a low average pace are colored red and show up smaller. With the blink of an eye, I can now interpret my average pace, taking into account my overall running data set!</p>
<p><a target='_blank' href="http://datablend.be/wp-content/uploads/geo5.jpg">
<p align="center"><img width="550" src="http://datablend.be/wp-content/uploads/geo5.jpg" alt="gephi" /></p>
<p></a></p>
<p>&nbsp;</p>
<h3>4. Conclusion</h3>
<p style="text-align: justify;">This article describes the use of the <span class="highlight"><em>Neo4J Spatial datastore</em></span> and <span class="highlight"><em>Gephi</em></span> to analyze Garmin running data. As always, the complete source code can be found on the <a target='_blank' href="https://github.com/datablend/neo4j-spatial-running">Datablend public GitHub repository</a>. Any ideas for other types of analysis that could be performed on the dataset?</p>
<p></p>]]></content:encoded>
			<wfw:commentRss>http://datablend.be/?feed=rss2&#038;p=262</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
