Datablend » visualisation

The power of Manchester City: a data analysis

Davy Suvee — Sun, 17 Aug 2014 11:55:16 +0000

What makes Manchester City such a great team? The infographic below illustrates one of the teams most powerful characteristics: its successful passing capability. The visualisation is based upon the Opta dataset released in August 2011, containing the high detailed Bolton vs Manchester City match statistics. The data has been loaded in the neo4j graph databases and the Cypher Query Language has been used to extract the passing statistics.

For each of the teams, we calculated the average position of their players on the pitch (based upon their individual actions). The thickness of the red edges visualises the number of successful passes amongst two individual players. Finally, we visualised the number of successful passes of each player (see shirt-number) and the distribution of the length of his passes. Have fun exploring this infographic!

Datablend lanceert vk14-bingo.be

Davy Suvee — Mon, 12 May 2014 11:26:55 +0000

Wordt U ook overladen met informatie in verband met de komende verkiezingen? Bent U, net zoals zo vele andere burgers, op zoek naar een eenvoudig alternatief waarbij U in 1 oogopslag kunt zien waar elke partij voor staat? Zoek niet langer en maak gebruik van vk14-bingo.be. We hebben voor U de verschillende partijprogramma’s woord voor woord geanalyseerd en de kerngedachten ten op zichte van elkaar uitgezet. Ga nu eenvoudig na op welke thema’s iedere partij inzet en vergelijk onderling.

Maar er is meer. Onze politici hebben de weg naar Twitter ondertussen gevonden en maken vlijtig gebruik van dit nieuwe medium om zichzelf, de partij en hun standpunten te promoten. Maar hoe trouw zijn ze aan hun eigen partijprogramma? vk14-bingo.be analyseert in real-time de twitter berichten van meer dan 850 Vlaamse politici met als 1 doel: welke partij vervult als eerste virtueel zijn programma en wint #vk14-bingo!

Hebt U vragen of opmerkingen over deze analyse? Contacteer ons via vk14-bingo@datablend.be. Geïnteresseerd in een analyse van uw eigen data, groot of klein? Bezoek datablend.be of contacteer ons via info@datablend.be en we komen graag even bij U langs.

The power of graphs to analyse biological data

Davy Suvee — Mon, 02 Dec 2013 07:04:34 +0000

Watch Davy Suvee present at GraphConnect London 2013 on the power of graph databases to analyse biological datasets.

The Power of Graphs to Analyze Biological Data – Davy Suvee @ GraphConnect London 2013 from Neo Technology on Vimeo.

Yelp graph: checkin-based business clustering

Davy Suvee — Sun, 01 Dec 2013 11:51:58 +0000

Recently, Yelp made available a sample dataset from the greater Phoenix metropolitan area including around 11.000 business, 8000 checkin-sets, 43.000 users and 230.000 user reviews. With the help of this data, data scientists can execute real-life experiments with various data mining/machine learning algorithms. In our case, we are interested in finding out whether it is possible to visually cluster businesses by category, based purely on their checkin data. The checkin data itself is available on a day-hour level: for each business, it is possible to retrieve the number of checkins on a Sunday afternoon between 3 and 4. So, with only this data in mind, are we able to cluster businesses as being restaurants or fashion stores, based purely on the correlations calculated amongst their checkin data? For this experiment, we use the Neo4J graph database for storing our checkin-based correlation graph and employ the Gephi graph visualisation platform for interpreting the identified business communities/clusters. As always, the full source code of this article can be found on the Datablend public github repository (although you will need to acquire the dataset yourself through the Yelp Dataset Challenge portal).

1. Building the Neo4J checkin correlation graph

We start by parsing both the business and checkin json-files from the Yelp Dataset challenge. Unfortunately, checkin data is available for only 8,282 out of the 11,537 supplied businesses. In addition, many of these have only a limited set of associated checkins. Hence, in order to make sure that only relevant correlations are calculated, we ignore the ones that have less than a 100 checkins, resulting in around 1920 remaining businesses.

Next, we try to identify the correlation between two businesses by using the Pearson Correlation Coefficient (read this site for a nice introduction). Simply put, we try to identify whether a linear association exists between the checkins of two individual businesses. In our case, the calculation is based upon 168 data points (24 hours x 7 days), the idea being that two breakfast restaurants will get most of their checkins from the morning till noon, while two bars will get most of their checkins during the evening and at night. Hence, we expect the correlation between two businesses of the same type to be quite high, while different types of businesses (i.e. a breakfast restaurant and a bar) will result in little or no correlation.

Time to get our hands dirty. After parsing the data files, we use the existing apache.commons.math to calculate the pairwise correlation between the checkin datasets of the 1920 businesses. If the resulting coefficient is 0.8 or higher, we consider both businesses to be correlated. We create a unique node for each business within the Neo4J graph and combine them via a “correlated”-relationship.

The generated graph contains 606 unique nodes (i.e. businesses that are correlated to at least one other business) and 2585 edges (i.e. actual correlations).

2. Gephi interpretation

Our next task is to observe whether groups of businesses exists that are highly correlated (i.e. highly interconnected) and identify whether these correlations makes sense. In order to do so, we import our Neo4J correlation graph in Gephi through the Gephi Neo4J plugin. Once loaded, we run the modularity-function to identify meaningful communities. These computed communities are then used to partition (i.e. color) the nodes (and their related edges) so that clusters can easily be observed. Next, we apply K-core filtering, in our case 3-core, to keep the subgraph from which all nodes have a degree of at least 3 (i.e. 3 relationships with other nodes). The size of the nodes (and their associated labels) is configured to be proportional with their degree. Finally, we apply Fruchterman-Reingold lay-outing in order to clearly visualise the various clusters.

We can easily observe 8 communities, but are these clusters meaningful? The pink cluster on the right-end side is highly interconnected (i.e. all nodes of the cluster have mutual correlations). Most of them can be identified as being breakfast diners (ex. The good egg, The breakfast joynt and Orange table). Cool. This certainly make sense, as most of these business have checkins early morning until early afternoon. The yellow cluster on the top contains various department stores (including Costco, Nordstrom Rack and IKEA). Again meaningful, as most of them open their doors somewhere around 10AM and close around 7PM. At first sight, it seems strange that the coffee places are correlated into two separate groups (yellow cluster at the bottom and pink cluster on the top). The reason however is simple: some of them close late afternoon while others are open until midnight.

3. Conclusion

The Neo4J/Gephi solution works remarkably well to visually identify the various business clusters from the Yelp dataset. In a next blog article, we will show how to use the k-nearest neighbours algorithm to automatically predict the type of business based upon solely the checkin information.

Hubway Data Visualization Challenge Entry: the flow of bikers

Davy Suvee — Tue, 16 Oct 2012 09:57:49 +0000

Last week, Hubway announced its Data Visualization Challenge. Hubway is a bike sharing system located in the Boston area: you simply pick up a bike at a particular station and drop it off at the closest station near your destination. For this challenge, Hubway released a CSV-file, containing over half a million rides. Each entry contains the origin and destination station as well as the timing-information and some anonymoused demographic information. The purpose of the challenge is to create appealing visualizations that provide Hubway with cool insights in how customers are using their bikes. As I had 8 hours to spare on a flight to New York, I decided to give it go.

1. Flow of bikers

The goal of my visualization is to depict how bikers flow through the city of Boston, namely: “taking a specific station as starting point, to which other stations are people biking”. A classical, graph-based visualization would show this flow, but would also be quite cluttered as each origin-destination tuple would have its own edge, this way failing to provide the grant overview. The use of a flow map however, would make the visualization both appealing and insightful. Cartographers use flow maps to show the movement of objects from one location to another, such as the number of people in a migration, the amount of goods being traded, or the number of packets in a network. Flow maps reduce visual clutter by merging edges where possible.

Playing around with the library in the past, I remembered somebody releasing a flow map layout implementation. Taking their implementation as a starting point, I applied some modifications (related to the mercator-layouting) and supplied it with my pre-processed Hubway biking data. For each station, I can now generate a separate map that visualises the flow of bikers towards other stations, where each station is mapped at its geographically correct location. As can be expected, most people bike to close-by stations, but others seem to enjoy their biking to far-off locations. Let’s have a look at a few examples. The image below displays the flow map for the Boston University Central station located at 725 Commonwealth Avenue (A32003). As this station is quite central to the city, we see that people bike off in almost all directions, although most of them keep close to Charles River.

If we generate the flow map for a biking station near the corners of the city, such as Andrew Station on Dorchester Avenue (C32012), an entirely different flow pattern can be observed as biking destinations are concentrated at the east-side of Boston.

2. Conclusion

The current application could easily be extended to filter trips on demographics and/or timing information. One could also overlay various flow maps in order to detect similarities between flows of bikers. If people would be interested in extending my implementation, I willing to upload my “code-hacking” to github so that the project can be forked. Just let me know.

Circle through your Google Analytics data with Neo4J and Circos

Davy Suvee — Sun, 11 Mar 2012 09:51:19 +0000

Storing massive amounts of data in a NoSQL data store is just one side of the Big Data equation. Being able to visualize your data in such a way that you can easily gain deeper insights, is where things really start to get interesting. Lately, I’ve been exploring various options for visualizing (directed) graphs, including Circos. Circos is an amazing software package that visualizes your data through a circular layout. Although it’s originally designed for displaying genomic data, it allows to create good-looking figures from data in any field. Just transform your data set into a tabular format and you are ready to go. The figure below illustrates the core concept behind Circos. The table’s columns and rows are represented by segments around the circle. Individual cells are shown as ribbons, which connect the corresponding row and column segments. The ribbons themselves are proportional in width to the value in the cell.

When visualizing a directed graph, nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. The proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. In my case, I want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, …) and how do they navigate between pages. The rest of this article details how to 1) retrieve the raw visit information through the Google Analytics API, 2) persist this information as a graph in Neo4J and 3) query and preprocess this data for visualization through Circos. As always, the complete source code can be found on the Datablend public GitHub repository.

1. Retrieving your Google Analytics data

Let’s start by retrieving the raw Google Analytics data. The Google Analytics data API provides access to all dimensions and metrics that can be queried through the web application. In my case, I’m interested in retrieving the previous page path property for each page view. If a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance). Otherwise, it contains the internal path. We will use Google’s Java Data API to connect and retrieve this information. We are particularly interested in the pagePath, pageTitle, previousPagePath and medium dimensions, while our metric of choice is the number of pageViews. After setting the date range, the feed of entries that satisfy this criteria can be retrieved. For ease of use, we transform this data to a domain entity and filter/clean the data accordingly. If a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, …) as previous path.

2. Storing navigational data as a directed graph in Neo4J

The set of site navigations can easily be stored as a directed graph in the the degree of my nodes is correct if I would perform other types of calculations. For each individual navigation relationship, we also store the date of visit.

3. Creating the Circos tabular data format

The Circos tabular data format is quite easy to construct. It’s basically a tab-delimited file with row and column headers. A cell is interpreted as a value that flows from the row entity to the column entity. We will use the Neo4J Cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period. Doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time.

Next, we create the tab delimited file itself. We iterate through all entries (i.e. navigations) that match our Cypher query and store them in a temporary list. Afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. At the end, we filter this occurrence matrix on the minimal number of required navigations. This ensures that we will only create segments for paths that are relevant in the total population. As a final step, we print the occurrences matrix as a tab-delimited file. For each path, we will use a shorthand as the Circos renderer seems to have problem with long string identifiers.

The text below is a sample of the output generated by the printCircosData method. It first prints the legend (matching shorthands with actual paths). Next it prints the tab-delimited Circos table.

4. Use the Circos power

Although Circos can be installed on your local computer, we will use its online version to create the visualization of our data. Upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site’s navigation information.

With just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. The outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. In case of referrals, no navigations have this path as target (indicated by the empty middle ring). Its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. The l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). This segment visualizes the navigation data related to my “The joy of algorithms and NoSQL: a MongoDB example”-article. Most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. The l15-segment (my blog’s main page) is the only path that receives an almost equal amount of incoming and outgoing traffic.

With just a few tweaks to the Circos input data, we can easily focus on particular types of navigation data. In the figure below, I made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors.

5. Conclusions

In the era of Big Data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. Circos specializes in a very specific type of visualization, but does its job extremely well. I would be delighted to hear about other types of visualizations for directed graphs.

Running along the graph using Neo4J Spatial and Gephi

Davy Suvee — Wed, 04 Jan 2012 09:48:13 +0000

When I started running some years ago, I bought a Garmin Forerunner 405. It’s a nifty little device that tracks GPS coordinates while you are running. After a run, the device can be synchronized by uploading your data to the Garmin Connect website. Based upon the tracked time and GPS coordinates, the Garmin Connect website provides you with a detailed overview of your run, including distance, average pace, elevation loss/gain and lap splits. It also visualizes your run, by overlaying the tracked course on Bing and/or Google maps. Pretty cool! One of my last runs can be found here.

Apart from simple aggregations such as total distance and average speed, the Garmin Connect website provides little or no support to gain deeper insights in all of my runs. As I often run the same course, it would be interesting to calculate my average pace at specific locations. When combining the data of all of my courses, I could deduct frequently encountered locations. Finally, could there be a correlation between my average pace and my distance from home? In order to come up with answers to these questions, I will import my running data into a Neo4J Spatial datastore. Neo4J Spatial extends the Neo4J Graph Database with the necessary tools and utilities to store and query spatial data in your graph models. For visualizing my running data, I will make use of Gephi, an open-source visualization and manipulation tool that allows users to interactively browse and explore graphs.

1. Extracting GPX data

The Garmin Connect website allows to download running data through various formats, including KML, TCX and GPX. GPX (the GPS Exchange Format) is a light-weight XML data format that is used for interchanging GPS data (waypoints, routes, and tracks) between applications and web services. Below, you can find a GPX extract enumerating several tracked points. Each of these points contains the GPS location, the elevation and the corresponding timestamp.

Based upon this data, one is able to calculate various metrics, including pace. For this, we will use GPSdings, a Java library that provides the required functionality to extract and analyze GPX data. We start by reading in a GPX file. Afterwards, we analyze the content using the GPSdings TrackAnalyzer which, amongst other metrics, calculates the pace for each point that was tracked during a run. The information we need is stored in the first segment of the first track.

2. Importing GPS data in Neo4J Spatial

Neo4J Spatial is build on top of Neo4J and provides support for spatial data. Once your data is stored, spatial operations can be executed, which for instance allow to search for data within specified regions or within a specified distance of a particular point of interest. We start by setting up a Neo4J EmbeddedGraphDatabase. We then wrap it as a SpatialDatabaseService, which allows us to create an EditableLayer. EditableLayer is Neo4J’s main abstraction, which is used to define a collection of geometries. Each layer needs to be initialized with a specific GeometryEncoder, which acts a kind of adapter to map from the graph to the geometries and vice versa. In our case, we will employ the SimplePointEncoder.

Adding spatial data to the running layer is very easy. We start by creating a Coordinate for each point that is parsed by GPSdings. Next, we add this new coordinate to the running layer. This operation returns a SpatialDatabaseRecord which, under the hood, is just a regular Neo4J node. Hence, we can add any property we want to this node. In our case, we will add two properties. One property, named speed, indicating the (average) pace. One property, named occurrences, indicating the number of times this particular coordinate was encountered in the overall data set. Once the new coordinate is created, we connect the previous node with the newly created node through the NEXT relationship type. Hence, our graph is an enumeration of the encountered coordinates, interlinked through NEXT edges.

In case a coordinate is encountered multiple times, we recalculate the average speed and increment the number of encounters.

Unfortunately, chances are low to encounter an already existing coordinate, as coordinates in a GPX file have a 15-digit precision right of the decimal point. Instead of trying to round these coordinates ourselves, we will use the Neo4J Spatial querying API. A simple nearest neighbor-search limited to 20 meters allows us to find matching coordinates. (I choose 20 meters, as 20 is a little above the average distance between two coordinates). In case we find a coordinate within this 20-meter range, we will reuse it. Otherwise, we just create a new coordinate. The full algorithm for importing multiple GPX datasets can be found below.

3. Visualizing running data

By using the Neo4J Spatial querying API, we are able to retrieve the set of coordinates that satisfy a particular condition. However, coordinates are somewhat abstract to interpret. Instead, we will use the excellent Gephi Graph visualization and exploration tool. By installing the Gephi Neo4J plugin, we are able to load and explore graphs that are stored in a Neo4J (Spatial) datastore. Let’s start by importing our dataset in Gephi.

The displayed graph contains other types of nodes and edges (i.e. Layer and RTree index information), in addition to the coordinates and NEXT edges that we added ourselves. Let’s get rid of those by filtering our graph on the NEXT relationship-type.

Only half of the edges remain … However, we will still not gain novel insights from this mess. Let’s layout our graph by using the Gephi GeoLayout plugin. This layouter takes geocoded graphs as input and will layout graphs according to the geocoded attributes. Make sure to increase scaling, as our coordinates are located closely together. Cool! This view clearly outlines the courses I’m running.

Let’s visualize the coordinates that were frequently encountered during the 4 runs that are imported in the Neo4J Spatial datastore. For this, we will use the InDegree node property, which indicates the number of incoming edges for each coordinate. We rank node weight (i.e. node size) through this property. Hence, frequently encountered nodes will show up bigger. In my case, frequently encountered coordinates are found around the place where I live (and hence start my runs) and on street intersections.

Let’s do one final analysis, namely a visualization that illustrates the average pace throughout all runs. For this, we rank both node weight and node color through the speed property. Hence, coordinates with a high average pace are colored green and show up bigger. Coordinates with a low average pace are colored red and show up smaller. With the blink of an eye, I can now interpret my average pace, taking into account my overall running data set!

4. Conclusion

This article describes the use of the Neo4J Spatial datastore and Gephi to analyze Garmin running data. As always, the complete source code can be found on the Datablend public GitHub repository. Any ideas for other types of analysis that could be performed on the dataset?