Datablend

The power of Manchester City: a data analysis

Davy Suvee — Sun, 17 Aug 2014 11:55:16 +0000

What makes Manchester City such a great team? The infographic below illustrates one of the teams most powerful characteristics: its successful passing capability. The visualisation is based upon the Opta dataset released in August 2011, containing the high detailed Bolton vs Manchester City match statistics. The data has been loaded in the neo4j graph databases and the Cypher Query Language has been used to extract the passing statistics.

For each of the teams, we calculated the average position of their players on the pitch (based upon their individual actions). The thickness of the red edges visualises the number of successful passes amongst two individual players. Finally, we visualised the number of successful passes of each player (see shirt-number) and the distribution of the length of his passes. Have fun exploring this infographic!

Coalition-Cocktail – Hacking the Elections @ Engagor

Davy Suvee — Tue, 27 May 2014 14:57:42 +0000

Last weekend, Engagor organised their hacktheelections hackaton. The Datablend team (Quentin, Stijn and Davy) was joined by Marc Broos, Tim Coene and Josbert van de Zande with one goal in mind: trying to visualise the (pre-arranged?) political coalition and, if possible, also predict the formation-period.

Technically, we extracted over 160K tweets through the Engagor API. Next, a “sentiment”-based political graph was build and stored in the Neo4J graph database. A Gephi-based visualisation, based upon community detection, revealed the “truth”, which was, to be honest, a bit disappointing and at the same time somewhat expected: the political parties from Wallionia form a solid community while the Flemish parties are interconnected through many clusters. A warning sign for the upcoming formation process?

The slidedeck below provides an overview of the single day of hacking. Our results ranked third (on 8 teams). Although being very informative, it’s quite hard to compete with a “pokemon”-themed fighting animation between politicians *wink*. Many thanks to Engagor for the spotless organisation!

Datablend lanceert vk14-bingo.be

Davy Suvee — Mon, 12 May 2014 11:26:55 +0000

Wordt U ook overladen met informatie in verband met de komende verkiezingen? Bent U, net zoals zo vele andere burgers, op zoek naar een eenvoudig alternatief waarbij U in 1 oogopslag kunt zien waar elke partij voor staat? Zoek niet langer en maak gebruik van vk14-bingo.be. We hebben voor U de verschillende partijprogramma’s woord voor woord geanalyseerd en de kerngedachten ten op zichte van elkaar uitgezet. Ga nu eenvoudig na op welke thema’s iedere partij inzet en vergelijk onderling.

Maar er is meer. Onze politici hebben de weg naar Twitter ondertussen gevonden en maken vlijtig gebruik van dit nieuwe medium om zichzelf, de partij en hun standpunten te promoten. Maar hoe trouw zijn ze aan hun eigen partijprogramma? vk14-bingo.be analyseert in real-time de twitter berichten van meer dan 850 Vlaamse politici met als 1 doel: welke partij vervult als eerste virtueel zijn programma en wint #vk14-bingo!

Hebt U vragen of opmerkingen over deze analyse? Contacteer ons via vk14-bingo@datablend.be. Geïnteresseerd in een analyse van uw eigen data, groot of klein? Bezoek datablend.be of contacteer ons via info@datablend.be en we komen graag even bij U langs.

Datanews – Ook met weinig data kan je nuttige dingen doen!

Davy Suvee — Fri, 24 Jan 2014 07:48:21 +0000

The power of graphs to analyse biological data

Davy Suvee — Mon, 02 Dec 2013 07:04:34 +0000

Watch Davy Suvee present at GraphConnect London 2013 on the power of graph databases to analyse biological datasets.

The Power of Graphs to Analyze Biological Data – Davy Suvee @ GraphConnect London 2013 from Neo Technology on Vimeo.

Yelp graph: checkin-based business clustering

Davy Suvee — Sun, 01 Dec 2013 11:51:58 +0000

Recently, Yelp made available a sample dataset from the greater Phoenix metropolitan area including around 11.000 business, 8000 checkin-sets, 43.000 users and 230.000 user reviews. With the help of this data, data scientists can execute real-life experiments with various data mining/machine learning algorithms. In our case, we are interested in finding out whether it is possible to visually cluster businesses by category, based purely on their checkin data. The checkin data itself is available on a day-hour level: for each business, it is possible to retrieve the number of checkins on a Sunday afternoon between 3 and 4. So, with only this data in mind, are we able to cluster businesses as being restaurants or fashion stores, based purely on the correlations calculated amongst their checkin data? For this experiment, we use the Neo4J graph database for storing our checkin-based correlation graph and employ the Gephi graph visualisation platform for interpreting the identified business communities/clusters. As always, the full source code of this article can be found on the Datablend public github repository (although you will need to acquire the dataset yourself through the Yelp Dataset Challenge portal).

1. Building the Neo4J checkin correlation graph

We start by parsing both the business and checkin json-files from the Yelp Dataset challenge. Unfortunately, checkin data is available for only 8,282 out of the 11,537 supplied businesses. In addition, many of these have only a limited set of associated checkins. Hence, in order to make sure that only relevant correlations are calculated, we ignore the ones that have less than a 100 checkins, resulting in around 1920 remaining businesses.

Next, we try to identify the correlation between two businesses by using the Pearson Correlation Coefficient (read this site for a nice introduction). Simply put, we try to identify whether a linear association exists between the checkins of two individual businesses. In our case, the calculation is based upon 168 data points (24 hours x 7 days), the idea being that two breakfast restaurants will get most of their checkins from the morning till noon, while two bars will get most of their checkins during the evening and at night. Hence, we expect the correlation between two businesses of the same type to be quite high, while different types of businesses (i.e. a breakfast restaurant and a bar) will result in little or no correlation.

Time to get our hands dirty. After parsing the data files, we use the existing apache.commons.math to calculate the pairwise correlation between the checkin datasets of the 1920 businesses. If the resulting coefficient is 0.8 or higher, we consider both businesses to be correlated. We create a unique node for each business within the Neo4J graph and combine them via a “correlated”-relationship.

The generated graph contains 606 unique nodes (i.e. businesses that are correlated to at least one other business) and 2585 edges (i.e. actual correlations).

2. Gephi interpretation

Our next task is to observe whether groups of businesses exists that are highly correlated (i.e. highly interconnected) and identify whether these correlations makes sense. In order to do so, we import our Neo4J correlation graph in Gephi through the Gephi Neo4J plugin. Once loaded, we run the modularity-function to identify meaningful communities. These computed communities are then used to partition (i.e. color) the nodes (and their related edges) so that clusters can easily be observed. Next, we apply K-core filtering, in our case 3-core, to keep the subgraph from which all nodes have a degree of at least 3 (i.e. 3 relationships with other nodes). The size of the nodes (and their associated labels) is configured to be proportional with their degree. Finally, we apply Fruchterman-Reingold lay-outing in order to clearly visualise the various clusters.

We can easily observe 8 communities, but are these clusters meaningful? The pink cluster on the right-end side is highly interconnected (i.e. all nodes of the cluster have mutual correlations). Most of them can be identified as being breakfast diners (ex. The good egg, The breakfast joynt and Orange table). Cool. This certainly make sense, as most of these business have checkins early morning until early afternoon. The yellow cluster on the top contains various department stores (including Costco, Nordstrom Rack and IKEA). Again meaningful, as most of them open their doors somewhere around 10AM and close around 7PM. At first sight, it seems strange that the coffee places are correlated into two separate groups (yellow cluster at the bottom and pink cluster on the top). The reason however is simple: some of them close late afternoon while others are open until midnight.

3. Conclusion

The Neo4J/Gephi solution works remarkably well to visually identify the various business clusters from the Yelp dataset. In a next blog article, we will show how to use the k-nearest neighbours algorithm to automatically predict the type of business based upon solely the checkin information.

Counting triangles smarter (or how to beat Big Data vendors at their own game)

Davy Suvee — Mon, 11 Feb 2013 10:02:00 +0000

A few months ago, I discovered Vertica’s “Counting Triangles”-article through Prismatic. The blog post describes a number of benchmarks on counting triangles in large networks. A triangle is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends with each other, the three of you define a friendship triangle. Counting all triangles within a large network is a rather compute-intensive task. In its most naive form, an algorithm iterates through all vertices in the network, retrieving the adjacent vertices of their adjacent vertices. If one of the vertices adjacent to the latter vertices is identical to the origin vertex, we identified a triangle.

The Vertica article illustrates how to execute an optimised implementation of the above algorithm through Hadoop and their own Massive Parallel Processing (MPP) Database product (both being run on a 4-node cluster). The dataset involves the LiveJournal social network graph, containing around 86 million relationships, resulting in around 285 million identified triangles. As can be expected, the Vertica solution shines in all respects (counting all triangles in 97 seconds), beating the Hadoop solution by a factor of 40. A few weeks later, the Oracle guys published a similar blog post, using their ExaData platform, beating Vertica’s results by a factor of 7, clocking in at 14 seconds.

Although Vertica and Oracle’s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating triangles through a smarter algorithm that is able to deliver similar performance on commodity hardware (i.e. my MacBook Pro Retina).

1. Doing it the smart way

The LiveJournal social network graph, about 1.3GB in raw size, contains around 86 million relationships. Each line in this file declares a relationship between a source and target vertex (where each vertex is identified by an unique id). Relationships are assumed to be bi-directional: if person 1 knows person 2, person 2 also knows person 1.

Let’s start by creating a row-like data structure for storing these relationships. The key of each row is the id of the source vertex. The row values are the id’s of all target vertices associated with the particular source vertex.

With this structure in place, one can execute the naive algorithm as described above. Unfortunately, iterating four levels deep will result in mediocre performance. Let’s improve our data structure by indexing each relationship through its lowest key. So, even though the LiveJournal file declares the relationship as being “2 0“, we persist the relationship by assigning the 2-value to the 0-row. (Order doesn’t matter as relationships are bi-directional anyway.)

Calculating triangles becomes a lot easier (and faster) now. If the key of a row is part of a triangle, its two adjacent vertices should be in its list of values (as by definition, the row key is the smallest vertex id of the three of them). Hence, we need to check whether we can find edges amongst the vertices contained within each row. So, for each row, we iterate through its list of values. For each of these values, we retrieve the associated row and verify whether one of its values is part of the original source-values. By doing so, we get rid of one expensive for-loop. Nevertheless, the amount of calculations that need to be executed is still close to 2 billion!

2. Persisting the relationships

The data structure as described above is persisted in a custom datastore that we developed at Datablend for powering the similr-engine (a chemical structure search engine). The datastore is fully persistent and optimised for quickly performing set-based operations (intersections, differences, unions, … ). Parsing the 86 million relationships and creating the appropriate in-memory data structure takes around 20 seconds on my MacBook Pro. An additional 4 seconds is required for persisting the entire data structure to the datastore itself. So around 25 seconds in total for effectively storing all 86 million relationships. Vertica nor Oracle mention the time it takes to persist the Livejournal dataset within their respective databases. However, I assume it also requires them a few seconds to execute this load-operation.

What about disk usage? The custom Datablend datastore takes the second place, requiring only 37 Mb more compared to Oracle’s Hybrid Columnar Compression version.

3. Calculating the triangles

The Oracle setup (on a cluster of 4 nodes, each with 96GB of RAM and 12 cores) is able to calculate the 265 million triangles in 14 seconds. The optimised algorithm described above, running on the custom Datablend datastore, takes the first place, clocking in at 9 seconds! The calculation runs fully pararellized on my MacBook Pro Retina and has a peak use of only 2.11 GB of RAM!

4. Conclusion

Datablend’s custom datastore is a very specific solution that targets a particular range of Big Data computations. It is in no means as generic and versatile as the MPP database solutions offered by both Vertica and Oracle. Nevertheless, the article tries to illustrate that one does not require a large computing cluster to execute particular Big Data computations. Just use the most appropriate/smart solution to solve the problem in an elegant and fast way. Don’t hesitate to contact us if you have any questions related to similr and/or Datablend.

Similr: blazingly fast chemical similarity searches

Davy Suvee — Mon, 04 Feb 2013 10:00:38 +0000

Today, Datablend announces Similr to be available for beta sign-up. Similr allows scientist (both from academics and enterprise) to quickly search for compounds that exhibit a particular chemical structure. It employs a wide range of fingerprinting algorithms, which combined, allow to identify matching compounds in millisecond time. Similr’s functionalities are available through a flexible and expressive REST API and allows to scan more than 30 million compounds that have been made publicly available through PubChem.

Similr will provide unlimited API-access to academics. Free commercial access, limited to a 1000 API-calls a month, will be available. Higher up, customers can choose between a pay-as-you-go subscription or opt-in for a dedicated installation that allows for the import of (private) compounds.

Similr is being developed by Datablend, a Big Data consultancy company. Datablend’s expertise in Pharma, combined with proficient knowledge of NoSQL technologies, allowed for the development of a highly optimised chemical similarity search algorithm that is able to scan millions of compounds at blazing speeds. Don’t hesitate to contact us if you have any questions related to Similr and/or Datablend.

Redis and Lua: a NoSQL power-horse

Davy Suvee — Tue, 29 Jan 2013 09:59:16 +0000

Recently, I’ve started implementing a number of Redis-based solutions for a Datablend customer. Redis is frequently referred to as the Swiss Army Knife of NoSQL databases and rightfully deserves that title. At its core, it is an in-memory key-value datastore. Values that are assigned to keys can be ‘structured’ through the use of strings, hashes, lists, sets and sorted sets. The power of these simple data structures, combined with its intuitive API, makes Redis a true power-horse for solving various ‘Big Data’-related problems. To illustrate this point, I reimplemented my MongoDB-based molecular similarity search through Redis and its integrated Lua support. As always, the complete source code can be found on the Datablend public GitHub repository.

1. Redis ‘fingerprint’ data model

Molecular similarity refers to the similarity of chemical compounds with respect to their structural and/or functional qualities. By calculating molecular similarities, Cheminformatics is able to help in the design of new drugs by screening large databases for potentially interesting chemical compounds. Chemical similarity can be determined through the use of so-called fingerprints (i.e. linear, substructure chemical patterns of a certain length). Similarity between compounds is identified by calculating the Tanimoto coefficient. This computation involves the calculation of intersections between sets of fingerprints, an operation that is natively supported by Redis.

Our Redis-based data model for storing fingerprints requires three different data-structures:

For each compound, identified by an unique key, we store its set of fingerprints (where each fingerprint is again identified by an unique key).
For each fingerprint, identified by an unique key, we store the set of compounds containing this fingerprint. These fingerprint sets can be conceived as the inverted indexes of the compound sets mentioned above.
For each fingerprint, we store its number of occurrences through a dedicated weight-key.

Fingerprints are calculated by using the, 33 and 35) are sufficient to create both the inverted indexes (compound->fingerprints and fingerprint->compounds) and incrementing the accompanying counters.

2. Finding similar chemical compounds

For retrieving compounds that satisfy a particular Tanimoto coefficient, we reuse the same principles as outlined in my original MongoDB article. The number of round-trips to the Redis datastore is minimised by implementing the algorithm via the build-in Lua scripting support. We start by retrieving the number of fingerprints of the particular input compound. Based upon that cardinality, we calculate the fingerprints of interest (i.e. the min-set of fingerprints that lead us to compounds that are able to satisfy the Tanimoto coefficient). For this, we need to identify the subset of compound fingerprints that occur the least throughout the entire dataset. Redis allows us to perform this query via a single sort-command; it takes the compound-key as input and sorts the contained fingerprints by employing the value of the external fingerprint weight keys. Out of this sorted set of fingerprints, we sub-select the top x fingerprints of interest. What a powerful and elegant command!

We use the inverted index (fingerprint->compounds) to identify those compounds that are able to satisfy the particular input Tanimoto coefficient. Applying the Redis union-command upon the calculated set of fingerprint keys returns the set of potential compounds. Once this set has been identified, we calculate similarity by making use of the Redis intersect-command. Only compounds that satisfy the Tanimoto restriction are returned.

3. Conclusion

With 25.000 stored compounds, Redis requires less then 20ms to retrieve compounds that are 70% similar to a particular input compound. Snappier compared to my original MongoDB implementation. In addition, Redis requires less then 1GB of RAM to maintain a live index of the 460.000 PubChem compounds that have at least one associated assay. This allows scientist to host a local instance of the compound datastore, effectively eliminating the need for a dedicated (and expensive) compound database setup.

Hubway Data Visualization Challenge Entry: the flow of bikers

Davy Suvee — Tue, 16 Oct 2012 09:57:49 +0000

Last week, Hubway announced its Data Visualization Challenge. Hubway is a bike sharing system located in the Boston area: you simply pick up a bike at a particular station and drop it off at the closest station near your destination. For this challenge, Hubway released a CSV-file, containing over half a million rides. Each entry contains the origin and destination station as well as the timing-information and some anonymoused demographic information. The purpose of the challenge is to create appealing visualizations that provide Hubway with cool insights in how customers are using their bikes. As I had 8 hours to spare on a flight to New York, I decided to give it go.

1. Flow of bikers

The goal of my visualization is to depict how bikers flow through the city of Boston, namely: “taking a specific station as starting point, to which other stations are people biking”. A classical, graph-based visualization would show this flow, but would also be quite cluttered as each origin-destination tuple would have its own edge, this way failing to provide the grant overview. The use of a flow map however, would make the visualization both appealing and insightful. Cartographers use flow maps to show the movement of objects from one location to another, such as the number of people in a migration, the amount of goods being traded, or the number of packets in a network. Flow maps reduce visual clutter by merging edges where possible.

Playing around with the library in the past, I remembered somebody releasing a flow map layout implementation. Taking their implementation as a starting point, I applied some modifications (related to the mercator-layouting) and supplied it with my pre-processed Hubway biking data. For each station, I can now generate a separate map that visualises the flow of bikers towards other stations, where each station is mapped at its geographically correct location. As can be expected, most people bike to close-by stations, but others seem to enjoy their biking to far-off locations. Let’s have a look at a few examples. The image below displays the flow map for the Boston University Central station located at 725 Commonwealth Avenue (A32003). As this station is quite central to the city, we see that people bike off in almost all directions, although most of them keep close to Charles River.

If we generate the flow map for a biking station near the corners of the city, such as Andrew Station on Dorchester Avenue (C32012), an entirely different flow pattern can be observed as biking destinations are concentrated at the east-side of Boston.

2. Conclusion

The current application could easily be extended to filter trips on demographics and/or timing information. One could also overlay various flow maps in order to detect similarities between flows of bikers. If people would be interested in extending my implementation, I willing to upload my “code-hacking” to github so that the project can be forked. Just let me know.