Counting triangles smarter (or how to beat Big Data vendors at their own game)
A few months ago, I discovered Vertica’s “Counting Triangles”-article through Prismatic. The blog post describes a number of benchmarks on counting triangles in large networks. A triangle is detected whenever a vertex has two adjacent vertices that are also adjacent to each other. Imagine your social network; if two of your friends are also friends with each other, the three of you define a friendship triangle. Counting all triangles within a large network is a rather compute-intensive task. In its most naive form, an algorithm iterates through all vertices in the network, retrieving the adjacent vertices of their adjacent vertices. If one of the vertices adjacent to the latter vertices is identical to the origin vertex, we identified a triangle.
The Vertica article illustrates how to execute an optimised implementation of the above algorithm through Hadoop and their own Massive Parallel Processing (MPP) Database product (both being run on a 4-node cluster). The dataset involves the LiveJournal social network graph, containing around 86 million relationships, resulting in around 285 million identified triangles. As can be expected, the Vertica solution shines in all respects (counting all triangles in 97 seconds), beating the Hadoop solution by a factor of 40. A few weeks later, the Oracle guys published a similar blog post, using their ExaData platform, beating Vertica’s results by a factor of 7, clocking in at 14 seconds.
Although Vertica and Oracle’s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating triangles through a smarter algorithm that is able to deliver similar performance on commodity hardware (i.e. my MacBook Pro Retina).
1. Doing it the smart way
The LiveJournal social network graph, about 1.3GB in raw size, contains around 86 million relationships. Each line in this file declares a relationship between a source and target vertex (where each vertex is identified by an unique id). Relationships are assumed to be bi-directional: if person 1 knows person 2, person 2 also knows person 1.
Let’s start by creating a row-like data structure for storing these relationships. The key of each row is the id of the source vertex. The row values are the id’s of all target vertices associated with the particular source vertex.
With this structure in place, one can execute the naive algorithm as described above. Unfortunately, iterating four levels deep will result in mediocre performance. Let’s improve our data structure by indexing each relationship through its lowest key. So, even though the LiveJournal file declares the relationship as being “2 0“, we persist the relationship by assigning the 2-value to the 0-row. (Order doesn’t matter as relationships are bi-directional anyway.)
Calculating triangles becomes a lot easier (and faster) now. If the key of a row is part of a triangle, its two adjacent vertices should be in its list of values (as by definition, the row key is the smallest vertex id of the three of them). Hence, we need to check whether we can find edges amongst the vertices contained within each row. So, for each row, we iterate through its list of values. For each of these values, we retrieve the associated row and verify whether one of its values is part of the original source-values. By doing so, we get rid of one expensive for-loop. Nevertheless, the amount of calculations that need to be executed is still close to 2 billion!
2. Persisting the relationships
The data structure as described above is persisted in a custom datastore that we developed at Datablend for powering the similr-engine (a chemical structure search engine). The datastore is fully persistent and optimised for quickly performing set-based operations (intersections, differences, unions, … ). Parsing the 86 million relationships and creating the appropriate in-memory data structure takes around 20 seconds on my MacBook Pro. An additional 4 seconds is required for persisting the entire data structure to the datastore itself. So around 25 seconds in total for effectively storing all 86 million relationships. Vertica nor Oracle mention the time it takes to persist the Livejournal dataset within their respective databases. However, I assume it also requires them a few seconds to execute this load-operation.
What about disk usage? The custom Datablend datastore takes the second place, requiring only 37 Mb more compared to Oracle’s Hybrid Columnar Compression version.
3. Calculating the triangles
The Oracle setup (on a cluster of 4 nodes, each with 96GB of RAM and 12 cores) is able to calculate the 265 million triangles in 14 seconds. The optimised algorithm described above, running on the custom Datablend datastore, takes the first place, clocking in at 9 seconds! The calculation runs fully pararellized on my MacBook Pro Retina and has a peak use of only 2.11 GB of RAM!
Datablend’s custom datastore is a very specific solution that targets a particular range of Big Data computations. It is in no means as generic and versatile as the MPP database solutions offered by both Vertica and Oracle. Nevertheless, the article tries to illustrate that one does not require a large computing cluster to execute particular Big Data computations. Just use the most appropriate/smart solution to solve the problem in an elegant and fast way. Don’t hesitate to contact us if you have any questions related to similr and/or Datablend.