Wednesday, February 15, 2017

Graph Data Representation for Trillion Edge Graphs




In this blog post, I will try to summarize the general graph formats that are widely used in various research works.  For a graph G = (V, E), where V represents the set of the vertices and E the edges, there are a number of popular graph representations. We describes each of them for the sample graph shown in Figure (a) below.

Edge List representation is defined as a collection of tuples of vertices and each tuple represents a single edge. If two vertices vi and vj are connected then the edge tuple is represented as
(vi, vj). The size of the edge list representation of a graph equals to the product of the edge count and twice of the size of a vertex. Figure (b) shows the  edge list representation of Figure (a).


Papers like graphchi and X-Stream graph engine use this format


Compressed Sparse Row (CSR) groups the edges of a vertex together and are stored in the adjacency list array (adj-list). There is a separate data structure of the beginning position array (beg-pos) that contains the index of the first edge for each vertex. The size of CSR representation of a graph equals the size of adjacency list (|E|), plus the size of beginning position array (|V |). Figure (c) shows the CSR representation of the example graph.



CSR format is one of most used format, and there has been a number of suggestion on improving this format. Once can see ligra++ paper for details on how to use compression to squeeze the adj-list which has shown to perform better for many graph algorithms. One can also look into CSR5 format as well.

Matrix format consists of rows that represent the IDs of the source vertices of the edges and columns that represent the destination vertices. This is generally represented as the bitwise format where each row is a sequence of zeros and ones. The value of one represents an edge from the source to the destination vertex. Figure 1(d) shows the same graph in the bitwise matrix format.

This format is best for dense graphs as it just need one bit. Also, symmetry could be utilized easily, providing space efficiency as well. However, most real world graph are sparse thereby, you will hardly find any paper using this format.

2D Partitioned Edge List is a 2-level classification of the edge list format, where edge tuples are first partitioned using the ID of the source vertex followed by the ID of the destination vertex. Figure (e) shows the 2x2 partition of the sample graph, where the edges are separated into different partitions based on vertex ID range 0-3 and 4-7.


Gridgraph work used the 2D partitioned edge-list format in a single-machine setup, however, various previous works used 2-D partitioning in CSR for BFS in Bluegene supercomputers.

2D Partitioned Edge-list using symmetry  is same as figure (e) but store only upper half (or lower half) for un-directed graphs of graph data as shown in figure (F). For directed graphs, one can decide to just store the in-edges (or out-edges) only.


For undirected graphs, enabling symmetry in the CSR format would incur high I/O cost for some algorithms (e.g., BFS) which is likely the reason why no existing graph engine supports it, and for directed graphs, the utilization of symmetry is not possible for many algorithms (e.g., SCC)
which need both in-edges and out-edges.

SNB (Smallest Number of Bits) representation. This format is based on the observation that another type of redundancy exists in the graph data, that is, in a 2D partitioned graph, for all the edge tuples within a partition, the most-significant-bits (MSBs) of IDs of source and destination vertices are identical and hence redundant. G-Store graph engine is the first paper to exploit this redundancy.

For the example graph in Figure (a), normally a three-bit storage would be required to represent any vertex whose ID is from zero to seven, thus six bits in total for an edge tuple. Interestingly, fewer number of bits will be required, if one separates the representation of a partition and the edges in this partition. For example, for each edge tuple in tile[1,1], the MSBs of source and destination vertices are always [1,1]. Similarly, the MSBs of source and destination vertices of each edge tuple in tile[1,0] are always [1,0]. In this example shown in Figure (g), one may use two bits to represent one vertex and thus four bits for an edge tuple, saving two bits per edge tuple. Of course, there is additional storage (two bits in this case) needed to represent the IDs of the tiles, but the total storage cost is much smaller.

To calculate the original edge tuple from this compact representation, one must know to which tile an edge tuple belongs. This can be easily done by concatenating the tile ID to the vertex ID. In the example, tile[1,1] has the offset of (4,4), and the edge tuple (0,1) in this tile will represent the edge (4,5) in the original graph data. We calculate and cache the offset pointers of algorithmic metadata for each tile, so that any accesses to the metadata for the whole tile can be simply indexed with the compact vertex IDs. For each metadata, two operations will be needed per tile: one serves the compact source vertices and another compact destination.


In the G-Store paper, authors allocate two bytes to represent each vertex and four bytes for an edge tuple  which can provide upto 8x space improvement over edge-list and 4x improvement over CSR format.

Note: All the diagrams and texts are mostly from the following SC'16 paper: G-Store: High-Performance Graph Store for Trillion-Edge Processing.[PDF] [PPT] [Code].

Thursday, August 11, 2016

G-Store: High-Performance Graph Store for Trillion-Edge Processing

Graph has become very hot topic in research community due to its application in social networking, bio-sciences, recommendation systems and in world-wide web etc. The size of such graph is exploding reaching billions of vertices and trillions of edges. A simple search in google will show you a number of graph frameworks that has been proposed recently.  In this blog post we will cover some of the challenges that makes it impossible to process large graphs (billions of nodes and trillions of edges) in a single machine.

A recent supercomputing (SC'16) paper titled  "G-Store: High-Performance Graph Store for Trillion-Edge Processing" takes on these challenges for a trillion edge graph, and proposes a number of techniques such as space-efficient representation of graph data, hardware cache friendly on-disk data-layout and a proactive cache-policy designed specifically to enable a trillion edge graph processing in a single server machine.  On top of these, slide-cache-rewind technique helps to overlap IO, processing and caching analysis. This strategy also makes sure that any data is not discarded without getting analyzed using proactive caching-policy.

Watch out this space for more updates.

Keywords: Trillion edge graph,  external graph engine, semi-external graph processing, extreme graphs.