Memory and IO Lab: Graph Data Representation for Trillion Edge Graphs

In this blog post, I will try to summarize the general graph formats that are widely used in various research works. For a graph G = (V, E), where V represents the set of the vertices and E the edges, there are a number of popular graph representations. We describes each of them for the sample graph shown in Figure (a) below.

Edge List representation is defined as a collection of tuples of vertices and each tuple represents a single edge. If two vertices vi and vj are connected then the edge tuple is represented as

(vi, vj). The size of the edge list representation of a graph equals to the product of the edge count and twice of the size of a vertex. Figure (b) shows the edge list representation of Figure (a).

Papers like graphchi and X-Stream graph engine use this format

Compressed Sparse Row (CSR) groups the edges of a vertex together and are stored in the adjacency list array (adj-list). There is a separate data structure of the beginning position array (beg-pos) that contains the index of the first edge for each vertex. The size of CSR representation of a graph equals the size of adjacency list (|E|), plus the size of beginning position array (|V |). Figure (c) shows the CSR representation of the example graph.

CSR format is one of most used format, and there has been a number of suggestion on improving this format. Once can see ligra++ paper for details on how to use compression to squeeze the adj-list which has shown to perform better for many graph algorithms. One can also look into CSR5 format as well.

Matrix format consists of rows that represent the IDs of the source vertices of the edges and columns that represent the destination vertices. This is generally represented as the bitwise format where each row is a sequence of zeros and ones. The value of one represents an edge from the source to the destination vertex. Figure 1(d) shows the same graph in the bitwise matrix format.

This format is best for dense graphs as it just need one bit. Also, symmetry could be utilized easily, providing space efficiency as well. However, most real world graph are sparse thereby, you will hardly find any paper using this format.

2D Partitioned Edge List is a 2-level classification of the edge list format, where edge tuples are first partitioned using the ID of the source vertex followed by the ID of the destination vertex. Figure (e) shows the 2x2 partition of the sample graph, where the edges are separated into different partitions based on vertex ID range 0-3 and 4-7.

Gridgraph work used the 2D partitioned edge-list format in a single-machine setup, however, various previous works used 2-D partitioning in CSR for BFS in Bluegene supercomputers.

2D Partitioned Edge-list using symmetry is same as figure (e) but store only upper half (or lower half) for un-directed graphs of graph data as shown in figure (F). For directed graphs, one can decide to just store the in-edges (or out-edges) only.

For undirected graphs, enabling symmetry in the CSR format would incur high I/O cost for some algorithms (e.g., BFS) which is likely the reason why no existing graph engine supports it, and for directed graphs, the utilization of symmetry is not possible for many algorithms (e.g., SCC)

which need both in-edges and out-edges.

SNB (Smallest Number of Bits) representation. This format is based on the observation that another type of redundancy exists in the graph data, that is, in a 2D partitioned graph, for all the edge tuples within a partition, the most-significant-bits (MSBs) of IDs of source and destination vertices are identical and hence redundant. G-Store graph engine is the first paper to exploit this redundancy.

For the example graph in Figure (a), normally a three-bit storage would be required to represent any vertex whose ID is from zero to seven, thus six bits in total for an edge tuple. Interestingly, fewer number of bits will be required, if one separates the representation of a partition and the edges in this partition. For example, for each edge tuple in tile[1,1], the MSBs of source and destination vertices are always [1,1]. Similarly, the MSBs of source and destination vertices of each edge tuple in tile[1,0] are always [1,0]. In this example shown in Figure (g), one may use two bits to represent one vertex and thus four bits for an edge tuple, saving two bits per edge tuple. Of course, there is additional storage (two bits in this case) needed to represent the IDs of the tiles, but the total storage cost is much smaller.

To calculate the original edge tuple from this compact representation, one must know to which tile an edge tuple belongs. This can be easily done by concatenating the tile ID to the vertex ID. In the example, tile[1,1] has the offset of (4,4), and the edge tuple (0,1) in this tile will represent the edge (4,5) in the original graph data. We calculate and cache the offset pointers of algorithmic metadata for each tile, so that any accesses to the metadata for the whole tile can be simply indexed with the compact vertex IDs. For each metadata, two operations will be needed per tile: one serves the compact source vertices and another compact destination.

In the G-Store paper, authors allocate two bytes to represent each vertex and four bytes for an edge tuple which can provide upto 8x space improvement over edge-list and 4x improvement over CSR format.

Note: All the diagrams and texts are mostly from the following SC'16 paper: G-Store: High-Performance Graph Store for Trillion-Edge Processing.[PDF] [PPT] [Code].

Memory and IO Lab

Wednesday, February 15, 2017

Graph Data Representation for Trillion Edge Graphs

No comments:

Post a Comment