Work efficient parallel algorithms for large graph exploration on emerging heterogeneous architectures

https://doi.org/10.1016/j.jpdc.2014.11.006Get rights and content

Highlights

  • Processing real world graphs in an efficient manner through input pruning.

  • Two different pruning strategies based on 1-degree nodes and articulation points.

  • Improvements upto 35% or 1.57x over current best known results.

  • Experimental evaluation of algorithms proposed on several real world graphs.

  • Heterogeneous multicore implementation provides better performance efficiency.

Abstract

Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. There are several efficient implementations of the above problems on a variety of modern multiprocessor architectures.

It can be noticed in recent times that the size of the graphs that correspond to real world datasets has been increasing. Parallelism offers only a limited succor to this situation as current parallel architectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. This calls for particular solution strategies aimed at processing large, sparse graphs on modern parallel architectures.

In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained on the pruned graph, the solution is extended to the entire graph. Towards, this end we investigate pruning based on two strategies that justifies their use in current real world graphs.

We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). For experimentations, we use three different sources for real world graphs. To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-of-the-art solutions. Such an improvement has the potential to speed up other applications reliant on these algorithms.

Introduction

Graph algorithms find a large number of applications in engineering and scientific domains. Prominent examples include solving problems arising in VLSI layouts, phylogeny reconstructions, data mining, image processing, and the like. Some of the most commonly used graph algorithms are graph exploration algorithms such as Breadth First Search (BFS), computing components, and finding shortest paths. As the current real life problems often involve the analysis of massive graphs, it is often seen that parallel solutions provide an acceptable recourse.

Parallel computing on graphs however is often very challenging because of their irregular nature of memory accesses. This irregular nature of memory access stresses the I/O systems of most modern parallel architectures. It is therefore not surprising that most of the recent progress in scalable parallel graph algorithms is aimed at addressing these challenges via innovative use of data structures, memory layouts, and SIMD optimizations [36], [20], [39]. Recent results have been able to make efficient use of modern parallel architectures such as the Cell BE  [39], GPUs [36], [21], [20], Intel multi-core architectures [12], [49], [1] and the like. Algorithms running of GPUs have shown standout performance amongst these because of its massive parallelism.

Another recent development in parallel computing is the design and engineering of heterogeneous algorithms that are aimed at heterogeneous computing platforms. Heterogeneous computing platforms consist of tightly coupled heterogeneous devices including CPUs and accelerator(s). One such example is a collection of a CPU coupled with a graphics accelerator (GPU). Heterogeneous algorithms for CPU+GPU based computational platforms have been designed for also graph breadth-first exploration  [21], [36], [18]. All of the above-cited works show an average of 2x improvement over pure GPU algorithms.

Most of the above works in general aim at data structure and memory layout optimizations but largely run classical algorithms on the entire input graph. These algorithms are designed for general graphs whereas the current generation graphs possess markedly distinguishable features such as being large, sparse, and large deviation in the vertex degrees. In Fig. 1, we show some of the real-world graphs taken from  [45]. As can be seen from Fig. 1, these graphs have several vertices of very low degree, often as low as 1. For instance, in the case of the graph web-Google, 14% of the vertices have degree 1. Table 1 lists other properties of a few real world graphs from [45].

Current parallel algorithms and their implementations  [36], [18], [39], [21], [47] do not take advantage of the above properties. For instance, in a typical implementation of the breadth-first search algorithm, one uses a queue to store the vertices that have to be explored next. But, a vertex v of degree 1 that is in the queue will not lead to the discovery of any yet undiscovered vertices. So, the actions of BFS with respect to v such as adding it to the queue, dequeue it, and then realize that there are no new vertices that can be discovered through vertex v are all unnecessary. These actions unfortunately can be quite expensive on most modern parallel architectures as one has to take into account the fact that the queue is to be accessed concurrently. Similarly, other operations such as checking of the status of a vertex, may be quite disposable.

In light of the above paragraph, we posit that new algorithms and implementation strategies are required for efficient processing of current generation graphs on modern multicore architectures. Such strategies should help algorithms and their implementations benefit from the properties of the graphs. In this paper, we propose input pruning as a technique in this direction. Input pruning aims to reduce the size of the graph by pruning away certain elements of the graph. The required computation is then performed on the remaining graph. The result of this computation is then extended to the pruned elements, if necessary.

In this paper, we apply the input pruning technique to three important graph algorithms: Breadth-first search, connected components, and All-pairs-shortest-paths (APSP). In each case, we show that pruning degree one (or pendant) nodes iteratively can result in reducing the size of the graph on real-world datasets, by as much as 25% in some cases. This reduction in size helps us achieve remarkable improvements in speed for the above three workloads by an average of 35%. In addition, we also perform pruning based on articulation points, where we identify biconnected component of the graph and the suitable articulation points at which the smaller components can be pruned. We use this strategy to then perform APSP on each of the smaller components. The results are then merged to provide the final shortest path of the actual graph. This approach gives a 1.57x speedup over the state of the art results.

Algorithm or implementation decisions based on the nature of the graph is an emerging area of research. In  [16],the authors propose a Distributed Leaf Pruning (DLP) strategy that helps in achieving a significant speedup over distributed communication networks. In this work, the authors noticed that in many real life networks, like CAIDA, the average node degree of a graph with n nodes is very close to n and nodes with a unitary degree is typically high. So, pruning these nodes from the graph, provided a much better performance in packet forwarding strategies over the entire network.

In  [38], Pattabiraman et al. show novel pruning techniques that solves the maximum clique problem on large sparse graphs. Their main idea is to prune the vertices that strictly have fewer neighbors than the size of the maximum clique already computed. These are the vertices that can be exempted from the computation as, even if a new clique is found, its size would not be greater than the maximum one that is already computed.

In another work by Cong et al. [14], the authors explore an experimental technique for the computation of biconnected components on symmetric multiprocessors. Towards this, the authors propose a modification of the well known Tarjan’s algorithm  [44] for computing biconnected components. The authors show how to find and hence remove non-essential edges that do not affect the biconnected components of the graph. By removing such non-essential edges, the time taken to find the biconnected components can reduce vastly as is shown by Cong and Bader [14]. In a more recent work  [22], the authors show a pruning based strategy for identifying the strongly connected components (SCCs) of a directed graph. In this paper, the authors propose a trimming technique that works on small world graphs based on their properties. It is observed such graphs tend to possess one giant SCC, with a size that is Θ(n), and a lot of SCCs that have very small sizes, including SCCs with size 1, and size 2. The authors of [22] essentially identify the small-sized SCCs quickly and “trim” the input graph accordingly. The actual algorithms is then run on the remaining graph thereby increasing the overall performance.

Input pruning has been used as a technique in the design of work-optimal parallel algorithms in the PRAM model. Popular examples include the list ranking algorithm of Anderson and Miller [2], the optimal merging algorithm  [13], the optimal range minima algorithm  [40], and so on. In all of these cases, the size of the input is reduced by a non-constant factor after which a slightly non-optimal algorithm is employed. In a post-processing phase, the results on the reduced input is extended to obtain a result for the entire input.

Many recent works in parallel computing have focused on graph algorithms. Few among them include  [12], [21], [48], [36], [9]. The work of Scarpazza et al.  [39] demonstrates the use of an all-to-all exchange of visited nodes information in a BFS execution across the eight SPUs of a Cell BE. One of the first results of BFS using GPUs is the work of Harish et al.  [20]. Subsequent improvements to  [20] centered around the use of heterogeneous computing. In  [21], Hong et al. use a CPU+GPU platform where the levels of the BFS with fewer discovered nodes are processed on the CPU and levels with large number of discovered nodes are processed on the GPU. Using such a heterogeneous strategy, they achieve a throughput of 0.4 Beps (Billion edges per second) on Erdos–Renyi random graphs. These are improved further by Bader et al.  [36]. Another work on parallel BFS was presented by Beamer et al.  [9], where the authors show an improvement of upto 4.8x on real world graphs based on edge contraction. Some of the prominent works on multicore CPUs include  [12] where the primary goal is to map the data structures to the cache hierarchy so as to improve the cache hit rates. A recent work  [18] partitions the graph so that low degree vertices are processed on the GPU and the high degree vertices are processed on the CPU.

Finding the connected components of a graph also is an important primitive and hence has attracted a lot of attention within the parallel computing community. Popular parallel algorithms in the PRAM model include the algorithm of Shiloach and Vishkin  [41] and its variants by Greiner  [19]. On GPUs, a variant of Shiloach and Vishkin  [41] is used by Soman et al. [42]. A heterogeneous execution of this algorithm on a CPU+GPU platform with an improvement of 35% on average is shown in [6].

The all-pairs-shortest-paths (APSP) problem is yet another fundamental graph algorithm with several applications. One of the earliest works on parallel shortest path problem was proposed by Micikevicius et al. in  [34]. In this work, the author proposed a parallel implementation of the popular Floyd–Warshall algorithm on an early generation FX5900 GPU which showed speedups upto 3x over sequential CPU code. In a more recent work  [46], Venkatraman et al. proposed a blocked parallel implementation of the APSP problem. Here, the authors showed a more cache efficient algorithm of the problem that utilizes the cache hierarchies present in the CPUs to provide a 1.6x to 1.9x speedup. In  [25], the authors show a new APSP implementation on the GPU where the authors utilizes the available shared memory efficiently using a G80 GPU. They employ a transitive closure based technique whereby adapt the Floyd–Warshall algorithm across single and multiple GPUs. This is the current known best result of the APSP problem on GPUs. Matsumoto et al. proposed a hybrid APSP work based on the work in  [46], [33]. Here the authors used a block based structure for the minimization of communication overheads between CPU and GPU and is hence more efficient. However, the work does not experiment on massive graphs.

More recent works in this direction are summarized below. Djidjev et al. [17] use graph decomposition via Parmetis  [24], compute shortest paths within the partitions and extend the same to paths across partitions. The success of their approach depends on two factors: the ability to find a good partition, and the ability to find paths across partitions quickly. In their work, they work mostly with planar graphs to ensure a good partition.

In this paper, we focus on graph BFS, connected components, and all-pairs-shortest-paths. For these three graph algorithms, we first show that a similar preprocessing phase can help reduce the size of the graph by an average of 35% on a wide variety of real-world graphs. This helps us to obtain an average of 40% speed-up compared to the best known implementations for the above problems on similar platforms.

Our preprocessing simply involves removing pendant nodes from the graph. This is done iteratively so that nodes on pendant paths are also removed during preprocessing. In the post-processing phase, we show that extending the output of the computation on the smaller graph can be done in a very straight-forward and quick manner.

In Fig. 2, we show the overall improvements achieved from our implementations on graphs from Table 1. Further results on some more graphs from the UFL Sparse Matrix Collection  [45] and on random matrices generated using the R-MAT synthetic graph generator  [11] are shown in our previous work [7].

Some of our specific contributions are as follows:

  • Our results improve the state-of-the-art for graph BFS by 35%. We achieve an average throughput of 2 billion edges per second on a wide range of datasets including graphs from the University of Florida collection  [45], and graphs generated using the Recursive Matrix Model (R-MAT).

  • On the connected components problem, we get an average 20% improvement over the best known result on an identical platform [6]. A small change to the algorithm can also build a spanning tree of a graph with very little extra time.

  • For computing the shortest path between all pairs of nodes, we achieve an average of 44% improvement compared to the best known result of  [25] on a similar platform.

  • Using the second pruning strategy based on biconnected components we achieve a speedup of 1.57x on an average over similar graphs.

Section snippets

A brief overview of our experimental platform

In this section, we briefly describe our hybrid computing platform. Our hybrid platform is a coupling of the two devices described above, the Intel i7 980 and the Nvidia GTX 580 GPU. The CPU and the GPU are connected via a PCI Express version 2.0 link. This link supports a data transfer bandwidth of 8 GB/s between the CPU and the GPU. To program the GPU we use the CUDA API Version 4.1. The CUDA API Version 4.1 supports asynchronous concurrent execution model so that a GPU kernel call does not

Our approach

In this section, we present a three phase technique, outlined in Algorithm 1, for scalable parallel graph algorithms of real world graphs. In the first phase, called the preprocessing phase, we reduce the size of the input graph by removing redundant elements of the graph. Once the graph size reduces, the second phase involves using existing algorithms to perform the computation on the smaller graph. In a final phase, we then extend the result of the computation to the entire original graph via

Breadth first search

Breadth First Search (BFS), is one of the most widely used graph algorithms and finds massive applications in the domains of state space partitioning, graph partitioning, theorem proving, and networks. The problem statement of the BFS is: given an undirected, unweighted graph G(V,E), and a source vertex S, compute the minimum number of edges that are needed to reach every vertex of G from S. The optimal sequential solution to this problem runs in O(V + E) time  [15].

The well known sequential

Connected components

Finding the connected components of a graph is of fundamental importance to graph algorithms. Given a graph G=(V,E), the problem is to find a partitioning of V into disjoint sets V1,V2,, so that vertices u and v are in the same set if and only if there is a path between u and v in G. Well known sequential algorithms such as the Depth First Search algorithm (DFS)  [15] run in O(n+m) time. Several efficient parallel algorithms in the PRAM model have been proposed. Popular among them are the

All pairs shortest paths

In graph theory, finding shortest paths in a weighted graph is a fundamental and well researched problem. The problem seeks to find the shortest path between any two vertices of the graph such that the sum of the weights of the constituent edges is minimized. The All-Pairs-Shortest-Paths (APSP) problem is a generalization where one seeks to find the shortest path between every pair of vertices in the graph. The most popular solution of the APSP problem is the Floyd–Warshall algorithm which has

Pruning based on bridges and articulation points

In this section we introduce a pruning technique that prunes bridges of the graph. Recall that a bridge in a graph G is an edge whose removal disconnects G. Similarly, an articulation point in a graph G is a vertex whose removal disconnects the graph. Bridges and articulation points can be used to partition the edges of the graph G into maximal 2-connected subgraphs that are also called as the biconnected components (BCCs) of G.

For the case of All-Pairs-Shortest-Paths, such a decomposition into

Conclusions

In this paper, we have proposed graph pruning as a technique to speed-up large graph algorithms on modern parallel architectures. We applied the technique to three important problems in graphs. Our results indicate that the technique is quite useful, especially for large sparse graphs. We received good speedups in all the workloads that we experimented with. This result clearly proves the need to perform data centric pre-processing and modifications that can lead to huge benefits.

In the

Acknowledgment

The first author was supported by Tata Consultancy Services Ltd. through the TCS Research Scholarship Program.

Dip Sankar Banerjee is presently a Ph.D. student at the International Institute of Information Technology, Hyderabad, India where he is affiliated to the Center for Security Theory and Algorithmic Research. Prior to joining the doctoral he pursued his undergraduate studies in Computer Science Engineering from the West Bengal University of Technology, India. His research interests are broadly in parallel algorithms, massive graph analysis, multicore and manycore computing and high performance

References (50)

  • R.J. Anderson et al.

    A simple randomized parallel algorithm for list-ranking

    Inform. Process. Lett.

    (1990)
  • D.A. Bader et al.

    A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)

    J. Parallel Distrib. Comput.

    (2005)
  • Y. Shiloach et al.

    An O(logn) parallel connectivity algorithm

    J. Algorithms

    (1982)
  • V. Agarwal, F.P.D. Pasetto, D.A. Bader, Scalable graph exploration on multicore processors, in: Proc. of ACM SC, 10,...
  • S. Auer et al.

    DBpedia: a nucleus for a web of open data

  • L. Backstrom et al.

    Group formation in large social networks: membership, growth, and evolution

  • D.S. Banerjee, K. Kothapalli, Hybrid algorithms for list ranking and graph connected components, in: Proc. of 18th...
  • D. Banerjee, S. Sharma, K. Kothapalli, Work efficient parallel algorithms for large graph exploration, in: 20th...
  • V. Batagelj, A. Mrvar, Pajek network: connectivity of internet...
  • S. Beamer et al.

    Direction-optimizing Breadth-first Search

  • L.S. Blackford et al.

    ScaLAPACK User’s Guide

    (1997)
  • D. Chakrabarti, Y. Zhan, C. Faloutsos, R-MAT: a recursive model for graph mining, in: Proceedings of 2004 SIAM...
  • J. Chhugani, N. Satish, C. Kim, J. Sewall, P. Dubey, Fast and efficient graph traversal algorithm for CPUs: maximizing...
  • R. Cole

    Parallel merge sort

    SIAM J. Comput.

    (1988)
  • G. Cong et al.

    An experimental study of parallel biconnected components algorithms on symmetric multiprocessors (SMPs)

  • T. Cormen, C. Leiserson, R. Rivest, C. Stein, Introduction to algorithms,...
  • G. Dangelo, M. Demidio, D. Frigioni, V. Maurizio, A speed-up technique for distributed shortest paths computation, in:...
  • H. Djidjev, S. Thulasidasan, G. Chapuis, R. Andonov, D. Lavenier, Efficient multi-gpu computation of all-pairs shortest...
  • A. Gharaibeh et al.

    On graphs, GPUs, and blind dating: a workload to processor matchmaking quest

  • J. Greiner

    A comparison of parallel algorithms for connected components

  • P. Harish, P.J. Narayanan, Accelerating large graph algorithms on the GPU using CUDA, in: Proc. of HiPC...
  • S. Hong, T. Oguntebi, K. Olukotun, Efficient parallel graph exploration for multi-core CPU and GPU, in: IEEE Parallel...
  • S. Hong et al.

    On fast parallel detection of strongly connected components (SCC) in small-world graphs

  • D.B. Johnson

    Efficient algorithms for shortest paths in sparse networks

    J. ACM

    (1977)
  • G. Karypis et al.

    Parallel multilevel k-way partitioning scheme for irregular graphs

  • Cited by (9)

    • Towards a large-scale twitter observatory for political events

      2020, Future Generation Computer Systems
      Citation Excerpt :

      A network analysis of these entities has often unravelled interesting insights [17]. The increase in the relative size of datasets to account for real-world problems has forced researchers and engineers to move to distributed and parallel proposals [30] in order to explore and process large graphs, and to calculate different measures over them. Machine learning research has made great advancements in developing scalable and distributed algorithms, but they often lack support for interactivity.

    • Visualizing large knowledge graphs: A performance analysis

      2018, Future Generation Computer Systems
      Citation Excerpt :

      Similarly, parallel algorithms, enabling concurrent processing of graph sections, allow us to efficiently deal with large graphs. In this work we have focused on the first type of approaches, but improvements and extensions based on the second ones could be integrated into the proposed architecture [63]. Big Data frameworks include two key elements: (1) a distributed file system; and (2) a distributed computation paradigm.

    • Improving data exploration in graphs with fuzzy logic and large-scale visualisation

      2017, Applied Soft Computing Journal
      Citation Excerpt :

      More recently, Pienta et al. [7] organised most of those works according to their opportunities for understanding the underlying graph, providing a new angle into this area of research. However, the increase in the relative size of datasets to account for real-world problems, has forced researchers and engineers to move to distributed and parallel proposals [21] in order to explore and process large graphs, and to calculate different measures over them. Machine learning research has made great advancements in developing scalable and distributed algorithms, but they often lack support for interactivity.

    • Efficient parallel algorithms for dynamic closeness- and betweenness centrality

      2023, Concurrency and Computation: Practice and Experience
    • BRICS - Efficient techniques for estimating the farness-centrality in parallel

      2019, Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
    • Applications of ear decomposition to efficient heterogeneous algorithms for shortest path/cycle problems

      2017, Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
    View all citing articles on Scopus

    Dip Sankar Banerjee is presently a Ph.D. student at the International Institute of Information Technology, Hyderabad, India where he is affiliated to the Center for Security Theory and Algorithmic Research. Prior to joining the doctoral he pursued his undergraduate studies in Computer Science Engineering from the West Bengal University of Technology, India. His research interests are broadly in parallel algorithms, massive graph analysis, multicore and manycore computing and high performance computing.

    Ashutosh Kumar did his undergraduate studies in Computer Science and Engineering from IIIT Hyderabad and is currently working as a Software Development Engineer at Microsoft. His research interests are on secure communication protocols in adversarial distributed networks and in parallel algorithms.

    Meher Chaitanya is currently a M.S. by research student at the International Institute of Information Technology, Hyderabad, India where he is affiliated to the Center for Security Theory and Algorithmic Research. Prior to this he was a software engineer at Nvidia. He completed his Mtech in Computer Science Engineering from the same institute. His current areas of research are studying various ways of separating graphs, and their cost-benefit analysis with respect to problems such as shortest paths, and betweenness centrality.

    Shashank Sharma is currently a Masters student at the International Institute of Information Technology, Hyderabad, India where he is affiliated to the Center for Security Theory and Algorithmic Research. Prior to this he completed his undergraduate degree in Computer Science Engineering from the same institute. His major research interests are in parallel computing, GPU computing and graph theory.

    Kishore Kothapalli is presently an Associate Professor at the International Institute of Information Technology, Hyderabad, where he is working since 2006. Prior to that, he obtained his doctoral degree in Computer Science from the Johns Hopkins University, USA, and his Master’s degree in Computer Science from Indian Institute of Technology, Kanpur. His current research interests are in parallel algorithms for problems on graphs, sparse matrices, and the like. He is also interested in data structures for geometric problems.

    A part of this work has appeared previously in Proceedings of the 20th IEEE International Conference on High Performance Computing (HiPC), 2013.

    View full text