Parallel processing of filtered queries in attributed semantic graphs☆
Introduction
Large-scale graph analytics is a central requirement of bioinformatics, finance, social network analysis, national security, and many other fields that deal with “big data”. Going beyond simple searches, analysts use high-performance computing systems to execute complex graph algorithms on large corpora of data. Often, a large semantic graph is built up over time, with the graph vertices representing entities of interest and the edges representing relationships of various kinds—for example, social network connections, financial transactions, or interpersonal contacts.
In a semantic graph, edges and/or vertices are labeled with attributes that might represent a timestamp, a type of relationship, or a mode of communication. An analyst (i.e. a user of graph analytics) may want to run a complex workflow over a large graph, but wish to only use those graph edges whose attributes pass a filter defined by the analyst.
The Knowledge Discovery Toolbox [30] is a flexible, Python-based, open-source toolbox for implementing complex graph algorithms and executing them on high-performance parallel computers. KDT achieves high performance by invoking linear-algebraic computational primitives supplied by a parallel C++/ MPI backend—the Combinatorial BLAS [10]. Combinatorial BLAS uses broad definitions of matrix and vector operations. The user can define custom callbacks to override the semiring scalar multiplications and additions that correspond to operations between edges and vertices.
Filters act to enable or disable KDT’s action (the semiring operations) based on the attributes that label individual edges or vertices. The programmer’s ability to specify custom filters and semirings directly in a high-level language like Python is crucial to ensure high-productivity and customizability of graph analysis software. This paper presents new work that allows KDT users to define filters and semirings in Python without paying the performance penalty of upcalls to Python.
Filters raise performance issues for large-scale graph analysis. In many applications it is prohibitively expensive to run a filter across an entire graph data corpus, and produce (“materialize”) a new filtered graph as a temporary object for analysis. In addition to the obvious storage problems with materialization, the time spent during materialization is typically not amortized by many graph queries because the user modifies the query (or just the filter) during interactive data analysis. The alternative is to filter edges and vertices “on the fly” during execution of the complex graph algorithm. A graph algorithms expert can implement an efficient on-the-fly filter as a set of primitive Combinatorial BLAS operations coded in C/C++ and incur a significant productivity hit. Conversely, filters written at the KDT level, as predicate callbacks in Python, are productive, but incur a significant performance penalty.
Our solution to this challenge is to apply Selective Just-In-Time Specialization techniques from the SEJITS approach [12]. We define two semantic-graph-specific domain-specific languages (DSL): one for filters and one for the user-defined scalar semiring operations for flexibly implementing custom graph algorithms. Both DSLs are subsets of Python, and they use SEJITS to implement the specialization necessary for filters and semirings written in that subset to execute efficiently as low-level C++ code. Unlike writing a compiler for the full Python language, implementing our DSLs requires much less effort due to their domain-specific nature. On the other hand, our use of existing SEJITS infrastructure preserves the high-level nature of expressing computations in Python without forcing users to write C++ code.
We demonstrate that SEJITS technology significantly accelerates Python graph analytics codes written in KDT, running on clusters and multicore CPUs. An overview of our approach is shown in Fig. 1. SEJITS specialization allows our graph analytics system to bridge the gap between the performance-oriented Combinatorial BLAS and usability-oriented KDT.
The primary new contributions of this paper are:
- 1.
A domain-specific language implementation that enables flexible filtering and customization of graph algorithms without sacrificing performance, using SEJITS selective compilation techniques.
- 2.
A new Roofline performance model [41] for high-performance graph exploration, suitable for evaluating the performance of filtered semantic graph operations.
- 3.
Experimental demonstration of excellent performance scaling to graphs with tens of millions of vertices and hundreds of millions of edges.
- 4.
Demonstration of the generality of our approach by specializing two different graph algorithms: breadth-first search (BFS) and maximal independent set (MIS). In particular, the MIS algorithm requires multiple programmer-defined semiring operations beyond the defaults that are provided by KDT.
Fig. 2 summarizes the work implemented in this paper, by comparing the performance of three on-the-fly filtering implementations on a breadth-first search query in a graph with 4 million vertices and 64 million edges. The chart shows time to perform the query as we synthetically increase the portion of the graph that passes the filter on an input R-MAT [28] graph of scale 22. The top, red, line is the method implemented in the current release v0.2 of KDT [23], with filters and semiring operations implemented as Python callbacks. The second, blue, line is our new KDT+SEJITS implementation where filters and semiring operations implemented in our DSLs are specialized using SEJITS. This new implementation shows minimal overhead and comes very close to the performance of native Combinatorial BLAS, which is in the third, gold line.
The rest of the paper is organized as follows. Section 2 gives background on the graph-analytical systems our work targets and builds upon. Section 3 is the technical heart of the paper, which describes how we meet performance challenges by using selective, embedded, just-in-time specialization. Section 4 presents Python-defined objects that enable the user to declare their attribute types directly in Python, enabling a much board set of applications. Section 6 proposes a theoretical model that can be used to evaluate the performance of our implementations, giving “Roofline” bounds on the performance of breadth-first search in terms of architectural parameters of a parallel machine, and the permeability of the filter (that is, the percentage of edges that pass the filter). Section 5 gives details about the experimental setting and Section 7 presents our experimental results. In Section 8, we precisely analyze the performance implications of selective just-in translation using hardware performance counters. We survey related work in Section 9. Section 10 gives our conclusions and some remarks on future directions and problems. This paper expands on the work first published as a conference paper at IPDPS [8].
Section snippets
Background
Running example: Throughout the paper, we will use a running example query to show how different implementations of filters and semiring operations express the query and compare their performance executing it. We consider a graph whose vertices are Twitter users, and whose edges represent two different types of relationships between users. In the first type, one user “follows” another; in the second type, one user “retweets” another user’s tweet. Each retweet edge carries as attributes a
SEJITS translation of filters and semiring operations
Defining semirings and filters in Python results in one or more serialized upcalls from the low-level Combinatorial BLAS into Python for both semiring operations and filtering. In order to mitigate this slowdown, we use the Selective Embedded Just-In-Time Specialization (SEJITS) approach [12]. We define embedded DSLs for semiring and filter operations which are subsets of Python. As shown in Fig. 6, callbacks written in these DSLs are translated at runtime to C++ to eliminate performance
Motivation
The attribute types of vertices and edges should ideally be declared in Python, especially when the application requires several graphs with different edge and/or vertex datatypes. Consider the analysis of multi-modal brain networks (also known as connectomes). In this application, data from multiple modalities, such as fMRI, DTI, EEG, and PET, are collected for the patient’s brain. Representing these data sources as graphs and using graph analysis has been instrumental in characterizing
Experimental design
This section describes graph algorithms used in our experiments, the benchmark matrices we used to test the algorithms, and the machines on which we ran our tests. KDT version 0.3 is enabled with the SEJITS techniques described in this paper, and is freely available at http://kdt.sourceforge.net.
A Roofline model of BFS
The Roofline model [41] is a visually intuitive representation of the performance characteristics of a kernel on a specific machine. It uses bound and bottleneck analysis to delineate performance bounds arising from bandwidth or compute limits and has been demonstrated to show that performance of many HPC kernels is well-correlated with STREAM bandwidth. Unfortunately, the traditional HPC application characteristics (massive parallelism, streaming memory access) and even metrics (flops per
Experimental results
In this section we use [semiring implementation]/[filter implementation] notation to describe the various implementation combinations we compare. For example, Python/SEJITS means that only the filter is specialized with SEJITS but the semiring is in pure Python (not specialized).
Performance results introspection via hardware performance counters
The Performance Application Programming Interface (PAPI) [36] library provides direct access to low-level performance counters. These counters can measure performance attributes of a particular program execution. For example, PAPI counters can be used to measure the total number of instructions executed, or the total number of cache misses (L1 or L2, data or instruction).
Our study incorporates several PAPI performance counters to gain a detailed analysis of the performance benefits of KDT
Related work
Graph algorithm packages
Pegasus [22] is a graph-analysis package that uses MapReduce [14] in a distributed-computing setting. Other cloud-based graph analysis systems include GPS [40], and Apache Hama [3], and Giraph [2]. Redekopp et al. [39] recently studied performance optimizations for such cloud-based graph platforms.
Pegasus [22], uses a generalized matrix–vector multiplication primitive called GIM-V, much like KDT’s SpMV, to express vertex-centered computations that combine data
Conclusion
The KDT graph analytics system achieves customizability through user-defined filters, high performance through the use of a scalable parallel library, and conceptual simplicity through appropriate graph abstractions expressed in a high-level language.
We have shown that the performance impact of expressing filters in a high-level language like Python can be mitigated by Selective Embedded Just-in-Time Specialization. In particular, we have shown that our embedded DSLs for filters and semirings
Acknowledgments
This work is supported by the Computer Science Program, the Applied Mathematics Program, and the Early-Career Research Program within the Office of Science Advanced Scientific Computing Research of the US Department of Energy under contract No. DE-AC02-05CH11231. This work was supported in part by National Science Foundation grant CNS-0709385. Portions of this work were performed at the UC Berkeley Parallel Computing Laboratory (Par Lab), supported by DARPA (contract #FA8750-10-1-0191) and by
Adam Lugowski is a Ph.D. student in Computer Science at the University of California, Santa Barbara. His research interests include high-performance graph analysis, library design for domain scientists, and next-generation parallel sparse matrix linear algebraic techniques. He is studying under John Gilbert.
References (42)
- et al.
Patterns of temporal variation in online media
- Active Record - Object-Relation Mapping Put on Rails, 2012....
- Apache Gigraph, 2013....
- Apache Hama, 2013....
- D.A. Bader, K. Madduri, SNAP, small-world network analysis and partitioning: An open-source parallel graph framework...
- et al.
Direction-optimizing breadth-first search
Sci. Program.
(2013) - et al.
Distributed memory breadth-first search revisited: Enabling bottom-up search
- et al.
Software and algorithms for graph queries on multithreaded architectures
- et al.
High-productivity and high-performance analysis of filtered semantic graphs
- A. Buluç, J.R. Gilbert, On the representation and multiplication of hypersparse matrices, in: Proc. IPDPS, April...
The combinatorial BLAS: Design, implementation, and applications
Int. J. High Perform. Comput. Appl.
MapReduce: simplified data processing on large clusters
On random graphs
Publ. Mat.
Domain Specific Languages
Sparse matrices in MATLAB: Design and implementation
SIAM J. Matrix Anal. Appl.
Green-Marl: a DSL for easy and efficient graph analysis
Cited by (12)
Programming languages for data-Intensive HPC applications: A systematic mapping study
2020, Parallel ComputingA survey on FinTech
2018, Journal of Network and Computer ApplicationsCitation Excerpt :Each vertex or edge carries the attributes such that the queries can be determined based on the analytic goals. One research (Lugowski et al., 2015) solved the analytic queries by deploying a filter that filtrates unnecessary vertices and edges; thus, the customized graph can be used for analytic purposes. In summary, the improvement of data processing in the perspective of hardware was highly concerned by recent studies.
Ternary-based feature level extraction for anomaly detection in semantic graphs: an optimal feature selection basis
2021, Sadhana - Academy Proceedings in Engineering SciencesDesign and development of ternary-based anomaly detection in semantic graphs using metaheuristic algorithm
2021, International Journal of Digital Crime and ForensicsLion plus firefly algorithm for ternary-based anomaly detection in semantic graphs in smart cities
2021, International Journal of Ad Hoc and Ubiquitous ComputingLAGraph: A community effort to collect graph algorithms built on top of the GraphBLAS
2019, Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2019
Adam Lugowski is a Ph.D. student in Computer Science at the University of California, Santa Barbara. His research interests include high-performance graph analysis, library design for domain scientists, and next-generation parallel sparse matrix linear algebraic techniques. He is studying under John Gilbert.
Shoaib Kamil is a Research Scientist at the Computer Science and Artificial Intelligence Laboratory at MIT in the Commit (Compilers at MIT) and CAP (Computer-Aided Programming) groups. Formerly, he was the lead student on the SEJITS project, which works to enable highly productive parallel programming, a central piece of the Parallel Computing Laboratory (Parlab) at Berkeley, where he obtained his Ph.D. Prior to that, he was a researcher at Lawrence Berkeley National Laboratory where he worked on auto-tuning, power efficiency in supercomputing, highly-parallel applications, and much-cited research into structured grid (stencil) algorithms.
Aydın Buluç is a computational research scientist at the Lawrence Berkeley National Laboratory. His research interests include parallel computing, combinatorial scientific computing, high performance graph analysis and sparse matrix computations. Previously, he was a Luis W. Alvarez postdoctoral fellow. He received his Ph.D. in Computer Science from the University of California, Santa Barbara in 2010 and his B.S. in Computer Science and Engineering from Sabanci University, Turkey in 2005. Dr. Buluç is the recipient of a DOE Early Career Award in 2013. He is also a founding associate editor of the ACM Transactions on Parallel Computing.
Samuel Williams is a Staff Scientist in the Future Technologies Group at LBNL. He received both his Ph.D. and M.Sc. degrees in Computer Science from the University of California at Berkeley. He received B.Sc. degrees in Electrical Engineering, Mathematics, and Physics from Southern Methodist University. His current research interests include performance optimization and modeling for multi- and many-core architectures running high-performance distributed numerical algorithms.
Erika Duriakova received the B.Sc. Degree in Computer Science from University College Dublin, Ireland in 2013 with 1st Class Honours. She is currently pursuing her Ph.D. Degree in Computer Science at the Insight Centre for Data Analytics at University College Dublin. Her current research interests include high performance computing and scalable community finding in large networks.
Leonid Oliker is a Senior Computer Scientist in the Future Technologies Group at Lawrence Berkeley National Laboratory. He received bachelor degrees in Computer Engineering and Finance from the University of Pennsylvania, and performed both his doctoral and postdoctoral work at NASA Ames research center. Lenny has co-authored over 100 technical articles, five of which received best paper awards. His research interests include HPC optimization, multi-core auto-tuning, and power-efficient computing.
Armando Fox is a Professor-in-Residence in the UC Berkeley EECS Department and a co-founder of the Berkeley RAD Lab. Prior to that he was an Assistant Professor of Computer Science at Stanford. He is the recipient of an NSF CAREER award and teaching awards from Stanford University, the Society of Women Engineers, and Tau Beta Pi. In previous lives he helped design the Intel Pentium Pro microprocessor and founded a small company to commercialize his UC Berkeley dissertation research on mobile computing. He received his other degrees in electrical engineering and computer science from MIT and the University of Illinois.
John R. Gilbert directs the Combinatorial Scientific Computing Laboratory at UC Santa Barbara, where he is a Professor of Computer Science. He has done fundamental work in algorithms and software for sparse matrix computation, including Matlab’s original sparse matrix capabilities and the SuperLU solver library. Prof. Gilbert received his Ph.D. from Stanford in 1981. He has served on the Computer Science faculty at Cornell and as a Principal Scientist and research manager at Xerox PARC, and is a Fellow of the Society for Industrial and Applied Mathematics.
- ☆
This paper is the extended version of the conference paper “High-productivity and high-performance analysis of filtered semantic graphs” that was presented at the 2013 IEEE International Parallel & Distributed Processing Symposium.