ABSTRACT
How can we enumerate triangles from an enormous graph with billions of vertices and edges? Triangle enumeration is an important task for graph data analysis with many applications including identifying suspicious users in social networks, detecting web spams, finding communities, etc. However, recent networks are so large that most of the previous algorithms fail to process them. Recently, several MapReduce algorithms have been proposed to address such large networks; however, they suffer from the massive shuffled data resulting in a very long processing time. In this paper, we propose PTE (Pre-partitioned Triangle Enumeration), a new distributed algorithm for enumerating triangles in enormous graphs by resolving the structural inefficiency of the previous MapReduce algorithms. PTE enumerates trillions of triangles in a billion scale graph by decreasing three factors: the amount of shuffled data, total work, and network read.
Experimental results show that PTE provides up to 47 times faster performance than recent distributed algorithms on real world graphs, and succeeds in enumerating more than 3 trillion triangles on the ClueWeb12 graph with 6.3 billion vertices and 72 billion edges, which any previous triangle computation algorithm fail to process.
Supplemental Material
- Jesse Alpert and Nissan Hajaj. http://googleblog.blogspot.kr/2008/07/we-knew-web-was-big.html, 2008.Google Scholar
- Shaikh Arifuzzaman, Maleq Khan, and Madhav V. Marathe. PATRIC: a parallel algorithm for counting triangles in massive networks. In CIKM, 2013. Google ScholarDigital Library
- Luca Becchetti, Paolo Boldi, Carlos Castillo, and Aristides Gionis. Efficient algorithms for large-scale local triangle counting. TKDD, 2010. Google ScholarDigital Library
- Jonathan W Berry, Bruce Hendrickson, Randall A LaViolette, and Cynthia A Phillips. Tolerating the community detection resolution limit with edge weighting. Phys. Rev. E, 83(5):056119, 2011.Google ScholarCross Ref
- Bin-Hui Chou and Einoshin Suzuki. Discovering community-oriented roles of nodes in a social network. In DaWaK, pages 52--64, 2010. Google ScholarDigital Library
- Jonathan Cohen. Graph twiddling in a mapreduce world. CiSE, 11(4):29--41, 2009. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- Jean-Pierre Eckmann and Elisha Moses. Curvature of co-links uncovers hidden thematic layers in the world wide web. PNAS, 99(9):5825--5829, 2002.Google ScholarCross Ref
- Facebook. http://newsroom.fb.com/company-info, 2015.Google Scholar
- Ilias Giechaskiel, George Panagopoulos, and Eiko Yoneki. PDTL: parallel and distributed triangle listing for massive graphs. In ICPP, 2015. Google ScholarDigital Library
- Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012. Google ScholarDigital Library
- Herodotos Herodotou. Hadoop performance models. arXiv, 2011.Google Scholar
- Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. Massive graph triangulation. In SIGMOD, pages 325--336, 2013. Google ScholarDigital Library
- ByungSoo Jeon, Inah Jeon, Lee Sael, and U Kang. Scout: Scalable coupled matrix-tensor factorization - algorithm and discoveries. In ICDE, 2016.Google ScholarCross Ref
- U Kang, Jay-Yoon Lee, Danai Koutra, and Christos Faloutsos. Net-ray: Visualizing and mining billion-scale graphs. In PAKDD, 2014.Google Scholar
- U Kang, Brendan Meeder, Evangelos E. Papalexakis, and Christos Faloutsos. Heigen: Spectral analysis for billion-scale graphs. TKDE, pages 350--362, 2014. Google ScholarDigital Library
- U Kang, Hanghang Tong, Jimeng Sun, Ching-Yung Lin, and Christos Faloutsos. Gbase: an efficient analysis platform for large graphs. VLDB J., 21(5):637--650, 2012. Google ScholarDigital Library
- U Kang, Charalampos E. Tsourakakis, and Faloutsos Faloutsos. Pegasus: A peta-scale graph mining system - implementation and observations. ICDM, 2009. Google ScholarDigital Library
- Jinha Kim, Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, and Hwanjo Yu. OPT: A new framework for overlapped and parallel triangulation in large-scale graphs. In SIGMOD, pages 637--648, 2014. Google ScholarDigital Library
- Matthieu Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci., pages 458--473, 2008. Google ScholarDigital Library
- Rasmus Pagh and Francesco Silvestri. The input/output complexity of triangle enumeration. In PODS, pages 224--233, 2014. Google ScholarDigital Library
- Ha-Myung Park and Chin-Wan Chung. An efficient mapreduce algorithm for counting triangles in a very large graph. In CIKM, pages 539--548, 2013. Google ScholarDigital Library
- Ha-Myung Park, Francesco Silvestri, U Kang, and Rasmus Pagh. Mapreduce triangle enumeration with guarantees. In CIKM, pages 1739--1748, 2014. Google ScholarDigital Library
- Filippo Radicchi, Claudio Castellano, Federico Cecconi, Vittorio Loreto, and Domenico Parisi. Defining and identifying communities in networks. PNAS, 101(9):2658--2663, 2004.Google ScholarCross Ref
- Thomas Schank. Algorithmic aspects of triangle-based network analysis. Phd thesis, University Karlsruhe, 2007.Google Scholar
- Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011. Google ScholarDigital Library
- Twitter. https://about.twitter.com/company, 2015.Google Scholar
- Mark N. Wegman and Larry Carter. New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci., 22(3):265--279, 1981.Google ScholarCross Ref
- Zhi Yang, Christo Wilson, Xiao Wang, Tingting Gao, Ben Y. Zhao, and Yafei Dai. Uncovering social network sybils in the wild. TKDD, 2014. Google ScholarDigital Library
Index Terms
- PTE: Enumerating Trillion Triangles On Distributed Systems
Recommendations
Enumerating Trillion Subgraphs On Distributed Systems
How can we find patterns from an enormous graph with billions of vertices and edges? The subgraph enumeration, which is to find patterns from a graph, is an important task for graph data analysis with many applications, including analyzing the social ...
BIGMiner: a fast and scalable distributed frequent pattern miner for big data
Frequent itemset mining is widely used as a fundamental data mining technique. Recently, there have been proposed a number of MapReduce-based frequent itemset mining methods in order to overcome the limits on data size and speed of mining that ...
Finding a maximum-weight induced k-partite subgraph of an i-triangulated graph
An i-triangulated graph is a graph in which every odd cycle has two non-crossing chords; i-triangulated graphs form a subfamily of perfect graphs. A slightly more general family of perfect graphs are clique-separable graphs. A graph is clique-separable ...
Comments