Abstract
Traditional database operators such as joins are relevant not only in the context of database engines but also as a building block in many computational and machine learning algorithms. With the advent of big data, there is an increasing demand for efficient join algorithms that can scale with the input data size and the available hardware resources.
In this paper, we explore the implementation of distributed join algorithms in systems with several thousand cores connected by a low-latency network as used in high performance computing systems or data centers. We compare radix hash join to sort-merge join algorithms and discuss their implementation at this scale. In the paper, we explain how to use MPI to implement joins, show the impact and advantages of RDMA, discuss the importance of network scheduling, and study the relative performance of sorting vs. hashing. The experimental results show that the algorithms we present scale well with the number of cores, reaching a throughput of 48.7 billion input tuples per second on 4,096 cores.
- M. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB, pages 1064--1075, 2012. Google ScholarDigital Library
- K. Anikiej. Multi-core parallelization of vectorized query execution. Master's thesis, VU University, 2010.Google Scholar
- C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, pages 85--96, 2013. Google ScholarDigital Library
- C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In ICDE, pages 362--373, 2013. Google ScholarDigital Library
- C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on modern processor architectures. IEEE TKDE, pages 1754--1766, 2015.Google ScholarCross Ref
- C. Barthels, S. Loesing, G. Alonso, and D. Kossmann. Rack-scale in-memory join processing using RDMA. In SIGMOD, pages 1463--1475, 2015. Google ScholarDigital Library
- K. E. Batcher. Sorting networks and their applications. In AFIPS, pages 307--314, 1968. Google ScholarDigital Library
- C. Binnig, A. Crotty, A. Galakatos, T. Kraska, and E. Zamanian. The end of slow networks: It's time for a redesign. PVLDB, pages 528--539, 2016. Google ScholarDigital Library
- S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In SIGMOD, pages 37--48, 2011. Google ScholarDigital Library
- A. Costea, A. Ionescu, B. Raducanu, M. Switakowski, C. Barca, J. Sompolski, A. Luszczak, M. Szafranski, G. D. Nijs, and P. Boncz. VectorH: taking SQL-on-Hadoop to the next level. In SIGMOD, pages 1105--1117, 2016. Google ScholarDigital Library
- Cray XC Series. http://www.cray.com/products/computing/xc-series/.Google Scholar
- CSCS Piz Daint Supercomputer. http://user.cscs.ch/computing_systems/piz_daint/index.html.Google Scholar
- D. J. DeWitt, J. F. Naughton, and D. A. Schneider. Parallel sorting on a shared-nothing architecture using probabilistic splitting. In PDIS, pages 280--291, 1991. Google ScholarDigital Library
- A. Dragojevic, D. Narayanan, M. Castro, and O. Hodson. Farm: Fast remote memory. In NSDI, pages 401--414, 2014. Google ScholarDigital Library
- F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees. The SAP HANA database - an architecture overview. IEEE Data Eng. Bull., 2012.Google Scholar
- W. D. Frazer and A. C. McKellar. Samplesort: A sampling approach to minimal storage tree sorting. J. ACM, pages 496--507, 1970. Google ScholarDigital Library
- P. W. Frey and G. Alonso. Minimizing the hidden cost of RDMA. In ICDCS, pages 553--560, 2009. Google ScholarDigital Library
- P. W. Frey, R. Goncalves, M. L. Kersten, and J. Teubner. A spinning join that does not get dizzy. In ICDCS, pages 283--292, 2010. Google ScholarDigital Library
- R. Gerstenberger, M. Besta, and T. Hoefler. Enabling highly-scalable remote memory access programming with MPI-3 one sided. In SC, pages 53:1--53:12, 2013. Google ScholarDigital Library
- W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press, 2014. Google ScholarDigital Library
- T. Hoefler, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, and K. Underwood. Remote Memory Access Programming in MPI-3. ACM TOPC, page 9, 2015. Google ScholarDigital Library
- J. Huang and Y.C.Chow. Parallel sorting and data partitioning by sampling. In COMPSAC, 1983.Google Scholar
- InfiniBand Trade Association. Architecture specification 1.3, 2015.Google Scholar
- J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. Wasi-ur-Rahman, N. S. Islam, X. Ouyang, H. Wang, S. Sur, and D. K. Panda. Memcached design on high performance RDMA capable interconnects. In ICPP, pages 743--752, 2011. Google ScholarDigital Library
- L. V. Kalé and S. Krishnan. A comparison based parallel sorting algorithm. In ICPP, pages 196--200, 1993. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In SIGCOMM, pages 295--306, 2014. Google ScholarDigital Library
- C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, A. D. Nguyen, A. D. Blas, V. W. Lee, N. Satish, and P. Dubey. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. PVLDB, pages 1378--1389, 2009. Google ScholarDigital Library
- A. Kumar, J. Naughton, J. M. Patel, and X. Zhu. To join or not to join? Thinking twice about joins before feature selection. In SIGMOD, pages 19--34, 2016. Google ScholarDigital Library
- F. Li, S. Das, M. Syamala, and V. R. Narasayya. Accelerating relational databases by leveraging remote memory and RDMA. In SIGMOD, pages 355--370, 2016. Google ScholarDigital Library
- S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing main-memory join on modern hardware. IEEE TKDE, pages 709--730, 2002. Google ScholarDigital Library
- Message Passing Interface Forum. MPI: a message-passing interface standard, version 3.0, 2012.Google Scholar
- O. Polychroniou, R. Sen, and K. A. Ross. Track join: distributed joins with minimal network traffic. In SIGMOD, pages 1483--1494, 2014. Google ScholarDigital Library
- W. Rödiger, S. Idicula, A. Kemper, and T. Neumann. Flow-Join: adaptive skew handling for distributed joins over high-speed networks. In ICDE, pages 1194--1205, 2016.Google ScholarCross Ref
- W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, pages 228--239, 2015. Google ScholarDigital Library
- W. Rödiger, T. Mühlbauer, P. Unterbrunner, A. Reiser, A. Kemper, and T. Neumann. Locality-sensitive operators for parallel main-memory database clusters. In ICDE, pages 592--603, 2014.Google ScholarCross Ref
- E. Solomonik and L. V. Kalé. Highly scalable parallel sorting. In IPDPS, pages 1--12, 2010.Google ScholarCross Ref
Index Terms
- Distributed join algorithms on thousands of cores
Recommendations
Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA
ICPP '19: Proceedings of the 48th International Conference on Parallel ProcessingIn data management systems, query processing on GPUs or distributed clusters have proven to be an effective method for high efficiency. However, the high PCIe data transfer overhead between CPUs and GPUs, and the communication cost between nodes in ...
Fast Equi-Join Algorithms on GPUs: Design and Implementation
SSDBM '17: Proceedings of the 29th International Conference on Scientific and Statistical Database ManagementProcessing relational joins on modern GPUs has attracted much attention in the past few years. With the rapid development on the hardware and software environment in the GPU world, the existing GPU join algorithms designed for earlier architecture ...
Distributed stream join query processing with semijoins
This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Comments