Abstract
This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter-processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel as is done in previous parallel approaches. In fact, after the initial load distribution phase, each processor can compute its assigned subcube without any communication with the other processors. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting. The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array. Experimental results presented show that our partitioning strategies generate a close to optimal load balance between processors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Agarwal, R. Agarwal, P.M. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Srawagi. On the computation of multi-dimensional aggregates. In Proc. 22nd VLDB Conf., pages 506–521, 1996.
Argonne National Laboratory, http://www-unix.mcs.anl.gov/mpi/index.html.The Message Passing Interface (MPI) standard.
R.I. Becker, Y. Perl, and S.R. Schach. A shifting algorithm for min-max tree partitioning. J. ACM, (29):58–67, 1982.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In Proc. of 1999 ACM SIGMOD Conference on Management of data, pages 359–370, 1999.
T. Cheatham, A. Fahmy, D. C. Stefanescu, and L. G. Valiant. Bulk synchronous parallel computing-A paradigm for transportable software. In Proc. of the 28th Hawaii International Conference on System Sciences. Vol. 2: Software Technology, pages 268–275, 1995.
F. Dehne, W. Dittrich, and D. Hutchinson. Efficient external memory algorithms by simulating coarse-grained parallel algorithms. In Proc. 9th ACM Symposium on Parallel Algorithms and Architectures (SPAA’97), pages 106–115, 1997.
F. Dehne, W. Dittrich, D. Hutchinson, and A. Maheshwari. Parallel virtual memory. In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 889–890, 1999.
F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel computational geometry for coarse grained multicomputers. In ACM Symp. Computational Geometry, pages 298–307, 1993.
F. Dehne, D. Hutchinson, and A. Maheshwari. Reducing i/o complexity by simulating coarse grained parallel algorithms. In Proc. 13th International Parallel Processing Symposium (IPPS’99), pages 14–20, 1999.
P.M. Deshpande, S. Agarwal, J.F. Naughton, and R Ramakrishnan. Computation of multidimensional aggregates. Technical Report1314, University of Wisconsin, Madison, 1996.
P. Flajolet and G.N. Martin. Probablistic counting algorithms for database applications. Journal of Computer and System Sciences, 31(2):182–209, 1985.
G.N. Frederickson. Optimal algorithms for tree partitioning. In Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 168–177, 1991.
S. Goil and A. Choudhary. High performance OLAP and data mining on parallel computers. Journal of Data Mining and Knowledge Discovery, 1(4), 1997.
S. Goil and A. Choudhary. A parallel scalable infrastructure for OLAP and data mining. In Proc. International Data Engineering and Applications Symposium (IDEAS’99), Montreal, August 1999.
M. Goudreau, K. Lang, S. Rao, T. Suel, and T. Tsantilas. Towards efficiency and portability: Programming with the BSP model. In Proc. 8th ACM Symposium on Parallel Algorithms and Architectures (SPAA’ 96), pages 1–12, 1996.
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery,1(1):29–53, April 1997.
V. Harinarayan, A. Rajaraman, and J.D. Ullman. Implementing data cubes efficiently. SIGMOD Record (ACM Special Interest Group on Management of Data), 25(2):205–216, 1996.
J. Hill, B. McColl, D. Stefanescu, M. Goudreau, K. Lang, S. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP programming library. Parallel Computing, 24(14):1947–1980, December 1998.
Y. Perl and U. Vishkin. Efficient implementation of a shifting algorithm. Disc. Appl. Math., (12):71–80, 1985.
K.A. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 23rd VLDB Conference, pages 116–125, 1997.
S. Sarawagi, R. Agrawal, and A. Gupta. On computing the data cube. Technical Report RJ10026, IBM Almaden Research Center, San Jose, CA, 1996.
A. Shukla, P. Deshpende, J.F. Naughton, and K. Ramasamy. Storage estimation for mutlidimensional aggregates in the presence of hierarchies. In Proc. 22nd VLDB Conference, pages 522–531, 1996.
J.F. Sibeyn and M. Kaufmann. BSP-like external-memory computation. In Proc. of 3rd Italian Conf. on Algorithms and Complexity (CIAC-97), volume LNCS1203,pages 229–240. Springer, 1997.
D.E. Vengroff and J.S. Vitter. I/o-efficient scientific computation using tpie. In Proc. Goddard Conference on Mass Storage Systems and Technologies, pages 553–570, 1996.
J.S. Vitter. External memory algorithms. In Proc. 17th ACM Symp. on Principles of Database Systems (PODS’ 98), pages 119–128, 1998.
J.S. Vitter and E.A.M. Shriver. Algorithms for parallel memory. i: Two-level memories. Algorithmica, 12(2–3):110–147, 1994.
Y. Zhao, P.M. Deshpande, and J.F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. ACM SIGMOD Conf., pages 159–170, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dehne, F., Eavis, T., Hambrusch, S., Rau-Chaplin, A. (2001). Parallelizing the Data Cube. In: Van den Bussche, J., Vianu, V. (eds) Database Theory — ICDT 2001. ICDT 2001. Lecture Notes in Computer Science, vol 1973. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44503-X_9
Download citation
DOI: https://doi.org/10.1007/3-540-44503-X_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41456-8
Online ISBN: 978-3-540-44503-6
eBook Packages: Springer Book Archive