Parallelizing the Data Cube

Dehne, Frank; Eavis, Todd; Hambrusch, Susanne; Rau-Chaplin, Andrew

doi:10.1007/3-540-44503-X_9

Frank Dehne⁶,
Todd Eavis⁷,
Susanne Hambrusch⁸ &
…
Andrew Rau-Chaplin⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1973))

Included in the following conference series:

International Conference on Database Theory

2617 Accesses
8 Citations

Abstract

This paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter-processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel as is done in previous parallel approaches. In fact, after the initial load distribution phase, each processor can compute its assigned subcube without any communication with the other processors. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. This supports the transfer of optimized sequential data cube code to a parallel setting. The bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. The top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array. Experimental results presented show that our partitioning strategies generate a close to optimal load balance between processors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Agarwal, R. Agarwal, P.M. Deshpande, A. Gupta, J.F. Naughton, R. Ramakrishnan, and S. Srawagi. On the computation of multi-dimensional aggregates. In Proc. 22nd VLDB Conf., pages 506–521, 1996.
Google Scholar
Argonne National Laboratory, http://www-unix.mcs.anl.gov/mpi/index.html.The Message Passing Interface (MPI) standard.
R.I. Becker, Y. Perl, and S.R. Schach. A shifting algorithm for min-max tree partitioning. J. ACM, (29):58–67, 1982.
Google Scholar
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In Proc. of 1999 ACM SIGMOD Conference on Management of data, pages 359–370, 1999.
Google Scholar
T. Cheatham, A. Fahmy, D. C. Stefanescu, and L. G. Valiant. Bulk synchronous parallel computing-A paradigm for transportable software. In Proc. of the 28th Hawaii International Conference on System Sciences. Vol. 2: Software Technology, pages 268–275, 1995.
Google Scholar
F. Dehne, W. Dittrich, and D. Hutchinson. Efficient external memory algorithms by simulating coarse-grained parallel algorithms. In Proc. 9th ACM Symposium on Parallel Algorithms and Architectures (SPAA’97), pages 106–115, 1997.
Google Scholar
F. Dehne, W. Dittrich, D. Hutchinson, and A. Maheshwari. Parallel virtual memory. In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 889–890, 1999.
Google Scholar
F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel computational geometry for coarse grained multicomputers. In ACM Symp. Computational Geometry, pages 298–307, 1993.
Google Scholar
F. Dehne, D. Hutchinson, and A. Maheshwari. Reducing i/o complexity by simulating coarse grained parallel algorithms. In Proc. 13th International Parallel Processing Symposium (IPPS’99), pages 14–20, 1999.
Google Scholar
P.M. Deshpande, S. Agarwal, J.F. Naughton, and R Ramakrishnan. Computation of multidimensional aggregates. Technical Report1314, University of Wisconsin, Madison, 1996.
Google Scholar
P. Flajolet and G.N. Martin. Probablistic counting algorithms for database applications. Journal of Computer and System Sciences, 31(2):182–209, 1985.
Article MATH MathSciNet Google Scholar
G.N. Frederickson. Optimal algorithms for tree partitioning. In Proc. ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 168–177, 1991.
Google Scholar
S. Goil and A. Choudhary. High performance OLAP and data mining on parallel computers. Journal of Data Mining and Knowledge Discovery, 1(4), 1997.
Google Scholar
S. Goil and A. Choudhary. A parallel scalable infrastructure for OLAP and data mining. In Proc. International Data Engineering and Applications Symposium (IDEAS’99), Montreal, August 1999.
Google Scholar
M. Goudreau, K. Lang, S. Rao, T. Suel, and T. Tsantilas. Towards efficiency and portability: Programming with the BSP model. In Proc. 8th ACM Symposium on Parallel Algorithms and Architectures (SPAA’ 96), pages 1–12, 1996.
Google Scholar
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery,1(1):29–53, April 1997.
Article Google Scholar
V. Harinarayan, A. Rajaraman, and J.D. Ullman. Implementing data cubes efficiently. SIGMOD Record (ACM Special Interest Group on Management of Data), 25(2):205–216, 1996.
Google Scholar
J. Hill, B. McColl, D. Stefanescu, M. Goudreau, K. Lang, S. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP programming library. Parallel Computing, 24(14):1947–1980, December 1998.
Article Google Scholar
Y. Perl and U. Vishkin. Efficient implementation of a shifting algorithm. Disc. Appl. Math., (12):71–80, 1985.
Google Scholar
K.A. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 23rd VLDB Conference, pages 116–125, 1997.
Google Scholar
S. Sarawagi, R. Agrawal, and A. Gupta. On computing the data cube. Technical Report RJ10026, IBM Almaden Research Center, San Jose, CA, 1996.
Google Scholar
A. Shukla, P. Deshpende, J.F. Naughton, and K. Ramasamy. Storage estimation for mutlidimensional aggregates in the presence of hierarchies. In Proc. 22nd VLDB Conference, pages 522–531, 1996.
Google Scholar
J.F. Sibeyn and M. Kaufmann. BSP-like external-memory computation. In Proc. of 3rd Italian Conf. on Algorithms and Complexity (CIAC-97), volume LNCS1203,pages 229–240. Springer, 1997.
Google Scholar
D.E. Vengroff and J.S. Vitter. I/o-efficient scientific computation using tpie. In Proc. Goddard Conference on Mass Storage Systems and Technologies, pages 553–570, 1996.
Google Scholar
J.S. Vitter. External memory algorithms. In Proc. 17th ACM Symp. on Principles of Database Systems (PODS’ 98), pages 119–128, 1998.
Google Scholar
J.S. Vitter and E.A.M. Shriver. Algorithms for parallel memory. i: Two-level memories. Algorithmica, 12(2–3):110–147, 1994.
Article MATH MathSciNet Google Scholar
Y. Zhao, P.M. Deshpande, and J.F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. ACM SIGMOD Conf., pages 159–170, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

Carleton University, Ottawa, Canada
Frank Dehne
Dalhousie University, Halifax, Canada
Todd Eavis & Andrew Rau-Chaplin
Purdue University, West Lafayette, Indiana, USA
Susanne Hambrusch

Authors

Frank Dehne
View author publications
You can also search for this author in PubMed Google Scholar
Todd Eavis
View author publications
You can also search for this author in PubMed Google Scholar
Susanne Hambrusch
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Rau-Chaplin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Limburg University (LUC), 3590, Diepenbeek, Belgium
Jan Van den Bussche
Department of Computer Science and Engineering, University of California, 92093-0114, La Jolla, CA, USA
Victor Vianu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dehne, F., Eavis, T., Hambrusch, S., Rau-Chaplin, A. (2001). Parallelizing the Data Cube. In: Van den Bussche, J., Vianu, V. (eds) Database Theory — ICDT 2001. ICDT 2001. Lecture Notes in Computer Science, vol 1973. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44503-X_9

Download citation

DOI: https://doi.org/10.1007/3-540-44503-X_9
Published: 12 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41456-8
Online ISBN: 978-3-540-44503-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics