ABSTRACT
The last decade has seen a surge of interest in large-scale data-parallel processing engines. While these engines share many features in common with parallel databases, they make a set of different trade-offs. In consequence many of the lessons learned for programming parallel databases have to be re-learned in the new environment. In this paper we show a case study of parallelizing an example large-scale application (offer matching, a core part of online shopping) on an example MapReduce-based distributed computation engine (DryadLINQ). We focus on the challenges raised by the nature of large data sets and data skew and show how they can be addressed effectively within this computation framework by optimizing the computation to adapt to the nature of the data. In particular we describe three different strategies for performing distributed joins and show how the platform language allows us to implement optimization strategies at the application level, without system support. We show that this flexibility in the programming model allows for a highly effective system, providing a measured speedup of more than 100 on 64 machines (256 cores), and an estimated speedup of 200 on 1280 machines (5120 cores)of matching 4 million offers.
- Apache Hadoop.Google Scholar
- P. B. R. R. Arnab Nandi, Cong Yu. Distributed cube materialization on holistic measures. Hanover, Germany, 2011.Google Scholar
- M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intel ligent Systems, 18:16--23, 2003. Google ScholarDigital Library
- R. Chaiken, P.-øA. L. Bob Jenkins, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. February 2008.Google Scholar
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and Nathan. FlumeJava: Easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2010. Google ScholarDigital Library
- L. Chu, H. Tang, T. Yang, and K. Shen. Optimizing data aggregation for cluster-based internet services. In Symposium on Principles and practice of paral lel programming (PPoPP), pages 119--130. ACM, 2003. Google ScholarDigital Library
- W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. pages 201--212, 1998. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. San Francisco, CA, December 2004. Google ScholarDigital Library
- D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In Proceedings of the 18th International Conference on Very Large Data Bases, VLDB '92, pages 27--40, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- I. Forrester Research. http://techcrunch.com/2010/03/08/forrester-forecast-online-retail-sales-will-grow-to-250-billion-by-2014/, 2011.Google Scholar
- G. Fulgoni. The 2010 u.s. digital year in review. Technical report, Comscore, 2010.Google Scholar
- Y. Gu and R. Grossman. Sector and Sphere: The design and implementation of a high performance data cloud. Philosophical Transactions of the Royal Society, 367(1897):2429--2445, June 2009.Google ScholarCross Ref
- K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proceedings of the 17th International Conference on Very Large Data Bases, VLDB '91, pages 525--535, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, Lisbon, Portugal, March 21-23 2007. also as MSR-TR-2006-140. Google ScholarDigital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In J. N. Matthews and T. E. Anderson, editors, ACM Symposium on Operating Systems Principles (SOSP), pages 261--276, Big Sky, Montana, USA, 2009. Google ScholarDigital Library
- A. Kannan, I. E.Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product descriptions. Technical Report MSR-TR-2010-172, Microsoft. Engineering (ICDE), pages 996--1005, Long Beach, California, March 1--6, 2010 2010.Google Scholar
- Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 75--86, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '00, pages 169--178, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- A. Monge and C. Elkan. The field matching problem: Algorithms and applications. In In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 267--270, 1996.Google Scholar
- Nielsen. Global trends in online shopping: A nielsen global consumer report. Technical report, June 2010.Google Scholar
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (Industrial Track) (SIGMOD), Vancouver, Canada, June 2008. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive -- a petabyte scale data warehouse using Hadoop. In International Conference on Data Engineering (ICDE), pages 996--1005, Long Beach, California, March 1-6, 2010 2010.Google ScholarCross Ref
- C. B. Walton, A. G. Dale, and R. M. Jenevein. A taxonomy and performance model of data skew effects in parallel joins. In Proceedings of the 17th International Conference on Very Large Data Bases, VLDB '91, pages 537--548, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- W. E. Winkler, W. E. Winkler, and N. P. Overview of record linkage and current research directions. Technical report, Bureau of the Census, 2006.Google Scholar
- J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200--209, Washington, DC, USA, 1991. IEEE Computer Society. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. page 14, San Diego, CA, December 8-10 2008. Google ScholarDigital Library
- Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, and E. Gillum. Botgraph: large scale spamming botnet detection. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation, pages 321--334, Berkeley, CA, USA, 2009. USENIX Association. Google ScholarDigital Library
Index Terms
- Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching
Recommendations
Intermediate Data Placement Strategy for Different Data Skew Levels Based on Random Sampling in Spark
ICBDC '19: Proceedings of the 4th International Conference on Big Data and ComputingIn recent years, the Apache Spark had been widely used in processing large-scale data. However, when the input data onto MapReduce was skewed, the default intermediate data placement algorithm of the Apache Spark could not efficiently handle the skewed ...
Comments