research-article

Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Authors:
Ekaterina Gonina

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Anitha Kannan

Microsoft Research, Mountain View, CA, USA

Microsoft Research, Mountain View, CA, USA
View Profile

,
John Shafer

Microsoft Research, Mountain View, CA, USA

Microsoft Research, Mountain View, CA, USA
View Profile

,
Mihai Budiu

Microsoft Research, Mountain View, CA, USA

Microsoft Research, Mountain View, CA, USA
View Profile

MapReduce '11: Proceedings of the second international workshop on MapReduce and its applicationsJune 2011Pages 35–42https://doi.org/10.1145/1996092.1996101

Published:08 June 2011Publication History

MapReduce '11: Proceedings of the second international workshop on MapReduce and its applications

Pages 35–42

ABSTRACT

The last decade has seen a surge of interest in large-scale data-parallel processing engines. While these engines share many features in common with parallel databases, they make a set of different trade-offs. In consequence many of the lessons learned for programming parallel databases have to be re-learned in the new environment. In this paper we show a case study of parallelizing an example large-scale application (offer matching, a core part of online shopping) on an example MapReduce-based distributed computation engine (DryadLINQ). We focus on the challenges raised by the nature of large data sets and data skew and show how they can be addressed effectively within this computation framework by optimizing the computation to adapt to the nature of the data. In particular we describe three different strategies for performing distributed joins and show how the platform language allows us to implement optimization strategies at the application level, without system support. We show that this flexibility in the programming model allows for a highly effective system, providing a measured speedup of more than 100 on 64 machines (256 cores), and an estimated speedup of 200 on 1280 machines (5120 cores)of matching 4 million offers.

References

Apache Hadoop.Google Scholar
P. B. R. R. Arnab Nandi, Cong Yu. Distributed cube materialization on holistic measures. Hanover, Germany, 2011.Google Scholar
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name matching in information integration. IEEE Intel ligent Systems, 18:16--23, 2003. Google ScholarDigital Library
R. Chaiken, P.-øA. L. Bob Jenkins, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. February 2008.Google Scholar
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and Nathan. FlumeJava: Easy, efficient data-parallel pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2010. Google ScholarDigital Library
L. Chu, H. Tang, T. Yang, and K. Shen. Optimizing data aggregation for cluster-based internet services. In Symposium on Principles and practice of paral lel programming (PPoPP), pages 119--130. ACM, 2003. Google ScholarDigital Library
W. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. pages 201--212, 1998. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. San Francisco, CA, December 2004. Google ScholarDigital Library
D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In Proceedings of the 18th International Conference on Very Large Data Bases, VLDB '92, pages 27--40, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
I. Forrester Research. http://techcrunch.com/2010/03/08/forrester-forecast-online-retail-sales-will-grow-to-250-billion-by-2014/, 2011.Google Scholar
G. Fulgoni. The 2010 u.s. digital year in review. Technical report, Comscore, 2010.Google Scholar
Y. Gu and R. Grossman. Sector and Sphere: The design and implementation of a high performance data cloud. Philosophical Transactions of the Royal Society, 367(1897):2429--2445, June 2009.Google ScholarCross Ref
K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proceedings of the 17th International Conference on Very Large Data Bases, VLDB '91, pages 525--535, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In European Conference on Computer Systems (EuroSys), pages 59--72, Lisbon, Portugal, March 21-23 2007. also as MSR-TR-2006-140. Google ScholarDigital Library
M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In J. N. Matthews and T. E. Anderson, editors, ACM Symposium on Operating Systems Principles (SOSP), pages 261--276, Big Sky, Montana, USA, 2009. Google ScholarDigital Library
A. Kannan, I. E.Givoni, R. Agrawal, and A. Fuxman. Matching unstructured product offers to structured product descriptions. Technical Report MSR-TR-2010-172, Microsoft. Engineering (ICDE), pages 996--1005, Long Beach, California, March 1--6, 2010 2010.Google Scholar
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 75--86, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD '00, pages 169--178, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
A. Monge and C. Elkan. The field matching problem: Algorithms and applications. In In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 267--270, 1996.Google Scholar
Nielsen. Global trends in online shopping: A nielsen global consumer report. Technical report, June 2010.Google Scholar
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (Industrial Track) (SIGMOD), Vancouver, Canada, June 2008. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive -- a petabyte scale data warehouse using Hadoop. In International Conference on Data Engineering (ICDE), pages 996--1005, Long Beach, California, March 1-6, 2010 2010.Google ScholarCross Ref
C. B. Walton, A. G. Dale, and R. M. Jenevein. A taxonomy and performance model of data skew effects in parallel joins. In Proceedings of the 17th International Conference on Very Large Data Bases, VLDB '91, pages 537--548, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
W. E. Winkler, W. E. Winkler, and N. P. Overview of record linkage and current research directions. Technical report, Bureau of the Census, 2006.Google Scholar
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proceedings of the Seventh International Conference on Data Engineering, pages 200--209, Washington, DC, USA, 1991. IEEE Computer Society. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. page 14, San Diego, CA, December 8-10 2008. Google ScholarDigital Library
Y. Zhao, Y. Xie, F. Yu, Q. Ke, Y. Yu, Y. Chen, and E. Gillum. Botgraph: large scale spamming botnet detection. In Proceedings of the 6th USENIX symposium on Networked systems design and implementation, pages 321--334, Berkeley, CA, USA, 2009. USENIX Association. Google ScholarDigital Library

Index Terms

Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching
1. Information systems
  1. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Processing Big Data with Azure HDInsight: Building Real-World Big Data Systems on Azure HDInsight Using the Hadoop Ecosystem
Read More
Large scale data processing using mapreduce
Read More
Intermediate Data Placement Strategy for Different Data Skew Levels Based on Random Sampling in Spark
ICBDC '19: Proceedings of the 4th International Conference on Big Data and Computing

In recent years, the Apache Spark had been widely used in processing large-scale data. However, when the input data onto MapReduce was skewed, the default intermediate data placement algorithm of the Apache Spark could not efficiently handle the skewed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MapReduce '11: Proceedings of the second international workshop on MapReduce and its applications
June 2011
82 pages
ISBN:9781450307000
DOI:10.1145/1996092
General Chairs:
Martin Arlitt
HP Labs, USA, and University of Calgary, Canada
,
Gilles Fedak
INRIA/LIP, France
,
Geoffrey Fox
Indiana University, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DryadLINQ
data skew
offer matching
Qualifiers
- research-article
Conference
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 213
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

MapReduce '11: Proceedings of the second international workshop on MapReduce and its applications

ABSTRACT

References

Cited By

Index Terms

Recommendations

Processing Big Data with Azure HDInsight: Building Real-World Big Data Systems on Azure HDInsight Using the Hadoop Ecosystem

Large scale data processing using mapreduce

Intermediate Data Placement Strategy for Different Data Skew Levels Based on Random Sampling in Spark