ABSTRACT
MapReduce has emerged as one of the most popular programming models for data parallel enterprise applications. Despite advances in runtime, the opportunities for optimizing MapReduce applications remain largely unexplored. In this paper, we present a framework for performing holistic compiler optimizations on legacy MapReduce applications. We have identified and implemented two optimizations and evaluated them with a set of Hadoop applications on a cluster of Xeon servers. Our experiments show that performance gains of more than 3X can be achieved without user involvement.
- Apache hive. http://hive.apache.org.Google Scholar
- Apache pig. http://pig.apache.org.Google Scholar
- Cloudera hadoop. http://www.cloudera.com.Google Scholar
- Hadoop mapreduce. http://hadoop.apache.org.Google Scholar
- Openmp parallel programming. http://openmp.org.Google Scholar
- Soot: A java optimization framework. http://www.sable.mcgill.ca.Google Scholar
- X-rime: Hadoop based social network analysis. http://xrime.sourceforge.net.Google Scholar
- Yahoo! launches largest hadoop production app. http://developer.yahoo.com/blogs.Google Scholar
- F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O'boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. Using machine learning to focus iterative optimization. In In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 295--305, 2006. Google ScholarDigital Library
- J. Ansel, Y. L. W. ans Cy Chan, M. Olszewski, A. Edelman, and S. Amarasinghe. Language and compiler support for auto-tuning variable-accuracy algorithms. In International Symposium on Code Generation and Optimization (CGO), 2011. Google ScholarDigital Library
- L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics, 37:1554--1563, 1966.Google ScholarCross Ref
- S. T. Chakradhar and A. Raghunathan. Best-effort computing: re-thinking parallel software and hardware. In Proceedings of the 47th Design Automation Conference (DAC), pages 865--870, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI), 2004. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1--38, 1977.Google Scholar
- J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC), pages 810--818, 2010. Google ScholarDigital Library
- T. S. Ferguson. A bayesian analysis of some nonparametric problems. Ann. Statist., 1:209--230, 1973.Google ScholarCross Ref
- I. Fodor. A survey of dimension reduction techniques. Technical report, 2002.Google Scholar
- H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. Dynamic knobs for responsive power-aware computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 199--212, 2011. Google ScholarDigital Library
- K. Kambatla, A. Pathak, and H. Pucha. Towards optimizing hadoop provisioning in the cloud. In Proceedings of the Conference on Hot topics in cloud computing (HotCloud), 2009. Google ScholarDigital Library
- K. Kambatla, N. Rapolu, S. Jagannathan, and A. Grama. Asynchronous algorithms in mapreduce. In Proceedings of the 2010 IEEE International Conference on Cluster Computing (CLUSTER), 2010. Google ScholarDigital Library
- D. Kim, L. Renganarayanan, D. Rostron, S. Rajopadhye, and M. M. Strout. Multi-level tiling: M for the price of one. In Proceedings of the ACM/IEEE conference on Supercomputing (ICS), 2007. Google ScholarDigital Library
- T. Kisuki, P. M. W. Knijnenburg, and M. F. P. O'Boyle. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques (PACT), 2000. Google ScholarDigital Library
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46:604--632, September 1999. Google ScholarDigital Library
- R. Lämmel. Google's mapreduce programming model revisited. Sci. Comput. Program., 68:208--237, October 2007. Google ScholarDigital Library
- J. B. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281--297, 1967.Google Scholar
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pages 169--178, 2000. Google ScholarDigital Library
- M. Méndez-Lojo, D. Nguyen, D. Prountzos, X. Sui, M. A. Hassaan, M. Kulkarni, M. Burtscher, and K. Pingali. Structure-driven optimizations for amorphous data-parallel programs. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 3--14, 2010. Google ScholarDigital Library
- P. Müller and D. R. Insua. Issues in bayesian analysis of neural network models. Neural Comput., 10:749--770, April 1998. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.Google Scholar
- K. Pearson. On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2:559--572, 1901.Google ScholarCross Ref
- K. Rajan, S. Rajamani, and S. Yaduvanshi. Guesstimate: a programming model for collaborative distributed systems. In Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation (PLDI), pages 210--220, 2010. Google ScholarDigital Library
- T. Sandholm and K. Lai. Mapreduce optimization using regulated dynamic prioritization. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 299--310, 2009. Google ScholarDigital Library
- S. Seidman. Network structure and minimum degree. Social Networks 5, pages 269--287, 1983.Google Scholar
- A. Tiwari, C. Chen, C. Jacqueline, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1--12, 2009. Google ScholarDigital Library
- A. Udupa, K. Rajan, and W. Thies. Alter: exploiting breakable dependences for parallelization. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation (PLDI), pages 480--491, 2011. Google ScholarDigital Library
Comments