Tuning Hadoop Map Slot Value Using CPU Metric

Kc, Kamal; Freeh, Vincent W.

doi:10.1007/978-3-319-13021-7_11

Tuning Hadoop Map Slot Value Using CPU Metric

Kamal Kc¹⁶ &
Vincent W. Freeh¹⁶

Conference paper
First Online: 11 November 2014

1545 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8807))

Abstract

Hadoop is a widely used open source mapreduce framework. Its performance is critical because it increases the usefulness of products and services for a large number of companies who have adopted Hadoop for their business purposes. One of the configuration parameters that influences the resource allocation and thus the performance of a Hadoop application is map slot value (MSV). MSV determines the number of map tasks that run concurrently on a node. For a given architecture, a Hadoop application has an MSV for which its performance is best. Furthermore, there is not a single map slot value that is best for all applications. A Hadoop application’s performance suffers when MSV is not the best. Therefore, knowing the best MSV is important for an application. In this work, we find a low-overhead method to predict the best MSV using a new Hadoop counter that measures per-map task CPU utilization. Our experiments on a variety of Hadoop applications show that using a single MSV for all applications results in performance degradation up to 132 % when compared to using the best MSV for each application.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The machines used in our experiments have hyperthreading enabled. Thus the CPU schedulable contexts as seen by operating system is greater than the number of cores. In a hyperthreaded system, the metric CPU_UTIL measures the utilization of the threads, and a 100 % utilization occurs when all threads are busy rather than when all cores are busy.
2.
A node in the x86 cluster has a single SAS hard disk, whereas a PowerPC node has 5 SAS hard disks in RAID-5 configuration.

References

Avoiding common hadoop administration issues. http://blog.cloudera.com/blog/2010/08/avoiding-common-hadoop-administration-issues
Hadoop poweredby. http://wiki.apache.org/hadoop/PoweredBy
Hadoop vaidya. http://hadoop.apache.org/docs/stable/vaidya.html
Ahmad, F., Lee, S., Thottethodi, M., Vijaykumar, T.N.: Puma: Purdue mapreduce benchmarks suite. Purdue University, Technical report (2012)
Google Scholar
Babu, S.: Towards automatic optimization of mapreduce programs. In: Proceedings of the ACM Symposium on Cloud Computing (SOCC) (2010)
Google Scholar
Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. In: Proceedings of VLDB (2012)
Google Scholar
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In: Proceedings of the 37th International Conference on Very Large Data Bases (VLDB) (2011)
Google Scholar
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: Proceedings of Conference on Innovative Data Systems Research (CIDR) (2011)
Google Scholar
Kambatla, K., Pathak, A., Pucha, H.: Towards optimizing hadoop provisioning in the cloud. In: Proceedings of the 2009 Conference on Hot topics in Cloud Computing. USENIX Association (2009)
Google Scholar
Mishra, A.K., Hellerstein, J.L., Cirne, W., Das, C.R.: Towards characterizing cloud backend workloads: insights from google compute clusters. In: Proceedings of SIGMETRICS (2010)
Google Scholar
Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX 2008 Annual Technical Conference on Annual Technical Conference (2008)
Google Scholar
Polo, J., Carrera, D., Becerra, Y., Torres, J., Ayguade, E., Steinder, M., Whalley, I.:Performance-driven task co-scheduling for mapreduce environments. In: NetworkOperations and Management Symposium (NOMS), 2010. IEEE (2010)
Google Scholar
Whaley, C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the atlas project. Parallel Comput. 27, 3–35 (2000)
Article Google Scholar
Zaharia, M., Borthakur, D., Sarma, J.S., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of EuroSys (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

North Carolina State University, Raleigh, USA
Kamal Kc & Vincent W. Freeh

Authors

Kamal Kc
View author publications
You can also search for this author in PubMed Google Scholar
Vincent W. Freeh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kamal Kc .

Editor information

Editors and Affiliations

ICT, Chinese Academy of Sciences, Beijing, China
Jianfeng Zhan
ICT, Chinese Academy of Sciences, Beijing, China
Rui Han
Shannon (IT) Lab., Huawei, China
Chuliang Weng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kc, K., Freeh, V.W. (2014). Tuning Hadoop Map Slot Value Using CPU Metric. In: Zhan, J., Han, R., Weng, C. (eds) Big Data Benchmarks, Performance Optimization, and Emerging Hardware. BPOE 2014. Lecture Notes in Computer Science(), vol 8807. Springer, Cham. https://doi.org/10.1007/978-3-319-13021-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-13021-7_11
Published: 11 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13020-0
Online ISBN: 978-3-319-13021-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics