Abstract
Ad-hoc analysis implies processing data in near real-time. Thus, raw data (i.e., neither normalized nor transformed) is typically dumped into a distributed engine, where it is generally stored into a hybrid layout. Hybrid layouts divide data into horizontal partitions and inside each partition, data are stored vertically. They keep statistics for each horizontal partition and also support encoding (i.e., dictionary) and compression to reduce the size of the data. Their built-in support for many ad-hoc operations (i.e., selection, projection, aggregation, etc.) makes hybrid layouts the best choice for most operations.
Horizontal partition and dictionary sizes of hybrid layouts are configurable and can directly impact the performance of analytical queries. Hence, their default configuration cannot be expected to be optimal for all scenarios. In this paper, we present ATUN-HL (Auto TUNing Hybrid Layouts), which based on a cost model and given the workload and the characteristics of data, finds the best values for these parameters. We prototyped ATUN-HL for Apache Parquet, which is an open source implementation of hybrid layouts in Hadoop Distributed File System, to show its effectiveness. Our experimental evaluation shows that ATUN-HL provides on average 85% of all the potential performance improvement, and 1.2x average speedup against default configuration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
References
Abedjan, Z., Golab, L., Naumann, F.L.: Data profiling: a tutorial. In: SIGMOD Conference. ACM (2017)
Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: SIGMOD Conference. ACM (2014)
Azim, T., Karpathiotakis, M., Ailamaki, A.: Recache: reactive caching for fast analytics over heterogeneous data. PVLDB 11(3), 324–337 (2017)
Bian, H., et al.: Wide table layout optimization based on column ordering and duplication. In: SIGMOD Conference. ACM (2017)
Cardenas, A.F.: Analysis and performance of inverted data base structures. Commun. ACM 18(5), 253–263 (1975)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Ferreira, M., Paiva, J., Bravo, M., Rodrigues, L.E.T.: Smartfetch: efficient support for selective queries. In: CloudCom. IEEE Computer Society (2015)
Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. Knowl. Discov. 15(1), 55–86 (2007)
Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB 4(11), 1111–1122 (2011)
Herodotou, H., et al.: Starfish: a self-tuning system for big data analytics. In: CIDR (2011)
Jindal, A., Quiané-Ruiz, J., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: SoCC. ACM (2011)
Li, Y., Patel, J.M.: Widetable: an accelerator for analytical data processing. PVLDB 7(10), 907–918 (2014)
Moerkotte, G.: Small materialized aggregates: a light weight index structure for data warehousing. In: VLDB, pp. 476–487 (1998)
Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W.: A cost-based storage format selector for materialization in big data frameworks. CoRR, abs/1806.03901 (2018)
Munir, R.F., Romero, O., Abelló, A., Bilalli, B., Thiele, M., Lehner, W.: ResilientStore: a heuristic-based data format selector for intermediate results. In: Bellatreche, L., Pastor, Ó., Almendros Jiménez, J.M., Aït-Ameur, Y. (eds.) MEDI 2016. LNCS, vol. 9893, pp. 42–56. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45547-1_4
Shvachko, K.V.: HDFS scalability: the limits to growth. Login 35(2), 6–16 (2010)
Sun, L., Franklin, M.J., Krishnan, S., Xin, R.S.: Fine-grained partitioning for aggressive data skipping. In: SIGMOD Conference. ACM (2014)
Sun, L., Franklin, M.J., Wang, J., Wu, E.: Skipping-oriented partitioning for columnar layouts. PVLDB 10(4), 421–432 (2016)
Acknowledgement
This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence - Doctoral College” (IT4BI-DC), and the GENESIS project, funded by the Spanish Ministerio de Ciencia e Innovación under project TIN2016-79269-R.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Munir, R.F., Abelló, A., Romero, O., Thiele, M., Lehner, W. (2018). ATUN-HL: Auto Tuning of Hybrid Layouts Using Workload and Data Characteristics. In: Benczúr, A., Thalheim, B., Horváth, T. (eds) Advances in Databases and Information Systems. ADBIS 2018. Lecture Notes in Computer Science(), vol 11019. Springer, Cham. https://doi.org/10.1007/978-3-319-98398-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-98398-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98397-4
Online ISBN: 978-3-319-98398-1
eBook Packages: Computer ScienceComputer Science (R0)