ABSTRACT
Columnar data formats, such as Apache Parquet, are increasingly popular nowadays for scalable data storage and querying data lakes, due to compressed storage and efficient data access via data skipping. However, when applied to spatial or spatio-temporal data, advanced solutions are required to go beyond pruning over single attributes and towards multidimensional pruning. Even though there exist solutions for geospatial data, such as GeoParquet and SpatialParquet, they fall short when applied to trajectory data (sequences of spatio-temporal positions). In this paper, we propose TrajParquet, a format for columnar storage of trajectory data, which is highly efficient and scalable. Also, we present a query processing algorithm that supports spatio-temporal range queries over TrajParquet. We evaluate TrajParquet using real-world data sets and in comparison with extensions of GeoParquet and SpatialParquet, suitable for handling spatio-temporal data.
- Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. 2012. NoDB: Efficient query execution on raw data files. In Proc. of SIGMOD. ACM, 241--252.Google ScholarDigital Library
- Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. Proc. VLDB Endow. 3, 1 (2010), 330--339.Google ScholarDigital Library
- Bongki Moon, H. V. Jagadish, Christos Faloutsos, and Joel H. Saltz. 2001. Analysis of the Clustering Properties of the Hilbert Space-Filling Curve. IEEE Trans. Knowl. Data Eng. 13, 1 (2001), 124--141.Google ScholarDigital Library
- Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986--1989.Google ScholarDigital Library
- Costas Panagiotakis, Nikos Pelekis, Ioannis Kopanakis, Emmanuel Ramasso, and Yannis Theodoridis. 2012. Segmentation and Sampling of Moving Object Trajectories Based on Representativeness. IEEE Trans. Knowl. Data Eng. 24, 7 (2012), 1328--1343.Google ScholarDigital Library
- Cyril Ray, Richard Dréo, Elena Camossi, Anne-Laure Jousselme, and Clément Iphar. 2019. Heterogeneous integrated dataset for Maritime Intelligence, surveillance, and reconnaissance. Data in Brief 25 (2019), 104141.Google ScholarCross Ref
- Majid Saeedan and Ahmed Eldawy. 2022. Spatial Parquet: A column file format for geospatial data lakes. In Proc. of SIGSPATIAL. ACM, 102:1--102:4.Google Scholar
- Paula Ta-Shma, Guy Khazma, Gal Lushi, and Oshrit Feder. 2020. Extensible Data Skipping. In Proc. of IEEE BigData. 372--382.Google ScholarCross Ref
- Deepak Vohra. 2016. Apache Parquet. 325--335.Google Scholar
- Grisha Weintraub, Ehud Gudes, and Shlomi Dolev. 2021. Needle in a haystack queries in cloud data lakes. In Proc. EDBT/ICDT Workshops (CEUR Workshop Proceedings, Vol. 2841). CEUR-WS.org.Google Scholar
- Yu Zheng, Xing Xie, and Wei-Ying Ma. 2010. GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory. IEEE Data Eng. Bull. 33, 2 (2010), 32--39.Google Scholar
- Dimitris Zissis, Konstantinos Chatzikokolakis, Giannis Spiliopoulos, and Marios Vodas. 2020. A Distributed Spatial Method for Modeling Maritime Routes. IEEE Access 8 (2020), 47556--47568.Google ScholarCross Ref
Index Terms
- TrajParquet: A Trajectory-Oriented Column File Format for Mobility Data Lakes
Recommendations
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and TechnologiesBig Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a ...
Distributed processing of big mobility data as spatio-temporal data streams
Recent rapid development of wireless communication, mobile computing, global navigation satellite systems (GNSS), and spatially enabled sensors are leading to an exponential growth of available mobility data produced continuously at high speed. Due to ...
A Survey on Spatio-temporal Data Analytics Systems
Due to the surge of spatio-temporal data volume, the popularity of location-based services and applications, and the importance of extracted knowledge from spatio-temporal data to solve a wide range of real-world problems, a plethora of research and ...
Comments