Abstract
Sliding-window aggregation is a foundational stream processing primitive that efficiently summarizes recent data. The state-of-the-art algorithms for sliding-window aggregation are highly efficient when stream data items are evicted or inserted one at a time, even when some of the insertions occur out-of-order. However, real-world streams are often not only out-of-order but also bursty, causing data items to be evicted or inserted in larger bulks. This paper introduces a new algorithm for sliding-window aggregation with bulk eviction and bulk insertion. For the special case of single insert and evict, our algorithm matches the theoretical complexity of the best previous out-of-order algorithms. For the case of bulk evict, our algorithm improves upon the theoretical complexity of the best previous algorithm for that case and also outperforms it in practice. For the case of bulk insert, there are no prior algorithms, and our algorithm improves upon the naive approach of emulating bulk insert with a loop over single inserts, both in theory and in practice. Overall, this paper makes high-performance algorithms for sliding window aggregation more broadly applicable by efficiently handling the ubiquitous cases of out-of-order data and bursts.
- 2022. Citi Bike System Data. https://www.citibikenyc.com/system-data. Retrieved December, 2022.Google Scholar
- Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff Phillips, Zhewei Wei, and Ke Yi. 2012. Mergeable Summaries. In Symposium on Principles of Database Systems (PODS). 23--34. Google ScholarDigital Library
- Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. In Conference on Very Large Data Bases (VLDB) Industrial Track. 734--746. Google ScholarDigital Library
- Albert Bifet and Ricard Gavaldà. 2007. Learning from Time-Changing Data with Adaptive Windowing. In International Conference on Data Mining (ICDM). 443--448. Google ScholarCross Ref
- Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM (CACM) 13, 7 (1970), 422--426. Google ScholarDigital Library
- Savong Bou, Hiroyuki Kitagawa, and Toshiyuki Amagasa. 2021. CPiX: RealTime Analytics Over Out-of-Order Data Streams By Incremental Sliding-Window Aggregation. Transactions on Knowledge and Data Engineering (TKDE) Early Access version of 28 January 2021 (2021). Google ScholarCross Ref
- Eric Bouillet, Ravi Kothari, Vibhore Kumar, Laurent Mignet, Senthil Nathan, Anand Ranganathan, Deepak S. Turaga, Octavian Udrea, and Olivier Verscheure. 2012. Processing 6 billion CDRs/day: from research to production (experience report). In Conference on Distributed Event-Based Systems (DEBS). 264--267. Google ScholarDigital Library
- Mark R. Brown and Robert E. Tarjan. 1979. A Fast Merging Algorithm. Journal of the ACM (JACM) 26, 2 (April 1979), 211--226. Google ScholarDigital Library
- Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a Single Engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015), 28--38. http://sites.computer.org/debull/A15dec/p28.pdfGoogle Scholar
- Thomas Cormen, Charles Leiserson, and Ronald Rivest. 1990. Introduction to Algorithms. MIT Press.Google Scholar
- Ralf Hinze and Ross Paterson. 2006. Finger trees: a simple general-purpose data structure. Journal of Functional Programming (JFP) 16, 2 (2006), 197--217. Google ScholarDigital Library
- Martin Hirzel, Scott Schneider, and Buğra Gedik. 2017. SPL: An Extensible Language for Distributed Stream Processing. Transactions on Programming Languages and Systems (TOPLAS) 39, 1 (March 2017), 5:1--5:39. Google ScholarDigital Library
- Michael Izbicki. 2013. Algebraic Classifiers: A Generic Approach to Fast Cross-Validation, Online Training, and Parallel Training. In International Conference on Machine Learning (ICML). 648--656. http://proceedings.mlr.press/v28/izbicki13.htmlGoogle Scholar
- Haim Kaplan and Robert E. Tarjan. 1995. Persistent Lists with Catenation via Recursive Slow-down. In Symposium on the Theory of Computing (STOC). 93--102. Google ScholarDigital Library
- Sailesh Krishnamurthy, Michael J. Franklin, Jeffrey Davis, Daniel Farina, Pasha Golovko, Alan Li, and Neil Thombre. 2010. Continuous Analytics over Discontinuous Streams. In International Conference on Management of Data (SIGMOD). 1081--1092. Google ScholarDigital Library
- Daan Leijen, Benjamin Zorn, and Leonardo de Moura. 2019. Mimalloc: Free List Sharding in Action. In Asian Symposium on Programming Languages and Systems (APLAS). 244--265. Google ScholarCross Ref
- Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-Order Processing: A New Architecture for High-performance Stream Systems. In Conference on Very Large Data Bases (VLDB). 274--288. Google ScholarDigital Library
- Adrian Michalke, Philipp M. Grulich, Clemens Lutz, Steffen Zeuch, and Volker Markl. 2021. An Energy-Efficient Stream Join for the Internet of Things. In Workshop on Data Management on New Hardware (DaMoN). Google ScholarDigital Library
- Olga Poppe, Chuan Lei, Lei Ma, Allison Rozet, and Elke A. Rundensteiner. 2021. To Share, or Not to Share Online Event Trend Aggregation Over Bursty Event Streams. In International Conference on Management of Data (SIGMOD). 1452--1464. Google ScholarDigital Library
- Marc Seidemann, Nikolaus Glombiewski, Michael Körber, and Bernhard Seeger. 2019. ChronicleDB: A High-Performance Event Store. Transactions on Database Systems (TODS) 44, 4 (Oct. 2019). Google ScholarDigital Library
- Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2017. FlatFIT: Accelerated Incremental Sliding-Window Aggregation for Real-Time Analytics. In Conference on Scientific and Statistical Database Management (SSDBM). 5.1--5.12. Google ScholarDigital Library
- Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2019. Optimal and General Out-of-Order Sliding-Window Aggregation. In Conference on Very Large Data Bases (VLDB). 1167--1180. http://www.vldb.org/pvldb/vol12/p1167-tangwongsan.pdfGoogle Scholar
- Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2021. In-Order Sliding-Window Aggregation in Worst-Case Constant Time. Journal on Very Large Data Bases (VLDB J.) 30 (June 2021), 933--957.Google ScholarDigital Library
- Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2023. Out-of-Order Sliding-Window Aggregation with Efficient Bulk Evictions and Insertions (Extended Version). https://arxiv.org/abs/2307.11210.Google Scholar
- Georgios Theodorakis, Alexandros Koliousis, Peter R. Pietzuch, and Holger Pirk. 2018. Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation. In Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS). 34--41. http://adms-conf.org/2018-camera-ready/SIMDWindowPaper_ADMS'18.pdfGoogle Scholar
- Georgios Theodorakis, Alexandros Koliousis, Peter R. Pietzuch, and Holger Pirk. 2020. LightSaber: Efficient Window Aggregation on Multi-core Processors. In International Conference on Management of Data (SIGMOD). 2,505--2,521. Google ScholarDigital Library
- Georgios Theodorakis, Peter R. Pietzuch, and Holger Pirk. 2020. SlideSide: A fast Incremental Stream Processing Algorithm for Multiple Queries. In Conference on Extending Database Technology (EDBT). 435--438. https://openproceedings.org/2020/conf/edbt/paper_337.pdfGoogle Scholar
- Jonas Traub, Philipp Grulich, Alejandro Rodriguez Cuellar, Sebastian Bres, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2019. Efficient Window Aggregation with General Stream Slicing. In Conference on Extending Database Technology (EDBT). 97--108. Google ScholarCross Ref
- Alvaro Villalba, Josep Lluis Berral, and David Carrera. 2019. Constant-Time Sliding Window Framework with Reduced Memory Footprint and Efficient Bulk Evictions. Transactions on Parallel and Distributed Systems (TPDS) 30, 3 (May 2019), 486--500. Google ScholarCross Ref
Recommendations
General incremental sliding-window aggregation
Stream processing is gaining importance as more data becomes available in the form of continuous streams and companies compete to promptly extract insights from them. In such applications, sliding-window aggregation is a central operator, and ...
Optimal and general out-of-order sliding-window aggregation
Sliding-window aggregation derives a user-defined summary of the most-recent portion of a data stream. For in-order streams, each window change can be handled in O(1) time even when the aggregation operator is not invertible. But streaming data often ...
Knowing When to Slide - Efficient Scheduling for Sliding Window Processing
MDM '09: Proceedings of the 2009 Tenth International Conference on Mobile Data Management: Systems, Services and MiddlewareWe consider sliding window query execution scheduling in stream processing engines. Sliding windows are an essential building block to limit the query focus at a particular part of the stream, based either on value count or time ranges. These so called ...
Comments