Abstract.
In many applications involving continuous data streams, data arrival is bursty and data rate fluctuates over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. We discuss one strategy for processing bursty streams - adaptive, load-aware scheduling of query operators to minimize resource consumption during times of peak load. We show that the choice of an operator scheduling strategy can have significant impact on the runtime system memory usage as well as output latency. Our aim is to design a scheduling strategy that minimizes the maximum runtime system memory while maintaining the output latency within prespecified bounds. We first present Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing runtime memory usage for any collection of single-stream queries involving selections, projections, and foreign-key joins with stored relations. Chain scheduling also performs well for queries with sliding-window joins over multiple streams and multiple queries of the above types. However, during bursts in input streams, when there is a buildup of unprocessed tuples, Chain scheduling may lead to high output latency. We study the online problem of minimizing maximum runtime memory, subject to a constraint on maximum latency. We present preliminary observations, negative results, and heuristics for this problem. A thorough experimental evaluation is provided where we demonstrate the potential benefits of Chain scheduling and its different variants, compare it with competing scheduling strategies, and validate our analytical conclusions.
Similar content being viewed by others
References
Amsaleg L, Franklin M, Tomasic A (1998) Dynamic query operator scheduling for wide-area remote access. J Distrib Parallel Databases 6(3):217-246
Arasu A, Babu S, Widom J (2002) An abstract semantics and concrete language for continuous queries over streams and relations. Technical report, Stanford University Database Group. http://dbpubs.stanford.edu/pub/2002-57
Avnur R, Hellerstein J (2000) Eddies: continuously adaptive query processing. In: Proc 2000 ACM SIGMOD international conference on management of data, pp 261-272
Ayad AM, Naughton JF (2004) Static optimization of conjunctive queries with sliding windows over infinite streams. In: Proc 2004 ACM SIGMOD international conference on management of data
Babcock B, Babu S, Datar M, Motwani R (2003) Chain: operator scheduling for memory minimization in data stream systems. In: Proc 2003 ACM SIGMOD international conference on management of data
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proc 2002 ACM symposium on principles of database systems
Babcock B, Datar M, Motwani R (2004) Load shedding for aggregation queries over data streams. In: Proc 2004 international conference on data engineering, pp 350-361
Babu S, Motwani R, Munagala K, Nishizawa I, Widom J (2004) Adaptive ordering of pipelined stream filters. In: Proc 2004 ACM SIGMOD international conference on management of data
Bouganim L, Kapitskaia O, Valduriez P (1998) Memory-adaptive scheduling for large query execution. In: Proc 1998 ACM CIKM international conference on information and knowledge management, pp 105-115
Carney D, Cetintemel U, Cherniack M, Convey C, Lee S, Seidman G, Stonebraker M, Tatbul N, Zdonik S (2002) Monitoring streams - a new class of data management applications. In: Proc 28th international conference on very large data bases
Carney D, Cetintemel U, Rasin A, Zdonik S, Cherniack M, Stonebraker M (2003) Operator scheduling in a data stream manager. In: Proc 2003 international conference on very large data bases
Chandrasekaran S, Franklin M (2002) Streaming queries over streaming data. In: Proc 28th international conference on very large data bases
Chaudhurim S, Shim K (1999) Optimization of queries with user-defined predicates. ACM Trans Database Sys 24(2):177-228
Cortes C, Fisher K, Pregibon D, Rogers A, Smith F (2000) Hancock: a language for extracting signatures from data streams. In: Proc 2000 ACM SIGKDD international conference on knowledge discovery and data mining, pp 9-17
Dageville B, Zait M (2002) SQL memory management in Oracle9i. In: Proc 2002 international conference on very large data bases
Das A, Gehrke J, Riedewald M (2003) Approximate join processing over data streams. In: Proc 2003 ACM SIGMOD international conference on management of data
Floyd S, Paxson V (1995) Wide-area traffic: the failure of poisson modeling. IEEE/ACM Trans Network 3(3):226-244
Golab L, Ozsu T (2003) Issues in data stream management. SIGMOD Record 32(2):5-14
Hellerstein J, Franklin M, Chandrasekaran S, Deshpande A, Hildrum K, Madden S, Raman V, Shah MA (2000) Adaptive query processing: technology in evolution. IEEE Data Eng Bull 23(2):7-18
Hellerstein J, Stonebraker M (1993) Predicate migration: optimizing queries with expensive predicates. In: Proc 1993 ACM SIGMOD international conference on management of data, pp 267-276
Ibaraki T, Kameda T (1984) On the optimal nesting order for computing n-relational joins. ACM Trans Database Sys 9(3):482-502
Internet Traffic Archive: http://www.acm.org/sigcomm/ITA/
Ives Z, Florescu D, Friedman M, Levy A, Weld D (1999) An adaptive query execution system for data integration. In: Proc 1999 ACM SIGMOD international conference on management of data, pp 299-310
Johnson T, Cranor C, Spatsheck O, Shkapenyuk V (2003) Gigascope: a stream database for network applications. In: Proc 2003 ACM SIGMOD international conference on management of data
Kabra N, DeWitt DJ (1998) Efficient mid-query re-optimization of sub-optimal query execution plans. In: Proc ACM SIGMOD international conference on management of data, pp 106-117
Kang J, Naughton JF, Viglas S (2003) Evaluating window joins over unbounded streams. In: Proc 2003 international conference on data engineering
Kao B, Garcia-Molina H (1995) An overview of real-time database systems. In: Son SH (ed) Advances in real-time systems. Prentice Hall, Englewood Cliffs, NJ, pp 463-486
Karger D, Stein C, Wein J (1997) Scheduling algorithms. In: Atallah MJ (ed) Handbook of algorithms and theory of computation. CRC, Boca Raton, FL
Kleinberg J (2002) Bursty and hierarchical structure in streams. In: Proc 2002 ACM SIGKDD international conference on knowledge discovery and data mining
Krishnamurthy R, Boral H, Zaniolo C (1986) Optimizing nonrecursive queries. In: Proc 1986 international conference on very large data bases, pp 128-137
Lawler EL, Lenstra JK, Rinnooy Kan AHG, Shmoys DB (1993) Sequencing and scheduling: algorithms and complexity. In: Graves SC, Zipkin PH, Rinnooy Kan AHG (eds) Logistics of production and inventory, Handbooks in operations research and management science, vol 4, North-Holland, Amsterdam, pp 445-522
Leland W, Taqqu M, Willinger W, Wilson D (1994) On the self-similar nature of ethernet traffic. IEEE/ACM Trans Network 2(1):1-15
Lomet D, Levy A (2000) Special issue on adaptive query processing. IEEE Data Eng Bull 23(2):1-48
Madden S, Shah M, Hellerstein J, Raman V (2002) Continuously adaptive continuous queries over streams. In: Proc 2002 ACM SIGMOD international conference on management of data
Monma C, Sidney J (1987) Optimal sequencing via modular decomposition: characterization of sequencing functions. Math Oper Res 12:22-31
Motwani R, Thomas D (2004) Caching queues in memory buffers. In: Proc 2004 annual ACM-SIAM symposium on discrete algorithms
Motwani R, Widom J, Arasu A, Babcock B, Babu S, Datar M, Manku G, Olston C, Rosenstein J, Varma R (2003) Query processing, approximation, and resource management in a data stream management system. In: Proc 1st biennial conference on innovative data systems research (CIDR)
Nag B, DeWitt DJ (1998) Memory allocation strategies for complex decision support queries. In: Proc 1998 ACM CIKM international conference on information and knowledge management, pp 116-123
Niagara Project. http://www.cs.wisc.edu/niagara/
Parker DS, Muntz RR, Chau HL (1989) The tangram stream query processing system. In: Proc 1989 international conference on data engineering, pp 556-563
Parker DS, Simon E, Valduriez P (1992) SVP: a model capturing sets, lists, streams, and parallelism. In: Proc 1992 international conference on very large data bases, pp 115-126
Raman V, Deshpande A, Hellerstein J (2003) Using state modules for adaptive query processing. In: Proc 2003 international conference on data engineering
Shah M, Madden S, Franklin M, Hellerstein J (2001) Java support for data-intensive systems: experiences building the telegraph dataflow system. SIGMOD Record 30(4):103-114
SQR - a stream query repository. http://www-db.stanford.edu/stream/sqr
Stanford Stream Data Management (STREAM) Project. http://www-db.stanford.edu/stream
Sullivan M (1996) Tribeca: a stream database manager for network traffic analysis. In: Proc 1996 international conference on very large data bases, p 594
Tatbul N, Cetintemel U, Zdonik S, Cherniack M, Stonebraker M (2003) Load shedding in a data stream manager. In: Proc 2003 international conference on very large data bases, pp 309-320
Terry D, Goldberg D, Nichols D, Oki B (1992) Continuous queries over append-only databases. In: Proc 1992 ACM SIGMOD international conference on management of data, pp 321-330
Urhan T, Franklin M (2000) Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng Bull 23(2):27-33
Urhan T, Franklin MJ (2001) Dynamic pipeline scheduling for improving interactive performance of online queries. In: Proc 2001 international conference on very large data bases
Urhan T, Franklin MJ, Amsaleg L (1998) Cost-based query scrambling for initial delays. In: Proc 1998 ACM SIGMOD international conference on management of data, pp 130-141
Viglas S, Naughton J (2002) Rate-based query optimization for streaming information sources. In: Proc 2002 ACM SIGMOD international conference on management of data
Willinger W, Paxson V, Riedi R, Taqqu M (2002) Long-range dependence and data network traffic. In: Doukhan P, Oppenheim G, Taqqu MS (eds) Long-range dependence: theory and applications. Birkhäuser, Basel, Switzerland
Willinger W, Taqqu M, Erramilli A (1996) A bibliographical guide to self-similar traffic and performance modeling for modern high-speed networks. In: Kelly FP, Zachary S, Ziedins I (eds) Stochastic networks: theory and applications. Oxford University Press, Oxford, UK, pp 339-366
Wilschut AN, Apers PMG (1991) Dataflow query execution in a parallel main-memory environment. In: Proc 1991 international conference on parallel and distributed information systems, pp 68-77
Author information
Authors and Affiliations
Corresponding author
Additional information
Received: 18 October 2003, Accepted: 16 April 2004, Published online: 14 September 2004
Edited by: J. Gehrke and J. Hellerstein
Brian Babcock: Supported in part by a Rambus Corporation Stanford Graduate Fellowship and NSF Grant IIS-0118173.
Shivnath Babu: Supported in part by NSF Grants IIS-0118173 and IIS-9817799.
Mayur Datar: Supported in part by Siebel Scholarship and NSF Grant IIS-0118173.
Rajeev Motwani: Supported in part by NSF Grant IIS-0118173, an Okawa Foundation Research Grant, an SNRC grant, and grants from Microsoft and Veritas.
Dilys Thomas: Supported by NSF Grant EIA-0137761 and NSF ITR Award Number 0331640.
Rights and permissions
About this article
Cite this article
Babcock, B., Babu, S., Datar, M. et al. Operator scheduling in data stream systems. VLDB 13, 333–353 (2004). https://doi.org/10.1007/s00778-004-0132-6
Issue Date:
DOI: https://doi.org/10.1007/s00778-004-0132-6