Skip to main content
Log in

Operator scheduling in data stream systems

  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract.

In many applications involving continuous data streams, data arrival is bursty and data rate fluctuates over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. We discuss one strategy for processing bursty streams - adaptive, load-aware scheduling of query operators to minimize resource consumption during times of peak load. We show that the choice of an operator scheduling strategy can have significant impact on the runtime system memory usage as well as output latency. Our aim is to design a scheduling strategy that minimizes the maximum runtime system memory while maintaining the output latency within prespecified bounds. We first present Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing runtime memory usage for any collection of single-stream queries involving selections, projections, and foreign-key joins with stored relations. Chain scheduling also performs well for queries with sliding-window joins over multiple streams and multiple queries of the above types. However, during bursts in input streams, when there is a buildup of unprocessed tuples, Chain scheduling may lead to high output latency. We study the online problem of minimizing maximum runtime memory, subject to a constraint on maximum latency. We present preliminary observations, negative results, and heuristics for this problem. A thorough experimental evaluation is provided where we demonstrate the potential benefits of Chain scheduling and its different variants, compare it with competing scheduling strategies, and validate our analytical conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Amsaleg L, Franklin M, Tomasic A (1998) Dynamic query operator scheduling for wide-area remote access. J Distrib Parallel Databases 6(3):217-246

    Article  Google Scholar 

  2. Arasu A, Babu S, Widom J (2002) An abstract semantics and concrete language for continuous queries over streams and relations. Technical report, Stanford University Database Group. http://dbpubs.stanford.edu/pub/2002-57

  3. Avnur R, Hellerstein J (2000) Eddies: continuously adaptive query processing. In: Proc 2000 ACM SIGMOD international conference on management of data, pp 261-272

  4. Ayad AM, Naughton JF (2004) Static optimization of conjunctive queries with sliding windows over infinite streams. In: Proc 2004 ACM SIGMOD international conference on management of data

  5. Babcock B, Babu S, Datar M, Motwani R (2003) Chain: operator scheduling for memory minimization in data stream systems. In: Proc 2003 ACM SIGMOD international conference on management of data

  6. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proc 2002 ACM symposium on principles of database systems

  7. Babcock B, Datar M, Motwani R (2004) Load shedding for aggregation queries over data streams. In: Proc 2004 international conference on data engineering, pp 350-361

  8. Babu S, Motwani R, Munagala K, Nishizawa I, Widom J (2004) Adaptive ordering of pipelined stream filters. In: Proc 2004 ACM SIGMOD international conference on management of data

  9. Bouganim L, Kapitskaia O, Valduriez P (1998) Memory-adaptive scheduling for large query execution. In: Proc 1998 ACM CIKM international conference on information and knowledge management, pp 105-115

  10. Carney D, Cetintemel U, Cherniack M, Convey C, Lee S, Seidman G, Stonebraker M, Tatbul N, Zdonik S (2002) Monitoring streams - a new class of data management applications. In: Proc 28th international conference on very large data bases

  11. Carney D, Cetintemel U, Rasin A, Zdonik S, Cherniack M, Stonebraker M (2003) Operator scheduling in a data stream manager. In: Proc 2003 international conference on very large data bases

  12. Chandrasekaran S, Franklin M (2002) Streaming queries over streaming data. In: Proc 28th international conference on very large data bases

  13. Chaudhurim S, Shim K (1999) Optimization of queries with user-defined predicates. ACM Trans Database Sys 24(2):177-228

    Article  Google Scholar 

  14. Cortes C, Fisher K, Pregibon D, Rogers A, Smith F (2000) Hancock: a language for extracting signatures from data streams. In: Proc 2000 ACM SIGKDD international conference on knowledge discovery and data mining, pp 9-17

  15. Dageville B, Zait M (2002) SQL memory management in Oracle9i. In: Proc 2002 international conference on very large data bases

  16. Das A, Gehrke J, Riedewald M (2003) Approximate join processing over data streams. In: Proc 2003 ACM SIGMOD international conference on management of data

  17. Floyd S, Paxson V (1995) Wide-area traffic: the failure of poisson modeling. IEEE/ACM Trans Network 3(3):226-244

    Google Scholar 

  18. Golab L, Ozsu T (2003) Issues in data stream management. SIGMOD Record 32(2):5-14

    Google Scholar 

  19. Hellerstein J, Franklin M, Chandrasekaran S, Deshpande A, Hildrum K, Madden S, Raman V, Shah MA (2000) Adaptive query processing: technology in evolution. IEEE Data Eng Bull 23(2):7-18

    Google Scholar 

  20. Hellerstein J, Stonebraker M (1993) Predicate migration: optimizing queries with expensive predicates. In: Proc 1993 ACM SIGMOD international conference on management of data, pp 267-276

  21. Ibaraki T, Kameda T (1984) On the optimal nesting order for computing n-relational joins. ACM Trans Database Sys 9(3):482-502

    Article  Google Scholar 

  22. Internet Traffic Archive: http://www.acm.org/sigcomm/ITA/

  23. Ives Z, Florescu D, Friedman M, Levy A, Weld D (1999) An adaptive query execution system for data integration. In: Proc 1999 ACM SIGMOD international conference on management of data, pp 299-310

  24. Johnson T, Cranor C, Spatsheck O, Shkapenyuk V (2003) Gigascope: a stream database for network applications. In: Proc 2003 ACM SIGMOD international conference on management of data

  25. Kabra N, DeWitt DJ (1998) Efficient mid-query re-optimization of sub-optimal query execution plans. In: Proc ACM SIGMOD international conference on management of data, pp 106-117

  26. Kang J, Naughton JF, Viglas S (2003) Evaluating window joins over unbounded streams. In: Proc 2003 international conference on data engineering

  27. Kao B, Garcia-Molina H (1995) An overview of real-time database systems. In: Son SH (ed) Advances in real-time systems. Prentice Hall, Englewood Cliffs, NJ, pp 463-486

  28. Karger D, Stein C, Wein J (1997) Scheduling algorithms. In: Atallah MJ (ed) Handbook of algorithms and theory of computation. CRC, Boca Raton, FL

  29. Kleinberg J (2002) Bursty and hierarchical structure in streams. In: Proc 2002 ACM SIGKDD international conference on knowledge discovery and data mining

  30. Krishnamurthy R, Boral H, Zaniolo C (1986) Optimizing nonrecursive queries. In: Proc 1986 international conference on very large data bases, pp 128-137

  31. Lawler EL, Lenstra JK, Rinnooy Kan AHG, Shmoys DB (1993) Sequencing and scheduling: algorithms and complexity. In: Graves SC, Zipkin PH, Rinnooy Kan AHG (eds) Logistics of production and inventory, Handbooks in operations research and management science, vol 4, North-Holland, Amsterdam, pp 445-522

  32. Leland W, Taqqu M, Willinger W, Wilson D (1994) On the self-similar nature of ethernet traffic. IEEE/ACM Trans Network 2(1):1-15

    Google Scholar 

  33. Lomet D, Levy A (2000) Special issue on adaptive query processing. IEEE Data Eng Bull 23(2):1-48

    Google Scholar 

  34. Madden S, Shah M, Hellerstein J, Raman V (2002) Continuously adaptive continuous queries over streams. In: Proc 2002 ACM SIGMOD international conference on management of data

  35. Monma C, Sidney J (1987) Optimal sequencing via modular decomposition: characterization of sequencing functions. Math Oper Res 12:22-31

    MathSciNet  MATH  Google Scholar 

  36. Motwani R, Thomas D (2004) Caching queues in memory buffers. In: Proc 2004 annual ACM-SIAM symposium on discrete algorithms

  37. Motwani R, Widom J, Arasu A, Babcock B, Babu S, Datar M, Manku G, Olston C, Rosenstein J, Varma R (2003) Query processing, approximation, and resource management in a data stream management system. In: Proc 1st biennial conference on innovative data systems research (CIDR)

  38. Nag B, DeWitt DJ (1998) Memory allocation strategies for complex decision support queries. In: Proc 1998 ACM CIKM international conference on information and knowledge management, pp 116-123

  39. Niagara Project. http://www.cs.wisc.edu/niagara/

  40. Parker DS, Muntz RR, Chau HL (1989) The tangram stream query processing system. In: Proc 1989 international conference on data engineering, pp 556-563

  41. Parker DS, Simon E, Valduriez P (1992) SVP: a model capturing sets, lists, streams, and parallelism. In: Proc 1992 international conference on very large data bases, pp 115-126

  42. Raman V, Deshpande A, Hellerstein J (2003) Using state modules for adaptive query processing. In: Proc 2003 international conference on data engineering

  43. Shah M, Madden S, Franklin M, Hellerstein J (2001) Java support for data-intensive systems: experiences building the telegraph dataflow system. SIGMOD Record 30(4):103-114

    Google Scholar 

  44. SQR - a stream query repository. http://www-db.stanford.edu/stream/sqr

  45. Stanford Stream Data Management (STREAM) Project. http://www-db.stanford.edu/stream

  46. Sullivan M (1996) Tribeca: a stream database manager for network traffic analysis. In: Proc 1996 international conference on very large data bases, p 594

  47. Tatbul N, Cetintemel U, Zdonik S, Cherniack M, Stonebraker M (2003) Load shedding in a data stream manager. In: Proc 2003 international conference on very large data bases, pp 309-320

  48. Terry D, Goldberg D, Nichols D, Oki B (1992) Continuous queries over append-only databases. In: Proc 1992 ACM SIGMOD international conference on management of data, pp 321-330

  49. Urhan T, Franklin M (2000) Xjoin: a reactively-scheduled pipelined join operator. IEEE Data Eng Bull 23(2):27-33

    Google Scholar 

  50. Urhan T, Franklin MJ (2001) Dynamic pipeline scheduling for improving interactive performance of online queries. In: Proc 2001 international conference on very large data bases

  51. Urhan T, Franklin MJ, Amsaleg L (1998) Cost-based query scrambling for initial delays. In: Proc 1998 ACM SIGMOD international conference on management of data, pp 130-141

  52. Viglas S, Naughton J (2002) Rate-based query optimization for streaming information sources. In: Proc 2002 ACM SIGMOD international conference on management of data

  53. Willinger W, Paxson V, Riedi R, Taqqu M (2002) Long-range dependence and data network traffic. In: Doukhan P, Oppenheim G, Taqqu MS (eds) Long-range dependence: theory and applications. Birkhäuser, Basel, Switzerland

  54. Willinger W, Taqqu M, Erramilli A (1996) A bibliographical guide to self-similar traffic and performance modeling for modern high-speed networks. In: Kelly FP, Zachary S, Ziedins I (eds) Stochastic networks: theory and applications. Oxford University Press, Oxford, UK, pp 339-366

  55. Wilschut AN, Apers PMG (1991) Dataflow query execution in a parallel main-memory environment. In: Proc 1991 international conference on parallel and distributed information systems, pp 68-77

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brian Babcock.

Additional information

Received: 18 October 2003, Accepted: 16 April 2004, Published online: 14 September 2004

Edited by: J. Gehrke and J. Hellerstein

Brian Babcock: Supported in part by a Rambus Corporation Stanford Graduate Fellowship and NSF Grant IIS-0118173.

Shivnath Babu: Supported in part by NSF Grants IIS-0118173 and IIS-9817799.

Mayur Datar: Supported in part by Siebel Scholarship and NSF Grant IIS-0118173.

Rajeev Motwani: Supported in part by NSF Grant IIS-0118173, an Okawa Foundation Research Grant, an SNRC grant, and grants from Microsoft and Veritas.

Dilys Thomas: Supported by NSF Grant EIA-0137761 and NSF ITR Award Number 0331640.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Babcock, B., Babu, S., Datar, M. et al. Operator scheduling in data stream systems. VLDB 13, 333–353 (2004). https://doi.org/10.1007/s00778-004-0132-6

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-004-0132-6

Keywords:

Navigation