Skip to main content

Large-Scale Data Stream Processing Systems

  • Chapter
  • First Online:
Book cover Handbook of Big Data Technologies

Abstract

In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data with low latency and high throughput. Large-scale data stream processing systems perceive data as infinite streams and are designed to satisfy such requirements. They have further evolved substantially both in terms of expressive programming model support and also efficient and durable runtime execution on commodity clusters. Expressive programming models offer convenient ways to declare continuous data properties and applied computations, while hiding details on how these data streams are physically processed and orchestrated in a distributed environment. Execution engines provide a runtime for such models further allowing for scalable yet durable execution of any declared computation. In this chapter we introduce the major design aspects of large scale data stream processing systems, covering programming model abstraction levels and runtime concerns. We then present a detailed case study on stateful stream processing with Apache Flink, an open-source stream processor that is used for a wide variety of processing tasks. Finally, we address the main challenges of disruptive applications that large-scale data streaming enables from a systemic point of view.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 349.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 449.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 449.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Generally in functional programming, higher-order functions might also produce functions as their outputs, but this does not appear in stream processing.

  2. 2.

    Mind that sum in this example is a pre-defined aggregation function, however, a UDF can also be typically provided to declare an incremental computation.

References

  1. Apache Hadoop project, https://hadoop.apache.org/

  2. Apache Kafka project, http://kafka.apache.org/

  3. Apache Samza project, http://samza.apache.org/

  4. Apache Spark project, http://spark.apache.org/

  5. Apache Storm project, http://storm.apache.org/

  6. D.J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S. Zdonik, Aurora: a new model and architecture for data stream management, in VLDBJ (2003)

    Google Scholar 

  7. D.J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina et al., The design of the Borealis stream processing engine, in CIDR (2005)

    Google Scholar 

  8. K.J. Ahn, S. Guha, A. McGregor, Graph sketches: sparsification, spanners, and subgraphs, in Proceedings of the 31st symposium on Principles of Database Systems. ACM (2012), pp. 5–14

    Google Scholar 

  9. T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, S. Whittle, MillWheel: Fault-tolerant stream processing at internet scale, in VLDB (2013)

    Google Scholar 

  10. T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R.J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt et al, The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing, in VLDB (2015)

    Google Scholar 

  11. A. Alexandrov, R. Bergmann, S. Ewen, J.C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl et al., The Stratosphere platform for big data analytics. VLDB J. - Int. J. Very Large Data Bases 23(6), 939–964 (2014)

    Article  Google Scholar 

  12. A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, V. Markl, Implicit parallelism through deep language embedding, in ACM SIGMOD (2015), pp. 47–61

    Google Scholar 

  13. A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, J. Widom, Stream: The stanford data stream management system, Book chapter (2004)

    Google Scholar 

  14. A. Arasu, M. Cherniack, E. Galvez, D. Maier, A.S. Maskey, E. Ryvkina, M. Stonebraker, R. Tibbetts, Linear road: a stream data management benchmark. in Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB Endowment, vol. 30 (2004), pp. 480–491

    Google Scholar 

  15. A. Arasu, S. Babu, J. Widom, The CQL continuous query language: semantic foundations and query execution, in VLDBJ (2006)

    Google Scholar 

  16. M. Balazinska, H. Balakrishnan, S.R. Madden, M. Stonebraker, Fault-tolerance in the Borealis distributed stream processing system. ACM Trans. Database Syst. (TODS) 33(1), 3 (2008)

    Article  Google Scholar 

  17. M. Balazinska, J.H. Hwang, M.A. Shah, Fault-tolerance and high availability in data stream management systems., in Encyclopedia of Database Systems (Springer, 2009), pp. 1109–1115

    Google Scholar 

  18. L. Becchetti, P. Boldi, C. Castillo, A. Gionis, Efficient semi-streaming algorithms for local triangle counting in massive graphs, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2008), pp. 16–24

    Google Scholar 

  19. Benchmarking streaming computation engines at Yahoo! https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

  20. T. Bernhardt, A. Vasseur, Esper: Event Stream Processing and Correlation. ON-Java (O’Reilly, Springfield, 2007)

    Google Scholar 

  21. A. Bifet, R. Gavaldà, Adaptive learning from evolving data streams, in Advances in Intelligent Data Analysis VIII (Springer, Berlin, 2009), pp. 249–260

    Google Scholar 

  22. A. Bifet, G. Holmes, R. Kirkby, B. Pfahringer, Moa: Massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)

    Google Scholar 

  23. I. Botan, R. Derakhshan, N. Dindar, L. Haas, R.J. Miller, N. Tatbul, Secret: A model for analysis of the execution semantics of stream processing systems, in VLDB (2010)

    Google Scholar 

  24. L. Brenna, A. Demers, J. Gehrke, M. Hong, J. Ossher, B. Panda, M. Riedewald, M. Thatte, W. White, Cayuga: a high-performance event processing engine, in Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (ACM, 2007), pp. 1100–1102

    Google Scholar 

  25. P. Carbone, K. Vandikas, F. Zaloshnja, Towards highly available complex event processing deployments in the cloud, in Seventh International Conference on Next Generation Mobile Apps, Services and Technologies (NGMAST) (IEEE, 2013), pp. 153–158

    Google Scholar 

  26. P. Carbone, S. Ewen, S. Haridi, A. Katsifodimos, V. Markl, K. Tzoumas, Apache Flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin (2015)

    Google Scholar 

  27. P. Carbone, G. Fóra, S. Ewen, S. Haridi, K. Tzoumas, Lightweight asynchronous snapshots for distributed dataflows (2015). arXiv preprint arXiv:1506.08603

  28. P. Carbone, J. Traub, A. Katsifodimos, S. Haridi, V. Markl, Cutty: Aggregate sharing for user-defined windows, in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (ACM, 2016)

    Google Scholar 

  29. A. Carzaniga, D.S. Rosenblum, A.L. Wolf, Design and evaluation of a wide-area event notification service. ACM Trans. Comput. Syst. (TOCS) 19(3), 332–383 (2001)

    Article  Google Scholar 

  30. R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Integrating scale out and fault tolerance in stream processing using operator state management, in Proceedings of the 2013 ACM SIGMOD international conference on Management of data (ACM, 2013), pp. 725–736

    Google Scholar 

  31. U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, J. Meehan, A. Pavlo, M. Stonebraker, E. Sutherland, N. Tatbul et al., S-store: A streaming newSQL system for big velocity applications. Proc. VLDB Endow. 7(13), 1633–1636 (2014)

    Article  Google Scholar 

  32. C. Chambers, A. Raniwala, F. Perry, S. Adams, R.R. Henry, R. Bradshaw, N. Weizenbaum, FlumeJava: easy, efficient data-parallel pipelines, in ACM Sigplan Notices, vol. 45 (ACM, 2010), pp. 363–375

    Google Scholar 

  33. B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J.C. Platt, J.F. Terwilliger, J. Wernsing, Trill: A high-performance incremental query processor for diverse analytics. Proc. VLDB Endow. 8(4), 401–412 (2014)

    Article  Google Scholar 

  34. S. Chandrasekaran, O. Cooper, A. Deshpande, M.J. Franklin, J.M. Hellerstein, W. Hong, S. Krishnamurthy, S.R. Madden, F. Reiss, M.A. Shah, TelegraphCQ: continuous dataflow processing, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (ACM, 2003), pp. 668–668

    Google Scholar 

  35. K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. (TOCS) 3(1), 63–75 (1985)

    Article  Google Scholar 

  36. F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)

    Article  Google Scholar 

  37. J. Chen, D.J. DeWitt, F. Tian, Y. Wang, Niagaracq: A scalable continuous query system for internet databases, in SIGMOD Record (ACM, 2000)

    Google Scholar 

  38. M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, S.B. Zdonik, Scalable distributed stream processing. CIDR. 3, 257–268 (2003)

    Google Scholar 

  39. T. Condie, N. Conway, P. Alvaro, J.M. Hellerstein, K. Elmeleegy, R. Sears, Mapreduce online. NSDI. 10, 20 (2010)

    Google Scholar 

  40. G. Cugola, A. Margara, Processing flows of information: From data stream to complex event processing. ACM Comput. Surv. (CSUR) 44(3), 15 (2012)

    Article  Google Scholar 

  41. U. Dayal, B. Blaustein, A. Buchmann, U. Chakravarthy, M. Hsu, R. Ledin, D. McCarthy, A. Rosenthal, S. Sarin, M.J. Carey et al., The HiPAC project: Combining active databases and timing constraints. ACM Sigmod Rec. 17(1), 51–70 (1988)

    Article  Google Scholar 

  42. G. De Francisci Morales, A. Bifet, Samoa: Scalable advanced massive online analysis. J. Mach. Learn. Res. 16(1), 149–153 (2015)

    Google Scholar 

  43. J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  44. N. Dindar, N. Tatbul, R.J. Miller, L.M. Haas, I. Botan, Modeling the execution semantics of stream processing engines with secret. VLDB J. 22(4), 421–446 (2013)

    Article  Google Scholar 

  45. D. Elin, T. Risch, Amos II java interfaces. Uppsala University report (2000)

    Google Scholar 

  46. J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, J. Zhang, On graph problems in a semi-streaming model. Theor. Comput. Sci. 348(2), 207–216 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  47. R.C. Fernandez, M. Migliavacca, E. Kalyvianaki, P. Pietzuch, Making state explicit for imperative big data processing, in Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 14) (2014), pp. 49–60

    Google Scholar 

  48. S. Gatziu, K.R. Dittrich, Samos: An active object-oriented database system. IEEE Data Eng. Bull. 15(1–4), 23–26 (1992)

    Google Scholar 

  49. B. Gedik, Partitioning functions for stateful data parallelism in stream processing. VLDB J. 23(4), 517–539 (2014)

    Article  Google Scholar 

  50. Google Cloud Dataflow, https://cloud.google.com/dataflow/

  51. W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, E. Chen, Chronos: a graph engine for temporal graph analysis, in Proceedings of the Ninth European Conference on Computer Systems (ACM, 2014), p. 1

    Google Scholar 

  52. B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, L. Zhou, Comet: batched stream processing for data intensive distributed computing, in Proceedings of the 1st ACM Symposium on Cloud Computing (ACM, 2010), pp. 63–74

    Google Scholar 

  53. M. Hirzel, H. Andrade, B. Gedik, V. Kumar, G. Losa, M. Nasgaard, R. Soule, K. Wu, SPL stream processing language specification. NewYork: IBMResearchDivisionTJ. WatsonResearchCenter, IBM ResearchReport: RC24897 (W0911–044) (2009)

    Google Scholar 

  54. M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé et al., IBM streams processing language: analyzing big data in motion. IBM J. Res. Develop. 57(3/4), 7–1 (2013)

    Google Scholar 

  55. M. Hirzel, R. Soulé, S. Schneider, B. Gedik, R. Grimm, A catalog of stream processing optimizations. ACM Comput. Surv. (CSUR) 46(4), 46 (2014)

    Article  Google Scholar 

  56. Introduction to Kafka Streams, http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple

  57. A. Iyer, L.E. Li, I. Stoica, CellIQ: real-time cellular network analytics at scale, in 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (2015), pp. 309–322

    Google Scholar 

  58. R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E.P. Jones, S. Madden, M. Stonebraker, Y. Zhang et al., H-store: a high-performance, distributed main memory transaction processing system. Proc. VLDB Endow. 1(2), 1496–1499 (2008)

    Article  Google Scholar 

  59. K. Karanasos, A. Katsifodimos, I. Manolescu, Delta: Scalable data dissemination under capacity constraints. Proc. VLDB Endow. 7(4), 217–228 (2013)

    Article  Google Scholar 

  60. J. Kreps, N. Narkhede, J. Rao et al, Kafka: A distributed messaging system for log processing. NetDB (2011)

    Google Scholar 

  61. S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J.M. Patel, K. Ramasamy, S. Taneja, Twitter Heron: Stream processing at scale, in ACM SIGMOD (2015)

    Google Scholar 

  62. A. Kyrola, G. Blelloch, C. Guestrin, Graphchi: Large-scale graph computation on just a pc, in Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12) (2012), pp. 31–46

    Google Scholar 

  63. A. Lakshman, P. Malik, Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44(2), 35–40 (2010)

    Article  Google Scholar 

  64. J. Li, D. Maier, K. Tufte, V. Papadimos, P.A. Tucker, Semantics and evaluation techniques for window aggregates in data streams, in ACM SIGMOD (2005)

    Google Scholar 

  65. L. Liu, C. Pu, W. Tang, Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11(4), 610–628 (1999)

    Article  Google Scholar 

  66. Y. Liu, B. Plale et al., Survey of publish subscribe event systems. Computer Science Dept, Indian University 16 (2003)

    Google Scholar 

  67. D. Logothetis, C. Olston, B. Reed, K.C. Webb, K. Yocum, Stateful bulk processing for incremental analytics, in Proceedings of the 1st ACM Symposium on Cloud Computing (ACM, 2010), pp. 51–62

    Google Scholar 

  68. D. Luckham, The power of events, vol. 204 (Addison-Wesley Reading, Boston, 2002)

    Google Scholar 

  69. G. Malewicz, M.H. Austern, A.J. Bik, J.C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM, 2010), pp. 135–146

    Google Scholar 

  70. N. Marz, J. Warren, Big Data: Principles and Best Practices of Scalable Realtime Data Systems (Manning Publications Co., Greenwich, 2015)

    Google Scholar 

  71. D. Mishra, SNOOP: an event specification language for active database systems. Ph.D. thesis, University of Florida (1991)

    Google Scholar 

  72. S.S. Muchnick, Advanced Compiler Design Implementation (Morgan Kaufmann, Burlington, 1997)

    Google Scholar 

  73. D.G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, M. Abadi, Naiad: a timely dataflow system, in ACM SOSP (2013)

    Google Scholar 

  74. L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: Distributed stream computing platform, in Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (IEEE, 2010), pp. 170–177

    Google Scholar 

  75. K. Patroumpas, T. Sellis, Window specification over data streams, in Current Trends in Database Technology–EDBT 2006 (Springer, Berlin, 2006), pp. 445–464

    Google Scholar 

  76. D. Peleg, A.A. Schäffer, Graph spanners. J. Graph Theory 13(1), 99–116 (1989)

    Google Scholar 

  77. M.A. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An adaptive partitioning operator for continuous query systems, in Proceedings of the 19th International Conference on Data Engineering (IEEE, 2003), pp. 25–36

    Google Scholar 

  78. M.A. Shah, J.M. Hellerstein, E. Brewer, Highly available, fault-tolerant, parallel dataflows, in Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (ACM, 2004), pp. 827–838

    Google Scholar 

  79. U. Srivastava, J. Widom, Flexible time management in data stream systems. in Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (ACM, 2004), pp. 263–274

    Google Scholar 

  80. StreamBase I: Streambase: Real-time, low latency data processing with a stream processing engine (2006)

    Google Scholar 

  81. J. Thaler, Semi-streaming algorithms for annotated graph streams (2014). arXiv preprint arXiv:1407.3462

  82. The Apache APEX project, https://www.datatorrent.com/apex/

  83. The Apache Beam System, https://wiki.apache.org/incubator/BeamProposal

  84. The Kappa Architecture by Jay Kreps, http://milinda.pathirage.org/kappa-architecture.com/

  85. The Trident Stream Processing Programming Model, http://storm.apache.org/releases/0.10.0/Trident-tutorial.html

  86. A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham et al, Storm @ Twitter, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (ACM, 2014), pp. 147–156

    Google Scholar 

  87. J. Webber, A programmatic introduction to Neo4j, in Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software For Humanity (ACM, 2012), pp. 217–218

    Google Scholar 

  88. R.S. Xin, J.E. Gonzalez, M.J. Franklin, I. Stoica, GraphX: A resilient distributed graph system on Spark, in First International Workshop on Graph Data Management Experiences and Systems (ACM, 2013), p. 2

    Google Scholar 

  89. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: Cluster computing with working sets. HotCloud 10, 10–10 (2010)

    Google Scholar 

  90. M. Zaharia, T. Das, H. Li, S. Shenker, I. Stoica, Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters, in Proceedings of the 4th USENIX Conference on Hot Topics in Cloud Ccomputing (USENIX Association, 2012), pp. 10–10

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paris Carbone .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Carbone, P. et al. (2017). Large-Scale Data Stream Processing Systems. In: Zomaya, A., Sakr, S. (eds) Handbook of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-49340-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49340-4_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49339-8

  • Online ISBN: 978-3-319-49340-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics