skip to main content
10.1145/2463676.2465282acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Integrating scale out and fault tolerance in stream processing using operator state management

Published:22 June 2013Publication History

ABSTRACT

As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs-systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.

Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

References

  1. D. J. Abadi, Y. Ahmand, et al. The Design of the Borealis Stream Processing Engine. In CIDR, 2005.Google ScholarGoogle Scholar
  2. D. Agrawal, S. Das, et al. Big Data and Cloud Computing: Current State and Future Opportunities. In EDBT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Ahmad, O. Kennedy, et al. DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Amini, H. Andrade, et al. SPC: A Distributed, Scalable Platform for Data Mining. In DMSSP, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Arasu, M. Cherniack, et al. Linear Road: A Stream Data Management Benchmark. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. N. Backman, R. Fonseca, et al. Managing Parallelism for Stream Processing in the Cloud. In HotCDP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Balazinska, J. Hwang, et al. Fault Tolerance and High Availability in Data Stream Management Systems. In Encyclopedia of Database Systems, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Balazinska, A. Rasin, et al. High-Availability Algorithms for Distributed Stream Processing. In ICDE, 2005.Google ScholarGoogle Scholar
  9. S. Blanas, J. M. Patel, et al. A comparison of join algorithms for log processing in mapreduce. SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Duller, J. S. Rellermeyer, et al. Virtualizing Stream Processing. In Middleware, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y.-H. Feng, N.-F. Huang, et al. Efficient and Adaptive Stateful Replication for Stream Processing Engines in High-Availability Cluster. TPDS, 22(11), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Gill, N. Jain, et al. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Greenberg, J. Hamilton, et al. The Cost of a Cloud: Research Problems in Data Center Networks. In SIGCOMM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Gu, Z. Zhang, et al. An Empirical Study of High Availability in Stream Processing Systems. In Middleware, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. V. Gulisano, R. Jimenez-Peris, et al. StreamCloud: An Elastic and Scalable Data Streaming System. TPDS, 99(PP), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Hirzel. Partition and Compose: Parallel Complex Event Processing. In DEBS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. H. Hwang, Y. Xing, et al. A Cooperative, Self-Configuring High-Availability Solution for Stream Processing. In ICDE, 2007.Google ScholarGoogle Scholar
  18. N. Jain, L. Amini, et al. Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Liu, Y. Zhu, et al. Run-time Operator State Spilling for Memory Intensive Long-running Queries. In SIGMOD, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Martin, C. Fetzer, et al. Active Replication at (Almost) No Cost. In SRDS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Martin, T. Knauth, et al. Scalable and Low-Latency Data Processing with Stream MapReduce. In CLOUDCOM, 2011.Google ScholarGoogle Scholar
  22. R. Motwani, J. Widom, et al. Query Processing, Resource Management, and Approximation in a Data Stream Management System. In CIDR, 2003.Google ScholarGoogle Scholar
  23. L. Neumeyer, B. Robbing, et al. S4: Distributed Stream Computing Platform. In ICDMW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. N. Parikh and N. Sundaresan. Scalable and Near Real-Time Burst Detection from eCommerce Queries. In SIGKDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Russell. Mining the Social Web. O'Reilly, 2011.Google ScholarGoogle Scholar
  26. B. Satzger, W. Hummer, et al. Esc: Towards an Elastic Stream Computing Platform for the Cloud. In IEEE CLOUD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Schneider, H. Andrade, et al. Elastic Scaling of Data Parallel Operators in Stream Processing. In IPDPS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Z. Sebepou and K. Magoutis. CEC: Continuous Eventual Checkpointing for Data Stream Processing Operators. In DNS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Twitter Storm. github.com/nathanmarz/storm/wiki.Google ScholarGoogle Scholar
  30. R. Wagle, H. Andrade, et al. Distributed Middleware Reliability and Fault Tolerance Support in System S. In DEBS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Zaharia, T. Das, et al. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. In HotCloud, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. E. Zeitler and T. Risch. Massive Scale-out of Expensive Continuous Queries. VLDB Endowment, 4(11), 2011.Google ScholarGoogle Scholar
  33. Z. Zhang, Y. Gu, et al. A Hybrid Approach to HA in Stream Processing Systems. In ICDCS, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Integrating scale out and fault tolerance in stream processing using operator state management

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
      June 2013
      1322 pages
      ISBN:9781450320375
      DOI:10.1145/2463676

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 June 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader