research-article

Integrating scale out and fault tolerance in stream processing using operator state management

Authors:
Raul Castro Fernandez

Imperial College London, London, United Kingdom

Imperial College London, London, United Kingdom
View Profile

,
Matteo Migliavacca

University of Kent, Canterbury, United Kingdom

University of Kent, Canterbury, United Kingdom
View Profile

,
Evangelia Kalyvianaki

Imperial College London, London, United Kingdom

Imperial College London, London, United Kingdom
View Profile

,
Peter Pietzuch

Imperial College London, London, United Kingdom

Imperial College London, London, United Kingdom
View Profile

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataJune 2013Pages 725–736https://doi.org/10.1145/2463676.2465282

Published:22 June 2013Publication History

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 725–736

ABSTRACT

As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs-systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.

Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.

References

D. J. Abadi, Y. Ahmand, et al. The Design of the Borealis Stream Processing Engine. In CIDR, 2005.Google Scholar
D. Agrawal, S. Das, et al. Big Data and Cloud Computing: Current State and Future Opportunities. In EDBT, 2011. Google ScholarDigital Library
Y. Ahmad, O. Kennedy, et al. DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views. In VLDB, 2012. Google ScholarDigital Library
L. Amini, H. Andrade, et al. SPC: A Distributed, Scalable Platform for Data Mining. In DMSSP, 2006. Google ScholarDigital Library
A. Arasu, M. Cherniack, et al. Linear Road: A Stream Data Management Benchmark. In VLDB, 2004. Google ScholarDigital Library
N. Backman, R. Fonseca, et al. Managing Parallelism for Stream Processing in the Cloud. In HotCDP, 2012. Google ScholarDigital Library
M. Balazinska, J. Hwang, et al. Fault Tolerance and High Availability in Data Stream Management Systems. In Encyclopedia of Database Systems, 2009.Google ScholarCross Ref
M. Balazinska, A. Rasin, et al. High-Availability Algorithms for Distributed Stream Processing. In ICDE, 2005.Google Scholar
S. Blanas, J. M. Patel, et al. A comparison of join algorithms for log processing in mapreduce. SIGMOD, 2010. Google ScholarDigital Library
M. Duller, J. S. Rellermeyer, et al. Virtualizing Stream Processing. In Middleware, 2011. Google ScholarDigital Library
Y.-H. Feng, N.-F. Huang, et al. Efficient and Adaptive Stateful Replication for Stream Processing Engines in High-Availability Cluster. TPDS, 22(11), 2011. Google ScholarDigital Library
P. Gill, N. Jain, et al. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM, 2011. Google ScholarDigital Library
A. Greenberg, J. Hamilton, et al. The Cost of a Cloud: Research Problems in Data Center Networks. In SIGCOMM, 2008. Google ScholarDigital Library
Y. Gu, Z. Zhang, et al. An Empirical Study of High Availability in Stream Processing Systems. In Middleware, 2009. Google ScholarDigital Library
V. Gulisano, R. Jimenez-Peris, et al. StreamCloud: An Elastic and Scalable Data Streaming System. TPDS, 99(PP), 2012. Google ScholarDigital Library
M. Hirzel. Partition and Compose: Parallel Complex Event Processing. In DEBS, 2012. Google ScholarDigital Library
J. H. Hwang, Y. Xing, et al. A Cooperative, Self-Configuring High-Availability Solution for Stream Processing. In ICDE, 2007.Google Scholar
N. Jain, L. Amini, et al. Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core. In SIGMOD, 2006. Google ScholarDigital Library
B. Liu, Y. Zhu, et al. Run-time Operator State Spilling for Memory Intensive Long-running Queries. In SIGMOD, 2006. Google ScholarDigital Library
A. Martin, C. Fetzer, et al. Active Replication at (Almost) No Cost. In SRDS, 2011. Google ScholarDigital Library
A. Martin, T. Knauth, et al. Scalable and Low-Latency Data Processing with Stream MapReduce. In CLOUDCOM, 2011.Google Scholar
R. Motwani, J. Widom, et al. Query Processing, Resource Management, and Approximation in a Data Stream Management System. In CIDR, 2003.Google Scholar
L. Neumeyer, B. Robbing, et al. S4: Distributed Stream Computing Platform. In ICDMW, 2010. Google ScholarDigital Library
N. Parikh and N. Sundaresan. Scalable and Near Real-Time Burst Detection from eCommerce Queries. In SIGKDD, 2008. Google ScholarDigital Library
M. Russell. Mining the Social Web. O'Reilly, 2011.Google Scholar
B. Satzger, W. Hummer, et al. Esc: Towards an Elastic Stream Computing Platform for the Cloud. In IEEE CLOUD, 2011. Google ScholarDigital Library
S. Schneider, H. Andrade, et al. Elastic Scaling of Data Parallel Operators in Stream Processing. In IPDPS, 2009. Google ScholarDigital Library
Z. Sebepou and K. Magoutis. CEC: Continuous Eventual Checkpointing for Data Stream Processing Operators. In DNS, 2011. Google ScholarDigital Library
Twitter Storm. github.com/nathanmarz/storm/wiki.Google Scholar
R. Wagle, H. Andrade, et al. Distributed Middleware Reliability and Fault Tolerance Support in System S. In DEBS, 2011. Google ScholarDigital Library
M. Zaharia, T. Das, et al. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. In HotCloud, 2012. Google ScholarDigital Library
E. Zeitler and T. Risch. Massive Scale-out of Expensive Continuous Queries. VLDB Endowment, 4(11), 2011.Google Scholar
Z. Zhang, Y. Gu, et al. A Hybrid Approach to HA in Stream Processing Systems. In ICDCS, 2010. Google ScholarDigital Library

Index Terms

Integrating scale out and fault tolerance in stream processing using operator state management
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Fault tolerance mechanisms for virtual data center architectures

A virtual data center (VDC) is a combination of interconnected virtual servers hosted on a physical data center that hosts multiple such VDCs. This enables efficient sharing of the data center's resources while handling dynamic resource requirements of ...
Read More
Transparent Fault Tolerance of Device Drivers for Virtual Machines

In a consolidated server system using virtualization, physical device accesses from guest virtual machines (VMs) need to be coordinated. In this environment, a separate driver VM is usually assigned to this task to enhance reliability and to reuse ...
Read More
Improving Fault Tolerance by Virtualization and Software Rejuvenation
AMS '08: Proceedings of the 2008 Second Asia International Conference on Modelling & Simulation (AMS)

The phenomenon that the state of software degrades with time is known as software aging. The primary method to fight aging is software rejuvenation. This paper presents new ways of effective software rejuvenation using virtualization for addressing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fault tolerance
scalability
stateful stream processing
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 196
  Total Citations
  View Citations
- 1,844
  Total Downloads
- Downloads (Last 12 months)127
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Integrating scale out and fault tolerance in stream processing using operator state management

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fault tolerance mechanisms for virtual data center architectures

Transparent Fault Tolerance of Device Drivers for Virtual Machines

Improving Fault Tolerance by Virtualization and Software Rejuvenation