ABSTRACT
As users of "big data" applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the "pay-as-you-go" model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs-systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.
Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the checkpointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures.
- D. J. Abadi, Y. Ahmand, et al. The Design of the Borealis Stream Processing Engine. In CIDR, 2005.Google Scholar
- D. Agrawal, S. Das, et al. Big Data and Cloud Computing: Current State and Future Opportunities. In EDBT, 2011. Google ScholarDigital Library
- Y. Ahmad, O. Kennedy, et al. DBToaster: Higher-Order Delta Processing for Dynamic, Frequently Fresh Views. In VLDB, 2012. Google ScholarDigital Library
- L. Amini, H. Andrade, et al. SPC: A Distributed, Scalable Platform for Data Mining. In DMSSP, 2006. Google ScholarDigital Library
- A. Arasu, M. Cherniack, et al. Linear Road: A Stream Data Management Benchmark. In VLDB, 2004. Google ScholarDigital Library
- N. Backman, R. Fonseca, et al. Managing Parallelism for Stream Processing in the Cloud. In HotCDP, 2012. Google ScholarDigital Library
- M. Balazinska, J. Hwang, et al. Fault Tolerance and High Availability in Data Stream Management Systems. In Encyclopedia of Database Systems, 2009.Google ScholarCross Ref
- M. Balazinska, A. Rasin, et al. High-Availability Algorithms for Distributed Stream Processing. In ICDE, 2005.Google Scholar
- S. Blanas, J. M. Patel, et al. A comparison of join algorithms for log processing in mapreduce. SIGMOD, 2010. Google ScholarDigital Library
- M. Duller, J. S. Rellermeyer, et al. Virtualizing Stream Processing. In Middleware, 2011. Google ScholarDigital Library
- Y.-H. Feng, N.-F. Huang, et al. Efficient and Adaptive Stateful Replication for Stream Processing Engines in High-Availability Cluster. TPDS, 22(11), 2011. Google ScholarDigital Library
- P. Gill, N. Jain, et al. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM, 2011. Google ScholarDigital Library
- A. Greenberg, J. Hamilton, et al. The Cost of a Cloud: Research Problems in Data Center Networks. In SIGCOMM, 2008. Google ScholarDigital Library
- Y. Gu, Z. Zhang, et al. An Empirical Study of High Availability in Stream Processing Systems. In Middleware, 2009. Google ScholarDigital Library
- V. Gulisano, R. Jimenez-Peris, et al. StreamCloud: An Elastic and Scalable Data Streaming System. TPDS, 99(PP), 2012. Google ScholarDigital Library
- M. Hirzel. Partition and Compose: Parallel Complex Event Processing. In DEBS, 2012. Google ScholarDigital Library
- J. H. Hwang, Y. Xing, et al. A Cooperative, Self-Configuring High-Availability Solution for Stream Processing. In ICDE, 2007.Google Scholar
- N. Jain, L. Amini, et al. Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core. In SIGMOD, 2006. Google ScholarDigital Library
- B. Liu, Y. Zhu, et al. Run-time Operator State Spilling for Memory Intensive Long-running Queries. In SIGMOD, 2006. Google ScholarDigital Library
- A. Martin, C. Fetzer, et al. Active Replication at (Almost) No Cost. In SRDS, 2011. Google ScholarDigital Library
- A. Martin, T. Knauth, et al. Scalable and Low-Latency Data Processing with Stream MapReduce. In CLOUDCOM, 2011.Google Scholar
- R. Motwani, J. Widom, et al. Query Processing, Resource Management, and Approximation in a Data Stream Management System. In CIDR, 2003.Google Scholar
- L. Neumeyer, B. Robbing, et al. S4: Distributed Stream Computing Platform. In ICDMW, 2010. Google ScholarDigital Library
- N. Parikh and N. Sundaresan. Scalable and Near Real-Time Burst Detection from eCommerce Queries. In SIGKDD, 2008. Google ScholarDigital Library
- M. Russell. Mining the Social Web. O'Reilly, 2011.Google Scholar
- B. Satzger, W. Hummer, et al. Esc: Towards an Elastic Stream Computing Platform for the Cloud. In IEEE CLOUD, 2011. Google ScholarDigital Library
- S. Schneider, H. Andrade, et al. Elastic Scaling of Data Parallel Operators in Stream Processing. In IPDPS, 2009. Google ScholarDigital Library
- Z. Sebepou and K. Magoutis. CEC: Continuous Eventual Checkpointing for Data Stream Processing Operators. In DNS, 2011. Google ScholarDigital Library
- Twitter Storm. github.com/nathanmarz/storm/wiki.Google Scholar
- R. Wagle, H. Andrade, et al. Distributed Middleware Reliability and Fault Tolerance Support in System S. In DEBS, 2011. Google ScholarDigital Library
- M. Zaharia, T. Das, et al. Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. In HotCloud, 2012. Google ScholarDigital Library
- E. Zeitler and T. Risch. Massive Scale-out of Expensive Continuous Queries. VLDB Endowment, 4(11), 2011.Google Scholar
- Z. Zhang, Y. Gu, et al. A Hybrid Approach to HA in Stream Processing Systems. In ICDCS, 2010. Google ScholarDigital Library
Index Terms
- Integrating scale out and fault tolerance in stream processing using operator state management
Recommendations
Fault tolerance mechanisms for virtual data center architectures
A virtual data center (VDC) is a combination of interconnected virtual servers hosted on a physical data center that hosts multiple such VDCs. This enables efficient sharing of the data center's resources while handling dynamic resource requirements of ...
Transparent Fault Tolerance of Device Drivers for Virtual Machines
In a consolidated server system using virtualization, physical device accesses from guest virtual machines (VMs) need to be coordinated. In this environment, a separate driver VM is usually assigned to this task to enhance reliability and to reuse ...
Improving Fault Tolerance by Virtualization and Software Rejuvenation
AMS '08: Proceedings of the 2008 Second Asia International Conference on Modelling & Simulation (AMS)The phenomenon that the state of software degrades with time is known as software aging. The primary method to fight aging is software rejuvenation. This paper presents new ways of effective software rejuvenation using virtualization for addressing ...
Comments