research-article

Open Access

Spur: Mitigating Slow Instances in Large-Scale Streaming Pipelines

Authors:
Ke Wang

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Avrilia Floratou

Microsoft, Sunnyvale, CA, USA

Microsoft, Sunnyvale, CA, USA
View Profile

,
Ashvin Agrawal

Microsoft, Sunnyvale, CA, USA

Microsoft, Sunnyvale, CA, USA
View Profile

,
Daniel Musgrave

Netflix, Los Gatos, CA, USA

Netflix, Los Gatos, CA, USA
View Profile

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataJune 2020Pages 2271–2285https://doi.org/10.1145/3318464.3386142

Published:31 May 2020Publication History

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 2271–2285

ABSTRACT

Bing's monetization pipeline is one of the largest and most critical streaming workloads deployed in Microsoft's internal data lake. The pipeline runs 24/7 at a scale of 3500 YARN containers and is required to meet a Service Level Objective (SLO) of low tail latency. In this paper, we highlight some of the unique challenges imposed by this large scale of operation: other concurrent workloads sharing the cluster may cause random performance deterioration; unavailability of external dependencies may cause temporary stalls in the pipeline; scarcity in the underlying resource manager may cause arbitrarily long delays or rejection of container allocation requests. Weathering these challenges requires specially tailored dynamic control policies that react to these issues as and when they arise. We focus on the problem of reducing the latency in the tail, i.e., 99th percentile (p99), by detecting and mitigating slow instances through speculative replication. We show that widely used approaches do not satisfactorily solve this issue at our scale. A conservative approach is hesitant to acquire additional resources, reacts too slowly to the changes in the environment and therefore achieves little improvement in p99 latency. On the other hand, an aggressive approach overwhelms the underlying resource manager with unnecessary resource requests and paradoxically worsens the p99 latency. Our proposed approach, Spur, is designed for this challenging environment. It combines aggressive detection of slow instances with smart pruning of false positives to achieve a far better trade-off between these conflicting objectives. Using only 0.5% additional resources (similar to the conservative approach), we demonstrate a 10% -38% improvement in the tail latency compared to both conservative and aggressive approaches.

Supplemental Material

3318464.3386142.mp4

mp4

88.7 MB

Download

References

Ashvin Agrawal and Avrilia Floratou. 2018. Dhalion in Action: Automatic Management of Streaming Applications. PVLDB, Vol. 11, 12 (2018), 2050--2053. https://doi.org/10.14778/3229863.3236257Google Scholar
Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, et al. 2015. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1792--1803.Google ScholarDigital Library
Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. 2013. Effective Straggler Mitigation: Attack of the Clones.. In NSDI, Vol. 13. 185--198.Google ScholarDigital Library
Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri.. In Osdi, Vol. 10. 24.Google ScholarDigital Library
Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. 2018. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In Proceedings of the 2018 International Conference on Management of Data. ACM, 601--613.Google ScholarDigital Library
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 36, 4 (2015).Google Scholar
Ronnie Chaiken et al. 2008. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. Proc. VLDB Endow., Vol. 1, 2 (Aug. 2008), 1265--1276. https://doi.org/10.14778/1454159.1454166Google Scholar
Badrish Chandramouli et al. 2014. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow., Vol. 8, 4 (Dec. 2014), 401--412. https://doi.org/10.14778/2735496.2735503Google Scholar
Carlo Curino, Subru Krishnan, Konstantinos Karanasos, Sriram Rao, Giovanni M. Fumarola, Botong Huang, Kishore Chaliparambil, Arun Suresh, Young Chen, Solom Heddaya, Roni Burd, Sarvesh Sakalanaga, Chris Douglas, Bill Ramsey, and Raghu Ramakrishnan. 2019. Hydra: a federated resource manager for data-center scale analytics. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA, 177--192. https://www.usenix.org/conference/nsdi19/presentation/curinoGoogle Scholar
Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM, Vol. 56, 2 (2013), 74--80.Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, Vol. 51, 1 (2008), 107--113.Google ScholarDigital Library
Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S Gunawi. 2013. Limplock: understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 14.Google ScholarDigital Library
Avrilia Floratou, Ashvin Agrawal, Bill Graham, Sriram Rao, and Karthik Ramasamy. 2017. Dhalion: Self-Regulating Stream Processing in Heron. PVLDB, Vol. 10, 12 (2017), 1825--1836. https://doi.org/10.14778/3137765.3137786Google Scholar
Maosong Fu, Ashvin Agrawal, Avrilia Floratou, Bill Graham, Andrew Jorgensen, Mark Li, Neng Lu, Karthik Ramasamy, Sriram Rao, and Cong Wang. 2017. Twitter Heron: Towards Extensible Streaming Engines. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19--22, 2017. 1165--1172. https://doi.org/10.1109/ICDE.2017.161Google Scholar
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequential building blocks. In ACM SIGOPS operating systems review, Vol. 41. ACM, 59--72.Google Scholar
Vasiliki Kalavri, John Liagouris, Moritz Hoffmann, Desislava Dimitrova, Matthew Forshaw, and Timothy Roscoe. 2018. Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 783--798. https://www.usenix.org/conference/osdi18/presentation/kalavriGoogle Scholar
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter heron: Stream processing at scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 239--250.Google ScholarDigital Library
Wei Lin, Haochuan Fan, Zhengping Qian, Junwei Xu, Sen Yang, Jingren Zhou, and Lidong Zhou. 2016. STREAMSCOPE: Continuous Reliable Distributed Processing of Big Data Streams. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation (NSDI'16). USENIX Association, USA, 439--453.Google ScholarDigital Library
Aleksander Maricq, Dmitry Duplyakin, Ivo Jimenez, Carlos Maltzahn, Ryan Stutsman, and Robert Ricci. 2018. Taming performance variability. In 13th $$USENIX$$ Symposium on Operating Systems Design and Implementation ($$OSDI$$ 18). 409--425.Google Scholar
Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Samza: Stateful Scalable Stream Processing at LinkedIn. Proc. VLDB Endow., Vol. 10, 12 (Aug. 2017), 1634--1645. https://doi.org/10.14778/3137765.3137770Google ScholarDigital Library
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@ twitter. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 147--156.Google ScholarDigital Library
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In ACM Symposium on Cloud Computing, SOCC '13, Santa Clara, CA, USA, October 1--3, 2013. 5:1--5:16. https://doi.org/10.1145/2523616.2523633Google ScholarDigital Library
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and adaptable stream processing at scale. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 374--389.Google ScholarDigital Library
Yingjun Wu and Kian-Lee Tan. 2015. ChronoStream: Elastic stateful stream computation in the cloud. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 723--734.Google ScholarCross Ref
Neeraja J Yadwadkar, Ganesh Ananthanarayanan, and Randy Katz. 2014. Wrangler: Predictable and faster jobs using fewer resources. In Proceedings of the ACM Symposium on Cloud Computing. ACM, 1--14.Google ScholarDigital Library
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 423--438.Google ScholarDigital Library
Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. 2008. Improving MapReduce performance in heterogeneous environments.. In Osdi, Vol. 8. 7.Google ScholarDigital Library

Index Terms

Spur: Mitigating Slow Instances in Large-Scale Streaming Pipelines
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Availability
  2. Real-time systems
    1. Real-time system architecture
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Stream management

Recommendations

Reducing tail latencies in micro-batch streaming workloads
SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

Spark Streaming discretizes streams of data into micro-batches, each of which is further sub-divided into tasks and processed in parallel to improve job throughput. Previous work [2, 3] has lowered end-to-end latency in Spark Streaming. However, two ...
Read More
Power balanced pipelines
HPCA '12: Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture

Since the onset of pipelined processors, balancing the delay of the microarchitectural pipeline stages such that each microarchitectural pipeline stage has an equal delay has been a primary design objective, as it maximizes instruction throughput. ...
Read More
Custom embedded counterflow pipelines
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
large-scale streaming pipelines
resource consumption
slow instances
tail latency
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 470
  Total Downloads
- Downloads (Last 12 months)69
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spur: Mitigating Slow Instances in Large-Scale Streaming Pipelines

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Reducing tail latencies in micro-batch streaming workloads

Power balanced pipelines

Custom embedded counterflow pipelines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Spur: Mitigating Slow Instances in Large-Scale Streaming Pipelines

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Reducing tail latencies in micro-batch streaming workloads

Power balanced pipelines

Custom embedded counterflow pipelines

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media