ABSTRACT
Application performance management (APM) tools are useful to observe the performance properties of an application during production. However, APM is normally purely reactive, that is, it can only report about current or past performance degradation. Although some approaches capable of predictive application monitoring have been proposed, they can only report a predicted degradation but cannot explain its root-cause, making it hard to prevent the expected degradation.
In this paper, we present SuanMing---a framework for predicting performance degradation of microservice applications running in cloud environments. SuanMing is able to predict future root causes for anticipated performance degradations and therefore aims at preventing performance degradations before they actually occur. We evaluate SuanMing on two realistic microservice applications, TeaStore and TrainTicket, and we show that our approach is able to predict and pinpoint performance degradations with an accuracy of over 90%.
- Alexander Alexandrov, Konstantinos Benidis, Michael Bohlke-Schneider, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Danielle C. Maddix, Syama Rangapuram, David Salinas, Jasper Schulz, Lorenzo Stella, Ali Caner Türkmen, and Yuyang Wang. 2020. GluonTS: Probabilistic and Neural Time Series Modeling in Python. Journal of Machine Learning Research, Vol. 21, 116 (2020), 1--6.Google Scholar
- Andre Bauer, Marwin Zufle, Nikolas Herbst, Albin Zehe, Andreas Hotho, and Samuel Kounev. 2020. Time Series Forecasting for Self-Aware Systems. Proc. IEEE, Vol. 108, 7 (2020), 1068--1093.Google ScholarCross Ref
- Christoph Bergmeir, Mauro Costantini, and José M. Benítez. 2014. On the usefulness of cross-validation for directional forecast evaluation. Computational Statistics & Data Analysis, Vol. 76 (2014), 132--143.Google ScholarCross Ref
- Ricardo Bianchini, Marcus Fontoura, Eli Cortez, Anand Bonde, Alexandre Muzio, Ana-Maria Constantin, Thomas Moscibroda, Gabriel Magalhaes, Girish Bablani, and Mark Russinovich. 2020. Toward ML-Centric Cloud Platforms. Commun. ACM, Vol. 63, 2 (2020), 50--59. https://doi.org/10.1145/3364684Google ScholarDigital Library
- Pedro Capelastegui, Alvaro Navas, Francisco Huertas, Rodrigo Garcia-Carmona, and Juan Carlos Dueñas. 2013. An online failure prediction system for private IaaS platforms. In Proceedings of the 2nd International Workshop on Dependability Issues in Cloud Computing (DISCCO '13). Association for Computing Machinery, New York, NY, USA, 1--3.Google ScholarDigital Library
- Alexander Clemm and Malte Hartwig. 2010. NETradamus: A forecasting system for system event messages. In IEEE/IFIP Network Operations and Management Symposium (NOMS) (2010), Yoshiaki Kiriha, Lisandro Zambenedetti Granville, Deep Medhi, Toshio Tonouchi, and Myung-Sup Kim (Eds.). IEEE, USA, 623--630. https://doi.org/10.1109/NOMS.2010.5488430Google ScholarCross Ref
- Simon Eismann, Cor-Paul Bezemer, Weiyi Shang, Dusan Okanovic, and Andre van Hoorn. 2020. Microservices: A Performance Tester's Dream or Nightmare?. In Proceedings of the 2020 ACM/SPEC International Conference on Performance Engineering (ICPE) (ICPE'20). ACM, New York, NY, USA, 12 pages. Acceptance Rate: 23.4% (15/64).Google ScholarDigital Library
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, Vol. 12, 85 (2011), 2825--2830.Google ScholarDigital Library
- Maria Fazio, Antonio Celesti, Rajiv Ranjan, Chang Liu, Lydia Chen, and Massimo Villari. 2016. Open Issues in Scheduling Microservices in the Cloud. IEEE Cloud Computing, Vol. 3, 5 (2016), 81--88.Google ScholarCross Ref
- Benito E. Flores. 1986. A pragmatic view of accuracy measurement in forecasting. Omega, Vol. 14, 2 (1986), 93--98.Google ScholarCross Ref
- Martin Fowler. 2015. Microservice Trade-Offs. https://martinfowler.com/articles/microservice-trade-offs.htmlGoogle Scholar
- Yu Gan, Yanqi Zhang, Kelvin Hu, Dailun Cheng, Yuan He, Meghna Pancholi, and Christina Delimitrou. 2019. Seer. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). Association for Computing Machinery, New York, NY, USA, 19--33.Google Scholar
- Johannes Grohmann, Nikolas Herbst, Avi Chalbani, Yair Arian, Noam Peretz, and Samuel Kounev. 2020. A Taxonomy of Techniques for SLO Failure Prediction in Software Systems. Computers, Vol. 9, 1 (2020), 10.Google ScholarCross Ref
- Johannes Grohmann, Nikolas Herbst, Simon Spinner, and Samuel Kounev. 2017. Self-Tuning Resource Demand Estimation. In Proceedings of the 14th IEEE International Conference on Autonomic Computing (ICAC 2017). IEEE, USA, 21--26.Google ScholarCross Ref
- Johannes Grohmann, Patrick K. Nicholson, Jesus Omana Iglesias, Samuel Kounev, and Diego Lugones. 2019. Monitorless: Predicting Performance Degradation in Cloud Applications with Machine Learning. In Proceedings of the 20th International Middleware Conference (Davis, CA, USA) (Middleware '19). Association for Computing Machinery, New York, NY, USA, 149--162.Google ScholarDigital Library
- Xiaohui Gu, Spiros Papadimitriou, Philip S. Yu, and Shu-Ping Chang. 2008. Online Failure Forecast for Fault-Tolerant Data Stream Processing. In 2008 IEEE 24th International Conference on Data Engineering. IEEE, USA, 1388--1390.Google Scholar
- Nikolas Herbst, Ayman Amin, Artur Andrzejak, Lars Grunske, Samuel Kounev, Ole J. Mengshoel, and Priya Sundararajan. 2017. Online Workload Forecasting. In Self-Aware Computing Systems, Samuel Kounev, Jeffrey O. Kephart, Xiaoyun Zhu, and Aleksandar Milenkoski (Eds.). Springer Verlag, Berlin Heidelberg, Germany, 529--553.Google Scholar
- Pooyan Jamshidi, Claus Pahl, Nabor C. Mendonca, James Lewis, and Stefan Tilkov. 2018. Microservices: The Journey So Far and Challenges Ahead. IEEE Software, Vol. 35, 3 (2018), 24--35.Google ScholarCross Ref
- Hiranya Jayathilaka, Chandra Krintz, and Rich Wolski. 2017. Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications. In Proceedings of the 26th International Conference on World Wide Web (WWW '17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 469--478.Google ScholarDigital Library
- Anshul Jindal, Vladimir Podolskiy, and Michael Gerndt. 2019. Performance Modeling for Cloud Microservice Applications. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (ICPE '19). Association for Computing Machinery, New York, NY, USA, 25--32.Google ScholarDigital Library
- James Lewis and Martin Fowler. 2014. Microservices: a definition of this new architectural term. https://martinfowler.com/articles/microservices.htmlGoogle Scholar
- Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. In Service-Oriented Computing, Claus Pahl, Maja Vukovic, Jianwei Yin, and Qi Yu (Eds.), Vol. 11236. Springer International Publishing, Cham, 3--20.Google ScholarDigital Library
- Leonardo Mariani, Mauro Pezzè, Oliviero Riganelli, and Rui Xin. 2020. Predicting failures in multi-tier distributed systems. Journal of Systems and Software, Vol. 161 (2020), 110464.Google ScholarDigital Library
- Burcu Ozcelik and Cemal Yilmaz. 2016. Seer: A Lightweight Online Failure Prediction Approach. IEEE Transactions on Software Engineering, Vol. 42, 1 (2016), 26--46.Google ScholarDigital Library
- Teerat Pitakrat, Jonas Grunert, Oliver Kabierschke, Fabian Keller, and Andre van Hoorn. 2014. A Framework for System Event Classification and Prediction by Means of Machine Learning. In Proceedings of the 8th International Conference on Performance Evaluation Methodologies and Tools (VALUETOOLS '14). ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, BEL, 173--180.Google ScholarDigital Library
- Teerat Pitakrat, Dusan Okanovic, André van Hoorn, and Lars Grunske. 2018. Hora: Architecture-aware online failure prediction. Journal of Systems and Software, Vol. 137 (2018), 669--685.Google ScholarCross Ref
- Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc.Google Scholar
- Simon Spinner, Giuliano Casale, Fabian Brosig, and Samuel Kounev. 2015. Evaluating approaches to resource demand estimation. Performance Evaluation, Vol. 92 (2015), 51--71.Google ScholarDigital Library
- André van Hoorn, Jan Waller, and Wilhelm Hasselbring. 2012. Kieker. In Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering (ICPE 2012). ACM, New York, NY, USA, 247.Google Scholar
- Joakim von Kistowski, Maximilian Deffner, and Samuel Kounev. 2018a. Run-Time Prediction of Power Consumption for Component Deployments. In 2018 IEEE International Conference on Autonomic Computing (ICAC). IEEE, USA, 151--156.Google Scholar
- Joakim von Kistowski, Simon Eismann, Norbert Schmitt, Andre Bauer, Johannes Grohmann, and Samuel Kounev. 2018b. TeaStore: A Micro-Service Reference Application for Benchmarking, Modeling and Resource Management Research. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, USA, 223--236.Google Scholar
- Ping Wang, Jingmin Xu, Meng Ma, Weilan Lin, Disheng Pan, Yuan Wang, and Pengfei Chen. 2018. CloudRanger: Root Cause Identification for Cloud Native Systems. In Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid '18). IEEE Press, USA, 492--502.Google ScholarDigital Library
- Jianping Weng, Jessie Hui Wang, Jiahai Yang, and Yang Yang. 2018. Root Cause Analysis of Anomalies of Multitier Services in Public Clouds. IEEE/ACM Trans. Netw., Vol. 26, 4 (2018), 1646--1659.Google ScholarDigital Library
- Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root Cause Localization of Performance Issues in Microservices. In IEEE/IFIP Network Operations and Management Symposium (NOMS). IEEE, Budapest, Hungary, 1--9.Google Scholar
- Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE Transactions on Software Engineering, Vol. 1, 01 (2018), 1--1.Google Scholar
- Marwin Züfle, André Bauer, Nikolas Herbst, Valentin Curtef, and Samuel Kounev. 2017. Telescope: A Hybrid Forecast Method for Univariate Time Series. In Proceedings of the International work-conference on Time Series (ITISE 2017). Springer, Berlin Heidelberg, Germany.Google Scholar
Index Terms
- SuanMing: Explainable Prediction of Performance Degradations in Microservice Applications
Recommendations
Towards Efficient Diagnosis of Performance Bottlenecks in Microservice-Based Applications (Work In Progress paper)
ICPE '24 Companion: Companion of the 15th ACM/SPEC International Conference on Performance EngineeringMicroservices have been a cornerstone for building scalable, flexible, and robust applications, thereby enabling service providers to enhance their systems' resilience and fault tolerance. However, adopting this architecture has often led to many ...
An Investigation into the Application of Different Performance Prediction Methods to Distributed Enterprise Applications
Response time predictions for workload on new server architectures can enhance Service Level Agreement--based resource management. This paper evaluates two performance prediction methods using a distributed enterprise application benchmark. The ...
Precise contention-aware performance prediction on virtualized multicore system
Virtualized multicore contention - aware performance prediction model.Virtual machine contention sensitivity and intensity features collection.Quantify the precise levels of performance degradation between VMs. Multicore systems are widely deployed in ...
Comments