short-paper

A Predictive Control Approach for Fault Management of Computing Systems

Authors:
Rui Jia

Mississippi State University, Mississippi State, MS

Mississippi State University, Mississippi State, MS
View Profile

,
Sherif Abdelwahed

Mississippi State University, Mississippi State, MS

Mississippi State University, Mississippi State, MS
View Profile

,
Abdelkarim Erradi

Qatar University, Doha, Qatar

Qatar University, Doha, Qatar
View Profile

ACM SIGMETRICS Performance Evaluation Review Volume 43 Issue 3December 2015pp 16–20https://doi.org/10.1145/2847220.2847225

Published:19 November 2015Publication History

ACM SIGMETRICS Performance Evaluation Review

Abstract

In this paper, a model-based predictive control approach for fault management in computing systems is presented. The proposed approach can incorporate existing fault diagnosis methods and fault recovery actions to facilitate the recovery process. When a fault is identified, the proposed algorithm uses utility cost functions to compute the optimal recovery solution that minimizes fault impacts on the system's Quality of Service. The proposed approach has been demonstrated on a Web service testbed under various faults.

References

S. Abdelwahed, J. Bai, R. Su, and N. Kandasamy. On the application of predictive control techniques for adaptive performance management of computing systems. Network and Service Management, IEEE Transactions on, 2009. Google ScholarDigital Library
M. D. Bond and K. S. McKinley. Tolerating memory leaks. SIGPLAN Not., 43(10):109--126, Oct. 2008. Google ScholarDigital Library
B. Butler. Amazon.com suffers outage: Nearly $5m down the drain?, 01 2013. Available at: http://goo.gl/TFosAZ.Google Scholar
A. Carzaniga, A. Gorla, A. Mattavelli, N. Perino, and M. Pezzè. Automatic recovery from runtime failures. In Proceedings of the 2013 International Conference on Software Engineering, 2013. Google ScholarDigital Library
Z. Chen and J. Dongarra. Highly scalable self-healing algorithms for high performance scientific computing. IEEE Trans. Comput., 58(11):1512--1524, Nov. 2009. Google ScholarDigital Library
J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google Scholar
J. Deng, S.-H. Huang, Y. Han, and J. Deng. Fault-tolerant and reliable computation in cloud computing. In GLOBECOM Workshops, 2010.Google ScholarCross Ref
M. Gagnaire, F. Diaz, C. Coti, C. Cerin, K. Shiozaki, Y. Xu, P. Delort, J.-P. Smets, J. Le Lous, S. Lubiarz, et al. Downtime statistics of current cloud solutions. International Working Group on Cloud Computing Resiliency, Tech. Rep, 2012.Google Scholar
D. Ghosh, R. Sharman, H. Raghav Rao, and S. Upadhyaya. Self-healing systems - survey and synthesis. Decis. Support Syst., 42:2164--2185, 2007. Google ScholarDigital Library
J. Herder, D. Moolenbroek, R. Appuswamy, B. Wu, B. Gras, and A. Tanenbaum. Dealing with driver failures in the storage stack. In 4th Latin-American Symposium on Dependable Computing, 2009. Google ScholarDigital Library
D. Jaffe. Dell dvd store database test suite. Available at: http://linux.dell.com/dvdstore/.Google Scholar
R. Jia, S. Abdelwahed, and A. Erradi. A model-based predictive fault management approach for computing systems. Technical report, Mississippi State University, 2014. Available at: http://goo.gl/cqvZEv.Google Scholar
R. Jia, S. Abdelwahed, and A. Erradi. Stability analysis of the adaptive control framework for fault management. Technical report, Mississippi State University, 2014. Available at: http://goo.gl/KWubsf.Google Scholar
H. Jo, H. Kim, J.-W. Jang, J. Lee, and S. Maeng. Transparent fault tolerance of device drivers for virtual machines. Computers, IEEE Transactions on, 59(11), Nov 2010. Google ScholarDigital Library
S. Kadirvel, J. Ho, and J. A. B. Fortes. Fault management in map-reduce through early detection of anomalous nodes. In Proceedings of ICAC 13, 2013.Google Scholar
K. Keeton, C. Santos, D. Beyer, J. Chase, and J. Wilkes. Designing for disasters. FAST '04, 2004. Google ScholarDigital Library
J. Kephart and D. Chess. The vision of autonomic computing. Computer, 36(1):41--50, 2003. Google ScholarDigital Library
Z. Lan and Y. Li. Adaptive fault management of parallel applications for high-performance computing. Computers, IEEE Transactions on, 57(12), 2008. Google ScholarDigital Library
A. Liu, Q. Li, L. Huang, and M. Xiao. Facts: A framework for fault-tolerant composition of transactional web services. IEEE Trans. Serv. Comput., 3(1), 2010. Google ScholarDigital Library
N. Mahadevan, A. Dubey, and G. Karsai. Application of software health management techniques. In Proc. of SEAMS, 2011. Google ScholarDigital Library
A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for hpc with xen virtualization. In Proc. of ICS, 2007. Google ScholarDigital Library
P. Pan, A. Dubey, and L. Piccoli. Dynamic workflow management and monitoring using DDS. In Proc. 7th IEEE Int EASe Conf. and Workshops, 2010. Google ScholarDigital Library
R. J. Rodríguez, C. Trubiani, and J. Merseguer. Fault-tolerant techniques and security mechanisms for model-based performance prediction of critical systems. In Proc. of ISARCS, 2012. Google ScholarDigital Library

Index Terms

A Predictive Control Approach for Fault Management of Computing Systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Towards Proactive Fault Management of Enterprise Systems
ICCAC '15: Proceedings of the 2015 International Conference on Cloud and Autonomic Computing

This paper introduces a model-based approach for autonomic fault management of computing systems. The proposed approach can recover a system from common faults while minimizing the impact on the system's quality of service and reducing potential revenue ...
Read More
Fault tolerant control using a fuzzy predictive approach

This paper proposes the application of fault-tolerant control (FTC) using fuzzy predictive control. The FTC approach is based on two steps, fault detection and isolation (FDI) and fault accommodation. The fault detection is performed by a model-based ...
Read More
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy

An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGMETRICS Performance Evaluation Review Volume 43, Issue 3
December 2015
89 pages
ISSN:0163-5999
DOI:10.1145/2847220
Editor:
Nidhi Hegde
Bell Labs France, Alcatel-Lucent Centre de Villarceaux Route de Villejust, Nozay, France
Issue’s Table of Contents
Copyright © 2015 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 November 2015
Check for updates
Author Tags
Autonomic Computing
Fault Tolerance
Model-based Control
Predictive Control
Self-healing
Qualifiers
- short-paper
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 107
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Predictive Control Approach for Fault Management of Computing Systems

ACM SIGMETRICS Performance Evaluation Review

Abstract

References

Cited By

Index Terms

Recommendations

Towards Proactive Fault Management of Enterprise Systems

Fault tolerant control using a fuzzy predictive approach

Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Predictive Control Approach for Fault Management of Computing Systems

ACM SIGMETRICS Performance Evaluation Review

Abstract

References

Cited By

Index Terms

Recommendations

Towards Proactive Fault Management of Enterprise Systems

Fault tolerant control using a fuzzy predictive approach

Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media