Abstract
In this paper, a model-based predictive control approach for fault management in computing systems is presented. The proposed approach can incorporate existing fault diagnosis methods and fault recovery actions to facilitate the recovery process. When a fault is identified, the proposed algorithm uses utility cost functions to compute the optimal recovery solution that minimizes fault impacts on the system's Quality of Service. The proposed approach has been demonstrated on a Web service testbed under various faults.
- S. Abdelwahed, J. Bai, R. Su, and N. Kandasamy. On the application of predictive control techniques for adaptive performance management of computing systems. Network and Service Management, IEEE Transactions on, 2009. Google ScholarDigital Library
- M. D. Bond and K. S. McKinley. Tolerating memory leaks. SIGPLAN Not., 43(10):109--126, Oct. 2008. Google ScholarDigital Library
- B. Butler. Amazon.com suffers outage: Nearly $5m down the drain?, 01 2013. Available at: http://goo.gl/TFosAZ.Google Scholar
- A. Carzaniga, A. Gorla, A. Mattavelli, N. Perino, and M. Pezzè. Automatic recovery from runtime failures. In Proceedings of the 2013 International Conference on Software Engineering, 2013. Google ScholarDigital Library
- Z. Chen and J. Dongarra. Highly scalable self-healing algorithms for high performance scientific computing. IEEE Trans. Comput., 58(11):1512--1524, Nov. 2009. Google ScholarDigital Library
- J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google Scholar
- J. Deng, S.-H. Huang, Y. Han, and J. Deng. Fault-tolerant and reliable computation in cloud computing. In GLOBECOM Workshops, 2010.Google ScholarCross Ref
- M. Gagnaire, F. Diaz, C. Coti, C. Cerin, K. Shiozaki, Y. Xu, P. Delort, J.-P. Smets, J. Le Lous, S. Lubiarz, et al. Downtime statistics of current cloud solutions. International Working Group on Cloud Computing Resiliency, Tech. Rep, 2012.Google Scholar
- D. Ghosh, R. Sharman, H. Raghav Rao, and S. Upadhyaya. Self-healing systems - survey and synthesis. Decis. Support Syst., 42:2164--2185, 2007. Google ScholarDigital Library
- J. Herder, D. Moolenbroek, R. Appuswamy, B. Wu, B. Gras, and A. Tanenbaum. Dealing with driver failures in the storage stack. In 4th Latin-American Symposium on Dependable Computing, 2009. Google ScholarDigital Library
- D. Jaffe. Dell dvd store database test suite. Available at: http://linux.dell.com/dvdstore/.Google Scholar
- R. Jia, S. Abdelwahed, and A. Erradi. A model-based predictive fault management approach for computing systems. Technical report, Mississippi State University, 2014. Available at: http://goo.gl/cqvZEv.Google Scholar
- R. Jia, S. Abdelwahed, and A. Erradi. Stability analysis of the adaptive control framework for fault management. Technical report, Mississippi State University, 2014. Available at: http://goo.gl/KWubsf.Google Scholar
- H. Jo, H. Kim, J.-W. Jang, J. Lee, and S. Maeng. Transparent fault tolerance of device drivers for virtual machines. Computers, IEEE Transactions on, 59(11), Nov 2010. Google ScholarDigital Library
- S. Kadirvel, J. Ho, and J. A. B. Fortes. Fault management in map-reduce through early detection of anomalous nodes. In Proceedings of ICAC 13, 2013.Google Scholar
- K. Keeton, C. Santos, D. Beyer, J. Chase, and J. Wilkes. Designing for disasters. FAST '04, 2004. Google ScholarDigital Library
- J. Kephart and D. Chess. The vision of autonomic computing. Computer, 36(1):41--50, 2003. Google ScholarDigital Library
- Z. Lan and Y. Li. Adaptive fault management of parallel applications for high-performance computing. Computers, IEEE Transactions on, 57(12), 2008. Google ScholarDigital Library
- A. Liu, Q. Li, L. Huang, and M. Xiao. Facts: A framework for fault-tolerant composition of transactional web services. IEEE Trans. Serv. Comput., 3(1), 2010. Google ScholarDigital Library
- N. Mahadevan, A. Dubey, and G. Karsai. Application of software health management techniques. In Proc. of SEAMS, 2011. Google ScholarDigital Library
- A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for hpc with xen virtualization. In Proc. of ICS, 2007. Google ScholarDigital Library
- P. Pan, A. Dubey, and L. Piccoli. Dynamic workflow management and monitoring using DDS. In Proc. 7th IEEE Int EASe Conf. and Workshops, 2010. Google ScholarDigital Library
- R. J. Rodríguez, C. Trubiani, and J. Merseguer. Fault-tolerant techniques and security mechanisms for model-based performance prediction of critical systems. In Proc. of ISARCS, 2012. Google ScholarDigital Library
Index Terms
- A Predictive Control Approach for Fault Management of Computing Systems
Recommendations
Towards Proactive Fault Management of Enterprise Systems
ICCAC '15: Proceedings of the 2015 International Conference on Cloud and Autonomic ComputingThis paper introduces a model-based approach for autonomic fault management of computing systems. The proposed approach can recover a system from common faults while minimizing the impact on the system's quality of service and reducing potential revenue ...
Fault tolerant control using a fuzzy predictive approach
This paper proposes the application of fault-tolerant control (FTC) using fuzzy predictive control. The FTC approach is based on two steps, fault detection and isolation (FDI) and fault accommodation. The fault detection is performed by a model-based ...
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy
An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...
Comments