skip to main content
short-paper

A Predictive Control Approach for Fault Management of Computing Systems

Published:19 November 2015Publication History
Skip Abstract Section

Abstract

In this paper, a model-based predictive control approach for fault management in computing systems is presented. The proposed approach can incorporate existing fault diagnosis methods and fault recovery actions to facilitate the recovery process. When a fault is identified, the proposed algorithm uses utility cost functions to compute the optimal recovery solution that minimizes fault impacts on the system's Quality of Service. The proposed approach has been demonstrated on a Web service testbed under various faults.

References

  1. S. Abdelwahed, J. Bai, R. Su, and N. Kandasamy. On the application of predictive control techniques for adaptive performance management of computing systems. Network and Service Management, IEEE Transactions on, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. D. Bond and K. S. McKinley. Tolerating memory leaks. SIGPLAN Not., 43(10):109--126, Oct. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Butler. Amazon.com suffers outage: Nearly $5m down the drain?, 01 2013. Available at: http://goo.gl/TFosAZ.Google ScholarGoogle Scholar
  4. A. Carzaniga, A. Gorla, A. Mattavelli, N. Perino, and M. Pezzè. Automatic recovery from runtime failures. In Proceedings of the 2013 International Conference on Software Engineering, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Z. Chen and J. Dongarra. Highly scalable self-healing algorithms for high performance scientific computing. IEEE Trans. Comput., 58(11):1512--1524, Nov. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dean. Designs, lessons and advice from building large distributed systems. Keynote from LADIS, 2009.Google ScholarGoogle Scholar
  7. J. Deng, S.-H. Huang, Y. Han, and J. Deng. Fault-tolerant and reliable computation in cloud computing. In GLOBECOM Workshops, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Gagnaire, F. Diaz, C. Coti, C. Cerin, K. Shiozaki, Y. Xu, P. Delort, J.-P. Smets, J. Le Lous, S. Lubiarz, et al. Downtime statistics of current cloud solutions. International Working Group on Cloud Computing Resiliency, Tech. Rep, 2012.Google ScholarGoogle Scholar
  9. D. Ghosh, R. Sharman, H. Raghav Rao, and S. Upadhyaya. Self-healing systems - survey and synthesis. Decis. Support Syst., 42:2164--2185, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Herder, D. Moolenbroek, R. Appuswamy, B. Wu, B. Gras, and A. Tanenbaum. Dealing with driver failures in the storage stack. In 4th Latin-American Symposium on Dependable Computing, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Jaffe. Dell dvd store database test suite. Available at: http://linux.dell.com/dvdstore/.Google ScholarGoogle Scholar
  12. R. Jia, S. Abdelwahed, and A. Erradi. A model-based predictive fault management approach for computing systems. Technical report, Mississippi State University, 2014. Available at: http://goo.gl/cqvZEv.Google ScholarGoogle Scholar
  13. R. Jia, S. Abdelwahed, and A. Erradi. Stability analysis of the adaptive control framework for fault management. Technical report, Mississippi State University, 2014. Available at: http://goo.gl/KWubsf.Google ScholarGoogle Scholar
  14. H. Jo, H. Kim, J.-W. Jang, J. Lee, and S. Maeng. Transparent fault tolerance of device drivers for virtual machines. Computers, IEEE Transactions on, 59(11), Nov 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Kadirvel, J. Ho, and J. A. B. Fortes. Fault management in map-reduce through early detection of anomalous nodes. In Proceedings of ICAC 13, 2013.Google ScholarGoogle Scholar
  16. K. Keeton, C. Santos, D. Beyer, J. Chase, and J. Wilkes. Designing for disasters. FAST '04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Kephart and D. Chess. The vision of autonomic computing. Computer, 36(1):41--50, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Z. Lan and Y. Li. Adaptive fault management of parallel applications for high-performance computing. Computers, IEEE Transactions on, 57(12), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Liu, Q. Li, L. Huang, and M. Xiao. Facts: A framework for fault-tolerant composition of transactional web services. IEEE Trans. Serv. Comput., 3(1), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Mahadevan, A. Dubey, and G. Karsai. Application of software health management techniques. In Proc. of SEAMS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for hpc with xen virtualization. In Proc. of ICS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Pan, A. Dubey, and L. Piccoli. Dynamic workflow management and monitoring using DDS. In Proc. 7th IEEE Int EASe Conf. and Workshops, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. J. Rodríguez, C. Trubiani, and J. Merseguer. Fault-tolerant techniques and security mechanisms for model-based performance prediction of critical systems. In Proc. of ISARCS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Predictive Control Approach for Fault Management of Computing Systems
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader