skip to main content
10.1145/1755913.1755933acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Otherworld: giving applications a chance to survive OS kernel crashes

Published:13 April 2010Publication History

ABSTRACT

The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is nonpersistent.

Otherworld is a mechanism that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore.

We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97% of the cases. In the default case, Otherworld adds zero overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.

References

  1. Auslander, M., Larkin, D., and Scherr, A. The evolution of the MVS operating system. IBM Journal of Research and Development 25, 5 (1981), 471--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Avizienis, A., Laprie, J. C., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1 (2004), 11--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baker, M., Asami, S., Deprit, E., Ouseterhout, J., and Seltzer, M. Non-volatile memory for fast, reliable file systems. Proc. of the 5th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (1992), 10--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Baker, M., and Sullivan, M. The Recovery Box: Using fast recovery to provide high availability in the Unix environment. Proc. of the 1992 USENIX Summer Conf. (1992), 31--43.Google ScholarGoogle Scholar
  5. Baumann, R. Soft errors in commercial semiconductor technology: overview and scaling trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals (2002), 121.Google ScholarGoogle Scholar
  6. Biederman, E. Kexec. http://lwn.net/Articles/15468/, 2002.Google ScholarGoogle Scholar
  7. Bohra, A., Neamtiu, I., Gallard, P., Sultan, F., and Iftode, L. Remote repair of operating system state using Backdoors. Proc. of the Intl. Conf. on Autonomic Computing (2004), 256--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. Microreboot-a technique for cheap recovery. Proc. of the 6th Symposium on Operating Systems Design and Implementation (2004), pp. 31--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chandra, S., and Chen, P. M. The impact of recovery mechanisms on the likelihood of saving corrupted state. Proc. of the 13th Intl. Symposium on Software Reliability Engineering (2002), 91--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chen, Y., Gnawali, O., Kazandjieva, M., Levis, P., and Regehr, J. Surviving sensor network software faults. Proc. of the 22nd Proc. of the Symposium on Operating Systems Principles (2009), 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. An empirical study of operating system errors. Symposium on Operating Systems Principles (2001), 73--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David, F. M., Chan, E. M., Carlyle, J. C., and Campbell, R. H. CuriOS: Improving reliability through operating system structure. Proc. of the 8th Symposium on Operating Systems Design and Implementation (2008), 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Depoutovitch, A., and Stumm, M. Otherworld -- giving applications a chance to survive OS kernel crashes. Proc. of the 4th Workshop on Hot Topics in System Dependability (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Goyal, V., Biederman, E., and Nellitheertha, H. KDump, a Kexec-based kernel crash dumping mechanism. Proc. of the Linux Symposium (2005), 169--181.Google ScholarGoogle Scholar
  15. Gu, W., Kalbarczyk, Z., Iyer, R., and Yang, Z. Characterization of Linux kernel behavior under errors. Proc. of the Intl. Conf. on Dependable Systems and Networks (2003), 459--468.Google ScholarGoogle Scholar
  16. Hargrove, P., and Duell, J. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conf. Series (2006), vol. 46, Institute of Physics Publishing, pp. 494--499.Google ScholarGoogle ScholarCross RefCross Ref
  17. Herder, J. N., Bos, H., Gras, B., Homburg, P., and Tanenbaum, A. S. Reorganizing Unix for reliability. Proc. of Asia-Pacific Computer Systems Architecture Conf. (2006), 81--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Intel. Using the Intel ICH family watchdog timer (WDT) application note: AP-725 http://www.intel.com/design/chipsets/applnots/292273.htm (2002)Google ScholarGoogle Scholar
  19. King, S., Dunlap, G., and Chen, P. Debugging operating systems with time-traveling virtual machines. Proc. of the USENIX 2005 Technical Conf. (2005), 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Laadan, O., and Nieh, J. Transparent checkpoint-restart of multiple processes on commodity operating systems. Proc. of the 2007 USENIX Technical Conf. (2007), 323--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lehman, T., Shekita, E., and Cabrera, L. An evaluation of the Starburst memory-resident storage component. IEEE Trans. on Knowledge and Data Engineering (1992), 555--566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lowell, D. E., Chandra, S., and Chen, P. M. Exploring failure transparency and the limits of generic recovery. Proc. of the 4th Symposium on Operating System Design and Implementation (2000), 289--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Microsoft. Underpinnings of the session state implementation in ASP.NET. http://msdn2.microsoft.com/enus/library/aa479041.aspx, 2003.Google ScholarGoogle Scholar
  24. Ng, W. Design and implementation of reliable main memory. Ph.D. thesis (1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ng, W. T., and Chen, P. M. The systematic improvement of fault tolerance in the Rio file cache. Proc. of the 1999 Symposium on Fault-Tolerant Computing (1999), 76--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Patterson, D. Recovery oriented computing: A new research agenda for a new century. Proc. of the 8th Intl. Symposium on High-Performance Computer Architecture (2002), 223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Srinivasan, S., Andrews, C., Kandula, S., and Zhou, Y. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. Proc. of the USENIX 2004 Annual Technical Conf. (2004), 29--44 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sullivan, M., and Chillarege, R. Software defects and their impact on system availability: A study of field failures in operating systems. Proc. of the 21st Intl. Symposium on Fault-Tolerant Computing (1991), 2--9.Google ScholarGoogle ScholarCross RefCross Ref
  29. Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. Recovering device drivers. ACM Transactions on Computer Systems 24, 4 (2006), 333--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Swift, M. M., Bershad, B. N., and Levy, H. M. Improving the reliability of commodity operating systems. ACM Transactions on Computer Systems 23, 1 (2005), 77--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Van Vleck, T. Unix and Multics. http://www.multicians.org/unix.html, 1993.Google ScholarGoogle Scholar
  32. VMWare Fault tolerance, http://www.vmware.com/Google ScholarGoogle Scholar
  33. Volano benchmark, http://www.volano.com/benchmarks.htmlGoogle ScholarGoogle Scholar
  34. Zheng, G., Shi, L., and Kal_e, L. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Proc. of the 2004 IEEE Intl. Conf. on Cluster Computing (2004), 93--103. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Otherworld: giving applications a chance to survive OS kernel crashes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      EuroSys '10: Proceedings of the 5th European conference on Computer systems
      April 2010
      388 pages
      ISBN:9781605585772
      DOI:10.1145/1755913

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 April 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate241of1,308submissions,18%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader