research-article

Otherworld: giving applications a chance to survive OS kernel crashes

Authors:
Alex Depoutovitch

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

,
Michael Stumm

University of Toronto, Toronto, ON, Canada

University of Toronto, Toronto, ON, Canada
View Profile

EuroSys '10: Proceedings of the 5th European conference on Computer systemsApril 2010Pages 181–194https://doi.org/10.1145/1755913.1755933

Published:13 April 2010Publication History

EuroSys '10: Proceedings of the 5th European conference on Computer systems

Pages 181–194

ABSTRACT

The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is nonpersistent.

Otherworld is a mechanism that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore.

We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97% of the cases. In the default case, Otherworld adds zero overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.

References

Auslander, M., Larkin, D., and Scherr, A. The evolution of the MVS operating system. IBM Journal of Research and Development 25, 5 (1981), 471--482. Google ScholarDigital Library
Avizienis, A., Laprie, J. C., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1 (2004), 11--33. Google ScholarDigital Library
Baker, M., Asami, S., Deprit, E., Ouseterhout, J., and Seltzer, M. Non-volatile memory for fast, reliable file systems. Proc. of the 5th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (1992), 10--22. Google ScholarDigital Library
Baker, M., and Sullivan, M. The Recovery Box: Using fast recovery to provide high availability in the Unix environment. Proc. of the 1992 USENIX Summer Conf. (1992), 31--43.Google Scholar
Baumann, R. Soft errors in commercial semiconductor technology: overview and scaling trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals (2002), 121.Google Scholar
Biederman, E. Kexec. http://lwn.net/Articles/15468/, 2002.Google Scholar
Bohra, A., Neamtiu, I., Gallard, P., Sultan, F., and Iftode, L. Remote repair of operating system state using Backdoors. Proc. of the Intl. Conf. on Autonomic Computing (2004), 256--263. Google ScholarDigital Library
Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. Microreboot-a technique for cheap recovery. Proc. of the 6th Symposium on Operating Systems Design and Implementation (2004), pp. 31--44. Google ScholarDigital Library
Chandra, S., and Chen, P. M. The impact of recovery mechanisms on the likelihood of saving corrupted state. Proc. of the 13th Intl. Symposium on Software Reliability Engineering (2002), 91--101. Google ScholarDigital Library
Chen, Y., Gnawali, O., Kazandjieva, M., Levis, P., and Regehr, J. Surviving sensor network software faults. Proc. of the 22nd Proc. of the Symposium on Operating Systems Principles (2009), 235--246. Google ScholarDigital Library
Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. An empirical study of operating system errors. Symposium on Operating Systems Principles (2001), 73--88. Google ScholarDigital Library
David, F. M., Chan, E. M., Carlyle, J. C., and Campbell, R. H. CuriOS: Improving reliability through operating system structure. Proc. of the 8th Symposium on Operating Systems Design and Implementation (2008), 59--72. Google ScholarDigital Library
Depoutovitch, A., and Stumm, M. Otherworld -- giving applications a chance to survive OS kernel crashes. Proc. of the 4th Workshop on Hot Topics in System Dependability (2008). Google ScholarDigital Library
Goyal, V., Biederman, E., and Nellitheertha, H. KDump, a Kexec-based kernel crash dumping mechanism. Proc. of the Linux Symposium (2005), 169--181.Google Scholar
Gu, W., Kalbarczyk, Z., Iyer, R., and Yang, Z. Characterization of Linux kernel behavior under errors. Proc. of the Intl. Conf. on Dependable Systems and Networks (2003), 459--468.Google Scholar
Hargrove, P., and Duell, J. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conf. Series (2006), vol. 46, Institute of Physics Publishing, pp. 494--499.Google ScholarCross Ref
Herder, J. N., Bos, H., Gras, B., Homburg, P., and Tanenbaum, A. S. Reorganizing Unix for reliability. Proc. of Asia-Pacific Computer Systems Architecture Conf. (2006), 81--94. Google ScholarDigital Library
Intel. Using the Intel ICH family watchdog timer (WDT) application note: AP-725 http://www.intel.com/design/chipsets/applnots/292273.htm (2002)Google Scholar
King, S., Dunlap, G., and Chen, P. Debugging operating systems with time-traveling virtual machines. Proc. of the USENIX 2005 Technical Conf. (2005), 1--15. Google ScholarDigital Library
Laadan, O., and Nieh, J. Transparent checkpoint-restart of multiple processes on commodity operating systems. Proc. of the 2007 USENIX Technical Conf. (2007), 323--336. Google ScholarDigital Library
Lehman, T., Shekita, E., and Cabrera, L. An evaluation of the Starburst memory-resident storage component. IEEE Trans. on Knowledge and Data Engineering (1992), 555--566. Google ScholarDigital Library
Lowell, D. E., Chandra, S., and Chen, P. M. Exploring failure transparency and the limits of generic recovery. Proc. of the 4th Symposium on Operating System Design and Implementation (2000), 289--304. Google ScholarDigital Library
Microsoft. Underpinnings of the session state implementation in ASP.NET. http://msdn2.microsoft.com/enus/library/aa479041.aspx, 2003.Google Scholar
Ng, W. Design and implementation of reliable main memory. Ph.D. thesis (1999). Google ScholarDigital Library
Ng, W. T., and Chen, P. M. The systematic improvement of fault tolerance in the Rio file cache. Proc. of the 1999 Symposium on Fault-Tolerant Computing (1999), 76--83. Google ScholarDigital Library
Patterson, D. Recovery oriented computing: A new research agenda for a new century. Proc. of the 8th Intl. Symposium on High-Performance Computer Architecture (2002), 223. Google ScholarDigital Library
Srinivasan, S., Andrews, C., Kandula, S., and Zhou, Y. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. Proc. of the USENIX 2004 Annual Technical Conf. (2004), 29--44 Google ScholarDigital Library
Sullivan, M., and Chillarege, R. Software defects and their impact on system availability: A study of field failures in operating systems. Proc. of the 21st Intl. Symposium on Fault-Tolerant Computing (1991), 2--9.Google ScholarCross Ref
Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. Recovering device drivers. ACM Transactions on Computer Systems 24, 4 (2006), 333--360. Google ScholarDigital Library
Swift, M. M., Bershad, B. N., and Levy, H. M. Improving the reliability of commodity operating systems. ACM Transactions on Computer Systems 23, 1 (2005), 77--110. Google ScholarDigital Library
Van Vleck, T. Unix and Multics. http://www.multicians.org/unix.html, 1993.Google Scholar
VMWare Fault tolerance, http://www.vmware.com/Google Scholar
Volano benchmark, http://www.volano.com/benchmarks.htmlGoogle Scholar
Zheng, G., Shi, L., and Kal_e, L. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Proc. of the 2004 IEEE Intl. Conf. on Cluster Computing (2004), 93--103. Google ScholarDigital Library

Index Terms

Otherworld: giving applications a chance to survive OS kernel crashes
1. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance

Recommendations

Improving the reliability of commodity operating systems
SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principles

Despite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85% of recently reported failures. This paper ...
Read More
"Otherworld": giving applications a chance to survive OS kernel crashes
HotDep'08: Proceedings of the Fourth conference on Hot topics in system dependability

We propose a mechanism that allows applications to survive operating system kernel crashes and continue functioning with no application data loss after a system reboot. This mechanism introduces no run-time overhead and can be implemented in a commodity ...
Read More
Otherworld - giving applications a chance to survive os kernel crashes
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '10: Proceedings of the 5th European conference on Computer systems
April 2010
388 pages
ISBN:9781605585772
DOI:10.1145/1755913
General Chair:
Christine Morin
INRIA Rennes, France
,
Program Chair:
Gilles Muller
INRIA/LIP6, France
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crash kernel
kernel
microreboot
recovery
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 60
  Total Citations
  View Citations
- 710
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Otherworld: giving applications a chance to survive OS kernel crashes

EuroSys '10: Proceedings of the 5th European conference on Computer systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving the reliability of commodity operating systems

"Otherworld": giving applications a chance to survive OS kernel crashes

Otherworld - giving applications a chance to survive os kernel crashes