ABSTRACT
The default behavior of all commodity operating systems today is to restart the system when a critical error is encountered in the kernel. This terminates all running applications with an attendant loss of "work in progress" that is nonpersistent.
Otherworld is a mechanism that microreboots the operating system kernel when a critical error is encountered in the kernel, and it does so without clobbering the state of the running applications. After the kernel microreboot, Otherworld attempts to resurrect the applications that were running at the time of failure. It does so by restoring the application memory spaces, open files and other resources. In the default case it then continues executing the processes from the point at which they were interrupted by the failure. Optionally, applications can have user-level recovery procedures registered with the kernel, in which case Otherworld passes control to these procedures after having restored their process state. Recovery procedures might check the integrity of application data and restore resources Otherworld was not able to restore.
We implemented Otherworld in Linux, but we believe that the technique can be applied to all commodity operating systems. In an extensive set of experiments on real-world applications (MySQL, Apache/PHP, Joe, vi), we show that Otherworld is capable of successfully microrebooting the kernel and restoring the applications in over 97% of the cases. In the default case, Otherworld adds zero overhead to normal execution. In an enhanced mode, Otherworld can provide extra application memory protection with overhead of between 4% and 12%.
- Auslander, M., Larkin, D., and Scherr, A. The evolution of the MVS operating system. IBM Journal of Research and Development 25, 5 (1981), 471--482. Google ScholarDigital Library
- Avizienis, A., Laprie, J. C., Randell, B., and Landwehr, C. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1, 1 (2004), 11--33. Google ScholarDigital Library
- Baker, M., Asami, S., Deprit, E., Ouseterhout, J., and Seltzer, M. Non-volatile memory for fast, reliable file systems. Proc. of the 5th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (1992), 10--22. Google ScholarDigital Library
- Baker, M., and Sullivan, M. The Recovery Box: Using fast recovery to provide high availability in the Unix environment. Proc. of the 1992 USENIX Summer Conf. (1992), 31--43.Google Scholar
- Baumann, R. Soft errors in commercial semiconductor technology: overview and scaling trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals (2002), 121.Google Scholar
- Biederman, E. Kexec. http://lwn.net/Articles/15468/, 2002.Google Scholar
- Bohra, A., Neamtiu, I., Gallard, P., Sultan, F., and Iftode, L. Remote repair of operating system state using Backdoors. Proc. of the Intl. Conf. on Autonomic Computing (2004), 256--263. Google ScholarDigital Library
- Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. Microreboot-a technique for cheap recovery. Proc. of the 6th Symposium on Operating Systems Design and Implementation (2004), pp. 31--44. Google ScholarDigital Library
- Chandra, S., and Chen, P. M. The impact of recovery mechanisms on the likelihood of saving corrupted state. Proc. of the 13th Intl. Symposium on Software Reliability Engineering (2002), 91--101. Google ScholarDigital Library
- Chen, Y., Gnawali, O., Kazandjieva, M., Levis, P., and Regehr, J. Surviving sensor network software faults. Proc. of the 22nd Proc. of the Symposium on Operating Systems Principles (2009), 235--246. Google ScholarDigital Library
- Chou, A., Yang, J., Chelf, B., Hallem, S., and Engler, D. An empirical study of operating system errors. Symposium on Operating Systems Principles (2001), 73--88. Google ScholarDigital Library
- David, F. M., Chan, E. M., Carlyle, J. C., and Campbell, R. H. CuriOS: Improving reliability through operating system structure. Proc. of the 8th Symposium on Operating Systems Design and Implementation (2008), 59--72. Google ScholarDigital Library
- Depoutovitch, A., and Stumm, M. Otherworld -- giving applications a chance to survive OS kernel crashes. Proc. of the 4th Workshop on Hot Topics in System Dependability (2008). Google ScholarDigital Library
- Goyal, V., Biederman, E., and Nellitheertha, H. KDump, a Kexec-based kernel crash dumping mechanism. Proc. of the Linux Symposium (2005), 169--181.Google Scholar
- Gu, W., Kalbarczyk, Z., Iyer, R., and Yang, Z. Characterization of Linux kernel behavior under errors. Proc. of the Intl. Conf. on Dependable Systems and Networks (2003), 459--468.Google Scholar
- Hargrove, P., and Duell, J. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conf. Series (2006), vol. 46, Institute of Physics Publishing, pp. 494--499.Google ScholarCross Ref
- Herder, J. N., Bos, H., Gras, B., Homburg, P., and Tanenbaum, A. S. Reorganizing Unix for reliability. Proc. of Asia-Pacific Computer Systems Architecture Conf. (2006), 81--94. Google ScholarDigital Library
- Intel. Using the Intel ICH family watchdog timer (WDT) application note: AP-725 http://www.intel.com/design/chipsets/applnots/292273.htm (2002)Google Scholar
- King, S., Dunlap, G., and Chen, P. Debugging operating systems with time-traveling virtual machines. Proc. of the USENIX 2005 Technical Conf. (2005), 1--15. Google ScholarDigital Library
- Laadan, O., and Nieh, J. Transparent checkpoint-restart of multiple processes on commodity operating systems. Proc. of the 2007 USENIX Technical Conf. (2007), 323--336. Google ScholarDigital Library
- Lehman, T., Shekita, E., and Cabrera, L. An evaluation of the Starburst memory-resident storage component. IEEE Trans. on Knowledge and Data Engineering (1992), 555--566. Google ScholarDigital Library
- Lowell, D. E., Chandra, S., and Chen, P. M. Exploring failure transparency and the limits of generic recovery. Proc. of the 4th Symposium on Operating System Design and Implementation (2000), 289--304. Google ScholarDigital Library
- Microsoft. Underpinnings of the session state implementation in ASP.NET. http://msdn2.microsoft.com/enus/library/aa479041.aspx, 2003.Google Scholar
- Ng, W. Design and implementation of reliable main memory. Ph.D. thesis (1999). Google ScholarDigital Library
- Ng, W. T., and Chen, P. M. The systematic improvement of fault tolerance in the Rio file cache. Proc. of the 1999 Symposium on Fault-Tolerant Computing (1999), 76--83. Google ScholarDigital Library
- Patterson, D. Recovery oriented computing: A new research agenda for a new century. Proc. of the 8th Intl. Symposium on High-Performance Computer Architecture (2002), 223. Google ScholarDigital Library
- Srinivasan, S., Andrews, C., Kandula, S., and Zhou, Y. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. Proc. of the USENIX 2004 Annual Technical Conf. (2004), 29--44 Google ScholarDigital Library
- Sullivan, M., and Chillarege, R. Software defects and their impact on system availability: A study of field failures in operating systems. Proc. of the 21st Intl. Symposium on Fault-Tolerant Computing (1991), 2--9.Google ScholarCross Ref
- Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. Recovering device drivers. ACM Transactions on Computer Systems 24, 4 (2006), 333--360. Google ScholarDigital Library
- Swift, M. M., Bershad, B. N., and Levy, H. M. Improving the reliability of commodity operating systems. ACM Transactions on Computer Systems 23, 1 (2005), 77--110. Google ScholarDigital Library
- Van Vleck, T. Unix and Multics. http://www.multicians.org/unix.html, 1993.Google Scholar
- VMWare Fault tolerance, http://www.vmware.com/Google Scholar
- Volano benchmark, http://www.volano.com/benchmarks.htmlGoogle Scholar
- Zheng, G., Shi, L., and Kal_e, L. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Proc. of the 2004 IEEE Intl. Conf. on Cluster Computing (2004), 93--103. Google ScholarDigital Library
Index Terms
- Otherworld: giving applications a chance to survive OS kernel crashes
Recommendations
Improving the reliability of commodity operating systems
SOSP '03: Proceedings of the nineteenth ACM symposium on Operating systems principlesDespite decades of research in extensible operating system technology, extensions such as device drivers remain a significant cause of system failures. In Windows XP, for example, drivers account for 85% of recently reported failures. This paper ...
"Otherworld": giving applications a chance to survive OS kernel crashes
HotDep'08: Proceedings of the Fourth conference on Hot topics in system dependabilityWe propose a mechanism that allows applications to survive operating system kernel crashes and continue functioning with no application data loss after a system reboot. This mechanism introduces no run-time overhead and can be implemented in a commodity ...
Comments