System-level virtualization research at Oak Ridge National Laboratory

https://doi.org/10.1016/j.future.2009.07.001Get rights and content

Abstract

System-level virtualization is today enjoying a rebirth as a technique to effectively share what had been considered large computing resources which subsequently faded from the spotlight as individual workstations gained in popularity with a “one machine–one user” approach. One reason for this resurgence is that the simple workstation has grown in capability to rival anything similar, available in the past. Thus, computing centers are again looking at the price/performance benefit of sharing that single computing box via server consolidation.

However, industry is only concentrating on the benefits of using virtualization for server consolidation (enterprise computing) whereas our interest is in leveraging virtualization to advance high-performance computing (HPC). While these two interests may appear to be orthogonal, one consolidating multiple applications and users on a single machine while the other requires all the power from many machines to be dedicated solely to its purpose, we propose that virtualization does provide attractive capabilities that may be exploited to the benefit of HPC interests. This does raise the two fundamental questions: is the concept of virtualization (a machine “sharing” technology) really suitable for HPC and if so, how does one go about leveraging these virtualization capabilities for the benefit of HPC.

To address these questions, this document presents ongoing studies on the usage of system-level virtualization in a HPC context. These studies include an analysis of the benefits of system-level virtualization for HPC, a presentation of research efforts based on virtualization for system availability, and a presentation of research efforts for the management of virtual systems. The basis for this document was the material presented by Stephen L. Scott at the Collaborative and Grid Computing Technologies meeting held in Cancun, Mexico on April 12–14, 2007.

Section snippets

Introduction to system-level virtualization

System-level virtualization is used for a number of reasons, but the three major justifications are [1], [2], [3]: (i) isolation, (ii) consolidation, and (iii) migration. We describe these points in the following sections after a brief description of the terminology used in this article.

Terminology. The execution of a virtual machine (VM) implies that one or more virtual systems are running concurrently on top of the same hardware, each having its own view of available resources. The operating

Why system-level virtualization for high-performance computing?

Today, high-performance computing (HPC) centers need to support multiple execution platforms. For example, at Oak Ridge National Laboratory (ORNL), massively parallel processing (MPP) platforms, such as the Cray XT4, and Beowulf-type clusters, like the ORNL Institutional Cluster (OIC), are available to users. Each of these systems targets a specific OS, requiring users to port their application before execution. On the other hand, a user’s requirements may also differ. For instance, some users

System-level virtualization and system availability

Modern high-performance computing platforms are composed of thousands or even hundreds of thousands of nodes. Because each node can be subject to a failure, the global availability of the system decreases in proportion to the system scale. Therefore, applications for this environment must be fault tolerant or resilient, able to operate successfully in the face of failure, and the systems should exhibit high-availability traits.

System-level virtualization provides three interesting capabilities

System-level virtualization and system management

The usage of VMs creates several challenges including: (i) how can we support multiple virtualization solutions? (ii) how can we easily manage both, the host OS and the VMs? and, (iii) is it possible to abstract the complexity of virtualization?

Several studies have been recently made to address these issues, which led to the implementation of OSCAR-V [9]. An extension for the management of VMs using the OSCAR system installation/management suite, it integrates several prototypes developed by

Conclusion

System-level virtualization provides several advantages for HPC that may change the way modern HPC systems are currently used though plug-and-play computing, system environment customization, computing on demand, and transparent application resilience through system provided fault tolerance.

However, the usage of virtual machines also creates several challenges including: (i) the development of a virtualization solution suitable for HPC, (ii) the development of tools and methods for the

Stephen L. Scott is a Senior Research Scientist in the Computer Science Group of the Computer Science and Mathematics Division at the Oak Ridge National Laboratory (ORNL), Oak Ridge, USA. Dr. Scott’s research interest is in experimental systems with a focus on high performance distributed, heterogeneous, and parallel computing. He is a founding member of the Open Cluster Group (OCG) and Open Source Cluster Application Resources (OSCAR). Within this organization, he has served as the OCG

References (9)

  • J. Liu et al.

    High performance VMM-bypass I/O in virtual machines

  • P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, A. Warfield, Xen and the art of...
  • C. Clark, K. Fraser, S. Hand, J.G. Hansen, E. Jul, C. Limpach, I. Pratt, A. Warfield, Live migration of virtual...
  • A. Whitaker, M. Shaw, S.D. Gribble, Denali: Lightweight virtual machines for distributed and networked applications,...
There are more references available in the full text version of this article.

Cited by (18)

  • Internet-based Virtual Computing Environment: Beyond the data center as a computer

    2013, Future Generation Computer Systems
    Citation Excerpt :

    In the past ten years, many projects for grid computing have been launched, such as TeraGrid, DataGrid and e-Science, to explore the methods to build a new computing environment across multiple organizations [9,29–33]. As the virtual machine technology becomes increasingly popular, it is used for grid resource virtualization at the system level [31,32] to accelerate the seamless expansion of current existing application software to grid environments. This further promotes the development of grid computing.

  • Grid design for mobile thin client computing

    2011, Future Generation Computer Systems
    Citation Excerpt :

    More recently, process migration techniques like ZAP [27] and its successor Cruz [28] have solved these drawbacks. Besides process migration, one could practice virtualization technologies and benefit from the ease to migrate a virtual session but suffer some resource overhead [29]. In this paper, the study of a mobile thin client grid service is presented.

  • Suspending, migrating and resuming HPC virtual clusters

    2010, Future Generation Computer Systems
    Citation Excerpt :

    Here, the basic idea consists in encapsulating the computational engine (the whole set of “hardware” and software requirements of an application) in a virtual cluster (VC) [5], a set of virtual machines which are deployed and managed as a single entity. In the case of data-driven applications, virtual clusters can typically be moved to where data reside in a fraction of the time required to move data to a static computational cluster, a feature which adds to the many other benefits virtualization yields to HPC [6,7]. As a practical example, next-generation sequencing devices [8] use multi-parallel arrayed formats that produce very large datasets of images, typically growing at the rate of multiple terabytes a week per apparatus.

  • Nezha: Mobile OS virtualization framework for multiple clients on single computing platform

    2019, HotMobile 2019 - Proceedings of the 20th International Workshop on Mobile Computing Systems and Applications
View all citing articles on Scopus

Stephen L. Scott is a Senior Research Scientist in the Computer Science Group of the Computer Science and Mathematics Division at the Oak Ridge National Laboratory (ORNL), Oak Ridge, USA. Dr. Scott’s research interest is in experimental systems with a focus on high performance distributed, heterogeneous, and parallel computing. He is a founding member of the Open Cluster Group (OCG) and Open Source Cluster Application Resources (OSCAR). Within this organization, he has served as the OCG steering committee chair, as the OSCAR release manager, and as working group chair. Dr. Scott is the project lead principal investigator for the Modular Linux and Adaptive Runtime support for HEC OS/R research (MOLAR) research team. This multi-institution research effort, funded by the Department of Energy - Office of Science, concentrates on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale scientific high-end computing (HEC) as part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS). Dr. Scott is also principal investigator of a project investigating techniques in virtualized system environments for petascale computing and is involved with a related storage effort that is investigating the advantages of storage virtualization in petascale computing environments. Dr. Scott is the chair of the international Scientific Advisory Committee for the European Commission’s XtreemOS project. Stephen has published numerous papers on cluster and distributed computing and has both a Ph.D. and M.S. in computer science. He is also a member of ACM, IEEE Computer, and IEEE Task Force on Cluster Computing.

Geoffroy Vallée is an R&D Associate in the Network and Cluster Computing Group of the Computer Science and Mathematics Division of Oak Ridge National Laboratory (ORNL), USA. He has received his Ph.D. from University of Rennes, France, in the framework of a french industrial collaboration between University of Rennes, INRIA and EDF. Geoffroy has completed his master degree from University of Saint-Quentin-en-Yvelines, France. His research interests include research in operating systems and high availability. Geoffroy is one of the initial developers and designers of the Kerrighed Single System Image (http://www.kerrighed.org/) for clusters. He is also one of the core team members of the Open Source Cluster Application Resource (OSCAR) software (http://www.openclustergroup.org/). Geoffroy is currently doing research on operating systems for petascale computing, focusing on high performance and high availability.

Thomas Naughton is an R&D associate working in the area of high performance system software. He has been involved in the Open Source Cluster Application Resources (OSCAR) project for several years serving as a developer, working group chair and co-chair for the annual OSCAR Symposium. His current efforts are focused on the areas of system-level virtualization and system resilience. Prior to starting at Oak Ridge National Laboratory Thomas received a M.S. degree in Computer Science from the Middle Tennessee State University and a B.S. in Computer Science & B.A. in Philosophy from the University of Tennessee at Martin. He is currently pursuing a Ph.D. from the University of Reading, England.

Anand Tikotekar is currently working as a Post Master’s research associate at Oak Ridge National Lab. His research interests include Fault Tolerant computing, OS level virtualization, and cluster Survivability. He received his Master’s at Louisiana Tech University in Computer Science, and received his B.E. from the Pune University in India in computer science.

Christian Engelmann is a R&D Staff Member at ORNL. He holds an M.Sc. from the University of Reading and a German engineering diploma from the Technical College for Engineering and Economics (FHTW) Berlin. As part of his research at ORNL, Christian is currently pursuing a Ph.D. at the University of Reading. His research aims at high-level reliability, availability, and serviceability for next-generation supercomputers to improve their resiliency (and ultimately efficiency) with novel high availability and fault tolerance system software solutions. Another research area concentrates on “plug-and-play” supercomputing, where transparent portability eliminates most of the software modifications caused by divers platforms and system upgrades. His past research included a pluggable lightweight heterogeneous Distributed Virtual Machine (DVM) environment, the successor of the Parallel Virtual Machine (PVM), and a new generation of superscalable scientific algorithms that address the challenges in scalability and fault tolerance for extreme-scale supercomputers.

Hong Ong is a research staff member in the Computer Science and Mathematics division at Oak Ridge National Laboratory (ORNL). He earned his Ph.D. from the University of Portsmouth, UK in 2004, under the supervision of Professor Mark Baker. His research interests are in the areas of operating systems, middleware for parallel and distributed systems, and system-level performance evaluation. Hong has significant in depth working knowledge in technologies for clusters and the Grid. Prior to joining ORNL, he worked on machines evaluation where he studied the factors that affect large-scale scientific applications performance and analyzed the interaction between network protocols and applications. He has additionally worked on a number of grid-related projects, including the UK e-Science OGSA Testbed project and work evaluating security and firewall issues. Hong currently focuses on several Department of Energy (DOE) Office of Science projects including the scalable operating systems, system virtualization, and dependability middleware. Hong also serves on various program committees of international conferences.

Research sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, which is managed by UT-Battelle, LLC for the US Department of Energy under Contract No. DE-AC05-00OR22725.

View full text