A GridWay-based autonomic network-aware metascheduler

https://doi.org/10.1016/j.future.2011.08.019Get rights and content

Abstract

One of the key motivations of computational and data grids is the ability to make coordinated use of heterogeneous computing resources which are geographically dispersed. Consequently, the performance of the network linking all the resources present in a grid has a significant impact on the performance of an application. It is therefore essential to consider network characteristics when carrying out tasks such as scheduling, migration or monitoring of jobs. This work focuses on an implementation of an autonomic network-aware meta-scheduling architecture that is capable of adapting its behavior to the current status of the environment, so that jobs can be efficiently mapped to computing resources. The implementation extends the widely used GridWay meta-scheduler and relies on exponential smoothing to predict the execution and transfer times of jobs. An autonomic control loop (which takes account of CPU use and network capability) is used to alter job admission and resource selection criteria to improve overall job completion times and throughput. The implementation has been tested using a real testbed involving heterogeneous computing resources distributed across different national organizations.

Highlights

► The proposed metascheduler maps jobs to resources by adapting itself to system status. ► Predictions on use of resources are computed and tuned using exponential smoothing. ► The term Tolerance reflects how reliable the predictions for each resource are. ► The framework is implemented as an extension to the GridWay metascheduler. ► Several workloads have been used to illustrate the usefulness of the approach.

Introduction

Computational and data grids allow the coordinated use of heterogeneous computing resources within large-scale parallel applications in science, engineering and commerce [1]. Since organizations sharing their resources in such a context still keep their independence and autonomy [2], grids are highly variable systems in which resources may join/leave the system at any time. This variability makes Quality of Service (QoS) highly desirable, though often very difficult to achieve in practice. One reason for this limitation is the lack of a central entity that orchestrates the entire system. This is especially true in the case of the network that connects the various components of a grid system.

Achieving an end-to-end QoS is often difficult, as without resource reservation any guarantees on QoS are often hard to achieve. Furthermore, in a real grid system, reservations may not be always feasible, since not all the Local Resource Management Systems (LRMS) permit them. There are also other types of resource properties, such as bandwidth, which lack a global management entity, thereby making their reservation impossible.

However, for applications that need a timely response (i.e., distributed engine diagnostics [3] or collaborative visualization [4]), the grid must provide users with some assurance about the use of resources—a non-trivial subject when viewed in the context of network QoS. In a grid, entities communicate with each other using an interconnection network—resulting in the network playing an essential role in grid systems [5].

In a previous contribution [6], authors proposed an autonomic network-aware grid meta-scheduling architecture as a possible solution. This architecture takes into account the status of the system in order to make meta-scheduling decisions—paying special attention to network capability. This is a modular architecture in which each module works independently of others, thereby providing an architecture that can be adapted to new requirements easily. In this paper, the aforementioned architecture has been implemented as an extension to the GridWay meta-scheduler [7], with case studies and performance results provided to demonstrate how it can used. A scheduling technique that makes use of Exponential Smoothing (ES) [8] to calculate predictions on the completion times of jobs is also provided. Thus, the main contributions of this paper are: (1) an implementation of an architecture to perform autonomic network-aware meta-scheduling based on the widely used GridWay system; (2) a scheduling technique that relies on ES to predict the completion times of jobs; (3) a performance evaluation carried out using a testbed involving workloads and heterogeneous resources from several organizations.

The structure of the paper is as follows: Section 2 reviews existing approaches for supporting QoS in grids, along with grid meta-schedulers. Section 3 discusses a scenario in which an autonomic scheduler can be used and harnessed. Section 4 contains details about the implementation based on an extension to the GridWay meta-scheduler. Section 5 presents a performance evaluation of our approach, with conclusions and suggestions for further work identified in Section 6.

Section snippets

Related work

The provision of QoS in a grid system has been explored by a number of research projects, such as GARA [5], G-QoSM [9], GNRB [10], [11], [12], among others. The proposals which provide scheduling of users’ jobs to computing resources are GARA and G-QoSM, and the schedulers used are DSRT [13] and PBS [14] in GARA, whilst G-QoSM uses DSRT. These schedulers (DSRT and PBS) only pay attention to the load of the computing resource, thus a powerful unloaded computing resource with an overloaded

Autonomic Network-aware Meta-scheduling (ANM)

The availability of resources within a grid environment may vary over time—some resources may fail whereas others may join or leave the system at any time. Additionally, each grid resource must execute a workload that combines locally generated tasks with those that have been submitted from external (remote) user applications. Hence, each new task influences the execution of existing applications, requiring a resource selection strategy that can account for this dynamism within the system. Our

Implementation of ANM

The GNB has been implemented as an extension to the GridWay meta-scheduler—to achieve this, it was first necessary to make GridWay network-aware, as well as performing the needed adaptations to develop an scalable and suitable solution for real grid environments. Details about how this was undertaken along with a description about predictions of network and CPU performance are provided in this section.

Experiments and results

This section describes the experiments conducted to test the usefulness of this work, along with the results obtained.

Conclusions and future work

This paper presents a working implementation of an architecture which combines concepts from grid scheduling with autonomic computing, in order to provide users with a more adaptive job management system. The architecture involves consideration of the status of the network when reacting to changes in the system–taking into account the load on computing resources and the network links when making a scheduling decision. This architecture was originally presented and tested by means of simulations

Acknowledgments

This work was supported by the Spanish MEC and MICINN, as well as the European Commission FEDER funds, under Grants “CSD2006-00046” and “TIN2009-14475-C04”. It was also partly supported by the JCCM under Grants “PBI08-0055-2800” and “PII1C09-0101-9476”.

Luis Tomas is a Ph.D. student of Computer Science at the University of Castilla-La Mancha, Spain. Luis received his B.E. and M.E. degrees in Computer Science from the University of Castilla-La Mancha (Albacete, Spain) in 2007 and 2009. He has been working with resource management and job scheduling for Grid environments since 2007. Luis’ current research effort is on efficient meta-scheduling in advance in Grids to provide QoS to users.

References (50)

  • M. Dobber et al.

    A prediction method for job runtimes on shared processors: survey, statistical analysis and new avenues

    Performance Evaluation

    (2007)
  • C. Vázquez et al.

    Federation of TeraGrid, EGEE and OSG infrastructures through a metascheduler

    Future Generation Computer Systems

    (2010)
  • I. Foster et al.

    The Grid 2: Blueprint for a New Computing Infrastructure

    (2003)
  • J. Austin et al.

    Predictive maintenance: distributed aircraft engine diagnostics

  • F.T. Marchese, N. Brajkovska, Fostering asynchronous collaborative visualization, in: Proc. of the 11th Intl....
  • A. Caminero et al.

    Performance evaluation of an autonomic network-aware metascheduler for Grids

    Concurrency and Computation: Practice and Experience

    (2009)
  • P.S. Kalekar, Time series forecasting using Holt–Winters exponential smoothing, Tech. Rep., Kanwal Rekhi School of...
  • R.A. Ali et al.

    A model for quality-of-service provision in service oriented architectures

    International Journal of Grid and Utility Computing

    (2005)
  • D. Adami, et al. Design and implementation of a Grid network-aware resource broker, in: Proc. of the Intl. Conference...
  • D. Guan, Z. Cai, Z. Kong, Provision and analysis of QoS for distributed Grid applications, in: Proc. of the 5th Intl....
  • H.-H. Chu, K. Nahrstedt, CPU service classes for multimedia applications, in: Proc. of Intl. Conference on Multimedia...
  • G. Mateescu, Extending the portable batch system with preemptive job scheduling, in: SC2000: High Performance...
  • K. Kurowski et al.

    Dynamic Grid scheduling with job migration and rescheduling in the GridLab resource management system

    Scientific Programming

    (2004)
  • X. Wei, Z. Ding, S. Yuan, C. Hou, H. Li, CSF4: a WSRF compliant meta-scheduler, in: Proc. of the Intl. Conference on...
  • O. Waldrich, P. Wieder, W. Ziegler, A meta-scheduling service for co-allocating arbitrary types of resources, in: Proc....
  • Cited by (9)

    • An autonomic approach to manage elasticity of business processes in the Cloud

      2015, Future Generation Computer Systems
      Citation Excerpt :

      The PaaS layer contains also a component to detect Security infraction and attack tentatives. In [27], authors proposed an Autonomic Network-aware Meta-scheduling (ANM) architecture capable of adapting its behavior to the current status of the environment. The proposed scheduling approach uses a variety of parameters to make a resource selection (e.g., network bandwidth, CPU usage, resource quality, etc.).

    • P2P-based resource discovery in dynamic grids allowing multi-attribute and range queries

      2013, Parallel Computing
      Citation Excerpt :

      Since some of this information is not publicly available (especially the number of clusters or machines the grid is made of), only raw estimations can be calculated. Considering this, if we assume that the aforementioned EGI uses machines with 2 cores each and the same software configuration as [8], the size of the information would be ca. 2 GB. If this information has to be stored at each of the 330 resource centers EGI is made of (similarly to [9]), then the size of the total information of the EGI would be ca. 724 GB.

    • A multi-policy adaptive scheduling framework in virtual clouds

      2018, International Journal of Networking and Virtual Organisations
    • Optimization and approximate placement of autonomic resources for the management of service-based applications in the cloud

      2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    Luis Tomas is a Ph.D. student of Computer Science at the University of Castilla-La Mancha, Spain. Luis received his B.E. and M.E. degrees in Computer Science from the University of Castilla-La Mancha (Albacete, Spain) in 2007 and 2009. He has been working with resource management and job scheduling for Grid environments since 2007. Luis’ current research effort is on efficient meta-scheduling in advance in Grids to provide QoS to users.

    Agustin Caminero is an Assistant Professor of Computer Science in the Dept. of Communication and Control Systems at The National University of Distance Education (Madrid, Spain). He obtained a Ph.D. degree in Computer Science with European Mention from the University of Castilla-La Mancha (Albacete, Spain) in 2009. His interests include meta-scheduling in Grids and Clouds, Quality of Service (QoS), simulation, and e-learning.

    Omer Rana is a Professor of Computer Science at the Cardiff University, and the Deputy Director of the Welsh eScience Center. He holds a Ph.D. degree in “Parallel Architectures and Neural Computing” from the Imperial College (London University).

    Carmen Carrion is an Associate Professor of Computer Architecture and Technology at the Computing Systems Department at the University of Castilla-La Mancha. She holds a Ph.D. Degree in Physics from the University of Cantabria, and her interests include architecture of interconnecting devices, meta-scheduling and QoS in Grids and Clouds.

    Blanca Caminero is an Associate Professor of Computer Architecture and Technology at the Computing Systems Department (University of Castilla-La Mancha). She holds a Ph.D. Degree in Computer Science, and her current research interests are QoS support and metascheduling in Grids and Clouds.

    View full text