Sm@rtConfig: A context-aware runtime and tuning system using an aspect-oriented approach for data intensive engineering applications

https://doi.org/10.1016/j.conengprac.2012.10.001Get rights and content

Abstract

Distributing the workload upon all available Processing Units (PUs) of a high-performance heterogeneous platform (e.g., PCs composed by CPU–GPUs) is a challenging task, since the execution cost of a task on distinct PUs is non-deterministic and affected by parameters not known a priori. This paper presents Sm@rtConfig, a context-aware runtime and tuning system based on a compromise between reducing the execution time of engineering applications and the cost of tasks' scheduling on CPU–GPUs' platforms. Using Model-Driven Engineering and Aspect Oriented Software Development, a high-level specification and implementation for Sm@rtConfig has been created, aiming at improving modularization and reuse in different applications. As case study, the simulation subsystem of a CFD application has been developed using the proposed approach. These system's tasks were designed considering only their functional concerns, whereas scheduling and other non-functional concerns are handled by Sm@rtConfig aspects, improving tasks modularity. Although Sm@rtConfig supports multiple PUs, in this case study, these tasks have been scheduled to execute on an platform composed by one CPU and one GPU. Experimental results show an overall performance gain of 21.77% in comparison to the static assignment of all tasks only to the GPU.

Introduction

High performance platforms are commonly required for scientific and engineering algorithms to deal appropriately with timing constraints. Desktop-based co-processors, such as many cores Graphics Processing Units (GPUs), have become a cost effective alternative as execution platform to improve performance. As an example, Nvidia has presented its GPU GTX285 that provides a peak performance of 1062 Gflop/s for single precision and 89 Gflop/s for double precision float operations (Nvidia, 2010).

As a consequence, heterogeneous platforms with several types of PUs act in essence as powerful asymmetric multi-core clusters and can handle multiple applications and tasks. This is even intensified with the multi-core CPUs, like the Intel Core2Quad that provides around 100 Gflop/s (Intel, 2010). Therefore, efficiently using all available resources from the PUs is a significant challenge to program applications.

Another challenge is the design of applications that use such a heterogeneous platform. On one hand, distinct PUs may require a specific programming technology, which, on the other hand, may not be supported by all available PUs. To transform the same application's source code into binary code for multiple PUs, an experienced specialist may be required in addition to have development tools available. Virtualization is an example of technique to solve the compatibility problem by adding a layer (i.e., a virtual machine) between the application binary code and the real PU, e.g., Java Virtual Machine. However, this solution leads to a performance penalty, which sometimes cannot be accepted by the target application due to its constraints. Such a situation demands new techniques to raise the design's abstraction level of such applications.

Using high-level representations of the application's structure and behavior allow the refinement of the application's requirements up to achieving its native implementation in the technology(ies) supported by different target PUs. Model-Driven Engineering (MDE) approaches advocate that designers shall use models instead of source code as the main artifact of the design. The system implementation is generated automatically from these models. In this sense, Aspect-Oriented Model-Driven Engineering for Real-Time (AMoDE-RT) systems (Wehrmeister, Freitas, Wagner, & Pereira, 2007) is a MDE approach that uses the Unified Modeling Language (UML),1 along with its MARTE profile2 for specifying real-time systems and concepts of the Aspect Oriented Software Development (AOSD) approach. Modularization and reuse are the main goals of AMoDE-RT approach, since implementations for distinct execution platforms can be obtained from the same application's UML model.

This work extends the AMoDE-RT, more specifically, its aspects framework, by including a modular and reusable runtime system for dynamic scheduling and tuning. In order to take full advantage of the available computing power, a strategy to distribute the application tasks over the available PUs is important. The strategy lies on dynamic scheduling, instead of static scheduling used by OpenCL (Khronos, 2010) or, more specifically, by CUDA (Nvidia, 2010) for Nvidia GPUs (see also Göddeke et al., 2009). This need becomes even more essential when dealing with desktop applications with timing constraints, like the real-time 3D Computational Fluid Dynamics (CFD) systems used as case study and which is applied in several complex engineering applications, such as design of modern cars or airplanes.

The task scheduling problem is considered NP-complete (Garey & Johnson, 1990) and several heuristics have been developed to better meet a good scheduling with little overhead, like, for example, the distinct approaches used by Topcuoglu, Society, you Wu, and Member (2002), Ahmadinia, Bobda, Koch, Majer, and Teich (2004), and Götz and Dittmann (2006) for heterogeneous PUs. However, just very recently, some techniques are starting to be directly applied to platforms containing CPU (eventually multi-core) and multiple GPUs. This paper additionally presents a new strategy to distribute the workload over the CPU and the GPU, being sufficient generic to consider other PUs coupled in a desktop. The dynamic scheduling method is oriented for a set of high-level tasks, like algorithms. It combines a first assignment phase – based on a pre-processing benchmark for acquiring initial tasks performance samples – with a runtime phase that obtains real performance measurements of tasks and feeds a performance database. This way, after the first assignment, the system considers the history presented on the database to perform further assignments for every task, maximizing the applications' performance with little overhead.

In this work, three iterative solvers for Systems of Linear Equations (SLEs) – Jacobi, Red-Black Gauss-Seidel, and Conjugate Gradient – are used by the CFD application and represent the high-level tasks for the scheduling strategy. The solvers have different implementations for the CPU and the GPU (using shared memory and with memory coalescing), as presented in previous work (Binotto et al., 2010).

It is important to mention that, although the GPU is more powerful to deal with those kind of data-intensive tasks, there are many scenarios in which the CPU provides better performance, e.g., when working with multiple applications and tasks with different problem size domains (based on the amount of data to be processed, not known before application execution). The paper presents an example of such a scenario. In a CFD application, a gain of 21.77% in comparison to the static assignment of all tasks to the GPU is achieved, while the scheduling error remains negligible.

The main contribution of this paper is the Sm@rtConfig framework, which provides a dynamic scheduling method for a desktop CPU–GPU platform that is composed of:

  • (i)

    a first assignment phase,

  • (ii)

    a runtime profiler that feeds a timing performance database, and

  • (iii)

    the runtime assignment phase that performs assignments based on the performance history stored on the database.

The paper is structured as follows: Section 2 presents the design approach based on aspect orientation used for the application specification and implementation phase. Following, Section 3 describes the new scheduling strategies that implement the aspects for one CPU and one GPU and its generalization for multiple PUs. Section 4 presents the real-time computing CFD applications used as case study. Its requirements and specification are discussed along with the obtained experimental results based on a performance analysis over the CPU–GPU platform. The related work focused on both design methods and the scheduling on distributed platforms using the GPU is then presented in Section 5. Finally, conclusions and future research are addressed in Section 6.

Section snippets

Overview

Aspect-Oriented Model-Driven Engineering for Real-Time systems – AMoDE-RT (Wehrmeister et al., 2007) – allows a smooth transition from initial specification phases to implementation phases of the design of real-time systems. Using MDE techniques combined with AOSD concepts, AMoDE-RT increases the abstraction level during design to address the increasing complexity of real-time systems. Fig. 1 shows an overview of AMoDE-RT.

The first step in AMoDE-RT is gathering requirements and constraints of

Runtime scheduling and tuning system

In a broad sense, the scheduling strategy has the goal to automatically assign Units of Allocation (UA) over a CPU-Co-processors execution platform. The term UA was generically defined since the proposed framework is intended to deal with different granularities (the granularity is designed to change in accordance to the platform to be used) and different types of decomposition (task or data decomposition, according to application characteristics). However, in the context of this paper, an UA

Overview of system's requirements

As a motivation for designing an asymmetric CPU–GPU platform approach, a Computational Fluid Dynamics application is briefly described. For this application, large computations are needed to solve the velocity field and local pressure for objects like planes and cars. Clearly, both computation time and performance need to be optimized, while several instances of varying geometries for the objects are evaluated. In industrial prototyping, commonly default flow simulation is used, while in later

Related work

Aspect orientation software development. Applying AOSD in the software development for “traditional” information systems has led to important improvements on productivity and complexity management, mainly due to the separation and modularization of crosscutting concerns that lead to an improved reuse of previously developed components. In order to obtain the same benefits, engineers and researchers of embedded and real-time systems communities are increasingly using AOSD's concepts in their

Conclusions and future work

A context-aware runtime and tuning system, named Sm@rtConfig, was presented based on a compromise between reducing the execution time of applications due to appropriate dynamic scheduling and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. The system is integrated into the AMoDE-RT approach, by means of including two new aspects in DERAF and also their implementation in the code generation scripts used by the GenERTiCA tool towards Sm@rtConfig. This way,

Acknowledgments

We would like to thank the reviewers for detailed suggestions and comments. Alécio Binotto thanks the support given by DAAD fellowship no. A/07/70158, Programme Al β an scholarship no. E07D402961BR, and CNPq scholarship no. 150860/2011-0. Marco Wehrmeister is grateful to CNPq (Brazilian National Council for Scientific and Technological Development) for the grant no. 480321/2011-6.

References (35)

  • P. Arpaia et al.

    An aspect-oriented programming-based approach to software development for fault detection in measurement systems

    Computer Standards & Interfaces

    (2010)
  • A. Gokhale et al.

    Model driven middleware: A new paradigm for deploying and provisioning distributed real-time and embedded applications

    Science of Computer Programming

    (2008)
  • Ahmadinia, A., Bobda, C., Koch, D., Majer, M., & Teich, J. (2004). Task scheduling for heterogeneous reconfigurable...
  • ATI. 2010. ATI stream SDK with OpenCL 〈http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx〉. Stand...
  • Augonnet, C., Thibault, S., Namyst, R., & Wacrenier, P.-A. (2009). StarPU: A unified platform for task scheduling on...
  • R. Barrett et al.

    Templates for the solution of linear systems: Building blocks for iterative methods

    (1994)
  • Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors. In...
  • Binotto, A. P. D., Daniel, C., Weber, D., Kuijper, A., Stork, A., Pereira, C. E., et al. (2010). Iterative sle solvers...
  • A.P.D. Binotto et al.

    Towards task dynamic reconfiguration over asymmetric computing platforms for UAVs surveillance systems

    Scalable Computing: Practice and Experience

    (2009)
  • Binotto, A. P. D., Pedras, B. M., Goetz, M., Kuijper, A., Pereira, C. E., Stork, A., et al. (2010). Effective dynamic...
  • Binotto, A. P. D., Pereira, C. E., & Fellner, D. W. (2010). Towards dynamic reconfigurable load-balancing for hybrid...
  • Binotto, A. P. D., Pereira, C. E., Kuijper, A., Stork, A., & Fellner, D. (2011). An effective dynamic scheduling...
  • de Freitas, E., Binotto, A. P. D., Pereira, C. E, Stork, A., & Larsson, T. (2008). Dynamic reconfiguration of tasks...
  • Diamos, G.F., & Yalamanchili, S. (2008). Harmony: An execution model and runtime for heterogeneous many core systems....
  • C. Driver et al.

    Managing embedded systems complexity with aspect-oriented model-driven engineering

    ACM Transactions on Embedded Computing Systems

    (2011)
  • Freitas, E. P., Wehrmeister, M. A., Silva, E, Carvalho, F., Pereira, C., & Wagner, F. (2007). DERAF: A high-level...
  • M.R. Garey et al.

    Computers and intractability: A guide to the theory of NP-completeness

    (1990)
  • Cited by (11)

    • Combining aspects and object-orientation in model-driven engineering for distributed industrial mechatronics systems

      2014, Mechatronics
      Citation Excerpt :

      Based on users feedback, the authors concluded that MDE is useful when applied to the development of complex systems, but it is still missing effective and easy-to-use tools to perform such development. In addition, another practical example of using MDE combined with an aspect-oriented approach in industrial applications was carried in previous work [8]. It demonstrates the practical use of developed tools during requirement and modeling phases of a complex Computational Fluid Dynamics application, from design to code generation for specific processing units.

    • Generating ROS-based Software for Industrial Cyber-Physical Systems from UML/MARTE

      2020, IEEE International Conference on Emerging Technologies and Factory Automation, ETFA
    View all citing articles on Scopus

    Special section on Advanced Software Engineering in Industrial Automation.

    View full text