Sm@rtConfig: A context-aware runtime and tuning system using an aspect-oriented approach for data intensive engineering applications

doi:10.1016/j.conengprac.2012.10.001

Control Engineering Practice

Volume 21, Issue 2, February 2013, Pages 204-217

https://doi.org/10.1016/j.conengprac.2012.10.001 Get rights and content

Abstract

Distributing the workload upon all available Processing Units (PUs) of a high-performance heterogeneous platform (e.g., PCs composed by CPU–GPUs) is a challenging task, since the execution cost of a task on distinct PUs is non-deterministic and affected by parameters not known a priori. This paper presents Sm@rtConfig, a context-aware runtime and tuning system based on a compromise between reducing the execution time of engineering applications and the cost of tasks' scheduling on CPU–GPUs' platforms. Using Model-Driven Engineering and Aspect Oriented Software Development, a high-level specification and implementation for Sm@rtConfig has been created, aiming at improving modularization and reuse in different applications. As case study, the simulation subsystem of a CFD application has been developed using the proposed approach. These system's tasks were designed considering only their functional concerns, whereas scheduling and other non-functional concerns are handled by Sm@rtConfig aspects, improving tasks modularity. Although Sm@rtConfig supports multiple PUs, in this case study, these tasks have been scheduled to execute on an platform composed by one CPU and one GPU. Experimental results show an overall performance gain of 21.77% in comparison to the static assignment of all tasks only to the GPU.

Introduction

High performance platforms are commonly required for scientific and engineering algorithms to deal appropriately with timing constraints. Desktop-based co-processors, such as many cores Graphics Processing Units (GPUs), have become a cost effective alternative as execution platform to improve performance. As an example, Nvidia has presented its GPU GTX285 that provides a peak performance of 1062 Gflop/s for single precision and 89 Gflop/s for double precision float operations (Nvidia, 2010).

As a consequence, heterogeneous platforms with several types of PUs act in essence as powerful asymmetric multi-core clusters and can handle multiple applications and tasks. This is even intensified with the multi-core CPUs, like the Intel Core2Quad that provides around 100 Gflop/s (Intel, 2010). Therefore, efficiently using all available resources from the PUs is a significant challenge to program applications.

Another challenge is the design of applications that use such a heterogeneous platform. On one hand, distinct PUs may require a specific programming technology, which, on the other hand, may not be supported by all available PUs. To transform the same application's source code into binary code for multiple PUs, an experienced specialist may be required in addition to have development tools available. Virtualization is an example of technique to solve the compatibility problem by adding a layer (i.e., a virtual machine) between the application binary code and the real PU, e.g., Java Virtual Machine. However, this solution leads to a performance penalty, which sometimes cannot be accepted by the target application due to its constraints. Such a situation demands new techniques to raise the design's abstraction level of such applications.

Using high-level representations of the application's structure and behavior allow the refinement of the application's requirements up to achieving its native implementation in the technology(ies) supported by different target PUs. Model-Driven Engineering (MDE) approaches advocate that designers shall use models instead of source code as the main artifact of the design. The system implementation is generated automatically from these models. In this sense, Aspect-Oriented Model-Driven Engineering for Real-Time (AMoDE-RT) systems (Wehrmeister, Freitas, Wagner, & Pereira, 2007) is a MDE approach that uses the Unified Modeling Language (UML),¹ along with its MARTE profile² for specifying real-time systems and concepts of the Aspect Oriented Software Development (AOSD) approach. Modularization and reuse are the main goals of AMoDE-RT approach, since implementations for distinct execution platforms can be obtained from the same application's UML model.

This work extends the AMoDE-RT, more specifically, its aspects framework, by including a modular and reusable runtime system for dynamic scheduling and tuning. In order to take full advantage of the available computing power, a strategy to distribute the application tasks over the available PUs is important. The strategy lies on dynamic scheduling, instead of static scheduling used by OpenCL (Khronos, 2010) or, more specifically, by CUDA (Nvidia, 2010) for Nvidia GPUs (see also Göddeke et al., 2009). This need becomes even more essential when dealing with desktop applications with timing constraints, like the real-time 3D Computational Fluid Dynamics (CFD) systems used as case study and which is applied in several complex engineering applications, such as design of modern cars or airplanes.

The task scheduling problem is considered NP-complete (Garey & Johnson, 1990) and several heuristics have been developed to better meet a good scheduling with little overhead, like, for example, the distinct approaches used by Topcuoglu, Society, you Wu, and Member (2002), Ahmadinia, Bobda, Koch, Majer, and Teich (2004), and Götz and Dittmann (2006) for heterogeneous PUs. However, just very recently, some techniques are starting to be directly applied to platforms containing CPU (eventually multi-core) and multiple GPUs. This paper additionally presents a new strategy to distribute the workload over the CPU and the GPU, being sufficient generic to consider other PUs coupled in a desktop. The dynamic scheduling method is oriented for a set of high-level tasks, like algorithms. It combines a first assignment phase – based on a pre-processing benchmark for acquiring initial tasks performance samples – with a runtime phase that obtains real performance measurements of tasks and feeds a performance database. This way, after the first assignment, the system considers the history presented on the database to perform further assignments for every task, maximizing the applications' performance with little overhead.

In this work, three iterative solvers for Systems of Linear Equations (SLEs) – Jacobi, Red-Black Gauss-Seidel, and Conjugate Gradient – are used by the CFD application and represent the high-level tasks for the scheduling strategy. The solvers have different implementations for the CPU and the GPU (using shared memory and with memory coalescing), as presented in previous work (Binotto et al., 2010).

It is important to mention that, although the GPU is more powerful to deal with those kind of data-intensive tasks, there are many scenarios in which the CPU provides better performance, e.g., when working with multiple applications and tasks with different problem size domains (based on the amount of data to be processed, not known before application execution). The paper presents an example of such a scenario. In a CFD application, a gain of 21.77% in comparison to the static assignment of all tasks to the GPU is achieved, while the scheduling error remains negligible.

The main contribution of this paper is the Sm@rtConfig framework, which provides a dynamic scheduling method for a desktop CPU–GPU platform that is composed of:

(i)
a first assignment phase,
(ii)
a runtime profiler that feeds a timing performance database, and
(iii)
the runtime assignment phase that performs assignments based on the performance history stored on the database.

The paper is structured as follows: Section 2 presents the design approach based on aspect orientation used for the application specification and implementation phase. Following, Section 3 describes the new scheduling strategies that implement the aspects for one CPU and one GPU and its generalization for multiple PUs. Section 4 presents the real-time computing CFD applications used as case study. Its requirements and specification are discussed along with the obtained experimental results based on a performance analysis over the CPU–GPU platform. The related work focused on both design methods and the scheduling on distributed platforms using the GPU is then presented in Section 5. Finally, conclusions and future research are addressed in Section 6.

Section snippets

Overview

Aspect-Oriented Model-Driven Engineering for Real-Time systems – AMoDE-RT (Wehrmeister et al., 2007) – allows a smooth transition from initial specification phases to implementation phases of the design of real-time systems. Using MDE techniques combined with AOSD concepts, AMoDE-RT increases the abstraction level during design to address the increasing complexity of real-time systems. Fig. 1 shows an overview of AMoDE-RT.

The first step in AMoDE-RT is gathering requirements and constraints of

Runtime scheduling and tuning system

In a broad sense, the scheduling strategy has the goal to automatically assign Units of Allocation (UA) over a CPU-Co-processors execution platform. The term UA was generically defined since the proposed framework is intended to deal with different granularities (the granularity is designed to change in accordance to the platform to be used) and different types of decomposition (task or data decomposition, according to application characteristics). However, in the context of this paper, an UA

Overview of system's requirements

As a motivation for designing an asymmetric CPU–GPU platform approach, a Computational Fluid Dynamics application is briefly described. For this application, large computations are needed to solve the velocity field and local pressure for objects like planes and cars. Clearly, both computation time and performance need to be optimized, while several instances of varying geometries for the objects are evaluated. In industrial prototyping, commonly default flow simulation is used, while in later

Related work

Aspect orientation software development. Applying AOSD in the software development for “traditional” information systems has led to important improvements on productivity and complexity management, mainly due to the separation and modularization of crosscutting concerns that lead to an improved reuse of previously developed components. In order to obtain the same benefits, engineers and researchers of embedded and real-time systems communities are increasingly using AOSD's concepts in their

Conclusions and future work

A context-aware runtime and tuning system, named Sm@rtConfig, was presented based on a compromise between reducing the execution time of applications due to appropriate dynamic scheduling and the cost of computing such scheduling applied on a platform composed of CPU and GPUs. The system is integrated into the AMoDE-RT approach, by means of including two new aspects in DERAF and also their implementation in the code generation scripts used by the GenERTiCA tool towards Sm@rtConfig. This way,

Acknowledgments

We would like to thank the reviewers for detailed suggestions and comments. Alécio Binotto thanks the support given by DAAD fellowship no. A/07/70158, Programme Al $β$ an scholarship no. E07D402961BR, and CNPq scholarship no. 150860/2011-0. Marco Wehrmeister is grateful to CNPq (Brazilian National Council for Scientific and Technological Development) for the grant no. 480321/2011-6.

References (35)

P. Arpaia et al.
An aspect-oriented programming-based approach to software development for fault detection in measurement systems
Computer Standards & Interfaces
(2010)
A. Gokhale et al.
Model driven middleware: A new paradigm for deploying and provisioning distributed real-time and embedded applications
Science of Computer Programming
(2008)
Ahmadinia, A., Bobda, C., Koch, D., Majer, M., & Teich, J. (2004). Task scheduling for heterogeneous reconfigurable...
ATI. 2010. ATI stream SDK with OpenCL 〈http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx〉. Stand...
Augonnet, C., Thibault, S., Namyst, R., & Wacrenier, P.-A. (2009). StarPU: A unified platform for task scheduling on...
R. Barrett et al.
Templates for the solution of linear systems: Building blocks for iterative methods
(1994)
Bell, N., & Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors. In...
Binotto, A. P. D., Daniel, C., Weber, D., Kuijper, A., Stork, A., Pereira, C. E., et al. (2010). Iterative sle solvers...
A.P.D. Binotto et al.
Towards task dynamic reconfiguration over asymmetric computing platforms for UAVs surveillance systems
Scalable Computing: Practice and Experience
(2009)
Binotto, A. P. D., Pedras, B. M., Goetz, M., Kuijper, A., Pereira, C. E., Stork, A., et al. (2010). Effective dynamic...

Binotto, A. P. D., Pereira, C. E., & Fellner, D. W. (2010). Towards dynamic reconfigurable load-balancing for hybrid...

Binotto, A. P. D., Pereira, C. E., Kuijper, A., Stork, A., & Fellner, D. (2011). An effective dynamic scheduling...

de Freitas, E., Binotto, A. P. D., Pereira, C. E, Stork, A., & Larsson, T. (2008). Dynamic reconfiguration of tasks...

Diamos, G.F., & Yalamanchili, S. (2008). Harmony: An execution model and runtime for heterogeneous many core systems....

C. Driver et al.

Managing embedded systems complexity with aspect-oriented model-driven engineering

ACM Transactions on Embedded Computing Systems

(2011)

Freitas, E. P., Wehrmeister, M. A., Silva, E, Carvalho, F., Pereira, C., & Wagner, F. (2007). DERAF: A high-level...

M.R. Garey et al.

Computers and intractability: A guide to the theory of NP-completeness

(1990)

Cited by (11)

Programming languages for data-Intensive HPC applications: A systematic mapping study
2020, Parallel Computing
A major challenge in modelling and simulation is the need to combine expertise in both software technologies and a given scientific domain. When High-Performance Computing (HPC) is required to solve a scientific problem, software development becomes a problematic issue. Considering the complexity of the software for HPC, it is useful to identify programming languages that can be used to alleviate this issue.
Because the existing literature on the topic of HPC is very dispersed, we performed a Systematic Mapping Study (SMS) in the context of the European COST Action cHiPSet. This literature study maps characteristics of various programming languages for data-intensive HPC applications, including category, typical user profiles, effectiveness, and type of articles.
We organised the SMS in two phases. In the first phase, relevant articles are identified employing an automated keyword-based search in eight digital libraries. This lead to an initial sample of 420 papers, which was then narrowed down in a second phase by human inspection of article abstracts, titles and keywords to 152 relevant articles published in the period 2006–2018. The analysis of these articles enabled us to identify 26 programming languages referred to in 33 of relevant articles. We compared the outcome of the mapping study with results of our questionnaire-based survey that involved 57 HPC experts.
The mapping study and the survey revealed that the desired features of programming languages for data-intensive HPC applications are portability, performance and usability. Furthermore, we observed that the majority of the programming languages used in the context of data-intensive HPC applications are text-based general-purpose programming languages. Typically these have a steep learning curve, which makes them difficult to adopt. We believe that the outcome of this study will inspire future research and development in programming languages for data-intensive HPC applications.
Combining aspects and object-orientation in model-driven engineering for distributed industrial mechatronics systems
2014, Mechatronics
Citation Excerpt :
Based on users feedback, the authors concluded that MDE is useful when applied to the development of complex systems, but it is still missing effective and easy-to-use tools to perform such development. In addition, another practical example of using MDE combined with an aspect-oriented approach in industrial applications was carried in previous work [8]. It demonstrates the practical use of developed tools during requirement and modeling phases of a complex Computational Fluid Dynamics application, from design to code generation for specific processing units.
Recent advances in technology enable the creation of complex industrial systems comprising mechanical, electrical, and logical – software – components. It is clear that new project techniques are demanded to support the design of such systems. At design phase, it is extremely important to raise abstraction level in earlier stages of product development in order to deal with such a complexity in an efficient way. This paper discusses Model Driven Engineering (MDE) applied to design industrial mechatronics systems. An aspect-oriented MDE approach is presented by means of a real-world case study, comprising requirements engineering up to code generation. An assessment of two well-known high-level paradigms, namely Aspect- and Object-Oriented paradigms, is deeply presented. Their concepts are applied at every design step of an embedded and real-time mechatronics system, specifically for controlling a product assembler industrial cell. The handling of functional and non-functional requirements (at modeling level) using aspects and objects is further emphasized. Both designs are compared using a set of software engineering metrics, which were adapted to be applied at modeling level. Particularly, the achieved results show the suitability of each paradigm for the system specification in terms of reusability quality of model elements. Focused on the generated code for each case study, statistics depicted an improvement in number of lines using aspects.
Generating ROS-based Software for Industrial Cyber-Physical Systems from UML/MARTE
2020, IEEE International Conference on Emerging Technologies and Factory Automation, ETFA
Using meta-heuristics and machine learning for software optimization of parallel computing systems: a systematic literature review
2019, Computing
A review of machine learning and meta-heuristic methods for scheduling parallel computing systems
2018, ACM International Conference Proceeding Series
Using meta-heuristics and machine learning for software optimization of parallel computing systems: A systematic literature review
2018, arXiv

View all citing articles on Scopus

^☆: Special section on Advanced Software Engineering in Industrial Automation.

View full text

Sm@rtConfig: A context-aware runtime and tuning system using an aspect-oriented approach for data intensive engineering applications☆