Elsevier

Future Generation Computer Systems

Volume 110, September 2020, Pages 422-439
Future Generation Computer Systems

Adding domain data to code profiling tools to debug workflow parallel execution

https://doi.org/10.1016/j.future.2018.05.078Get rights and content

Highlights

  • Database-oriented approach to extract at runtime fine-grain performance data associated to workflow information, provenance and domain-specific data.

  • Development of PerfMetricEval component to capture performance and resource consumption data using TAU and SAR profiling tools.

  • Specialization of our provenance data model, named as PROV-Df, to include typical data from the code-profiling tools.

  • Integration of PerfMetricEval with Chiron, a parallel SWfMS.

  • Parallel execution of two workflows in the astronomy and bioinformatics domains showing the analytical potential of our approach.

Abstract

Computer simulations may be composed of several scientific programs chained in a coherent flow running in High Performance Computing and cloud environments. These runs may present different execution behavior associated to the parallel flow of data among programs. Gather insight into the parallel flow of data is important for several applications. The usual way of getting insight into code performance is by means of a code-profiler. Several parallel code-profiling tools already support performance analysis, such as Tuning and Analysis Utilities (TAU), or provide fine-grained performance statistics, e.g., System Activity Report (SAR). These tools are effective for code profiling, but are not connected to the concept of IO-intensive workflows. Analyzing the workflow execution with domain and performance data is important for users because they can identify anomalies, choose suitable machines to run their workflows, etc. This type of analysis may be performed by capturing execution data enriched with fine-grained domain data during the long-term run of a computer simulation. In this paper, we propose a monitoring data capture approach as a component that couples code-profiling tools to domain data from workflow executions. The goal is to profile and debug parallel executions of workflows through queries to a database that integrates performance, resource consumption, provenance, and domain data from simulation programs flow at runtime. We show how querying this database with domain-aware data at runtime allows to identify performance anomalies not detected by code-profiling tools. We evaluate our approach using the astronomy Montage workflow on a cluster environment and the SciPhy bioinformatics workflow on the Amazon cloud. In both cases computing time overhead imposed by our approach for gathering fine-grained domain, performance, and resource consumption data is negligible.

Introduction

Oden et al. [1] define models as mathematical constructions based on physical principles attempting to characterize abstractions of the reality. Computer simulations commonly use mathematical models with high complexity for solving problems of relevance in science and engineering. Commonly, these simulations involve the execution of complex programs chained in a coherent flow. Each program in the flow represents a computational model execution with data dependency from a previous simulation program in this flow. These simulations can be modeled as a scientific workflow (for the sake of simplicity henceforth named as workflow) [2].

A workflow is an abstraction that defines a set of activities and a dataflow among them [2]. Each activity is associated with a simulation program, which is responsible for the consumption of an input dataset and the production of an output dataset. Many workflows process a large volume of data, requiring the effective use of High Performance Computing (HPC) and High-Throughput Computing (HTC) [3] environments allied to parallelization techniques such as data parallelism or parameter sweep [4]. While HPC environments are commonly employed to perform large amounts of computing, HTC environments usually process applications that are more interested in how many tasks can be completed over a long period of time instead of how fast they are completed.

To support the modeling and execution of workflows in HPC/HTC environments, parallel Scientific Workflow Management Systems (SWMS) were developed, such as Swift/T [5], Pegasus [6],ASKALON [7], Triana [8], and Chiron [9], or SWMS embedded in science gateways, such as WorkWays [10]. To foster data parallelism, workflow activities can be instantiated as tasks for processing each input data, known as activations [9]. Each activation executes a specific program or computational service in parallel, consuming a set of parameter values and input data, and producing output data. Besides the activations, a parallel SWMS controls data dependencies among activities. Scheduling, parallel processing and provenance support are some of the advantages of using a SWMS as opposed to executing workflows using Python scripts or Spark [11].

Workflow execution in these SWMS may consume largeamounts of resources and as the workflow becomes increasingly IO- and compute-intensive (measured by the number of activations or the volume of processed data), they tend to execute for days, or even weeks, depending on the amount of input data and the availability of computational resources. Thus, it is important to be aware of execution status and to analyze partial results to check if the current execution complies with some pre-defined quality, performance or financial criteria. Consequently, analytical capabilities of the solutions based on offline analysis are not able to do monitoring, since users can only investigate data after the workflow execution ends and they are not aware of partial results at runtime. With online analysis [12] and data steering support [13], users may decide if they have to interfere in the execution (also known as dynamic workflows). Data steering at runtime allows for the analysis of partial results and to potentially identify failures and problems as early as they happen. Data steering is a common requirement in scientific simulations. In fact, several visualization tools such as ParaView already provide plug-ins so that users can analyze partial results using filters and do in situ visualization [14].

Most of the parallel SWMS have addressed the need of performance and resource consumption monitoring facilities by adding new components to their workflow engines, or loading data into databases (at runtime or after the workflow execution) to be further queried by users [13]. In 2008, Swift [15] adopted Kickstart [16] to successfully improve workflow monitoring and debugging using execution data combined to provenance data. Kickstart has also been coupled to SWMS Pegasus [6] to monitor the execution of workflows in HPC environments at runtime. These monitoring mechanisms can provide information such as the execution time of each activity or activation of the workflow, how many activations were executed, and resource usage. This type of information is very useful to help users to debug an execution or to understand the performance of a workflow in a specific environment. However, this coarse grain provenance information has limitations in monitoring and debugging. In scientific workflows, particularly IO-intensive ones, data values of interest are inside files and provenance just registers input and output of activities, which in this case would be the file name [16]. This is useful for understanding the data derivation behavior (i.e., dataflow file path), but the user still has a lot of additional effort to parse the files to trace and analyze data elements inside datasets.

In previous papers, we have experienced the benefits of domain-data steering with execution data, when running workflows in different domains such as engineering and biology. For example, in [17], the user was able to analyze partial results of the workflow, thus allowing for stopping it or changing parameter values to continue executing. However, one would not be able to check if some problem, error or low-quality result was influenced by environment issues such as low memory or bandwidth. This prevented users to fine tune parameters and avoid these environment problems. In [18], the user was able to verify that a task was taking longer than average and was able to define which parameter and files were corresponding to the anomaly, while running in the cloud environment. However, the user could not be sure if the long-time problem was related to CPU or IO. This limitation of our previous approach also occurs in Kickstart based solutions. For example, in case the long execution is due to CPU, it would be consistent according to the input data, however, if the problem was on IO, then the user should abort the execution. These scenarios motivated us to improve our runtime steering support with HPC/HTC monitoring and debugging tool to complement the existing analytical power with code profiling tools. When users are left with separate data analytic tools each with their own data it is difficult to relate them. It does not make sense to reproduce all that HPC/HTC debugging tools already do, into a SWMS. Therefore, a solution that connects both can benefit from both the SWMS monitoring and debugging as well as the code profiling tool.

In this work, we developed a SWMS component for capturing performance and resource consumption metrics that is designed on top of TAU [19] and SAR [20] tools, which we named as PerfMetricEval. Without PerfMetricEval, we would need to browse the generated files to extract and analyze performance data from TAU and SAR log files. Therefore, we would have to gather those data from log files and develop programs to relate performance and resource consumption data captured from the log with domain data extracted from raw data files [16]. Different from this manual approach, our component enables online query processing considering data obtained from log and raw data files, taking advantage of existing code-profiling tools, such as TAU and SAR. In addition, we coupled PerfMetricEval to Chiron SWMS to present a relational database that integrates provenance, domain, execution, performance, and resource consumption data. This integrated database enables users to perform their analysis by submitting SQL queries to the RDBMS. Therefore, we can enumerate our contributions as:

  • The development of PerfMetricEval component to capture performance and resource consumption data using TAU and SAR profiling tools;

  • A specialization of PROV-Df provenance database schema to include relevant data from the code-profiling tools;

  • Integration of PerfMetricEval with an existing SWMS. In this case, we use Chiron SWMS.

We have presented this database-oriented approach in a paper at the Workflows in Support of Large-Scale Science (WORKS) 2016 Workshop [21]. In our previous paper, we presented workflow elapsed time and CPU usage metrics, and performed experiments in a cluster environment. In this paper, we go deeper in describing the components of PerfMetricEval, in particular we detail memory consumption and disk operations. We also extend [21] with new experiments to show the performance and analytical power of PerfMetricEval using Amazon EC2 cloud. These new results show how the gathered data can improve virtual machines provisioning in a cloud environment.

This paper is organized in five sections. Section 2 presents a motivating example. Section 3 discusses related work. Section 4 describes our approach for performance and resource consumption monitoring; the extended provenance database schema that integrates provenance, domain, performance and resource consumption data, all together; and the integration between PerfMetricEval with Chiron SWMS. Section 5 shows the evaluation of the proposed approach using the Montage workflow in a cluster environment and the SciPhy workflow in a cloud environment. Finally, we conclude the paper in Section 6.

Section snippets

DEBUGGING workflows: A motivating example

Workflows usually involve several users with different skills. Broadly speaking, we can consider four types of users in the process:

  • i.

    Domain specialist;

  • ii.

    Domain/computer specialist, e.g., bioinformatician;

  • iii.

    SWMS specialist; and

  • iv.

    HPC/HTC specialist.

The domain specialist has the knowledge to interpret the results of the simulation. The domain/computer specialist models computer simulations according to the domain requirements and suitability of programs, therefore, is someone that knows how to chain the

Related work

Considering the efficiency and high-quality support of code-profiling tools, this paper presents an approach to integrate these existing tools to monitoring and debugging support of SWMS, enhanced with domain-data from workflow execution. It is far from trivial to monitor and steer the performance and the resource consumption associated to domain data during the parallel execution of workflows in HPC and clouds [23]. For example, the same workflow specification can be executed in several

The perfmetriceval component

Several existing tools already support debugging and profiling of HPC scientific applications for computer experts, such as TAU, which instruments the application code to capture performance data and to present these data using a graphical representation. However, domain users may need to analyze performance and resource consumption together with the domain data, as well as to be aware of all data transformations that have occurred in the workflow parallel execution. In this section, we show

Experimental evaluation

We use two case studies in this section to evaluate PerfMetricEval in different environments. The two case studies have different execution behavior. The first case study is the well-known scientific workflow Montage from the astronomy domain. Montage was executed in a cluster. Montage represents an IO-intensive application [30], where several gigabytes are produced in each execution and stored in several files. On the other hand, we have chosen SciPhy [18] as the second case study since it is

Conclusions

Performing analytical performance queries in workflows in distributed environments is an open, yet important, issue. It is fundamental to follow the status of the workflow execution, especially when they execute for weeks or even months. To be aware of the bottlenecks, resource consumption, and other performance issues are mandatory. Most SWMS already provide some level of monitoring capabilities. However, the type of information, provided by these monitoring mechanisms, is limited to the

Acknowledgments

We thank Kary Ocaña for her help on configuring SciPhy programs, executing them, and analyzing experimental results. Authors also would like to thank CNPq, FAPERJ, HPC4E (EU H2020, Brazil Programme and MCTI/RNP-Brazil, grant no. 689772), and Intel for partially funding this work. This research made use of Montage, which is funded by the National Science Foundation (NSF) , USA. Leonardo Neves is currently at Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA, USA.

Vítor Silva is a D.Sc. candidate at the Department of Computer Science in Federal University of Rio de Janeiro (UFRJ). He received a B.Sc. (2013) and M.Sc. (2014) degree in Computer and Information Engineering from UFRJ. His main interests include workflow management, raw data file analysis, and high performance computing.

References (34)

  • J.M. Wozniak, T.G. Armstrong, M. Wilde, D.S. Katz, E. Lusk, I.T. Foster, Swift/T: Large-scale application composition...
  • R. Prodan, S. Ostermann, K. Plankensteiner, Performance analysis of grid applications in the ASKALON environment, in:...
  • TaylorI. et al.

    The Triana Workflow Environment: Architecture and Applications, Workflows for E-Science

    (2007)
  • OgasawaraE. et al.

    An algebraic approach for data-centric scientific workflows

    PVLDB

    (2011)
  • NguyenH.A. et al.

    WorkWays: interacting with scientific workflows

    CCPE

    (2015)
  • M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in: USENIX...
  • A. Ailamaki, Managing scientific data: lessons, challenges, and opportunities, in: SIGMOD, 2011, pp....
  • Vítor Silva is a D.Sc. candidate at the Department of Computer Science in Federal University of Rio de Janeiro (UFRJ). He received a B.Sc. (2013) and M.Sc. (2014) degree in Computer and Information Engineering from UFRJ. His main interests include workflow management, raw data file analysis, and high performance computing.

    Leonardo Neves is a Research Engineer at Snapchat Research. He holds a master’s degree (2016) in Intelligent Information Systems from Carnegie Mellon University and a bachelor’s degree (2015) in Computer Engineering from Federal University of Rio de Janeiro (UFRJ). He took a school year at Cornell University and summer internships at Pivotal and Yelp. His main interests include machine learning, natural language processing, and information retrieval.

    Renan Souza is a Computer Science D.Sc. student at the Department of Computer Science in Federal University of Rio de Janeiro (UFRJ) and a Research Engineer at IBM Research Brazil. He holds a master’s (2015) and a bachelor’s (2013) degrees in Computer Science from UFRJ. He took a school year at Missouri State University and a summer internship at Stanford University. His main interests include parallel and distributed data processing, high performance computing, and big data management and analytics.

    Alvaro L.G.A. Coutinho is the Director of the High Performance Computing Center and a Professor at the Department of Civil Engineering in The Alberto Luiz Coimbra Institute for Graduate Studies and Research in Engineering (COPPE), The Federal University of Rio de Janeiro, Brazil; Coordination and participation in over 80 industry projects. Recipient of the IBM Faculty Partnership Award, 2001; Recipient of the Giulio Massarani Academic Award, COPPE, 2007; Organizer of National and International conferences, training workshops and short courses; Recipient of IACM Fellow Award, 2012. Editorial Advisory Board, International Journal for Numerical Methods in Fluids; Associate Editor, Revista Internacional de Métodos Numéricos para Cálculo y Diseño en Ingeniería. Dissertations Directed, 24 Ph.D. and 25 M.Sc., 92 Journal papers, 250 Conference papers. Cited in Web of Science, November 11, 2013: Total citations, 443, h- index: 13. Scopus, November 11, 2013: Total citations, 496, h-index: 11. Google Scholar, November 11, 2103: Total citations, 1145, h-index: 19. Recent projects: “Research on Simulation of Geological Processes on High Performance Computers”, Network on Basin Modeling, Brazilian Petroleum Agency and PETROBRAS, 2011–2013, US$ 1,592,649.00; “Hoscar-High Performance Computing and Scientific data management driven by highly demanding applications, CNPq, INRIA; “Finite Element Simulator for Complex Free-Surface Problems: Extensions and New Engineering Challenges”, PETROBRAS 2011–2013, US$ 544,026.00; “High Performance Computing Infrastructure for the GRADE-BR Node at COPPE/UFRJ”, Thematic Network on Scientific Computing and Visualization, Brazilian Petroleum Agency and PETROBRAS, 2008–2012, US$ 8,188,623.00.

    Daniel de Oliveira is a Professor of the Institute of Computing of the Fluminense Federal University (UFF) since 2013. He received the Doctor of Science degree from the Federal University of Rio de Janeiro (UFRJ) in 2012. His current research interests include scientific workflows, provenance, cloud computing, high performance computing, raw data analysis, and distributed and parallel databases. He is the coordinator of research projects in those areas, with funding from several Brazilian government agencies, including CNPq and FAPERJ. He participates in several program committees of national and international conferences (VLDB17, SBBD16) and workshops (IPAW16, WORKS16), and is a regular reviewer of several international journals (Transactions on Services Computing, Concurrency and Computation, Journal of Supercomputing). He is a member of IEEE, ACM and of the Brazilian Computer Society. He has published over 50 refereed international journal articles and conference papers.

    Marta Mattoso is a Professor of the Department of Computer Science at the COPPE Institute from Federal University of Rio de Janeiro (UFRJ) since 1994, where she leads the Distributed Database Research Group. She has received the Doctor of Science degree from UFRJ. She has been active in the database research community for more than fifteen years and her current research interests include distributed and parallel databases, data management aspects of scientific workflows. She is the principal investigator in research projects in those areas, with funding from several Brazilian government agencies, including CNPq, CAPES, FAPERJ, FINEP and INRIA. She has published over 100 refereed international journal articles and conference papers. She is a member of ACM, IEEE and SBC—Brazilian Computer Society. She has served in program committees of international conferences, and is a regular reviewer of several international journals.

    View full text