SunwayMR: A distributed parallel computing framework with convenient data-intensive applications programming

https://doi.org/10.1016/j.future.2017.01.018Get rights and content

Highlights

  • We analyze the drawbacks of the existing distributed parallel computing frameworks.

  • We propose a distributed parallel computing framework, SunwayMR.

  • We presents framework design in terms of task scheduling, message communication, etc.

  • The results outperform existing Spark framework in performance.

  • The framework has good scaling in speed with varying data sizes, nodes and threads.

Abstract

Managing servers integration to realize distributed data computing framework is an important concern. Regardless of the underlying architecture and the actual distributed system’s complexity, such framework gives programmers an abstract view of systems to achieve variously data-intensive applications. However, some state-of-the-art frameworks need too much library dependencies and parameters configuration, or lack extensibility in application programming. Moreover, general framework’s precise design is a nontrivial work, which is fraught with challenges of task scheduling, message communication and computing efficiency, etc. To address these problems, we present a general, scalable and programmable parallel computing framework called SunwayMR, which only needs GCC/G++ environment. We argue it from the following aspects: (1) Distributed data partitioning, message communication and task organization are given to support transparent application execution on parallel hardware. By searching threads table of each node, the task gets an idle thread (with preferred node IP address) for executing data partition. A novel communication component, SunwayMRHelper, is employed to merge periodical results synchronously. Through identifying whether current node is master or slave, SunwayMR deals with the periodical task’s results differently. (2) As for optimizations, a simple fault tolerance is given to resume data-parallel applications, and thread-level stringstream is utilized to boost computing. To ensure ease-of-use, open Application Programming Interface (API) excerpts can be invoked by various of applications with fewer handwritten code than OpenMPI/MPI. We conduct extensively experimental studies to evaluate the performance of SunwayMR over real-world datasets. Results indicate that SunwayMR (runs on 16 computational nodes) outperforms Spark in various applications, and has good scaling with data sizes, nodes and threads.

Introduction

In the last several years, data amount has been explosively growing and the scale has vastly evolved  [1], [2], [3]. Organizations have been using data-intensive applications to extract valuable information from the huge datasets they manage. Meanwhile, commercial computational processors have the impressive performance advantages and dominate the general-purpose processor market. They catalyze high performance computing (HPC) and keep pace with the increasing computation requirement  [4], [5]. In fact, network-connected computing nodes at server level are critical to modern HPC applications. Nevertheless, the complexity of the operations of bottom instructions confines the use of computing facilities to process data. Hence, how to develop a programmable parallel computing framework that benefits the most from general or custom-made computational devices is a critical and promising issue [6], [7], [8], [9].

Although numerous researchers have devoted fundamental work to develop parallel computing frameworks with implementation techniques [10], [11], [12], [13], this work leads to a proliferation of such frameworks for data analytics. We argue that there exist inherent limitations, some lack ease-of-use and flexibility, and some cannot be extended in application programming. As a result, the design of distributed system is a nontrivial and challenging work. Some mechanisms of data distribution, job scheduler, information communication and fault tolerance scheme, etc., for parallel computing framework should be designed in detail.

A good distributed computing framework actually emerges for analyzing the enormous quantities of data. Usually, these frameworks share some key features: (1) Data processing management with efficiency and scalability. Data analysis is performed on locally stored data, greatly increasing throughput and performance. (2) Programming abstraction with ease-of-use. Most parallel computing frameworks provide some degree of abstraction over computing nodes. It is potentially beneficial to programmers, with shortening learning curve. (3) Be insensitive to underlying hardware. Typically, it is not necessary for the developers to be aware of the detail of the underlying architecture. (4) Fault tolerance for reliability, easy configuration and less library dependencies.

Definitely, Apache Hadoop  [14] has been designed for distributed computing, from a single node to a potentially huge number of nodes. It is resilient to node failures. Things become worse that Hadoop might not deal with the data variety well, since its programming interfaces and associated data processing models are inconvenient and inefficient for handling variety of data, e.g., structural data and graph data. The key idea of Apache Spark  [15], another distributed computing framework, is based on the important concept of immutable Resilient Distributed Datasets (RDDs), with providing transformation and action operations. Data analytics in Spark is performed via a sequence of RDD transformations while a MapReduce job consists of a map phrase and a reduce phase. By contrasting with Hadoop MapReduce, Spark has better advantage in performance. Note that, to some extent, Spark and Hadoop suffer from fussy parameters configuration.

When a new hardware architecture is introduced, in many existing distributed frameworks, the requirement of less library dependencies cannot be possibly met. For example, it is necessary to install and configure JAVA SDK for Spark and Hadoop. Thus, a new overarching system is specially needed to be developed with less library dependencies from scratch, so that special environment requirement can be met. For instance, although Sunway processors (provided by the State Key Laboratory of Mathematic Engineering and Advance Computing (MEAC-SKL)) have the capacity of super computation. It is violated for Spark and Hadoop to run on Sunway processors, because Spark and Hadoop strongly require dependent libraries, such as Scala SDK or Java SDK, etc. As so far, Spark supports several programming languages, including Scala, Java, and Python. Spark and Hadoop are JVM-based.

Motivated by these observations, the goal of our inspired work is to explore an effective solution to manage computational devices, with obtaining high performance. Like Spark and Hadoop, we propose SunwayMR by utilizing devices directly, which can lower the barrier to entry for the average users. In this paper, SunwayMR involves useful and architecture design, thus to obtain parallel capacity. The advantage is that SunwayMR copes with the challenge of data variety with multi-structural datasets and has easier configuration. Our developed framework makes specific data-intensive applications deploy on Infrastructure-as-a-Service Infrastructure lighter and faster. Programmers can use this on-going framework, which is targeted to have applicability and generality, to develop data-intensive applications.

To summarize, the main contributions of our work include:

  • We first present and discuss the framework’s design. Based on clustering system’s two-level (master–slaves) hierarchy architecture, a distributed dataset managing mechanism organizes data into partitions as data computing unit sets (DCUS). More critically, task organization, job/task scheduling and message communication are given out subsequently.

  • We give out systematic optimizations of thread-level stringstream to accelerate information communication between nodes substantially, and lightweight fault tolerance to solve reliability problem.

  • We make an implementation of SunwayMR, which provides advantages of both ease-of-use and extensibility. More data-intensive applications can be achieved quickly by invoking public high-level APIs from lower layers of the framework, so as to write less low-level code.

  • We conduct extensively empirical studies to evaluate the performance of SunwayMR using various applications and real datasets. Experimental results demonstrate that our solution achieves better performance in efficiency, speedup and execution time, compared with Spark framework.

The SunwayMR is written in C++ language in more than 8000 LOC, whereas the software resources (including Linux shell compiling scripts) are available from Github repository.1 We hope our research can help provide a guiding role for researchers to achieve autonomous parallel computing frameworks more quickly.

The rest of the paper is organized as follows. Section  2 contains related work and background. Section  3 provides an overview of SunwayMR framework. Section  4 introduces some preliminary knowledge. The main design principle is described in Section  5. Some optimizations are discussed in Section  6 and we introduce the framework’s ease-of-use in Section  7. We evaluate our system in Section  8. Finally, we conclude our paper in Section  9.

Section snippets

Related work and background

In this section, we discuss the key enablers of our study, namely, computational devices, distributed parallel programmable techniques and relative performance requirement.

Commoditization of computational accelerators is driving their widespread use.

Currently, it is common practice that large-scale data centers typically employ commodity off-the-shelf components to yield a cost-efficient setup. The large-scale data centers (e.g., Google, Amazon’s EC2) follow this commonly accepted approach to

SunwayMR overview

The HPC clustering system for running SunwayMR ought to include several computing nodes and desktops, etc., as depicted in Fig. 1. Parallel machines are networked together through a high speed network (e.g., GigaNet, InfiniBand). Such design constructs the foundation for the framework’s running environment. In order to provide remote access and control the cluster, the architecture of heterogeneous clustering system is organized in master–slaves paradigm, as shown in Fig. 2. The master node

Preliminaries

Typically, the sources from which data is loaded for parallel computing can be distributed file systems, shared/local file systems, memory or a data stream. The essential thing is to partition data logically and to process the data partitions in different physical servers. Therefore, in this section, we firstly introduce some important preliminaries, i.e.,  distributed dataset management and relative job and task management, etc., so as to realize both collaboration, inter-operability,

The main design principle of SunwayMR

The main design of SunwayMR contains several main aspects, data processing mechanism, coarse-grained and fine-grained parallelism and SunwayMRHelper communication component, as explained next.

Some system optimizations

Despiting the main design of SunwayMR has been proposed, there are some important optimizations that can be done furtherly. One is the communication efficiency between inter nodes (thread-level stringstream optimization); and the other is the reliability (detect/resume model for fault tolerance), as explained next.

The layered software architecture

Today’s popular software architecture generally follows the loosely-coupled layered manner. Likely, SunwayMR is the layered software architecture stack, The framework’s code can be divided into mainly three abstracted layers totally from the perspective of software engineering: (1) the upper code layer provides interfaces for application programming; (2) the bottom code layer manages the computing hardware resource; (3) the middle code layer block mainly connects link between the preceding and

Evaluation

In this section, we discuss two categories of research questions. One is that whether the performance is well-behaved performed. The other is that what the effect we obtain when applying our integrated optimizations, as well as varying the nodes, threads and data sizes. To answer above questions, we conduct the following analysis.

Conclusion

A programmable parallel computing framework, SunwayMR, is proposed and implemented for data-intensive applications on distributed clustering systems. It solves data volume challenge by parallelization and alleviates the data variety challenge, etc. Some systematic design alternatives and mechanisms on distributed environment are evaluated. The experiments show SunwayMR can effectively utilize cluster resources, while preserving its transparency, simplicity, and portability as a programming

Acknowledgments

This work is partially supported by the National High Technology Research and Development Program of China under Grant No.2014AA01A301, and the National Natural Science Foundation of China (NSFC) under Grant No.61472241.

Renke Wu is currently working toward the Ph.D. degree in the department of computer science and engineering at the Shanghai Jiao Tong University. His research focus on parallel computing and software engineering.

References (29)

  • M.H. Wu, P.C. Wang, C.Y. Fu, R.S. Tsay, A high-parallelism distributed scheduling mechanism for multi-core...
  • Cao Yi et al.

    A parallel computing framework for large-scale air traffic flow optimization

    ITS

    (2012)
  • R. Baraglia, P. Dazzi, G. Capannini, G. Pagano, A multi-criteria job scheduling framework for large computing farms,...
  • Apache Hadoop,...
  • Cited by (10)

    • EDAWS: A distributed framework with efficient data analytics workspace towards discriminative services for critical infrastructures

      2018, Future Generation Computer Systems
      Citation Excerpt :

      Spark provides in-memory data structure to persist intermediate results in memory [12]. Additionally, distributed parallel computing systems MAPR, e.g., [16], Hadoop and SunwayMR [17–19], emerge for processing data stream. Meanwhile, several systems, e.g., Flume [20], HBase [21], Hive [22], have been built on the top of Hadoop.

    • SwMR: A framework for accelerating mapreduce applications on sunway taihulight

      2021, IEEE Transactions on Emerging Topics in Computing
    View all citing articles on Scopus

    Renke Wu is currently working toward the Ph.D. degree in the department of computer science and engineering at the Shanghai Jiao Tong University. His research focus on parallel computing and software engineering.

    Linpeng Huang received his M.S. and Ph.D. degrees in computer science from Shanghai Jiao Tong University in 1989 and 1992, respectively. He is a professor of computer science in the department of computer science and engineering, Shanghai Jiao Tong University. His research interests lie in the area of distributed systems, architecture-driven software development, parallel computing, big data analysis and in-memory computing.

    Peng Yu received his B.S. degree in software engineering from Nankai University (NKU) in 2014. He is currently working toward the M.S. degree in the school of software at the Shanghai Jiao Tong University. His research interests lie in the area of distributed systems, architecture-driven software development, parallel computing and big data analysis.

    Haojie Zhou received his M.S. degree in computer science from Chinese Academy of Sciences. He works in the State Key Laboratory of Mathematic Engineering and Advance Computing, Jiangnan Institute of Computing Technology. His research interests lie in the area of distributed systems, parallel computing and data analysis.

    View full text