Online Task Scheduling of Big Data Applications in the Cloud Environment

Bouhouch, Laila; Zbakh, Mostapha; Tadonki, Claude

doi:10.3390/info14050292

Open AccessArticle

Online Task Scheduling of Big Data Applications in the Cloud Environment

by

Laila Bouhouch

^1,2,*

,

Mostapha Zbakh

¹ and

Claude Tadonki

²

¹

National School of Computer Science and Systems Analysis, Mohammed V University in Rabat, Rabat 10112, Morocco

²

Mines ParisTech-PSL Centre de Recherche en Informatique (CRI), 77305 Paris, France

^*

Author to whom correspondence should be addressed.

Information 2023, 14(5), 292; https://doi.org/10.3390/info14050292

Submission received: 9 March 2023 / Revised: 8 April 2023 / Accepted: 11 May 2023 / Published: 15 May 2023

(This article belongs to the Special Issue Internet of Things and Cloud-Fog-Edge Computing)

Download

Browse Figures

Versions Notes

Abstract

:

The development of big data has generated data-intensive tasks that are usually time-consuming, with a high demand on cloud data centers for hosting big data applications. It becomes necessary to consider both data and task management to find the optimal resource allocation scheme, which is a challenging research issue. In this paper, we address the problem of online task scheduling combined with data migration and replication in order to reduce the overall response time as well as ensure that the available resources are efficiently used. We introduce a new scheduling technique, named Online Task Scheduling algorithm based on Data Migration and Data Replication (OTS-DMDR). The main objective is to efficiently assign online incoming tasks to the available servers while considering the access time of the required datasets and their replicas, the execution time of the task in different machines, and the computational power of each machine. The core idea is to achieve better data locality by performing an effective data migration while handling replicas. As a result, the overall response time of the online tasks is reduced, and the throughput is improved with enhanced machine resource utilization. To validate the performance of the proposed scheduling method, we run in-depth simulations with various scenarios and the results show that our proposed strategy performs better than the other existing approaches. In fact, it reduces the response time by 78% when compared to the First Come First Served scheduler (FCFS), by 58% compared to the Delay Scheduling, and by 46% compared to the technique of Li et al. Consequently, the present OTS-DMDR method is very effective and convenient for the problem of online task scheduling.

Keywords:

cloud computing; big data; Cloudsim; task scheduling; data migration; data replication

1. Introduction

Big Data analytics is essential to many applications and supports a variety of user services. The advance of internet technology has led to big data analytics, and thus, big data analytics tasks [1]. As a result, managing big data tasks and supporting data-intensive applications is now possible using cloud data centers [2]. Most of the big data applications [3,4,5,6,7] are in the form of online task processing. However, it is clear that these tasks are both computation- and data-intensive [8], hence it becomes a challenge to efficiently handle them.

Furthermore, in a dynamic cloud environment, resources such as virtual machines, storage, and networking components are provisioned and deprovisioned as needed to meet changing demands [9]. This increases the complexity of the task scheduling problem, as response time is a crucial decision-making parameter for data-intensive tasks. Thus, scheduling methods should not only aim to reduce task response time but also consider data migration and replication management to improve response time, throughput, and resource utilization [10]. In order to cope with dynamic cloud environments, researchers proposed several task scheduling strategies [11,12,13,14] to find a trade-off between different goals and achieve efficient task planning.

Data migration in cloud environments involves the process of transferring data from one storage or computing system to another within the same cloud infrastructure. The goal of data migration is to ensure that data are available in the right location at the right time to meet task needs. Managing data is a crucial factor to consider when dealing with data-intensive tasks, and there are two cases to consider: local data and remote data. Data locality occurs when the task and its required data are on the same server, while remote data involves accessing required data stored on different servers than those hosting the consumer tasks. Accessing remote data involves additional time [15] due to the migration process that occurs when moving the datasets over the network and writing them to the disks. There are several challenges associated with data migration in cloud environments, such as ensuring data security, maintaining data integrity, handling placement and storage, and managing costs [16,17,18].

Therefore, to reduce the task response time, it is preferable to schedule the task in the server where all or most of its required datasets are stored. Otherwise, the task has to be scheduled at least in the server ensuring an optimal data migration time. The scheduling process is also related to other metrics such as the heterogeneity of the configuration of the servers [19] in terms of CPU frequency, number of CPUs, size of available memory, etc., as well as the load of each server—to avoid both overloaded and underloaded nodes [20].

Data replication in cloud computing refers to the process of creating multiple copies of data and storing them across different physical locations or servers within a cloud computing environment [21]. This is completed to ensure that data are highly available, resilient, and can be accessed quickly in case of a failure or outage [22]. When it comes to data replication in cloud environments, there are several key challenges that need to be addressed. These include network bandwidth, data consistency, replication latency, and cost [23,24,25,26]. It is important to mention that, beside the replicas created during the initial placement of data, in this paper, the data migration process generates duplicated data that should be managed efficiently for better data locality.

Due to dynamic provisioning resources for online tasks, there is a constant queue of tasks waiting to be processed. However, since servers have limited storage capacity, not all incoming tasks can be scheduled to run locally, making it challenging to efficiently utilize available resources for improved response time and throughput. As a result, servers may become either underloaded or overloaded, depending on the demand being lower or higher than their processing capacity [27].

To deal with the above issues, we address the task scheduling problem by proposing an Online Task Scheduling strategy based on Data Migration and Data Replication (OTS-DMDR) with the main focus of selecting the most suitable tasks to be executed on each server.

In our proposed algorithm OTS-DMDR, we first establish a model to estimate the response times of tasks in different servers. We then decide between the following three actions: (1) achieve data locality by scheduling the task on the server storing the required datasets; (2) delay the task execution so it is scheduled on another server for better data locality; (3) schedule the task on a remote server that gives an optimal response time, including the migration process. Additionally, in the task response time, we consider the replicated datasets, the computational capacity, and the load of each server to prevent underloaded and overloaded machines. Finally, after comparing our online task scheduling OTS-DMDR with other existing algorithms in the literature, the corresponding results show that the proposed OTS-DMDR can guarantee better average response time by 46% compared to Li et al. [28], by 58% compared to the Delay Scheduling method, and acceptable load balancing between machines, improving the overall system efficiency.

In summary, our contributions can be organized as follows:

Formalize the OTS-DMDR problem considering both the heterogeneity and the processing capacity of the servers, together with locality, movement, and replication of datasets.
Propose algorithms to estimate the different costs and measure the tasks’ adequacy to the servers to seek a better task-to-server allocation.
Conduct extensive simulation experiments to evaluate the efficiency of our algorithm, OTS-DMDR.

The rest of this paper is organized as follows. Section 2 describes the related work on various scheduling methods and frameworks. Section 3 presents how our online task model was established. The proposed OTS-DMDR is outlined with different implemented algorithms in Section 4. We conduct various experiments and assess the algorithm’s effectiveness in Section 5. Finally, Section 6 draws conclusions and some perspectives.

2. Related Work

In this section, we desribe some common metrics studied in different task scheduling methods. Then, we review the two main types of scheduling techniques based, single objective and multiple-objective. Finally, we highlight our motivation.

2.1. Common Used Metrics

Big data processing requires a lot of computing resources; thus, effectively managing the resources is essential due to the heterogeneity and dynamism of the environments. Scheduling algorithms are a set of policies, procedures, and rules, implemented to assign the best resource for task execution with the aim to accomplish the service provider’s and cloud user’s objectives. Each of the existing scheduling methods [29,30,31,32] take into consideration several performance metrics. The most common metrics are mentioned below:

Throughput [33,34];
Execution time [35,36,37];
Response time [38,39];
Execution cost [32,40];
Deadline and Budget constraints [41,42,43];
Load balancing [20,44,45];
Fault tolerance [46,47];
SLA violation [41,48];
Energy consumption [49,50,51];
Data transfer [28].

2.2. Single-Objective Scheduling Techniques

Some of the earliest scheduling algorithms that have been studied in the literature are [30,52,53].

The First Come First Served (FCFS) scheduling algorithm is the most traditional one. Its idea is that the last arrived tasks have to wait until the end of the execution of earlier ones [52]. Only after a task ends will the next task in the queue be considered. The FCFS method is also the main scheduler used in the Hadoop framework [54]. The disadvantages of this strategy are that the waiting time for tasks is increased and it does not consider task size. Moreover, it fails in balancing the workload among machines and decreases data locality.

In the Shortest Job First (SJF) method [30], it chooses the shortest task to be executed first in order to reduce the execution time. Although, due to uneven load distribution on the servers, the algorithm fails to respect the SLA.

The Round Robin (RR) algorithm [53] circularly distributes tasks and an equal amount of CPU time is given to every task. The round-robin strategy results in a higher average waiting time.

The traditional scheduling algorithms (mentioned above) did not find the best solution to the multi-dimensional scheduling problem since the scheduling algorithm should simultaneously optimize various parameters [11,31] such as response time with resource utilization, makespan, cost, energy consumption, etc.

2.3. Multi-Objective Scheduling Techniques

To address the issue mentioned above, many scheduling techniques have been proposed, with focus on enhancing multiple parameters simultaneously [32,55,56,57,58].

Shyam and Manvi [55] suggested a resource allocation technique that maximizes resource usage while minimizing time and budget. The method relies on VM migration to improve the placement ratio of VMs, which is advantageous for both cloud users and providers.

Wang et al. [32] proposed a dynamic resource provisioning algorithm that is ideal in terms of service availability, migration, and leasing costs. The study considers resources such as CPU, memory, and storage.

For data-intensive applications, Zhao et al. [56] proposed an energy-efficient scheduling technique where datasets and tasks are treated as a binary tree using a data correlation clustering algorithm. By decreasing the number of active VMs and data transmission time, the proposed strategy is used to minimize the energy usage of cloud data centers. However, the online scheduling is not considered.

To minimize the execution time, while increasing resource usage, the work in [57] proposed a scheduling algorithm based on IBA (Improved Backfill Algorithm) and takes into account task priority. Priority is one of the important metrics for users who want to pay more for a quicker answer (VIP request). The limitation of this technique resides in the performance that is decreasing once the number of tasks grows.

The term Online Potential Finish Time was coined in [58] to improve execution time and cost in cloud computing. Tasks are distributed onto powerful virtual machines, which can execute tasks with the least amount of delay.

Reddy G. Narendrababu et al. [59] introduced a modified version of the ant colony optimization algorithm (MACO) that is tailored to multi-objective task scheduling in cloud environments. MACO improves upon the original ACO algorithm by assigning pheromone values to virtual machines (VMs) based on their RAM, bandwidth, storage, processing speed, makespan, and other factors. This approach facilitates the efficient allocation of tasks to VMs that are best suited for the task, resulting in better resource utilization and reduced degree of imbalance. The MACO algorithm outperforms basic ACO, PSO, and GA algorithms in terms of makespan, system load balance, and task assignment efficiency.

A dynamic round robin scheduling algorithm is proposed in [60]. Authors dynamically calculate the time quantum for each round by taking into account the differences among the maximum burst times of the three tasks in the ready queue. One potential issue with this method is that it does not efficiently handle the starvation challenge. Despite this concern, the proposed method offers significant benefits such as reducing the average turnaround time, decreasing the average waiting time, and minimizing the number of context switches.

The research in [61] employed a genetic meta-heuristic algorithm to enhance performance by investigating the environment. The fitness function combined throughput, response time, and cost criteria, producing overall enhancements. To ensure that all parameters were given equal consideration, normalization was employed, resulting in relative optimization. The suggested method improved waiting time, makespan, and utility while slightly reducing costs, resulting in superior service for both providers and users. The main limitation of this work is that it does not address the topic of data-intensive online tasks. Moreover, it could be hard to adapt such a solution for data-intensive online tasks.

The authors in [62] introduce the Hard Disk Drive and CPU Scheduling (HCS) algorithm for devices with multiple cores and hard disks, aiming to optimize execution time and energy consumption while minimizing missed tasks. It considers scheduling multiple parallel tasks with individual deadlines and utilizes multiple stages to execute sorted tasks. However, this study does not consider memory effects, network bandwidth, and latency of multi-core systems.

2.4. Our Motivation

It is obvious from the research works referenced above that the majority of authors focus primarily on resources, especially computing resources, since the main activity of the task is on CPUs. However, the frequent I/O operations required for big data analytics tasks make data locality more crucial, as local I/O can minimize task execution time more effectively than remote ones [1]. The most fundamental scheduling technique used in big data systems is called DLB (Data Location Based) [12]. For that, a delay algorithm or a matchmaking algorithm may be used.

The delay algorithm [63] resolves locality through the waiting method. The goal of this technique is to assign tasks to servers based on the location of their input data, i.e., considering that a node in the cluster is free and asks for a task in the queue. It may be possible that the data required by the selected task are not stored in the given available node. Hence, the delay scheduling technique delays the task until a node containing the required data becomes available (achieving data locality). Although, to prevent starvation, a task that has been waiting for a long time is not executed regardless of the locality of the input data.

The matchmaking algorithm [64] implies that every node has an equal opportunity to take advantage of the local tasks before a new task is assigned to the nodes. A local task’s input data are kept at the relevant node.

Generally, the DLB approach attempts to reduce the amount of time spent transferring data and provide fairness by achieving data locality. The problem is that when the data are not spread equally across the nodes, the servers’ load may be unbalanced, and thus the execution time may be longer. Yet, hot data spots may affect both the matchmaking and the delay algorithms, meaning that some nodes may be overloaded with tasks as a result of their data storage while others are left idle. The tasks are mostly scheduled on the servers where the majority of their input data are stored. Therefore, a few servers are always used, which makes them overloaded. As a result, the task execution time is larger and the throughput is lower.

The aforementioned scheduling frameworks prioritize the task scheduling problem while ignoring the deployment of incoming data. Because handling resources and tasks are seen as the most expensive, the majority of prior works focused on managing them. However, as scientific applications become more and more data-intensive, handling storage, data management, and computing resources is increasingly critical [65]. The most related scheduling strategy to our work is presented by Li et al. in [28]. They proposed an online job scheduling based on data migration by selecting a proper task to be scheduled when a server becomes available. The authors make a trade-off between two costs: (1) the task is assigned to a remote server with a data transfer cost, (2) the task will wait a certain amount of time for a server that ensures the locality of the data for the task with a waiting time cost.

Paper [28] schedules tasks sequentially, one task after the other, which increases the waiting time for tasks in the queue. Moreover, when migrating data, a set of characteristics were not considered such as machine performance, the network between machines, storage space, and task requirements in terms of CPU, RAM, size, and volume of required data per task. Consequently, this does not guarantee an optimal result. Furthermore, the process of handling data replication has not been discussed in [28].

In summary, the specificity of our OTS-DMDR approach is that we use a different concept to assign tasks to nodes. In fact, we choose a set of tasks from the incoming tasks from the queue and assign them to the nodes with potentially the best response time. To calculate the response time, we take into account the data migration time, including the data replication process and the computing power of each server (e.g., CPU usage, RAM availability, and storage capacity). Moreover, we anticipate the execution of a task in a better node by considering the delay time and balancing the load between the servers.

It is important to note that data placement and replication techniques were investigated in many papers such as [66,67,68]. These papers can not be compared to the scheduling algorithms reviewed in our related work section, but we mention them because they study the importance of data availability in the cloud computing environment. Table 1 summarizes the above analysis.

For further detail on task scheduling techniques, we refer the following reviews: [11,12,13,20]. The papers primarily focus on examining various scheduling techniques used in cloud computing and presenting a new classification scheme for scheduling algorithms, along with a detailed review of resource scheduling techniques. Additionally, they aim to highlight the advantages and limitations of heuristic, meta-heuristic, and hybrid scheduling algorithms.

3. System Model and Problem Formulation

3.1. System Model

The target system of this study is represented by a heterogeneous machine that consists of two processing types: task processing and data processing.

The machine characteristics and their notations are listed as follows:

$M = {m_{i}}$ a set of machines, where $m_{i}$ designates the ith machine.
$s c_{i}$ is an integer value that represents the storage capacity of machine $m_{i}$ (MB).
$r_{i}$ is a float value that represents the read speed of machine $m_{i}$ (MB/second).
$w_{i}$ is a float value that represents the write speed of machine $m_{i}$ (MB/second).
$R A M [m_{i}]$ is an integer value that represents the available memory capacity of machine $m_{i}$ (MB).
$N_C P U [m_{i}]$ is the number of cores of machine $m_{i}$ .
$P_C P U [m_{i}]$ is an integer value that represents the CPU performance of each core of machine $m_{i}$ (Million Instructions per second— MIPS).
$b_{i j}$ is an integer value that represents the bandwidth of the connection between machines $m_{i}$ and $m_{j}$ (MB/second).
$β_{i j}$ is the elementary data transfer time [68] between machines $m_{i}$ and $m_{j}$ , it is defined by:

$β_{i j} = \{\begin{matrix} 0 & if i = j \\ \frac{1}{b_{i j}} & o t h e r w i s e \end{matrix}$

(1)
$P P [m_{i}]$ is an integer value that represents the processing power of machine $m_{i}$ (in Million Instructions per second—MIPS). $P P [m_{i}]$ is the overall CPU amount of $m_{i}$ and is calculated as follows:

$P P [m_{i}] = N_C P U [m_{i}] \times P_C P U [m_{i}]$

(2)

where $N_C P U [m_{i}]$ is the number of cores of $m_{i}$ and $P_C P U [m_{i}]$ is the CPU performance of every core in $m_{i}$ .
$T P_{i}$ defines the list of tasks in progress in $m_{i}$ .

Many independent users submit tasks for execution. In this paper, we consider that tasks arrive in an online manner to the servers of the different Cloud data centers. All the online tasks share resources and data over the servers. Since the tasks we are handling are data-intensive, two important factors are associated with each task: required data and resources. Tasks are executed in a non-pre-emptive way. However, each task is defined as follows:

$T = {t_{i}}$ a set of tasks, where $t_{i}$ is the ith task;
$l_{i}$ an integer value that designates the length of ith task (in Million Instructions—MI);
$R A M [t_{i}]$ is an integer value that represents the memory capacity required by task $t_{i}$ (in MB);
$C P U [t_{i}]$ is an integer value that represents the quantity of MIPS required by task $t_{i}$ ;
$V [t_{i}]$ is an integer value that represents the total size of all the required datasets by task $t_{i}$ ;
$α_{i}$ is the index of the final machine assignment ( $m_{α_{i}}$ ) of task $t_{i}$ ;
$ω_{i}$ is a decimal value that represents the arrival time of $t_{i}$ ;
$U R_{i j}$ is the CPU utilization ratio to determine whether a machine $m_{j}$ has a sufficient amount of resources to support a task $t_{i}$ or not.

As mentioned before, load balancing is a critical aspect to take into consideration when designing any task scheduling algorithm in a way that optimizes resource utilization, maximizes throughput, and minimizes response time. For this, we define the workload of each server as follows:

L o a d [m_{i}] = \frac{\sum_{t_{j} \in T P_{i}} l [t_{j}]}{P P [m_{i}]}

(3)

where

L o a d [m_{i}]

is a percentage rate that indicates either

m_{i}

is overloaded or underloaded.

L o a d [m_{i}]

is computed by dividing the total of all tasks that are running in

m_{i}

on the processing power

P P [m_{i}]

.

In our work, we assume that a fixed number of datasets are initially stored on the servers. Each dataset is defined as follows:

$D = {d_{i}}$ a set of datasets where $d_{i}$ is the ith dataset;
$v_{i}$ an integer value that designates the volume of ith dataset (in MB);
$Ψ = {ψ_{i j}}$ is the datasets to machines assignment matrix. Equation (4) describes the computation of matrix $Ψ$ .

$ψ_{i j} = \{\begin{matrix} 1 & i f d_{i} i s s t o r e d i n m_{j} \\ 0 & o t h e r w i s e \end{matrix}$

(4)
$F = {f_{i j}}$ is the assignment of the datasets to tasks matrix. We set matrix F because a task may require one or multiple datasets for its execution and many tasks may use the same dataset. Matrix F is generated following Equation (5).

$f_{i j} = \{\begin{matrix} 1 & i f d_{i} i s r e q u i r e d b y t_{j} \\ 0 & o t h e r w i s e \end{matrix}$

(5)

For a given dataset

d_{i}

, there could be two options of use. (1) The local use is when the dataset and its consumer task are on the same node, in that case, the dataset is locally accessed. (2) The remote use is when the required dataset is stored in a different node than the one hosting the task; in that case, data migration is needed from a distant source. We can clearly see that due to the migration process, the execution time of the consumer task is affected by adding a data migration time

D M T

, where

D M T_{i j}

is the time to migrate all the datasets required by

t_{i}

from their locations to

m_{j}

(

m_{j}

is also where

t_{i}

is assigned) [68].

Our proposed Online Task Scheduling strategy based on Data Migration and Data Replication (OTS-DMDR) aims to select online tasks from the queue and schedule them to the appropriate server to ensure better response time as well as a load-balanced system. The task response time includes two main factors: the performance of the resources and, more critically, the management of data regarding their location, movement, and replication within the system.

In addition, the proposed (OTS-DMDR) technique is a generic algorithm that could be easily extended to handle different types of data. Mainly, we can use a data adapter component to integrate heterogeneous data types (text, images, logs, videos, etc.) that could be generated by different devices, such as the one used in IoT, financial institutions, and healthcare areas [69]. The next section explains in detail and illustrates the benefit of our approach.

3.2. Problem Formulation

The challenge is how to distribute incoming tasks among servers reduce task response time while avoiding overloaded or underloaded servers. Since data migration requires time, it is obvious that we should seek data locality for tasks as much as possible in order to decrease the response time [1]. When data are migrated to new locations, this will generate new copies of data over the system, called replicated data. In general, data replication increases the availability of data, thereby achieving more data locality and reducing response time for the next incoming tasks.

We re-examine the issue and discover that the tasks in the queue that must be chosen are the ones that would be carried out on a suitable server with the best response time. In our algorithm, OTS-DMDR, the scheduling result combines the data locality method, the data migration method, and the delay scheduling. In other words, the result generates: data locality, i.e., the task will be placed directly in the server containing all its required data; or the task will be placed in a remote server that yields a minimal data migration time; or the task will be delayed until another server having the best response time, via data locality or data migration, becomes available. Simultaneously, the machine load is also taken into account in the OTS-DMDR technique to increase the effectiveness of the entire system.

To better illustrate the OTS-DMDR technique, we give an example in Figure 1. In Figure 1a, we depict the system configuration. Q is the queue of online tasks. F is the matrix of the assignment of the datasets to the task and

Ψ

is the matrix of the assignment of datasets to machines.

According to the OTS-DMDR method, machine

m_{1}

is determined to be the optimal choice for task

t_{1}

as shown in Figure 1b, since it achieves perfect data locality with the required datasets

d_{1}

and

d_{4}

already stored on

m_{1}

. Similarly, for task

t_{2}

, the OTS-DMDR method selects machine

m_{2}

as the most efficient solution, as in Figure 1b. Therefore, executing

t_{2}

on

m_{2}

will result in the shortest response time due to the locally stored required data

d_{2}

and the minimal migration time to migrate the required data

d_{3}

from

m_{3}

to

m_{2}

. As a result, tasks

t_{1}

and

t_{2}

can be executed at the same time (in parallel). In addition, a replication of

d_{3}

is created in

m_{2}

.

Finally, the OTS-DMDR algorithm estimates the response time of task

t_{3}

on all machines. Using this approach, the algorithm suggests that it is preferable to delay the execution of

t_{3}

until machine

m_{2}

becomes available. This delay is represented by a time interval denoted as

Δ

. Despite the delay, executing

t_{3}

on

m_{2}

is expected to result in a lower response time compared to assigning

t_{3}

to other machines that would require greater data migration times. This is shown in Figure 1c.

It is important to note that in the case of limited computational power of a machine due to different causes (lack of memory, lack of storage, overload of cpu, etc.), the proposed OTS-DMDR algorithm, as we will see in Section 4.1.1, proceeds by either skipping that machine for another one that could host the current task, or delaying the task’s execution until that machine becomes available to host the current task.

3.3. Objective Function

In this section, we design a mathematical formulation for our proposed algorithm OTS-DMDR. Our objective function seeks an efficient task scheduling that minimizes the task response time while maintaining a balanced load of the nodes. The response time is the time required for each task to complete its execution from the moment it arrives in the queue. The value is a combination of the following metrics (see Figure 2):

Scheduling Time (ST): the time between the arrival of the task in the queue and its scheduling.
Delay Time ( $Δ$ ): the time that a task can wait for the availability of a given machine.
Waiting Time (WT): the sum of scheduling time (ST) and delay time ( $Δ$ ).
Data Migration Time (DMT): the time a task needs to locally gather all its remote required datasets.
Data Access Time (DAT): the time it takes for a task to read all its local required datasets.
Execution Time (ET): the time to execute the task.
Total Execution Time (TET): the sum of data migration time (DMT), data access time (DAT), and execution time (ET).
Response Time (RT): the sum of waiting time (WT) and total execution time (TET).

The problem of reducing the response time of a task

t_{i}

when scheduled in

m_{j}

can be formulated as:

\begin{matrix} min & R T_{i j} = min (W T_{i j} + T E T_{i j}) \\ = min (S T_{i j} + Δ_{i j} + D M T_{i j} + D A T_{i j} + E T_{i j}) \end{matrix}

(6)

The constraints related to our objective function are shown in Equations (7)–(9).

\begin{matrix} s . t . & R A M [t_{i}] \leq R A M [m_{j}] - \sum_{t_{k} \in T P_{j}} R A M [t_{k}] \end{matrix}

(7)

\begin{matrix} \sum_{l = 1}^{D} v_{l} \times f_{l i} \leq s c_{j} - \sum_{t_{k} \in T P_{j}} V [t_{k}], & if Ψ_{l j} = 0 \end{matrix}

(8)

\begin{matrix} L o a d_{m i n} \leq L o a d_{j} \leq L o a d_{m a x} \end{matrix}

(9)

The constraint from Equation (7) guarantees that the remaining amount of RAM in

m_{j}

exceeds the requested amount of RAM required by task

t_{i}

(

T P_{j}

is the list of tasks running in

m_{j}

). The constraint from Equation (8) ensures there is enough storage in

m_{j}

to store the required datasets for

t_{i}

in the case of data migration (

t_{k}

is in progress in

m_{j}

) and when

d_{l}

is a remote dataset. Finally, the constraint from Equation (9) assures the load balancing of the system in such a way that the load of machine

m_{j}

should be comprised between two thresholds (

L o a d_{m i n}

and

L o a d_{m a x}

) in order to avoid (resp.) underload and overloaded nodes.

For simplicity, we will set the default value of

L o a d_{m a x}

at 70% CPU utilization and the default

L o a d_{m i n}

value at 20% CPU usage [70] for the remainder of the paper.

4. Proposed Approach

In this section, we explain the main steps of our suggested task scheduling strategy OTS-DMDR, which selects a set of tasks from the queue and schedule them in machines with the optimal response time.

Our approach consists of the following four steps (see Figure 3):

Estimate the response time matrix for the incoming tasks in the queue for all machines;
Generate a preference list for task-to-machine assignment;
Perform task selection and assignment;
Update system state (the availability of the machines and the tasks in the waiting queue Q).

The steps above are repeated for tasks in the queue. Based on the waiting time of the task, the data migration time, the overall execution time, and the load of machines, a set of tasks will be selected from the queue and will be assigned to servers that best fit.

4.1. Response Time Matrix

The idea of our proposed task scheduling strategy is to choose a set of tasks from the waiting queue and assign each of them to the most appropriate server. In other words, we select the appropriate hosting tasks for each server. This method not only allows us to efficiently use all available servers but also, simultaneously schedule multiple tasks instead of scheduling them task by task.

In order to select tasks and assign them to the most suitable servers, we first have to compute the response time matrix

R T

for each task in the queue for all the machines. The

R T

matrix contains the response time

R T_{i j}

of each task

t_{i}

if it is assigned to machine

m_{j}

.

Figure 4 and Algorithm 1 show how the response time matrix is computed in detail. For each task

t_{i}

in the queue Q, we go through the set of servers in order to estimate

R T_{i j}

, the response time of the task

t_{i}

if assigned to

m_{j}

. First, we check if

m_{j}

can host

t_{i}

by computing the fitness value using the method

M a c h i n e F i t T a s k

, as shown in line 9 in Algorithm 1 and step 1 in Figure 4. Then, we compute the response time

R T_{i j}

based on four main costs:

Waiting time $W T_{i j}$ , which includes both the delay time $Δ_{i j}$ and the scheduling time $S T_{i j}$ (line 20).
−
Scheduling time $S T_{i j}$ is how much time $t_{i}$ waits in the queue to be scheduled in $m_{j}$ .
−
Delay time $Δ_{i j}$ is how much time $t_{i}$ can wait for $m_{j}$ to be available. It is measured using the $C o m p u t e D e l a y T i m e$ function (line 13). $Δ_{i j}$ is computed only if $m_{j}$ will suit $t_{i}$ later (see also step 2 in Figure 4). More details are explained in Section 4.1.4.
Time to migrate required remote data $D M T_{i j}$ is computed when there is no data locality for a given required dataset (line 17 and step 3). Function $C o m p u t e D a t a M i g r a t i o n T i m e$ is detailed in Section 4.1.2. Otherwise, if all the required datasets are locally available, $D M T_{i j} = 0$ .
Time to access required data locally $D A T_{i j}$ (step 4 in Figure 4) is computed by calling the $C o m p u t e D a t a A c c e s s T i m e$ function in line 18. This step aims to measure the needed time to consume the datasets already available locally and the one that has just been gathered via the migration process. More information is depicted in Section 4.1.3.
Time to execute $t_{i}$ in $m_{j}$ defined by $E T_{i j}$ as shown at line 19 in Algorithm 1 and step 5 in Figure 4.

Algorithm 1 Compute Total Response Time Matrix for Each Task in the Queue Q

Intput:
  1: Q = (t₁, t₂, …, t_n): Queue of arrived tasks
  2: M = (m₁, m₂, …, m_p): Set of machines
Output:
  3:

R T_{i j}

: Response time of

t_{i}

if placed in

m_{j}

, where

1 \leq i \leq | Q |

and

1 \leq j \leq | M |

  4: if Q is empty then
  5:     Wait for tasks to arrive to Q (see Figure 3)
  6: else
  7:     for (

i \in Q

) do
8: for (

j \in M

) do // assume

t_{i}

will be placed in

m_{j}

9:

ϕ_{i j} \leftarrow

MachineFitTask(

i, j

)
10: if

(ϕ_{i j} = - 1)

then
11：

e x i t;

//

m_{j}

can’t host

t_{i}

, move to the next machine
12: else if (

ϕ_{i j} = 0

) then
13:

Δ_{i j} \leftarrow

ComputeDelayTime(

i, j

)
14: else if (

ϕ_{i j} = 1

) then
15:

Δ_{i j} \leftarrow 0

16: end if
17:

D M T_{i j} \leftarrow

ComputeDataMigrationTime(

i, j

)
18:

D A T_{i j} \leftarrow

ComputeDataAccessTime(

i, j

)
19:

E T_{i j} \leftarrow \frac{l_{i}}{P P_{j}} + D A T_{i j}

20:

W T_{i j} \leftarrow S T_{i j} + Δ_{i j}

21:

R T_{i j} \leftarrow D M T_{i j} + D A T_{i j} + E T_{i j} + W T_{i j}

  22:         end for
  23:     end for
  24:

P L \leftarrow

GeneratePreferenceList(

R T

)
25:

(Q, α) \leftarrow

SelectTasks(

M, Q, P L

)
26: end if

Afterward, we have a matrix of response time

R T

(step 6, Figure 4) of all the tasks in the queue, assuming that they are executed in all the machines of the system. The

R T

matrix will be the basis for our scheduling scheme. Based on matrix

R T

, the OTS-DMDR algorithm generates a preference list

P L

for the task-to-machine assignment (line 24, Algorithm 1) and is better detailed in Section 4.2. Therefore, a set of tasks is selected to be scheduled in the appropriate servers using method SelectTasks (in line 25). The tasks’ selection process is described in Section 4.3.

4.1.1. Fitness

The fitness calculation algorithm defines whether the chosen machine

m_{j}

is adequate and fit for the execution of task

t_{i}

(as shown in Figure 4, step 1). If

m_{j}

cannot host

t_{i}

(

m_{j}

does not fit

t_{i}

),

m_{j}

is directly discarded. In our work, we consider several metrics to say that

t_{i}

can be assigned to

m_{j}

or that the fitness of

m_{j}

to

t_{i}

is achieved. The fitness metrics are: the amount of RAM, storage capacities, and CPU utilization rate (

U R

).

U R_{i j}

is the CPU usage rate of

t_{i}

in

m_{j}

and is computed using Equation (10).

P P [m_{j}]

is the processing power of

m_{j}

and is calculated using Formula (2).

U R_{i j} = \frac{C P U [t_{i}]}{P P [m_{j}]}

(10)

The last metric is the machine load, which determines if the machine is overloaded or underloaded. The load is calculated by Equation (11).

L o a d_{j} = \frac{\sum_{t_{I} \in T P_{J}} l_{i}}{P P [m_{j}]}

(11)

Algorithm 2 depicts how the fitness of task

t_{i}

in machine

m_{j}

is computed. There are three response states:

$ϕ_{i j} = - 1$ , if the utilization ratio is more than 1 (lines 5 and 6).
$ϕ_{i j} = 1$ , if and only if the utilization rate does not exceed 1, when the remaining storage capacity in $m_{j}$ can accommodate the total amount of required data by $t_{i}$ (line 9 to 15). The load of $m_{j}$ must be between the $L o a d_{m i n}$ and $L o a d_{m a x}$ thresholds (line 16 to 25), and the amount of remaining RAM in $m_{j}$ should be greater than $R A M [t_{i}]$ .
$ϕ_{i j} = 0$ , if one or more of the conditions above are not verified, i.e., $m_{j}$ does not have enough CPU or/and not enough RAM to host $t_{i}$ , or/and the storage capacity of $m_{j}$ cannot store the remote required datasets of $t_{i}$ or/and $m_{j}$ is overloaded or underloaded.

The fitness status obtained from Algorithm 2 is returned to the main Algorithm 1 for processing the three different cases of compatibility (fitness):

$ϕ_{i j} = - 1$ , the machine $m_{j}$ cannot host the task $t_{i}$ due to the lack of CPU, and no action can be taken. We start by checking this first case, so we can know from the beginning if we can continue the process of calculating the response time. In this case, the scheduler moves to the next machine.
$ϕ_{i j} = 0$ , the machine $m_{j}$ cannot host the task $t_{i}$ due to insufficient RAM or/and storage or/and $m_{j}$ being overloaded or underloaded. The peculiarity here is that the task can be delayed and wait for these conditions to be verified and accomplish the fitness on $m_{j}$ . In this case, we talk about delay scheduling technique. Task $t_{i}$ can wait for a delay $Δ_{i j}$ so that the resources of $m_{j}$ become available again to host $t_{i}$ . The measurement of the delay time $Δ$ will be explained in Section 4.1.4.
$ϕ_{i j} = 1$ , the machine $m_{j}$ can host the task $t_{i}$ without constraint violations and delay time.

We would like to mention that in the case where the storage of

m_{j}

is not enough to store the required datasets of

t_{i}

, we select a set of datasets to delete from

m_{j}

. For that, we use our previous work [68], based on data replication, for data selection and deletion processes. The idea is based on two factors:

Dependency between tasks and datasets ( $d e p e n d_{k}$ ): this factor seeks to define how many duplicated datasets in $m_{j}$ are required for the uncompleted tasks in the queue. In other words, we compute how many tasks in Q are using every replicated dataset in $m_{j}$ .
Number of existing replicas of the dataset ( $r e p l_{k}$ ): this factor attempts to define how many replicas of each dataset $d_{k}$ are currently available in the whole system. Therefore, we check if each machine $m_{j}$ stores $d_{k}$ as a replica copy. The value of $r e p l_{k}$ is raised by one each time a replica of $d_{k}$ is identified.

Due to the possibility of multiple replications, only datasets with more than

m a x R e p

replicas (here equals three) are qualified for deletion.

We select

d_{i}

with the lowest

d e p e n d_{k}

(the least used

d_{i}

from the unfinished tasks). If there are multiple datasets with the same

d e p e n d

factor, we take the one with the highest

r e p l

into consideration. Based on this, we delete the datasets one after another until the deleted space is greater than the requested size, liberating the storage needed by the datasets that will be migrated for

t_{i}

execution.

Algorithm 2 Compute Fitness of m_j to host t_i

Input:
1: i: index of task for which fitness is checked
2: j: index of machine whom we check the placement fitness
Output:

ϕ_{i j}

: fitness status (1,0,−1)
3: function MachineFitTask(

i, j

)
// CPU Utilization Ratio measurement
4:

U R_{i j} \leftarrow \frac{C P U [t_{i}]}{P P [m_{j}]}

5: if

(U R_{i j} > 1)

then
6:

ϕ_{i j} \leftarrow - 1

  7:         exit;
  8:     else
          // Storage capacity verification
  9:

V [t_{i}] \leftarrow 0

10: for (

k \in D

) do
11:

V [t_{i}] \leftarrow V [t_{i}] + f_{k i} \times v_{k}

// total remote data size required by

t_{i}

12: end for
13: if

(s c_{j} \geq V [t_{i}])

then
14:

s e l e c t e d D a t a s e t s \leftarrow

SelectDatasetsToDelete(

i, V [t_{i}]

)
  15:         end if
            // Load measurement
  16:

L o a d_{j} \leftarrow 0

17: for (

k \in T P_{j}

) do // search tasks in progress in

m_{j}

18:

L o a d_{j} \leftarrow L o a d_{j} + \frac{l_{k}}{P P [m_{j}]}

19: end for
20: if

(L o a d_{j} \geq L o a d_{m a x})

then//

m_{j}

is overloaded
21:

m_{j} . o v e r l o a d e d = 1

22: end if
23: if

(L o a d_{j} \leq L o a d_{m i n})

then //

m_{j}

is underloaded
24:

m_{j} . u n d e r l o a d e d = 1

  25:         end if
        // Fitness Measurement
  26:         if (

\sum_{t_{k} \in T P_{j}} R A M [t_{k}] \geq R A M [t_{k}]

&&

m_{j}

.overloaded = 0 &&

m_{j}

.underloaded = 0)
then
27:

ϕ_{i j} \leftarrow 1

28: else
29:

ϕ_{i j} \leftarrow 0

  30:         end if
  31:     end if
    return

ϕ_{i j}

32: end function

4.1.2. Migration Time

Once the fitness of scheduling a proper task

t_{i}

in a proper machine

m_{j}

is calculated, we can now start computing the migration time

D M T_{i j}

in order to estimate the response time of

t_{i}

in

m_{j}

.

Algorithm 3 is used to compute the time needed to migrate the remote required datasets of task

t_{i}

from their remote locations to

m_{j}

. There are two potential issues in calculating the migration time. The first is that

t_{i}

may need one or more datasets to migrate. The second is that multiple replicas may exist for a single dataset. In Algorithm 3, the block between line 5 and line 16 describes how to solve these two issues.

Algorithm 3 Compute required remote Datasets Migration Time of t_i placed in m_j

Input:
1: i: index of task for which we will estimate the needed time to migrate its required data
2: j: index of machine we assumed t_i will be scheduled and data will be migrated to
Output:

D M T_{i j}

: Data Migration Time of the remote required datasets of

t_{i}

from their distant locations to the local node

m_{j}

where

t_{i}

is scheduled
3: function ComputeDataMigrationTime(

i, j

)
4:

D M T_{i j} \leftarrow 0

5: for (

k \in D

) do
6: if

(f_{k i} = 1)

then //

d_{k}

is required by

t_{i}

7:

τ_{i j}^{k j} \leftarrow 0

8: if (

ψ_{k j} = 0

) then //

d_{k}

is a remote data
9:

l \leftarrow 0

10: for (

l \in M - {m_{j}}

) do
11:

τ_{i j}^{k l} \leftarrow 0

12: if

(ψ_{k l} = 1)

then //

d_{k}

is stored in

m_{l}

13:

τ_{i j}^{k l} \leftarrow (\frac{1}{r_{l}} + \frac{1}{w_{j}} + \frac{1}{b_{l j}}) \times v_{k}

// time to migrate

d_{k}

required by

t_{i}

from distant

m_{l}

to local

m_{j}

  14:                    end if
  15:                end for
  16:            end if
    // Sort migration times of all machines of each

d_{k}

τ_{i j} (k, :)

in ascending order
17:

[σ_{i j} (k, :)]

\leftarrow s o r t (τ_{i j} (k, :))

//

σ_{i j} (k, q)

= l, i.e.,

d_{k}

is migrated from

m_{l}

with time of

τ_{i j}^{k l}

18:

s \leftarrow σ_{i j} (k, 0)

//

m_{s}

is the machine source with least migration time to move

d_{k}

to

m_{j}

19:

D M T_{i j} \leftarrow D M T_{i j} + τ_{i j}^{k s}

   // data migration time
  20:         end if
  21:     end for
    return

D M T_{i j}

22: end function

For each dataset, we check if

d_{k}

is required by

t_{i}

(line 6) and if

d_{k}

is not stored locally in

m_{j}

(line 8). In this case, the migration of

d_{k}

is required by finding all its locations, calculating the time needed to migrate

d_{k}

from each of its locations to

m_{j}

, and finally selecting the location

m_{l}

with the smallest migration time. Line 13 shows how to calculate the time to migrate

d_{k}

from one of the found locations

m_{l}

to the local node

m_{j}

. This migration time is denoted by

τ_{i j}^{k l}

.

In fact, the migration depends on the size of the data (

v_{k}

) and consists of three processes: (1) reading

d_{k}

from the remote node

m_{l}

with a read speed of

r_{l}

, (2) writing

d_{k}

to the local node

m_{j}

with a write speed of

w_{j}

, and (3) transferring

d_{k}

from

m_{l}

to

m_{j}

via a bandwidth with a transfer rate of

b_{l j}

.

For now, for each data

d_{k}

required by

t_{i}

not achieving the data locality, we have its migration time

τ_{i j} (k, :)

from all its existing locations to the local machine

m_{j}

. The next step is to select the best location from which

d_{k}

will be migrated. To do this, we sort the vector

τ_{i j} (k, :)

(line 17) into ascending order and pick the first element

σ_{i j} (k, 0)

, which gives the best machine

m_{s}

providing

d_{k}

with the lowest migration time

τ_{i j}^{k s}

(line 18).

Before moving on to the next dataset, the value

τ_{i j}^{k s}

is added to the

D M T_{i j}

value (line 19), where

D M T_{i j}

is the total time needed to migrate all the required remote datasets of

t_{i}

affected to

m_{j}

.

Finally, after browsing all the remote data and computing their migration time from their best location, the total value of

D M T_{i j}

is detained for use in Algorithm 1 at line 17.

4.1.3. Data Access Time

It is mandatory that a task accesses and consumes its required data in order to complete its execution, otherwise the task fails. The access time for the local consumption of all data is designed by

D A T_{i j}

as indicated in Algorithm 4. Once

D A T_{i j}

has been calculated, its value is returned to Algorithm 1 so that it is taken into account in the response time

R T_{i j}

.

Algorithm 4 Compute Data Access Time of t_i placed in m_j

Input:
1: i: index of task for which we will estimate time to locally access its required data
2: j: index of machine we assumed t_i will be scheduled at
Output:

D A T_{i j}

: Data Access Time of all the required datasets of

t_{i}

in the local node

m_{j}

3: function ComputeDataAccessTime(

i, j

)
4:

D A T_{i j} \leftarrow 0

5: for (

k \in D

) do
6: if

(f_{k i} = 1)

then //

d_{k}

required by

t_{i}

7:

D A T_{i j} \leftarrow D A T_{i j} + \frac{v_{k}}{r_{j}}

  8:         end if
  9:     end for
    return

D A T_{i j}

10: end function

4.1.4. Delay Time

As mentioned previously, it is possible that a given machine

m_{j}

does not fit

t_{i}

due to insufficient storage space, RAM, or load of the machine. This incompatibility might be solved if the execution of the task is postponed. This type of scheduling is called Delay Scheduling.

The proposed OTS-DMDR technique is based on the delay method, which could lead to a better response time. Algorithm 5 employs the delay scheduling, which will allow the computation of the delay time (

Δ_{i j}

) for the task

t_{i}

until the resources of machine

m_{j}

are available again.

The measurement of the

Δ_{i j}

is conducted as follows. First, we sort the tasks in the machine

m_{j}

by their estimated finish time in ascending order. The sorting result is in a sorted queue designated by

Q_{j}^{'}

(line 4, Algorithm 5). Then, we go through each task in

Q^{'}

to verify when the fitness of

t_{i}

will be achieved (line 9). For each task

t_{k}

in

Q_{j}^{'}

, we obtain its remaining execution time (

R E T_{k}

).

R E T_{k}

is added to the delay time

Δ_{i j}

(line 11); then, the RAM, storage capacity, and load of

m_{j}

are updated in order to add the value consumed by

t_{k}

(line 12 to line 14). The goal is to check if this updated state will allow to free more RAM and/or storage and/or load on the machine

m_{j}

so that it receives the concerned task

t_{i}

.

The process is repeated until the fitness of

m_{j}

and

t_{i}

is achieved (line 10). Finally, we receive the exact delay time

Δ_{i j}

, which will be considered subsequently in the response time of task

t_{i}

in

m_{j}

in the main Algorithm 1 at line 13.

Algorithm 5 Compute Delay Time

Input:
1: i: index of task for which fitness is checked
2: j: index of machine whom we check the placement fitness
Output:

Δ_{i j}

: Delay time so that

m_{j}

is available to host

t_{i}

3: function ComputeDelayTime(

i, j

)
4:

Q_{j}^{'} \leftarrow s o r t (T P_{j})

// sort in ascending order by estimated finish time the tasks in

m_{j}

// or by arrival time of assigned tasks to

m_{j}

5:

Δ_{i j} \leftarrow 0

6:

n e w R A M [m_{j}] \leftarrow R A M [m_{j}]

7:

n e w S c_{j} \leftarrow s c_{j}

8:

n e w L o a d_{j} \leftarrow L o a d_{j}

9: for (

k \in Q_{j}^{'}

) do
10: while

(m_{j}

.overloaded = 1

‖ m_{j}

.underloaded = 1

‖ n e w R A M [m_{j}] < R A M [t_{i}] ‖ n e w S c [m_{j}] < V [t_{i}])

do
11:

Δ_{i j} \leftarrow Δ_{i j} + R E T_{k}

12:

n e w R A M [m_{j}] \leftarrow n e w R A M [m_{j}] + R A M [t_{k}]

13:

n e w S c_{j} \leftarrow n e w S c_{j} + V [t_{k}]

14:

n e w L o a d_{j} \leftarrow n e w L o a d_{j} + \frac{l_{k}}{P P [m_{j}]}

15: if

(n e w L o a d_{j} \leq L o a d_{m a x})

&&

(n e w L o a d_{j} \geq L o a d_{m i n})

then
16:

m_{j} . o v e r l o a d e d = 0

17:

m_{j} . u n d e r l o a d e d = 0

  18:            end if
  19:         end while
  20:     end for
    return

Δ_{i j}

21: end function

4.2. Task to Machine Preference List

So far, we have been able to compute the response time

R T

matrix. In the current work, we aim to efficiently select a set of incoming tasks and assign them to the appropriate servers. Hence, we propose a preference list

P L

that generates potential association between tasks and the available machines.

To generate the preference list

P L

, we sort the elements of the matrix

R T

in ascending order. The elements of

P L

are represented by a triplet of task

t_{i}

, machine

m_{j}

, and their corresponding response time

R T_{i j}

, as follows:

P L = {p l_{k}} = [(t_{i}, m_{j}, R T_{i j})]

(12)

where the first element (

p l L_{1}

) of the list

P L

is the lowest response time if we assign

t_{i}

to

m_{j}

. To better understand the process, we give an example in Figure 5.

The matrix

R T

gives the

P L

where the best assignment is represented by the lowest value

R T_{23} = 1

when

t_{2}

will be scheduled in

m_{3}

, followed by placing

t_{5}

in

m_{4}

. While the worst assignment is the highest value 9, which happens if

t_{3}

is affected by

m_{3}

for the execution. Hence, in the following subsection, we present an efficient technique to select an optimal assignment task-to-machine based on

P L

.

4.3. Tasks Selection

In this section, we select the set of tasks that must be scheduled in each of the machines. Since our work is based on online tasks. Therefore, we always have a queue containing tasks that must be executed as soon as possible. For this reason, we have opted for the idea of selecting a set of tasks. Task selection allows us not only to choose the tasks with the best response times but also to take advantage of the use of all the available machines. In this way, we are sure to achieve our goal of minimizing the response time and using the resources efficiently. The selection process of tasks is described in Algorithm 6 and illustrated by Figure 6.

For the task selection procedure, as a first step, we need the available machines M, the arrival tasks in the queue Q, and the preference list

P L

as input. Then, by going through the preference list

P L

(step 3 in Figure 6), we select the first element

p l_{1}

(step 4), which has the lowest response time

R T_{i j}

(line 9, Algorithm 6) and that happens when assigning

t_{i}

to

m_{j}

. After assigning

t_{i}

to

m_{j}

(step 5),

m_{j}

is marked as the best assignment for

t_{i}

as indicated in line 10.

The vector

α

is used to describe the indices of the final task assignments, i.e.,

α = (α_{1}, α_{2}, \dots, α_{j})

, where

α_{i}

is the index of the machine where

t_{i}

is assigned. In other words, the best machine to host

t_{i}

is

m_{α_{i}}

. After that, we perform four updated operations:

The available machines M are updated by removing $m_{j}$ ;
The characteristics of $m_{j}$ are updated, i.e., the RAM occupied by $t_{i}$ is subtracted from the total RAM of $m_{α_{i}}$ (line 12), then the storage capacity of $m_{α_{i}}$ is modified by deleting the volume of migrated data required by $t_{i}$ (line 13) and the used load by $t_{i}$ is added to the total load of $m_{α_{i}}$ (line 14);
The queue Q of incoming tasks is updated by removing the assigned task $t_{i}$ ;
The preference list $P L$ is updated (step 8, lines between 18 and 22) by removing all the triplets concerning the task $t_{i}$ or the machine $m_{j}$ .

Updating

P L

is required to avoid rescheduling an already assigned task and not to use a machine to which we have already assigned a task.

The whole process is repeated until the preference list is empty, which means either no available tasks are in the queue or all the machines were used for the tasks in the queue. In that case, we run the main Algorithm 1 to re-check the queue and repeat the computation of the response time matrix and so on.

To help understand how task selection operates, we illustrate it using the example in Figure 7 and Figure 8.

Algorithm 6 Tasks Selection

Input:
  1: M: Available machines in the system
  2: Q: Arrival tasks in the queue
  3: PL: Preference list issued by sorting RT
Output:
  4: Q: Updated Q
  5:

α

: Vector of the final assignment of selected tasks

6: function SelectTasks(

M, Q, P L

)
7: while (

P L

is not empty) do
8: for (

k \in P L

) do
9:

s e l e c t e d P L \leftarrow p l_{1}

//

p l_{1} = (t_{i}, m_{j}, R T_{i j})

is lowest response time
10:

α_{i} \leftarrow j

// the best placement of

t_{i}

is

m_{j}

11:

M \leftarrow M - {m_{j}}

// update M
12:

R A M [m_{j}] \leftarrow R A M [m_{j}] - R A M [t_{i}]

13:

s c_{j} \leftarrow s c_{j} - V [t_{i}]

14:

L o a d_{j} \leftarrow L o a d_{j} + \frac{l_{i}}{P P [m_{j}]}

15:

T P_{j} . a d d (t_{i})

16:

Q \leftarrow Q - {t_{i}}

// update M
17:

P L \leftarrow P L - {p l_{1}}

// update

P L

by removing the 1st element
18: for (

l \in P L

) do
19: if (

(p l_{l} . c o n t a i n s (t_{i}) ‖ p l_{l} . c o n t a i n s (m_{j}))

) then
20:

P L \leftarrow P L - {p l_{l}}

// Update

P L

by removing elements with

t_{i}

or

m_{j}

in triplet
  21:                end if
  22:            end for
  23:         end for
  24:     end while
    return

(Q, α)

25: end function

The example begins with the input of four available machines M, five incoming tasks in the queue Q, and the preference list

P L

generated in Figure 5. A first iteration takes effect to assign one of the tasks to the adequate server (see Iteration 1 in Figure 7). First, we select the first element of

P L

, which is the triplet

(t_{2}, m_{3}, 1)

. This triplet provides the lowest response time in matrix

R T

and allows us to assign

t_{2}

to

m_{3}

with

R T_{23} = 1

. As result,

m_{3}

is removed from the available machines M,

t_{2}

is deleted from the queue Q, and

P L

is updated by removing all the triplets containing either

t 2

or

m_{3}

. All updates are described by cross marks with red color in the output box. A second iteration is conducted by taking as input the updated values from Iteration 1 of the available machines M, Q, and

P L

. After selecting the first triplet

(t_{5}, m_{4}, 1)

in

P L

,

t_{5}

is assigned to

m_{4}

with a response time of 1. The updates are completed by removing

m_{4}

from M,

t_{5}

from Q, and all the triplets concerning

t_{5}

and

m_{4}

from

P L

. In our example, the process is repeated until iteration 4 (see Figure 8) where all the machines were used (

M = {⌀}

) and

P L

is empty. In contrast, this case results in a task

t_{4}

that is not assigned and that will be handled when repeating the main process from computing the matrix

R T

and where new tasks will be added to the arrival queue. Therefore, we can say that our proposed strategy OTS-DMDR assigned four incoming tasks out of five while minimizing their response time and maximizing the system resources.

5. Simulation Setup and Result Analysis

In this section, we present the experiments performed to assess the effectiveness of the proposed scheduling algorithm OTS-DMDR. The following subsections present the performance metrics, the used benchmarks, the experimental setup, and a discussion of the obtained results.

5.1. Simulation Setup

Since the target system is a cloud computing environment, the evaluation of scheduling algorithms is crucial. However, experiments on real cloud platforms would be costly and challenging, especially when it comes to repeating the experiments under the same circumstances in order to compare other algorithms. As a result, a simulator is required to measure the performance of the proposed algorithms. In order to model and simulate cloud-based systems, we used an extensible toolkit CloudSim 3.03 [71,72]. Nevertheless, the Cloudsim framework does not support data management such as data storage, data migration, data replication, and remote data consumption. Due to these limitations, we extended Cloudsim in our previous work [73] so that it can effectively address those needs.

For our experiments, we vary the number of machines between 5 and 100. Each machine is considered with its characteristics (

C P U

,

R A M

,

S t o r a g e

C a p a c i t y

,

R e a d / W r i t e

S p e e d

). We also consider a range of tasks between 30 and 2000 tasks, where every task requires at least 1 and at most 10 datasets. The size of datasets is evenly distributed within the range [1–100 GB]. The overall configuration is depicted in Table 2.

CloudSim offers the flexibility of time-sharing and space-sharing techniques for resource allocation in tasks [71,72]. The appropriate technique can be selected by users depending on their specific requirements, such as performance, cost, and resource utilization, which significantly affects the overall efficiency and performance of cloud computing applications. Our proposed algorithm utilizes the time-sharing technique provided by CloudSim, allowing tasks to be executed in parallel. In time-shared mode, multiple task units, or Cloudlets, can perform multitasking within a machine.

We generate various scenarios, take into account 100 executions for each, and use the average as our final measurement.

We would like to mention that the initial data placement is conducted based on the max–max algorithm [74]. This means, that the data with the largest size is placed in the storage with the maximum remaining storage capacity.

We compare our proposed task scheduling strategy (OTS-DMDR) with four other scheduling algorithms: FCFS [52], the traditional scheduling algorithm that schedules tasks based on their arrival time, i.e., the first arrived is the first to be executed; Delay Scheduling [63], delays the execution of a task in order to assign it to the server achieving the data locality; Li et al. method [28], that compromises between waiting time and data migration costs. Finally, we compare OTS-DMDR with a proposed algorithm that does not consider the data replication, which we name Online Task Scheduling based on Data Migration (OTS-DM).

5.2. Performance Metrics

To quantitatively evaluate the performance of the OTS-DMDR algorithm and compare its effectiveness with other algorithms in the literature, we need to use a variety of metrics, which are listed below.

5.2.1. Response Time (RT)

Needed time for a task to finish its execution. The response time includes the following stages (also see Figure 2):

Scheduling Time (ST);
Delay Time ( $Δ$ );
Waiting Time (WT);
Data Migration Time (DMT);
Data Access Time (DAT);
Execution Time (ET);
Total Execution Time (TET);

5.2.2. Throughput

Number of tasks that can be processed by the whole system within a time slot.

5.2.3. Degree of Imbalance (DI)

Calculates the imbalance across all of the machines using Equation (13).

D I = | M | \times \frac{V E T_{m a x} - V E T_{m i n}}{O E T}

(13)

where

| M |

is the total number of machines.

V E T_{m a x}

and

V E T_{m i n}

are (resp.) the maximum and minimum total execution time among all machines and

O E T

is the overall execution time of all machines and is calculated as follows:

O E T = \sum_{j \in M} V E T_{j}

(14)

5.3. Result Analysis

5.3.1. Experiment 1: Task Variation

In the first experiment, we aim to measure the impact of varying the number of incoming tasks that arrived into the queue within the same time slot w.r.t. the aforementioned time metrics (response time, migration time, waiting time, etc.). For that, we fix the number of machines M to 100 and the number of required data

R D

varying within the range of [1–10], while the number of incoming tasks T takes the values 500, 1000, 1500, and 2000.

Figure 9 presents a comparison between the different scheduling algorithms in terms of the average response time

R T

. Where the x-axis represents the number of tasks and the y-axis is the measured average response time for a given number of tasks.

We can see from the results of Figure 9 that the proposed algorithms OTS-DMDR and OTS-DM outperformed the rest of the scheduling strategies for all of the test cases, showing a considerable reduction in average response time, particularly for a higher number of tasks (1500 and 2000 tasks). In the meantime, the Li et al. [28] and delay scheduling methods have a competitive performance only for 500 and 1000 tasks. Meanwhile, the FCFS method exhibits poor performance for all the cases.

To investigate the performance of each method in more detail, we chose the test case of 2000 tasks, then we computed the time spent on each stage (namely,

S T

,

Δ

,

D M T

,

D A T

, and

E T

). Figure 10 gives the percentages of each stage for each tested method.

The FCFS method dedicates more than

41 %

of the response time to data migration. This is justified by the fact that the FCFS method does not consider both data locality and machine performance when scheduling tasks. Moreover, since the FCFS method is based on a first come first served stragey to assign the incoming tasks, the time between scheduling and starting the execution of every task is very low, which can be considered 0 (

Δ = 0

). In contrast, the scheduling time is quite high (

S T = 7 %

) because tasks that arrive may not be immediately scheduled due to the unavailability of machines.

Since the delay scheduling method is based on delaying tasks in order to achieve data locality and does not consider any data migration, we can clearly see in Figure 10 that the percentage of waiting time is significant (

21 %

) and helps to gain in terms of local data accesses (representing only

18 %

of the response time). In addition, to avoid starvation, the migration process takes effect and remote data is not efficiently gathered. Thus, the response time is dominated by data migration by

31 %

.

The response time of Li et al. [28] is overtaken by both data migration time and execution time with a rate of

36 %

and

25 %

, respectively. On the other hand, for the OTS-DM method, the data migration rate is decreased to

31 %

, while the execution time is slightly increased to

27 %

. The reduction of data migration in the proposed OTS-DM method is due to the strategy that chooses the best location from which to pull the data. For OTS-DMDR,

41 %

of the total time is consumed by execution time (ET), while

25 %

is consumed by data migration. One can recall a huge decrease in terms of

D M T

in comparison with all of the existing strategies; this can be justified by replicating datasets across the machines, as explained in Section 4.1.1.

Finally, from this experiment, we can conclude that the proposed strategy OTS-DMDR presents a significant improvement in terms of response time and can be very useful for online task scheduling for big data applications that involve both small and large numbers of tasks. In addition, the decision to schedule some tasks on the most appropriate machine can be determined based on a compromise between data locality and data migration cost, while considering data replication and delay scheduling cost, thus yielding an optimal response time with lower data transfer.

One major advantage of data migration is that it can help to address data accessibility and availability. By having multiple replicas of data across multiple machines, it is possible to leverage multi-task processing and accelerate the execution process. This can be particularly useful for large-scale machine learning problems or for training on big data sets [6].

5.3.2. Experiment 2: Machine Variation

In contrast to the previous experiment, in this scenario, we fix the number of tasks T to 2000, while the number of machines M takes values 25, 50, 100, 250, and 500. The purpose of this experiment is to examine the scheduling behavior of the algorithms under different system configurations.

Figure 11 represents the plotting of the average response time for various numbers of machines corresponding to different scheduling methods. The results demonstrate that our proposed algorithm OTS-DMDR outperforms the other scheduling techniques for all of the test scenarios, providing a significant reduction of the average response time, particularly for a higher number of machines (500 machines). The OTS-DM Li et al. [28] and delay scheduling algorithms currently perform well. The FCFS approach, on the other hand, yields consistently poor results.

To further compare the effectiveness of each technique, a detailed experimental analysis is performed in terms of the percentage rate of each metric (

W T

,

D M T

,

D A T

, and

E T

) compared to the total response time

R T

. For this, we select the case of 100 machines. In this respect, Figure 12 shows the percentage rate for all the tested algorithms.

As can be observed from the results of the OTS-DMDR strategy, the execution time took

45 %

of the total response time, while the percentage of migration time consumed only 25% of the response time. As OTS-DMDR is based on a trade-off between optimizing data locality, delay scheduling, and data migration relying on data replication, this led to a good data migration time rate of

25 %

as well as a small waiting time (

5 %

) compared to other techniques.

For the OTS-DM strategy, the migration time was considered to be the overtaken time (36%). This is a significant difference compared to OTS-DMDR because OTS-DM does not consider the replicated data while moving it. Li et al. [28] found similar results to the OTS-DM strategy for all metrics.

In the delay scheduling strategy, the waiting time was greater than the other strategies (

17 %

) since the idea behind the method was based on delaying tasks in order to achieve better data locality (

22 %

). The

D M T

rate still had a noticeable value (

30 %

) in comparison with OTS-DMDR.

Finally, for the FCFS method, we can clearly see that the data migration time again dominated response time with a percentage of

45 %

. The reason is that FCFS assigns tasks without considering data locality nor data movement.

From the result of Figure 11 and Figure 12, we can notice a strong relationship between the average response time of tasks and the number of machines, as the number of machines increases the average response time decreases. Moreover, OTS-DMDR performs competitively with OTS-DM, Li et al. [28], and Delay Scheduling methods for a higher number of machines, while for the lower number of machines, it is very obvious that the proposed OTS-DMDR gives significantly better results than all of the existing algorithms. Eventually, the proposed OTS-DMDR algorithm showed sufficient performance to be used as an alternative task scheduling algorithm for big data systems.

5.3.3. Experiment 3: Datasets Variation

In this scenario, we vary the number of required datasets to see how it impacts the total response time. We use three different scales: (a) Small number of required data [1–5], (b) Medium number of required data [5–10], and (c) Large number of required data [10–20]. Figure 13 depicts the plots of the average response time for the three different scales for each scheduling strategy.

We can see that our proposed algorithm OTS-DMDR performed best throughout the three experiments. Furthermore, one can observe a competitive performance between OTS-DM and Li et al. [28] with a slight advantage of the proposed OTS-DM. However, both delay scheduling and FCFS methods gave higher response times for all scales.

The corresponding percentage response times of this experiment are reported in Figure 14a–c. We observe that, even though the number of required data changed, our method OTS-DMDR consistently gave the least

D M T

percentage rate. While a comparable performance was observed between OTS-DM and Li et al. [28] in the three scales. For the delay scheduling, the

W T

percentage was higher compared than the other methods. Similar to previous results, the FCFS response time was mostly spent migrating data.

Our method, OTS-DMDR, consistently produces the least

D M T

percentage rate regardless of the changing data requirements. This is especially due to considering the replication of data, which creates new copies of data in the system. As a result, scheduling tasks based on data migration and reusing replicated data can offer benefits such as enhanced data availability, improved data locality, and decreased response time for incoming tasks.

Finally, we conclude that the obtained results by our proposed algorithm OTS-DMDR are more stable than those generated by the other scheduling algorithms. Furthermore, OTS-DMDR is very applicable to different scales of required datasets, which ensures the usefulness of the proposed task scheduling and validates the theoretical algorithm design developed in this paper.

5.3.4. Experiment 4: Tasks Arrive in 100 Time-Slot

For this scenario, we aimed to evaluate the throughput metric, which is the percentage of tasks executed for a specific time slot. Importantly, we analyze the load balancing of our system using the Degree of Imbalance (DI) metric. For this, we have 2000 tasks to execute in 100 machines. The tasks will arrive every 100 time slots. The throughput, the percentage of tasks that were completed for a given time slot, is indicated for each point in Figure 15.

As expected, FCFS did not perform well. Meanwhile, the Delay Scheduling, Li et al., and OTS-DM methods were comparable and gave acceptable results. OTS-DMDR achieved the best performance due to the efficient management of data replication throughout task scheduling.

Figure 16 shows the degree of imbalance for the FCFS, Delay scheduling, Li et al., OTS-DM, and OTS-DMDR algorithms, where the lower value of DI indicated a higher load balancing performance.

According to the reported results, it can be concluded that the degree of imbalance of the proposed OTS-DMDR algorithm had the smallest value. Since the OTS-DMDR strategy considers the load of each machine when assigning tasks, as a result, it avoids imbalanced workload situations.

However, a remarkable topic to be discussed is how to combine the proposed algorithm with other meta-heuristic algorithms to further enhance scheduling results by choosing optimal weight for parameters involved in the objective function, as discussed in [59,61].

6. Conclusions

Big data analytics tasks are now feasible due to advances in internet technology and the use of cloud data centers. However, managing these data-intensive tasks is challenging, especially in dynamic cloud environments. To address this challenge, it becomes highly demanding to consider data aspects when designing task scheduling algorithms. This paper introduces a new method named Online Task Scheduling based on Data Migration and Data Replication (OTS-DMDR). It considers various metrics to select the appropriate task for the appropriate machine, including, data access time, data migration time, tasks requirement, performance power, and load of the machines. By combining data migration and data replication features with delay scheduling, the OTS-DMDR method achieves better data locality, minimizes the response time, and improves the task throughput.

Accordingly, extensive simulations are carried out to demonstrate the validity of our proposed OTS-DMDR method. The results show that the proposed OTS-DMDR method outperforms existing scheduling techniques, reducing response time by 78% when compared to the First Come First Served (FCFS) scheduler, by 58% compared to the Delay Scheduling, and by 46% compared to the technique of Li et al.—all of this while ensuring a balanced load over the machines. Consequently, this demonstrates the effectiveness and convenience of the proposed approach for the problem of online task scheduling.

The study on online task scheduling combined with data migration and replication in the cloud presents an important research implication for the development of efficient task scheduling algorithms for data-intensive applications. The study’s findings indicate the importance of considering data locality in task scheduling, which can be further explored in future research. Furthermore, as future work, it will be important to investigate how to dynamically place the initial datasets and handle data replicas in order to enhance the performance of the system. In conclusion, it can be inferred that the performance of the proposed OTS-DMDR algorithm is adequate to be utilized as an alternative online task scheduling algorithm for big data systems.

Author Contributions

All authors contributed equally to this work. L.B., M.Z. and C.T. designed and performed the experiments and prepared the manuscript. L.B., M.Z. and C.T. supervised the work and contributed to the writing of the paper. All authors read and approved the final manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Acknowledgments

The authors thankfully acknowledge the Laboratory of the Smart Systems Laboratory (SSLab) ENSIAS for his support to achieve this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

Barika, M.; Garg, S.; Zomaya, A.Y.; Wang, L.; Moorsel, A.V.; Ranjan, R. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions. ACM Comput. Surv. 2019, 52, 1–41. [Google Scholar] [CrossRef] [Green Version]
Rjoub, G.; Bentahar, J.; Wahab, O.A. BigTrustScheduling: Trust-aware big data task scheduling approach in cloud computing environments. Future Gener. Comput. Syst. 2020, 110, 1079–1097. [Google Scholar] [CrossRef]
Cao, K.; Liu, Y.; Meng, G.; Sun, Q. An Overview on Edge Computing Research. IEEE Access 2020, 8, 85714–85728. [Google Scholar] [CrossRef]
Petrolo, R.; Loscrì, V.; Mitton, N. Towards a smart city based on cloud of things, a survey on the smart city vision and paradigms. Trans. Emerg. Telecommun. Technol. 2017, 28, e2931. [Google Scholar] [CrossRef] [Green Version]
Fedushko, S.; Ustyianovych, T.; Syerov, Y.; Peracek, T. User-Engagement Score and SLIs/SLOs/SLAs Measurements Correlation of E-Business Projects Through Big Data Analysis. Appl. Sci. 2020, 10, 9112. [Google Scholar] [CrossRef]
Zhang, C.; Li, M.; Wu, D. Federated Multidomain Learning With Graph Ensemble Autoencoder GMM for Emotion Recognition. IEEE Trans. Intell. Transp. Syst. 2022, 1–11. [Google Scholar] [CrossRef]
Luo, X.; Zhang, C.; Bai, L. A fixed clustering protocol based on random relay strategy for EHWSN. Digit. Commun. Netw. 2023, 9, 90–100. [Google Scholar] [CrossRef]
Chen, H.; Wen, J.; Pedrycz, W.; Wu, G. Big Data Processing Workflows Oriented Real-Time Scheduling Algorithm using Task-Duplication in Geo-Distributed Clouds. IEEE Trans. Big Data 2020, 6, 131–144. [Google Scholar] [CrossRef]
Arunarani, A.; Manjula, D.; Sugumaran, V. Task scheduling techniques in cloud computing: A literature survey. Future Gener. Comput. Syst. 2019, 91, 407–415. [Google Scholar] [CrossRef]
Amini Motlagh, A.; Movaghar, A.; Rahmani, A.M. Task scheduling mechanisms in cloud computing: A systematic review. Int. J. Commun. Syst. 2020, 33, e4302. [Google Scholar] [CrossRef]
Kumar, M.; Sharma, S.; Goel, A.; Singh, S. A comprehensive survey for scheduling techniques in cloud computing. J. Netw. Comput. Appl. 2019, 143, 1–33. [Google Scholar] [CrossRef]
Liu, J.; Pacitti, E.; Valduriez, P. A Survey of Scheduling Frameworks in Big Data Systems. Int. J. Cloud Comput. 2018, 7, 103–128. [Google Scholar] [CrossRef]
Gautam, J.V.; Prajapati, H.B.; Dabhi, V.K.; Chaudhary, S. A survey on job scheduling algorithms in Big data processing. In Proceedings of the 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 5–7 March 2015; pp. 1–11. [Google Scholar] [CrossRef]
Mishra, S.K.; Puthal, D.; Sahoo, B.; Jena, S.K.; Obaidat, M.S. An adaptive task allocation technique for green cloud computing. J. Supercomput. 2017, 74, 370–385. [Google Scholar] [CrossRef]
Stavrinides, G.L.; Karatza, H.D. Scheduling Data-Intensive Workloads in Large-Scale Distributed Systems: Trends and Challenges. In Modeling and Simulation in HPC and Cloud Systems; Kołodziej, J., Pop, F., Dobre, C., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 19–43. [Google Scholar] [CrossRef]
Yang, C.; Huang, Q.; Li, Z.; Liu, K.; Hu, F. Big Data and cloud computing: Innovation opportunities and challenges. Int. J. Digit. Earth 2017, 10, 13–53. [Google Scholar] [CrossRef] [Green Version]
Hashem, I.A.T.; Yaqoob, I.; Anuar, N.B.; Mokhtar, S.; Gani, A.; Ullah Khan, S. The rise of “big data” on cloud computing: Review and open research issues. Inf. Syst. 2015, 47, 98–115. [Google Scholar] [CrossRef]
Mazumdar, S.; Seybold, D.; Kritikos, K.; Verginadis, Y. A survey on data storage and placement methodologies for Cloud-Big Data ecosystem. J. Big Data 2019, 6, 1–37. [Google Scholar] [CrossRef] [Green Version]
Natesan, G.; Chokkalingam, A. Task scheduling in heterogeneous cloud environment using mean grey wolf optimization algorithm. ICT Express 2019, 5, 110–114. [Google Scholar] [CrossRef]
Jafarnejad Ghomi, E.; Masoud Rahmani, A.; Nasih Qader, N. Load-balancing algorithms in cloud computing: A survey. J. Netw. Comput. Appl. 2017, 88, 50–71. [Google Scholar] [CrossRef]
Alami Milani, B.; Jafari Navimipour, N. A comprehensive review of the data replication techniques in the cloud environments: Major trends and future directions. J. Netw. Comput. Appl. 2016, 64, 229–238. [Google Scholar] [CrossRef]
Ahmad, N.; Che Fauzi, A.A.; Sidek, R.; Zin, N.; Beg, A. Lowest Data Replication Storage of Binary Vote Assignment Data Grid. Commun. Comput. Inf. Sci. 2010, 88, 466–473. [Google Scholar] [CrossRef]
Mohammadi, B.; Navimipour, N.J. Data replication mechanisms in the peer-to-peer networks. Int. J. Commun. Syst. 2019, 32, e3996. [Google Scholar] [CrossRef]
Campêlo, R.A.; Casanova, M.A.; Guedes, D.O.; Laender, A.H.F. A Brief Survey on Replica Consistency in Cloud Environments. J. Internet Serv. Appl. 2020, 11, 1. [Google Scholar] [CrossRef] [Green Version]
Long, S.Q.; Zhao, Y.L.; Chen, W. MORM: A Multi-objective Optimized Replication Management strategy for cloud storage cluster. J. Syst. Archit. 2014, 60, 234–244. [Google Scholar] [CrossRef]
Mokadem, R.; Hameurlain, A. A data replication strategy with tenant performance and provider economic profit guarantees in Cloud data centers. J. Syst. Softw. 2020, 159, 110447. [Google Scholar] [CrossRef]
Wang, D.; Chen, J.; Zhao, W. A Task Scheduling Algorithm for Hadoop Platform. J. Comput. 2013, 8, 929–936. [Google Scholar] [CrossRef]
Li, X.; Wang, L.; Lian, Z.; Qin, X. Migration-based Online CPSCN Big Data Analysis in Data Centers. IEEE Access 2018, 6, 19270–19277. [Google Scholar] [CrossRef]
Dubey, K.; Kumar, M.; Sharma, S. Modified HEFT Algorithm for Task Scheduling in Cloud Environment. Procedia Comput. Sci. 2018, 125, 725–732. [Google Scholar] [CrossRef]
Mondal, R.; Nandi, E.; Sarddar, D. Load Balancing Scheduling with Shortest Load First. Int. J. Grid Distrib. Comput. 2015, 8, 171–178. [Google Scholar] [CrossRef]
Lakra, A.V.; Yadav, D.K. Multi-Objective Tasks Scheduling Algorithm for Cloud Computing Throughput Optimization. Procedia Comput. Sci. 2015, 48, 107–113. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Wang, F.; Liu, J.; Wang, D.; Groen, J. Enabling customer-provided resources for cloud computing: Potentials, challenges, and implementation. IEEE Trans. Parallel Distrib. Syst. 2015, 26, 1874–1886. [Google Scholar] [CrossRef]
Gill, S.S.; Chana, I.; Singh, M.; Buyya, R. CHOPPER: An intelligent QoS-aware autonomic resource management approach for cloud computing. Clust. Comput. 2018, 21, 1203–1241. [Google Scholar] [CrossRef]
Thomas, A.; Krishnalal, G.; Raj, P.V. Credit Based Scheduling Algorithm in Cloud Computing Environment. Procedia Comput. Sci. 2015, 46, 913–920. [Google Scholar] [CrossRef] [Green Version]
Sajid, M.; Raza, Z. Turnaround Time Minimization-Based Static Scheduling Model Using Task Duplication for Fine-Grained Parallel Applications onto Hybrid Cloud Environment. IETE J. Res. 2015, 62, 402–414. [Google Scholar] [CrossRef]
Hadji, M.; Zeghlache, D. Minimum Cost Maximum Flow Algorithm for Dynamic Resource Allocation in Clouds. In Proceedings of the 2012 IEEE Fifth International Conference on Cloud Computing, Honolulu, HI, USA, 24–29 June 2012; pp. 876–882. [Google Scholar] [CrossRef]
Elzeki, O.; Reshad, M.; Abu Elsoud, M. Improved Max-Min Algorithm in Cloud Computing. Int. J. Comput. Appl. 2012, 50, 22–27. [Google Scholar] [CrossRef]
Fernández Cerero, D.; Fernández-Montes, A.; Jakóbik, A.; Kołodziej, J.; Toro, M. SCORE: Simulator for cloud optimization of resources and energy consumption. Simul. Model. Pract. Theory 2018, 82, 160–173. [Google Scholar] [CrossRef]
Ma, T.; Chu, Y.; Zhao, L.; Otgonbayar, A. Resource Allocation and Scheduling in Cloud Computing: Policy and Algorithm. IETE Tech. Rev. 2014, 31, 4–16. [Google Scholar] [CrossRef]
Carrasco, R.; Iyengar, G.; Stein, C. Resource Cost Aware Scheduling. Eur. J. Oper. Res. 2018, 269, 621–632. [Google Scholar] [CrossRef] [Green Version]
Coninck, E.; Verbelen, T.; Vankeirsbilck, B.; Bohez, S.; Simoens, P.; Dhoedt, B. Dynamic Auto-scaling and Scheduling of Deadline Constrained Service Workloads on IaaS Clouds. J. Syst. Softw. 2016, 118, 101–114. [Google Scholar] [CrossRef]
Yi, P.; Ding, H.; Ramamurthy, B. Budget-Minimized Resource Allocation and Task Scheduling in Distributed Grid/Clouds. In Proceedings of the 2013 22nd International Conference on Computer Communication and Networks (ICCCN), Nassau, Bahamas, 30 July–2 August 2013; pp. 1–8. [Google Scholar] [CrossRef]
Reddy, G. A Deadline and Budget Constrained Cost and Time Optimization Algorithm for Cloud Computing. Commun. Comput. Inf. Sci. 2011, 193, 455–462. [Google Scholar] [CrossRef]
Xin, Y.; Xie, Z.Q.; Yang, J. A load balance oriented cost efficient scheduling method for parallel tasks. J. Netw. Comput. Appl. 2017, 81, 37–46. [Google Scholar] [CrossRef]
Yang, S.J.; Chen, Y.R. Design adaptive task allocation scheduler to improve MapReduce performance in heterogeneous Clouds. J. Netw. Comput. Appl. 2015, 57, 61–70. [Google Scholar] [CrossRef]
Smara, M.; Aliouat, M.; Pathan, A.S.; Aliouat, Z. Acceptance Test for Fault Detection in Component-based Cloud Computing and Systems. Future Gener. Comput. Syst. 2016, 70, 74–93. [Google Scholar] [CrossRef]
Fan, G.; Chen, L.; Yu, H.; Liu, D. Modeling and Analyzing Dynamic Fault-Tolerant Strategy for Deadline Constrained Task Scheduling in Cloud Computing. IEEE Trans. Syst. Man Cybern. Syst. 2017, 50, 1260–1274. [Google Scholar] [CrossRef]
Zhou, Z.; Abawajy, J.; Chowdhury, M.; Hu, Z.; Li, K.; Cheng, H.; Alelaiwi, A.; Li, F. Minimizing SLA violation and power consumption in Cloud data centers using adaptive energy-aware algorithms. Future Gener. Comput. Syst. 2017, 86, 836–850. [Google Scholar] [CrossRef]
Pradhan, R.; Satapathy, S. Energy-Aware Cloud Task Scheduling algorithm in heterogeneous multi-cloud environment. Intell. Decis. Technol. 2022, 16, 279–284. [Google Scholar] [CrossRef]
Chen, H.; Liu, G.; Yin, S.; Liu, X.; Qiu, D. ERECT: Energy-Efficient Reactive Scheduling for Real-Time Tasks in Heterogeneous Virtualized Clouds. J. Comput. Sci. 2017, 28, 416–425. [Google Scholar] [CrossRef]
Duan, H.; Chen, C.; Min, G.; Wu, Y. Energy-aware scheduling of virtual machines in heterogeneous cloud computing systems. Future Gener. Comput. Syst. 2017, 74, 142–150. [Google Scholar] [CrossRef]
Shaikh, M.B.; Waghmare Shinde, K.; Borde, S. Challenges of Big Data Processing and Scheduling of Processes Using Various Hadoop Schedulers: A Survey. Int. J. Multifaceted Multiling. Stud. 2019, III, 1–6. [Google Scholar]
Mohapatra, S.; Mohanty, S.; Rekha, K. Analysis of Different Variants in Round Robin Algorithms for Load Balancing in Cloud Computing. Int. J. Comput. Appl. 2013, 69, 17–21. [Google Scholar] [CrossRef] [Green Version]
Li, R.; Hu, H.; Li, H.; Wu, Y.; Yang, J. MapReduce Parallel Programming Model: A State-of-the-Art Survey. Int. J. Parallel Program. 2016, 44, 832–866. [Google Scholar] [CrossRef]
Shyam, G.K.; Manvi, S.S. Resource allocation in cloud computing using agents. In Proceedings of the 2015 IEEE International Advance Computing Conference (IACC), Banglore, India, 12–13 June 2015; pp. 458–463. [Google Scholar] [CrossRef]
Zhao, Q.; Xiong, C.; Yu, C.; Zhang, C.; Zhao, X. A new energy-aware task scheduling method for data-intensive applications in the cloud. J. Netw. Comput. Appl. 2016, 59, 14–27. [Google Scholar] [CrossRef]
Dubey, K.; Kumar, M.; Chandra, M.A. A priority based job scheduling algorithm using IBA and EASY algorithm for cloud metaschedular. In Proceedings of the 2015 International Conference on Advances in Computer Engineering and Applications, Ghaziabad, India, 19–20 March 2015; pp. 66–70. [Google Scholar] [CrossRef]
Nasr, A.A.; El-Bahnasawy, N.A.; Attiya, G.; El-Sayed, A. A new online scheduling approach for enhancing QOS in cloud. Future Comput. Inform. J. 2018, 3, 424–435. [Google Scholar] [CrossRef]
Reddy, G.; Kumar, S. MACO-MOTS: Modified Ant Colony Optimization for Multi Objective Task Scheduling in Cloud Environment. Int. J. Intell. Syst. Appl. 2019, 11, 73–79. [Google Scholar] [CrossRef]
Biswas, D.; Samsuddoha, M.; Asif, M.R.A.; Ahmed, M.M. Optimized Round Robin Scheduling Algorithm Using Dynamic Time Quantum Approach in Cloud Computing Environment. Int. J. Intell. Syst. Appl. 2023, 15, 22–34. [Google Scholar] [CrossRef]
Soltani, N.; Barekatain, B.; Soleimani Neysiani, B. MTC: Minimizing Time and Cost of Cloud Task Scheduling based on Customers and Providers Needs using Genetic Algorithm. Int. J. Intell. Syst. Appl. 2021, 13, 38–51. [Google Scholar] [CrossRef]
Mohseni, Z.; Kiani, V.; Rahmani, A. A Task Scheduling Model for Multi-CPU and Multi-Hard Disk Drive in Soft Real-time Systems. Int. J. Inf. Technol. Comput. Sci. 2019, 11, 1–13. [Google Scholar] [CrossRef]
Zaharia, M.; Borthakur, D.; Sen Sarma, J.; Elmeleegy, K.; Shenker, S.; Stoica, I. Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling; EuroSys’10; Association for Computing Machinery: New York, NY, USA, 2010; pp. 265–278. [Google Scholar] [CrossRef]
He, C.; Lu, Y.; Swanson, D. Matchmaking: A New MapReduce Scheduling Technique. In Proceedings of the 2011 IEEE Third International Conference on Cloud Computing Technology and Science, Athens, Greece, 29 November–1 December 2011; pp. 40–47. [Google Scholar] [CrossRef] [Green Version]
Kosar, T.; Balman, M. A new paradigm: Data-aware scheduling in grid computing. Future Gener. Comput. Syst. 2009, 25, 406–413. [Google Scholar] [CrossRef]
Vobugari, S.; Somayajulu, D.V.L.N.; Subaraya, B.M. Dynamic Replication Algorithm for Data Replication to Improve System Availability: A Performance Engineering Approach. IETE J. Res. 2015, 61, 132–141. [Google Scholar] [CrossRef]
Bouhouch, L.; Zbakh, M.; Tadonki, C. A Big Data Placement Strategy in Geographically Distributed Datacenters. In Proceedings of the 2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech), Marrakesh, Morocco, 24–26 November 2020; pp. 1–9. [Google Scholar] [CrossRef]
Bouhouch, L.; Zbakh, M.; Tadonki, C. Dynamic data replication and placement strategy in geographically distributed data centers. Concurr. Comput. Pract. Exp. 2022. early view. [Google Scholar] [CrossRef]
Mohamed, A.; Najafabadi, M.K.; Wah, Y.B.; Zaman, E.A.K.; Maskat, R. The state of the art and taxonomy of big data analytics: View from new big data framework. Artif. Intell. Rev. 2020, 53, 989–1037. [Google Scholar] [CrossRef]
Samadi, Y.; Zbakh, M.; Tadonki, C. DT-MG: Many-to-one matching game for tasks scheduling towards resources optimization in cloud computing. Int. J. Comput. Appl. 2021, 43, 233–245. [Google Scholar] [CrossRef]
Calheiros, R.N.; Ranjan, R.; Beloglazov, A.; De Rose, C.A.F.; Buyya, R. CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw. Pract. Exp. 2011, 41, 23–50. [Google Scholar] [CrossRef]
Calheiros, R.; Ranjan, R.; De Rose, C.; Buyya, R. CloudSim: A Novel Framework for Modeling and Simulation of Cloud Computing Infrastructures and Services. arXiv 2009, arXiv:0903.2525. [Google Scholar]
Bouhouch, L.; Zbakh, M.; Tadonki, C. Data Migration: Cloudsim Extension. In Proceedings of the ICBDR 2019: 2019 the 3rd International Conference on Big Data Research, Cergy-Pontoise, France, 20–22 November 2019; pp. 177–181. [Google Scholar] [CrossRef]
Niznik, C.A. Min-max vs. max-min flow control algorithms for optimal computer network capacity assignment. J. Comput. Appl. Math. 1984, 11, 209–224. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Example to model our proposed scheduling technique OTS-DMDR, where (a–c) depict respectively the configuration of the system, first iteration of the execution and second iteration of the execution.

Figure 2. Response Time Scheme.

Figure 3. Flowchart of our proposed scheduling strategy.

Figure 4. Flowchart of computing response time matrix for incoming tasks in the queue.

Figure 5. Example of generating the preference list.

Figure 6. Flowchart of task selection process.

Figure 7. Example of task selection (iterations 1 and 2), where the selected elements are highlited with green color and the deleted elements are highlited with red color.

Figure 8. Example of task selection (iterations 3 and 4), where the selected elements are highlited with green color.

Figure 9. Average Response Time for Task Variation of the proposed methods OTS-DM and OTS-DMDR, compared to Li et al. [28], Delay Scheduling [63] and FCFS [52].

Figure 10. Percentage Response Time for Tasks Variation of the proposed methods OTS-DM and OTS-DMDR, compared to Li et al. [28], Delay Scheduling [63] and FCFS [52].

Figure 11. Percentage Response Time for Machine Variation of the proposed methods OTS-DM and OTS-DMDR, compared to Li et al. [28], Delay Scheduling [63] and FCFS [52].

Figure 12. Percentage Response Time for Machines Variation of the proposed methods OTS-DM and OTS-DMDR, compared to Li et al. [28], Delay Scheduling [63] and FCFS [52].

Figure 13. Response Time of Dataset Variation for the proposed methods OTS-DM and OTS-DMDR, compared to Li et al. [28], Delay Scheduling [63] and FCFS [52], where (a) is for small number of required data, (b) is for medium number of required data and (c) is for large number of required data.

Figure 14. Percentage Response Time of Dataset Variation for the proposed methods OTS-DM and OTS-DMDR, compared to Li et al. [28], Delay Scheduling [63] and FCFS [52], where (a) is for small number of required data, (b) is for medium number of required data and (c) is for large number of required data.

Figure 15. Throughput [28].

Figure 16. Degree of Imbalance for the proposed methods OTS-DM and OTS-DMDR, compared to Li et al. [28], Delay Scheduling [63] and FCFS [52].

Table 1. Summary of scheduling methods in the literature.

Method	Technique	Advantages	Limitations	Parameters
First Come First Served [52]	The last arrived tasks must wait until the end of the execution of earlier ones.	Simple to implement and efficient.	Increases the waiting time of tasks and tasks size not considered. Imbalance load and decreases data locality.	-
Shortest Job First [30]	Chooses the shortest task to be executed first.	Reduces execution time in comparison with FCFS and RR.	Uneven load distribution on the servers.	Execution time. Response time.
Round Robin [53]	Circularly distributes tasks.	An equal amount of CPU time is given to every task.	A higher average waiting time.	-
Shyam and Manvi [55]	VM Migration.	Maximizes resource usage while minimizing time and budget.	Needed more agents for searching the best resource.	Execution time. Makespan time. Response time. Resource utilization.
Wang et al. [32]	Dynamic resource provisioning.	Considers resources such as CPU, memory and storage.	Data locality and replication not addressed.	Execution cost. Availability.
Zhao et al. [56]	Energy-efficient technique where datasets and tasks are treated as a binary tree using a data correlation clustering algorithm.	Minimizes the energy usage of cloud data centers.	Online scheduling is not considered.	Execution cost. Resource utilization. Energy consumption.
Dubey et al. [57]	Scheduling algorithm based on IBA (Improved Backfill Algorithm).	Minimizes the execution time and increases resource usage.	More tasks imply less performance. Considers task priority.	Execution time. Makespan. Resource utilization.
Elseoud et al. [58]	Online Potential Finish Time heuristic algorithm.	Improves execution time and cost in cloud computing execute tasks with the least amount of delay.	Data locality and replication not addressed.	Execution time/cost. Makespan. Response time. Resource utilization.
Delay algorithm [63]	Assign tasks based on input data location. It delays tasks until it required data is available.	The ease of scheduling. Achieves data locality.	Imbalance load which can cause higher delays in task execution and lower throughput.	Execution time.
Matchmaking algorithm [64]	Before assigning a new task to nodes, every node has a fair chance to utilize its local tasks.	High node utilization. High cluster utilization.	No particular.	Resource utilization. Availability.
Li et al. [28]	Online job scheduling based on data migration based on a trade-off between data transfer cost and the waiting time cost.	Handles data migration.	Schedules tasks sequentially which increases the waiting time. Systems characteristics not considered. Data replication not discussed.	Throughput.
Reddy et al. [59]	Modified Ant Colony optimization.	Considers VMs RAM, bandwidth, storage, processing speed and makespan in the fitness function.	Data-intensive online tasks not addressed.	Resource utilization. Makespan. Load balance.
Biswas et al. [60]	Dynamic round robin.	Dynamically determines Time Quantum.	Starvation not handled.	Turnaround. Waiting time.
Soltani et al. [61]	Genetic meta-heuristic.	Multi-purposed weighted genetic algorithm to enhance performances.	Data-intensive online tasks not addressed.	Response time. Waiting time. Makespan.
Mohseni et al. [62]	Hard Disk Drive and CPU Scheduling (HCS) algorithm.	Schedules multiple tasks among multi-core systems.	Memory, bandwidth and latency of multi-core systems are not considered.	Execution time. Energy consumption.

Table 2. Setup characteristics.

Characteristic	Value
Number of machines	[5–100]
$P_C P U$ (MIPS)	[1000–5000]
RAM (GB)	[64–2048]
Storage capacity (TB)	[1–25]
Number of tasks	[30–2000]
Size of tasks (MI)	[1000–4000]
Number of datasets	300
Size of datasets (GB)	[1–100]
Number of required datasets	[1–10]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bouhouch, L.; Zbakh, M.; Tadonki, C. Online Task Scheduling of Big Data Applications in the Cloud Environment. Information 2023, 14, 292. https://doi.org/10.3390/info14050292

AMA Style

Bouhouch L, Zbakh M, Tadonki C. Online Task Scheduling of Big Data Applications in the Cloud Environment. Information. 2023; 14(5):292. https://doi.org/10.3390/info14050292

Chicago/Turabian Style

Bouhouch, Laila, Mostapha Zbakh, and Claude Tadonki. 2023. "Online Task Scheduling of Big Data Applications in the Cloud Environment" Information 14, no. 5: 292. https://doi.org/10.3390/info14050292

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Online Task Scheduling of Big Data Applications in the Cloud Environment

Abstract

1. Introduction

2. Related Work

2.1. Common Used Metrics

2.2. Single-Objective Scheduling Techniques

2.3. Multi-Objective Scheduling Techniques

2.4. Our Motivation

3. System Model and Problem Formulation

3.1. System Model

3.2. Problem Formulation

3.3. Objective Function

4. Proposed Approach

4.1. Response Time Matrix

4.1.1. Fitness

4.1.2. Migration Time

4.1.3. Data Access Time

4.1.4. Delay Time

4.2. Task to Machine Preference List

4.3. Tasks Selection

5. Simulation Setup and Result Analysis

5.1. Simulation Setup

5.2. Performance Metrics

5.2.1. Response Time (RT)

5.2.2. Throughput

5.2.3. Degree of Imbalance (DI)

5.3. Result Analysis

5.3.1. Experiment 1: Task Variation

5.3.2. Experiment 2: Machine Variation

5.3.3. Experiment 3: Datasets Variation

5.3.4. Experiment 4: Tasks Arrive in 100 Time-Slot

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI