1 Introduction

With the development of sensing and computer technologies, autonomous driving systems, such as autonomous vehicles and mobile robots, have gained large attention of researchers, and experienced increased commercialization. The mobile robots are used in logistics centers as logistics robots, which are in-charge of transportation of goods, and in various other fields such as serving in restaurants and cafes, delivering food to home, and providing information for guidance or cleaning at airports. Very recently, to prevent the spread of COVID-19 infection, the biggest problem today, mobile robots have been started to be used for quarantine purposes such as body temperature measurement and sanitizing. As working mechanism, the mobile robots use internal sensors such as encoder and IMU to measure their speed, posture and travel distance, and external sensors such as lidar and camera to measure the presence and distance of any surrounding walls and obstacles. With these sensors, they explore the environments to locate themselves and autonomously drive to the destination. The goal is that these mobile robots travel safely and quickly to their destination without colliding with any obstacles.

The navigation of these mobile robots typically comprises locating the robots first based on maps followed by generating paths from the robot’s current position to the target position. The generated path is then followed by the robot. For new obstacles that are not displayed on the map, the robot itself detects the obstacles and avoids them using the distance measuring sensor, namely lidar, and finally, reaches to the target position. In many working scenarios, the movement of dynamic objects is irregular, and it is a challenge for a mobile robot to avoid the dynamic obstacles while traveling and reach the destination without collision. This is mainly because the mobile robots do not know the behavior of the dynamic obstacles or where the obstacles would finally head to, i.e., the destination. The bypassing of these dynamic obstacles has been studied actively, using both the traditional analytical techniques [1, 2] and empirical methods [3,4,5].

However, the conventional obstacle avoidance methodologies such as dynamic window approach (DWA) [6] and timed elastic band (TEB) [7] are specific for static obstacle avoidance. Therefore, it becomes a challenge to bypass the dynamic obstacles by predicting the next movement of the obstacles, which requires sharing information with dynamic obstacles or detecting a new dynamic object. The velocity obstacle (VO) series method [8,9,10,11] has been used by several researchers to address the dynamic obstacle method problem. While it is advantageous for dynamic obstacle avoidance to predict and avoid its next moving position through movement path of the object, the conventional methods have the disadvantage of requiring heavy computing power due to the combination of complex conditions and equations.

Empirical information-based learning methodologies can simplify formulas represented by combinations of complex conditions in traditional methods. Reinforcement learning (RL) has been actively studied because of the advantage that it can solve irregular problems which are otherwise not solvable by traditional methods [12, 13]. The representative obstacle avoidance methodologies based on RL include CADRL [14] and MRCA [15]. CADRL aims to drive by avoiding people and requires the process of obtaining information on a person’s position, speed, and size for avoidance. Therefore, when a person is detected, it has excellent avoidance performance, and has the advantage of being able to learn from human movements to social rules such as left-hand or right-hand traffic. However, if a person is not detected, avoidance cannot be performed, and since detection is not carried out on non-human dynamic objects, it has a disadvantage that collision cannot be avoided. On the other hand, MRCA can drive without collision in a multi-agent environment, and unlike CADRL, performs the task of predicting and avoiding movements of dynamic obstacles through information on distance from the lidar without detecting the dynamic objects. Therefore, it is possible to avoid various dynamic obstacles, not only specific dynamic objects, and has the advantage of being able to learn quickly because multiple robots learn the policy each as a respective agent. Each robot is applied as a dynamic object with a collision avoidance policy but presented as irregular movement of dynamic objects to the other robots. However, this advantage may be a disadvantage. Since they are all assumed to use the same avoidance policy, it is relatively difficult to avoid dynamic obstacles with different avoidance policies or without avoidance policies. In addition, since it was trained in an environment where dynamic factors were not considered, it has the disadvantage of some differences in performance of driving and obstacle avoidance in the real-world and training environments.

In the context of the above scenario, in the present study, in this study, for decentralized dynamic obstacle avoidance, we propose a method of predicting the next movement through the movement of an obstacle based on RL and avoiding collision. The dynamic object movement was predicted through distance information from lidar without detecting the objects to perform avoidance of various obstacles. Furthermore, to reduce the differences between driving in real and training environments, the policy was trained in the environment where inertia and friction dynamics were considered. In addition, a multi-robot environment was also configured to enable fast learning, and dynamic objects which do not have obstacle avoidance policies other than robots were also placed in the training environment to enable effective avoidance of dynamic obstacles running with other policies. However, RL-based obstacle avoidance alone caused the problem of not finding a path in a specific situation. To tackle this problem and impose the path efficiency, a path planner was integrated with the reinforcement learning-based obstacle avoidance.

We implemented the reinforcement learning-based obstacle avoidance to differential drive mobile robot and performed the task of driving to the target position by avoiding obstacles. To train the policy and evaluate the performance, we constructed simulation environments, and evaluated the efficiency of obstacle avoidance and driving through simulation and conducting real-world experiments. Section 2 describes the differential driving robot used in the present study and the parameters of the model. Section 3 details the reinforcement learning-based collision avoidance method. The state, behavior, reward function, termination condition, training environment, training algorithm are described in detail and the train results along with the problem of the method are also discussed. In Sect. 4, the modified method which includes the path planner has been proposed and explained. Section 5 evaluates the driving and obstacle avoidance performance of the proposed method through real-world experiments in comparison with dynamic window approach (DWA), timed elastic band (TEB). Section 6 contains analysis and discussion of the results. Finally, conclusions are drawn in Sect. 7.

2 Description of mobile robot

The methodology used in the current study can also be applied to build any type of mobile robots, but the differential drive robot, which is widely used for indoor services, was adopted for application and validation. The industrial logistic robot, SR7 (Syscon co., Inchon, Korea), is shown in Fig. 1 and its specifications are listed in Table 1.

Fig. 1
figure 1

Differential drive wheeled SR7 robot

Table 1 Specification of SR7 robot

The kinematics of the differential drive robot is represented by two-dimensional coordinates as indicated in Fig. 2. The yaw angle \(\psi \) represents the robot’s travel direction about the x-axis at the center of the two wheels. The velocities of the left and right wheels are \(v_{\text {L}}\) and \(v_{\text {R}}\), respectively, while the forward velocity of the mobile robot is given by \(v=(v_{\text {L}}+v_{\text {R}})/2\) and the yaw rate is \({\dot{\psi }}=\omega =(v_{\text {R}}-v_{\text {L}}) /W\), where W is the wheel tread.

Fig. 2
figure 2

Coordinate system for a differential drive robot

The required rotation speeds of the motors \(\omega _{\text {R}}\) and \(\omega _{\text {L}}\) can be obtained from the input of the robot which are the forward velocity v and the yaw rotation velocity \(\omega \), under the assumption that there is no slip,

$$\begin{aligned} \begin{bmatrix} \omega _{\text {L}} \\ \omega _{\text {R}} \\ \end{bmatrix}= \frac{1}{R} \begin{bmatrix} \frac{1}{2} &{} \frac{1}{2} \\ \frac{1}{W} &{} -\frac{1}{W} \\ \end{bmatrix}^{-1} \begin{bmatrix} v \\ \omega \\ \end{bmatrix} \end{aligned}$$
(1)

where R is the radius of the drive wheels.

The equation of motion for global coordinates is given by

$$\begin{aligned} \begin{bmatrix} {\dot{x}} \\ {\dot{y}} \\ {\dot{\psi }} \\ \end{bmatrix}= \begin{bmatrix} \cos \psi &{} 0 \\ \sin \psi &{} 0 \\ 0 &{} 1 \\ \end{bmatrix} \begin{bmatrix} v \\ \omega \\ \end{bmatrix}. \end{aligned}$$
(2)

The kinematics of the robot is used to convert the linear and angular velocity derived from the algorithm presented in this paper to the rotation speed of left and right motor.

3 RL-based dynamic obstacle avoidance

In this section, we will develop an obstacle avoidance model, named as Mobile robot Collision Avoidance Learning (MCAL) based on reinforcement learning. In MCAL, the agent level decentralized collision avoidance paradigm was adopted similar to the previous study multi-agent collision availability (MRCA) [15]. However, to reduce the difference in obstacle avoidance performance between simulation and real-world environments and to achieve high sample efficiency and fast learning speed, MCAL was trained in the environment with dynamics considered using the value-based learning method, soft actor critic (SAC) [16].

Figure 3 shows the framework for MCAL, RL-based obstacle avoidance method with continuous behaviors, and each block performing the following tasks.

  • The robot obtains the position and speed data (/Odometry) and the lidar data (/Scan) via interactions with the external environment, world.

  • For localization [17], map information (/Map) and lidar data (/Scan) were compared, and the robot’s position (/Global Pose) was derived based on the map.

  • The relative difference between the target position (/Goal) and the position of the robot based on the map (/Global Pose) is the distance from the robot to the target position (/Relative Goal), which becomes the input of the RL agent.

  • The RL agent gathers information on the distance between the robot and the target position (/Relative Goal), 3 steps of lidar scan data (/Scan), and the speed of the robot (/Odometry.speed), and outputs the forward and rotational velocity (/Velocity) of driving to the target point without collision through the trained deep neural network of reinforcement learning.

  • The Robot moves by controlling the forward and rotational velocity obtained through the RL agent.

This process repeats until the target position is reached.

Fig. 3
figure 3

Structure of mobile robot collision avoidance learning

3.1 Reinforcement learning formulation

3.1.1 State

To formulate the collision avoidance problem within a reinforce learning framework, the state consists of lidar data, which are the distance information from the surrounding environment, the forward velocity v and the rotational velocity \(\omega \) of the robot, and the relative distances of x and y from the robot to the target position as

$$\begin{aligned} s_{t} =&[s_{t}^{\text {lidar}}, s_{t}^{\text {goal}}, s_{t}^{\text {speed}}] \end{aligned}$$
(3)

where \(s_{t}^{\text {lidar}} = [o_{\text {l}}^{t-2}, o_{\text {l}}^{t-1}, o_{\text {l}}^{t}]\), \(s_{t}^{\text {goal}} = o_{\text {g}}^t = [x, y]\), and \(s_{t}^{\text {speed}} = o_{\text {s}}^t = [v, \omega ]\).

\(s_{t}^{\text {lidar}}\) are the lidar data which indicate the relationship between the obstacles and the robot by measuring the distance and the recognizing objects. Furthermore, the lidar data on three consecutive time-steps was used to implicitly predict the movement direction and speed of the object. \(s_{t}^{\text {goal}}\) is the relative distance from the robot to the target position, and driving direction can be effectively obtained from the clear information provided by \(s_{t}^{\text {goal}}\) on whether the moving direction is right. \(s_{t}^{\text {speed}}\) provides the velocity information of the robot. Additionally, and the limiting speed and inertia of the robot and avoidance method according to the speed can be known from \(s_{t}^{\text {speed}}\).

3.1.2 Action

The behavior of the mobile robot can be defined as continuous behavior to enable smooth movement and avoidance in various ways, and consists of two-dimensional information with the forward velocity v and the rotational velocity \(\omega \) as follows.

$$\begin{aligned} a_{t} =&[v, \omega ] \end{aligned}$$
(4)

where v, \(\omega \) are continuous values and have the limiting velocity constraints of \(v \in [0.00, 0.55]\), \(\omega \in [-0.60, 0.60]\).

Fig. 4
figure 4

Actor and critic networks for mobile robot collision avoidance learning

3.1.3 Reward

In this study, the robot aims to reach the target position (\(P_x\), \(P_y\)) and the orientation was not concerned since the differential drive robot can rotate easily in place. To reach the target position, driving with obstacle avoidance through reinforcement learning without colliding with obstacles and remaining within the performance limit is necessary. Therefore, the reward was also considered separately. The total reward function is the sum of these three reward functions,

$$\begin{aligned} R =&R_{\text {g}} + R_{\text {c}} + R_{\omega }. \end{aligned}$$
(5)

If the mobile robot reaches the target position, the agent receives a large reward of 10. Additionally, while moving to the target position, if the distance to the target becomes shorter than before, then also reward is given as the robot is moving in the right direction.

$$\begin{aligned} R_{\text {g}} = {\left\{ \begin{array}{ll} 10 &{} \hbox { if}\ {\text {dis}}_{\text {curr}} < 0.5 \\ {\text {dis}}_{\text {pre}} - {\text {dis}}_{\text {curr}} &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(6)

where the distance is given by

$$\begin{aligned} dis = \sqrt{(p_x^{\text {goal}}-p_x^{\text {robot}})^2 + (p_y^{\text {goal}}-p_y^{\text {robot}})^2}. \end{aligned}$$
(7)

Moving in a direction away from the target position will incur a penalty corresponding to the distance traveled in one step, and moving in the direction closer to the target position will result in a reward corresponding to the distance traveled in one step.

The reward \(R_{\text {c}}\) imposes a large penalty of -10 when there is a collision with an obstacle, a policy that the robot needs to be taught to avoid collisions with obstacles.

$$\begin{aligned} R_{\text {c}} = {\left\{ \begin{array}{ll} -10 &{} \text {if Collision} \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(8)

The last reward \(R_{\omega }\) concerns the robot’s performance limit as a penalty, instead of constraint equation. When the 150 kg heavy robot SR7 rapidly rotates, it is difficult to control due to inertia, so \(R_{\omega }\) imposes a large penalty on rotational velocity beyond the threshold value in order to prevent the problem.

$$\begin{aligned} R_{\omega } = {\left\{ \begin{array}{ll} -0.1 | \omega | &{} \hbox { if}\ | \omega | > 0.6 \\ 0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(9)

The weight of each reward was determined after trial and error through experiment so that the robot learns the various methods of obstacle avoidance according to the situation, such as avoidance through acceleration and deceleration, avoidance through stopping, and avoidance via directional change.

3.1.4 Termination conditions

The termination condition for stopping an episode in the training was designed for 3 cases, namely when the robot reaches the target position, when it collides with an obstacle, and when the number of steps to proceed the episode exceeds 2000.

$$\begin{aligned} T = {\left\{ \begin{array}{ll} \text {True} &{} \text { if (Reached goal } \textit{or} \text { Collision }{} \textit{or} \text { step}>2000) \\ \text {False} &{} \text { otherwise} \end{array}\right. } \end{aligned}$$
(10)

The limit condition of 2000 steps suppresses the tendency to reach the target position by hovering around the target position for a long time, or avoiding the obstacle by moving through an inefficient path. Without such limits, the robot will learn a very safe obstacle avoidance, such as slowing down or stopping, only gaining small \(R_{\text {g}}\) for a long time.

3.2 Network

To apply SAC algorithm on collision avoidance, an actor network to derive a policy and a critic network to derive the Q-function for computing the cost of the policy were constructed.

Referring to the network, MRCA [18], the actor network shown in Fig. 4a consisted of two convolution layers, three fully connected layers, two nonlinear activation ReLU functions, and one linear activation function. Among the state in 3, the lidar data 512 \(\times \) 3 corresponding to 3 sample times was input and passed through two convolution layers to derive a temporal change.

This information, combined with the relative distance (xy) from the robot to the target position, and the velocity of the robot \((v,\omega )\) passed through two fully connected layers. Finally, the mean velocity \(v^t_{means}\) and log standard deviation \(v^t_{logstd}\) form a Gaussian distribution of actions. The final action \(a_{t}\) was sampled.

The Critic network shown in Fig. 4b is similar to that of the Actor network, but the action obtained from the Actor network was added and the final result obtained is the Q-value.

3.3 Training

3.3.1 Training environment

Two simulators were used for training simulation. The first simulator, i.e., the Stage simulator [19] in Fig. 5a considers only the kinematic factors, so behaviors denoting the dynamic characteristics such as inertia and friction cannot be trained accurately, resulting in significant differences between the real-world and training environments. However, the simulation time can be accelerated due to its lightweight simulator, allowing the advantage of fast learning.

Fig. 5
figure 5

Environments for the Stage and Gazebo simulators for training

The second simulator is the Gazebo simulator [20] in Fig. 5b. It takes into account the dynamic factors, such as inertia and friction as well as the kinematic factors. The difference between the real-world and training environments also exists for the Gazebo simulator, but the difference can be reduced. It has a disadvantage of longer simulation time. In consideration of these advantages and disadvantages, after fast training in the Stage simulator, the actor and critic networks were trained again in the Gazebo simulator to reduce the difference between the real-world and the learning environment.

The training environment in the Stage simulator was configured for a circular space with a diameter of 30 m, placing randomly 7 static objects and 4 dynamic objects that do not have any avoidance policy and move in a straight line at random speeds between 0.5 and 1 m/s. In Fig. 5a, the red squares were the robots, blues were dynamic objects, blacks were static objects. We also deployed 4 robots to train the policy simultaneously, and this multi-robot training has the advantage that each robot treats others as dynamic obstacles to learn how to avoid obstacles via various movements. Moreover, since 4 robots gathered the necessary information for training together, it can increase the learning speed. The training environment in the Gazebo simulator was configured for a square space of 20 m by 20 m, placing randomly objects just like the stage environment, and 4 robots were trained with the policy simultaneously. In Fig. 5b, the white boxes were the robots having the goals in red, blue, yellow, and white circles individually. The walking people were dynamic objects, and the gray boxes and cylinders were static objects.

3.3.2 Training algorithm

After trying to train with different algorithms, we adopted SAC to train the weights of deep neural networks.

SAC is an off-policy learning algorithm that can use all the information collected during the training process for deep neural network learning. So, sample-efficient is high. The objective function of SAC is given by

$$\begin{aligned} J(\theta ) = \sum _{t=0}^TE_{\pi }[r(s_t, a_t, s_{t+1}) + \alpha H(\pi (\cdot | s_{t})]. \end{aligned}$$
(11)

The entropy function H and the hyper-parameter \(\alpha \), which determines the relative importance of the compensation entropy, have been added to the existing objective function. In the training with SAC, the agent acts more randomly and this should result in effective learning. Since the Gaussian distribution generated is relatively flat and if the optimal policy does not work well in the real environment due to the difference between the training and the real environments, there is a possibility to solve the problem with the next best policy.

In fact, we compared the results after training with the on-policy algorithm, Proximal Policy Optimization (PPO), and it was confirmed that SAC achieves a faster learning speed due to high sample efficiency.

The off-policy algorithm, SAC, requires replay of memory to store the generated state, reward, and action values in every step of the training process, and the related training framework shown in Fig. 6 can improve policy through the information stored in this repository. Pre-processing in Fig. 6 refers to the process of obtaining a global pose from scan data, odometry data, and map data. It transforms information obtained from the environment to match the input of the network.

Fig. 6
figure 6

Training framework for mobile robot collision avoidance learning

3.3.3 Train result

The policy was trained using the Stage simulator with the hyper-parameters in Table 2. The reward graph in Fig. 7 confirmed that the policy converged from the 20,000 episodes. Furthermore, the policy with 3000 more episodes was trained using the Gazebo simulator, to reduce the difference between the training environment and real-world environment.

Table 2 Hyper-parameters for training in SAC
Fig. 7
figure 7

Reward for the stage simulation training

3.4 Problem of RL-based dynamic obstacle avoidance

It was confirmed that the robot was trained according to the above process successfully and to move without colliding with any static and dynamic obstacles in open spaces. In particular, it was impressive to respond differently to dynamic and static obstacles. As the dynamic obstacles come near, the forward velocity decreases significantly, the rotational velocity increases significantly, and as the movement of dynamic obstacle slows down, the changes in velocities then decrease.

However, certain problems were found when the robot moved from point A to point B in a virtual environment, as shown in Fig. 8a.

Fig. 8
figure 8

Problem with MCAL (square marker: start point, circle marker: end point)

The movement path of the robot is shown in Fig. 8b, identifying the problem that the robot avoids the wall in the middle, but fails to reach the target point and continues to hover around the wall. This problem was caused by conflicting objectives of the robot to minimize the distance from the target point and to maintain a certain distance from the obstacle while moving according to the proposed method. Thus, according to the action suggested by the trained policy of MCAL, the robot moves in an inefficient path, resulting in the problem of not finding the driving direction under certain circumstances. This occurs in certain situations, such as in the presence of obstacles close to the front of the robot, which can be attributed to a lack of lidar data and a lack of training information in various environments.

These problems can be improved with a lot of training in a variety of environments. Also, previous RL-based robot navigation studies solved this problem through effective exploration. [21, 22]. However, it is impossible to learn an optimal path comparable to traditional algorithms for solving optimization problems. The problem-solving through exploration is highly likely to fail to derive the optimal path if the observation value is generated equally for the complex and wide environment. Therefore, we tried to solve these problems by integrating RL-based mobile robot navigation with traditional method.

4 RL-based dynamic obstacle avoidance with optimal path

To solve the problem of MCAL mentioned above, we included classical optimal path planning to obtain an improved MCAL method, so that the robot follows the optimal path which the global path planner generates based on the map. As shown in Fig. 9, we adopted the concept of look head distance from the path-following method PurePursuit [23]. The look-ahead point is a moving point maintaining a certain distance from the robot and replaces the target point in the MCAL input.

Fig. 9
figure 9

Look-ahead point and optimal path

In this section, we will propose Mobile robot Collision Avoidance Learning with Path (MCAL_P) that follows the look-ahead point through the trained policy and reaches the target point with collision avoidance.

The overall procedure including path planner and look-ahead point is shown in Fig. 10

Fig. 10
figure 10

The MCAL_P framework

  • The robot obtains the position and speed data (/Odometry) and the lidar data (/Scan) via interactions with the external environment, world.

  • Fpr Localization, map information (/Map) and lidar data (/Scan) were compared, and the robot’s position (/Global Pose) was derived based on the map.

  • Path Planning such as \(\hbox {A}^*\) [24] algorithm gives global optimal path from the robot’s location to the target point (/Path) from the input of map information (/Map), target position (/Goal), and the robot’s location (/Global pose).

  • Find Look Ahead Point finds a look-ahead point (/Look Ahead Point), a point away from the robot by the look-ahead distance in the path, based on the input of the optimal path (/Path) and the position of the robot based on the map (/Global pose).

  • The relative difference between the look-ahead point on the path (/Look Ahead Point) and the position of the robot based on the map (/Global pose) is the distance to the look-ahead point (/Relative Goal), which becomes the input of the RL agent.

  • The RL agent receives the robot to the look-ahead point (/Relative Goal), 3 steps of lidar data (/Scan), the speed of the robot (/Odometry.speed) and outputs the forward and rotational velocity (/Velocity) for driving to the look-ahead point without collision through the trained deep neural network of reinforcement learning.

  • The Robot moves by controlling the forward and rotational velocity obtained through the RL agent.

Without new training, the RL agent described in Sect. 3 was reused.

4.1 Improvement of the MCAL problem

To check whether the problem of MCAL introduced in Sect. 3 was solved using MCAL_P, the test in Sect. 3.4 was conducted in the same manner.

Figure 11 compares the movement paths of the mobile robot using MCAL and MCAL_P. As intended, the MCAL_P drove effectively in environments where MCAL failed by obtaining a favorable direction for driving through the optimal path.

Fig. 11
figure 11

Robot odometry MCAL vs. MCAL_P (square marker: start point, circle marker: end point)

5 Simulation and real-world experiments

5.1 Real-world experiments in standardized environments

The proposed method was implemented based on Navigation stack [25] in Robot Operating System (ROS) [26]. Furthermore, we also implemented DWA and TEB, the two most commonly used conventional non-communicating obstacle avoidance driving methods. We compared the obstacle avoidance success, travel time, driving path, computation time, and velocities of the proposed method through driving experiments in the same environment as those of TEB and DWA.

The common parameters required for DWA and TEB are shown in Table 3, the other parameters required for DWA are listed in Table 4, and the other parameters required for TEB are put in Table 5. For other parameters, the default values provided by the Navigation stack in the ROS were used. The parameters of each algorithm were set to achieve same performance as much as possible in straight driving without obstacles. However, it was difficult to make them completely identical due to structural differences.

Table 3 Common parameters for DWA and TEB
Table 4 Parameters for DWA
Table 5 Parameters for TEB

5.1.1 Experimental environment and methods

The real-world experiments were conducted by configuring 6 real-world environments as shown in Fig. 12. In each environment, the mobile robot aims to avoid obstacles in the path while driving, with 10 round trips from position A to position B, with MCAL_P, DWA, and TEB, respectively. MCAL_P has not trained in these 6 environments and this experiment would check the generalization of the policy trained in the simulator.

  1. Map A

    Empty environment without objects (Fig. 12a)

  2. Map B

    Environment where 3 static objects are placed at 4 m intervals (Fig.  12b)

  3. Map C

    Environment where 3 static objects are placed at 2 m intervals (Fig.  12c)

  4. Map D

    Environment where two dynamic objects reciprocate in a direction perpendicular to the robot’s moving direction (Fig.  12d)

  5. Map E

    Environment where two dynamic objects reciprocate in a direction parallel to the robot’s moving direction (Fig.  12e)

  6. Map F

    Environment where two dynamic objects reciprocate in a direction diagonal to the robot’s moving direction (Fig.  12f)

Fig. 12
figure 12

Environments for performance test(https://youtu.be/xxzoh1XbAl0?t=25)

5.1.2 Experimental results

Table 6 shows the probability of obstacle avoidance and the results of the average travel time for 10 runs. The experiment results were analyzed separately for the success rate of obstacle avoidance, travel time, computation time, trajectory, and driving tendency.

Table 6 Success rate and travel time for MCAL_P performance test

Success rate Table 6 shows that MCAL_P, DWA, and TEB performed well in the Map B and Map C environments that are related to static obstacle avoidance, and relatively well in the Map D and Map F environments which are related to dynamic obstacle avoidance. However, in Map F, where dynamic objects move in parallel directions of the robot’s progression, the DWA failed to evade, while the TEB succeeded in avoiding only two out of ten times. In all environments, the MCAL_P showed a significantly higher probability of dynamic obstacle avoidance compared to the DWA and TEB.

Travel time Overall, the MCAL_P showed faster driving in all environments than the DWA and TEB. This is because DWA and TEB are always subject to acceleration and deceleration by speed profile. The MCAL_P does not have a speed profile and only accelerates and decelerates in certain situations by the trained policy, which seems to result in faster driving. However, in narrow road environments such as Map C, it was found that the robot was driving slower than DWA and TEB.

Computation time The time taken to derive the forward and rotational velocity, which are the outputs of each method were compared using the same PC Intel i9-9900K with 32 GB ram and Geforce GTX 1080Ti. To derive the required velocity for the situation, MCAL_P took the fastest time, 0.00745 s, the TEB 0.015 s, and the DWA the slowest time, 0.028 s.

Trajectory Figure 13 shows the trajectories of the robot in each map. In the Map A environment without an obstacle (Fig. 13a), MCAL_P traveled with a relatively large y-axis movement compared to DWA and TEB, indicating that it is very difficult for MCAL_P to drive straight. This will be discussed further in the next section. When avoiding static obstacles in Map B (Fig. 13b) and Map C (Fig. 13c) environments, the MCAL_P took a relatively large path to avoid obstacles. Such avoidance of the obstacles which suddenly started moving was expected during the training process of MCAL_P. In Map D environment with dynamic obstacles, it can be seen that the path of MCAL_P in Fig. 13d has a larger movement along the y-axis compared to DWA and TEB.

Fig. 13
figure 13

Robot odometry for performance test (square marker: start point, circle marker: end point)

Such characteristics of these driving trajectories were observed probably because the MCAL_P avoided obstacles by various methods such as stopping and turning to avoid obstacles depending on the situation. On the other hand, the DWA and TEB tended to stop first to avoid obstacles and waited for them to pass before driving. This avoidance tendency can also be confirmed through the trajectories in the Map E environment. In Fig. 13e, the DWA and TEB collided with an obstacle at point (− 4.5, 11) and moved toward the target point after the obstacle moved away. The trajectories of DWA and TEB in Map E in Fig. 13e are almost identical to those of Map A. This means that DWA and TEB were not able to adopt different behaviors from the obstacle-free environment Map A for the obstacle coming forward and avoid the collision by stopping and waiting for the obstacle to pass away. However, the MCAL_P rapidly moved in the y-axis direction at point (− 4.5, 11), which was close to the obstacle, and by changing direction, avoided the collision with the approaching obstacle from the front. Due to the difference in these avoidance methods, the DWA and TEB cannot avoid dynamic obstacles with various movements, but the MCAL_P can avoid dynamic obstacles with various movements in various ways.

Driving tendency Figures 14 and 15 show the graphs of the forward and the rotational velocity for six environments. When avoiding static obstacles in Figs. 14 and 15b, c, the DWA and TEB avoided the obstacles by accelerating and decelerating the speed without stopping; however, to avoid the dynamic obstacles in Figs. 14 and 15d–f, the DWA and TEB avoided obstacles simply by stopping. Due to this avoidance method, the dynamic obstacles moving ahead of the robot cannot be avoided. Unlike the DWA and TEB, the MCAL_P avoided obstacles by moving, changing directions, or accelerating/decelerating speed depending upon the situation without distinguishing whether the obstacle is static or dynamic. The MCAL_P can avoid obstacles with varying motion, but in narrow sections such as Map C. The forward velocity graphs in Fig. 14c show a large variation and the tendency to steer after stopping to avoid obstacles. As a result, the driving performance is degraded by performing the avoidance through a somewhat inefficient path when avoiding static obstacles. Interestingly, at the start of the driving in Fig. 14, the MCAL_P and other algorithms have different acceleration. The TEB and DWA have acceleration limitations but the actual acceleration on the start was much smaller than the limit. While moving, the DWA and TEB can accelerate and decelerate faster than staring and this can be seen by the sharp drop in Fig. 14c–e. We consider that these factors are characteristic of the algorithm.

Fig. 14
figure 14

Forward velocity of the robot

Fig. 15
figure 15

Rotational velocity of the robot

In addition, the rotational velocity graph in Fig. 15 shows that the MCAL_P rotates left and right repeatedly and moves with great shaking. Such movements occur even in situations where there are no obstacles, such as in Fig. 13a, and is responsible for the robot’s failure to move along a straight line. Since moving both along a curved path and a straight one is not performed well, the robot cannot travel along the optimal path. Therefore, the robot always draws an inefficient path when driving. This is the disadvantage of MCAL_P.

5.2 Real-world experiment in real working environment

To identify the possibility of obstacle avoidance driving in spaces such as real collaborative factories and parks, we constructed an environment in which a large number of people moved and the robot proceeded via driving using the MCAL_P methodology.

The robot started from position A in Fig. 16 and reached the final target point D via positions B and C. In this environment, there were 4 static obstacles and 7 dynamic obstacles (see Fig. 17).

Fig. 16
figure 16

Map for real working environment

Fig. 17
figure 17

Experiment for real working environment (https://youtu.be/xxzoh1XbAl0?t=91)

It could be seen that the MCAL_P avoided with excellent performance even when a large number of people congregated and moved. In this experiment, there were 20 times when robots had to avoid various obstacles, and all 20 times succeeded in avoiding them. The robot avoided fast approaching people in various ways, such as making a sudden stop or changing direction.

Thus, it is expected that the robot will be able to perform excellent obstacle-avoidance driving even in collaborative environments such as actual factories and crowded environments in parks and airports.

5.3 Multi-robot driving simulation

In this work, MCAL, MCAL_P were trained in the form of multi-robots. Accordingly, a simulation experiment was conducted to confirm that each robot was well avoided and traveled to the target point even in an environment where a number of robots traveled together, such as a logistics center.

5.3.1 Experimental environment and methods

A 10 \(\times \) 5 m environment in which two robots were run (Fig. 18a), and a 10 \(\times \) 10 m environment in which four robots were run (Fig. 18b) were constructed for multi-robot navigation and driving. In Fig. 18a, robots A and B, where position of one was the target of the other, conducted 10 obstacle avoidance runs. Similarly, in Fig. 18b, an experiment was conducted in which robot A moved to the position of robot C as the target position, robot B went to the position of D as the target position, robot C drove to the position of robot A as the target position, and robot D moved to the position of robot B as the target position 10 times.

Fig. 18
figure 18

Virtual environment for multi robot test (https://youtu.be/xxzoh1XbAl0?t=148)

5.3.2 Experiment result

In both the experiments in which the two robots avoided each other and in which the four robots avoided one another, the robots succeeded in all the 10 attempts and showed excellent obstacle avoidance performance. Although the robots did not communicate with each other, they had the same policies. Therefore, they took the same actions to avoid, and unlike people and objects defined as dynamic obstacles in previous experiments, each robot easily avoided each other. The paths for the same policy can be found in Fig. 19. It can be seen that the robots derived similar driving paths under similar conditions and showed a tendency to avoid by turning to the left of the others.

Fig. 19
figure 19

Robots odometry for multi-robot test (square marker: start point, circle marker: end point)

6 Analysis and discussion

When driving to avoid obstacles through MCAL, the problem was that the driving could not be continued for a long time if the driving direction was not found. To improve this, the proposed MCAL_P provides the robot with a favorable direction for driving through an optimal path. The MCAL_P was able to effectively avoid dynamic obstacles in various ways, such as acceleration, directional change, and stopping unlike the DWA and TEB, which avoided dynamic obstacles only through stop actions, thus confirming the best dynamic obstacle avoidance performance. This observation seems to be because the objective functions of DWA and TEB were calculated without considering the movement direction and speed of the obstacle, and thus, only favorable for static obstacle avoidance. On the other hand, the MCAL_P can grasp temporal movement of obstacles around the robot by setting the lidar data of the current step and the previous two steps as a state. The dynamic obstacle avoidance performance of MCAL_P is excellent because the obstacle avoidance policy was trained considering the moving direction and speed of the dynamic object. In addition, the MCAL_P also showed excellent avoidance performance in a complex environment where a large number of large people move, and showed smooth avoidance behavior even in multi-robot environments.

However, the MCAL_P has certain disadvantages also. It tends not to decelerate and accelerate except in the situations of avoiding obstacles, which can be a problem when carrying work. Despite reducing inefficient driving using the optimal path, the disadvantage of making inefficient driving, such as frequent deceleration and stop movements, remains when avoiding narrow-positioned static obstacles. This is attributed to the problem in training focused on dynamic obstacle avoidance, largely avoiding static obstacles considering the possibility of movement, and the problem caused by the robot’s poorly controlled and out-of-target movements.

Another drawback is the problem of not being able to draw a straight path, which is believed to be due to reinforcement learning. During training, the robot drives directly and receives the evaluation of the drive as a reward, and learns the policy for obstacle avoidance and driving in a way that maximizes its rewards. Therefore, if driving along a straight line is not executed during training, the robot cannot determine whether a straight line is a good way of driving because it would not have any information on straight line driving. However, in the reinforcement learning of continuous action, the probability that the rotational velocity of 0 rad/s will be continuously derived for all steps is close to 0, so information on straight-line driving is not obtained.

7 Conclusion

The current study proposed a method in which a mobile robot is able to avoid both static and dynamic obstacles and can drive to the target point based on reinforcement learning. Assuming decentralized control without communication between agents, only the information on distance obtained through lidar was used to derive the avoidance policy for new and various obstacles. In this regard, for effective dynamic obstacle avoidance, the method of predicting and avoiding the movements of dynamic obstacles was adopted based on the information of the lidar distance of the present and the past two steps. Instead of separately predicting the motions of the obstacles, planning, and following a path, reinforcement learning was used to take into account the avoidance of various motions of an irregular dynamic environment from the observed state and output the action directly. To reduce the difference between the real driving environment and the training environment, we developed a training environment where dynamic characteristics were considered and implemented using soft actor critic as the reinforcement learning algorithm to learn policies. Furthermore, due to the high sample efficiency of SAC, the policy could be obtained in short time. As the first idea, the reinforcement learning-based obstacle avoidance driving method proposed in this study involves forward and rotational velocity, which are continuous actions for obstacle avoidance and driving to target position by inputting the lidar data over time, the speed of the robot, and the distance to the target position. However, it showed a problem that driving in a specific environment was impossible. In order to solve this problem, a path planner was integrated, and the input of the trained policy was modified with the lidar data, the speed of the robot, and the distance to the look-ahead point. The improved method, while following the optimal path globally, performed various methods of avoidance through acceleration/deceleration, stopping, and direction change according to the movement of the dynamic obstacle and the distance between the robot and the obstacle locally.

Through various simulations and real-world experiments, it was confirmed that the obstacle avoidance performance was excellent in the actual work environment. The dynamic object avoidance performance of MCAL_P was much better than the DWA and TEB algorithms, and the time of operation to derive the algorithm’s resulting values, linear and angular velocity was also shorter. In real-world environmental experiments, we confirm that robots using MCAL_P can also avoid objects that appear suddenly at fast response rates. We also found that in multi-agent environments, more effective and secure avoidance proceeds because robots using MCAL_P avoid each other with the same policy.

However, there are certain disadvantages also that need to be improved, e.g., it cannot drive in a straight line, draws inefficient paths while avoiding static obstacles, and does not perform acceleration/deceleration while driving other than obstacle avoidance.

Videos of the experiment are available at https://youtu.be/xxzoh1XbAl0.