1. Introduction
Automated parking is a hot issue of interest in academia and industry. Autonomous vehicles (AVs) can improve driving safety, efficiency and convenience through Advanced Driving Assistance Systems (ADAS) [
1]. Automated parking is an essential application of ADAS for autonomous vehicles, which has been used by many car manufacturers such as Audi, BMW, Mercedes-Benz and BYD [
2]. In recent years, parking space has become scarce in many cities with the increase in vehicles and the relative lag in infrastructure. The narrow and crowded parking environment increases the risk of collision and makes parking more difficult for drivers while also bringing new challenges to automated parking technology [
3]. This paper aims to achieve a more efficient collision-free trajectory planning scheme for automated parking in narrow parking spaces (
Figure 1).
Trajectory planning is a critical component of autonomous vehicles, enabling the generation of a collision-free, smooth and kinematically feasible curve for the vehicle’s motion. Compared to the trajectory planning of on-road autonomous driving, the trajectory planning of automated parking has more challenges. To be specific: (1) Polynomial-based path planning approaches are not suitable for automated parking because parked path curves can have non-guidable points, whereas polynomial-based methods can only generate smooth, reachable path curves. (2) Obstacle avoidance constraints in automated parking are more complex than those encountered in on-road autonomous driving due to the intricate parking environment. (3) Automated parking requires full use of the vehicle’s steering capability compared to the relatively smooth curve of on-road autonomous driving [
4]. Currently, research mainly has three types of trajectory planning for automated parking: the sample-and-search-based method, the optimization-based method and the neural-network-based method.
Sample-and-search-based path planning discretizes the continuous state space into a graph with nodes and searches the graph structure for the optimal path linking the start and goal points. The typical methods of node-based search with sampling methods are RRT and RRT*, which can relatively quickly generate a curve to the goal point. However, these methods have no guarantee of kinetic feasibility, resulting in no reliable trajectories [
5]. D. Dolgov et al. proposed the Hybrid A* algorithm, which discretizes and samples the control space to obtain a smooth and kinetic feasible trajectory curve [
6]. He et al. proposed a fast A* anchor point-based path planning algorithm based on Hybrid A* for solving reverse parking path planning in narrow spaces [
7]. The sample-and-search-based method can be applied to a wide range of scenarios. However, balancing computational accuracy and time is a challenge because low sampling resolution can lead to inaccurate path planning, while high sampling resolution can consume too many computational resources.
The optimization-based trajectory planning algorithms such as Model Predictive Control (MPC) have been applied to unmanned aircraft and autonomous vehicles. In general, the trajectory planning obstacle avoidance problem is NP-hard and the optimization-based methods consider the obstacle avoidance problem as an Optimal Control Problem (OCP), which is solved using the technique of Nonlinear Programming (NP) [
8]. This category has two branches, that is, soft constraint-based and hard constraint-based optimizations. On soft constraint-based methods, K Shibata et al. designed an Artificial Potential Field (APF) in the cost function as the soft constraint by considering the obstacle as a mass but cannot guarantee collision avoidance [
4]. On hard constraint-based methods, Zhang et al. proposed an Optimization-Based Collision Avoidance (OBCA) algorithm by reformulating a smooth obstacle avoidance constraint containing the vehicle geometry [
8]. Li et al. proposed a unified OCP model aimed at unstructured parking spaces using a triangle-area criterion to construct the collision-avoidance constraints [
9]. However, when parking space is narrow, the feasible domain will be greatly limited due to collision avoidance constraints, resulting in non-linear programming solvers that cannot compute a feasible solution in a finite time.
The neural-network-based method is proposed to generate parking trajectories in narrow scenarios while keeping the time-optimality of the trajectory and low computational overhead. Few studies use neural-network-based methods in automated parking trajectory planning [
10,
11,
12,
13,
14]. Zhang et al. [
15] used Deep Deterministic Policy Gradient (DDPG), a deep reinforcement learning method, to solve the perpendicular parking problem. However, this work focuses only on perpendicular parking and does not consider speed planning. Song et al. [
16] proposed a method to solve the parallel parking problem by fusing Nonlinear Programming (NP) and Monte Carlo Tree Search (MCTS), using NP to generate data and train the policy neural network offline, followed by using the policy network to guide MCTS to complete the trajectory planning. The above neural-network-based method can trade off computational accuracy and computational time. However, these methods need to be more generalized and applied to other scenarios because it is only trained in a single scene.
In the field of Intelligent Transportation Systems (ITSs), Federated Learning (FL) has emerged as a promising approach for collaborative model training without the need for lengthy data transfers or sacrificing user privacy. By leveraging the distributed computing power of participating devices, federated learning enables the collaborative training of a globally shared vehicle AI model, which can be utilized by all participating devices for improved performance and efficiency. In recent years, federated learning has gained significant attention from the research community due to its potential to enhance the effectiveness and privacy of ITS [
17,
18]. For privacy protection in connected vehicles, federated learning has been applied to several intelligent transportation systems to achieve traffic flow prediction, parking reservation and traffic edge computing without privacy leakage [
19,
20,
21,
22]. Moreover, the application of Federated Reinforcement Learning (FRL) has extended beyond ITSs to encompass a variety of industrial Internet of Things (IoT) applications, including robot task scheduling [
23] and robot navigation tasks [
24]. FRL has demonstrated improved generalization and faster convergence of neural network models, enabling the efficient and effective training of complex reinforcement learning models across distributed devices.
In this paper, we propose a hierarchical trajectory planning method with deep reinforcement learning in the federated learning scheme, which is named HALOES, to rapidly and accurately generate collision-free automated parking trajectories in multiple narrow spaces. The FL method considers the computation time and the vehicle kinematic constraints for automated parking trajectory planning in extremely narrow spaces. The contributions of this paper are as follows:
A novel hierarchical trajectory planning method with deep reinforcement learning and optimization-based approach integration is proposed to achieve computational accuracy and computational time trade-off. The method has a high-level neural network model for rapid reference trajectory generation and a low-level optimization model to refine the trajectory.
A decentralized training scheme is introduced in the model training module to improve the generalization of model performance by fusing the model parameters trained in different parking scenarios and the federated learning scheme is used in decentralized deep reinforcement learning to protect the privacy of the vehicle’s data during model parameter fusion.
Simulation results demonstrate that the proposed HALOES method can enable efficient automated parking in narrow spaces and outperforms other state-of-art methods, such as Hybrid A* and OBCA, in terms of planning time, trajectory accuracy and model generalization.
The rest of the paper is as follows.
Section 2 presents the kinematics of vehicles, the formulation of model predictive trajectory planning and the background of deep reinforcement learning. The details of HALOES are shown in
Section 3. In
Section 4, the federated learning scheme is presented. Simulations on several narrow-space parking and detailed analyses of the simulation results are presented in
Section 5. Finally, conclusions are drawn in
Section 6.
3. Hierarchical Trajectory Planning with Deep Reinforcement Learning
This section will introduce the kinematic vehicle models used by HALOES and the overall scheme of HALOES.
3.1. Kinematic Vehicle Model
This paper employs the Kinematic Single-track (KS) model as the kinematic vehicle model. As shown in
Figure 2, in the
coordinate system, the reference point of the vehicle is the rear-wheel axle mid-point
, and the orientation in the global coordinate system is
.
v represents the current velocity,
represents the steering angle,
is the wheelbase of vehicle,
is the vehicle rear hang length,
is the vehicle front hang length and
is the vehicle width. According to the KS model, the state transfer equation of the vehicle is as follows,
in which
t and
represent the current time and the control interval, respectively.
and
are the control profiles, representing the acceleration and angular velocity of steering angles at the current time, respectively.
Traditional optimization-based approaches typically use model predictive trajectory planning. Model predictive trajectory planning can be considered as an OCP, which can be solved via an NP solver. Let
be the vehicle state profile and
the control profile; the details of
and
at time
t are as follows:
Let
represent the discrete time index with time interval
; the state transfer Equation (
1) can be written as
, in which
denotes the differential equation of vehicle state
specified which could be written as follows,
The finite-horizon OCP with collision avoidance constraint can be written as:
in which
N represents the prediction horizon, and
J is the cost function to penalize deviations from the target point. Equation (5d) is the obstacle avoidance constraint to achieve the obstacle avoidance trajectory.
3.2. DRL-Based Trajectory Planning for Automated Parking
Automated parking trajectory planning can be modeled as a Markov process. This section describes the formulation of the DRL for automated parking trajectory planning and the training process of the DRL model.
3.2.1. Setup of the DRL-Based Trajectory Planning
In this section, we define the setup of the DRL framework, such as the set of state , the set of action and the reward function r.
Automated parking trajectory planning in a narrow space requires constant consideration of vehicle coordinates, target point coordinates, obstacle location information and vehicle kinematic states such as speed and steering wheel angle. In this paper, we define the observed state at the moment of time step
t as:
in which
presents the heading coordinates of the ego vehicle at the time step
t,
denotes the position of the target point relative to the ego vehicle,
and
represent the coordinates of the two-dimensional vehicle target point
and the heading angle, respectively.
and
denote the speed and steering wheel angle of the ego vehicle at time step
t, respectively.
is the set of relative positions of the obstacles and the ego vehicle, in which
and
is the number of vertices of the
Nth obstacle.
Since the vehicle kinetic characteristics, such as maximum steering wheel angle and maximum turning radius, need to be fully utilized in parking, the action output of the neural network needs to conform to the vehicle kinetic constraints. In this paper, in order to conform to the vehicle kinetic characteristics, instead of using the relative change of distance as the vehicle action set, e.g.,
, the acceleration, as well as the steering wheel cornering rate, are used as the action and the action space is defined as:
Normalization of the action space improves the learning efficiency of the neural network. The control inputs to the vehicle are
where
and
are the maximum values of acceleration and steering wheel rate, respectively.
The reward function in deep reinforcement learning mixes multiple reward values to guide the convergence direction of the model. In automatic parking trajectory planning, three aspects of reward need to be considered; one aspect is the duration of parking, one is the distance to the target point and the last one is the collision with the obstacle. In addition, we use negative rewards to penalize the agent in order to make it able to converge faster. The whole reward function is:
where
is the penalty factor for each item.
is the time-term penalty as a fixed value.
is the penalty term for distance to the target point, which represents the change in distance of the vehicle from the target point:
is the penalty term for the orientation angle to the target point, which represents the change in orientation of the vehicle from the target point:
represents the obstacle collision penalty, which applies a fixed penalty when the vehicle collides with an obstacle. There are two prominent cases where a vehicle collides with an obstacle; one is that at least one vehicle with one vertex of the polygon is located inside the obstacle and the other is that at least one obstacle with a vertex of the polygon is located inside the vehicle. At time
t, if neither of the two cases exists, the automatic parking trajectory can be considered collision-free with the obstacle. Note that each obstacle polygon should be convex. If the polygonal obstacle is non-convex, it must first be divided into several convex polygons. As shown in
Figure 3, the set of vertices of the obstacle is
and the set of vertices of the vehicle is
. Let any point
denote the coordinates of a vertex of the obstacle and any point
denote the coordinates of a vertex of the vehicle. The collision-free condition can be obtained from the triangular area criterion:
where
denotes the triangle area, and
and
denote the area of the polygonal obstacle and the area of the polygonal vehicle. Here, the area of a triangle and the area of a polygonal obstacle can be calculated using the shoelace theorem; let the set of vertices of a convex polygon be
where
denotes the coordinate, then:
Thus
is defined as follows:
3.2.2. Training Process of DRL Model
The training goal of reinforcement learning is to find the optimal policy function
, parameterized by
, which maximizes the expected discount reward of Equation (
14).
where
r is calculated via Equation (
8). The Deep reinforcement learning is mainly divided into the off-policy method and the on-policy method. The off-policy methods, such as Deep Deterministic Policy Gradient (DDPG) and Soft Actor-Critic (SAC), improve sampling efficiency by maintaining the experience pool, so the model can reuse old data for training. The on-policy methods, such as Proximal Policy Optimization (PPO), optimize the same policy as its own decision network, making the model update more stable by directly optimizing the objective function. The trajectory planning process in complex environments often involves many exploration processes, making it a computationally challenging task. Reinforcement learning (RL) methods utilizing off-policy have shown the potential to enhance the model’s performance and stability. By exploiting these existing data, RL methods using off-policy can make more efficient use of available resources and accelerate the training process. As a result, using off-policy in RL methods can improve the overall performance and robustness of the learned policies, making them well suited for real-world applications. DDPG is an Actor-Critic, model-free algorithm based on the deterministic policy gradient used for learning policies in environments with continuous action spaces. DDPG uses two neural networks, an actor and a critic network, to update the policy and value function, respectively. The actor network is updated using the sampled policy gradient:
where
N is the mini-batch size,
, and
is the critic network and actor network with parameters
and
. The critic network is updated by minimizing the loss:
where
is the reward function,
, and
is the target critic network and target actor network with parameters
and
. The overall minimization objective of DDPG is as follows:
In Algorithm 1, we use the prioritized experience to replay a statistical technique that enables the efficient utilization of experience to accelerate the convergence speed of the model. Furthermore, the algorithm leverages the addition of Gaussian noise to the decision-making process, enhancing the model’s exploratory power.
Algorithm 1 DDPG-based trajectory planning. |
Input: batch size: B, discount factor , target smoothing coefficient , number of training episode: M, timesteps of each episode: T, exponents and of prioritization sampling, noise variance . Initialize actor network and critic network with parameters and . Initialize target actor network and target critic network with parameters and . Initialize prioritized replay memory , , . Reset the environment and receive the initial observation state . Select action according to the actor network and exploration noise Execute action and obtain the reward and the next observation state Store transition in with maximal priority Sample transition Compute importance-sampling weight Compute TD-error Update critic by minimizing the loss ( 16) Update actor using the sample policy gradient ( 15) Update transition priority Accumulate weight-change Update the target network:
|
3.3. Optimization-Based Trajectory Planning
A trajectory from the starting point to the target point will be obtained quickly after trajectory planning via the DRL-based trajectory. Still, some trajectories penetrate obstacles due to the DRL-based trajectory, so it needs to be corrected via the optimization-based method. Because the reference point of the hot start is obtained via DRL-Based trajectory planning, optimization-based trajectory planning based on obstacle constraint can be solved quickly. Suppose is the trajectory point obtained via soft constraint path planning. It is sampled before passing hard-constrained trajectory planning, which can be upsampling or downsampling, depending on the target accuracy of trajectory planning. Finally, the sampled trajectory point is obtained as where M is the number of trajectory points after sampling.
In optimization-based trajectory planning, this paper uses sampled trajectory points to constrain the reference curve of OBCA, which uses the exact vehicle geometry to develop collision avoidance constraints for tight maneuvers in a narrow space and the nonlinear optimization is formulated as follows:
where
and
are the trajectory and control sequence generated via hard constraint trajectory planning, respectively.
is the penalty term for the deviation distance from the reference trajectory and the size of the control magnitude. Equations (17g)–(17i) are the smooth and exact reformulations using Theorem 2 in [
8].
3.4. Overall Scheme of Hierarchical Trajectory Planning with Deep Reinforcement Learning
The framework of Hierarchical Trajectory Planning with Deep Reinforcement Learning (HTP-DRL) is shown in
Figure 4, which is mainly divided into two parts DRL-based trajectory planning and optimization-based trajectory planning. After obtaining the parking space information and the vehicle’s parameters, a reinforcement learning model is used to make decisions based on the observed state and a coarse trajectory from the start to the goal point is output. A vehicle dynamics model is used to update the state. Compared to traditional optimization-based online trajectory planning methods or sampling-based trajectory planning methods that require constant collision detection and the computational time overhead caused by non-linear solving, a rough trajectory from start to finish can be obtained quickly with the reinforcement learning model.
After obtaining the rough trajectory via DRL-based trajectory planning, it is necessary to use optimization-based trajectory planning to constrain it at some points to achieve collision-free automatic parking trajectory planning. Here, we refer to the method of OBCA, which can achieve more accurate collision avoidance by using exact vehicle geometry and since DRL-based trajectory planning has obtained the reference trajectory points, the computational speed of the method of OBCA can converge faster so that the results of automatic parking trajectory planning can be solved in a limited time.
4. Federated Learning-Based Model Training
The integration of communication technology and intelligent terminals has formed an intelligent transportation system aimed at improving the safety and efficiency of transportation. In order to realize an intelligent transportation system, the widespread application of artificial intelligence technology benefits from its ability to access the large amount of real-time activity data generated by traffic participants. Therefore, most AI-based intelligent transportation system solutions rely on centralized learning frameworks on data centers. Model training methods based on federated learning can have high privacy and low communication latency. Thus, federated learning plays an important role in handling privacy-protected distributed trajectory planning training based on reinforcement learning.
The proposed method for training a reinforcement learning model based on federated learning is designed to achieve a privacy-preserving centralized reinforcement learning framework, as shown in
Figure 1b. Algorithm 2 presents the centralized reinforcement learning method under the federated learning framework in pseudo code form. At the start of the training process,
N federated learning participant vehicles download global model parameters, including actor and critic network parameters, from the central server node via base stations. Subsequently, each participant trained their own model based on the data collected from their respective environments. Afterward, each intelligent agent uploads their own model parameters to the central server. The central server node then performs the model aggregate update for all models and distributes the updated models to all participants.
Algorithm 2 The federated learning-based reinforcement learning model training. |
Input: number of federated learning participants N, parameter aggregate intervals L, number of training episodes M. Initial global actor network and global critic network parameters . Initial participant actor network and participant critic network parameters . Each DRL participant downloads from the central server node. Each DRL participant updates locally with on the owner current observation Each DRL participant uploads the trained model parameters to the central server Central server receives all weights and performs the federated averaging for Broadcast the averaged model parameters
|
6. Conclusions
In this paper, to improve automated parking in narrow spaces, HALOES, a novel hierarchical trajectory planning method with deep reinforcement learning and optimization-based approach integration, is proposed to achieve computational accuracy and computational time trade-off. HALOES uses DDPG as a baseline and uses federated learning to train reinforcement learning models for privacy-preserving training of intelligent agent systems in intelligent transportation systems. HALOES proposes a novel hierarchical approach to trajectory planning, where the output of the reinforcement learning model is used for secondary optimization of the trajectory to achieve a trade-off between computational time and accuracy in trajectory planning. Extensive experiments have demonstrated that HALOES generalizes well to a wide range of scenarios and achieves the same accuracy as the reinforcement learning algorithm, and by verifying the planning time in a wide range of cases, it can be seen that HALOES is significantly better than Hybrid A* and OBCA.
The scenarios verified in this paper are all static obstacles, so the proposed method is limited when moving obstacles are present. Automatic parking trajectory planning under the influence of moving obstacles is a more realistic and complex problem. Therefore, in future research work, we will focus on the optimization method of automatic parking trajectory planning in the case of moving obstacles such as other vehicles and pedestrians, existing in narrow parking environments.