1 Introduction

Most recently, with the development of manufactured technologies, human–cyber-robot systems (HCRS) have received extensive attention, especially in the field of intelligent manufacturing [6, 23, 25, 39]. Traditional man–machine methods cannot adapt to complex and changeable tasks, so that they cannot meet the needs of diversification and individualization. However, in the advanced HCRS, the advantages of humans (such as intelligence and dexterity) and robots (such as fast response and high efficiency) are efficiently when combined to achieve a deep integration of human, information, and robot systems. Therefore, human–robot skill transmission is to transfer human skills to the robot after a general description, and then realize the motion operation, which is currently a hot issue. Human–robot skill transmission via multi-sensor fusion is a new significant topic in human-assisted systems [19, 26, 32, 36]. More specifically, it is promising to control the mobile rescue robot for delivering the wounded or materials safely in real-world environment with the uncertain scenarios [12]. For example, to transport the patient with a virus, epidemic is the safest method. As with the new viruses COVID-19 that are currently suffering in the world, if mobile robots can autonomously carry patients to designated locations, it can reduce infections among medical staff. However, few studies have applied human demonstration teaching to mobile robots to receive learning skills. In particular, it is interesting to combine the Kinect sensor for the application of skills transfer. Thus, unlike most applications in robotic arms, this paper innovatively presents the human skill transmission for the mobile robot via imitation learning.

Human–robot skill transmission technology is the primary section to realize the skills from humans to robots and determine how people teach robots [10, 18, 30]. According to different interactive interfaces, common transfer methods can be summarized in the following three forms: teleoperation-based [31], physical human–robot interaction [37], and vision-based [35]. For the teleoperation-based human–machine skill transfer, the instructor can operate the slave robot through the master device. For example, a robot learning framework via teleoperation in the visual interaction was considered in [33], and an extreme learning machine method was used to improve the teaching trajectory for the robot motion. Furthermore, in [38], in order to obtain the situation of virtual reality headsets, a deep imitation learning scheme was discussed to the teleoperate robots under complex manipulation tasks. The advantage of this teaching method based on remote operation is that it can be used in remote control scenes and working scenes that are not suitable for direct contact between the instructor and the robot. However, the teleoperation system often has delay problems and tremor problems. On the other hand, the physical human–robot interaction-based skill transmission, which the instructor directly contact with the robot, is another commonly used technique [17]. For instance, an optimal physical human–robot interaction approach using machine learning was considered in [1] for the humanoid robots. However, it cannot be applied to surgical scenes with infectious diseases through physical interaction. The visual-based skill transmission is a simple and effective method. The visual sensor (such as Kinect camera) captures and tracks human motion information and then uses the learning algorithms to model the motion state to obtain a generalized description. For example, a discrete-time method was presented in [22], which utilizes the Kinect camera to capture the joint angle of the teacher’s arm during movement and then map the human joint angle to the joint-space of the robot. The vision-based skill transfer method is convenient for people to teach. Since the human body does not directly contact the robot, the movement of the teacher body is not restricted, especially for the mobile robot. Based on this, we use the visual-based method for the imitation learning.

On the other hand, there are some advanced technologies for human skill transmission via teaching by human demonstration [5, 34], such as dynamic motion primitives (DMP), hidden Markov model (HMM), Gaussian mixture model (GMM) [3], and Gaussian mixture regression (GMR) [4]. For example, the GMM model is introduced into the imitation learning framework of humanoid robots, and the movement information learned by GMM can be used to dynamically adjust the corresponding actions according to the movement information of the target object [11]. Calinon et al. [2] proposed an architecture based on the HMM–GMR model so that the robot can learn human motion skills. It is the HMM method to model human motion information, and the GMR method to regress the motion control commands. This architecture is similar to GMM–GMR, but it can characterize richer motion information with effective robustness.

In general, DMP has the advantages of a simple model, high computational efficiency, and strong generalization, but DMP independently characterizes the information of each movement dimension and loses the associated information between each dimension. Furthermore, GMM and HMM cannot distinguish the residence time of each state, so it takes a long time to learn model parameters. However, the architecture combined with DMP and GMR can represent richer motion information so that the algorithm has stronger robustness and generalization capabilities, and this paper mainly centers on this method applying for the motion representation of mobile robots.

How to combine the Kinect sensor technology to achieve human-like control is the main challenge because robots and humans need effective collaboration. Meanwhile, it is interesting to deliver the patients and materials without colliding with people and equipment for the mobile rescue application. Therefore, this paper aims to bridge the human–robot cooperative control for the human activity recognition and robot-assisted systems via teaching by human demonstration. The main framework of this article is arranged as below:

  1. 1.

    The Kinect camera is adopted to recognize human movement points, and then, the imitation learning technique is implemented to generate the teaching trajectory for the mobile robot.

  2. 2.

    Aiming to high accuracy of tracking control, a model predictive tracking controller is developed to track the learning path, and a recurrent neural network is utilized to effectively evaluate the uncertain interaction simultaneously.

  3. 3.

    Different from the conventional application of imitation learning in manipulators, this article intends to establish a human-like control of mobile robot application, which enhances human skills transferred to the mobile robot.

The overview of this article can be arranged as follows: Sect. 2 investigates the methodology of human skill transfer framework based on the Kinect sensor, the trajectory regression, and the neural network-based predictive tracking controller. The experiment validation is presented in Sect. 3, and the summary is discussed in Sect. 4.

2 Methodology

2.1 Human movement recognition

The position information of the joint points of the human body in the three-dimensional space can be obtained by the Kinect sensor [8]. The Kinect sensor can accurately obtain the three-dimensional coordinates of the 25 joint points of the human body in real time. According to the three-dimensional information of the joint position, the angle feature and the distance feature of each joint point of the human body can be expressed. The recognition process of human motion is presented in Fig. 1. The image of the human body captured by the Kinect camera can record the human movement and then scale the joints after being processed, subsequently normalize concerning scale and position, and then consider the posture analysis.

Fig. 1
figure 1

The recognition process of human motion

First of all, normalizing the standard action features and real-time action features can improve the calculation speed and accuracy. Considering \({\mathcal {B}}(k)=(x(k), y(k), z(k))\), and the depth image points D(ij) and color image \(\vartheta (i, j)\), the probability \({\mathcal {B}}(t | \vartheta )\) can be calculated as below:

$$\begin{aligned} {\mathcal {B}}(t \mid \vartheta )=\frac{{\mathcal {B}}(\vartheta \mid t) {\mathcal {B}}(t)}{{\mathcal {B}}(\vartheta )} \end{aligned}$$
(1)

where \({\mathcal {B}}(\vartheta )\) represents the movement point, and \({\mathcal {B}}(\vartheta | t)\) is the previous color probabilities.

At certain points, it is because of consciousness or lack of joint details that we need to add other roles that can provide human shape details to boost the classifier’s accuracy. To obtain the positive 2D image and the profile obtained, we followed orthogonal Cartesian planes on the depth map. So the silhouette is converted by mapping Cartesian coordinates to polar coordinates [14].

$$\begin{aligned}&{\mathcal {R}}_{i}=\sqrt{\left( x_{i}-x_{j}\right) ^{2}-\left( y_{i}-y_{j}\right) ^{2}} \end{aligned}$$
(2)
$$\begin{aligned}&\theta _{i}=\tan ^{-1} \frac{y_{i}-y_{j}}{x_{i}-x_{j}} \end{aligned}$$
(3)

where \(\left( x_{i}, y_{i}\right) \) denotes the teacher position. \(\left( {\mathcal {R}}_{i}, \theta _{i}\right) \) represents the corresponding angle and polar position. \(\left( x_{j}, y_{j}\right) \) denotes central reference point of human exoskeleton.

Finally, the average \({\mathcal {E}}_{{\textit{mean}}}\) can be addressed as below:

$$\begin{aligned} {\mathcal {E}}_{{\textit{mean}}}=\frac{1}{T} \sum _{t=1}^{T} I(x, y, z, t) \end{aligned}$$
(4)

2.2 Dynamic movement primitive

The Kinect sensor is used to extract human exoskeleton data and then to obtain the distance characteristics and the angular characteristics. It is useful to dynamically match with the template animation in real time with a good recognition performance. At the same time, this method has simple algorithm and fast calculation speed, which can meet the real-time requirements for pose matching. Thus, the next process is to generate the human movement trajectory. Hence, we apply the imitation learning technique to encode the teaching position data. The DMP algorithm can be expressed as follows:

$$ \begin{array}{l}{\ddot{{\mathcal {C}}}_{t}=G^{p}\left( g-{\mathcal {C}}_{t}\right) -G^{v} \dot{{\mathcal {C}}}_{t}+F\left( v_{t}\right) } \\ {{\dot{v}}_{t}=\alpha _{v} v_{t}} \\ {F\left( v_{t}\right) =h_{t}^{\mathrm {T}}\left( v_{t}\right) \omega \left( g-{\mathcal {C}}_{0}\right) }\end{array}$$
(5)

subjected to

$$\begin{aligned}&h_{t}\left( v_{t}\right) =\frac{\sum _{i=1}^{N} \psi _{i}\left( v_{t}\right) v_{t}}{\sum _{i=1}^{N} \psi _{i}\left( v_{t}\right) _{t}} \end{aligned}$$
(6)
$$\begin{aligned}&\psi _{i}\left( v_{t}\right) =\exp \left( -\frac{1}{2 \varXi _{i}}\left( v_{t}-c_{i}\right) ^{2}\right) \end{aligned}$$
(7)

where \(\left[ {\mathcal {C}}_{t}, \dot{{\mathcal {C}}}_{t}, \ddot{{\mathcal {C}}}_{t}\right] \) is the position in the state of Cartesian, and \(G^{v}\) represents the item in this state. \(G^{p}\) denotes the structural stiffness. \({\mathcal {C}}_{0}\) presents the original state, and g denotes designed point of the mobile robot. Besides, \(\alpha _{v}\) represents the order factor in established learning system. \(\omega \) denotes the corresponding dynamic movement primitive factor. It should be noted that \(v_{t}\) gradually tend to zero. Finally, \(\varXi _{i}\) is the length of Gaussian function and the \(c_{i}\) is the home position, where i represents the order of the function.

In particular, there are two parameters in dynamic movement primitive procedure, including the item of a linear spring-damper section and the nonlinear section, that is, \(G^{p}\left( g-{\mathcal {C}}_{t}\right) -G^{v} \dot{{\mathcal {C}}}_{t}\) and \(F\left( v_{t}\right) \). In order to efficiently obtain the motion representation of the mobile robot, These items can be used to evaluate the motion process of the teacher for the method of imitation learning. Therefore, the intention of DMP method is to imitate the teacher motions combined with the point of g.

2.3 Gaussian mixture model

The method of dynamic movement primitive is a movement representation of human activities. Consequently, we apply the technology of Gaussian mixture model to encode the human movement position via the Kinect sensors. This model is based on the knowledge of statistical probability, which can calculate the probability density distribution of each Gaussian model data set. Thus, according to the dynamic movement primitive function presented in (5), the corresponding Gaussian mixture model can be addressed as follows:

$$\begin{aligned} \ddot{{\mathcal {C}}}=\sum _{k=1}^{G} h_{k}\left( G_{k}^{p}\left( \mu _{k}^{{\mathcal {C}}}-{\mathcal {C}}\right) -G_{k}^{v} {\dot{X}}+F\right) \end{aligned}$$
(8)

where \(G_{k}^{p}\) represents the structural stiffness, and \(G_{k}^{v}\) denotes the corresponding item in Cartesian state.

At the same time, define the human demonstration in the Cartesian state as below: \(v_{j}=\left( v_{t, j}, v_{{\mathcal {C}}, j}\right) (j=1, \ldots , N)\), including the time set \(v_{t, j}\) and the position set \(v_{{\mathcal {C}}, j}\).

Subsequently, according to the related motion point, we apply the following Gaussian mixture model.

$$\begin{aligned} p\left( v_{j}\right) =\sum _{k=1}^{G} p(k) p\left( v_{j} | k\right) \end{aligned}$$
(9)

Among them, p(k) and \(p\left( v_{j} | k\right) \) represent the corresponding prior probability and conditional probability density, and k presents the related orders.

Hence, the items of Gaussian mixture model can be addressed as below:

$$\begin{aligned} p(k)=\lambda _{k} \end{aligned}$$
(10)
$$\begin{aligned} p\left( v_{j} | k\right) =\frac{e\left( -\frac{1}{2}\left( v_{j}-\mu _{k}\right) ^{T} \varXi _{k}^{-1}\left( v_{j}-\mu _{k}\right) \right) }{\sqrt{8 \pi ^{3}\left| \varXi _{k}\right| }} \end{aligned}$$
(11)

where \(\lambda _{k}\) is a priori probability, \(\mu _{k}\) is the average value, \(\varXi _{k}\) is the variance value, and \(D_{k}\) is a posterior probability.

Then, according to statistical theory, the related \(D_{k}\) can be calculated as below:

$$\begin{aligned} D_{k}=\sum _{j=1}^{N} p\left( k | v_{j}\right) \end{aligned}$$
(12)

subjected to

$$\begin{aligned} p\left( k | v_{j}\right) =\frac{p(k) p\left( v_{j} | k\right) }{\sum _{m=1}^{G} p(m) p\left( v_{j} | m\right) } \end{aligned}$$
(13)

2.4 Gaussian mixture regression

The coefficients related to its probability have been obtained through the Gaussian mixture model, and then, the problem of the nonlinear part will be considered. How to effectively achieve the regression factor F is the next problem. After the Gaussian mixture model solves the probability, we utilize the technique of Gaussian mixture regression to reflect motion points and need to evaluate \(v_{{\mathcal {C}}}\). Therefore, we have

$$\begin{aligned} \mu _{k}=\left\{ \mu _{t, k}, \mu _{{\mathcal {C}}, k}\right\} \end{aligned}$$
(14)
$$\begin{aligned} \varXi _{k}=\left( \begin{array}{ll} \varXi _{t t, k} &{}\quad \varXi _{t C, k} \\ \varXi _{c t, k} &{}\quad \varXi _{c c, k} \end{array}\right) \end{aligned}$$
(15)

where \(\mu _{k}\) is the average value and \(\varXi _{k}\) is the variance value.

The next step is to calculate the desired distribution \(v_{{\mathcal {C}}, k}\), and then, we have

$$\begin{aligned} p\left( v_{{\mathcal {C}}, k} | v_{t}, k\right) ={\mathcal {N}}\left( v_{{\mathcal {C}}, k} ; {\hat{v}}_{{\mathcal {C}}, k}, {\hat{\varXi }}_{\mathcal {C C}, k}\right) \end{aligned}$$
(16)
$$\begin{aligned} {\hat{v}}_{{\mathcal {C}}, k}=\mu _{{\mathcal {C}}, k}+\sum _{{\mathcal {C}} t, k}\left( \varXi _{t t, k}\right) ^{-1}\left( v_{t}-\mu _{t, k}\right) \end{aligned}$$
(17)
$$\begin{aligned} {\hat{\varXi }}_{\mathcal {C C}, k}=\varXi _{\mathcal {C C}, k}-\varXi _{{\mathcal {C}} t, k}\left( \varXi _{t t, k}\right) ^{-1} \varXi _{t {\mathcal {C}}, k} \end{aligned}$$
(18)

where \({\hat{v}}_{{\mathcal {C}}, k}\) and \({\hat{\varXi }}_{{\mathcal {C}} {\mathcal {C}}, k}\) are determined by probability distribution.

Finally, the condition probability density can be defined as follows:

$$\begin{aligned} p\left( v_{{\mathcal {C}}} | v_{t}\right) =\sum _{k=1}^{G} h_{k} {\mathcal {N}}\left( v_{{\mathcal {C}}} ; {\hat{v}}_{{\mathcal {C}}, k}, {\hat{\varXi }}_{\mathcal {C C}, k}\right) \end{aligned}$$
(19)
$$\begin{aligned} h_{k}=\frac{p(k) p\left( v_{t} | k\right) }{\sum _{i=1}^{G} p(i) p\left( v_{t} | i\right) }=\frac{\lambda _{k} {\mathcal {N}}\left( v_{t} ; \mu _{t, k}, \varXi _{t t, k}\right) }{\sum _{i=1}^{G} \lambda _{i} {\mathcal {N}}\left( v_{t} ; \mu _{t, i}, \varXi _{t t, i}\right) } \end{aligned}$$
(20)

Therefore, the movement \(\left\{ {\hat{v}}_{t}, {\hat{v}}_{{\mathcal {C}}}\right\} \) can be generated by estimating \(\left\{ {\hat{v}}_{{\mathcal {C}}}, {\hat{\varXi }}_{{\mathcal {C}} {\mathcal {C}}}\right\} \) at the time step \(v_{t}\), that is,

$$\begin{aligned} {\hat{v}}_{{\mathcal {C}}}=\sum _{k=1}^{G} h_{k} {\hat{v}}_{{\mathcal {C}}, k} \end{aligned}$$
(21)
$$\begin{aligned} {\hat{\varXi }}_{C C}=\sum _{k=1}^{G} h_{k}^{2} {\hat{\varXi }}_{C C, k} \end{aligned}$$
(22)

2.5 Tracking controller development

The Kinect camera collects human motion points, and then, the generated trajectory can be obtained with the imitation learning method. Finally, the next task is to control the mobile robot to track the human movement point using predictive tracking controller. At the same time, the neural networks are used to evaluate the uncertain interaction in the tracking process [7, 20, 24, 27, 28].

Fig. 2
figure 2

Six-wheeled independent steering model

Figure 2 displays the kinematic model and turning model of the robot. The turning model can be defined as follows:

$$\begin{aligned} R=\frac{W}{2}+\frac{L}{2 \tan \theta _{{\textit{in}}}} \end{aligned}$$
(23)
$$\begin{aligned} R_{{\textit{out}}}=\sqrt{\left( \frac{L}{2}\right) ^{2}+\left( W+\frac{L}{2 \tan \theta _{{\textit{in}}}}\right) ^{2}} \end{aligned}$$
(24)
$$\begin{aligned} R_{{\textit{in}}}=\sqrt{\left( \frac{L}{2}\right) ^{2}+\left( \frac{L}{2 \tan \theta _{{\textit{in}}}}\right) ^{2}} \end{aligned}$$
(25)

where the inner steering angle is denoted by \(\theta _{{\textit{in}}}=\theta _{1}=\theta _{5}\). The wheelbase and width are represented by L and W, respectively. The wheel width is denoted by the letter W. The radius of centroid, inner, and outer is denoted by R, \(R_{{\textit{in}}}\) and \(R_{{\textit{out}}}\), respectively. Thus, the robot speed and yaw rate are described as follows:

$$\begin{aligned} v_{x}=v \cos \varphi \end{aligned}$$
(26)
$$\begin{aligned} v_{y}=v \sin \varphi \end{aligned}$$
(27)
$$\begin{aligned} \omega _{c}=\frac{v}{R}=\frac{2 v \tan \theta _{{\textit{in}}}}{W \tan \theta _{{\textit{in}}}+L} \end{aligned}$$
(28)

where \(\varphi \) and v denote the course angle robot speed, respectively. Thus, the steering relationship can be represented as below:

$$\begin{aligned} \tan \theta _{{\textit{out}}}=\frac{\frac{L}{2}}{\frac{W}{2}+R} \end{aligned}$$
(29)
$$\begin{aligned} \theta _{{\textit{out}}}=\arctan \frac{L}{W+2 R} \end{aligned}$$
(30)
$$\begin{aligned} R_{6}=R-W \end{aligned}$$
(31)
$$\begin{aligned} R_{15}=\sqrt{\left( \frac{L}{2}\right) ^{2}+\left( R-\frac{W}{2}\right) ^{2}} \end{aligned}$$
(32)
$$\begin{aligned} R_{24}=\sqrt{\left( \frac{L}{2}\right) ^{2}+\left( W+\frac{L}{2 \tan \theta _{{\textit{in}}}}\right) ^{2}} \end{aligned}$$
(33)
$$\begin{aligned} R_{3}=R+W \end{aligned}$$
(34)

where the \(R_{6}, R_{15}, R_{24}, R_{3}\) represent the wheel number of 6, 1 and 5, 2 and 4 and 3, respectively.

The kinematic model of BIT-6NAZA mobile robot [21] can be addressed as follows:

$$\begin{aligned} \dot{{\mathcal {Z}}}=\left[ \begin{array}{c}{\dot{x}}_{c} \\ {\dot{y}}_{c} \\ {\dot{\varphi }}_{c}\end{array}\right] =\left[ \begin{array}{cc}\cos \varphi _{c} &{} 0 \\ \sin \varphi _{c} &{} 0 \\ 0 &{} 1\end{array}\right] \left[ \begin{array}{l}v_{c} \\ \omega _{c}\end{array}\right] \end{aligned}$$
(35)

where \((x_{c}, y_{c}, \varphi _{c})\) denote the position in XY-axis, and the course angle, respectively. \(\omega _{c}\) denotes yaw rate, and \(v_{c}\) represents the line velocity.

Then, the tracking error function exhibited can be indicated as follows:

$$\begin{aligned} \dot{{\mathcal {Z}}}_{e}&=\left[ \begin{array}{c} {\dot{x}}_{c}-{\dot{x}}_{d} \\ {\dot{y}}_{c}-{\dot{y}}_{d} \\ {\dot{\varphi }}_{c}-{\dot{\varphi }}_{d} \end{array}\right] =\left[ \begin{array}{ccc} 0 &{} 0 &{} -v_{c} \sin \varphi _{d} \\ 0 &{} 0 &{} v_{c} \cos \varphi _{d} \\ 0 &{} 0 &{} 0 \end{array}\right] {\mathcal {Z}}_{e}\nonumber \\&\quad +\left[ \begin{array}{cc} \cos \varphi _{d} &{} 0 \\ \sin \varphi _{d} &{} 0 \\ \frac{2 \tan \delta }{L(1+\tan \delta )} &{} \frac{2 v_{c}}{L(1+\tan \varphi )^{2} \cos ^{2} \delta _{d}} \end{array}\right] u_{e} \end{aligned}$$
(36)

where \((x_{d}, y_{d}, \varphi _{d})\) and \((v_{d}, \delta _{d})\) denote the desired state variables and control variables, respectively. L represents the wheel track of the robot, and \(\delta \) denotes the turning angle.

Subsequently, the error function can be described as follows:

$$\begin{aligned} \tilde{{\mathcal {Z}}}(k+1)={\mathcal {U}}_{k, t} \tilde{{\mathcal {Z}}}(k)+{\mathcal {K}}_{k, t} {\tilde{u}}(k) \end{aligned}$$
(37)

subjected to

$$\begin{aligned} {\mathcal {U}}_{k, t}=\left[ \begin{array}{ccc} 1 &{} 0 &{} -v_{d} T \sin \varphi _{d} \\ 0 &{} 1 &{} v_{d} T \cos \varphi _{d} \\ 0 &{} 0 &{} 1 \end{array}\right] \end{aligned}$$
(38)
$$\begin{aligned} {\mathcal {K}}_{k, t}=\left[ \begin{array}{cc} {T\cos } \varphi _{d} &{} 0 \\ {T\sin } \varphi _{d} &{} 0 \\ \frac{2 T \tan \delta }{L(1+\tan \delta )} &{} \frac{2 v_{d} T}{L(1+\tan \varphi )^{2} \cos ^{2} \delta _{d}} \end{array}\right] \end{aligned}$$
(39)

where T denotes the testing period.

Considering the objective function in [16], the optimization function can be selected as below:

$$\begin{aligned} {\mathcal {A}}_{d}(k)&=\sum _{i=1}^{P_{p}} \varPsi (k+i \mid t)-\varPsi _{r e f}(k+i \mid t)_{{\mathcal {L}}}^{2}\nonumber \\&\quad +\sum _{i=1}^{P_{c}-1} \Vert \varDelta U(k+i \mid t)_{R}^{2}+\sigma \psi ^{2} \end{aligned}$$
(40)

where \(P_{p}\) and \(P_{e}\) represent the constraint of prediction and control, respectively. \(\sigma \) and \(\psi \) denote the corresponding weight variables.

Finally, the specific constraint of the tracking controller can be defined as below:

$$\begin{aligned} \begin{array}{c} {\left[ \begin{array}{l} -0.4 \\ -25 \end{array}\right] \le u \le \left[ \begin{array}{l} 0.6 \\ 32 \end{array}\right] } \\ {\left[ \begin{array}{l} -0.03 \\ -0.07 \end{array}\right] \le \varDelta U \le \left[ \begin{array}{l} 0.03 \\ 0.07 \end{array}\right] } \end{array} \end{aligned}$$
(41)

The unknown disruption in the trajectory control process for the mobile robot needs to be managed in order to successfully pass the trajectory teaching by human demonstration [9, 13, 29]. We consider in this paper that two components are included in the uncertainty of the drum system: internal connection and outside complexity. Considering the following function \({\mathcal {G}}({\mathcal {K}}): R^{q} \rightarrow R\), it can determine the constraint of unknown dynamics.

$$\begin{aligned} {\mathcal {G}}_{n n}\left( {\mathcal {K}}_{i n}\right) ={\mathcal {Q}}^{T} \varTheta \left( {\mathcal {K}}_{i n}\right) \end{aligned}$$
(42)

where \(\varTheta \left( {\mathcal {K}}_{i n}\right) =\left[ \varTheta _{1}\left( {\mathcal {K}}_{i n}\right) , \varTheta _{2}\left( {\mathcal {K}}_{i n}\right) , \ldots , \varTheta _{i}\left( {\mathcal {K}}_{i n}\right) \right] ^{T}\) and \(\varTheta _{i}\left( {\mathcal {K}}_{i n}\right) \) represent the corresponding Gaussian function. \({\mathcal {Q}}=\left[ \xi _{1}, \xi _{2}, \ldots , \xi _{B}\right] \in R^{B}\) and \({\mathcal {K}}_{i n} \in \varOmega \subset R^{q}\) denote the hidden layer and the input of neural networks.

$$\begin{aligned} \varTheta _{i}\left( {\mathcal {K}}_{i n}\right) =\exp \left[ \frac{-\left( {\mathcal {K}}_{i n}-u_{i}^{T}\right) \left( {\mathcal {K}}_{i n}-u_{i}\right) }{\eta _{i}^{2}}\right] \end{aligned}$$
(43)

where \(i=1,2, \ldots , m\), \(u_{i}=\left[ u_{i 1}, u_{i 2}, \ldots , u_{i q}\right] ^{T} \in R^{q}\), and \(\eta _{i}\) is the variance.

Therefore, we have

$$\begin{aligned} \left\| \varTheta \left( {\mathcal {K}}_{i n}\right) \right\| \le \tau \end{aligned}$$
(44)

where \(\tau \) denotes a positive variable.

$$\begin{aligned} {\mathcal {G}}_{n n}\left( {\mathcal {K}}_{i n}\right) ={\mathcal {Q}}^{* T} \varTheta \left( {\mathcal {K}}_{i n}\right) +\varepsilon \end{aligned}$$
(45)
$$\begin{aligned} {\mathcal {Q}}^{*}=\arg \min _{{\mathcal {K}}_{i n} \in R^{q}}\left\{ \sup \left| {\mathcal {G}}_{n n}\left( K_{i n}\right) -{\mathcal {Q}}^{T} \varTheta \left( {\mathcal {K}}_{i n}\right) \right| \right\} \end{aligned}$$
(46)

where \({\mathcal {Q}}^{*}\) subjects to \(\varPhi _{{\mathcal {K}}_{i n}} \subset R^{q}\), and \(\Vert \varepsilon \Vert \le \tau _{c}\).

3 Experiment validation

In this part, the experimental demonstration is performed to discuss the developed imitation learning in real-world application [15], and the experimental environment is presented in Fig. 3. There is one Kinect sensor (Microsoft Xbox ONE) used in the demonstration. Among them, the human is the leader, and the mobile robot is the follower. The primary objective of this investigation is that the mobile robot can operate the learning trajectory by human demonstration effectively.

Fig. 3
figure 3

The experimental environment in real-world scenario. The Kinect camera is used to detect the human movement, and then, the DMP combined with GMR is applied to encode the learning trajectory. The mobile robot finally follow the teaching trajectory

At the same time, the main experiment parameter settings are as follows: mobile robot velocity is 400 r/min; prediction horizon and control horizon set as 25 and 10, respectively; controller sample time is \(T=0.01\,\mathrm{s}\); the weight variables of neural networks are defined as \({\mathcal {Q}}_{1}(0)={\mathbf {0}} \in {R}^{3 l_{1} \times 3}\) and \({\mathcal {Q}}_{2}(0)={\mathbf {0}} \in {R}^{3 l_{2} \times 3}\); the corresponding learning rate of neural networks are defined as 0.000004 and 0.000006, respectively. \(\eta _{i}=1.008\) and \(u_{i}=\left[ \begin{array}{llllll}-2.5&-2.0&0&2.0&3.0&4.0\end{array}\right]. \)

On the other hand, the procession is set as follows: The human follows the same trajectory with five times, and the Kinect sensor records the moving trajectory points. Then, the method of DMP with GMR is used to generate a desired trajectory. Finally, the robot can be controlled to follow the teaching trajectory through the tracking controller. In this case, we set up two teaching tracks, including straight trajectory and C-shape trajectory.

Fig. 4
figure 4

a The regression process of straight line and b The tracking results of x-error \(x_{e}\), y-error \(y_{e}\), course angle error \(\theta _{e}\), robot velocity \(v_{{\textit{robot}}}\), pitch angle, roll angle and tracking trajectory in real-world

Fig. 5
figure 5

The tracking performance of teaching by demonstration under C-shape. a The regression process of C-shape; b The tracking results of x-error \(x_{e}\), y-error \(y_{e}\), course angle error \(\theta _{e}\), robot velocity \(v_{{\textit{robot}}}\), pitch angle, roll angle, and tracking trajectory in real-world

The results are shown in Fig. 4. The Kinect sensor extracts the point information of the human exoskeleton and then selects the gravity center of the human as the reference point to record the set of track points during the teaching process. Figure 4a displays the regression process of teaching by human demonstration. In order to improve the regression accuracy of the Gaussian model, the trajectory dataset can be obtained by teaching five times. Then, the regression method of DMP with GMR is implemented to generate the teaching trajectory. Finally, the mobile robot can be control to follow the teaching trajectory using the model predictive tracking controller. As shown in Fig. 4b, there is the tracking results of x-error \(x_{e}\), y-error \(y_{e}\), course angle error \(\theta _{e}\), robot velocity \(v_{{\textit{robot}}}\), pitch angle, roll angle, and tracking trajectory in real world. It can be concluded that the horizontal and vertical position errors of the robot are basically constrained within the range of \(\pm 0.05\) meters, and the control accuracy is high. At the same time, the heading angle error is also constrained within \(1^{\circ }\), which shows that the proposed neural network can effectively eliminate external interference. Moreover, under the constraints of predictive controller, the speed of the robot is kept within the range of 400 r/min, and the attitude angles such as pitch angle and roll angle are also kept within a reasonable range, realizing the stability control of the robot.

Furthermore, the teaching and tracking experiment of C-shape trajectory is carried out, and the experimental performance is shown in Fig. 5. Similarly, five teaching movements of human are collected through the Kinect sensor, and then the gravity center is selected as the reference point to form a teaching set, as shown in Fig. 5a. The Gaussian function generates the desired trajectory of the robot through a reasonable number of iterations. The mobile robot achieves a human-like trajectory tracking performance by tracking the teaching trajectory. Figure 5b exhibits the tracking results of x-error \(x_{e}\), y-error \(y_{e}\), course angle error \(\theta _{e}\), robot velocity \(v_{{\textit{robot}}}\), pitch angle, roll angle, and tracking trajectory in real world. Benefited to the model predictive controller and the approximation of neural networks, the mobile robot can effective follow the desired trajectory. The position error of x-axis and y-axis is basically constrained within a reasonable range of \(\pm 0.06\) meters, and the course angle error can be controlled within \(2^{\circ }\). It is in line with the expected effect. In addition, the speed control and attitude control of the mobile robot also meet the requirement of the steady-state error, and there is no oscillation situation.

4 Conclusion

This paper studies a human–robot capability transfer method for controlling a mobile robot using learning by demonstration in real-world situations, with an emphasis on material transportation and wounded rescue. Learning by presentation, which is an ability learning focused on various teachings, is used to understand human–robot conversion technologies. A skill transmission framework is investigated in this situation, with the Kinect camera being used to discern human activity identification and establish an expected route. Furthermore, the dynamic movement primitive approach is used to represent the teaching results, and the learning curve is encoded using Gaussian mixture regression. On the other hand, a model predictive tracking control is studied in order to achieve precise path tracking position control, where the recurrent neural network is used to eradicate the unknown interaction. Extensive demonstrations highlight the reasonable results in a real-world environment, and it offers a possible alternative for a mobile robot with human-like skill capacity. In future works, how to combine the Internet of thing and multi-sensors to achieve high performance of human activity recognition will be considered. At the same time, advanced tracking algorithms such as reinforcement learning control will be investigated to improve control accuracy and real-time performance.