Keywords

1 Introduction

Over the past few decades, there has been a significant surge in the digitalization and automation of industrial settings, primarily driven by the adoption of Industry 4.0 principles. At its essence, Industry 4.0 aims to establish a world of interconnected, streamlined, and secure industries, built upon fundamental concepts such as the advancement of cyber-physical systems (CPS) [1,2,3], the Internet of Things (IoT) [4,5,6], and cognitive computing [7].

Computer numerical control machines (CNCs) play a pivotal role in aligning with the principles of Industry 4.0 [8,9,10], facilitating automated and efficient manufacturing of intricate and high-quality products. They have revolutionized various industries such as woodworking, automotive, and aerospace by enhancing automation and precision. By automating industrial processes, CNCs reduce the need for manual labor in repetitive and non-value-added activities, fostering collaboration between machine centers and human operators in factory settings [11]. Moreover, CNCs’ modular design and operational flexibility empower them to perform a wide range of applications with minimal human intervention, ensuring the creation of secure workspaces through built-in security measures. These machines often incorporate advanced sensing and control technologies, optimizing their performance and minimizing downtime.

In parallel with the rapid adoption of CNCs in the market, simulation techniques have evolved to meet the industry’s latest requirements. The emergence of the digital twin (DT) [3] concept has particularly contributed to advancing cyber-physical systems (CPS) by establishing seamless coordination and control between the cyber and physical components of a system [12]. While there is no universally accepted definition of digital twin, it can be understood as a virtual representation of a physical machine or system. DTs offer numerous advantages for controlling and analyzing the performance of physical machines without the need for direct physical intervention. By conducting research and testing on virtual representations instead of physical machines, specialists can experiment and evaluate process performance from technological and operational perspectives, conserving physical resources and avoiding associated costs such as energy consumption, operational expenses, and potential safety risks during the research and development stages [13, 14].

The exponential growth of data generated by machines, coupled with the integration of information from their digital twins [3], has opened up new possibilities for data-driven advancements [12]. These developments leverage state-of-the-art analysis techniques to optimize processes in an adaptive manner. In the realm of robotics and automation, reinforcement learning has emerged as a foundational technology for studying optimal control. Reinforcement learning [15,16,17], a branch of Artificial Intelligence (AI), revolves around analyzing how intelligent agents should behave within a given environment based on a reward function. The underlying principle of RL draws inspiration from various fields of knowledge, including social psychology. In RL algorithms, intelligent agents interact with their environment, transitioning between states to discover an optimal policy that maximizes the expected value of their total reward function. These algorithms hold tremendous potential for overcoming existing limitations in the control of robotic systems at different length scales [18], offering new avenues for advancements in this field.

A significant challenge in the realm of mobile industrial machinery lies in designing path trajectories that effectively control robot movement [19,20,21,22]. Traditionally, computer-aided design (CAD) and computer-aided manufacturing (CAM) systems are utilized to generate these trajectories, ensuring adherence to security specifications such as maintaining a safe distance between the robot-head and the working piece to prevent product damage. However, these path trajectories often exhibit discontinuities and encounter issues in corners and curves due to the mechanical limitations of the physical machinery. Moreover, factors such as overall distance or the number of movements between two points, as well as the possibility of collisions among the robot’s moving parts, are not efficiently optimized by these systems.

The optimization of path trajectories becomes increasingly complex as the number of movement dimensions and potential options for movement increases. In this context, reinforcement learning emerges as a promising solution for addressing high-dimensional spaces in an automated manner [23, 24], enabling the discovery of optimal policies for controlling robotic systems with a goal-oriented approach. Reinforcement learning algorithms offer the potential to tackle the challenges associated with path trajectory optimization, providing a framework for finding efficient and effective movement strategies for robotic systems [25]. By leveraging reinforcement learning techniques, mobile industrial machinery can navigate complex environments and optimize their trajectories, taking into account multiple dimensions of movement and achieving superior control and performance through an enhanced perception that represents the knowledge developed by reinforcement learning systems.

2 Reinforcement Learning

Reinforcement learning (RL) [15, 16] is widely recognized as the third paradigm of Artificial Intelligence (AI), alongside supervised learning [26, 27] and unsupervised learning [28]. RL focuses on the concept of learning through interactive experiences while aiming to maximize a cumulative reward function. The RL agent achieves this by mapping states to actions within a given environment, with the objective of finding an optimal policy that yields the highest cumulative reward as defined in the value function.

Two fundamental concepts underpin RL: trial-and-error search and the notion of delayed rewards. Trial-and-error search involves the agent’s process of selecting and trying different actions within an initially uncertain environment. Through this iterative exploration, the agent gradually discovers which actions lead to the maximum reward at each state, learning from the outcomes of its interactions.

The concept of delayed rewards [15, 26] emphasizes the consideration of not only immediate rewards but also the expected total reward, taking into account subsequent rewards starting from the current state. RL agents recognize the importance of long-term consequences and make decisions that maximize the cumulative reward over time, even if it means sacrificing immediate gains for greater overall rewards.

By incorporating trial-and-error search and the notion of delayed rewards, RL enables agents to learn effective policies by actively interacting with their environment, continuously adapting their actions based on the feedback received, and ultimately maximizing cumulative rewards.

Reinforcement learning (RL) problems consist of several key elements that work together to enable the learning process. These elements include a learning agent, an environment, a policy, a reward signal, a value function, and, in some cases, a model of the environment. Let’s explore each of these elements:

  1. 1.

    Learning agent: The learning agent is an active decision-making entity that interacts with the environment. It aims to find the optimal policy that maximizes the long-term value function through its interactions. The specific approach and logic employed by the learning agent depend on the RL algorithm being used.

  2. 2.

    Environment: The environment is where the learning agent operates and interacts. It can represent a physical or virtual world with its own dynamics and rules. The environment remains unchanged by the actions of the agent, and the agent must navigate and adapt to its dynamics to maximize rewards.

  3. 3.

    Policy: The policy determines the behavior of the learning agent by mapping states in the environment to the actions taken by the agent. It can be either stochastic (probabilistic) or deterministic (fixed), guiding the agent’s decision-making process.

  4. 4.

    Reward signal: The reward signal is a numerical value that the agent receives as feedback after performing a specific action in the environment. It represents the immediate feedback obtained during state transitions. The goal of the agent is to maximize the cumulative rewards over time by selecting actions that yield higher rewards.

  5. 5.

    Value function: The value function represents the expected total reward obtained by the agent in the long run, starting from a specific state. It takes into account the sequence of expected rewards by considering the future states and their corresponding rewards. The value function guides the agent in estimating the desirability of different states and helps in decision-making.

  6. 6.

    Model (optional): In some cases, RL algorithms incorporate a model of the environment. The model mimics the behavior of the environment, enabling the agent to make inferences about how the environment will respond to its actions. However, in model-free RL algorithms, a model is not utilized.

In a typical reinforcement learning (RL) problem, the learning agent interacts with the environment based on its policy. The agent receives immediate rewards from the environment and updates its value function accordingly. This RL framework is rooted in the Markov decision process (MDP) [28], which is a specific approach used in process control.

RL has been proposed as a modeling tool for decision-making in both biological [29] and artificial systems [18]. It has found applications in various domains such as robotic manipulation, natural language processing, and energy management. RL enables agents to learn optimal strategies by exploring and exploiting the environment’s feedback. Inverse RL, which is based on hidden Markov models, is another extensively studied topic in the field. Inverse RL aims to extract information about the underlying rules followed by a system that generate observable behavioral sequences. This approach has been applied in diverse fields including genomics, protein dynamics in biology, speech and gesture recognition, and music structure analysis. The broad applicability of RL and its ability to address different problem domains make it a powerful tool for understanding and optimizing complex systems in various disciplines.

Through iterative interactions, the agent adjusts its policy and value function to optimize its decision-making process and maximize cumulative rewards. Figure 1 illustrates the common interaction flow in an RL problem.

Fig. 1
A flow diagram between agent and environment through state S, action A, transition between s and s dash, and reward R t.

Reinforcement learning classic feedback loop

As previously mentioned, the learning agent is situated within and interacts with its environment. The environment’s state reflects the current situation or condition, defined along a set of possible states denoted as S. The agent moves between states by taking actions from a set of available actions, denoted as A. Whenever the agent chooses and performs an action from A, the environment E undergoes a transformation, causing the agent to transition from one state to another , where . Additionally, the agent receives a reward based on the chosen action . The ultimate objective of the agent is to maximize the expected cumulative reward over the long term, which can be estimated and reestimated through the learning process of the agent to include and adapt to the new knowledge acquired.

A significant challenge in RL is striking the right balance between exploration and exploitation. On one hand, it is advantageous for the agent to exploit its existing knowledge gained from past experiences. By selecting actions that have previously yielded high rewards, the agent aims to maximize the cumulative reward over time. On the other hand, exploration is crucial to enable the agent to discover new states and potentially identify better actions, thus avoiding suboptimal policies. Different RL algorithms employ various approaches to address this trade-off.

A fundamental characteristic of MDPs and RL is their ability to handle stochastic influences in the state–action relationship. This stochasticity is typically quantified by a transition function, which represents a family of probability distributions that describe the potential outcomes resulting from an action taken in a particular state.

By knowing the transition function, the agent can estimate the expected outcomes of applying an action in a state by considering all possible transitions and their corresponding probabilities. This analysis allows the agent to assess the desirability or undesirability of certain actions.

To formalize this process, a value function U is defined [16]. The value function assigns a numerical value to each state, representing the expected cumulative reward the agent can achieve starting from that state and following a specific policy. It serves as a measure of the desirability or utility of being in a particular state.

The value function helps guide the agent’s decision-making process by allowing it to compare the potential outcomes and make informed choices based on maximizing the expected cumulative reward over time.

$$ {U}^{\ast }(s)=\underset{\uppi}{\max }E\left(\sum \limits_{t=0}^{\infty }{\upgamma}^t{r}_t\right) $$

Indeed, the parameter γ in the value function equation is widely referred to as the discount factor. It plays a pivotal role in regulating the importance of future events during the decision-making process, considering the delayed nature of rewards. By adjusting the discount factor, one can determine the relative significance of immediate rewards compared to future rewards. In the optimal value function equation, the discount factor appears to discount future rewards geometrically [15, 16]. This means that rewards obtained in the future are typically weighted less compared to immediate rewards. However, it’s important to note that the specific value and impact of the discount factor depend on the chosen model of optimality. There are three main models of optimality that can be considered: finite horizon, infinite horizon, and average reward. In the infinite horizon model, which we are focusing on here, the discount factor is used to discount future rewards geometrically.

In the equation, the policy function π represents the mapping from states to actions and serves as the primary focus of optimization for the RL agent. It determines the action to be taken in each state based on the agent’s acquired knowledge. The asterisk (*) symbol signifies the “optimal” property of the function being discussed, indicating that the equation represents the value function associated with the optimal policy.

One can extend the expression of the optimal value by writing the expected value of the reward using the transition function T as

$$ {U}^{\ast }(s)=\underset{a}{\max}\left[r\left(s,a\right)+\upgamma \sum \limits_{s\in S}T\left(s,a,{s}^{'}\right){U}^{\ast}\left({s}^{'}\right)\right] $$

This is the Bellman equation, which is a fundamental concept in dynamic programming. It encompasses the maximization operation, highlighting the nonlinearity inherent in the problem. The solution to the Bellman equation yields the policy function, which determines the optimal actions to be taken in different states.

$$ {\uppi}^{\ast }(s)=\underset{a}{\textrm{argmax}}\left[r\left(s,a\right)+\upgamma \sum \limits_{s\in S}T\left(s,a,{s}^{'}\right){U}^{\ast}\left({s}^{'}\right)\right] $$

As mentioned above, it returns the action to be applied on each state so that once converged it returns the best action to be applied on each state.

The value function within a MDP can be also expressed or summarized in a matrix that stores the value associated with an action a in a given state s. This matrix is typically called Q-matrix and is represented by

$$ {U}^{\ast }(s)=\underset{a}{\max }{Q}^{\ast}\left(s,a\right) $$

so that the Bellman equation results

$$ {Q}^{\ast}\left(s,a\right)=r\left(s,a\right)+\upgamma \sum \limits_{s\in S}T\left(s,a,{s}^{'}\right)\underset{a}{\max }{Q}^{\ast}\left({s}^{'},{a}^{'}\right) $$

The opening to what is commonly known as Q-learning is generally facilitated by this approach. It is important to note that the system’s model, also known as the transition function, may either be known or unknown. In the case of model-based Q-learning, the model is known, while in model-free Q-learning, it is not. When dealing with an unknown model, the temporal differences approach has proven to be an effective tool for tackling strategy search problems in actual systems. In this approach, the agent is not required to possess prior knowledge of the system’s model. Instead, information is propagated after each step, eliminating the need to wait until the conclusion of a learning episode. This characteristic renders the implementation of this method more feasible in real robotic systems.

2.1 Toward Reinforcement Learning in Manufacturing

Teaching agents to control themselves directly from high-dimensional sensory inputs, such as vision and speech, has long been a significant challenge in RL. In many successful RL applications in these domains, a combination of hand-crafted features and linear value functions or policy representations has been utilized. It is evident that the performance of such systems is heavily dependent on the quality of the feature representation employed.

In recent years, deep learning has witnessed significant progress, enabling the extraction of high-level features directly from raw sensory data. This breakthrough has had a transformative impact on fields such as computer vision and speech recognition. Deep learning (DL) techniques leverage various neural network architectures such as convolutional networks, multilayer perceptrons, restricted Boltzmann machines, and recurrent neural networks. These architectures have successfully employed both supervised and unsupervised learning approaches. Given these advancements, it is reasonable to inquire whether similar techniques can also benefit reinforcement learning (RL) when dealing with sensory data.

The advancements in deep learning (DL) [30] have paved the way for deep neural networks (DNNs) to automatically extract compact high-dimensional representations (features). This capability is particularly useful for overcoming the dimensional catastrophe, commonly encountered in domains such as images, text, and audio. DNNs possess powerful representation learning properties, enabling them to learn meaningful features from raw data. Deep reinforcement learning (DRL) [31] refers to a class of RL algorithms that leverage the representation learning capabilities of DNNs to enhance decision-making abilities.

The algorithm framework for DNN-based RL is illustrated in Fig. 2. In DRL, the DNN plays a crucial role in extracting relevant information from the environment and inferring the optimal policy in an end-to-end manner. Depending on the specific algorithm employed, the DNN can be responsible for outputting the Q-value (value-based) for each state–action pair or the probability distribution of the output action (policy-based). The integration of DNNs into RL enables more efficient and effective decision-making by leveraging the power of representation learning.

Fig. 2
A flow diagram between state, D N N, agent, and environment. The flow is carried out through sampling action and state observation. Reward has functioned between environment and D N N.

Representation of the deep reinforcement learning (DRL) feedback loop

Such scenario made much more accessible or tractable classic problems in manufacturing frameworks to RL approaches. In the following section we present two cases where RL has migrated to DRL to address specific problems within manufacturing environments.

3 Deep Reinforcement Learning in Virtual Manufacturing Environments

In this section, we present two distinct examples that demonstrate the advancements made in RL, specifically in the context of deep reinforcement learning (DRL). These examples involve the application of RL within a virtual environment, which allows for the development of strategies that can later be translated to real systems. This approach opens up the possibility of deploying this technology in manufacturing environments.

One key advantage of utilizing virtual environments is that it mitigates the significant amount of learning episodes typically required by RL agents to develop an optimal policy. In a real manufacturing system, the time needed to explore numerous strategies would make the process highly inefficient for reaching an optimal solution. Moreover, certain strategies may introduce risks, such as safety concerns, that cannot be easily managed or assumed in a manufacturing environment. For instance, in the second example, robotic systems operating collaboratively may pose safety risks, while in the first example, machines with high power consumption may introduce operational risks.

By leveraging virtual environments [32, 33], RL techniques can be effectively applied to develop optimal strategies while minimizing risks and reducing the time and costs associated with experimentation in real systems. This approach enables the integration of RL technology into manufacturing environments, paving the way for enhanced efficiency, productivity, and safety [34,35,36]. Considering these issues, the development of digital environments (such as simulation platforms or digital twins) has been taken as the ideal scenario to train RL agents until the systems reach certain level of convergency or, in other words, trustworthiness. Once, certain strategies have reached a reasonable point, they can be tested in real scenario, and even a reduced optimization process can take place at that point to finally find the optimal strategy in the real context.

In this sense, the quality of the virtual context or the divergency with respect to the real process, becomes critical for the achievement of an optimal strategy later in the real world. However, the optimal digitalization of processes is out of the scope of this chapter.

The two scenarios presented in this section address the optimization of two different systems. The first one is the optimization of trajectories in a CNC cutting machine designed for different operations over large wood panels. The problem in this case is the optimization of the path between two different operations (cutting, drilling, milling, etc.). The second scenario faces the robotic manipulation of a complex material. In particular, the problem is the manipulation of fabric by two robotic arms in order to reduce wrinkles. These problems have been addressed within specifically developed digital environments to deliver optimal strategies that are later tested in the real system.

3.1 CNC Cutting Machine

The digital twin (DT) of the physical CNC presented here was developed and shared by one of the partners along the MAS4AI Project (GA 957204) within the ICT-38 AI for manufacturing cluster for the development of the RL framework. The DT is built on X-Sim, and it incorporates the dynamics of the machine, its parts, and the effects on the working piece, simulating the physical behavior of the CNC. The CNC of our study was a machining center for woodworking processes, more specifically for cutting, routing, and drilling.

The machine considered (not shown for confidentiality issues) consists of a working table in which wood pieces are located and held throughout the woodworking process and a robot-head of the CNC, which is responsible for performing the required actions to transform the raw wood piece into the wood product (Fig. 3).

Fig. 3
Three schematic diagrams. The robot head moves in its 5 axes from right to left. The base has some keys and it has a pin on its head.

Movements of the robot head in its five-axes

The model-based machine learning (ML) agent developed in this context aims at optimizing the path trajectories of the five-axes head of a CNC within a digital twin environment. Currently, the DT CNC enables the 3D design of the wood pieces by users, creating all the necessary code to control the physical CNC.

Controlling a five-axes head automatically in an optimized way is yet an engineering challenge. The CNC must not only perform the operation required to transform the wood piece into the desired product, but it must also avoid potential collisions with other parts of the machine, control the tools required by the head of the robot, and keep times short reducing unnecessary movements to enhance productivity and save energy, while ensuring the machine integrity and safety of operators throughout the process, as well as high-quality products.

In this context, a model-based ML agent based on a DRL framework was trained to optimize the path trajectories of the five-axes head of the CNC in order to avoid potential collisions while optimizing the overall time operation. The difficulty of working in a five-dimensional space (due to the five-axis robot head) is increased by the dimensions of the working table of the CNC, which goes up to 3110 × 1320 mm. In the DT environment, the measurement scale is micrometers, resulting in more than 4,105,200 million states to be explored by the agent in a discrete state–action space only in the plane XY of the board, without considering the extra three-axes of the robot head. This complex applicability of discrete approaches is the reason why only a continuous action space using Deep Deterministic Policy Gradient (DDPG) [37] is shown here.

The model-based AI CNC agent was trained to work considering different operations. The ultimate goal of the agent is to optimize the path trajectories between two jobs in a coordinated basis considering the five-axes of the CNC head. For this reason, the inputs of the model are the coordinates of the initial location, i.e., state of the five-axes head, and the destination location or a label representing the desired operation to be performed by the CNC. The agent returns the optimized path trajectory to reach the goal destination by means of a set of coordinates representing the required movements of the five-axes head. Currently, the agent has been trained separately to perform each operation independently. In a future stage, a multi-goal DRL framework was explored in order to enhance generalization.

Different operations and different learning algorithms were explored during the development of the deep RL framework, including 2-D, 3-D, and 5-D movements of the five-axes head of the CNC, different path trajectories to be optimized, and different learning algorithms including Q-learning [15, 38], deep Q-learning (DQL) [39], and DDPG.

As seen previously, Q-learning is a model-free, off-policy RL algorithm that seeks to find an optimal policy by maximizing a cost function that represents the expected value of the total reward over a sequence of steps. It is used in finite Markov decision processes (stochastic, discrete), and it learns an optimal action-selection policy by addressing the set of optimal actions that the agent should take in order to maximize the total reward (Rt). The algorithm is based on an agent, a set of actions A, a set of states S, and an environment E. Every time the agent selects and executes an action a ϵ A, the environment E is transformed, and the agent transitions from one state, s, to another, s’, with (s, s’) ϵ S, receiving a reward r according to the action selected.

DDPG is an off-policy algorithm that simultaneously learns a Q-function and a policy based on the Bellman equation in continuous action spaces. DDPG makes use of four neural networks, namely an actor, a critic, a target actor, and a target critic. The algorithm is based on the standard “actor” and “critic” architecture [40], although the actor directly maps states to actions instead of a probability distribution across discrete action spaces.

In order to solve the problem of exhaustively evaluating all possible actions from a continuous action space, DDPG learns an approximator to Q(s, a) by means of a neural network, the critic Qθ(s, a), with θ corresponding to the parameters of the network (Fig. 4).

Fig. 4
A workflow diagram. Environment is followed by value function and policy through reward, r, and T D error. Environment is connected to value function through state S and policy is connected to the environment through action a.

Actor–critic architecture workflow [41]

Qθ learns from an experience replay buffer that serves as a memory for storing previous experiences. This replay buffer contains a set D of transitions, which includes the initial state (s), the action taken (a), the obtained reward (r), the new state reached (s’), and whether the state is terminal or not (d). In other words, each transition is represented as (s, a, r, s’, d), where s, a, r, s’, and d are elements of the set D.

In order to evaluate the performance of Qθ in relation to the Bellman equation, the mean-squared Bellman error (MSBE) can be computed. The MSBE quantifies the discrepancy between the estimated Q-values produced by Qθ and the values predicted by the Bellman equation. It is typically calculated by taking the mean squared difference between the Q-value estimate and the expected Q-value, using the current parameters θ. The MSBE provides a measure of how well the Q-function approximated by Qθ aligns with the optimal Q-values as defined by the Bellman equation. Minimizing the MSBE during training helps the DRL algorithm converge toward an optimal Q-function approximation.

$$ L=\frac{1}{N}\sum \Big({Q}_{\theta}\left(s,a\right)-y{\left(r,{s}^{\prime },d\right)}^2 $$
$$ y\left(r,{s}^{\prime },d\right)=r+\gamma \left(1-d\right){Q}_{\theta \textrm{target}}\left({s}^{\prime },{\mu}_{\phi \textrm{target}}\left({s}^{\prime}\right)\right) $$

where the Qθtarget and μϕtarget networks are lagged versions of the Qθ (critic) and μϕ (actor) networks to solve the instability of the minimization of the MSBE due to interdependences among parameters. Hence, the critic network is updated by performing gradient descent considering loss L. Regarding the actor policy, it is updated using sampled policy gradient ascent with respect to the policy parameters by means of:

$$ {\nabla}_{\phi}\frac{1}{N}\ \sum_{s\in D}{Q}_{\theta}\left(s,{\mu}_{\phi }(s)\right) $$

Finally, the target networks are updated by Poliak averaging their parameters over the course of training:

$$ {\displaystyle \begin{array}{l}{\theta}_{Q\textrm{target}}\leftarrow {\rho \theta}_Q+\left(1-\rho \right){\theta}_{Q\textrm{target}}\\ {}{\phi}_{\mu \textrm{target}}\leftarrow {\rho \phi}_{\mu }+\left(1-\rho \right){\phi}_{\mu \textrm{target}}\end{array}} $$

During training, uncorrelated, mean-zero Gaussian noise is added to actions to enhance exploration of the agent. The pseudocode for training a DDPG algorithm would be as in Table 1.

Table 1 Pseudocode for training a DDPG algorithm

The CNC AI agent has a multi-goal nature within the environment. First, the agent shall learn not to collide with other machine parts. Collisions correspond to a penalty of −1000 and the reset of the environment. Second, the agent shall learn how to reach a goal destination by exploring the environment, which has an associated reward of +500 and causes the reset of the environment as well. Third, the agent shall optimize its policy considering the operational time and quality of the path. The operational time is calculated based on the distance that the robot head needs to travel to reach the goal destination following the proposed path. The quality of the path is calculated based on the number of actions needed to reach the destination. The former two aspects are favored by an extra reward of +100 to the agent (Fig. 5).

Fig. 5
An information flow diagram between the environment from D T, D D P G algorithm, buffer, and main agent through reset, state, action, reward, del movement et cetera. It includes predictions and steps.

Information flow in the DDPG framework during training phase

Figure 6 shows four exemplary paths found by the agent in a 2-D movement for visualization. In this problem, the agent needs to learn how to move from right to left without colliding with any machine parts. Since the action space is continuous, the goal destination does not correspond to a specific coordinate but to a subspace of the environment. The circles draw on the paths represent the subgoal coordinates proposed by the agent (each action correspond to a new coordinate to which the robot head is moved). From the figure, it can be seen that the yellow and pink trajectories comprise more actions and a longer path than the blue and green trajectories. Although these latter contain the same number of actions (three movements), the green trajectory requires a shorter path, and thus is preferred.

Fig. 6
A diagram presents the multi lines of exemplary trajectories for simplicity in a 2 D movement. The action path is continuous. Square boxes indicate the starting and end points.

Exemplary trajectories for simplicity in a 2-D movement. Circles in the path correspond to set of coordinates obtained after performing the action proposed by the framework in the environment. The total length of the trajectory and the number of actions are considered in the reward function. The red square corresponds with the starting point, while the red rectangle corresponds to the final region (target)

3.2 Robotic Manipulation of Complex Materials

The challenge of manipulating complex materials involves the identification of measurable quantities that offer insights into the system, which can then be utilized to make informed decisions and take appropriate actions [42, 43]. This essentially involves combining a perception system with a decision-making process. The Markov decision process (MDP) framework, as described earlier, is well-suited to address the task of defining optimal strategies for material manipulation. RL is particularly suitable for this purpose due to its probabilistic nature. RL accommodates the inherent uncertainty associated with characterizing the state of complex materials, which often presents challenges in traditional approaches.

The first step for the robotic manipulation of a fabric is the definition of the information required to perform the optimal actions for the manipulation of a material [44,45,46]. For that purpose, the state of the system needs to be characterized, a prediction needs to be done to infer what is the next state of the system under the application of a given action, and a criterium to decide what action to take for a given target needs to be chosen.

To address these points, on the one hand, a procedure for the generation of synthetic data has been deployed to generate automatically thousands of synthetic representations of a fabric and their transitions under the application of certain actions. On the other hand, in order to exploit such tool and build a solution based on data, a neural network has been developed and trained. Given that in a real system, in the real scenario for this work, a point cloud camera was used to detect the fabric, the entropy of the cloud has been calculated and taken as a reference magnitude to evaluate the goodness of a transition in terms of wrinkledness reduction.

In order to quantify the amount of knowledge in the system, we use entropy as a measurement of the information of the system. Using entropy maps, the wrinkledness of the fabric has been characterized. This entropy maps are calculated from the distributions of normal vectors within local regions using the classic form of information entropy for a distribution as follows:

$$ H(X)=-\sum_{i=1}^np\left({x}_i\right)\log p\left({x}_i\right) $$

Entropy is usually thought as a measurement of how much information a message or distribution holds. In other words, how predictable such a distribution is. In the context of the work presented here, entropy gives an idea about the orientation of the normal vectors of a given area of points taken from a reference one.

In order to address the massive complex manipulation of fabrics to reduce wrinkles over the surface, a specific digital environment has been developed as it is the clothsim environment made available as open access as part of the results of the MERGING Project. The detailed description of the simulation can be found on its own repository and is out of the scope of this chapter.

The clothsim environment has been used for the cloth initial random configuration and later the training of the system. The learning routine has suggested different actions considering the Q-values (Q-matrix) by an argmax function. After the application of the actions, clothsim returned the transition of the fabric and the values of the Q-matrix have been updated following a Q-learning update rule where a reward function stablishes how good or bad the action was attending to the calculated entropy for each state. The DQL procedure has been trained built on a ResNet18 architecture as a backbone with an input shape (224, 224, 4) and the RL has been set with the following characteristic parameters: the gamma factor has been initially set as 0.75 and epsilon has variated from 1 to 0.3 in 8000 steps. During the whole learning process, a log file captures the global status of the knowledge acquired by the system.

Figure 7 shows a large example of the fabric manipulation in such a virtual environment (clothsim) in order to reduce the wrinkledness represented the entropy, which is estimated based on phi, theta, and z coordinates of the normal vectors of the points.

Fig. 7
Nine photographs. These present the variations in fabric manipulation between the initial state and the final state. There are grab and pull points to grab and pull the piece of fabric.

Example of fabric manipulation using the developed environment for training under a complex manipulation task. The red and blue points indicate grab and pull points, respectively

The images show the status of the fabric during the application of actions selected based on the values of the Q-matrix. These values are developed during the training procedure where the system tries to stablish, for given state, which action drives the maximum entropy reduction. This means that the entropy is the metric used during the whole process for the wrinkle minimization in the fabric (Fig. 8).

Fig. 8
Four matrix diagrams. These present the values between Q minimum and Q maximum. These present the data for each state.

Different examples of Q-matrix which is estimated for each state in order to drive an action attending at the position of its maximum value. The minimum values a very small displacement of the corners, meaning a very low reduction of the entropy so that they show low Q-values

In order to select the actions, the knowledge of the system is encoded in the classic Q-matrix, which is inferred by the system for a given state. Such codification is done using a 6 × 6 matrix that considers corners to manipulate and directions that can be taken within the fabric manipulation. The final outcome of the procedure is a set of three points: one static point to fix the fabric, a second point that represents the corner to be manipulated, and a third point that represents the point where this last corner has to be placed (grab point, pull point initial coordinates, pull point final coordinates). To decide what action to take, the system evaluates the Q-matrix and selects the action that corresponds with the maximum value on the matrix through an argmax function. The Q-values are updated to include a reward function that becomes positive as entropy decreases and remains negative otherwise. So, the Q-values hold the information of the reduction of the entropy that the system expects by the application of a given action. In a way that through the application of the action that corresponds with the maximum Q-value, the entropy reduction is also expected to be maximum in the long run (considering a whole episode).

To quantify the solution, we validate the use of the entropy as a metric for the wrinkledness of the fabric and its minimization as the target of the algorithm for the development of a strategy to reduce such wrinkledness. Fig. 9 shows how the entropy is reduced through the actions followed attending at the Q-matrix toward an acceptance distance from the target (plane).

Fig. 9
Two graphs of the Q matrix present the evolution of entropy before and after. A graph of entropy versus movement presents the variations of entropy and threshold with the increase in movement. The line for threshold is parallel to the horizontal axis.

Example of entropy evolution during synthetic fabric manipulation. The figure shows the initial state of the fabric and the final after different actions are applied. The red arrow (top-left corner) indicates the action suggested by the system. The right side of the figure shows the entropy evolution during the application of the actions. It can be seen how it decreases until a certain threshold (orange horizontal line) is crossed, meaning that the fabric is close to an ideal (plane) state

The action-selection method, which serves as the system’s plan to efficiently eliminate wrinkles from fabric, relies on a knowledge structure. This knowledge structure can be validated by evaluating the entropy variation resulting from the application of different actions, taking into account the information stored in the Q-matrix. In this process, various actions are applied to a given initial state, ordered based on their optimality as determined by the Q-matrix. By examining the entropy variation, one can assess the effectiveness of the selected actions and validate the underlying knowledge structure.

Furthermore, the entropy’s evolution along the optimal path can be compared with a scenario where suboptimal actions are taken, disregarding the Q-matrix. This comparison allows us to observe how the selection of the maximum Q-value drives a more robust curve toward entropy minimization. In Fig. 10, we can see that the entropy follows a decreasing curve toward the minimum entropy when the optimal actions are taken, considering the evolving state of the fabric. However, when suboptimal actions are consistently chosen for the same fabric state, the entropy exhibits a more erratic behavior. This demonstration highlights how the developed solution offers a long-term optimal solution, ensuring a continuous reduction in entropy along the optimal path.

Fig. 10
A multiline graph of entropy versus movement presents the entropy progression. It presents the variations of entropy for suboptimal action 10, suboptimal action 20, optimal action, and threshold. The curve for the threshold is parallel to the horizontal axis.

Optimal path. Comparison of the entropy evolution when always the maximum of the Q-matrix is taken as optimal actions (blue curve) with entropy evolution under the selection of suboptimal actions (green and orange curves). The starting point for each transition is always considered as the state achieved by the application of the optimal (argmax from Q-values). Suboptimal actions drive more irregular behavior

The strategies have been also tested in a real scenario, exploiting the information captured through a point cloud and following the outcomes suggested by the analysis of the Q-matrix associated with the given state. However, a complete description of the entire work behind this demonstration in terms of hardware is out of the scope of this chapter. Nevertheless, Fig. 11 shows the state of a fabric from an initial configuration to a final (ordered) state after the application of two actions to reduce the wrinkles.

Fig. 11
Three photographs of fabric. It presents the initial state, condition after action 1, and condition after action 2 of manipulation.

Example of real fabric manipulation results. The figure shows three different steps during the manipulation following the actions suggested by the system. The sample manipulated is one from the real Use Case of the MERGING Project. In the figure, it can be clearly appreciated how the wrinkledness is reduced as the actions are applied

4 Conclusions

We have provided an introduction to reinforcement learning as a robust AI technology for the development of complex strategies to be exploited by active agents.

For its deployment in manufacturing environments, RL applicability depends strongly on the digitalization of the process or its correct modeling. This is in order to provide a learning scenario to develop complex strategies to be demonstrated first under such digital strategy to be tested later in the real situation.

We have shown, as part of the results of two projects, examples of the application of RL in manufacturing. First, the application of Deep Deterministic Policy Gradient methods for the path optimization of a CNC machine in a digital twin. Second, deep Q-learning has been shown as a method for the development of optimal strategies related to the manipulation of fabric in a manufacturing environment. Showing results in a dedicated digital environment, as well as providing examples of the performance of the system, results in reducing the wrinkles on the fabric.

By utilizing reinforcement learning in a digital context, we have shown how to overcome the limitations posed by restricted training phases in manufacturing industries. Our research contributes to the development of effective strategies that can be tested and refined in digital environments before being applied to real-world systems. This approach allows for safer and more efficient exploration, enabling the optimization of manufacturing processes and performance.

The two manufacturing scenarios presented in this chapter highlight the potential and applicability of reinforcement learning in improving industrial processes. By bridging the gap between digital and real environments, we strive to advance the field of manufacturing and drive innovation in these sectors.

Overall, this research sheds light on the benefits of applying reinforcement learning in a digital context for manufacturing industries. It underscores the importance of leveraging digital environments to enhance training and strategy development, ultimately leading to improved performance and efficiency in real-world systems.