Deep reinforcement learning for optical systems: A case study of mode-locked lasers

Chang Sun; Eurika Kaiser; Steven L Brunton; J Nathan Kutz

doi:10.1088/2632-2153/abb6d6

Machine learning (ML) and artificial intelligence (AI) algorithms are transforming the scientific landscape [1, 2]. From self-driving cars and autonomous vehicles to digital twins and manufacturing, there are few scientific and engineering disciplines that have not been profoundly impacted by the rise of ML/AI methods. Optics is no exception, with a significant growth of ML/AI methods developed for applications ranging from imaging to optical communications [3, 4]. For control applications, a variety of ML strategies have been developed for stabilizing optical systems such as mode-locked lasers [5–12]. From genetic algorithms to deep neural networks, these studies provide a broad perspective on how a diverse set of optimization algorithms can be used to automate the control and self-tuning of a given optical device. However, one of the most successful ML architectures has yet to be implemented for mode-locked lasers: reinforcement learning (RL) [13–16]. RL is a rapidly growing branch of ML/AI that is based upon goal-oriented algorithms in which an agent learns from interactions with the environment. It is the algorithmic basis for the popular AI success stories on games like chess and Go [17]. Given its leading status as a control and goal-oriented strategy, we show that RL can be integrated with optical systems, specifically mode-locked lasers, to produce an architecture for intelligent and stable self-tuning operation.

The power of RL lies in its ability to learn from interactions with the environment with goal-oriented objectives. This is unlike the two other dominant ML paradigms of supervised and unsupervised learning [1, 2]. With a trial-and-error search, a RL agent learns to sense the state of its environment and take actions accordingly to achieve optimal immediate or delayed rewards. Specifically, the RL agent arrives at different states by performing actions, with the selected actions leading to positive or negative rewards for learning. Importantly, the RL agent is capable of learning delayed rewards, which is critical for many optical systems since a trajectory to the optimal solution must be learned. This is equivalent to mapping out a set of moves, or long term strategy, to win a chess game. RL targets optimal policies for reinforcement learners to maximize the total reward across an episode. Each state follows a Markov property by assumption, i.e. each state is determined only by the previous state and the transition taken to the current state. Thus a large number of trials must be evaluated in order to determine an optimal policy. This is accomplished in chess and Go by self-play [17], which is exactly what the mode-locked laser is allowed to do to learn an optimal policy.

In context of mode-locked lasers, the RL agent is given access to the components of the laser typically used for generating stable operation: the waveplates and polarizer (See figure 1). The RL agent then explores the ways to maximize the policy information, which is centered around stable mode-locking of the laser cavity. Specifically, the highest-energy mode-locked pulse is typically sought in the high-dimensional space generated by the waveplates and polarizer. We show that the RL agent can learn to stabilize a mode-locked laser in a robust manner. More than that, it can learn pathways to circumvent regions in parameter space where bi-stabilities exist. Indeed, the delayed reward structure of the RL agent allows the system to learn how to maneuver around bi-stabilities in order to achieve optimal mode-locking performance. Such a trajectory cannot be discovered with the variety of ML methods used so far on laser systems [5–12]. Specifically, previous methods rely on searching through the parameter space, or finding local maxima of the objective space. However, the stable operation of the mode-locked laser is more nuanced than these approaches can accommodate. Through delayed reward, RL learns self-tuning trajectories through parameter space that are capable of circumventing the ubiquitous bi-stability of the objective function. Thus RL provides the requisite architecture for overcoming the practical tuning aspects of real lasers. In summary, the RL framework is especially valuable for systems where enough self-exploration can be promoted in order to sample the entirety of parameter space. This can be done with mode-locked lasers so that the RL architecture provides a clear pathway toward technological implementation and more robust turn-key technologies.

1. Results

We demonstrate the efficacy of deep reinforcement learning control on mode-locked fiber lasers in figure 1. We first demonstrate the deep reinforcement learning (deep RL) strategy for a single-input control (α₁). The deep RL controller is then applied in a multi-input control to find the optimal orientation of the waveplates ( $\alpha_1, \alpha_2, \alpha_3$ ) and polarizer (α_p). Finally, the controller is generalized to find optimal solutions with varying values of the fiber birefringence, which is an unmeasured latent variable that dictates the performance of the laser cavity. RL is shown to be a robust and stable way to enact control. The loss function, or optimization objective, is detailed in the Methods section. Previous work [5, 6] has found the loss function to be well modeled by the cavity energy divided by the kurtosis (fourth-moment) of the spectrum.

1.1. Single-input control for fixed birefringence

Figure 2 shows the variation of rewards and loss function during training process of the deep RL controller for a single-input, single-output case. The quarter-waveplate angle α₁ is the control variable, which can be varied in 2° steps. The search starts with an initial value of $\alpha_1 = 15^{\circ}$ , and all other angles are held fixed at pre-determined, locally maximizing values. The deep reinforcement learning agent takes an action from a state using epsilon-greedy policy and the exploration rate exponentially decays during training. We observe an increase of the total reward over a complete episode as the training proceeds, as shown in figure 2(a). Note that the deep RL agent adapts to a new policy when the loss rises as shown in figure 2(b).

Extending the initial values of α₁ from 15° during training enables us to train a model that drives the laser dynamics to mode-locking with different initial values of α₁, as shown in figure 3. With fixed birefringence parameter K, the deep reinforcement learning controller correctly interprets and extracts features from the input states, and takes the action to efficiently drive the laser dynamics to mode-locked solutions with α₁ starting from $[-40^{\circ}, -10^{\circ}]$ . For example, with the initial value of $\alpha_1 = -12^{\circ}$ , the reward of each step continues increasing from the initial value as the deep RL controller drives the intra-cavity dynamics to mode-locking, as shown in figure 3 (iii). Interestingly, we observe hysteresis in the corresponding change of α₁ while the reward is consistently increasing, as shown in figure 4(a). The deep reinforcement learning controller correctly identifies the bi-stability of the intra-cavity dynamics and arrives eventually at the globally maximizing solution in this case. No other ML architecture to date has been able to identify bistability. RL achieves this due to its deferred reward structure which allows it to plan a path around the instability. Figure 4(b) describes the system bi-stability with two controllers α₁ and α₂. Our deep RL agent successfully discovers the path to drive the laser dynamics to mode-locking, marked as path (i) in figure 4(b). The corresponding final states of the electric fields u and v are also pictured in figure 4(b), with a high reward r = 0.206 2. We compare this path selected by our deep RL agent to the direct connection of the start and end points, which is marked as path (ii) in figure 4(b). The corresponding final states of the electric fields u and v are also pictured, of which we observe constant waveforms (plane waves) with r = 0.006 6. Note that in this work we only extend the initial values of α₁ to be in the range of $[-40^{\circ}, -10^{\circ}]$ during training, but the deep RL agent has the ability to be generalized and further expanded to a larger range of initial values of α₁.

**Figure 3.** The deep reinforcement learning controller effectively drives the laser dynamics to mode-locked solutions with α₁ starting from $[-40^{\circ}, -10^{\circ}]$ . Left panel demonstrates the change of rewards in each episode starting with different initial values of α₁. The deep reinforcement learning controller adaptively selects actions to continue with the current waveplate orientation, or increase/decrease α₁ by 2°. Control results for experiments starting with $\alpha_1 = -36^{\circ}$ , −28°, and −12° are shown in detail in figures (i)–(iii). Note that the intra-cavity electric fields u and v start as hyperbolic secant pulses in each experiment. Further note that the colormap (which contains a colorbar), used here and in future figures, denotes the reward value.
Download figure:
Standard image High-resolution image

**Figure 3.** The deep reinforcement learning controller effectively drives the laser dynamics to mode-locked solutions with α₁ starting from $[-40^{\circ}, -10^{\circ}]$ . Left panel demonstrates the change of rewards in each episode starting with different initial values of α₁. The deep reinforcement learning controller adaptively selects actions to continue with the current waveplate orientation, or increase/decrease α₁ by 2°. Control results for experiments starting with $\alpha_1 = -36^{\circ}$ , −28°, and −12° are shown in detail in figures (i)–(iii). Note that the intra-cavity electric fields u and v start as hyperbolic secant pulses in each experiment. Further note that the colormap (which contains a colorbar), used here and in future figures, denotes the reward value.
Download figure:
Standard image High-resolution image

**Figure 4.** (a) The deep reinforcement learning controller for a single-input, single-output (SISO) case. The reward r rises from the initial value as the controller drives the intra-cavity dynamics to mode-locking and we observe hysteresis in the corresponding change of α₁ while the reward r is consistently increasing. (b) The deep reinforcement learning control for two controllers α₁ and α₂. Start with $\alpha_1 = -15^{\circ}$ and $\alpha_2 = -3^{\circ}$ , the laser dynamics successfully arrives at the mode-locked solution with r = 0.206 2 following the path (i) selected by the deep reinforcement learning controller, whereas we observe the plane wave solution (r = 0.006 6) following path (ii) as comparison.
Download figure:
Standard image High-resolution image

**Figure 4.** (a) The deep reinforcement learning controller for a single-input, single-output (SISO) case. The reward r rises from the initial value as the controller drives the intra-cavity dynamics to mode-locking and we observe hysteresis in the corresponding change of α₁ while the reward r is consistently increasing. (b) The deep reinforcement learning control for two controllers α₁ and α₂. Start with $\alpha_1 = -15^{\circ}$ and $\alpha_2 = -3^{\circ}$ , the laser dynamics successfully arrives at the mode-locked solution with r = 0.206 2 following the path (i) selected by the deep reinforcement learning controller, whereas we observe the plane wave solution (r = 0.006 6) following path (ii) as comparison.
Download figure:
Standard image High-resolution image

1.2. Multi-input control for fixed birefringence

Figure 5 shows the deep reinforcement learning controller for the multiple-input, single-output (MISO) case where we control all four waveplate orientations simultaneously. This multi-input control is more complicated than the single-input case as the number of possible actions is significantly larger. Thus transfer learning is leveraged to prevent the model from diverging at the early stage of training. The number of controllers is gradually increased until the desired performance is achieved. The search starts with initial values of $\alpha_1 = 15^{\circ}, \alpha_2 = -5^{\circ}, \alpha_3 = 20^{\circ}$ , and $\alpha_p = 84^{\circ}$ . With fixed fiber birefringence, the deep reinforcement learning controller takes the correct actions to drive the laser dynamics from constant waveforms (plane waves) to the mode-locked solutions, as shown in figure 5. After the mode-locked state is achieved, the deep reinforcement learning agent continues searching through the action space with the reward oscillating near the optimal performance. Because the waveplate orientations are varied simultaneously, the large action space results in a slow search for the deep RL agent, and we find it difficult to eliminate such oscillations. Other adaptive controllers, for example, extremum seeking control [5], can be combined to stabilize and attenuate such oscillations for better performance.

**Figure 5.** The deep reinforcement learning controller for the multiple-input, single-output (MISO) case where we are controlling all four waveplate orientations simultaneously (K = 0). The experiment starts with hyperbolic secant pulses u and v in cavity, which are promptly attenuated to constant waveforms with initial values of $\alpha_1 = -15^{\circ}, \alpha_2 = -5^{\circ}, \alpha_3 = 20^{\circ}$ , and $\alpha_p = 84^{\circ}$ . The four controllers α₁, α₂, α₃, and α₄ either hold on to the current orientations or increase/decrease by 0.5° in each step.
Download figure:
Standard image High-resolution image

1.3. Multi-input control for varying birefringence

We find the neural network parameters of the deep RL controller trained with birefringence K = 0.1 can be generalized to control the mode-locking dynamics at different values of K. Except for a few cases, the deep reinforcement learning controller successfully drives the laser dynamics to mode-locking with different values of birefringence K from −0.2 to 0.4, as shown in figure 6(a). Note that in such generalized cases, the deep RL controller takes more steps on average to drive the laser dynamics to mode-locking, and in some cases the mode-locked solutions achieved are not tightly-confined as shown previously. Moreover, among a few cases, the deep reinforcement learning controller successfully achieves mode-locking but gradually wanders afterwards.

**Figure 6.** (a) Neural network parameters of the deep RL controller trained with birefringence K = 0.1 can be generalized to environments with different values of K. (b) With transfer learning, neural network parameters of the deep RL controller can be rapidly fine-tuned with a small amount of newly collected experiences and updated to control effectively in the new environments with different values of K.
Download figure:
Standard image High-resolution image

One feasible solution to improve the control performance in such cases is to retrain the model completely for different values of birefringence K. However, we find the neural network parameters of our deep RL agent adapt well and quickly with transfer learning [18] to different values of the birefringence K. In other words, there is no need to retrain the model completely with varying K, but instead we can slightly increase the exploration rate of the current model and collect more experiences by interacting with the new environment of changed birefringence K. These newly collected experiences enable the model to quickly update and adapt to the new environment. Thus it provides the possibility of building an online model for what is typically a stochastic and slowly-varying birefringence. In our experiments, training the neural network parameters with the birefringence K = 0.1 takes at least 2000 episodes for the deep RL controller to converge to the optimal policy, whereas the transfer learning takes only 300 episodes on average to adapt to environments with varying birefringence K. With fined-tuned parameters for K ∈ \{−0.2, 0, 0.2, 0.4\}, we improve the performance of the deep reinforcement learning controller to drive the laser dynamics to mode-locking with other values of the birefringence K ranging from −0.2 to 0.4, as shown in figure 6(b). Note only four sets of neural network parameters are used to generalize the controller to a range of K values.

As previously noted, we find in some cases that the deep RL controller successfully achieves mode-locking, but gradually walks apart from the desired solution. In such cases we can rely on extremum seeking controller [5] or other adaptive controllers for better performance. Extremum-seeking control, for example, is a form of perturb-and-observe control that estimates the gradient of an objective function by injecting an additional sinusoidal signal as input. The signal converges more rapidly when the objective function has a large gradient. Extremum-seeking control can lock the system to the local maximum and reacts rapidly to moderate changes of intra-cavity dynamics [19]. However, it relies on initial conditions of the parameters and state of the system since it only finds local maxima. Moreover, extremum-seeking control cannot recover in cases when the system is knocked far from the desired local maximum with drastic perturbations. Therefore, a combination of the deep RL agent with the extremum-seeking controller is a viable integrative strategy since it can evade the poor local maximum, and stabilize the intra-cavity dynamics around the mode-locked solution. To implement this integrated strategy, the deep RL controller is first executed in order to find a good mode-locked solution in a rapid manner. Indeed, deep RL can find near optimal stable mode-locking with several steps of propagation. The extremum-seeking controller is then turned on to stabilize the system. The schematic is shown in figure 7.

**Figure 7.** Combining the deep reinforcement learning controller with the extremum-seeking controller stabilizes the intra-cavity dynamics and achieves mode-locking with varying birefringence K.
Download figure:
Standard image High-resolution image

Discussion

Deep reinforcement learning is a learning paradigm that integrates the power of RL and deep neural networks. It is an ideal ML paradigm for complex dynamical systems where the learning agent is allowed to explore the system and for which trajectory planning is critical for success: both aspects typically manifest in optical systems. Optical systems also manifest one of the key physics features that required the RL agent: bi-stability. Previous self-tuning algorithms take advantage of parameter sweeps and extremum seeking to find optimal solutions. However, these sweeps miss critical bi-stable structures which must be navigated around in order to achieve ideal model-locking. Indeed, the best solutions are achieved by taking specific trajectories in parameter space, something that the path planning aspects of RL can easily find. Here, we demonstrate a fast, reliable self-tuning controller for the passive mode-locked fiber laser with deep RL. The controller varies all four waveplate orientations simultaneously to achieve a tightly-confined, high-energy mode-locked state. Interestingly, the control paths selected by the deep reinforcement learning controller reflect the bi-stability of the laser dynamics, and demonstrate the efficacy of the deep learning control to correctly sense the state of the environment in a bistable system. Although no new optical physics is demonstrated in this work, we have provided a principled control strategy to escape from the poor local maxima by interacting with the environments, which is important for building controllers in systems with bi-stability. Moreover, the deep RL controller architecture provided here can be easily integrated and generalized to experimental environments and other optical systems, including for instance, managing instabilities from dispersion management [20], controlling pulse compression [21], and/or circumventing Q-switching instabilities [22]. Importantly, given a well-defined reward criteria and state-space, the deep RL architecture generates experiences with its environment in order to train the deep reinforcement learning controller.

The deep RL framework demonstrated here can be combined and integrated with other control paradigms, for example, the extremum-seeking controller, for a better control performance. Since the deep RL controller continues searching the entire space even with a good mode-locked solution already found, it is difficult to eliminate the oscillation, and in some cases the controller eventually walks apart from the mode-locked solutions. An integration with the extremum-seeking controller stabilizes the control performance around the optimal solution discovered, even with slowly varying birefringence. With drastic perturbations to the birefringence, our deep reinforcement learning controller can be promptly fine-tuned to adapt to the new environments using transfer learning. Once a new mode-locked solution is found by the deep RL controller, the extremum-seeking controller is turned on instead to stabilize the system. This hybrid approach marries the ability of deep RL to search globally in a large control space with the increased stability provided by extremum-seeking control via local optimization. Future work looks also to consider emerging methods for producing certifiable RL strategies which can guarantee control performance[23].

Methods

3.1. A. Reinforcement learning

As noted earlier, RL is a branch of ML that uses a goal-oriented algorithm that learns from interactions with its environment. Using a trial-and-error search, an agent learns to sense the state of its environment and take actions accordingly to achieve optimal immediate or delayed rewards. Specifically, the RL agent arrives at different states by performing actions, with the selected actions leading to positive or negative rewards for learning. The agent's behaviors are defined by policies of the reinforcement learning algorithms in the environments, and we target at the optimal policies for reinforcement learners to maximize total rewards across an episode (or trajectories generated by tuning the optical system).

3.1.1. Q-learning

We leverage deep Q-learning, specifically the deep off-policy temporal difference control algorithm [24, 25], which approximates the current estimations based on the previously learned estimations. In reinforcement learning algorithms, the state-action value function, or Q-function, is defined as the expected discounted return of rewards starting from the state s with the action a according to policy π. The Q-function specifies the agent's performance taking a particular action and transit from the current state to the next with the policy we choose. During training the reinforcement learning agent learns and converges to the optimal policy that maximizes the total reward across an episode.

Q-learning [26] is a particular approach to learn optimal actions in such sequential decision problems and has been recognized as a form of temporal difference learning [27]. Suppose we take action a in the current state s and arrive at state $s\,'$ , Q-learning obtains

$\begin{equation} Q(s,a) = r(s,a)+\gamma \max_{a’}Q(s’, a’), \end{equation} \tag{ 1 }$

where r(s, a) is the reward collected performing action a to move from state s to $s’$ , γ ∈ [0, 1] is the discount factor that controls the contribution of the rewards collected in the future to the total reward after the episode is finished. During an episode, the agent proceeds by either choosing the action with the highest Q value (exploitation), or selecting randomly an action to explore other possible states which may return higher delayed rewards (exploration). The agent moves forward to the next state $s’$ with the selected action a and collects the associated reward r. Q-learning updates the current Q value of the experienced state-action pair with the collected reward after transitioning to $s’$ and the possible future rewards taking the optimal action thereafter:

$\begin{equation} \small Q^{{\rm new}}(s,a) = Q(s,a)+\alpha (r+\gamma \max_{a’}Q(s’, a’)-Q(s,a)), \end{equation} \tag{ 2 }$

where α ∈ [0, 1] is the learning rate. Note that the difference between the actual reward $r+\gamma \max Q(s’, a’)$ and the expected reward Q(s, a) is taken to update the value of Q(s, a). The parameter α is important for convergence since it determines to what extent the current Q function is updated by the newly explored information. The Q function is arbitrarily initialized and updated following equation (2) until the Q-learning algorithm has converged.

3.1.2. Deep Q neural networks (DQN)

In discrete environments represented by a finite number of possible states and actions, we often search through all possible state-action pairs exhaustively to find the optimal Q(s, a) values and the associated policy. However, this is computationally expensive and becomes infeasible with more than a small number of state-actions pairs. In continuous environments, it is impossible to list and search through each state with different actions. In contrast, deep Q learning [25] allows one to approximate the tabular Q function Q(s, a) as a parameterized function Q(s, a; θ). Considering that neural networks can provide good approximations to possibly very complex functions, we utilize here deep neural networks as the estimator of the Q value function. In particular, Q(s, a; θ) is modeled as a multi-layered neural network with parameters θ that takes a given state s as input and yields a vector of values Q(s,·; θ), each associated with a particular action a.

Following the Q-learning updating rule defined in equation (2), we refer to $r+\gamma \max Q(s’,a’;\theta)$ as the target value, Q(s, a; θ) as the predicted value, and the difference between the target and prediction is minimized when the current policy converges to the optimum. In deep Q learning, we can define analogously the loss function as the squared difference between the target and predicted value:

$\begin{equation} L = (r+\gamma \max_{a’}Q(s’,a’;\theta)-Q(s,a;\theta))^2. \end{equation} \tag{ 3 }$

The loss is minimized by learning updates to the deep neural network parameters θ that converge to the optimal policy. In summary, we use neural networks for the approximation of the Q function in deep Q learning, and converge to the optimal policy by minimizing the loss.

In particular, we employ as deep reinforcement learning agent an adaptation of the double deep Q neural network (DDQN) [28, 29] to the self-tuning laser control problem. The architecture of DDQNs is shown in figure 8, where the inputs fed into the action network describe the current state that the deep RL agent is in, and the output of the action network is the approximated Q function, specifically the Q values for all possible actions given the current state. Following the loss function defined in equation (3), we would observe strong divergence during training since the same neural network with parameters θ calculates both the predicted value and target value. To diminish the divergence, two separate networks are employed, one for selecting an action and the other for evaluating the selected action. Specifically, the target network with parameters $\theta '$ is used to calculate the target value, while the action network with parameters θ yields the predicted Q values associated with each action. The new loss function is defined as:

$\begin{equation} L = (r+\gamma \max_{a’}Q(s’, a’; \theta ')-Q(s,a;\theta))^2, \end{equation} \tag{ 4 }$

where $\theta '$ and θ stand for the different set of parameters of the target network and the action network, respectively. The parameters of the target network are periodically frozen for several episode during training before being updated by copying the parameters from the action network, or partially updated with parameters from the action network in each episode to stabilize the training.

**Figure 8.** The architecture of the double deep Q neural network. A target network is included to stabilize the training. More details are discussed in section 3.1.2.
Download figure:
Standard image High-resolution image

To stabilize the training and reduce the overfitting caused by correlation between the deep RL agent's experiences, we train the DDQN with an experience replay buffer [30], which is usually defined as a queue that saves a fixed number of the recent experiences. The experience $\lt\,s,a,r,s’>$ of the deep RL agent is defined as the concatenation of the current state s, the action selected a, the next state $s’$ after performing the action, and the associated reward r received in this transition. During training, rather than directly train with the newest experiences collected, we sample a random batch of the experiences $\lt\,s,a,r,s’>$ from the replay buffer and feed the sampled batch to the neural network for parameter updates. The deep RL agent benefits from the replay buffers by learning from an enlarged range of random and less correlated experiences.

3.2. B. Mode-locked fiber laser model

Our model of the laser cavity is a well-established computational model which treats the cavity in a component by component manner by separately applying the non-linear optical propagation to the laser dynamics with discrete waveplates and polarizer in each round trip. This model produces a rich set of dynamics that we wish to control [31]. We model the propagation of intra-cavity fields with the coupled non-linear Schrödinger equation with modifications to account for the bandwidth limited gain and cavity losses [32–34]:

$\begin{align} i\frac{\partial u}{\partial z} + \frac{D}{2}\frac{\partial^2~{\rm u}}{\partial t^2} - Ku + (|u|^2 + A|v|^2)u + Bv^2~{\rm u}^* & = iRu, \end{align} \tag{ 5a }$

$\begin{align} i\frac{\partial v}{\partial z} + \frac{D}{2}\frac{\partial^2 v}{\partial t^2} + Kv + (A|u|^2 + |v|^2)v + Bu^2v^* & = iRv, \end{align} \tag{ 5b }$

where u(z, t) and v(z, t) are often referred to as the fast and slow components of the two intra-cavity electric field envelopes, which are orthogonally polarized. The propagation distance z is non-dimensionalized by the cavity length, and the dimensionless time t is normalized by the full width at half maximum of the pulse. D is the average group velocity dispersion, A and B, determined by physical properties of the laser fiber, are the non-linear coupling parameters characterizing the cross-phase modulation and the four-wave mixing, respectively. In this work we consider a silica fiber with A = 2/3 and B = 1/3. The fiber birefringence, quantified by K, represents a major disturbance to the laser dynamics due to its sensitivity to thermal fluctuations. The dissipative term R, characterizing the bandwidth-limited gain and attenuation arising from the Yb-doped amplification, is defined as

$\begin{equation} R = \frac{2g_0(1+\tau\partial_t^2)}{1+(1/e_0)\int_{-\infty}^{\infty}(|u|^2+|v|^2)dt}-\Gamma, \end{equation} \tag{ 6 }$

where g₀ is the dimensionless pumping strength, and e₀ is the dimensionless saturation energy of the gain medium. Losses caused by output coupling and fiber attenuation are characterized by the pump bandwidth τ and Γ.

The effect of the waveplates and polarizer during each round trip is modeled by the discrete application of Jones matrices:

$\begin{equation} W_{\lambda/4} = \left[ \begin{matrix} e^{-i\pi/4} & 0 \\ 0 & e^{i\pi/4} \end{matrix} \right], \end{equation} \tag{ 7a }$

$\begin{equation} W_{\lambda/2} = \left[ \begin{matrix} -i & 0 \\ 0 & i \end{matrix} \right], W_p = \left[ \begin{matrix} 1 & 0 \\ 0 & 0 \end{matrix} \right]. \end{equation} \tag{ 7b }$

Note that $W_{\lambda/4}$ characterizes the effects of quarter-waveplates α₁ and α₂, $W_{\lambda/2}$ is for the half-waveplate α₃, and W_p is for the polarizer α_p. An additional rotation matrix R(α) is necessary to account for the offset between the direction of the intra-cavity fast field and the principal axes of the waveplates and polarizer, and we define

$\begin{equation} J_j = R(\alpha_j)W_jR(-\alpha_j), \end{equation} \tag{ 8a }$

$\begin{equation} R(\alpha_j) = \left[ \begin{matrix} \textrm{cos}(\alpha_j) & -\textrm{sin}(\alpha_j) \\ \textrm{sin}(\alpha_j) & \textrm{cos}(\alpha_j) \end{matrix} \right], \end{equation} \tag{ 8b }$

where α_j (j = 1, 2, 3, p) is a waveplate or polarizer angle. These rotation angles are easily manipulated via electronic control [35], and are considered as the control variables of the deep reinforcement learning agent for driving the laser dynamics to mode-locking in this work. Typical parameters used in these simulations are given in table 1. The round-trip length is 1.5 dimensionless units. The non-dimensional scalings can be found in references [32–34]. Such lasers typically produce pulse widths on the order of hundreds of femtoseconds with repetition rates in the tens of megahertz.

Table 1. CNLS Simulation parameters.

τ	Γ	A	B	D	K	g₀	e₀	L_t	N_t
0.1	0.1	2/3	1/3	-0.4	0.1	1.73	4.23	60	256

3.3. C. Deep reinforcement learning control

A schematic of the self-tuning mode-locked laser with deep reinforcement learning control is shown in figure 9, highlighted with a deep RL controller and a mode locked fiber laser cavity of passive non-linear polarization rotation. The mode-locking laser cavity, which is discussed in details in the previous section, is interpreted as the interactive environment in the reinforcement learning framework, and the waveplate angles $\alpha_1, \alpha_2, \alpha_3$ and polarizer angle α_p are considered as the controllable actions of the deep reinforcement learning agent. We take concatenated components of the electric fields u, v, and the current waveplate orientations $\alpha_1, \alpha_2, \alpha_3$ and α_p as the input to the deep reinforcement learning agent. Specifically, the deep reinforcement learning controller, built with TensorFlow 1.10.0, contains three alternatively stacked convolutional layers and max pooling layers, followed by six fully connected layers with leaky-ReLU as activation functions. The convolutional layers extract features from the input state by identifying the solitons inside the electric fields u and v, and the max pooling layers detect existence of the solitons and reduce the input dimensionality before feeding into the fully connected layers. Note that we demonstrate in this work the efficacy of the deep RL architecture in a numerical simulation of the laser cavity, but it is possible to train the deep RL controller directly in an experiment, as the controller only relies on information that is readily available in experiments.

The performance of the deep reinforcement learning controller is evaluated in terms of a reward r. In particular, we seek to steer the system to high-energy mode-locked states. However, the reward/cost landscape is very complex and exhibits many local optima. In addition, evaluating energy is not sufficient, as there are many chaotic solutions which have significantly higher energy than mode-locked states [5]. To define the reward r, we consider including the fourth-moment (kurtosis) M of the Fourier spectrum of the waveform, which is large for chaotic solutions but relatively small for the desired mode-locked states. To have a large reward r only for tightly confined temporal wave packets with relatively large energy, we define [5]

$\begin{equation} r = E/M. \end{equation} \tag{ 9 }$

To penalize the ineffective actions more efficiently during training, we rescale the reward to be centered around zero, so that the desired actions result in positive rewards while the ineffective ones return negative rewards. We rescale it back as defined in equation (9) after training for consistency and interpretability.

Our deep RL agent uses an -greedy policy to balance between exploration and exploitation, and parameters $\theta '$ of the target network are partially updated in each training step to improve stability. Note that the deep RL agent spans a large action space in the MISO case, when the three waveplates and polarizer orientations α₁, α₂, α₃, and α_p are considered as controllers. Thus we observe convergence difficulty in training the model directly with randomly initialized neural network parameters. To deal with this problem, we start training our deep RL agent with a single controller α₁, and gradually increase the number of controllers by initializing with previously trained parameters for the current model of increased number of controllers. Such parameter initialization strategy efficiently prevents the model from diverging, especially at the early stages of training.

The deep RL agent takes as input the electric fields and waveplate orientations, and the output yields Q values associated with all possible actions given the current state that the deep RL agent is in. Following the -greedy policy, our RL agent either randomly selects an action for exploration, or greedily selects the action with the highest Q value for exploitation and moves to the next state accordingly. Specifically, parameters α₁, α₂, α₃, and α_p are adjusted according to the action selected, and consequently we observe changes of electric fields u and v in the fiber laser cavity. The reward r of this transition, as defined in equation (9), is taken of the new intra-cavity fields u and v after transition. The procedure is repeated until the completion of the entire episode, and then a new episode is started with the same initial conditions of the electric fields and waveplate/polarizer orientations to collect more samples. Thus, the deep RL agent learns from different trials using exploration and exploitation, and eventually converges to the optimal policy that leads to the highest total reward across the entire episode. Note that after the training stage, the learned policy of the deep RL agent is evaluated by greedily selecting the action with the highest Q value. The complete model of controlling all four waveplate orientations simultaneously (K = 0) takes about 25 h to train on CPU, with initial values of $\alpha_1 = -15^{\circ}, \alpha_2 = -5^{\circ}, \alpha_3 = 20^{\circ}$ , $\alpha_p = 84^{\circ}$ and hyperbolic secant pulses u and v in cavity [36].

Acknowledgments

SLB acknowledges support from the Army Research Office (ARO W911NF-19-1-0045; Program Manager Matthew Munson). SLB and EK acknowledge support from the Army Research Office (ARO W911NF-17-1-0306; Program Managers Matthew Munson and Samuel Stanton). SLB, EK, and JNK acknowledge support from the UW Engineering Data Science Institute, NSF HDR award #1934292. JNK acknowledges support from the Air Force Office of Scientific Research (FA9550-17-1-0200).

Deep reinforcement learning for optical systems: A case study of mode-locked lasers

Article metrics

Submit

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Results

1.1. Single-input control for fixed birefringence

1.2. Multi-input control for fixed birefringence

1.3. Multi-input control for varying birefringence

Discussion