Introduction

Swimming microorganisms have evolved versatile navigation strategies by switching their locomotory gaits in response to their surroundings1. Their navigation strategies typically involve switching between translation and rotation modes such as run-and-tumble and reverse-and-flick in bacteria2,3,4,5, as well as run-stop-shock and run-and-spin in eukaryotes6,7. Such an adaptive, multimodal gait-switching ability is particularly desirable for biomedical applications of artificial microswimmers such as targeted drug delivery and microsurgery8,9,10,11,12, which require navigation towards target locations in biological media with uncontrolled and/or unpredictable environmental factors13,14,15.

Pioneering works by Purcell and subsequent studies demonstrated how simple reconfigurable systems with ingenious locomotory gaits can generate net translation and rotation, given the stringent constraints for locomotion at low Reynolds numbers16. Yet, the design of locomotory gaits becomes increasingly intractable when more sophisticated maneuvers are required or environmental perturbations are present. Existing microswimmers are therefore typically designed with fixed locomotory gaits and rely on manual interventions for navigation8,17,18,19,20,21. It remains an unresolved challenge in developing microswimmers with adaptive locomotory strategies similar to that of biological cells that can navigate complex environments autonomously. Modular microrobotics and the use of soft active materials22,23 have been proposed to address the challenge.

More recently, the rapid development of artificial intelligence (AI) and its applications in locomotion problems24,25,26,27,28,29 have opened different paths towards designing the next generation of smart microswimmers30,31. Various machine learning approaches have enabled the navigation of active particles in the presence of background flows32,33, thermal fluctuations34,35, and obstacles36. As minimal models, the microswimmers are often modeled as active particles with prescribed self-propelling velocities and certain degrees of freedom for speed variation and re-orientation. However, the complex adjustments in locomotory gaits required for such adaptations are typically not accounted for. Recent studies have begun to examine how different machine learning techniques enable reconfigurable microswimmers to evolve effective gaits for self-propulsion37 and chemotactic repsonse38.

Here, we combine reinforcement learning (RL) with artificial neural network to enable a simple reconfigurable system to perform complex maneuvers in a low-Reynolds-number environment. We show that the deep RL framework empowers a microswimmer to adapt its locomotory gaits in accomplishing sophisticated tasks including targeted navigation and path tracing, without being explicitly programmed. The multimodel gait switching strategies are reminiscent of that adopted by swimming microorganisms. Furthermore, we examine the performance of these locomotion strategies against perturbations by background flows. The results showcase the versatility of AI-powered swimmers and their robustness in media with uncontrolled environmental factors.

Results and discussion

Model reconfigurable system

We consider a simple reconfigurable system consisting of three spheres with radius R and centers ri (i = 1, 2, 3) connected by two arms with variable lengths and orientations as shown in Fig. 1a. This setup generalizes previous swimmer models proposed by Najafi and Golestanian39 and Ledesma-Aguilar et al. 40 by allowing more degrees of freedom. The interaction between the system and the surrounding viscous fluid is modeled by low Reynolds number hydrodynamics, imposing stringent constraints on the locomotive capability of the system. Unlike the traditional paradigm where the locomotory gaits are prescribed in advance39,40,41,42,43,44, here we exploit a deep RL framework to enable the system to self-learn a set of locomotory gaits to swim along a target direction, θT. We employ a deep neural network based on the Actor-Critic structure and implement the Proximal Policy Optimization (PPO) algorithm29,45 to train and update the agent (i.e., AI) in charge of the decision making process (Fig. 1b). The deep RL framework here extends previous studies from discrete action spaces to continuous action spaces32,35,37,46, enhancing the swimmer’s capability in developing more versatile locomotory gaits for complex navigation tasks (see the “Methods” section for implementation details of the Actor-Critic neural network and PPO algorithm).

Fig. 1: Schematics of the model microswimmer and the deep neural network with Actor-Critic structure.
figure 1

a Schematic of the model microswimmer consisting of three spheres with raidus R and centers ri (i = 1, 2, 3). We mark the leftmost sphere r1 as red and the other two spheres r2, r3 as blue to indicate the current orientation of the swimmer. The spheres are connected by two arms with variable lengths L1, L2 and orientations θ1, θ2, where θ31 is the intermediate angle between two arms. The swimmer's orientation θo is defined based on the relative position between the swimmer's centroid rc = ∑iri/3 and r1 as \({\theta }_{o}=\arg ({{{{{{{{\bf{r}}}}}}}}}_{{\rm {c}}}-{{{{{{{{\bf{r}}}}}}}}}_{1})\). The swimmer is trained to swim along a target direction θT. b Schematic of Actor-Critic neural networks. Both networks consist of three sets of layers (input layer, hidden layer, and output layer). Each layer is composed of neurons (marked as nodes). The weights of the neural network are illustrated as links in between the nodes. The input layer has the same dimension as the observation. The three linear hidden layers have the dimension of 64,32,32, respectively. The output layer dimension of the actor network is the same as the action space dimension, whereas the output layer of the actor network has only 1 neuron. We discuss the general idea as follows: based on the current observation, a reinforcement learning agent decides the next action using the Actor neural network. The next action is then evaluated by the Critic neural network to guide the training process. The swimmer performs the action advised by the agent and interacts with the hydrodynamic environment, leading to movements that constitute the next observation and reward. Both the Actor and Critic neural network are updated periodically to improve the overall performance. See more details in the “Methods” section.

Hydrodynamic interactions

The interaction between the spheres and their surrounding fluid is governed by the Stokes equation ( p = μ2u, u = 0). Here, p, μ and u represent, respectively, the pressure, dynamic viscosity, and velocity field. In this low Reynolds number regime, the velocities of the spheres Vi and the forces Fi acting on them can be related linearly as

$${{{{{{{{\bf{V}}}}}}}}}_{i}={{{{{{{{\bf{G}}}}}}}}}_{ij}{{{{{{{{\bf{F}}}}}}}}}_{j},$$
(1)

where Gij is the Oseen tensor47,48,49 given by

$${{{{{{{{\bf{G}}}}}}}}}_{ij}=\left\{\begin{array}{l}\frac{1}{6\pi \mu R}{{{{{{{\bf{I}}}}}}}},\hfill\\ \frac{1}{8\pi \mu | {{{{{{{{\bf{r}}}}}}}}}_{i}-{{{{{{{{\bf{r}}}}}}}}}_{j}| }({{{{{{{\bf{I}}}}}}}}+{\hat{{{{{{{{\bf{r}}}}}}}}}}_{ij}{\hat{{{{{{{{\bf{r}}}}}}}}}}_{ij}).\end{array}\right.$$
(2)

Here, I is the identity matrix and \({\hat{{{{{{{{\bf{r}}}}}}}}}}_{ij}=({{{{{{{{\bf{r}}}}}}}}}_{i}-{{{{{{{{\bf{r}}}}}}}}}_{j})/| {{{{{{{{\bf{r}}}}}}}}}_{i}-{{{{{{{{\bf{r}}}}}}}}}_{j}|\) denotes the unit vector between spheres i and j. The torque acting on the sphere i is calculated by Ti = ri × Fi. The rate of actuation of the arm lengths \({\dot{L}}_{1}\), \({\dot{L}}_{2}\) and the intermediate angle \({\dot{\theta }}_{31}\) can be expressed in terms of the velocities of the spheres Vi. The kinematics of the swimmer is fully determined upon applying the force free (∑iFi = 0) and torque-free (∑iTi = 0) conditions. The Oseen tensor hydrodynamic description is valid when the spheres are not in close proximity (RL). We therefore constrain the arm and angle contractions such that 0.6L ≤ L1, L2 ≤ L and 2π/3 ≤ θ31 ≤ 4π/3.

The actuation rate of the arm lengths \({\dot{L}}_{1},{\dot{L}}_{2}\) can be expressed in terms of the relative velocities of the spheres parallel to the arm orientations:

$$({{{{{{{{\bf{V}}}}}}}}}_{2}-{{{{{{{{\bf{V}}}}}}}}}_{1})\cdot {\hat{{{{{{{{\bf{r}}}}}}}}}}_{21}={\dot{L}}_{1},$$
(3)
$$({{{{{{{{\bf{V}}}}}}}}}_{3}-{{{{{{{{\bf{V}}}}}}}}}_{2})\cdot {\hat{{{{{{{{\bf{r}}}}}}}}}}_{32}={\dot{L}}_{2},$$
(4)

The actuation rate of the intermediate angle \({\dot{\theta }}_{31}\) can be expressed in terms of the relative velocities of the spheres perpendicular to the arm orientations:

$$({{{{{{{{\bf{V}}}}}}}}}_{2}-{{{{{{{{\bf{V}}}}}}}}}_{1})\cdot \frac{d{\hat{{{{{{{{\bf{r}}}}}}}}}}_{21}}{d{\theta }_{1}}={L}_{1}{\dot{\theta }}_{1},$$
(5)
$$({{{{{{{{\bf{V}}}}}}}}}_{3}-{{{{{{{{\bf{V}}}}}}}}}_{2})\cdot \frac{d{\hat{{{{{{{{\bf{r}}}}}}}}}}_{32}}{d{\theta }_{2}}={L}_{2}{\dot{\theta }}_{2},$$
(6)
$${\dot{\theta }}_{1}-{\dot{\theta }}_{2}={\dot{\theta }}_{31},$$
(7)

where \({\dot{\theta }}_{1}\) and \({\dot{\theta }}_{2}\) are the arm rotation speeds. Together with the Oseen tensor description of the hydrodynamic interaction between the spheres, Eqs. (1) and (2) in the main text, and the overall force-free and torque-free conditions, the kinematics of the swimmer is fully determined.

In presenting our results, we scale lengths by the fully extended arm length L, velocities by a characteristic actuation rate of the arm Vc, and hence time by L/Vc and forces by μLVc (see Non-dimensionalization under Supplementary methods).

Targeted navigation

We first use the deep RL framework to train the model system in swimming along a target direction θT, given any arbitrary initial swimmer’s orientation θo. The swimmer’s orientation is defined based on the relative position between the swimmer’s centroid rc = ∑iri/3 and r1 as \({\theta }_{{\rm {o}}}=\arg ({{{{{{{{\bf{r}}}}}}}}}_{{\rm {c}}}-{{{{{{{{\bf{r}}}}}}}}}_{1})\) (Fig. 1).

In the RL algorithm, the state s (r1, L1, L2, θ1, θ2) of the system is specified by the sphere center r1, arm lengths L1, L2, and arm orientations θ1, θ2. The observation \(o\in ({L}_{1},{L}_{2},{\theta }_{31},\cos {\theta }_{{\rm {d}}},\sin {\theta }_{{\rm {d}}})\) is extracted from the state, where θ31 is the intermediate angle and θd = θTθo is the difference between the target direction θT and the swimmer’s orientation θo; note that the angle difference is expressed in terms of \((\cos {\theta }_{{\rm {d}}},\sin {\theta }_{{\rm {d}}})\) to avoid discontinuity in the orientation space. The AI decides the swimmer’s next action based on the observation using the Actor neural network: for each action step Δt, the swimmer performs an action \(a\in ({\dot{L}}_{1},{\dot{L}}_{2},{\dot{\theta }}_{31})\) by actuating its two arms, leading to swimmer displacement. To quantify the success of a given action, the reward is measured by the displacement of the swimmer’s centroid along the target direction, \({r}_{t}=({{{{{{{{\bf{r}}}}}}}}}_{{{\rm {c}}}_{t+1}}-{{{{{{{{\bf{r}}}}}}}}}_{{{\rm {c}}}_{t}})\cdot (\cos {\theta }_{{\rm {T}}},\,\sin {\theta }_{{\rm {T}}})\).

We divide the training process into a total of Ne episodes, with each episode consisting of Nt = 150 learning steps. To ensure a full exploration of the observation space o, both the initial swimmer state s and the target direction θT are randomized in each episode. Based on the training results after every 20 episodes, the critic neural network updates the AI to maximize the expected long-term rewards E[Rt=0πθ], where πθ is the stochastic control policy, \({R}_{t}=\mathop{\sum }\nolimits_{t^{\prime} }^{\infty }{\gamma }^{t^{\prime} -t}{r}_{t^{\prime} }\) is the infinite-horizon discounted future returns, and γ is the discount factor measuring the greediness of the algorithm45,50. A large discount factor γ = 0.99 is set here to ensure farsightedness of the algorithm. As the episodes proceed, the Actor-Critic structure progressively trains the AI and thereby enhances the performance of the swimmer.

In Fig. 2 (Supplementary Movie 1) we visualize the navigation of a trained swimmer along a target direction θT, given a substantially different initial orientation, θo. The swimmer’s targeted navigation is accomplished in three stages: (1) in the initial phase (blue curve and regime), the swimmer employs “steering” gaits primarily for re-orientation, followed by (2) “transition” phase (red curve and regime) in which the swimmer continues to adjust its direction while self-propelling, before reaching (3) the “translation” phase (green curve and regime), in which the re-orientation is complete and the swimmer simply self-propels along the target direction. This example illustrates how an AI-powered reconfigurable system evolves a multimodal navigation strategy without explicitly programmed or relying on any prior knowledge of low-Reynolds-number locomotion. We next analyze the locomotory gaits in each mode in the evolved strategy.

Fig. 2: Example of target navigation utilizing three distinct locomotory gaits.
figure 2

The Artificial Intelligence powered swimmer switches between distinct locomotory gaits (steering, transition, translation) advised by the reinforcement learning algorithm to steer itself towards a specified target direction θT (black arrow) and swim along the target direction afterwards. Different parts of the swimmer's trajectory are colored to represent the locomotion due to different locomotory gaits, where the steering, transition, and translation gaits are marked as blue, red, green, respectively. Schematics of the swimmer configurations (not-to-scale) are shown for illustrative purpose, where the leftmost sphere is marked as red and other two spheres marked as blue to indicate the swimmer's current orientation (gray arrows). The inset shows the change in swimmer's orientation θo over action steps. An animation of this simulation is shown in Supplemental Movie 1.

Multimodal locomotory gaits

Here we examine the details of the locomotory gaits acquired by the swimmer for targeted navigation in the steering, transition, and translation modes. We distinguish these gaits by visualizing their configurational changes in the three-dimensional (3D) configuration space of the swimmer (L1, L2, θ31) in Fig. 3. Here we utilize an example of a swimmer navigating towards a target direction with θd > π/2 to illustrate the switching between different locomotory gaits (Fig. 3a), Supplementary Movies 2 and 3). The swimmer needs to re-orient itself in the counter-clockwise direction in this example; an example for the case of clockwise rotation is included in the Supplementary Note 1 (Supplementary Fig. 1, Movies 7 and 8). The dots in Fig. 3a represent configurations at different action steps. The configurations for the steering (blue dots), transition (red dots), and translation (green dots) gaits are clustered in different regions in the configuration space. A representative sequence of configurational changes for each mode of gaits are shown as solid lines to aid visualization (Fig. 3a).

Fig. 3: Analysis of configurational changes revealing three distinct modes of locomotory gaits.
figure 3

The steering, transition, and translation gaits are marked as blue, red, green, respectively. a A 3D configuration plot for a typical simulation which the swimmer aligns with the target direction via a counterclockwise rotation, where L1, L2 are the arm lengths and θ31 is the intermediate angle. Each dot represents one specific configuration of a locomotory gait. The solid lines mark an example cycle of each locomotory gait. b The changes in the arm lengths L1 and L2 and the intermediate angle θ31 with respect to the configuration number for each locomotory gait. c The average translational velocity \(\langle \dot{x}\rangle\) and rotational velocity \(\langle \dot{\theta }\rangle\) are calculated by averaging the centroid translation along the target direction θT and the change of swimmer's orientation θo over the total number of action steps for each locomotory gaits. d Representative configurations labeled with the configuration number are displayed to illustrate the configurational changes for each selected sequence of locomotory gaits for the steering (blue box), transition (red box), and translation (green box) modes. The leftmost sphere of the swimmer is marked as red and other two spheres are marked as blue to indicate the swimmer's current orientation. The gray arrows indicate the contraction/extension of the arms and the intermediate angle. For illustration, the reference frame of the configurations are rotated consistently such that the left arm of the first configuration is aligned horizontally in each sequence. The animation of counterclockwise and clockwise simulations are shown in the Supplementary Movies 2, 3 and 7, 8.

We further examine the evolution of L1, L2, and θ31 using the representative sequences of configurational changes identified in Fig. 3a for each mode of gaits. For the steering gaits (Fig. 3b, blue lines and Fig. 3d, blue box), the swimmer repeatedly extends and contracts L2 and θ31, but keeps L1 constant (the left arm rests in the fully contracted state). The steering gaits thus reside in the L2θ31 plane in Fig. 3a (blue line). The large variation in θ31 generates net rotation, substantially re-orientating the swimmer orientation with a relatively small net translation (Fig. 3c). For the transition gaits (Fig. 3b, red lines and Fig. 3d, red box), the swimmer repeatedly extends and contracts all L1, L2 and θ31, leading to significant amounts of both net rotation and translation (Fig. 3c). In the configuration space (Fig. 3a), the transition gaits tilt into the L1L2 plane with an average θ31 less than π (red line). Compared with the steering gaits, the variation of θ31 becomes more restricted (Fig. 3b), resulting in smaller net rotation for fine tuning of the swimmer’s orientation in the transition phase. Finally, for the translation gaits (Fig. 3b, green lines and Fig. 3d, green box), the swimmer’s orientation is aligned with the target direction (θd ≈ 0); the swimmer repeatedly extends and contracts L1 and L2, while keeping θ31 close to π (i.e., all three spheres of the swimmer are aligned), resembling the swimming gaits of Najafi–Golestanian swimmers39,51. In the configuration space (Fig. 3a), the translation gaits reside largely in the L1L2 plane with an approximately zero average θ31, generating the maximum net translation with minimal net rotation (Fig. 3c). The details of gaits categorization are summarized under Supplementary methods.

It is noteworthy that the multimodal navigation strategy emerges solely from the AI without relying on prior knowledge of locomotion. The switching between rotation, transition, and translation gaits is analogous to the switching between turning and running modes observed in bacterial locomotion2,5. These results demonstrate how an AI-powered swimmer, without being explicitly programmed, self-learns complex locomotory gaits from rich action and configuration spaces and undergoes autonomous gait switching in accomplishing targeted navigation.

Performance evaluation

Here we investigate the improvement of swimmer’s performance with increased number of training episodes Ne. At initial stage of training with a small Ne, the swimmer may fail to identify the right sets of locomotory gaits to achieve targeted navigation due to insufficient training. Continuous training with increased number of episodes would enable the swimmer to identify better locomotory gaits to complete navigation tasks. Here we measure the improvement of swimmer’s performance with increased Ne by three locomotion tests: (1) Random target test: the swimmer is assigned a target direction selected randomly from a uniform distribution in [0, 2π]; (2) Rotation test: the swimmer is assigned a targeted direction with a large angle of difference with swimmer’s orientation (i.e., θd = ± π/2); (3) Translation test: the swimmer is assigned a target direction equal to the swimmer’s orientation (i.e., θd = 0). A test is considered to be successful if the swimmer travels along the target direction for a distance of 5 unit in 10,000 action steps. These tests ensure that the trained swimmer acquires a set of effective locomotory gaits to swim along any specified direction with robust rotation and translation.

We consider the success rates of the three tests over 100 trials (Fig. 4). For Ne = 3 × 104, success rates of around 90% are obtained for the three tests. When Ne is increased to 9 × 104, the swimmer masters translation with a 100% success rate but still needs more training for rotation. When Ne is increased further to 15 × 104, the swimmer obtains 100% success rates for all tests. This result demonstrates the continuous improvement in the robustness of targeted navigation with increased Ne up to 15 × 104. As we further increase Ne, we found the relationship between Ne and performance to be non-monotonic. For a total training episodes much greater than Ne = 15 × 104, the overall success rate will begin to drop and eventually fluctuate around 95%. We selected the trained result at Ne = 15 × 104 for the best overall performance.

Fig. 4: Analysis of the swimmer’s performance with increasing number of episodes.
figure 4

Number of episodes Ne indicates the total training time of the swimmer. Each episode during training contains a fixed amount of action steps Nl = 150. a We used three tests (random target test, rotation test and translation test) to measure the swimmer's performance in a fixed number of training steps Nl = 150. For all tests, the swimmer starts with a random initial configuration to ensure a full exploration of the observation space. A total of 100 trials are considered for each test with swimmers trained at different Ne. A swimmer with insufficient training (3 × 104 episodes) may occasionally fails in the three tests (success rate ≈ 90%). At Ne = 9 × 104, the swimmer masters translation and improves its rotation ability. When Ne increases to 1.5 × 105, the swimmer obtains a 100% success rate in all tests. b Schematics of the random target test, rotation test, and translation test. The leftmost sphere is marked as red and other spheres are marked as blue to indicate the swimmer's orientation θo (red dashed arrows). Given a random initial configuration, we test the swimmer's ability to translate along or rotate towards a target direction θT (solid red arrows). The black dashed arrows indicate the swimmer's intended moving direction.

To better understand the swimmer’s training process, we also varied the number of steps in each episodes, Nl. For a range from 100 to 300 and a fixed total episodes Ne, we found Nl = 150 provides the most efficient way to balance translation and rotation and require least amount of action steps to complete both the rotation and translation tests. We remark that, when Nl = 100, the swimmer was only able to translate but not to rotate, indicating the significant role Nl plays in learning.

Lastly, we remark that the swimmer appears to require more training, both in Ne and Nl, to learn rotation compared to translation. This may be attributed to the inherit complexity of rotation gaits, where the swimmer needs to actuate its intermediate angle in addition to the actuation of the two arms required in translation gaits.

Path tracing–"SWIM"

Next we showcase the swimmer’s capability in tracing complex paths in an autonomous manner. To illustrate, the swimmer is tasked to trace out the English word “SWIM" (Fig. 5, Supplementary Movie 4). We note that the hydrodynamic calculations required to design the locomotory gaits to trace such complex paths become quickly intractable as the complexity increases. Here, instead of explicitly programming the gaits of the swimmer, we only select target points (pi, i = 1, 2, . . . , 17, red spots in Fig. 5) as landmarks and require the swimmer to navigate towards these landmarks with its own AI, with the target directions at action step t + 1 given by \({\theta }_{{T}_{t+1}}=\arg ({{{{{{{{\bf{p}}}}}}}}}_{i}-{{{{{{{{\bf{r}}}}}}}}}_{{c}_{t}})\). The swimmer is assigned with the next target point pi+1 when its centroid is within a certain threshold (0.1 of the fully extended arm length) from pi. The completion of these multiple navigation tasks sequentially enables the swimmer to successfully trace out the word “SWIM" with a high accuracy (Fig. 5, Supplementary Movie 4). In accomplishing this task, the swimmer switches between the three modes of locomotory gaits autonomously to swim towards individual target points and turn around the corners of the path based on the AI-powered navigation strategy. It is noteworthy that the swimmer is able to navigate around some corners (e.g., at target points 4 and 6) without activating the steering gaits, which are employed for corners with more acute angles (e.g., at target points 8, 14, and 16). While past approaches based on detailed hydrodynamic calculations, manual interventions, or other control methods may also complete such tasks, here we present reinforcement learning as an alternative approach in accomplishing these complex maneuvers in a more autonomous manner.

Fig. 5: Demonstration of complex navigation capability of Artificial Intelligence powered swimmer.
figure 5

The Artificial Intelligence powered swimmer switches between various locomotory gaits autonomously in tracing a complex trajectory “SWIM". The trajectory of the central sphere of the swimmer is colored based on the mode locomotory gaits: steering (blue), transition (red), and translation (green). The swimmer is given a lists of target points (1–17) with one target point at a time. The black arrows at each point indicate the intended direction of the swimmer. From the current target point, the swimmer determines the target direction for the next action step t + 1, \({\theta }_{{T}_{t+1}}\) and adapts the locomotory gaits based on its AI in navigating towards that direction. Schematics of the swimmer configurations (not-to-scale) are shown for illustrative purposes, where the leftmost sphere is marked as red and other two spheres are marked as blue to indicate the swimmer's current orientation. An animation of this simulation is shown in Supplemental Movie 4.

Robustness against flows

Last, we examine the performance of targeted navigation under the influence of flows (Fig. 6a, b, Supplementary Movies 5, 6). In particular, to determine to what extent the AI-powered swimmer is capable of maintaining its target direction against flow perturbations, we use the same AI-powered swimmer trained without any background flow, and impose a rotational flow generated by a rotlet at the origin47,48, u = −γ × r/r3, where γ = γez prescribes the strength of the rotlet in the z-direction, r = r is the magnitude of the position vector r from the origin (see the section “Simulations of background flow” under Supplementary methods). Here the AI-powered swimmer is tasked to navigate towards the positive x-direction under flow perturbations due to the rotlet. We examine how the swimmer adapts to the background flow when performing this task. For comparison, we contrast the resulting motion of the AI-powered swimmer with that of an untrained swimmer (i.e., a Najafi–Golestanian (NG) swimmer that performs only fixed locomotory gaits without any adaptivity39). Without the background flow, both swimmers self-propel with the same speed. Both swimmers are initially placed close to the rotlet with rc = −5ex and we sample their performance with three different initial orientations: \({\theta }_{{o}_{0}}=-\pi /3\), 0, and π/3, under different flow strengths. Under a relatively weak flow (γ = 0.15, Fig. 6a), Supplementary Movie 5), the AI-powered swimmer is capable of navigating towards the positive x-direction regardless of its initial orientations against flow perturbations. In contrast, the trajectories of the NG swimmer are largely influenced by the rotlet flow passively depending on the initial orientation of the swimmer. For an increased flow strength (γ = 1.5, Fig. 6b, Supplementary Movie 6), the NG swimmer completely loses control of its direction and is scattered by the rotlet into different directions again due to the absence of any adaptivity. Under such a strong flow, the AI-powered swimmer initially circulates around the rotlet but eventually manages to escape from it, navigating to the positive x-direction successfully with similar trajectories for all initial orientations. We note that the vorticity experienced by the swimmer in this case is comparable with typical re-orientation rates of the AI-powered swimmer. We also remark that when navigating under flow perturbations, the AI-powered swimmer adopts the transition gaits to constantly re-orient itself towards the positive x-direction and self-propels along that direction eventually. These results showcase the AI-powered swimmer’s capability in adapting its locomotory gaits to navigate robustly against flows.

Fig. 6: Analysis of the performance of targeted navigation under the influence of flows.
figure 6

a The Artificial Intelligence powered swimmer and the Najafi–Golestanian (NG) swimmer escape from a relatively weak rotlet flow, u = −γ × r/r3, where γ = γez prescribes the strength of the rotlet in the z-direction, r = r is the magnitude of the position vector r from the origin (γ = 0.15). The leftmost sphere of the AI-powered swimmer is marked as red and other spheres are marked as blue to indicate the swimmer's current orientation (blue dashed arrow). The NG swimmer is colored red with its orientation marked as red dashed arrows. Three sets of trajectories (dashed, dotted, and solid lines) are shown with different initial swimmer orientation \({\theta }_{{\rm {{o}}}_{0}}\). The AI-powered swimmer travels to the right regardless of its initial orientation whereas the trajectory for the NG swimmer is highly affected by the rotlet flow. b We compare the trajectories of the AI-powered swimmer and the NG swimmer in a strong rotlet flow (γ = 1.5). The NG swimmer completely loses control in the flow, while the AI-powered swimmer maintains its orientation towards the positive x-direction, with similar trajectories for different initial orientations. The animation of the two simulations are shown in Supplemental Movies 5 and 6.

Conclusions

In this work, we present a deep RL approach to enable navigation of an artificial microswimmer via gait switching advised by the AI. In contrast to previous works that considered active particles with prescribed self-propelling velocities as minimal models32,34,35 or simple one-dimensional swimmers37,38,46, here we demonstrate how a reconfigurable system can learn complex locomotory gaits from rich and continuous action spaces to perform sophisticated maneuvers. Through RL, the swimmer develops distinct locomotory gaits for a multimodal (i.e., steering, transition, and translation) navigation strategy. The AI-powered swimmer can adapt its locomotory gaits in an autonomous manner to navigate towards any arbitrary directions. Furthermore, we show that the swimmer can navigate robustly under the influence of flows and trace convoluted paths. Instead of explicitly programming a swimmer to perform these tasks in the traditional approach, the swimmer is advised by the AI to perform complex locomotory gaits and autonomous gait switching in accomplishing these navigation tasks. The multimodal strategy employed by the AI-powered swimmer is reminiscent of the run-and-tumble in bacteria2,5. Taken together, our results showcase the vast potential of this deep RL approach in realizing adaptivity similar to that of biological organisms for robust locomotive capabilities. Such adaptive behaviors are crucial for future biomedical applications of artificial microswimmers in complex media with uncontrolled and/or unpredictable environmental factors.

We finally discuss several possibilities for subsequent investigations based on this deep RL approach. While we demonstrate only planar motion in this work, the approach can be readily extended to three-dimensional navigation by allowing out-of-plane rotation the swimmer’s arms with expanded observation and action spaces for the additional degrees of freedom. Moreover, the deep RL framework is not tied to any specific swimmers; a simple multi-sphere system is used in this work for illustration, and the same framework applies to other reconfigurable systems. We also remark that the AI-powered swimmer is able to overcome some influences of flows even though such flows were absent in the training. Subsequent investigations including the flow perturbation in the training may lead to even more powerful AI that could exploit the flows to further enhance the navigation strategies. Another practical aspect to consider is the effect of Brownian noise52,53,54. Specifically, the characterization of the effect of thermal fluctuations in both the training process of the swimmer and its resulting navigation performance is currently underway. In addition to flow and thermal fluctuations, other environmental factors, including the presence of physical boundaries and obstacles, may be addressed in similar manners in future studies. The deep RL approach here opens an alternative path towards designing adaptive microswimmers with robust locomotive and navigation capabilities in more complex, realistic environments.

Methods

Here we briefly explain the Proximal Policy Optimization (PPO) alogrithm we used to train our AI-powered swimmer.

In the PPO algorithm, the agent’s motion control is managed with a neural network with an Actor-Critic structure. The Actor network can be considered as a stochastic control policy πθ(atot), where it generates an action at given an observation ot following a Gaussian distribution. Here θ represents all the parameters of the actor neural network. The Critic network is used to compute the value function Vϕ by assuming the agent starts at an observation o and acts according to a particular policy πθ. The parameters in the critic network is represented as ϕ.

To effectively train the swimmer, we divide the total training process into episodes. Each episode can be considered as one round, which terminates after a fixed amount of training steps (Nl = 150). To ensure fully exploration of the observation space, we randomly initialize the swimmer’s geometric configurations (L1, L2, θ1, θ2) and the target direction (θT) at the beginning of each episode.

At time t, the agent receives its current observation ot and samples action at based on the policy πθ. Given at, the swimmer interacts with its surrounding and calculates the next state st+1 and reward rt. The next observation ot+1 extracted from st+1 is sent to the agent for the next iteration. All the observations, actions, rewards and sampling probabilities are stored for the agent’s update. The update process begins after running fix amount of episodes NE = 20 (Total training steps of an update is therefore: N = NE*Nl = 3000). The goal for the update is to optimize θ so that the expected long term rewards J(πθ) = E[Rt=0πθ] is maximized.

The expectation is taken with respect to each running episode, τ. Here, we use the infinite-horizon discounted returns \({r}_{t}=\mathop{\sum }\nolimits_{t^{\prime} }^{\infty }{\gamma }^{t^{\prime} -t}{r}_{t^{\prime} }\), where γ is the discount factor measuring the greediness of the algorithm. We set γ = 0.99 ensuring its farsightedness. To solve this optimization problem, we use the typical policy gradient approach estimation: θJ(πθ). More specifically, we implemented the clipped advantage PPO algorithm to avoid large changes in each gradient update. We estimated the surrogate objective J(πθ) by clipping the probability ratio r(θ) times the advantage function \({\hat{A}}_{t}\). The probability ratio measures the probability of selecting an action for the current policy over the old policy (\(r(\theta )=\frac{{\pi }_{\theta }{(a| o)}_{N\times 1}}{{\pi }_{{\theta }_{{{{{{{\mathrm{old}}}}}}}}}{(a| o)}_{N\times 1}}\)). The advantage function \({\hat{A}}_{t}\) describes the relative advantage of taking an action a based on an observation o over a randomly selected action and is calculated by subtracting the value function VN×1 from the discounted return RN×1 (\({\hat{A}}_{t}={R}_{N\times 1}-{V}_{N\times 1}\)).

We then update the parameters θ, ϕ via a typical gradient descent algorithm: Adam optimizer. The full detail for our implementation is included in the Algorithm 1 and 2 below. Here, K is the total epoch number. Nl is the number of steps in one episode, and N is the total number of steps for each update. The PPO algorithm uses fixed-length trajectory segments τ. During each iteration, each of NA parallel actors collect T time steps of data, then we construct the surrogate loss on these NAT time steps of data, and optimize it with Adam for K epochs.

In the following we present the algorithm tables for the PPO algorithm employed in this work. We refer the readers to classical monographs for more details45.

Algorithm 1

Environment

1:

for time step t = 0, 1, . . . do

2:

if mod(t, Nl) = 0 then

3:

  Reset state st

4:

  Compute observation ot

5:

end if

6:

 Sample action at from policy πθ

7:

 Evaluate the next state st+1 and reward rt following the swimmer’s hydrodynamics

8:

 Compute the next observation ot+1 from state st+1

9:

ift = 0 or mod(t, N) ≠ 0 then

10:

  append observation ot+1, action at, reward rt and action sampling probability πθ(atot) to observation list oN×5, action list aN×3, reward list RN×1 and action sampling probability list \({\pi }_{{\theta }_{{{{{{{{\rm{old}}}}}}}}}}{(a| o)}_{N\times 1}\)

11:

else

12:

  Update the Agent using Algorithm 2

13:

end if

14:

end for

Algorithm 2

Proximal Policy Optimization, Actor-Critic, Update the Agent

1:

Input: Initial policy parameter θ, initial value function parameter ϕ

2:

fork = 0, 1, 2,…Kdo

3:

 Compute infinite-horizon discounted returns RN×1

4:

 Evaluate expected returns VN×1 using observations oN×5 and value function Vϕ

5:

 Compute the advantage function: \({\hat{A}}_{t}={R}_{N\times 1}-{V}_{N\times 1}\)

6:

 Evaluate the probability for policy πθ using observations oN×5 and actions aN×3, store the probability to πθ(ao)N×1

7:

 Compute the probability ratio: \(r(\theta )=\frac{{\pi }_{\theta }{(a| o)}_{N\times 1}}{{\pi }_{{\theta }_{{{{{{{\mathrm{old}}}}}}}}}{(a| o)}_{N\times 1}}\)

8:

 Compute the clipped surrogate loss function: \({L}^{{{{{{{\mathrm{CLIP}}}}}}}}(\theta )={\mathbb{E}}[\min (r(\theta ){\hat{A}}_{t},\,{{{{{{\mathrm{clip}}}}}}}\,(r(\theta ),1-\epsilon ,1+\epsilon ){\hat{A}}_{t})]\), with constant ϵ

9:

 Compute the value-function loss: \({L}^{{{{{{{\mathrm{VF}}}}}}}}(\phi )=\frac{1}{2}{{{{{{{\bf{E}}}}}}}}[{({R}_{N\times 1}-{V}_{N\times 1})}^{2}]\)

10:

 Compute the entropy loss: LS = αS[πθ], with constant α

11:

 Compute the total loss: L(θ, ϕ) = −LCLIP(θ) + LVF(ϕ)−LS

12:

 Optimize surrogate L with respect to (θ, ϕ), with K epochs and minibatch size MNAT, with NA is the number of parallel actors and T is the time step.

13:

θold ← θ, ϕold ← ϕ

14:

end for