Introduction

Simulations of wall-bounded turbulent flows have become a key element in the design cycle of wind farms1 and aircraft2 and the major factor in the predictive capabilities of simulations of atmospheric flows3. Due to the high Reynolds numbers associated with these flows, direct numerical simulations (DNS), where all scales of motion are resolved, are not attainable with current computing capabilities. LES aims to reduce the necessary grid requirements by resolving only the energy-containing eddies and modeling the smaller scale motions. However, this requirement is still hard to meet in the near-wall region, as the stress-producing eddies become progressively smaller, scaling linearly in size with the distance to the wall. Several studies4,5,6 have estimated that the number of grid points necessary for wall-resolved LES scales as \({{{{{{{\mathcal{O}}}}}}}}(R{e}^{13/7})\), where Re is the characteristic Reynolds number of the flow. This number of computational elements is orders of magnitude smaller than that required for DNS, yet it remains prohibitive. In turn, modeling the near-wall flow such that only the large-scale motions in the outer region of the boundary layer are resolved, the grid-point requirement for wall-modeled LES (WMLES) scale at most as \({{{{{{{\mathcal{O}}}}}}}}(Re)\). With WMLES, certification by analysis—prediction of the aerodynamic quantities of interest for engineering applications by numerical simulations alone—may soon be a reality. Certification by analysis is expected to narrow the number of wind tunnel experiments, reducing both the turnover time and cost of the design cycle.

Several strategies for modeling the near-wall region have been explored7,8,9,10. The taxonomy of WMLES approaches can be broadly categorized as Hybrid LES/RANS methods and wall-flux modeling. Hybrid LES/RANS and its variants8 combine Reynolds-averaged Navier–Stokes (RANS) equations close to the wall and LES in the outer layer, with the interface between RANS and LES domains enforced implicitly through the change in the turbulence model. In wall-flux modeling, the usual no-slip and thermal wall boundary conditions are replaced with stress and heat flux boundary conditions provided by the wall model. Examples of well-known approaches involve computing the wall stress using either the law of the wall11,12,13 or the RANS equations14,15,16,17,18,19,20. Models account for nonlinear advection and pressure gradient by solving the unsteady three-dimensional RANS equations15,17 or accounting only for the wall-normal diffusion reducing the computational requirements to the solution of a system of ordinary differential equations19,20.

The main impediments of the above-mentioned models are that they rely on RANS parametrization that requires the use of a priori empirical coefficients calibrated for a particular flow state (usually fully developed turbulence in equilibrium over a flat plate). Such wall models do not function as intended in real-world applications, where various flow states coexist (e.g., separated flows, flow over roughness, predicting transition, etc.)7. The use of RANS parametrization for wall modeling was challenged with a dynamic wall model that is free of a priori specified coefficients at a negligible additional cost21,22. The two approaches were formulated by requiring consistency between the filtered velocity field at the wall and a differential filter kernel.

Dynamic wall models provide encouraging results, but they also face significant challenges. They are robust for changes in Reynolds number and grid resolution but sensitive to numerical methods employed in the flow solver and the choice of the subgrid-scale (SGS) model. This is attributed to the dominance of numerical errors close to the wall that in turn affect the evaluation of the necessary wall-model constants23. Furthermore, the methodology has only been exploited specifically for structured, incompressible flow solvers, with limited applicability for compressible flows or complex geometries.

The essential requirements for a successful dynamic wall model are that it (i) accommodates diverse flow solvers and SGS models and (ii) generalizes beyond their calibration flow fields. Recent advances in machine learning and data science aim to address these issues and complement the existing turbulence modeling approaches. To date, most efforts have focused on the application of supervised learning to SGS modeling24,25,26,27,28,29,30 and wall modeling31,32,33. However, despite the demonstrated promise, these methods encounter difficulties in generalizing beyond the distributions of the training data. In supervised learning, the parameters of the neural network are commonly derived by minimizing the model prediction error, which is often based on single-step target values to limit computational challenge. Therefore, it is necessary to differentiate between a priori and a posteriori testing. The first measures the accuracy of the supervised learning model in predicting the target values on a database of reference simulations, typically obtained via DNS. A posteriori testing is performed after training, by integrating in time the Navier–Stokes equations along with trained supervised learning closure and comparing the obtained statistical quantities to that of DNS or other reference data. Due to the single-step cost function, the resultant neural network model is not trained to compensate for the systematic discrepancies between DNS and LES (or WMLES) and the compounding errors. The issue of ill-conditioning of data-driven SGS models has been exposed by studies that perform a posteriori testing27,34,35,36. Wall models are more sensitive than SGS models22, and we expect the compounding of errors to play a more detrimental role in WMLES.

Here, we propose SciMARL for the development of wall models in LES. Reinforcement learning (RL) identifies optimal strategies for agents that perform actions, contingent on their information about the environment, and measures their performance via scalar reward functions. In this work, the agents correspond with the computational elements and their actions compensate both for the closure terms and errors associated with the numerics of the flow solver. RL is a semi-supervised learning framework with foundations on dynamic programming37 and a broad range of applications ranging from robotics38,39, games40,41, and more recently flow control39,42,43,44,45. We note that SciMARL has been used in fluid mechanics only recently for the development of SGS models in LES of homogeneous turbulent flow46.

In the case of WMLES, the performance of the SciMARL can be measured by comparing the statistical properties of the simulation to those of reference data such as the wall-shear stress. SciMARL is a semi-supervised learning algorithm that requires information about the flow formulated in terms of a reward rather than detailed spatiotemporal data as in the case of supervised learning. In the case of wall modeling, SciMARL does not rely on a priori knowledge of the log-law coefficients but rather aims to discover active closure policies according to patterns in the flow physics captured by the filtered equations. The respective wall models are robust with respect to the numerical discretizations, as these errors are taken into consideration in the training process. Furthermore, the model discovery method can be readily extended to complex geometries and different flow configurations, such as flow over rough surfaces and stratified and compressible boundary layers.

Results

Multi-agent reinforcement learning for wall modeling

In RL, the agent interacts with its environment by sampling its states (s), performing actions (a), and receiving rewards (r). At each time step, the agent performs the action and the system is advanced in time before the agent can observe its new state, receive a scalar reward, and choose a new action. The agent infers a policy π(s, a) through its repeated interactions with the environment to maximize its long-term rewards. The optimal policy π*(s, a) is found by maximizing the expected utility, which is given by the expected cumulative reward. Throughout the paper, x, y, and z denote the streamwise, wall-normal, and spanwise directions, respectively. The corresponding velocity components are u, v, and w. RL agents are distributed evenly on each channel wall with each agent located at (x, z) receiving local states sn(x, z) and rewards rn(x, z) and providing local actions an(x, z) at each time step tn. A single policy is maintained and updated by the multiple agents present in the domain (Fig. 1).

Fig. 1: Diagram of the SciMARL setup.
figure 1

Agents are distributed evenly along the wall, with each agent obtaining state information wall-normal height hm away from the wall, computing the reward at the wall and supplying into the policy π to obtain actions a for the next time step.

In order for the RL to be universally applicable for a wide range of flow parameters, the states are nondimensionalized using viscosity ν and the modeled instantaneous friction velocity

$${u}_{\tau }^{m}(x,z,t)={\left(\frac{{\tau }_{w}^{m}(x,z,t)}{\rho }\right)}^{1/2},$$
(1)

where \({\tau }_{w}^{m}\) is the modeled wall-shear stress and ρ is the density. These quantities are only dependent on the output of the wall model and can be obtained without any prior knowledge of the flow. This non-dimensionalization is noted by the superscript * and is distinct from the one by the true friction velocity uτ, noted by the superscript + , which will be used for the assessment of model performance. The goal of the wall model is to predict the correct wall-shear stress τw, and thus the uτ, which will allow for good predictions of quantities such as the mean velocity profile and turbulence intensities47.

Velocity-based wall model

We first train the model to adapt to the variation of the velocity with the wall-normal height, which has a universal behavior in the log region. We set as states sn(x, z) the instantaneous velocity u*(x, hm, z, tn), the wall-normal derivative ∂u*/∂y*(x, hm,  z,  tn), and the wall-normal location \({y}^{* }={({h}^{m})}^{* }\) of the sampling point. Agents act to adjust the wall-shear stress through a multiplication factor an(x, z)  [0.9, 1.1] such that \({\tau }_{w}^{m}(x,z,{t}_{n+1})={a}_{n}(x,z){\tau }_{w}^{m}(x,z,{t}_{n})\). This choice does not require the model to produce the exact wall-shear stress (which is dependent on Reynolds number), but rather proposes an action that adjusts the wall-shear stress. The reward (see “Methods” for definition) is also incremental and proportional to the improvement in the prediction of the wall-shear stress compared to the one obtained in the previous time step. The agent behavior is rendered stable by providing additional reward if the predicted wall-shear stress is within 1% of the true value.

Log-law-based wall model

The second model is based on the existence of a logarithmic (log) layer in the near-wall region of turbulent flows, present in all flows with an inner–outer scale separation48. In the log layer, the velocity profile is expressed as:

$${u}^{+}=\frac{1}{\kappa }\log {y}^{+}+B,$$
(2)

where κ is the von Karman constant and B is the intercept constant. The exact value of κ and B depends on the flow configurations and wall roughness; however, for the current study, we use values attributed to a canonical smooth zero-pressure-gradient boundary layer. The states for the second model are the local instantaneous coefficients for the log law κm and Bm, computed from the instantaneous velocity, velocity gradient, and wall-normal location information. We emphasize that this model does not take as input the a priori known values of κ and B from the log law, but rather derived quantities from the instantaneous flow. This has an advantage over the first model in that the values do not depend on the value of y* and thus the model can learn the log-law behavior for y* outside the range of values it trained on. This allows the model to be extended to higher Reynolds numbers or coarser grids more readily. The actions and rewards are the same as the first model.

State-action map

We inquire into the learned models by examining the state-action map conditioned to positive rewards for the channel flow at friction Reynolds number Reτ = 2000, 4200, 8000. As seen in Fig. 2a, the velocity-based wall model (VWM) has distinguished states (y*, u*) with distinct actions corresponding to positive rewards. The model is able to up/down-shift the wall-shear stress based on whether the (y*, u*) pair is located above or below the log-law profile. The model initially does not have any prior knowledge of the log-law coefficients, yet it is able to learn to adjust the wall-shear stress through the RL process. However, because the model is trained on a limited range of \({({h}^{m})}^{+}\) in the training set, the extrapolation of this behavior to much larger values of \({({h}^{m})}^{+}\) may be challenging. This can be alleviated by refining the grid in the wall-normal direction with Ny~Re4,5,6.

Fig. 2: State-action map of VWM and LLWM.
figure 2

a Probability density function of states u* for given y* for VWM conditioned to events with r > 0.1 and a < 0.95 (blue) or a > 1.05 (red). Contour levels are 30, 50, 70% of maximum value. Line indicates the log law \({u}^{+}=1/\kappa \log {y}^{+}+B\) with κ = 0.41, B = 5.2. b Joint probability density functions of states 1/κm and Bm for LLWM conditioned to events with r > 0.1 and a < 0.95 (blue) or a > 1.05 (red). Contour levels are 30, 50, 70% of maximum value. Dashed lines indicate κ = 0.41, B = 5.2; the solid, dotted, and dot-dashed lines are \((1/{\kappa }^{m}-1/\kappa )\log ({y}^{+})+({B}^{m}-B)=0\) for \({({h}^{m})}^{+}=500\), 100 and 104, respectively.

The log-law-based wall model (LLWM) similarly has distinct states with different actions corresponding to positive rewards (see Fig. 2b). The main mechanism for controlling the wall-shear stress is similar to the VWM, with the wall-shear stress being up/down-shifted based on whether the point corresponding to the slope and intercept of 1/κm and Bm are under/over-predicting the log law. Depending on the wall-normal location of hm, the classification of whether the point is above or below the log law may vary, especially for points farther away from the origin. However, the majority of the states are centered around the true value of 1/κ and B, and the mechanism will work as expected.

Testing: turbulent channel flow

We examine the model predictions on turbulent channel flow for Reynolds numbers in the range from 5200 to 106 (Fig. 3). In the case of VWM, we expect that as long as \({({h}^{m})}^{+}\) is within the range observed during the training process (\(150 \; < \; {({h}^{m})}^{+} < \; 1200\)), the model will perform as expected. Cases at Reτ = 2 × 104 and 5 × 104 produce high errors as the \({({h}^{m})}^{+}\) is not within the trained range. Once the values of \({({h}^{m})}^{+}\) are adjusted to be within the range by refining the grid, the errors decrease significantly. This entails refining the grid for higher Reynolds numbers to allow the first grid point off the wall to be within the trained range of \({({h}^{m})}^{+}\). In the case of LLWM, we observe that the prediction error in the friction velocity is less than 4% while the mean velocity profiles are well-aligned with the log law regardless of the value of \({({h}^{m})}^{+}\). The error increases with Reynolds number, most likely due to the high variation of the streamwise wall-normal gradient with increasing Reynolds number as well as the departure of \({({h}^{m})}^{+}\) from the trained range of values. Still, the results are comparable to the results obtained from the widely used equilibrium wall model (EQWM) up to Reτ ≈ 105, which uses an empirical coefficient tuned for this particular flow configuration. This range of Reynolds numbers is sufficient for various external aerodynamic and geophysical flows. The predicted mean velocity profiles for both models are shown in Fig. 4.

Fig. 3: Errors in the friction velocities.
figure 3

a Error in time-averaged wall-shear stress obtained from the VWM (empty) and LLWM (filled) for various Reynolds numbers. Circles indicate the standard grid with Δy = 0.05 and triangles indicate refined cases. b Zoomed in the version of (a) for LLWM with error in EQMW (crosses) for three Reynolds numbers.

Fig. 4: Predicted mean velocity profiles for turbulent channel flow.
figure 4

Mean velocity profiles for the (a) VWM cases and (b) LLWM cases are shown in Fig. 3. Dashed line is \({u}^{+}=1/\kappa \log ({y}^{+})+B\) for κ = 0.41 and B = 5.2. The two largest Reynolds number cases for VWM are omitted as the velocity profiles are outside the plotted range.

Testing: spatially evolving turbulent boundary layer

The predictive performance of the LLWM is assessed in a zero-pressure-gradient flat-plate turbulent boundary layer. The simulation ranges from Reθ = 1000–7000, where Reθ is the Reynolds number based on the momentum thickness.

The modeled skin friction coefficient \({C}_{f}^{m}={\tau }_{w}^{m}/(\rho {U}_{\infty }^{2}/2)\) for the full simulation domain is comparable to the Cf from the empirical values49 (Fig. 5a). This shows that the model is capable of adapting to variations of wall-shear stress in the streamwise direction, even when it was only trained on a channel-flow simulation.

Fig. 5: Predicted friction coefficients for the turbulent boundary layer.
figure 5

Friction coefficient Cf as a function of Reθ. Symbols are LLWM and line is the empirical Cf49.

Distribution of wall-shear stress

A growing body of studies in wall-bounded turbulence has shown that the generation of wall-shear stress fluctuations is directly connected with outer-layer large-scale motions50,51. This observation supports the idea that the log-layer flow contains the information necessary to predict not only the mean wall-shear stress but also the fluctuations. However, in deterministic wall models such as the EQWM, the wall-shear stress is perfectly correlated with the velocity at the sampling location52,53, as opposed to a correlation coefficient of 0.3 observed in DNS50. This can be observed in Figs. 6a and  7, where the wall-shear stress predicted by the EQWM is perfectly correlated with the velocity fluctuations at the sampling location hwm. On the other hand, LLWM results in a smaller correlation between the velocity at an off-wall location and the wall-shear stress (Figs. 6b and  7) with a maximum correlation of ~0.3, which matches the expected correlation from DNS.

Fig. 6: Comparison of the instantaneous off-wall streamwise velocity and wall-shear stress.
figure 6

Instantaneous snapshots of xz plane of streamwise velocity fluctuation \({u}^{^{\prime} * }\) at hm (top) and \({\tau }_{w}^{m}/{\tau }_{w}\) (bottom) for (a) EQWM and (b) LLWM.

Fig. 7: Correlation of instantaneous off-wall streamwise velocity and wall-shear stress.
figure 7

Cross-correlation coefficient between the wall-shear stress \({\tau }_{w}^{m^{\prime} * }\) and streamwise velocity \({u}^{^{\prime} * }\) at sampling location hm = 0.1δ for LLWM (red solid line) and EQWM (black dashed line).

Potential of SciMARL wall models

We demonstrate that the SciMARL wall models perform as well as the RANS-based EQWM, which has been tuned for this particular flow configuration. The SciMARL wall model is able to achieve these results by training on moderate Reynolds number flows with a reward function only based on the mean wall-shear stress rather. Moreover, RL models are trained in-situ with WMLES and do not require any DNS simulation data. This is in contrast to supervised learning methods, where a vast amount of data need to be generated using high-fidelity DNS simulations to proceed with the learning process. For example, in the case of a moderate Reynolds number channel flow (Reτ = 4200), LLWM can be trained using O(103) CPU-hours with less than 1 GB of storage. For supervised learning, generating the DNS data will require O(107) CPU-Hours with more than 100 TB storage. DNS databases might be already available for canonical cases such as channel flow, but it would be more difficult to obtain for cases regarding wall roughness or adverse pressure gradients, where wall models will be more useful. The additional overhead for generating data for supervised learning makes it less practical for real-world applications of wall modeling.

The LLWM is easy to extend to complex geometries and flow simulations utilizing different numerical methods or SGS models, as it only takes as states the instantaneous streamwise (or wall-parallel) velocity, its wall-normal gradient, and the distance from the wall. These quantities do not depend heavily on numerics or SGS models, unlike filtered velocities or eddy viscosity values required in the dynamic model22. Thus, the model can be used in a wide range of simulations, much like the EQWM, but without prescribed tunable parameters. Furthermore, the RL framework can be extended to various flow configurations by adding an additional dimension to the state vector. Since all flow with an inner–outer scale separation exhibit a log law48 in the overlap region, the current configuration for wall-model development can be extended to flows exhibiting roughness, stratified flows, compressible flows, among many others. These flows usually have different log-law coefficients κ and B that are manually tuned for existing wall models. However, in this work, these values are adjusted automatically using a SciMARL-based model.” This gives the LLWM a distinct advantage over existing models. For example, in cases with varying pressure gradients over the simulation domain, traditional methods will have to assign different model parameters for each location containing different pressure gradients. In contrast, the SciMARL model can smoothly transition between various pressure-gradient effects with a single policy trained from various canonical cases when the parameters such as pressure and velocity gradients are included as a state. A similar argument can be applied to simulations with varying levels of stratification or compressible effects within a simulation domain. In addition, the evaluation of the LLWM involves evaluating the weights of the trained neural net, which is an order of magnitude faster than the EQWM that solves an ODE at each time step.

Discussion

We have introduced a potent method for the automated discovery of closures in simulations of wall-bounded turbulent flows that uses limited data by fusing scientific computing and multi-agent reinforcement learning (SciMARL). In this method, we solve the filtered Navier–Stokes equations using LES and develop a wall model as a control policy enacted by cooperating agents using the recovery of the correct mean wall-shear stress as a reward. SciMARL requires limited data in contrast to supervised learning methods. The training was performed using LES of a turbulent channel flow at moderate Reynolds numbers (Reτ = 2000, 4200, and 8000). Remarkably, the method generalizes on LES of a turbulent boundary layer and turbulent channel flow at extreme Reynolds numbers.

We examine the robustness of the method by studying two models (VWM and LLWM) with different state spaces. In the VWM, the state space comprises the streamwise velocity and its wall-normal derivatives. This model adjusts the wall-shear stress based on the discrepancy of the velocity profile from the log law. The model captures the mean velocity profile for a wide range of Reynolds numbers when the wall-normal location of the sampling point is within the training set. Alternatively in the LLWM, the state space is based on the instantaneous log-law coefficients. This model generalizes to a broader set of grid resolutions and Reynolds numbers than the VWM. Moreover, despite training in turbulent channel flows we find that the LLWM generalizes to spatially evolving turbulent boundary layer and it recovers the correct skin friction coefficient at a fraction of the cost of high-fidelity simulations.

We note that the LLWM produces correlations between the predicted wall-shear stress and the off-wall velocity profile that are similar to fully resolved flow. This is in contrast to the correlations obtained by the classical RANS models. This implies that the policy of the LLWM replicates the natural mechanisms of wall-shear stress control that can be obtained so far only through highly resolved simulations. Furthermore, as the model only requires instantaneous flow information at one off-wall location, it could be extended to more complex geometries and different numerical methods without additional modifications.

We anticipate that the model can be easily expanded for all wall-bounded flows that exhibit a log law through an inner–outer scale separation48. We envision that when SciMARL is trained over a wide range of flows, the model will also acquire experiences for the key flow patterns that are omnipresent in the fundamental physics of flows in complex configurations. This advance will present a paradigm shift in wall-model development for LES in the prediction and control of industrial aerodynamics and environmental flows.

Methods

Reinforcement learning

Learning is performed through the open-source RL library smarties54. The library leverages efficiently the computing resources by separating the task of updating the policy parameters from the task of collecting interaction data. The flow simulations are distributed across workers who collect, for each agent, experiences organized into episodes,

$${E}_{i}={\{{s}_{n}^{(i)},{r}_{n}^{(i)},{\mu }_{n}^{(i)},{\sigma }_{n}^{(i)},{a}_{n}^{(i)}\}}_{n = 0,...,N},$$

where n tracks in-episode RL steps, μ and σ are the statistics of the Gaussian policy used to sample a, and tN is the final time step for each episode. When a simulation concludes, the worker sends one episode per agent to the central learning process (master) and receives updated policy parameters. The master stores the episodes to a replay memory (RM), which is sampled to update the policy parameters according to Remember-and-Forget Experience Replay (ReF-ER)54. ReF-ER is combined with an off-policy actor-critic algorithm V-RACER which supports continuous state and action spaces.

V-RACER trains a neural network defined by weights w which, given input state s, outputs the mean μw(s) and standard deviation σw(s) of the policy πw and a state-value estimate vw(s). The statistics μw and σw are improved through the policy gradient estimator

$${g}_{\pi }({\mathtt{w}})=\; {\mathbb{E}}\bigg[\left.{g}_{\pi ,n}({\mathtt{w}})\equiv \frac{{\pi }^{{\mathtt{w}}}({a}_{n}| {s}_{n})}{{{{{{{{\mathcal{P}}}}}}}}({a}_{n}| {\mu }_{n},\,{\sigma }_{n})}({\hat{q}}_{n}-{v}^{{\mathtt{w}}}({s}_{n})){\nabla }_{{\mathtt{w}}}\log {\pi }^{{\mathtt{w}}}({a}_{n}| {s}_{n})\right|\\ \{{s}_{n},{r}_{n},{\mu }_{n},{\sigma }_{n},{a}_{n}\} \sim \,{{\mbox{RM}}}\bigg],$$
(3)

where \({{{{{{{\mathcal{P}}}}}}}}({a}_{n}| {\mu }_{n},{\sigma }_{n})\) is the probability of sampling an from \({{{{{{{\mathcal{N}}}}}}}}({\mu }_{n},{\sigma }_{n})\), and \({\hat{q}}_{n}\) is an estimator of the action value which is computed recursively from a Retrace algorithm55 as

$${\hat{q}}_{n}={r}_{n+1}+\gamma {v}^{{\mathtt{w}}}({s}_{n+1})+\gamma \min \left\{1,\frac{{\pi }^{{\mathtt{w}}}({a}_{n}| {s}_{n})}{{{{{{{{\mathcal{P}}}}}}}}({a}_{n}| {\mu }_{n},{\sigma }_{n})}\right\}({\hat{q}}_{n+1}-{v}^{{\mathtt{w}}}({s}_{n+1})),$$
(4)

where γ = 0.995 is the discount factor for rewards into the future. Retrace is also used to derive the gradients for the state-value estimate

$${g}_{v}({\mathtt{w}})= {\mathbb{E}} \bigg[\left.{g}_{v,n}({\mathtt{w}})\equiv \min \left\{1,\frac{{\pi }_{{\mathtt{w}}}({a}_{n}| {s}_{n})}{{{{{{{{\mathcal{P}}}}}}}}({a}_{n}| {\mu }_{n},\,{\sigma }_{n})}\right\}\left({\hat{q}}_{n}-{v}^{{\mathtt{w}}}({s}_{n})\right)\right|\\ \{{s}_{n},{r}_{n},{\mu }_{n},{\sigma }_{n},{a}_{n}\} \sim \,{{\mbox{RM}}}\,\bigg].$$
(5)

The expectations in Eqs. (3) and (5) are approximated by Monte Carlo sampling B observations from RM.

Due to the use of experience replay, V-RACER and similar algorithms become unstable if the policy diverges from the distribution of experiences in the RM. We circumvent this issue by using an importance weight ρt to classify whether an experience is “near-policy" or “far-policy" and clip the gradients computed from far-policy samples to zero54. In ReF-ER, the gradients are computed as

$${\hat{g}}_{n}({\mathtt{w}})=\left\{\begin{array}{ll}\beta {g}_{n}({\mathtt{w}})-(1-\beta ){g}_{n}^{D}({\mathtt{w}}),\quad &\,{{\mbox{if}}}\,\,1/C \; < \; {\rho }_{t} \; < \; C\\ -(1-\beta ){g}_{n}^{D}({\mathtt{w}}),\quad &\,{{\mbox{otherwise}}}\,,\end{array}\right.$$
(6)

where \({\rho }_{t}={\pi }_{{\mathtt{w}}}({a}_{t}| {s}_{t})/{{{{{{{\mathcal{P}}}}}}}}({a}_{t}| {\mu }_{t},{\sigma }_{t})\). Here, \({g}^{D}={\nabla }_{{\mathtt{w}}}{D}_{KL}({\pi }_{{\mathtt{w}}}(\cdot | {s}_{t}))\parallel {{{{{{{\mathcal{P}}}}}}}}(\cdot | {\mu }_{t},{\sigma }_{t})\), where DKL(PQ) is the Kullback–Leibler divergence measure between distributions P and Q. The coefficient β is iteratively updated to keep a constant fraction of samples in the RM within the trust region by

$$\beta \leftarrow \left\{\begin{array}{ll}(1-\eta )\beta ,\quad &\,{{\mbox{if}}}\,\,{r}_{RM} \; > \; D,\\ \beta +(1-\eta )\beta ,\quad &\,{{\mbox{otherwise}}}\,,\end{array}\right.$$
(7)

where rRM is the fraction of the RM with importance weights outside the trust region [1/C, C] and D is a parameter.

The most notable hyperparameters used in our description of the MARL setup are the spatial resolution for the interpolation of the actions onto the grid (determined by \({{{\Delta }}}_{x}^{m}/{{{\Delta }}}_{x}\), and \({{{\Delta }}}_{z}^{m}/{{{\Delta }}}_{z}\)). The default values \({{{\Delta }}}_{x}^{m}/{{{\Delta }}}_{x}\), and \({{{\Delta }}}_{z}^{m}/{{{\Delta }}}_{z}\) reduce the number of experiences generated per simulation to O(105). This value is similar to the number of experiences generated per simulation used for SciMARL of SGS model development46. Consistent with previous studies, we found that further reducing the number of agents per simulation reduced the model’s adaptability and therefore exhibit slightly lower performance. Because we use conventional reinforcement learning update rules in a multi-agent setting, single parameter updates are imprecise. We found that ReF-ER with hyperparameters C = 1.5 and D = 0.05 (Eqs. (6) and (7)) stabilizes training. We ran multiple training runs per reward function and whenever we vary the hyperparameters, but we observe consistent training progress regardless of the initial random seed.

Further implementation details of the algorithm can be found in Novati et al.54.

Overview of the training setup

The models are trained on turbulent channel flow simulations of Reτ = uτδ/ν ≈ 2000, 4200, and 8000, where δ is the channel half-height with grid resolution Δx,y,z 0.05δ. Each WMLES is initialized for uniformly sampled Reτ {2000, 4200, 8000} with the initial velocity field for the training obtained by superposing white noise sampled from \({{{{{{{\mathcal{N}}}}}}}}(0,0.5{u}_{\tau })\) to a previously obtained WMLES flow field at the given Reτ that is run for a short period of time to remove numerical artifacts. The initial wall-shear stress is set to overestimate or underestimate the correct wall-shear stress within ± 20%. At each time step of the WMLES, the location hm is randomly selected between 0.075δ and 0.15δ to train over a smooth range of \({({h}^{m})}^{+}\) within the log layer. The velocity and its wall-normal gradient are then interpolated to the chosen wall-normal location hm to form the state vector. The agents are located with spacings \({{{\Delta }}}_{x}^{m}=4{{{\Delta }}}_{x}\) and \({{{\Delta }}}_{z}^{m}=4{{{\Delta }}}_{z}\). Each iteration of the learning algorithm runs the simulation for 2δ/uτ, updating the model at every time step.

The policy is parameterized by a neural network with two hidden layers of 128 units each, with soft sign activations and skip connections. The neural network is initialized with small outer weights and bias shifted such that the initial policy is approximately \({{{{{{{\mathcal{N}}}}}}}}(1,1{0}^{-4})\)56. Gradients are computed with Monte Carlo estimates with sample size B = 512 from an RM of size 106. The parameters are updated with the Adam algorithm57 with learning rate η = 10−5. ReF-ER hyperparameters of C = 1.5 and D = 0.05 are used to stabilize training. Each training run is advanced for 107 policy gradient steps.

For both VWM and LLWM, the action is given by a multiplication factor an(x, z)  [0.9, 1.1] such that \({\tau }_{w}^{m}(x,z,{t}_{n+1})={a}_{n}(x,z){\tau }_{w}^{m}(x,z,{t}_{n})\). The reward is given by

$${r}_{n}(x,z)=\, \left(\frac{| {\tau }_{w}-{\tau }_{w}^{m}(x,z,{t}_{n})| -| {\tau }_{w}-{\tau }_{w}^{m}(x,z,{t}_{n-1})| }{{\tau }_{w}}\right)\\ +{\mathbb{1}}\left(\frac{| {\tau }_{w}-{\tau }_{w}^{m}(x,z,{t}_{n})| }{{\tau }_{w}} < 0.01\right),$$
(8)

where \({\mathbb{1}}\) is an indicator function and τw is the true mean wall-shear stress. This gives a reward that is proportional to the improvement in the prediction of the wall-shear stress compared to the one obtained in the previous time step with an additional reward if the predicted wall-shear stress is within 1% of the true value. The states of the VWM are the instantaneous velocity u*(x, hm, z, tn), the wall-normal derivative ∂u*/∂y*(x, hm, z, tn), and the wall-normal location \({y}^{* }={({h}^{m})}^{* }\) of the sampling point. The states of the LLWM are

$$\frac{1}{{\kappa }^{m}}(x,z,{t}_{n}) =\left(\frac{\partial {u}^{* }}{\partial {y}^{* }}{y}^{* }\right)(x,{h}^{m},z,{t}_{n}),\,\,{{\mbox{and}}}\,{B}^{m}(x,z,{t}_{n})\\ ={u}^{* }(x,{h}^{m},z,{t}_{n})-\frac{1}{{\kappa }^{m}}(x,z,{t}_{n})\log {\left({h}^{m}\right)}^{* }.$$

Details of the flow simulation

We solve the filtered incompressible Navier–Stokes equations in a channel using LES with a staggered second-order finite difference in space58 with a fractional-step method59 and a third-order Runge–Kutta time-advancing scheme60. The SGS model is given by the anisotropic minimum dissipation (AMD) model61, which is known to perform well in highly anisotropic grids62.

For the channel flow, periodic boundary conditions are imposed in the streamwise and spanwise directions, and the no-slip and no-penetration boundary conditions at the top and bottom walls. The modeled wall stress \({\tau }_{w}^{m}\) is applied to the LES domain through the eddy viscosity at the wall63,

$${\left.{\nu }_{t}\right|}_{w}={\left.\left(\frac{\partial u}{\partial y}\right)\right|}_{w}^{-1}\frac{{\tau }_{w}^{m}}{\rho }-\nu ,$$
(9)

where νt is the eddy viscosity and the subscript w indicates values evaluated at the wall. This boundary condition, compared to the more widely used Neumann boundary condition, is better at resolving the so-called log-layer mismatch for WMLES63. The channel is driven by a constant pressure gradient for the testing cases. For training, the channel is driven by a constant mass flow rate computed from the mean velocity profile of channel flow. The domain size is given by Lx = 2πδ, Ly = 2δ, and Lz = πδ, where δ is the channel half-height.

For the spatially evolving boundary layer, periodic boundary conditions are imposed in the spanwise direction. No-slip and no-penetration boundary condition with viscosity augmentation (Eq. (9)) is used at the wall. In the top plane, we impose u = U (free-stream velocity), w = 0, and v estimated from the known experimental growth of the displacement thickness for the corresponding range of Reynolds numbers49. This controls the average streamwise pressure gradient, whose nominal value is set to zero. The turbulent inflow is generated by the recycling scheme64, in which the velocities from a reference downstream plane, xref, are used to synthesize the incoming turbulence. The reference plane is located well beyond the end of the inflow region to avoid spurious feedback65,66. A convective boundary condition is applied at the outlet with convective velocity U67 with a small correction to enforce global mass conservation66. The spanwise direction is periodic.

The code has been validated in previous studies in turbulent channel flows22,68,69,70 and flat-plate boundary layers22,71.

Testing: channel flow

The model predictions of VWM and LLWM are tested on turbulent channel flow for Reynolds numbers in the range from 5200 to 106 (see Table 1) and for a time span of 300δ/uτ significantly longer than the training period 2δ/uτ. While only results using Δx ≈ Δz ≈ 0.05δ are reported here, using different grid resolutions representative of WMLES also produce similar results.

Table 1 List of channel flow test cases and corresponding Reynolds number, wall-normal grid resolution, and matching location hm.

Note that for LLWM, one of the states, 1/κm = (∂u*/∂y*)y*, depends on the choice of y with respect to the discrete points of the simulation. For example, if y is located at the midpoint of two computational grid points, a central finite difference can be used to compute the wall-normal derivative ∂u*/∂y*. On the other hand, if y is located on the computational grid point, either a left- or right-finite difference is used. In this case, we chose y values that are midpoints of the two computational grid points. Changing the location of y had minor effects on the results, with the wall-shear stress changing ~5% when the location of y was chosen to be on the computational grid point.

Testing: spatially evolving turbulent boundary layer

The predictive performance of LLWM is assessed in a zero-pressure-gradient flat-plate turbulent boundary layer with Reθ ranging from 1000 to 7000. This range was chosen so that the results can be compared against relevant DNS72. The recycling plane for the inlet boundary condition is set to xref/θ0 = 890, where θ0 is the momentum thickness at the inlet. The length, height and width of the simulated box are Lx = 3570θ0, Ly = 100θ0, and Lz = 200θ0. The streamwise and spanwise resolutions are Δx/δ = 0.06 (\({{{\Delta }}}_{x}^{+}=128\)) and Δz/δ = 0.05 (\({{{\Delta }}}_{z}^{+}=105\)) at Reθ = 6500. The grid is uniform in the wall-normal direction with Δy/δ = 0.03 (\({{{\Delta }}}_{y}^{+}=64\)) at Reθ = 6500. The number of wall-normal grid points per boundary layer thickness is chosen to be ~10 at the inlet, which is in line with the channel flow simulations. The sampling point hm was chosen to be at the third grid point off the wall in the wall-normal direction16, which places the point in the log region for most of the domain. All computations were run for 50 washout times after transients.