Machine learning force fields for molecular liquids: Ethylene Carbonate/Ethyl Methyl Carbonate binary solvent

Magdău, Ioan-Bogdan; Arismendi-Arrieta, Daniel J.; Smith, Holly E.; Grey, Clare P.; Hermansson, Kersti; Csányi, Gábor

doi:10.1038/s41524-023-01100-w

Download PDF

Article
Open access
Published: 17 August 2023

Machine learning force fields for molecular liquids: Ethylene Carbonate/Ethyl Methyl Carbonate binary solvent

npj Computational Materials volume 9, Article number: 146 (2023) Cite this article

3997 Accesses
3 Citations
4 Altmetric
Metrics details

Subjects

Atomistic models

Abstract

Highly accurate ab initio molecular dynamics (MD) methods are the gold standard for studying molecular mechanisms in the condensed phase, however, they are too expensive to capture many key properties that converge slowly with respect to simulation length and time scales. Machine learning (ML) approaches which reach the accuracy of ab initio simulation, and which are, at the same time, sufficiently affordable hold the key to bridging this gap. In this work we present a robust ML potential for the EC:EMC binary solvent, a key component of liquid electrolytes in rechargeable Li-ion batteries. We identify the necessary ingredients needed to successfully model this liquid mixture of organic molecules. In particular, we address the challenge posed by the separation of scale between intra- and inter-molecular interactions, which is a general issue in all condensed phase molecular systems.

Enhancing ReaxFF for molecular dynamics simulations of lithium-ion batteries: an interactive reparameterization protocol

Article Open access 10 January 2024

Quantum chemical calculations of lithium-ion battery electrolyte and interphase species

Article Open access 05 August 2021

Fast evaluation technique for the shear viscosity and ionic conductivity of electrolyte solutions

Article Open access 04 May 2022

Introduction

The development of ML methods for molecular modelling has opened the possibility to simulate large length scales and long time scales with ab initio-level accuracy^1,2,3. ML methods have been employed successfully to model isolated molecules^4,5,6,7,8, where the directional intra-molecular interactions dominate, as well as inorganic solids and liquids, where the interactions are homogeneous^{9,10,11,12,13,14,15,16,17,18,19,20,21,22}. The molecular condensed phase presents unique challenges owing to a large separation of scales between intra- and inter-molecular interactions^{23,24,25,26,27,28}. Molecular mixtures, in particular, present further complications owing to the heterogeneity of the inter-molecular environments.

Many important properties of the molecular liquids such as: density, viscosity and dielectric constant; depend specifically, and critically on inter-molecular forces. It is therefore crucial that ML models obtain good accuracies on both intra- and inter-scales. Numerous studies have demonstrated ML potentials for water and aqueous solutions^{29,30,31,32,33}, methane²⁶, ionic liquids²⁸ and, more recently, electrolyte solutions^34,35,36. The standard approach to modelling these molecular systems is to create separate force fields for the intra- and inter- contributions^2,37,38,39. This approach is inspired by classical force fields and neatly solves the problem of scale separation. However the approximation breaks down in reactive systems where the molecular identity changes during the course of the simulation.

In this work, we intentionally choose to fit ML models on total interactions, without an explicit separation of scales, so our potential are entirely general and could, in principle, capture reactivity^40,41. We aim to answer the overarching question: can we develop an accurate and robust ML force field for molecular liquids mixtures without making the scale separation explicit? In particular, can these models generate long and stable molecular dynamics trajectories and reproduce the thermodynamic properties of the reference method? As shown in other studies, even reproducing the density of a molecular system is not a trivial task⁴² and merits a meticulous investigation. Here we show that a good fit on the total loss function results in good intra-, but poor inter- relative accuracies. In practice, this makes it difficult to fit the inter- contribution which underlies the thermodynamic properties of the liquid state.

Specifically, this paper develops an ML force field for the binary solvent EC:EMC (3:7 M) of the standard LP57 electrolyte (1M LiPF₆ in EC:EMC (3:7 M) < 10 ppm H₂O, BASF). This formulation comprises a mixture of cyclic and linear carbonates because of their complementary properties: cyclic carbonates have large dipole moments and can efficiently dissociate the salt, while linear carbonates have higher mobility and improve ion diffusion. The solvation structure and ion transport properties of mixed liquid carbonate electrolytes have been extensively explored using classical dynamics in conjunction with quantum calculations^43,44,45, polarizable force fields^46,47,48, reactive force fields⁴⁹, short ab initio dynamics^47,50 and more recently neural network potentials³⁶. Lower-accuracy methods mischaracterize the ion solvation structure, while higher-accuracy methods are too expensive to capture the slow dynamics and, ultimately, the ion transport. The recent work in ref. ³⁶ demonstrates an ML potential which captures long time scale ion transport with high accuracy, reproducing experimental diffusivities. In their model, however, the inter-molecular interaction is largely captured by a Coulomb interaction based on learned atomic charges and a purely geometry dependent empirical dispersion term.

Recent efforts in ML modelling have developed strategies to deal with the long-range electrostatics^{51,52,53,54,55,56,57} which is important at interfaces³², in molecular clusters and in the vapor phase⁵⁸. Here, we find that short-range models are sufficiently accurate to describe our neutral solvent liquid. This result is in line with local molecular-field theory (LMFT)⁵⁹ which states that homogeneous bulk systems can be modelled with sort-range potentials. The absence of an analytic baseline (e.g. Coulomb, Lennard-Jones) makes our ML modelling more challenging because the force fields must correctly describe the subtle inter-molecular interactions to reproduce the underlying dynamics. However the models we obtain remain general and versatile. We show that by crafting a diverse training set through iterative training and carefully testing intra- and inter-molecular performance, we can fit a general potential that describes the EC:EMC binary solvent liquid for a wide range of compositions. This paves the way for a more general fully reactive force field in the future. More importantly, the conclusions we draw here, likely generalize to other molecular systems as well.

Results

We present our results for fitting a Gaussian Approximation Potential (GAP)⁶⁰ with Smooth Overlap of Atomic Positions (SOAP) descriptors⁶¹ for the EC:EMC molecular solvent. Our main goal is to generate stable dynamics in the NPT ensemble and correctly predict the density of the reference method. For a finite-size system, constrained volume simulation ensembles (NVE, NVT) can mask the poor performance of ML potentials in describing the weak inter-molecular interactions. In the NPT ensemble the density becomes an observable that is highly sensitive to small errors in the inter-molecular interactions.

In the section 'OPLS training sets', we demonstrate that fitting ML models on fixed training-sets results in unstable potentials. In the section 'Iterative training', we show that by iteratively updating the training data set, we are able to continuously improve our GAP models and obtain stable GAP-MD trajectories on the PBE-D2 potential energy surface (PES). We explore different XC functionals (Solvent dimer interactions), retrain our converged GAP model with PBE-D3 data (Comparison to Ab initio MD) and compare the PBE-D3/GAP-MD density prediction to experimental results (Comparison to Experiment). In the sections that follow (Intra-/Inter- decomposition–Model optimization), we present one-by-one the key challenges for fitting ML potentials in multi-component molecular systems, as well as the solutions and tests we developed to address them: the intra-/inter- error decomposition, the rigid-molecular volume-scans, the transferability across molecular compositions and the final model optimizations.

Quantum-mechanical calculations are central to this paper and are used in several different contexts: (i) to provide the energies, forces and virials that build up the training set, (ii) for the testing of XC functionals using molecular complexes, and (iii) to provide ab initio molecular dynamics (AIMD) simulation data as a reference for comparison with the ML-derived results.

OPLS training sets

Obtaining an ML potential without iterative training would be ideal in many applications, but does it work? We test this approach by fitting a series of ML models on a fixed data set without using iterative training. We employed the classical OPLS force field⁶² to sample diverse molecular configurations for EC:EMC (3:7 M) at a wide range of densities (0.4–1.3 g cm⁻³) and temperatures (300–1200 K). At all densities we monitor the mean square displacement (MSD) and choose temperatures at which all ns-long OPLS trajectories remain fluid. The high entropy of the liquid allows the system to explore a diverse distribution of molecular environments. We construct training sets of different sizes (ranging from 200 to 1000 configurations) by selecting uncorrelated samples at intervals of 10 to 50 ps, and recompute their energy, forces and virials at the PBE-D2 level of theory.

We train a range of different GAP models extensively exploring the hyper-parameter space, and obtain force root mean square errors (RMSEs) comparable to the level of force errors inherent in the approximation due to the short range cutoff (up to about 6 Å), as revealed by the “locality test” introduced in ref. ⁹ (see the SI). However, we find that all these models invariably fail when performing NPT dynamics. Figure 1 shows four such GAP models which all behave similarly: shortly after the start of the dynamics, spontaneous bubbles appear and grow within the liquid, which causes the unphysical collapse of the density. This instability reveals a poor description of the inter-molecular degrees of freedom (DoFs).

Interestingly, all models yield stable dynamics in the NVT and NVE ensembles, an example of which is shown in Fig. 1. Molecules remain intact during the simulation which indicates all models capture the intra-molecular DoFs reasonably well. Due to equipartition, most thermal energy is stored in the more numerous intra- modes, which explains why potential and kinetic energies remain stable despite the poor description of the inter- modes. Since inter- modes are soft, they are unlikely to cause integration problems at the 1 fs time step, therefore the dynamics can appear stable in the fixed-volume ensembles. The density observable in the NPT ensemble is much more sensitive to the poor description of the inter-molecular interactions. In the thermodynamic limit, all ensembles are equivalent. Therefore, bubble instability will eventually form in the NVE/NVT simulations as well, but might require significantly larger system sizes.

Iterative training

Iterative training solves the inherent self-consistency problem of ML fitting: good potentials require not only a training set that is representative of the probability distribution of the DFT PES, but the potential needs to be reasonably accurate further away too (for configuration never encountered by a simulation on the true PES) otherwise it will not generate the appropriate distribution when used as a sampler.

We use an iterative training protocol that simultaneously improves our potentials and the diversity of the training data sets generated by the potentials. At each iteration, we expand the training set and we fit a new model. Training is performed on PBE-D2 data and all GAP models employ double SOAP (DS) or double Turbo SOAP (DTS) descriptors⁶³ (details in the Methods section), however only DS models are used to generate new samples in the iterative protocol. Unphysical density fluctuations observed in NPT dynamics are used as a proxy for collecting new configurations and extending the training set (details in the methods section and the SI). As part of the protocol, we address key challenges specific to mixed molecular liquids underpinned by the imbalance between intra- and inter-molecular interactions. In particular, we add multiple molecular compositions, isolated molecules, as well as rigid-molecule volume scans to the training data. In addition, we analyse the errors split into intra- and inter- errors and use these metrics to optimize the SOAP descriptors. We test the potentials using 48-molecule NPT GAP-MD and monitor the density fluctuations at different temperatures and molecular compositions. We also check reproducibility by running independent trajectories at the same temperature and composition, but starting from different configurations. Based on OPLS dynamics and experimental results, we expect the density should monotonically decrease with composition: highest for pure EC to lowest for pure EMC.

Figure 2 shows the systematic improvement of our GAP models across multiple generations of iterative training. After a few iterations, at Gen5/DS, the densities of the target molecular composition, EC:EMC (3:7 M), remain relatively stable over 100 ps of trajectory, even at temperatures as high as 500 K. This model is markedly better than those trained on OPLS data alone and does not form bubbles. Interestingly, the potential is not transferable to different molecular compositions away from the 3:7 M target. At the same time, on the 100 ps time-scale, the model performs poorly with respect to the reproducibility test. These issues may be related to one another: the quality of the energy prediction for various local environments differs substantially based on molecular composition. Since these local environments persist over long time-scales, simulations starting from different initial conditions are dominated by different systematic errors and yield different densities.

**Fig. 2: Iterative training and GAP-MD density.**

After Gen5/DS, we extend the training set with configurations across all molecular compositions, ranging from pure EC to pure EMC and by Gen10/DS we find stable dynamics at all compositions. However, the density ordering with respect to composition remains wrong. In particular, the equilibrium density of pure EC is too low compared to other compositions. On the target composition, we find the model performs better on the reproducibility test, however, the densities at all temperatures are higher than final values obtained with Gen16/DTS.

Starting with Gen11/DS, we add isolated molecules and rigid-molecule volume scans to the training set (details in the sections below), which significantly improves the description of the inter-molecular interactions. The resulting model at Gen13/DS generates more physical dynamics, it solves the density ordering problem and brings the densities closer to final converged values. However, unphysical density fluctuations persist at higher temperatures (500 K), owing to a still insufficiently accurate description of the inter-molecular interactions. We refine our model hyper-parameters and replace the SOAP descriptors with Turbo SOAP, which uses a better radial basis⁶³. With these changes, at Gen16/DTS we achieve converged NPT dynamics across all compositions and temperatures, demonstrating a robust GAP model for the EC:EMC liquid solvent.

Solvent dimer interactions

The iterative training employs PBE-D2 to construct the final training set. In this section we explore the accuracy of other XC functionals and dispersion correction schemes with respect to gold-standard PNO-LCCSD(T)-F12/AVTZ-F12 coupled-cluster theory by comparing the interaction energy of different solvent dimers as shown in Fig. 3. Coupled-cluster calculations show that (EC)₂ is most binding (−0.45 eV), (EMC)₂ is least binding (−0.23 eV), while the mixed (EC)₁(EMC)₁ lies in between (−0.31 eV). All XC functionals tested here follow the same trend as the coupled-cluster reference. Without dispersion correction, all XC functionals underestimate the interaction energies which is consistent with the consensus that GGA and Hybrid DFT functionals require such corrections to describe the weak dispersion interactions present in our system. After including dispersion, we observe a considerable improvement for (EC)₂ and (EC)₁(EMC)₁. In the case of (EMC)₂, the interaction energy is consistently overestimated due to the non-polar character of the EMC molecule.

**Fig. 3: Ab initio dimer interaction energies.**

In choosing an XC functional to model our system we also consider computational cost. In the case of organic electrolytes, only a few AIMD simulations exist because of their intrinsic structural complexity⁶⁴. Very often, such simulations are carried out with GGA functionals due to the large number of atoms required to mitigate finite-size effects. From our dimer scans, we observe that among the selection of functionals in Fig. 3, PBE-D3 (GGA) and PBE0-D3 (hybrid-GGA) exhibit similar accuracy for describing the interaction energies of the solvent components. We chose the less expensive PBE-D3 to validate the training strategy developed in this study.

Comparison to Ab initio MD

We test our best GAP models against ab initio MD (AIMD) results for the target composition (EC:EMC (3:7 M)) at 500 K and show that both potentials perform well. In particular, double Turbo SOAP (Gen16/DTS) gives the best agreement with the DFT density and Equation of States (EoS). All AIMD simulations were carried out at the PBE-D3 level. The original PBE-D2 training set was recalculated with PBE-D3 and both Gen16/DS and Gen16/DTS models were retrained on the new data. Notably, we find good model transferability between the different dispersion schemes (D2 and D3) and no additional iterative training was required.

In general, converging NPT-AIMD simulations with small boxes sizes remains a challenging due to large volume fluctuations, e.g., leading to requiring very dense k-space sampling and large plane wave cutoffs⁶⁵. We circumvent this issue by running multiple NVT simulations where we monitor the pressure expectation value and interpolate it to find the equilibrium density. Each trajectory contains 48-molecules and spans 20 ps. Each AIMD simulation took 0.5 M core-hours, while the equivalent GAP-MD simulation took 250 core-hours, which corresponds to a speed-up of 2000 × .

Figure 4A–C compares a selection of radial distribution functions (RDFs) for the three methods (see SI for details). We compute partial RDFs, separated into intra- and inter- contributions, for each atomic element pair (ab):

$$\begin{array}{ll}{g}_{{{{\rm{intra}}}}}^{ab}(r)\,=\,\frac{V}{4\pi N{r}^{2}}\left\langle \mathop{\sum }\limits_{i}^{{N}_{a}}\mathop{\sum }\limits_{j}^{{N}_{b}}{h}_{ij}\delta (r-{r}_{ij})\right\rangle \\ {g}_{{{{\rm{inter}}}}}^{ab}(r)\,=\,\frac{V}{4\pi N{r}^{2}}\left\langle \mathop{\sum }\limits_{i}^{{N}_{a}}\mathop{\sum }\limits_{j}^{{N}_{b}}(1-{h}_{ij})\delta (r-{r}_{ij})\right\rangle \\ \end{array}$$

(1)

where

$${h}_{ij}=\left\{\begin{array}{l}1{\quad}i,j\,{{\mbox{same molecule}}}\,\\ 0{\quad}i,j\,{{\mbox{different molecules}}}\,\end{array}\right.$$

(2)

such that total RDF is simply:

$$g(r)=\mathop{\sum}\limits_{a,b}{g}_{{{{\rm{intra}}}}}^{ab}(r)+{g}_{{{{\rm{inter}}}}}^{ab}(r)$$

(3)

**Fig. 4: Comparison of GAP-MD to AIMD for EC:EMC (3:7 M), at 500 K.**

Both GAP models agree well with the AIMD results. At short range, the intra-RDF curves are indistinguishable from each-other, indicating the intra-molecular geometries are well reproduced. We report the longer-range intra-RDF (4–8 Å interval) separately to demonstrate that agreement is excellent even for this subtle structure of the RDF. This portion of the signal characterizes the EMC molecule which is longer than EC, and is sensitive to the populations of the EMC conformers. The curves are well reproduced by both surrogate models even at distances larger than R_cut = 6Å, indicating the correct population of EMC isomers is achieved. Finally, the inter-RDFs are in good agreement as well, demonstrating that both models correctly capture the more subtle inter-molecular interactions.

We also compare the self-averaged velocity autocorrelation functions (VACFs) computed as:

$$z(t)={\left\langle \mathop{\sum}\limits_{i}{{{{\bf{v}}}}}_{i}(\tau )\cdot {{{{\bf{v}}}}}_{i}(\tau +t)\right\rangle }_{\tau }$$

(4)

and the vibrational density of states (VDoS):

$$\hat{z}(\omega )={\rm{Re}} \left\{{{{\rm{FT}}}}\left[z(t)-\bar{z}\right]\right\}$$

(5)

where z(t) was sampled every 4 fs, computed up to 1 ps and zero-padded up to 4 ps for smoothing (ω_Nyquist = 4170 cm⁻¹, δω = 33 cm⁻¹, δω_interp = 8 cm⁻¹). Both models reproduce the AIMD vibrational dynamics at high- and low-frequency indicating that intra- and inter-molecular modes are both well described. Gen16/DTS agrees slightly better with AIMD for the high-frequency CH vibration (Fig. 4F).

Figure 4 also shows the pressure distribution obtained with NVT dynamics at three different densities. We compute the expectation value for the pressure and its standard error by averaging over 5 ps trajectory windows. We interpolate these values using the power law P(ρ) = aρⁿ + b, inspired by the Murnaghan-Tait model EoS and mark the equilibrium values of the density at zero pressure. Gen16/DTS density at normal conditions is in good agreement with the AIMD density. Moreover, the pressure distributions at all densities overlap with the AIMD result (Fig. 4H), and so does the fitted EoS. The Gen16/DS model predicts a higher density at normal conditions, but agrees with AIMD within the error tolerance. The agreement with AIMD for the overall EoS is poorer, therefore, we conclude Gen16/DTS is a better model.

Finally, panels I and J of Fig. 4 show the potential energy distributions obtained from the MD simulation as well as the specific heat at constant volume ${C}_{{{{\rm{V}}}}}={{{\rm{Var}}}}\left[E\right]{({N}_{{{{\rm{mol}}}}}{k}_{{{{\rm{B}}}}}T)}^{-1}$³⁶. Both GAP models predict C_V values in agreement with AIMD. Interestingly, energy distributions obtained through GAP-MD are systematically shifted by 2–4 meV atom⁻¹ to lower energies with respect to AIMD. Note that errors in energy per atom are constant with systems size, while fluctuations in this quantity decrease as $E/\sqrt{N}$. For large enough system sizes (here 640 atoms) this error appears large compared to the size of the fluctuations. Moreover, the shift is systematically towards lower energies in GAP because the dynamics biases the energy distribution towards lower energy values. Importantly, this constant offset in energy does not affect the physics.

Comparison to experiment

Figure 5 shows 1000-atom GAP-MD simulations obtained with Gen16/DTS fitted using the PBE-D3 XC functional. Trajectories are stable for one nanosecond at all four compositions. Density distributions are Gaussian and we observe no unphysical density fluctuations.

**Fig. 5: Comparison of PBE-D3 to experiment.**

Given the convergence results in this work, the density predictions of the ML potential represent the results corresponding to DFT. Thus we can state that PBE-D3 densities for EMC-rich simulations are in good agreement with the experiment, whereas the densities of EC-rich compositions are under-estimated. We also note that in our simulations pure EC remains liquid at 300 K, whereas experimentally EC is found to be solid below the 310 K melting temperature. The slope of density with respect to temperature which determines the thermal expansion coefficient is well approximated by PBE-D3 for all compositions.

Intra-/Inter- decomposition

Intra-molecular forces and virials are an order of magnitude larger than inter-molecular contributions and they dominate the ML loss function. At the same time, the thermodynamic and kinetic observables of the EC:EMC liquid such as density and self-diffusivity, are primarily determined by the weaker inter-molecular contributions. It is therefore instructive to test the performance of our GAP models separately on the intra- and inter-components. In one approach, we start from liquid configurations where every atomic coordinate can be expressed as:

$${{{{\bf{r}}}}}_{i}={{{{\boldsymbol{\tau }}}}}_{j}+{{{{\boldsymbol{\Theta }}}}}_{j}{{{{\boldsymbol{\xi }}}}}_{i}$$

(6)

and where τ_j and Θ_j are the translation and rotation of molecule j, and ξ_i represent the internal coordinates of atom i in the reference frame of molecule j. We deconstruct the simulation box into individual molecules. For each molecule we keep the original geometry ξ_i and recompute all of its properties in isolation, obtaining the purely intra-molecular contributions for the whole configuration as:

$$\begin{array}{l}{E}^{{{{\rm{intra}}}}}({{{{\boldsymbol{\tau }}}}}_{j}+{{{{\boldsymbol{\Theta }}}}}_{j}{{{{\boldsymbol{\xi }}}}}_{i})\,=\,\mathop{\sum}\limits_{j}{E}_{j}^{{{{\rm{intra}}}}}({{{{\boldsymbol{\xi }}}}}_{i})\\ {F}^{{{{\rm{intra}}}}}({{{{\boldsymbol{\tau }}}}}_{j}+{{{{\boldsymbol{\Theta }}}}}_{j}{{{{\boldsymbol{\xi }}}}}_{i})\,=\,\mathop{\bigcup}\limits_{j}{{{{\boldsymbol{\Theta }}}}}_{j}^{T}{F}_{j}^{{{{\rm{intra}}}}}({{{{\boldsymbol{\xi }}}}}_{i})\end{array}$$

(7)

where intra-molecular energies are invariant with respect to τ_j and Θ_j, while intra- forces and virials are invariant with respect to τ_j and equivariant with Θ_j. We compute the inter-molecular contributions as the difference:

$$\begin{array}{ll}{E}^{\rm{inter}}({{{{\boldsymbol{\tau }}}}}_{j}+{{{{\boldsymbol{\Theta }}}}}_{j}{{{{\boldsymbol{\xi }}}}}_{i})= {E}^{{{{\rm{total}}}}}({{{{\boldsymbol{\tau }}}}}_{j}+{{{{\boldsymbol{\Theta }}}}}_{j}{{{{\boldsymbol{\xi }}}}}_{i})-\mathop{\sum}\limits_{j}{E}_{j}^{{{{\rm{intra}}}}}({{{{\boldsymbol{\xi }}}}}_{i})\end{array}$$

(8)

for both GAP and DFT values.

Another method we employ for separating the intra-/inter- errors is based on the molecular DoFs. We decompose atomic forces into translational, rotational and vibrational parts⁶⁶, using:

$$\begin{array}{l}{{{{\bf{f}}}}}_{i}^{{{{\rm{trans}}}}}\,=\,\frac{{m}_{i}}{{M}_{j}}\mathop{\sum}\limits_{k\in j}{{{{\bf{f}}}}}_{k}\\ {{{{\bf{f}}}}}_{i}^{{{{\rm{rot}}}}}\,=\,{m}_{i}{{{{\bf{r}}}}}_{i}\times {\left({I}_{j}^{\alpha \beta }\right)}^{-1}\mathop{\sum}\limits_{k\in j}{{{{\bf{f}}}}}_{k}\times {{{{\bf{r}}}}}_{k}\\ {{{{\bf{f}}}}}_{i}^{{{{\rm{vib}}}}}\,=\,{{{{\bf{f}}}}}_{i}-{{{{\bf{f}}}}}_{i}^{{{{\rm{trans}}}}}-{{{{\bf{f}}}}}_{i}^{{{{\rm{rot}}}}}\end{array}$$

(9)

where

$$\begin{array}{l}{M}_{j}\,=\,\mathop{\sum}\limits_{k\in j}{m}_{k}\\ {I}_{j}^{\alpha \beta }\,=\,\mathop{\sum}\limits_{k\in j}{m}_{k}\left(| | {{{{\bf{r}}}}}_{k}| {| }^{2}{\delta }^{\alpha \beta }-{r}_{k}^{\alpha }{r}_{k}^{\beta }\right)\end{array}$$

(10)

are the mass and moment of inertia tensor of each molecule j, r_i are the atomic positions with respect to the molecular center of mass, the dummy index k runs over all atoms in molecule j, and α, β label the Cartesian directions. This decomposition has the advantage that it can be done in place, without additional calculations on the isolated molecules, however it only works for forces and not for energies and virials.

We quantify the errors by using RMSEs and relative RMSEs (RRMSEs) as defined bellow:

$$\begin{array}{l}{{{\rm{RMSE}}}}=\sqrt{\frac{1}{N}\mathop{\sum}\limits_{i}{\left({F}_{i}^{{{{\rm{DFT}}}}}-{F}_{i}^{{{{\rm{GAP}}}}}\right)}^{2}}\\ {{{\rm{RRMSE}}}}=\sqrt{\frac{{\sum }_{i}{\left({F}_{i}^{{{{\rm{DFT}}}}}-{F}_{i}^{{{{\rm{GAP}}}}}\right)}^{2}}{{\sum }_{i}{\left({F}_{i}^{{{{\rm{DFT}}}}}-\overline{{F}^{{{{\rm{DFT}}}}}}\right)}^{2}}}\end{array}$$

(11)

In Fig. 6 we show GAP-DFT force correlation plots for total quantities as well as for intra-/inter- components separately. We used a test set that contains 40 liquid configurations collected from GAP-MD runs with unphysical density fluctuations. Both Gen10/DS and Gen11/DS models yield good total-force RMSEs which are roughly 3% of the maximum force values ( ~ 6 eV Å⁻¹).

**Fig. 6: Separation of intra-/inter- errors.**

Gen10/DS, which does not include isolated molecules in the training set, performs poorly on both the intra-force and inter-force components. This is a consequence of the limited extrapolation power of the model which was trained only on liquid configurations and cannot accurately predict isolated molecule properties (intra- component). The large error on the inter- is simply propagated from the intra- error and is not indicative of the real accuracy of the model. This conclusion is supported by the DoF-based decomposition of the forces, where both models perform similarly well on all force components.

In Gen11/DS we include isolated molecules in the training set and obtain a meaningful decomposition of the intra-/inter- error. This model, yields significantly better RMSEs for both intra- and inter-forces, even though the total RMSE is about the same as in Gen10/DS. In this case, the two RMSE components are predictive of the model accuracy on two separate interaction scales. Although, the absolute RMSEs for the two contributions are similar, in relative terms (RRMSE) the inter-molecular error is significantly larger than the intra-: 1.26 vs 0.09 because inter-forces (~0.5 eVÅ⁻¹) are an order of magnitude smaller than intra-forces (~5 eVÅ⁻¹). Reducing the inter-RMSE (or translational and rotational in the DoF decomposition) is the main challenge in this work and, more generally, in all molecular condensed phase systems where we aim to obtain stable dynamics and properties that depend on inter-molecular forces.

Rigid-molecule volume scans

The rigid-molecule volume scan illustrated in Fig. 7B, was introduced as an additional test to quantify the purely inter-molecular interactions along a special path on the PES:

$${{{{\bf{r}}}}}_{i}(s)=s{{{{\boldsymbol{\tau }}}}}_{j}+{{{{\boldsymbol{\Theta }}}}}_{j}{{{{\boldsymbol{\xi }}}}}_{i}$$

(12)

where internal molecular DoFs and molecular rotations are frozen, and the volume is scaled by a factor s around the starting value by adjusting the center of mass of all molecules. As the starting point for the scans we use liquid configurations from GAP-MD trajectories.

The resulting energy curves are purely inter-molecular in nature:

$${E}^{{{{\rm{total}}}}}(s)={E}^{{{{\rm{inter}}}}}(s)+{E}^{{{{\rm{intra}}}}}{| }_{{{{\boldsymbol{\xi }}}} = {{{\rm{fixed}}}}}$$

(13)

and manifest a steep repulsive part as the volume is compressed beyond its equilibrium value (s < 1) and an attractive part in the opposite direction (s > 1). At long-range (s ≫ 1), the leading contributions to the inter-molecular energy in our system is electrostatic. Specifically, since all molecules are electrically neutral, the main contribution is the dipole-dipole interaction which decays as 1/r³. The remaining dispersion-induction interactions have a 1/r⁶ dependence and vanish at shorter range. In a two-body approximation, the energy along this curve can be understood as the sum over many different molecular dimer curves averaged over multiple orientations. In practice we find the performance on these tests correlates well with the model stability in GAP-MD simulations.

Figure 7C shows the energy curves along such a volume scan for three different GAP models: Gen10/DS trained only on liquid configurations, Gen11/DS trained on liquid and isolated molecule configurations and Gen12/DS where some volume scans are also included in the training set. These predictions are compared to DFT results. Gen10/DS reproduces the ab initio reference near the equilibrium volume (s ≈ 1) but overestimates both the repulsive and attractive regimes. This over-binding reduces the density fluctuations in Gen10/DS compared to earlier models, however, it also leads to density over-estimation (Fig. 2C) and reduced flow in MD simulations. We find that including isolated molecules in Gen11/DS improves the prediction of the energy curve towards the zero-density regime, where the inter-energy vanishes and the total energy becomes purely intra-.

Folding some of the configurations along the volume scans back into the training set (Gen12/DS) improves the predicted energy curves throughout, including on unseen test curves (Fig. 7C). Volume scan configurations interpolate between liquid environments and isolated molecules in configuration space as can be seen in the kernel principal component analysis (K-PCA) plot (Fig. 7A). These special configurations help improve the sampling of the PES along the shallow-curvature of the inter-molecular modes which can be poorly sampled during MD. On one hand, a better estimation of the energy in the s < 1 regime means the repulsive core of atoms is better captured which helps avoid unphysical close-approaches during MD. On the other hand, the improvement in energy prediction in the attractive regime means that Gen12/DS correctly captures the medium-range inter-molecular interactions in the bulk, despite being a finite-range model. With this model, the predicted NPT densities are in better agreement with converged values for all molecular compositions (Fig. 2C).

Molecular compositions

The inter-molecular interactions are difficult to capture because they are weaker, but also because the relevant geometries are more diverse than for intra-molecular interactions. With n the number of molecules and N the average number of atoms per molecule, there are 3n(N − 2) intra-molecular DoFs and 6n inter-molecular ones. In the EC:EMC (3:7 M) system this equates to 34n intra- and just 6n inter- DoFs. However, with just two molecule types in the liquid, most intra- environments are identical to each other, whereas inter- environments vary significantly owing to the continuous reorientation of the molecules in the liquid state and the additional randomness due to mixing. As a result, capturing the richness of inter-molecular environment is a significantly harder task for the ML models.

Models trained only on the target molecular composition ${{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}=$ 3:7 M (up to Gen5/DS) are not transferable to other compositions ${{{\mathcal{C}}}}\,\ne \,{{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$. This manifests as poor stability of GAP-MD simulations at ${{{\mathcal{C}}}}\,\ne \,{{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$ (Fig. 2B), which are plagued by the formation of bubbles, and it is also apparent in volume scan tests. The poor transferability can be understood by examining the K-PCA plot of a single atom type (SI for details) shown in Fig. 8A. Local atomic environments originating from different liquid compositions ${{{\mathcal{C}}}}$ occupy distinct regions in configuration space, meaning models trained only on ${{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$ operate in an extrapolative regime when tested on other compositions.

Figure 8 shows a collection of volume scans for two molecular compositions: 3:7 M and 7:3 M, obtained with different GAP models. Gen5/DS which was trained only on the target composition, correctly reproduces the energy of the 3:7 M composition at the equilibrium density but grossly under/overestimates the energy at 7:3 M, by hundreds of meV atom⁻¹. Adding isolated molecules to the training set improves the 7:3 M prediction and the new model, Gen5+IM/DS, correctly reproduces the total energy in the limit s → ∞, where only intra-molecular energy terms survive.

The error in the inter-molecular energy of the 7:3 M liquid state (s ≈ 1) is reduced by about 50 meV atom⁻¹, however adding isolated molecules is not sufficient to fix this problem. In Gen6+IM/DS we further extend the training set by adding configurations across the full range of molecular compositions. As shown in Fig. 8D, this new model performs significantly better and volume scan energy predictions at s ≈ 1 are within a few meV atom⁻¹ of the DFT values. Importantly, GAP-MD trajectories obtained with models above Gen6/DS are stable at all compositions (Fig. 2).

Augmenting the training sets with the full range of compositions, significantly improves the representation of all local molecular environments c and these models perform better on ${{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$ as well. At a fixed ${{{\mathcal{C}}}}$, the typical GAP-MD trajectory explores a wide range of environments c defined as the number of molecules of each type within a cutoff radius R of a center molecule:

$$c=\{{n}_{{{{\rm{EC}}}}}^{R},{n}_{{{{\rm{EMC}}}}}^{R}\}$$

(14)

The probability distribution of such environments depends on the global composition $f(c| {{{\mathcal{C}}}})$ and some environments are more likely than others. Figure 8B shows an example distribution from GAP-MD at ${{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$. While the most prevalent environment is $c\equiv {{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$, some atypical environments like c = {3, 0} still occur with non-zero probability. In models up to Gen5/DS (trained only on ${{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$) such low-probability environments are poorly represented in the training set and therefore yield large errors. Extending the training sets to the full range of compositions ${{{\mathcal{C}}}}$ improves the coverage of all local molecular environments c and reduces the overall errors, including at the target composition ${{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$. This is reflected in more stable and reproducible GAP-MD trajectories at ${{{{\mathcal{C}}}}}_{{{{\rm{t}}}}}$.

Model optimization

All GAP models above Gen11/DS were trained on isolated molecules, volume scans and multiple compositions. These special configurations significantly improve the versatility of the training sets and the description of inter-molecular interactions. The density collapse problem is largely solved, however models up to Gen15/DS continue to exhibit reversible, but unphysical density fluctuations in GAP-MD, especially at higher temperatures (Figs. 9A, 2A). Iterative training does not improve these issues, suggesting further model optimization is required.

On reevaluating selected points along such a density fluctuations with DFT (Fig. 9B), we find that Gen15/DS reproduces the intra-energy with good accuracy, whereas the inter-energy is systematically under-estimated by around 4–9 meV atom⁻¹. Surprisingly, the 40% drop in density only changes the DFT inter-energy by about 10 meV atom⁻¹, whereas the Gen15/DS model under-estimates this energy barrier by about 5 meV atom⁻¹. This suggests tight inter-energy accuracies are required (order 1 meV atom⁻¹) to keep NPT simulations stable. We find that by tuning the model hyper-parameters we can improve these errors. Specifically, increasing N_sparse improves the flexibility of the model which is required to describe complex inter-molecular environments. A larger σ_atom makes the atomic density smoother and radial functions broader which decreases the resolution but improves model regularity (SI for details).

We find that using Turbo SOAP⁶³ instead of the original SOAP⁶¹, significantly improves the inter- errors on both forces and energies, as well as the smoothness of the model. Figure 9C compares Gen15/DS to Gen15/DTS (double Turbo SOAP) on a test set comprising 40 liquid configurations selected form GAP-MD simulations. Gen15/DTS yields inter-energy RMSEs of around 1 meV atom⁻¹ and 40% smaller inter-force RMSEs. Moreover, the Turbo SOAP model produces smoother volume scan curves as illustrated by the molecular center of mass forces shown in Fig. 9D. This improvement is largely accounted for by the quality of the poly3gauss radial basis, most other parameters being unchanged (SI for more details).

These optimizations for both SOAP and Turbo SOAP, result in improved performance at Gen16. GAP-MD dynamics obtained with both Gen16/DS and Gen16/DTS are stable at all compositions at temperatures up to 500 K (Fig. 2 and SI).

Discussion

In this work we demonstrate a robust ML training protocol for the EC:EMC binary solvent and identify the key challenges in developing converged ML models. We validate our best models against extensive ab initio MD results and compare the density predictions to experimental values. Below we summarize our key findings.

ML potentials fitted on fixed data sets are not MD-stable, especially in the NPT ensemble where density fluctuations are sensitive to an accurate description of the inter-molecular interactions. Iterative training is essential for obtaining a robust model and unphysical density fluctuations can be used as a proxy for extending the training sets.

Intra- and inter-molecular interactions have different scales and dimensionalities which complicates ML modelling. The key difficulties in training our models stem from the large imbalance between intra- and inter-molecular interactions and are likely general in all molecular condensed phase systems. The intra-component is easier to fit from total energies and forces because it dominates the loss function. However, the significantly weaker inter-component determines the thermodynamic properties of the liquid and therefore a good model must describe this component of the interactions with great accuracy. We develop a range of techniques such as rigid-molecule volume scans and intra-/inter- error separation. We employ these techniques to evaluate the potential and to improve its performance on the relevant inter-molecular scale.

Multi-component molecular liquids present additional challenges to ML modelling because they feature an increased diversity of local molecular environments. We demonstrate that a potential intended to model a target molecular composition must still be trained on all compositions. Extending the data in this way, ensures all local molecular environments are uniformly represented in the training set.

We further show that when using SOAP descriptors, the hyper-parameters need to be selected carefully to capture both intra- and inter-molecular interactions. We optimize the radial basis and the sparse set to achieve the required trade-off which performs well on both scales.

We expect our learning protocol to generalize to more complex systems relevant for batteries, such as the electrolyte (solvent + salt) and the solvent-electrode interphase (SEI). In both cases, long-range electrostatics^{51,52,53,54,55,56,57} will likely be required to correctly capture ion transport and interface chemistry. More generally, the lessons we learned here will help guide future ML modelling in these systems.

Methods

GAP models

We fit and test our potentials using the standard implementation of GAP^2,60 from the QUIP library⁶⁷. The SOAP descriptors are computed from the atomic density of each atom i of chemical species a:

$$\begin{array}{ll}{\rho}^{ia}(r)\,=\, \mathop{\sum}\limits_{j}{\delta }_{ab}\exp \left(\frac{-| r-{r}_{ij}{| }^{2}}{2{\sigma }_{{{{\rm{atom}}}}}^{2}}\right){f}_{{{{\rm{cut}}}}}({r}_{ij}\, < \,{R}_{{{{\rm{cut}}}}})\\\qquad\quad=\, \mathop{\sum }\limits_{n}^{{N}_{\max }}\mathop{\sum }\limits_{lm}^{{L}_{\max }}{c}_{nlm}^{ia}{R}_{n}(r){Y}_{lm}(\hat{{{{\bf{r}}}}})\end{array}$$

(15)

where b is the chemical species of atom j, σ_atom is the width of the Gaussian smoothing and R_cut is the cut-off of the potential. The density is expanded in a radial-angular basis (controlled by ${N}_{\max }$ and ${L}_{\max }$) and the expansion coefficients ${c}_{nlm}^{ia}$ are used to construct the rotationally-invariant power spectrum:

$$\xi ={p}_{n{n}^{{\prime} }l}^{ia{a}^{{\prime} }}=\frac{1}{\sqrt{2l+1}}\mathop{\sum}\limits_{m}{c}_{nlm}^{ia}{c}_{{n}^{{\prime} }lm}^{i{a}^{{\prime} }}$$

(16)

These feature vectors enter a polynomial kernel:

$$k(\xi ,{\xi }^{{\prime} })={(\xi \cdot {\xi }^{{\prime} })}^{\zeta }$$

(17)

with ζ = 4 in this work. We use a double SOAP (DS) model with separate kernels:

$$\begin{array}{ll}E={{{\Delta }}}_{1}\mathop{\sum }\limits_{m=1}^{{N}_{{{{\rm{sparse}}}}}^{(1)}}{c}_{m}^{(1)}k({\xi }^{(1)},{\xi }_{m}^{(1)})\\ \quad \quad +\,{{{\Delta }}}_{2}\mathop{\sum }\limits_{m=1}^{{N}_{{{{\rm{sparse}}}}}^{(2)}}{c}_{m}^{(2)}k({\xi }^{(2)},{\xi }_{m}^{(2)})\end{array}$$

(18)

where the first SOAP is higher resolution, shorter range and the second SOAP is lower resolution, longer range: σ_atom,2 = 2 × σ_atom,1 and R_cut,2 = 2 × R_cut,1. Scaling parameters are the same for both SOAPs (Δ₁ = Δ₂ = 0.1), cut-offs are R_cut,1 = 3Å and R_cut,2 = 6Å, while the rest of the parameters (reported in Table 1) were adjusted at different generations of the potential during iterative training.

Table 1 SOAP/GAP hyper-parameters.

Full size table

The loss function is minimized on total energy (E), forces (F) and stress virials (V):

$$\begin{array}{ll}{{{\mathscr{L}}}}=\mathop{\sum }\limits_{n}^{{N}_{{{{\rm{conf}}}}}}\frac{{({E}^{{{{\rm{DFT}}}}}-{E}^{{{{\rm{GAP}}}}})}^{2}}{{N}_{{{{\rm{atom}}}},n}{\sigma }_{{{{\rm{E}}}}}^{2}}\\ \qquad\,+\,\mathop{\sum }\limits_{n}^{{N}_{{{{\rm{conf}}}}}}\mathop{\sum }\limits_{i}^{{N}_{{{{\rm{atom}}}},n}}\frac{\parallel {F}_{i}^{{{{\rm{DFT}}}}}-{F}_{i}^{{{{\rm{GAP}}}}}{\parallel }^{2}}{{\sigma }_{{{{\rm{F}}}}}^{2}}\\ \qquad\,+\,\mathop{\sum }\limits_{n}^{{N}_{{{{\rm{conf}}}}}}\frac{\parallel {V}^{{{{\rm{DFT}}}}}-{V}^{{{{\rm{GAP}}}}}{\parallel }^{2}}{{N}_{{{{\rm{atom}}}},n}{\sigma }_{{{{\rm{V}}}}}^{2}}\\ \qquad\,+\,\mathop{\sum }\limits_{m=1}^{{N}_{{{{\rm{sparse}}}}}}{c}_{m}k({\xi }_{m},{\xi }_{m}^{{\prime} }){c}_{m}^{{\prime} }\end{array}$$

(19)

with regularization parameters: σ_E = 1 meV atom⁻¹ (up to Gen10 models) and σ_E = 2 meV atom⁻¹ (from Gen11), σ_F = 0.1 eV Å⁻¹, σ_V = 0.01 eV. At Gen15 (see SI) and Gen16 we also employ Turbo SOAP descriptors⁶³ which are computed from the atomic density:

$$\begin{array}{l}{\rho}^{ia}(r)=\mathop{\sum}\limits_{j}{\delta }_{ab}\exp \left(\frac{-| r-{r}_{ij}{| }^{2}}{2{\sigma }_{{{{\rm{atom}}}},\parallel }^{2}}\right)\\ \qquad \,\,\,\,\,\,\,\,\,\,\times \,\exp \left(\frac{-{r}_{\perp ,ij}^{2}}{2{\sigma }_{{{{\rm{atom}}}},\perp }^{2}}\right){f}_{{{{\rm{cut}}}}}({r}_{ij}\, < \,{R}_{{{{\rm{cut}}}}})\end{array}$$

(20)

where the radial (∥) and angular (⊥) parts are separated and the σ_atom parameters have linear radial dependence:

$$\begin{array}{ll}&{\sigma }_{{{{\rm{atom}}}},\parallel }={\sigma }_{{{{\rm{atom}}}},\parallel ,0}+{\alpha }_{\parallel }r\\ &{\sigma }_{{{{\rm{atom}}}},\perp }={\sigma }_{{{{\rm{atom}}}},\perp ,0}+{\alpha }_{\perp }r\end{array}$$

(21)

In this work we set σ_atom,∥ = σ_atom,⊥. We explored different single Turbo SOAP models with varying σ_atom (SI), however, in the Results section here, we only report double Turbo SOAP (DTS) with α_∥ = α_⊥ = 0 which is our best model. The Turbo SOAP models use a different radial basis which yields smoother potential energy surfaces (check ref. ⁶³ for details). All descriptors are implemented in the QUIP library, see SI for detail input strings.

Iterative Training

We start from Gen0 with a small subset (50 liquid configurations) chosen from the OPLS data-set. At each subsequent iteration, we train a new GAP model on the current training set and use it to run NPT dynamics at various temperatures, pressures, compositions, cell sizes as detailed in Table 2. We typically run multiple GAP-MD simulations spanning 20 to 100 ps, staring from different coordinate seeds. From each GAP-MD trajectory, we choose a single new configuration (typically around unphysical density fluctuations). This results in a few new configurations per generation (depending on the number of MD trajectories, see Table 2), we compute their ab initio properties, and augment our training set for the next generation of the potential.

Table 2 Training set construction.

Full size table

The sampling for the iterative procedure was carried out exclusively using DS models (see Table 1), while DTS models were employed for additional testing and production simulations.

In addition to these iterations, we add a range of special configurations to our training sets: at Gen11 we add isolated molecules (obtained by deconstructing the 50 OPLS configs from Gen0) and at Gen12 we add volume scans. Up to Gen5 we train only on the target composition, while from Gen6 we expand the data sets to all molecular compositions ranging from pure EC to pure EMC.

The system sizes we use for the iterative procedure are smaller than those employed in the testing/production runs to limit the cost of the DFT calculations. Up to Gen13, we use 12 molecule systems which means different molecular compositions have different system sizes, with the EC molecule being smaller than EMC (10 atoms vs 15). From Gen14 onward, we employ systems with roughly the same number of atoms for each molecular composition, such that the cost of DFT and GAP-MD is more homogeneous across all configurations. This also ensures the finite-size effects are similar for all compositions.

The diversity of molecular environments increases substantially with iterative training. We adjust the number of sparse points (Table 1) to improve the model flexibility and capture the increasing complexity of the data.

Electronic structure calculations

The molecular dimers used for bench-marking (Fig. 3) were geometry-optimized with different DFT XC functionals (BLYP, PBE, B3LYP and PBE0) and with D3BJ⁶⁸ dispersion corrections, using the def2-TZVP basis sets. We check the stability of the local minimum using vibrational frequency analysis. The distance between molecular centre of masses is used to investigate the strength of intermolecular interactions. All dimer DFT calculations (with and without dispersion corrections⁶⁹) were performed with Orca 5.0⁷⁰. The PNO-LCCSD(T)-F12/AVTZ-F12⁷¹ approach as implemented in MOLPRO 2021.2⁷² was used with the “tight" settings to obtain the reference interaction energies at the DFT optimized geometries.

For iterative training (including volume scans and isolated molecules), we obtain the total energies, forces, and virials from spin-unpolarized periodic electronic structure calculations at the Γ point. We used the plane wave implementation of CASTEP⁷³, with the PBE XC functional and D2 dispersion correction (keyword: G06)⁷⁴, an energy cutoff of 800 eV and a Monkhorst-Pack 1 x 1 x 1 k-point grid. Standard CASTEP ultrasoft pseudopotentials were used to model the core electrons.

The final training set (Gen16) was recomputed with the mixed Gaussian/plane wave method implemented in CP2K 8.2⁷⁵. We used the PBE-D3 functional with a TZV2P basis set and a density cutoff of 800 Ry. Core electrons were modelled with norm-conserving GTH pseudopotentials.

MD Simulations

OPLS-MD and GAP-MD simulations were performed in LAMMPS⁶⁶. We obtain the OPLS⁶² parameters with CM1A charges⁷⁶ using the online generator LibParGen⁷⁷. The GAP potentials are loaded in LAMMPS using the pair_style quip command. The equations of motion were integrated using the velocity-Verlet algorithm with a time step of 1 fs. For iterative training we used the Nose-Hoover thermostat with chain length 3 (LAMMPS default) and temperature damping set to 100 fs, and the Nose-Hoover barostat (isotropic pressure) with chain length 3 (LAMMPS default) and pressure damping set to 1 ps. The settings of the simulation (number of molecules, compositions, temperatures, densities) are reported in Table 2.

At selected generations of the potential (Fig. 2) we perform 100 ps NPT GAP-MD tests with larger system sizes (48 molecules), at different compositions and temperatures. The different compositions were modelled using 48 EC molecules (pure EC), 32 EC + 16 EMC (7:3 M), 16 EC + 32 EMC (3:7 M) and 48 EMC (pure EMC).

We also perform 1000 atoms NPT simulations (Fig. 5) with the final versions of the potential (Gen16/DTS) retrained on the PBE-D3 functional. For these simulations, we model four different compositions (Fig. 5) using: 100 EC molecules (pure EC), 57 EC + 29 EMC (7:3 M), 25 EC + 50 EMC (3:7 M) and 67 EMC molecules (pure EMC). All NPT simulations used the barostat, thermostat, integration settings.

AIMD simulations were performed with QUICKSTEP as implemented in the CP2K code⁷⁵. The wave function extrapolation (between MD steps) was obtained with the always stable predictor-corrector method (ASPC) (extrapolation order 3). We used the NVT ensemble and the CSVR thermostat⁷⁸ with 100 fs damping to keep the temperature constant at 500 K. We obtained three independent 48-molecule (12 EC + 36 EMC) trajectories at liquid densities ρ = (0.85, 0.95 and 1.08 g cm⁻³) with cell dimensions L = (19.358, 20.235 and 21.253 Å). The cutoff radius of the dispersion interactions was set to 10 Å. The time step was set to 0.5 fs, and for each box, a 20 ps MD trajectory was generated after a 5 ps equilibration run. For a direct and fair comparison to GAP-MD we ran the same NVT simulations using the LAMMPS implementation of the CSVR thermostat with 100 fs damping, and an integration time step of 0.5 fs. The GAP potentials (Gen16/DS and Gen16/DTS) were retrained with the same electronic structure settings as used in the AIMD.

Experimental details

Mixtures of ethylene carbonate and ethyl methyl carbonate were prepared by weighing EC and EMC in the desired ratio; gentle heating to 60 °C (333 K) was required to fully dissolve EC. The prepared solvent was degassed using three freeze-pump-thaw degas cycles and dried over 4 Å activated molecular sieves. The water content of the mixtures was found to be less than 50 ppm by Karl-Fischer titration. The exact molar ratio was determined by integrating the EC and EMC signals in the ¹H NMR spectra. NMR spectra were recorded on a Bruker 300 MHz Ascend spectrometer. Measured ratios were: 69.3 EC to 30.7 EMC by NMR (7:3 M) and 29.9 EC to 70.1 EMC by NMR (3:7 M).

Density measurements were performed at 22 ^∘C (295 K) using an Anton Paar DMA 48 instrument, which was calibrated with water, ethanol and propylene carbonate before measuring the mixtures. 5 ml of each liquid was used for the measurements.

Data availability

The final train/test sets, final potential and MD trajectories are freely available at https://doi.org/10.5281/zenodo.8043753.

Code availability

Alongside the data, we also provide a Jupyter notebook which explains in detail the structure of the data. The notebook also demonstrates how to create volume scans, intra/inter splits, analyze molecular compositions and analyze the MD trajectories generated in this study. The Python functionality we developed for this project has been made publicly available at https://github.com/imagdau/aseMolec.git.

References

Musil, F. et al. Physics-inspired structural representations for molecules and materials. Chem. Rev. 121, 9759–9815 (2021).
CAS Google Scholar
Deringer, V. L. et al. Gaussian process regression for materials and molecules. Chem. Rev. 121, 10073–10141 (2021).
CAS Google Scholar
Langer, M. F., Goeßmann, A. & Rupp, M. Representations of molecules and materials for interpolation of quantum-mechanical simulations via machine learning. NPJ Comput. Mater. 8, 1–14 (2022).
Google Scholar
Chmiela, S., Sauceda, H. E., Müller, K.-R. & Tkatchenko, A. Towards exact molecular dynamics simulations with machine-learned force fields. Nat. Commun. 9, 1–10 (2018).
CAS Google Scholar
Christensen, A. S. & von Lilienfeld, O. A. On the role of gradients for machine learning of molecular energies and forces. Mach. Learn.: Sci. Technol. 1, 045018 (2020).
Google Scholar
Qiao, Z., Welborn, M., Anandkumar, A., Manby, F. R. & Miller III, T. F. Orbnet: Deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. J. Chem. Phys. 153, 124111 (2020).
CAS Google Scholar
Kovács, D. P. et al. Linear atomic cluster expansion force fields for organic molecules: beyond rmse. J. Chem. Theory Comput. 17, 7696–7711 (2021).
Google Scholar
Rosenberger, D., Smith, J. S. & Garcia, A. E. Modeling of peptides with classical and novel machine learning force fields: A comparison. J. Phys. Chem. B 125, 3598–3612 (2021).
CAS Google Scholar
Deringer, V. L. & Csányi, G. Machine learning based interatomic potential for amorphous carbon. Phys. Rev. B 95, 094203 (2017).
Google Scholar
Bartók, A. P., Kermode, J., Bernstein, N. & Csányi, G. Machine learning a general-purpose interatomic potential for silicon. Phys. Rev. X 8, 041048 (2018).
Google Scholar
Deringer, V. L., Caro, M. A. & Csányi, G. A general-purpose machine-learning force field for bulk and nanostructured phosphorus. Nat. Commun. 11, 1–11 (2020).
Google Scholar
Deringer, V. L., Pickard, C. J. & Csányi, G. Data-driven learning of total and local energies in elemental boron. Phys. Rev. Lett. 120, 156001 (2018).
CAS Google Scholar
Bernstein, N., Csányi, G. & Deringer, V. L. De novo exploration and self-guided learning of potential-energy surfaces. NPJ Comput. Mater. 5, 1–9 (2019).
Google Scholar
Dragoni, D., Daff, T. D., Csányi, G. & Marzari, N. Achieving dft accuracy with a machine-learning interatomic potential: Thermomechanics and defects in bcc ferromagnetic iron. Phys. Rev. Mater. 2, 013808 (2018).
Google Scholar
Maresca, F., Dragoni, D., Csányi, G., Marzari, N. & Curtin, W. A. Screw dislocation structure and mobility in body centered cubic fe predicted by a gaussian approximation potential. NPJ Comput. Mater. 4, 1–7 (2018).
CAS Google Scholar
Zhang, Z., Csányi, G. & Alfè, D. Partitioning of sulfur between solid and liquid iron under earth’s core conditions: Constraints from atomistic simulations with machine learning potentials. Geochim. Cosmochim. Acta. 291, 5–18 (2020).
CAS Google Scholar
Sivaraman, G. et al. Machine-learned interatomic potentials by active learning: amorphous and liquid hafnium dioxide. NPJ Comput. Mater. 6, 1–8 (2020).
Google Scholar
Liu, Y.-B. et al. Machine learning interatomic potential developed for molecular simulations on thermal properties of β-ga2o3. J. Chem. Phys. 153, 144501 (2020).
Google Scholar
Rowe, P., Deringer, V. L., Gasparotto, P., Csányi, G. & Michaelides, A. An accurate and transferable machine learning potential for carbon. J. Chem. Phys. 153, 034702 (2020).
CAS Google Scholar
Caro, M. A., Deringer, V. L., Koskinen, J., Laurila, T. & Csányi, G. Growth mechanism and origin of high sp3 content in tetrahedral amorphous carbon. Phys. Rev. Lett. 120, 166101 (2018).
CAS Google Scholar
Caro, M. A., Csányi, G., Laurila, T. & Deringer, V. L. Machine learning driven simulated deposition of carbon films: From low-density to diamondlike amorphous carbon. Phys. Rev. B 102, 174201 (2020).
CAS Google Scholar
Timmermann, J. et al. Iro_2 surface complexions identified through machine learning and surface investigations. Phys. Rev. Lett. 125, 206101 (2020).
CAS Google Scholar
Mones, L., Ortner, C. & Csányi, G. Preconditioners for the geometry optimisation and saddle point search of molecular systems. Sci. Rep. 8, 1–11 (2018).
CAS Google Scholar
Cheng, B., Mazzola, G., Pickard, C. J. & Ceriotti, M. Evidence for supercritical behaviour of high-pressure liquid hydrogen. Nature 585, 217–220 (2020).
CAS Google Scholar
Zong, H., Wiebe, H. & Ackland, G. J. Understanding high pressure molecular hydrogen with a hierarchical machine-learned potential. Nat. Commun. 11, 1–9 (2020).
Google Scholar
Veit, M. et al. Equation of state of fluid methane from first principles with machine learning potentials. J. Chem. Theory Comput. 15, 2574–2586 (2019).
CAS Google Scholar
Wengert, S., Csányi, G., Reuter, K. & Margraf, J. T. Data-efficient machine learning for molecular crystal structure prediction. Chem. Sci. 12, 4536–4546 (2021).
CAS Google Scholar
Montes-Campos, H., Carrete, J., Bichelmaier, S., Varela, L. M. & Madsen, G. K. A differentiable neural-network force field for ionic liquids. J. Chem. Inf. Model. 62, 88–101 (2021).
Google Scholar
Morawietz, T., Singraber, A., Dellago, C. & Behler, J. How van der waals interactions determine the unique properties of water. Proc. Natl. Acad. Sci. U.S.A. 113, 8368–8373 (2016).
CAS Google Scholar
Schran, C. et al. Machine learning potentials for complex aqueous systems made simple. Proc. Natl. Acad. Sci. USA 118, e2110077118 (2021).
CAS Google Scholar
Zhang, L., Wang, H., Car, R. & Weinan, E. Phase diagram of a deep potential water model. Phys. Rev. Lett. 126, 236001 (2021).
CAS Google Scholar
Niblett, S. P., Galib, M. & Limmer, D. T. Learning intermolecular forces at liquid–vapor interfaces. J. Chem. Phys. 155, 164101 (2021).
CAS Google Scholar
Schran, C., Behler, J. & Marx, D. Automated fitting of neural network potentials at coupled cluster accuracy: Protonated water clusters as testing ground. J. Chem. Theory Comput. 16, 88–99 (2019).
Google Scholar
Yao, N., Chen, X., Fu, Z.-H. & Zhang, Q. Applying classical, ab initio, and machine-learning molecular dynamics simulations to the liquid electrolyte for rechargeable batteries. Chem. Rev. 122, 10970–11021 (2022).
CAS Google Scholar
Jacobson, L. D. et al. Transferable neural network potential energy surfaces for closed-shell organic molecules: Extension to ions. J. Chem. Theory Comput. 18, 2354–2366 (2022).
CAS Google Scholar
Dajnowicz, S. et al. High-dimensional neural network potential for liquid electrolyte simulations. J. Phys. Chem. B 126, 6271–6280 (2022).
CAS Google Scholar
Schriber, J. B. et al. Cliff: A component-based, machine-learned, intermolecular force field. J. Chem. Phys. 154, 184110 (2021).
CAS Google Scholar
Konrad, M. & Wenzel, W. Coni-net: Machine learning of separable intermolecular force fields. J. Chem. Theory Comput. 17, 4996–5006 (2021).
CAS Google Scholar
Cisneros, G. A. et al. Modeling molecular interactions in water: From pairwise to many-body potential energy functions. Chem. Rev. 116, 7501–7528 (2016).
CAS Google Scholar
Behler, J. First principles neural network potentials for reactive simulations of large molecular and condensed systems. Angew. Chem. Int. Ed. 56, 12828–12840 (2017).
CAS Google Scholar
Young, T. A., Johnston-Wood, T., Deringer, V. L. & Duarte, F. A transferable active-learning strategy for reactive molecular force fields. Chem. Sci. 12, 10944–10955 (2021).
CAS Google Scholar
Mondal, A., Kussainova, D., Yue, S. & Panagiotopoulos, A. Z. Modeling chemical reactions in alkali carbonate-hydroxide electrolytes with deep learning potentials. J. Chem. Theory Comput. 19, 4584–4595 (2023).
CAS Google Scholar
Ponnuchamy, V., Mossa, S. & Skarmoutsos, I. Solvent and salt effect on lithium ion solvation and contact ion pair formation in organic carbonates: a quantum chemical perspective. J. Phys. Chem. C 122, 25930–25939 (2018).
CAS Google Scholar
Skarmoutsos, I., Ponnuchamy, V., Vetere, V. & Mossa, S. Li+ solvation in pure, binary, and ternary mixtures of organic carbonate electrolytes. J. Phys. Chem. C 119, 4502–4515 (2015).
CAS Google Scholar
Oldiges, K. et al. Understanding transport mechanisms in ionic liquid/carbonate solvent electrolyte blends. Phys. Chem. Chem. Phys. 20, 16579–16591 (2018).
CAS Google Scholar
Borodin, O. & Smith, G. D. Quantum chemistry and molecular dynamics simulation study of dimethyl carbonate: ethylene carbonate electrolytes doped with lipf6. J. Phys. Chem. B 113, 1763–1776 (2009).
CAS Google Scholar
Borodin, O. et al. Competitive lithium solvation of linear and cyclic carbonates from quantum chemistry. Phys. Chem. Chem. Phys. 18, 164–175 (2016).
CAS Google Scholar
Perner, V. et al. Insights into the solubility of poly (vinylphenothiazine) in carbonate-based battery electrolytes. ACS Appl. Mater. Interfaces 13, 12442–12453 (2021).
CAS Google Scholar
Ong, M. T. et al. Lithium ion solvation and diffusion in bulk organic electrolytes from first-principles and classical reactive molecular dynamics. J. Phys. Chem. B 119, 1535–1545 (2015).
CAS Google Scholar
Zhang, X. & Kuroda, D. G. An ab initio molecular dynamics study of the solvation structure and ultrafast dynamics of lithium salts in organic carbonates: A comparison between linear and cyclic carbonates. J. Chem. Phys. 150, 184501 (2019).
Google Scholar
Ko, T. W., Finkler, J. A., Goedecker, S. & Behler, J. A fourth-generation high-dimensional neural network potential with accurate electrostatics including non-local charge transfer. Nat. Commun. 12, 1–11 (2021).
Google Scholar
Zhang, L. et al. A deep potential model with long-range electrostatic interactions. J. Chem. Phys. 156, 124107 (2022).
CAS Google Scholar
Shao, Y., Andersson, L., Knijff, L. & Zhang, C. Finite-field coupling via learning the charge response kernel. Electron. Struct. 4, 014012 (2022).
Google Scholar
Grisafi, A. & Ceriotti, M. Incorporating long-range physics in atomic-scale machine learning. J. Chem. Phys. 151, 204105 (2019).
Google Scholar
Xie, X., Persson, K. A. & Small, D. W. Incorporating electronic information into machine learning potential energy surfaces via approaching the ground-state electronic energy as a function of atom-based electronic populations. J. Chem. Theory Comput. 16, 4256–4270 (2020).
CAS Google Scholar
Gao, A. & Remsing, R. C. Self-consistent determination of long-range electrostatics in neural network potentials. Nat. Commun. 13, 1–11 (2022).
Google Scholar
Metcalf, D. P., Jiang, A., Spronk, S. A., Cheney, D. L. & Sherrill, C. D. Electron-passing neural networks for atomic charge prediction in systems with arbitrary molecular charge. J. Chem. Inf. Model. 61, 115–122 (2020).
Google Scholar
Yue, S. et al. When do short-range atomistic machine-learning models fall short? J. Chem. Phys. 154, 034111 (2021).
CAS Google Scholar
Cox, S. J. Dielectric response with short-ranged electrostatics. Proc. Natl. Acad. Sci. USA 117, 19746–19752 (2020).
CAS Google Scholar
Bartók, A. P., Payne, M. C., Kondor, R. & Csányi, G. Gaussian approximation potentials: The accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett. 104, 136403 (2010).
Google Scholar
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).
Google Scholar
Jorgensen, W. L. & Tirado-Rives, J. Potential energy functions for atomic-level simulations of water and organic and biomolecular systems. Proc. Natl. Acad. Sci. USA 102, 6665–6670 (2005).
CAS Google Scholar
Caro, M. A. Optimizing many-body atomic descriptors for enhanced computational performance of machine learning based interatomic potentials. Phys. Rev. B 100, 024112 (2019).
CAS Google Scholar
Pham, T. A. Ab initio simulations of liquid electrolytes for energy conversion and storage. Int. J. Quantum Chem. 119, e25795 (2019).
Google Scholar
Galib, M. et al. Mass density fluctuations in quantum and classical descriptions of liquid water. J. Chem. Phys. 146, 244501 (2017).
Google Scholar
Plimpton, S. Fast parallel algorithms for short-range molecular dynamics. J. Comput. Phys. 117, 1–19 (1995).
CAS Google Scholar
Quip. https://github.com/libAtoms/QUIP.
Grimme, S., Ehrlich, S. & Goerigk, L. Effect of the damping function in dispersion corrected density functional theory. J. Comput. Chem. 32, 1456–1465 (2011).
CAS Google Scholar
Grimme, S., Hansen, A., Brandenburg, J. G. & Bannwarth, C. Dispersion-corrected mean-field electronic structure methods. Chem. Rev. 116, 5105–5154 (2016).
CAS Google Scholar
Neese, F. Software update: The orca program system-version 5.0. Wiley Interdiscip. Rev. Comput. Mol. Sci. 12, e1606 (2022).
Google Scholar
Ma, Q. & Werner, H.-J. Accurate intermolecular interaction energies using explicitly correlated local coupled cluster methods[pno-lccsd(t)-f12]. J. Chem. Theory Comput. 15, 1044–1052 (2019).
CAS Google Scholar
Werner, H.-J. et al. The molpro quantum chemistry package. J. Chem. Phys. 152, 144107 (2020).
CAS Google Scholar
Clark, S. J. et al. First principles methods using castep. Z. Kristallogr. Cryst. Mater. 220, 567–570 (2005).
CAS Google Scholar
Grimme, S. Semiempirical gga-type density functional constructed with a long-range dispersion correction. J. Comput. Chem. 27, 1787–1799 (2006).
CAS Google Scholar
Kühne, T. D. et al. Cp2k: An electronic structure and molecular dynamics software package - quickstep: Efficient and accurate electronic structure calculations. J. Chem. Phys. 152, 194103 (2020).
Google Scholar
Dodda, L. S., Vilseck, J. Z., Tirado-Rives, J. & Jorgensen, W. L. 1.14* cm1a-lbcc: localized bond-charge corrected cm1a charges for condensed-phase simulations. J. Phys. Chem. B 121, 3864–3870 (2017).
CAS Google Scholar
Dodda, L. S., Cabeza de Vaca, I., Tirado-Rives, J. & Jorgensen, W. L. Ligpargen web server: an automatic opls-aa parameter generator for organic ligands. Nucleic Acids Res. 45, W331–W336 (2017).
CAS Google Scholar
Bussi, G., Donadio, D. & Parrinello, M. Canonical sampling through velocity rescaling. J. Chem. Phys. 126, 014101 (2007).
Google Scholar
Wang, A. A., Greenbank, S., Li, G., Howey, D. A. & Monroe, C. W. Current-driven solvent segregation in lithium-ion electrolytes. Cell Rep. Phys. Sci. 3, 101047 (2022).
CAS Google Scholar
Johnson, P. H. The properties of ethylene carbonate and its use in electrochemical applications a literature review. Lawrence Berkeley National Laboratory. LBNL Report #: LBL-19886 (1985).

Download references

Acknowledgements

The authors would like to thank Samuel Niblett, James Darby, Dávid Péter Kovács, William Baldwin and Miguel Caro for useful discussions. Holly Smith would like to acknowledge Darren M C Ould for assistance with preparing the EC:EMC mixtures, and Stuart M Clarke and Svetlana Menkin for helpful discussions. The project was funding by the European Union’s Horizon 2020 Research and Innovation Program under Grant Agreement No. 957189 (BIG-MAP project), the Swedish Research Council (VR) and the Swedish National Strategic e-Science program eSSENCE. H.S. was supported by EPSRC grant EP/R513180/1. Computer resources were provided by ARCHER2 via the UKCP consortium funded by EPSRC grant EP/P022065/1, and the Swedish National Infrastructure for Computing (SNIC) at HPC2N and PDC.

Author information

Authors and Affiliations

Engineering Laboratory, University of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, UK
Ioan-Bogdan Magdău & Gábor Csányi
Department of Chemistry–Ångström Laboratory, Uppsala University, Box 538, 75121, Uppsala, Sweden
Daniel J. Arismendi-Arrieta & Kersti Hermansson
Yusuf Hamid Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
Holly E. Smith & Clare P. Grey

Authors

Ioan-Bogdan Magdău
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J. Arismendi-Arrieta
View author publications
You can also search for this author in PubMed Google Scholar
Holly E. Smith
View author publications
You can also search for this author in PubMed Google Scholar
Clare P. Grey
View author publications
You can also search for this author in PubMed Google Scholar
Kersti Hermansson
View author publications
You can also search for this author in PubMed Google Scholar
Gábor Csányi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.C. and K.H. conceived the study and coordinated the research. I.B.M. developed ML methodology, performed calculation and analysis with input from GC. D.J.A.A. led the generation, curation and benchmarking of QM data and AIMD simulations. H.E.S. performed the experiments with guidance from CPG. I.B.M. wrote the paper with input from all other authors. All authors discussed the theory and results though periodic meetings, and contributed to editing the manuscript.

Corresponding author

Correspondence to Ioan-Bogdan Magdău.

Ethics declarations

Competing interests

The authors declare no competing non-financial interests but the following competing financial interests: G.C. is listed as inventor on a patent filed by Cambridge Enterprise Ltd. related to SOAP and GAP (US patent 8843509, filed on 5 June 2009 and published on 23 September 2014).

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Magdău, IB., Arismendi-Arrieta, D.J., Smith, H.E. et al. Machine learning force fields for molecular liquids: Ethylene Carbonate/Ethyl Methyl Carbonate binary solvent. npj Comput Mater 9, 146 (2023). https://doi.org/10.1038/s41524-023-01100-w

Download citation

Received: 05 December 2022
Accepted: 28 July 2023
Published: 17 August 2023
DOI: https://doi.org/10.1038/s41524-023-01100-w