Accelerating materials discovery

A central question related to accelerating the materials discovery process is how do we guide experiments towards materials with desired properties? This is a question of optimal experimental design and our focus is to highlight how the computational and analysis tools developed in this and related areas can be employed to accelerate the materials discovery process. The Materials Genome Initiative in the U.S.,1 where the stated objective is to reduce in half the time and costs of finding new materials, has catalyzed and spawned considerable activity. However, the number of materials discovered so far with enhanced properties, as a consequence of this initiative, is far and few between. Figure 1 illustrates a perspective in terms of our ability to find materials with increasing complexity as a function of time. Our trajectory has been one where trial-and-error has been followed by intuition, and often rapid progress is made if the synergy between theorists, who can often generate and suggest a list of compounds for possible synthesis, and experimentalists is exploited. High-throughput calculations based on T = 0 K first principles codes have provided the means to generate large databases of varied crystal structures and chemistries with calculated properties.2,3,4 Depending on the functionals used for the exchange correlations for given properties for these calculations, including stability, their fidelity will dictate how rapidly we make progress on the curve of materials complexity versus time. The thesis behind high-throughput computations is one of populating fully the phase space of materials possibilities, given the approximations inherent in these calculations, and then down selecting worthy candidates for further study.5 If the red curve of Fig. 1 is our desired goal (Target), the jury is very much out on whether this is an optimal strategy for the accelerated discovery process. The perspective we take in this review is that tools from statistical experimental design,6 pattern recognition,7 operations research8 and the fields of reinforcement and active learning9 in computer science offer enormous opportunities and suggest a complementary paradigm for the discovery process. Many of these tools are unfamiliar to materials scientists but have been utilized in other fields, such as drug discovery,10 in biology for learning protein-protein interactions11 and predicting macromolecular structure,12 and cancer genomics,13 as well as in industry in the context of engineering design.14 For instance, in drug discovery it is not tractable to explore all possible compounds and active learning methods are used to iteratively guide experiments such that only a subset of promising experiments are carried out. The active learning methods offer, in principle, a more systematic means to approach the red line in Fig. 1 and have the potential to change the way materials science will be carried out in the future; not by changing the fundamental iterative nature of the materials design process with its reliance on distilling physical principles and mechanisms, but by improving the choices that we make to decide which material designs to test and by improving our ability to learn from previous data, experimental or computational or both. Experimental observations or calculations can be difficult, time consuming and therefore maximizing the value of experimental observation via designing experiments to be optimal by some measure is critical. We would like to be able to make decisions about what to measure, which variables to interrogate and what experimental conditions to use.

Fig. 1
figure 1

The trajectory for the discovery of new materials with increasing complexity as a function of time as we accelerate the process from trial and error to using high throughput calculations and statistical design methods to improve our ability to learn from existing data and make decisions for the next materials to test (TARGET)

The materials challenge in its full generality encompasses a very high-dimensional discovery or search space with millions of possible compounds of which only a vary small fraction have been experimentally explored. The space spans aspects of chemistry, crystal structure, processing conditions, microstructure, and the compounds can be multicomponent, for example, solid solutions and the properties can be dependent on materials descriptors or features at several length scales. Most efforts using first principles codes have largely focused on establishing a library of crystal structures and chemistries relevant to the problem, defining the training space in terms of samples and features, which can be elemental properties (e.g., electronegativity, Mendeleev number etc.), bond angles, bond lengths, energetics from first principles calculations, and aspects of thermodynamics from experiments or codes such as Calphad, to down-select promising candidates for experiments or further studies.15 A number of studies have also used inference models, such as off-the-shelf regression and classification learning tools, which can include deep neural networks for large datasets of microstructures and high-throughput computational databases, to make predictions. The approach has been employed to suggest new Heusler alloys,16 polymers,17 and thermoelectrics18 to name a few. In the case of Heusler alloys, the predicted compounds have also been synthesized. In spite of a significant amount of work using this approach, few examples exist where the final properties exceed those of the best compounds in the training data sets. There are now a number of articles and reviews19,20,21,22,23,24,25,26 emphasizing the merits of using machine learning and statistical inference to make predictions in a large combinatorial space. However, few focus on aspects of making the optimal next decisions for synthesis and characterization by experiments or calculations.27,28,29,30,31,32,33,34,35,36 The predictions from machine learning are not necessarily optimal. Aspects related to multiscale modeling and constitutive response at the engineering design scale are discussed in the book by McDowell et al.37

In contrast, the approaches we will discuss are based on an active learning or adaptive design paradigm, whereby the predictions from a surrogate model, which can be an inference model or a physics-based or reduced order model (ROM), serve as the input to define a utility or acquisition function, the optimal of which dictates the next experiment or calculation to be performed.38,39 This is the optimal experimental design component, the key decision making aspect of the active learning loop of Fig. 2. The results of the experiment or computation then augment the training data and the loop continues until the material requirements for the desired targets are met. The approach is unique in that most of the work done in the field essentially involves one or two of these steps. Taking data, building a model, making predictions and validating them with calculations or experiments. We are not aware of studies in which new materials have been found even via the inner feedback loop (green) of Fig. 2, let alone incorporating uncertainties and performing experimental design. However, a Bayesian and decision-theoretic approach discussed in this review naturally lends itself to adaptive sampling and active learning.40 The input to the decision making can come from predictions from any inference, surrogate or machine learned model. One first defines the utility of an experiment or calculation and then, taking into account uncertainties in both the parameter values (if any) and the observations or objectives, chooses experiments or calculations by maximizing an expected utility. The utilities are defined according to information theoretic considerations given the desired goals. The approach can also be used for model selection, that is, the design of experiments using maximum information criteria to distinguish between models. For any experimental design procedure, we should have a notion of the value of the information (or cost of uncertainty) that is gained (or reduced) by observing a specific data point. Then, the possible alternatives to observe (experiments) can be ranked by the expected value of information they provide so that we can prioritize experiments based on this.

Fig. 2
figure 2

The adaptive design paradigm to iteratively learn a surrogate model and use uncertainties to trade-off exploitation and exploration of the search space of unexplored materials to select the next best experiment or calculation. Red arrow represents efforts such as AFLOWLIB,2 Materials Project3 and OQMD,4 etc

To set the stage, we first review briefly the classic design of experiments approach in “Design of experiments” section followed by the Bayesian approach to inference and design41 in “Adaptive sampling and Bayesian optimization” section with its focus on incorporating prior information and data to predict the property or objective. However, other approaches to inference based on machine learning methods can also be used for this purpose. The concept of utility function was previously introduced in decision theory42 in the context of value of information (and hence uncertainties) and this provides us with the means to make decisions about where to optimally sample next. We review a number of these functions, such as the widely studied expected improvement,38 which provide criteria for making choices. This two-stage approach, also known as Bayesian Global Optimization40,43,44 provides a natural active learning strategy to iteratively improve the model. In “Application to materials science” section, we discuss applications of this framework to discover new compositions of materials with targeted properties, as well as increase the computational efficiency of simulations towards targeted design of optoelectronic structures, and their use in high throughput density functional theory calculations. These applications reflect areas of our own expertize, however, we follow these examples with a review of other applications and related studies. We conclude in “Challenges and future development” section by suggesting directions for future research and challenges to be overcome with its impact on materials science.

Design of experiments

Design of experiments (DoE)45 encompasses many statistical approaches that seek to study the behavior of a system where there are variations in data controlling inputs and outputs. The field has a rich history dating from the work of Fisher,46 Box and Wilson47 and recent developments due to Taguchi.48 In classical design, factors refer to the input variables, which can adopt many discrete levels, and their combinations over the allowed ranges of variables define the search space over which a decision needs to be made regarding the best choice of variables to use for the next decision. Factorial and Composite designs45 are examples of the usual experimental design approach often used to find a relationship between the input and output variables. The basic idea is to distill the interrelationships between the various factors and their bearing on the response. Prioritizing these contributions and encoding them within an inference model then allows us to study how the response can be optimized. The models are usually of the form yi = fθ(xi) ± ei, where the uncertainties are assumed to be normally distributed, N(0, σ). Assuming a linear model, the variance in the parameters, θ and objective, yi, are given in terms of the moment matrix, M = (XTX)−1, where X denotes the design matrix with rows (1, f1(xi),...fp(xi)) where p is the number of parameters/coefficients, such that var(θ) = σ2M and var(yi) = σ2MXi. The optimal design is then chosen based on the choice of optimality conditions, such as A - optimality which chooses xi based on minimizing \(Tr(M_x^{ - 1})\) and the common G—optimality, which maximizes the variance in f(x).

An important consideration in DoE is to minimize the number of experiments which need to be carried out in the overall space of variables and parameters (the design space) in order to reduce costs and time. Ordinary DoE puts the same amount of resources in high and poor performance regions of the parameter space and so the costs can go up exponentially in search of the optimal design. As a result, in the early design stages where uncertainties are higher, we need appropriate strategies for the analysis of relatively expensive experiments or high-fidelity codes because seeking for an accurate numerical value of the optimum in the conceptual design stage is not efficient as there can still exist unknowns in the design that need to be addressed later in the design cycle. Also, traditional sequential design of experiments, which involves one design step followed by another, tends to be “exploitative” towards the point of interest based on previous experimental data. Hence, in the case of fairly expensive functions, adaptive sampling (also called on-line or infill sampling) is preferred over traditional design of experiments (DoE).14

Adaptive sampling and Bayesian optimization

Adaptive sampling features two possible aspects that can compete: the exploration of the overall search space so that any model to be developed adequately samples globally, and exploiting local regions of the search space, where the model is likely to yield good predictions near the optimal solution. It serves the objective of rapidly assessing potential regions of interest at the initial stages of a given design. It is thus usual for engineers to employ cheap surrogate models when the engineering model is computationally intense. These surrogate models are based on a relatively small number of training points (Fig. 3) and the use of Bayesian Global Optimization (BGO) has had impact in industry for addressing large scale computational tasks since the 1990s, following the work of Jones, Schonlau and Welch, who introduced the efficient global optimization (EGO) method.38 This relied substantially on early work by Kushner49 on optimizing unknown functions, as well as work by Mockus and Zilinskas.50 The notion of maximizing the probability of improvement from the “best-so-far” was introduced in these studies. BGO has been used extensively in design applications when optimization of unknown functions are often required, especially where the computations are expensive.14

Fig. 3
figure 3

Bayesian Global Optimization builds a surrogate model from data based on input x and selects new design points with estimates for the output \(\hat y\) for the next calculation, which in turn will yield the actual value y

Utility functions

The development and applications of BGO in statistics and engineering has progressed steadily and over the last five years there has been much interest from the machine learning community, where BGO has been employed for tuning hyperparameters of machine learning models.51 Only recently have these methods been applied to problems related to materials discovery and computation, the application area we will focus on. However, complementary ideas date back to developments in decision theory with the work of Howard,52 who presented a theory of the value of information for the joint probabilistic and economic factors affecting decisions and how numerical values could be attributed to the reduction and suppression of any uncertainty. Lindley,53 using a decision theoretic approach, brought a number of ideas together into a fully Bayesian scheme involving the definition of a utility function for experimental design so that an experiment is selected to maximize the expected utility function. The utility is defined as \(U(e) = {\int}_Y {\int}_\Theta u(e,y,\theta )p(\theta ,y|e){\mathrm{d}}\theta {\mathrm{d}}y\), where u(e, y, θ) is a utility function which is chosen so that it reflects the purpose, cost or usefulness of the experiment, e, selected from a large number of possible experiments with a particular outcome, y, belonging to the set over all possible outcomes y.41,54 As the outcome is not known before the experiment is carried out because we do not know the parameters, θ, the expectation is taken with respect to the joint posterior distribution of θ and y. The choice of utility functions include the gain in Shannon or relative entropy (Kullback-Leibler divergence), quadratic loss and improvement from the best, which we will discuss in greater detail. We emphasize that the usually studied uncertainty quantification approaches based on variance and entropy take into account the statistical characteristics of uncertainty without regard to the objective, and the information gained associated with an experiment is measured in terms of the reduction in the entropy or variance.6

Two steps of Bayesian global optimization (BGO): model prediction and decision-making

The problem we want to solve is stated by maxxf(x), that is, to find the design, x that maximizes y = f(x). BGO offers a powerful and efficient means for searching for extrema of objective functions which are costly and for which derivatives and convexity properties are unknown. The efficiency results from being able to incorporate prior belief in the form of smoothness, locality of f(x) to guide the adaptive sampling and active learning. If D = {xi, f(xi)} (i = 1:n) is the set of observations, then by Bayes’ rule, the posterior distribution is given by P(f|D) P(D|f)P(f), where the posterior will capture the updated beliefs about the unknown f(x) given the prior information about f combined with how likelihood the data seen so far, P(D|f). This step can be interpreted as estimating f(x) with a response surface or even a surrogate model, such as a Gaussian process (GP) or any machine learned model. We could choose a number of possibilities for the prior, for example, a Wiener or GP. A common example of the former is a random walk with random step sizes and this stochastic process was used in the early work of Kushner.49 However, it has been shown that a GP, extension of the multivariate Gaussian distribution to a stochastic process with an infinite number of variables, is well suited if we make a number of simplifying assumptions and has become the mainstay for BGO.55,56 Instead of a Gaussian distribution over a random variable specified by a mean and variance, a GP is a distribution over functions specified by its mean function, m and covariance function, k, i.e., \(f(x)\sim {\cal G}{\cal P}\left( {m(x),k(x,x\prime )} \right.\). A common choice for k is the squared exponential \(k(x,x\prime ) = {\mathrm{exp}}\left( { - \frac{1}{2}||x - x\prime ||^2} \right)\) which approaches 1 as data points x and xʹ get close together and 0 when they are far apart. However, the GP can allow for several kernels. In addition to the gaussian kernel exp(−1/2*(h/θ)2) with h = ||xxʹ|| and θ a parameter, the exponential exp(−h/θ) and Matern functions \(\left( {1 + \sqrt {(3)} \ast h/\theta } \right) \ast {\mathrm{exp}}\left( { - \sqrt {(3)} \ast h/\theta } \right)\), \((1 + \sqrt {(5)} \ast h/\theta + (1/3) \ast 5 \ast (h/\theta )^2) \ast {\mathrm{exp}}( - \sqrt {(5)} \ast h/\theta )\) and exp(−(h/θ)p) are also often used. In principle, the choice of these priors can affect the surrogate model predictions. However, this depends on the data sets. The Matern kernels are more flexible as they include more free parameters to the model compared to the standard Gaussian kernels. On the other hand, for a specific set of parameters the Matern kernel reduces to the gaussian kernel.

Figure 4 summarizes the two components of BGO:

  1. (a)

    a model that provides the value or posterior distribution of the objective f(x) at unevaluated points based on available data (xi, yi) with i = 1, ..n

  2. (b)

    choice of utility function, u(x|D1:t−1), such that its maximum over the all the data in the sampling domain, xt = argmaxxu(x|D1:t−1) provides a potentially high value of the objective at which to evaluate the function to guide the next experiment or calculation. This is the “good” decision making step to direct future sampling.

Fig. 4
figure 4

Model prediction and decision making by maximizing the utility function, which in this case chooses the sample with the largest uncertainty—“exploration”. a Actual function (cyan) represents the ground truth function fitted to the data (Response = −826.815x6 + 2443.982x5 − 2573.867x4 + 1093.097x3 − 118.369x2 − 14.842x + 5.087, where x is the Input). We randomly pick five data points (training data) to train a surrogate model with GP Regression, the predictions of which are shown by the red points (new data) and fitted curve. The GP predicts the Response (y-axis) for the test points along with the associated uncertainties (error bars). The exploration strategy recommends the data point with the largest uncertainty for the next experiment or computation. b The updated surrogate model after the first iteration with now 6 training data points. Notice the reduction in the error bars in b compared to a. The data point with the largest uncertainty is recommended for the next (second) iteration for validation and feedback

If f(x) is the outcome of an experiment, the observations are influenced by noise and hence the objective needs to be written as y(xi) = f(xi) + εi, where the noise ε is distributed normally. The posterior distribution for f(x) will now be a summation of two normal distributions and will itself be normally distributed so that the effects of noise is that the posterior distribution no longer interpolates the data points but the credible or confidence interval has positive width at all data points. In addition, the machine learning algorithms also contain “hyperparameters”, such as the penalty term, insensitive loss function term and coefficient of the kernel function, that control model complexity and help balance the bias-variance tradeoff. In general, the hyperparameters are adjusted to optimize the leave-one-out error, which in turn is computed by constructing an independent model for each training set by leaving out a single data point. The trained model now predicts the response of the left out data point. The leave-one-out error is defined as the average of the out-of-training sample residual over each model. The hyperparameters that minimize the leave-one-out error are identified and used to train a single final model. On the other hand, in a Bayesian treatment (such as GP), one can choose the hyperparameters directly from training data by minimizing the negative log marginal likelihood with respect to the hyperparameters and the noise (where the cost of Monte Carlo methods are prohibitively high). Typically, samples are drawn randomly from a prior distribution (such as the uniform prior or chosen based on domain knowledge) and allowed to converge. The final estimate corresponds to an instance with the smallest negative log marginal likelihood.

Examples of utility functions

Expected improvement, E[I]

The utility function we will focus on is an improvement based function, I, initially suggested by Kushner49 in order to maximize the probability of improvement over the best observed so far, f(x+). That is, I = max(f(x)−f(x+),0) would be the improvement associated with each choice x. Before evaluating f(x), we do not know what the improvement will be but as we have the probability distribution on f(x), we can evaluate the amount of improvement using E[I(x)] = E[f(x)−f(x+)] for positive I, where E is the expectation over the posterior distribution.38 Since f(x) is normally distributed with mean, μ, and standard deviation, σ, it is straightforward to show that

$$\begin{array}{*{20}{l}} {E[I(x)]} \hfill & = \hfill & {{\int}_{f(x^ + )}^\infty (z - f(x^ + ))\phi (z){\mathrm{d}}z} \hfill \\ {} \hfill & = \hfill & {\left( {\mu (x) - f(x^ + )\left[ {\phi \left( {\frac{{\mu (x) - f(x^ + )}}{{\sigma (x)}}} \right) + \sigma (x){\mathrm{\Phi }}\left( {\frac{{\mu (x) - f(x^ + )}}{{\sigma (x)}}} \right)} \right]} \right.} \hfill \end{array}$$
(1)

where φ and Φ are the standard normal density and cumulative distribution functions. Thus, a new sample point, x*, is chosen amongst other data points based on the largest expected improvement, i.e. x = argmaxxEI[x]. The limiting cases of uncertainty on E[I(x)] give us:

  • Small σ: E[I(x)] → μf(x+), i.e. choose the μ greater than f(x+) to maximize E[I(x)] or exploit the model predictions.

  • Large σ: E[I(x)] → σ, i.e. choose the μ with largest uncertainty to maximize E[I(x)] or explore the sample space.

Additionally, E[I(x)] is largest where both μ(x) and σ(x) are large. That is, points likely to give the most gains are those with the best prediction but also where uncertainty is greatest. These aspects emphasize the exploration—exploitation tradeoff, which is central to aspects of active57 and reinforcement learning.58 It characterizes many problems, including the multi-armed bandit problem, in which decisions need to be made repeatedly but where the outcomes are uncertain and we wish to obtain immediate results as to where to sample optimally so that we get better results in the future. The multi-armed bandit problem59 is an example illustrating the exploration-exploitation dilemma and refers to a slot machine with n-arms or “bandits” with each arm having its own probability distribution of success. Pulling any one of the arms gives a stochastic reward of either success or failure. The objective is to pull the arms one-by-one in sequence such that the total reward collected in the long run is maximized. The problem has applications in many real-world situations where one would like to select the “best” bandit out of a group of bandits i.e., A/B testing, line-up optimization, evaluating social media influence.

Knowledge gradient (KG)

In the presence of noise in the data, the utility function becomes a generalization of expected improvement, namely knowledge gradient, KG(x).43,60 With noise, the f(x) values are not exactly known. Instead, we consider μ(x) so that best choice at the next step with the largest improvement is given by \(\mu _{n + 1}^ \ast = max_x\mu _{n + 1}(x)\) and \({\mathrm{KG}}(x) = E[\mu _{n + 1}^ \ast - \mu _n^ \ast ]\). If we normalize the change in means by the number of standard deviations, σ, then KG(x) is given by,

$$KG(x) = \sigma \left[ {\phi \left( {\frac{{\mu (x) - f(x^ + )}}{{\sigma (x)}}} \right)} \right] + {\mathrm{\Phi }}\left( {\frac{{\mu (x) - f(x^ + )}}{{\sigma (x)}}} \right)$$
(2)

The next sampling point is then one for which KG(x) is largest, that is argmaxxKG(x). The sampling region over which \(\mu _n^ \ast\), \(\mu _{n + 1}^ \ast\) are determined is not restricted to the observed data, as in the case of E[I(x)], but can be extensive covering many allowed possibilities. This aspect, and the noise in the data, are the key difference from expected improvement. Recently, expected improvement has been generalized to also include noise.61

Mean objective cost of uncertainty (MOCU)

An objective based uncertainty quantification scheme was recently introduced to study interactions among genes to develop drugs to mitigate aberrant phenotypes, such as cancers, in the design of gene regulatory networks.62,63,64 The basic idea is that the presence of large uncertainties can degrade experimental design and the aim of new measurements should be to reduce uncertainties in finding materials with desirable properties. That is, one needs an objective-based uncertainty quantification scheme as compared to reducing overall variances or entropy to monitor information gain in an experiment. A prior distribution that encodes knowledge about the unknown parameters or features, say θ, is assumed, which after knowing the outcome of an experiment, is updated to a posterior distribution. If further measurements are required, then this distribution is used as the new prior for the next experimental design loop. The MOCU serves as the utility function and measures the deterioration in the design due to the presence of model uncertainty, i.e., it measures the degradation in performance between an optimal design based on partial prior knowledge and data compared to a design with full knowledge of the system. One chooses the next experiment to shrink the uncertainty the most, i.e., the experiment that is expected to minimize the variance in the posterior distribution the most. That is, the alternative that is expected to decrease the MOCU the most is selected at each iteration, so that after observing that experiment’s output, we would be able to make a better decision towards the objective.

Clarifying further, suppose we are at iteration 0 and we have a cost function f(x) with unknown parameters θ, i.e., we have observed initial data points and have updated prior distributions to posterior, f(x), but not yet selected the next experiment x. At this step if we want to stop doing any more measurements and select the x that, for example, maximizes f(x) based on our current state of belief, we would select x which maximizes the expected value of f(x) over the unknown parameters, θ, assuming we have a prior distribution for θ to allow us to evaluate the expected value. Thus, we would select xrobust = argmaxxEθf(x). This is the best we can do “on average”, hence the term “robust”, because we do not know the θ's and we are not guaranteed to choose the true optimal. In many ways it is this “robust” we seek which is expected to perform optimally on average. Now, corresponding to each parameter value, θ, the “optimal” material, x+, can be found by performing x+ = argmaxxfθ(x). We will have an optimal for each parameter value θ and this is our best choice if there are no unknown parameters. As we do not know the true parameter values, θ, the amount we lose is the difference between the true maximum value f(x+) and the value associated with the robust selection. This difference in the absence and presence of uncertainty is the objective cost of uncertainty (OCU) and its expectation over all θ is MOCU, which is given by

$${\mathrm{MOCU}} = E_\theta [f_\theta (x^ + ) - f_\theta (x_{{\mathrm{robust}}})]$$
(3)

The next experiment selected is that which reduces MOCU the most. Hence, for each x, we consider all possible values of the experimental outcome, y, to calculate MOCU weighted by the probability distribution, P(y|x), of y being the true experiment outcome of x,. The selected experiment for measurement, x*, is then given by

$$x^ \ast = argmin_xE_{y|x}(E_{\theta |x}[f_{\theta |x}(x^ + ) - f_{\theta |x}(x_{{\mathrm{robust}}})]),$$
(4)

where Ey|x is expectation over P(y|x). If we want to stop measurements at the next time step and select the candidate with highest expected y = f(x), our final selection will be given by the robust alternative by taking the expectation over θ.

This method has recently been applied within Landau or phase field simulations to determine dopants that minimize the energy dissipation, which affects the fatigue of shape memory alloy such as FePd.28 These simulations assume a random initial dopant distribution in the FePd matrix and associated with the dopants is a concentration, potency and range, all of which can vary. Figure 5 summarizes the active learning strategy that the authors have used leveraging MOCU. In the simulations, a stress is incrementally applied to obtain a stress versus strain curve from which the energy dissipated is estimated. The data from the simulations serve as the training data to essentially fit a nonlinear model to the dissipation in terms of the concentration, potency and range of the dopants. This model is then used for the experimental design step in Fig. 5 to select the next best dopant and its features on the basis of MOCU, that is, the dopant that minimizes the overall deterioration in the model uncertainty. After the measurement is performed (in the real situation) or as here, after the outcome is known, the prior distribution in terms of the unknown dopant parameters is then updated to give a posterior distribution, which serves as the new prior for the next step in the active learning loop.

Fig. 5
figure 5

The objective based uncertainty approach, MOCU, to experimental design (right) with its emphasis on construction of a prior distribution to commence the search. The plot on the left shows the performance of the method for simulations of shape memory alloys where the objective, the average energy dissipation, is plotted versus the number of measurements.28 MOCU finds the material features for minimizing the objective in far fewer experiments than pure exploitation or random selection. Reproduced with permission from ref. 28 Copyright Elsevier 2017

The prior distribution is chosen as a Dirichlet distribution by generating different samples using different dopant concentrations, strength and parameters. The plot in Fig. 5 shows how the average energy dissipation behaves after different numbers of measurements. The five measurements for the proposed approach are compared to ten measurements for pure exploitation and random selection selection policies. A summary of the utility functions discussed here is given in Table 1

Table 1 Utility functions

Applications to materials science

We discuss three independent examples to demonstrate the breadth of application of the active learning strategies in materials science. In all three examples discussed below, several iterations of the design loop are performed with five iterations for each of the four AL strategies of the experimental materials discovery of ferroelectrics in Example 1, up to a 1000 iterations for the optoelectronic simulations of Example 2, and twenty five iterations for one AL strategy of the first principles code used in Example 3. Unlike the more common statistical studies (discussed later) on data with assumed distributions, experimental data as well as computational data from physics codes do not offer the luxury to randomly select different training sets because each evaluation of a sample data point is expensive.

Materials discovery

The discovery of materials with targeted properties typically involves investigating a vast search space efficiently. Systematic design strategies aid in guiding or recommending iteratively the optimal compounds for synthesis. The strategies discussed above have been applied to alloys27,29 and ceramic materials.30,35,36

Example 1: piezoelectrics with large electrostrains

We review a recent example35 where the goal was to find Barium Titanate (BaTiO3) based piezoelectrics with large electrostrains. The idea is to use chemical substitution (i.e., doping), such that Ca2+ and Sr2+ cations replace Ba2+ on the A-site of the ABO3 perovskite structure, and Zr4+ and Sn4+ substitute for Ti4+ on the B-site. Thus, the family of solid solutions being considered is given by (Ba1−xyCaxSry)(Ti1−uvZruSnv)O3, where the compositions x, y, u, and v are constrained by 1−xy > 0.6, x < 0.4, y < 0.3, 1−uv > 0.6, u < 0.3 and v < 0.3 so that relaxor phases are avoided. As the composition of each ion can be controlled to 0.01 in the synthesis process, there are potentially about 605,000 possible compositions of which the training data consists of only 61 which have been synthesized (~.01%). All 61 training data points, corresponding to different chemistries and compositions, were synthesized and characterized using the identical protocols by the same researchers to remove lab-to-lab variability and processing. In addition, all data points were chosen essentially at random, i.e., no prior knowledge is used in performing the experiments or correlating the results from differing compositions to build the training data set.

The features, which should be easily available for all unexplored materials, are expected to be physically meaningful and should have a bearing on the property. The seven features (Fig. 6) we employed included microscopic attributes such as electronegativity, polarizability, ionic displacement, ionic radius and volume, as they impact polarization and strain, as well as features which capture the direction (increase, decrease or no change) of the dependence of the cubic to tetragonal ferroelectric transition temperature and tetragonal to orthorhombic ferroelectric transition temperature, respectively, on the doping elements. The data has been uploaded as part of the Supplemental Information.

Fig. 6
figure 6

Active learning loop comparing experimentally the performance of the expected improvement, E[I] (trade-off), based on balancing the trade-off between exploitation-exploration with design selection strategies based on a maximum uncertainty from the surrogate model prediction b the best (maximum) prediction from the model, and c purely random choice, for finding a BaTiO3 based piezoelectric with the largest electrostrain at 20 kV/cm. Reproduced with permission from ref. 35 Copyright John Wiley and Sons, Inc. 2018

The result of the dimensional reduction using multidimensional scaling is given in the Supplementary Fig. 1, which shows little evidence of clustering amongst the compounds. Moreover, the analysis of the way the ML was done, including feature selection, the use of bootstrap methods for ensemble sampling to estimate uncertainties of the surrogate model, are given in ref. 35 and the Supplement to it, which discuss these aspects. A number of models, including polynomial fits, Lasso, support vector machine (SVR) with a linear (SVR.LIN) and radial-based (SVR.RBF) kernel function, and Gradient Tree Boosting were used. Given such a high quality experimental data set from an unknown distribution, the statistical method of non-parametric bootstrap sampling with random replacement65 was used to generate 1000 samples of training data. The basis of bootstrap sampling is that the empirical distribution so obtained mimics the statistics associated with the original population. An ensemble of 1000 models then gave predictions with means and variances. Cross-validation (CV), typically 10-fold and grid-search to optimize the hyperparameters, was used for each model. To be specific, for the case of SVR on the given training data, we first constructed a model with the best combination of hyperparameters leading to the smallest CV error. We then evaluated the model performance by monitoring two errors, the cross-validation error and the mean squared error. For the mean squared error, we used bootstrap sampling on the whole training data with replacement. For the selected bootstrap sample, one model was built with the hyperparameters tuned by 10-fold cross-validation, and then applied on the whole training data. In total we had 1000 bootstrap samples, so that 1000 models were trained giving 1000 predictions, from which the mean squared error was calculated.

A number of steps were carried out to build an adequate model. Before the first iteration, a model was built with 55 data points and 6 randomly chosen test points using bootstrap sampling and 10-fold cross validation on the 55 points. These results together with the test results on the 6 points are shown in Supplemental Fig. 2. The MSEerror for the training data was 0.0047 and that for the test data was 0.0044. This suggests that the model is not overfitting. Also considered were safeguards usually undertaken to avoid overfitting, including regularization as well as the use of ensemble models. Regarding the former, the hyperparameters were optimized using cross validation and for the latter, bagging as well as boosting were used (see Supplemental Fig. 3 for the model fit on the 55 points using SVR.RBF with bootstrap and Gradient tree boosting), and they give similar error estimates.

Figure 6 captures the active learning loop in which the performance of the expected improvement, E[I] is compared with other design selection strategies experimentally. These include selection of a composition for synthesis from all allowed possible samples based on (a) maximum uncertainty in the electrostrain from the surrogate model prediction (b) the best (maximum) prediction from the model (c) purely random choice. Thus (a) corresponds to pure exploration in Fig. 6, (b) to pure exploitation, and the trade-off between the two is characterized by E[I]. Random bootstrap sampling (ref. 65) with replacement on the training data set was used to generate 1000 samples, from which 1000 models were learned to predict the mean and standard deviation. Figure 7a compares the relative performance of the design after synthesis and characterization of the new compounds predicted by each of the selection criteria after five iterations of the loop. After each iteration, the new results augment the training data and the iterative loop repeats. The compound with the largest electrostrain, (Ba0.84Ca0.16)(Ti0.90Zr0.07Sn0.03)O3, was obtained on the third iteration. The trade-off, given by E[I], was the best performing strategy in every iteration with random selection being the worst overall performer. Interestingly, pure exploration and exploitation also had their peaks in the third iteration. The predictions in Fig. 7b of the electrostrain from the inference and different selection criteria show a similar tendency to the experimental results, although the compound with the largest strain occurs an iteration later.

Fig. 7
figure 7

a Comparison of the relative performance of the design on the four selection strategies after synthesis and characterization of new compounds after five iterations of the loop. The third iteration leads to the compound (Ba0.84Ca0.16)(Ti0.90Zr0.07Sn0.03)O3 with the largest electrostrain with E[I] the best performer in every iteration. b The predictions from the design. c The quality of the model based on the training data and after each iteration. Reproduced with permission from ref. 35 Copyright John Wiley and Sons, Inc. 2018

Figure 7c shows the measured strain (x-axis) vs predicted strain (y-axis) obtained by SVR.RBF. Groups 1 to 5 indicate the 20 new compositions that were added as a result of AL (four per Group or iteration). The blue data points correspond to the original trained model before any new experiments. The model at the end of the fourth group of experiments is shown in Supplemental Information 4; the experimental measurements, overall, are quite consistent with the predictions. Pure exploitation is superior to pure exploration. It tends to sample the property-feature space in the vicinity of local minima, in contrast to exploration which samples more globally in regions where there is limited data and uncertainties are high, and is preferred for this problem. As 9 of the 20 compounds had a strain value larger than the best in the training set, the Fisher p value <0.001, indicating that the result is significant and the probability that this result is due to chance, is low. Also, the results of experiments (as depicted in Fig. 7a) show that the AL strategies perform better over five experimental iterations and 20 newly synthesized/characterized samples than random sampling. Additionally, Supplementary Figure 5 shows that for the original training data set, a bootstrap statistical analysis with replacement yields that the number of new measurements required to find the sample with the largest electrostrain as a function of training size, is far fewer for AL strategies than random sampling. This result is consistent with the results of Fig. 7a from in-fact experiments.

We observe from Fig. 7c (and also from Fig. 12b of Example 3) that although the initial surrogate model is reasonable, subsequent iterations are not particularly improving the model, if anything the model can be argued to get worse as the design guides the selection of the next compound for synthesis. More work is needed to capture the underlying reasons and to study the interplay between the quality of the surrogate model and the observed behavior.

Figure 8a, b show the performance of the design as a function of the number of iterations of the loop. In Fig. 8a, we plot the calculated expected improvement (y-axis) for each composition chosen from each AL strategy as a function of five iterations (x-axis). The expected improvement does not monotonically decrease; it fluctuates but the trend is a decreasing one so that eventually the expected improvement will decrease from further experiments, and in the limit we expect it to tend to zero (a more rigorous study with computational data is discussed in Section IV B. 2). In Fig. 8b, we also plot the corresponding standard deviation (our measure of prediction uncertainty) as a function of the five iterations of experiments. The standard deviation of the selected samples, together with their means, goes in determining E[I]. For E[I], the standard deviation decreases with increasing iterations as exploitation of the model is more favored. For the case of the randomly selected compounds, their E[I] values are essentially flat. This is because most of the compounds in the ~605,000 possibilities have an E[I] that is very small or close to zero (that is, most compounds are not likely to lead to any improvement). As expected, the standard deviations are relatively large for the compounds selected on the basis of the pure exploration strategy.

Fig. 8
figure 8

The behavior of E[I], the expected improvement, and the standard deviation as a function of the number of experimental loops carried out. a The expected improvement does not monotonically decrease, but the the trend is a decreasing as the expected improvement will eventually decrease from further experiments. For the case of the randomly selected compounds, their E[I] is essentially zero as most of the compounds in the search space of possibilities have an E[I] that is very small or close to zero. b The uncertainty or standard deviation in the surrogate model decreases with iteration number, except for pure exploration, which is driven by selecting compounds with maximum variance

Simulations for targeted design

We focus here on two examples that discuss the application of active learning methods to iteratively guide physics-based computational codes, such as a Poisson-Schrödinger equation solver (APSYS) used for optoelectronic simulations to design structures of light emitting diodes (LEDs), and density functional theory (DFT). The objective is to build cheap, surrogate models that are sufficiently accurate in carrying out specific tasks without the need for executing these expensive physics-based codes every time a new instance is chosen. We discuss the details below.

Example 2: building surrogate models for optoelectronic simulations

Poisson-Schrödinger (P-S) simulations are often used to optimize fabrication of light emitting diode (LED) structures. The structures are built from a series of quantum wells and the objective is to maximize the optical output, typically the quantum efficiency or yield.66 Such simulations for optoelectronic applications self-consistently solve coupled classical and quantum equations and are available in commercial packages such as APSYS. The process is quite time consuming as different inputs to the code are tried until a satisfactory structure or conditions for fabrication are obtained. The simulations themselves are computationally demanding and thus what is desired is to rapidly construct a surrogate model that maps the LED structure or inputs to simulated efficiency, thereby overcoming time-consuming trial and error based simulations to obtain the optimal design in as few iterations as possible. Leduc et al.31 addressed this problem by starting with a database of LEDs with known structure (i.e., number of layers, their composition, doping, and widths) with labeled electro-luminescence internal quantum efficiency, and built a GP surrogate model to make predictions of the efficiency for as-yet unseen structures. They used the E[I] algorithm for the selection and ranking and their work shows that to find a globally optimum LED, striking a delicate balance between exploration and exploitation is favored in selecting new sample points. This trade-off serves to improve the global accuracy of the model, thereby minimizing the possibility of becoming stuck in a local region of the LED design space for which the efficiencies are only locally optimal.

Figure 9a shows the structure of the GaN LED studied by Leduc et al. consisting of a feature space of 6 dimensions, so that each point, x was defined by the Indium composition of each of the 5 quantum wells as well as the average Indium value of the barriers. The width of the quantum wells will be a function of the indium levels as well as the barrier height so that the wavelength does not especially vary. The input parameters used in the APSYS code were those from ref. 66 with a current density of 75 A/cm2 and the output included the band structure as well as the results of the solution of the coupled transport equations subject to the band structure deformation and carrier redistribution. Figure 9b shows the results for the internal quantum efficiency calculated using APSYS with GP and E[I] performing the optimization. Figure 9b (i) shows that within 25 iterations, the code has reached 75% of the total efficiency, hence BGO converges rapidly, and after 75 iterations the efficiency is at its optimal with little improvement thereafter (Fig. 9b (ii)). From 150 iterations onwards, the simulation is essentially working towards reducing the uncertainties so that at the end, R2, the coefficient of determination, as determined by cross validation, is greater than 0.99 (Fig. 9b (iii), (iv)).

Fig. 9
figure 9

a A schematic of the structure of a light emitting diode (LED) showing quantum wells and conduction band that was used in the surrogate based modeling and optimization b (i) and (ii) Results of the simulated quantum efficiency of the LED as a function of the active learning steps executed by the code. The simulation converges rapidly and there is little improvement beyond the optimal reached, except that the uncertainties decrease. (iii), (iv) Parity plots of the out of sample data showing the efficiencies predicted by the GP model and those obtained from the simulations after (iii) 150 iterations and (iv) 1000 iterations when the uncertainties are much smaller and the fits superior. Reproduced from ref. 31

Poisson-Schrödinger simulations require tuning of a number of parameters, such as the damping step of the Newton solver that reaches equilibrium, and convergence is often an issue. Optimizing parameters is a general problem in computational sciences and physics55,67,68,69,70 and in machine learning.71,72 Leduc et al.32 recently demonstrated how a surrogate model, using random forest (RF) to predict the convergence rates and E[I] to do the sampling, can be applied to speed up convergence of simulations, such as the P-S equations encoded in APSYS.

Rather than a GP model, Leduc et al.32 utilized a RF model to predict the fraction of convergence for a given simulation, so that 0% refers to a simulation that does not converge at all and 100% identifies a simulation successfully converged. For this application, the completion fraction of optoelectronic simulations in which voltages or currents are ramped up to target values, determines the convergence rate. For example, a simulation where the drain to source voltage is ramped up from 0 to 10 V, but the simulation fails for 7.5 V, is said to have converged at 75%. A RF is a collection of decision trees, and the prediction of the forest is an average of the predictions of each tree, hence various measures of uncertainty can be evaluated. In this work, the empirical estimate of the error between the RF prediction and the true convergence ratio was used as an uncertainty measure.

Figure 10 demonstrates the efficacy of the approach to parameterization. The first 15 iterations show the lack of convergence of simulations for which parameters have been chosen randomly. The simulations that follow from the black line onwards in Fig. 10, using a database of the initial random simulations, are then set up automatically using the active learning strategy with RF and E[I]. With the points showing the convergence fraction, these simulations quickly converge, and similar to the simulations in Fig. 9b, there is little subsequent improvement except the uncertainties decrease. The occasional drops in the convergence fraction that are seen in Fig. 10 are indicative of the exploration of regions in parameter space as the simulation proceeds toward building a robust model. The coefficient of determination, R2, past 150 iterations is ~0.9. This example on GaN simulations demonstrates the potential of machine learning for automated speed up in search of target properties as well as convergence.

Fig. 10
figure 10

Active learning applied to the convergence of GaN transistor simulations by optimizing numerical parameters (5 in total). Each point on the plot is the convergence ratio for a simulation, and the blue line shows the moving average or trend of the convergence ratio. A random choice of numerical parameters leads to an average convergence fraction of 10%, with no simulations fully converging. These random simulations serve as an initial database and as the active learning is initiated (black vertical line), the simulations converge fully within a few iterations. Reproduced with permission from ref. 32 Copyright AIP Publishing 2017

Example 3: guiding density functional theory codes

One of the popular approaches in computational materials science is to use density functional theory (DFT) methods to calculate properties such as the band gap (Eg), thermal conductivity, and modulus (elastic, bulk, and shear) to name a few, in a high-throughput manner.2,3,4,17 More recently, there is also interest in exploring the potential use of machine learning methods to learn from DFT data and, in turn, make new predictions.17,73,74,75,76 Typically, machine learning combined with DFT studies has the following flavor: a reasonably large dataset from high-throughput DFT is built, a machine learning model is trained on the dataset, and the trained model then predicts new responses for compositions not present in the dataset. Note that this approach does not account for uncertainties or active learning. Here, we demonstrate a specific application of active learning where machine learning methods iteratively guide DFT calculations towards promising regions in the composition space.77

We consider stoichiometric apatite compounds for this work that have a chemical formula of A10(BO4)6X2, where A and B are divalent and pentavalent cations, respectively, and X is an anion. Apatites find applications as biomaterials, luminescent materials, and host lattices for immobilizing heavy and toxic elements and radiation tolerant materials.78 We focus on a chemical space, where A = {Mg, Ca, Sr, Ba, Zn, Cd, Hg, or Pb}, B = {P, As, or V}, and X = {F, Cl, Br, or OH}. This gives a total of 96 unique compositions that satisfy the A10(BO4)6X2 stoichiometry. The goal is is to rapidly find an apatite composition with the largest Eg by combining active learning methods with DFT calculations in as few iterations as possible.

The training data consists of 13 randomly chosen apatites for which Eg is calculated within the generalized gradient approximation (GGA). Each apatite is represented by rA, rB, and rX, which correspond to the Shannon ionic radius of A-site, B-site, and X-site elements, respectively.79 The largest Eg (5.35 eV) in the training set belongs to the Sr10(PO4)6F2 (SrPF) composition in the P63/m crystal symmetry. We used an ensemble of 100 SVR-RBF ML models to establish a relationship between the three features (rA, rB, and rX) and the property, Eg. The hyperparameters for each SVR-RBF model were optimized using the leave-one-out cross-validation method. Each SVR-RBF model will return a prediction for Eg. Since there are 100 such models, we can estimate the mean (μ or \(\hat E_g\)) and standard deviation (error bar, σ) from the 100 Eg predictions. We estimated the resubstitution absolute mean squared error to be 0.54 eV/composition.

The iterative design loop is shown in Fig. 11a. Our training set contained 13 compositions from which we built an ensemble of 100 SVR-RBF models. Our next step is to predict the \(\hat E_{\mathrm{g}} \pm \sigma\) for the remaining 83 compositions for which the Eg data from DFT-GGA calculations is not known. We recommend a composition that has the largest E[I] for DFT-GGA validation and feedback. The training set is then augmented with this new composition. We set a computational “budget” of 25 total iterations to gain an understanding of the adaptive design process. In Fig. 11b, we show the DFT-GGA calculated Eg data for the compositions recommended. The optimal composition, [Ca10(PO4)6F2], with the largest Eg of 5.67 eV was identified in the 9th iteration.

Fig. 11
figure 11

Active learning for guiding Density Functional Theory (DFT) calculations. a Overarching strategy in iteratively guiding DFT calculations to find a composition with the largest band gap. b DFT-GGA Eg data (y-axis) as a function of iteration number (x-axis), where the red dashed line represents a composition with the largest Eg in our initial training set. c Evolution of maximum E[I] at the end of each iteration showing that it does not decrease monotonically, but fluctuates. Further, the data point corresponding to the best composition (9th iteration) need not coincide with the largest E[I] in the whole composition space. Adapted with permission from ref. 77 Copyright Springer 2018

In order to provide insights into the iterative adaptive learning process, we also tracked the maxE[I] at the end of each iteration. Ideally, we expect maxE[I] to have large values during the initial iterations and to monotonically decrease as the number of iterations increases (especially after the optimal material is identified). However, we find that maxE[I] shows a non-monotonic trend and does not decrease smoothly as a function of the number of iterations. This is shown in Fig. 11c. The maxE[I] consistently attains a value of zero only from the 16th iteration onwards. This is intriguing because the optimal composition was found in the 9th iteration, yet the maxE[I] did not reduce to zero. This outcome sheds light into the one of the difficult questions related to the stopping criterion, i.e., when should one stop the iterative loop? We do not recommend stopping immediately after maxE[I] has reached a small value. Instead, it is important to confirm that the maxE[I] is consistently small and does not increase. This can be accomplished by running additional experiments or simulations. An alternative maxE[I]-agnostic criterion would be to stop the iterative loop when a material with the desired response is found. However, this approach will not ensure an optimal ML model because the search space will not be surveyed adequately.

Review of other applications and studies

In addition to examples we have already referred to, there have been a number of recent efforts on the topic of active learning, adaptive design, sequential design, and/or Bayesian optimization in materials science. We briefly review these, deferring to the individual publications for details.

One of the earliest works can be traced to Seko et al.80,81 who used kriging in conjunction with a simplified version of the probability of improvement to guide DFT calculations towards targeted regions in the design space. They demonstrated this approach to computationally find compounds with high melting points and low lattice thermal conductivity. A similar strategy, using kriging together with simplified probability of improvement, was adopted by Kiyohara et al.82 to computationally identify a stable interface structure for a simplified coincidence-site lattice grain boundary of FCC-Cu. Ueno et al. developed a Bayesian optimization software framework (referred to as COMBO) for black-box optimization, especially for large data problems.33 The COMBO has several functionalities including Thompson sampling (which is based on probability matching that chooses a candidate point according to the probability of being optimal), one-rank Cholesky update, random feature maps, and automatic hyperparameter tuning. Ling et al.34 emphasize the importance of how estimates for uncertainties are calculated. They bias the usual standard deviation estimate and compare its performance using a RF method on four test data sets including those for magnetocalorics, superconductivity, and thermoelectrics. Comparing strategies in terms of number of evaluations required to find the best property. They find that although random selection was the inferior strategy in all cases, the performance was essentially data dependent. In the limit of adequately large data sets, different uncertainty estimates would be expected to behave similarly, however, an interesting aspect to be explored is the behavior when data sizes are limited, given that we do not know the actual uncertainties associated with the data.83 For Example 1 with the electrostrain data, Xue et al.84 have shown that with a data size of 61 the uncertainties calculated by standard deviation and jackknife based on a RF model using 50,000 bootstrap samples, behave very similarly. Yamawaki et al.85 use COMBO to perform a structural optimization of graphene nano-ribbons with the objective of identifying a configuration with a high-zT for thermoelectrical properties via separately controlling the electron and phonon transport. Herbol et al.86 used Bayesian optimization with expected improvement utility criterion to maximize the intermolecular binding energy for exploring the combinatorial space associated with lead ion solvation in hybrid inorganic–organic perovskites.

The reports discussed above use the formalism of active learning to guide computations or have applied to test cases to demonstrate the potential of these methods in materials science. There have now been a number of studies where iterative machine learning methods have been used to guide experiments. One such work is that of Ren et al.87 who used RF to identify glass formers in ternary bulk metallic alloy systems (Co-V-Zr), which in turn were validated by high-throughput combinatorial experiments. The authors showed the importance of feedback to improve the machine learning model predictions.

Broadening the scope, a number of studies have utilized surrogate based optimization to accelerate materials discovery or codes by minimizing number of experiments or evaluations. Chen et al.88 applied the knowledge gradient function with Bayesian inference to study the stability of an emulsion. Its release is triggered by the excitation of gold nanoparticles on the surface of oil droplets and their focus was targeted controlled release through a set of tunable model control parameters. Knowledge gradient was similarly used in maximizing the output current in an optoelectronic device.89 Xue et al.29 studied the design of NiTi based alloys with high transformation temperatures. They employed polynomial regression with features for chemical bonding and atomic size coupled with E[I], knowledge gradient and maximum variance to guide the next experiments. Aggarwal et al.25 studied a Cahn-Hilliard model with Bayesian optimization, maximizing the expected utility of the KL divergence to select experiments that provide maximal information about parameters in the model given a certain budget of experiments. Their problem of interest was to find the subsurface properties in a thin film deposited on a substrate. They show that the information gained with two values of the parameters is comparable to that with eight randomly selected values of the same parameters, thus reducing the experimental effort compared to a random strategy by roughly a factor of four. They also studied the problem of metal-metal interfaces and trapping of helium (He) impurities, important in nuclear materials, and also used KL divergence for model selection amongst competing models.

In the field of ferroelectrics, Xue et al. used a Landau model together with Bayesian regression as well as support vector regression combined with E[I] to find Pb-free piezoelectrics in the BaTiO3 family with morphotropic phase boundaries that show minimum sensitivity to temperature.30 Their training data consisted of 20 well characterized phase diagrams of solid solutions with a search space of 1200 possible phase diagrams. Balachandran et al.36 developed a novel two-step machine learning approach that combines classification learning with regression and adaptive design methods to recommend promising chemical compositions of Bi-containing PbTiO3 solid solutions that simultaneously satisfy two constraints: (i) They should have a perovskite crystal structure and (ii) They should have a high ferroelectric Curie temperature. From screening a set of 61,000 potential compositions, ten compositions were identified in five iterations and six of them were confirmed to be in the perovskite crystal structure from X-ray diffraction with a composition in the Bi(Co,Fe)O3-PbTiO3 chemical space with a highest Curie temperature of 898 K among them.

Optimizing interphase properties in polymer nanocomposites was the problem studied by Wang et al.90 Their objective was to minimize the difference between predicted bulk properties of a nanocomposite using simulations for dielectric and viscoelastic properties with those obtained from experiments using a GP model together with E[I].

Genetic algorithms have been extensively used as a means to optimize given target properties iteratively in a variant of the approaches we have considered. Starting with a random initial population of n-block polymers (for any given n), and letting them undergo evolution in terms of constituent blocks and their neighbors based on a genetic algorithm, a set of polymers may be obtained with properties closest to the provided targets.91 Such methods can be efficient in the search for materials with desired properties compared to random searches, however, they do not use the exploitation—exploration type strategies based on selection and ranking discussed here.

Active learning methods have also been used to parameterize interatomic potentials for molecular dynamics simulations. Podryabinkin and Shapeev92 demonstrated the use of active learning for building machine learning interatomic potentials to resolve the transferability or extrapolation problem. Using a D-optimality criterion, which minimizes the generalized variance of the parameter estimates of a pre-specified model functional form, the authors showed that reliable machine learning interatomic potentials can be built. In a recent paper entitled, “Less is more: sampling chemical space with active learning”. Smith et al.93 also consider the development of accurate and transferrable potentials for predicting molecular energetics. They use a “query by committee” ranking strategy to sample the chemical space where the potential fails to accurately predict the potential energy, thereby improving the overall fitness of the potentials.

Finally, in addition to the use of active learning methods in computational materials and physics which we have discussed in Sec. IVB, there is the whole area of the design of computer experiments,94,95,96 which utilizes adaptive and iterative strategies to address various problems by combining system knowledge with space-filling for sample placement. Many physical processes are difficult to simulate and a simulation code often serves as a proxy for the process with various inputs and outputs. Our Example 2 was a case in point and, with easier access to more powerful computer resources, such studies are becoming important as surrogates to complement physical experiments. Computer experiments allow many of the criteria and designs, such as two-level factorials, Latin hypercubes, U- designs, lattice designs to be compared. In addition, in purely deterministic computer experiments, the observations from the code with the same input variables will be the same. Thus, in modeling data from such an experiment, bias due to model inadequacy becomes important rather than the need to reduce variance.

Challenges and future development

The nature of experimental data in materials science is such that we do not have the luxury to randomly select different training sets from a population and evaluate the ground truth, because each experiment is expensive. Typically we have a training data set of samples from an unknown population and we need to employ statistical methods, such as non-parametric bootstrap sampling with replacement (and other variants of this methodology) which we have discussed, to generate many training data sets to form a prescribed distribution to train surrogate models.

There are different ways to evaluate the efficacy of AL methods. One of the approaches is to use simulated datasets, where the ground truth labels are known for all data points in the population. This approach has advantages because one can randomly select several training sets from a population, explore AL methods independently on each of the training set, and provide statistically meaningful comparisons. In fact, such studies on different AL strategies on simulated datasets have been performed in the literature.97,98 For example, Theiler and Zimmer97 compare Bayesian global optimization methods using simulated datasets in four distinct regimes, corresponding to high and low levels of measurement noise and to high and low levels of “quenched noise.” They investigate the choice of AL methods when the surrogate model is well matched to the data. Their numerical experiments provide a direct way to compare Bayesian global optimization algorithms, and in particular, in a way that controls for the choice of surrogate model and avoids the issue of model mismatch. They isolate the role of the surrogate model from the AL. So, rather than choosing anecdotal functions to optimize, they draw their functions (thousands of them) from a GP prior, and then use the same prior to initialize the GP regression. By comparing these algorithms over different regimes of quenched and measurement noise, they compare relative performance of the different AL strategies. Their work on gaussian generated data categorically demonstrates that random sampling is the worst performer compared to other strategies. Similarly, Balachandran et al.99 used data from DFT calculations and evaluated various AL methods with the objective of accelerated materials design, albeit post factum. Here, the authors considered a population of 223 compounds for which the ground truth data was available. They were able to randomly sample a number of training sets from the population and estimate the statistics. Even in this case, AL consistently out-performed random sampling.

One of the unresolved questions in adaptive design is the importance of model quality in the accelerated search. We have empirically observed that it is not necessary to have a highly accurate machine learned model to accelerate the search when active learning is also involved in recommending the next experiment or computation. In Fig. 12a, we show a particular example of a machine learned model that was trained using the DFT data. The term “G” refers to shear modulus of a particular compound99 obtained from DFT calculations. It can be seen that this is particularly not an accurate model. However, when this model is coupled to active learning algorithms (E[I], KG etc) in search of a new compound with the largest G, then it performs better than the random method (see Fig. 12b). This result indicates that the active learning is forgiving of the poor model quality. Our preliminary studies have shown that the choice of features may also have a key role in finding the optimal material when dealing with poor model quality. However, the underlying reasons behind these observations are not clear and worthy of further investigation.

Fig. 12
figure 12

Model quality and active learning. a Performance of the SVR model (y-axis) vs. DFT data (x-axis) in predicting the shear modulus (G) of MAX phases. The blue and green data points correspond to in-sample and out-of-sample data points. The red data point correspond to the next recommendation from active learning at the “k” = 34th measurement. Symbols μ* and μ'′ represent data points with the largest G in the population and SVR predictions on the out-of-sample data points. b Relative performance of various active learning methods (including random sampling, max: highest expected score, max-A: alternates between choosing the highest expected score and the most uncertain estimated score, max-P: maximizes the probability of improvement) in maximizing G. The smaller the number of new measurements (y-axis), the better the accelerated search. X-axis shows the number of initial in-sample data points used for training the SVR model. Error bar is the standard deviation from 1000 random trials. Reproduced from ref. 99

Similarly, there is a need for results providing stricter guidance on stopping criteria to prevent excessive exploration of the feature space. In the examples we have considered, we have stopped when either our budget has run out or when we are satisfied with the outcome.

The methods we have outlined have been generalized to deal with multi-objectives so that given two properties, one needs to make the best selection for the next materials in parts of the two dimensional Pareto plot likely to improve on the existing Pareto front.100,101,102,103 The Pareto plot has the axes as the properties so that we can define a characteristic boundary on which lie materials where none of the objectives can be improved in value without degrading the other objective value. Such boundary points, the non-dominated data-points, define a Pareto front (PF) that represents the best trade-off between the objectives. Although the method has been exercised on materials databases,103 such as for shape memory alloys and piezoelectrics, it yet has to be demonstrated experimentally to find new materials with enhanced properties.

The extensions of the basic ideas here to multi-fidelity optimization, in which larger amounts of relatively cheap data are used in conjunction with smaller amounts of more expensive data to make credible predictions, has also been well studied. These methods employ GP and extend well known kriging regression to heterogeneous variable-fidelity information,104,105 and this framework has been successfully demonstrated within an optimization scheme. Its use in materials science has recently been demonstrated by combining two levels of DFT exchange correlation functionals (PBE and Hybrid functionals) to make predictions of band gaps for double perovskites.106 However, no design was performed and the method still needs to be demonstrated with experiments. Nevertheless, there are a number of materials science problems where this method will become important.

Recently, there has been interest in the use of text mining methods and natural language processing techniques to build datasets from the digital literature for guiding materials synthesis and design.107,108 Such advances can have key implications with the adaptive learning approach, especially in rapidly constructing the training datasets for building surrogate models. Currently the training sets are constructed manually from surveying the published literature, which can be a time-consuming process. We envision that combining the text mining approach for training set construction with adaptive learning can have an impact in the accelerated search for new materials.

Summary

Our objective is to bring global optimization based tools, developed in other areas, to the attention of the materials community by showing via three diverse examples that involve experiments or computations how the number of experiments or iterations can be minimized. Although a large number of studies exist that build and make predictions using machine learning models, the importance of experimental design and active learning—the selection of the next material to test—is only now being studied in materials science. In particular, we have highlighted the importance of how different design strategies, in the form of surrogate models and utility functions, can minimize the number of iterations to find materials with desired properties. These tools have certainly led to new materials in their class and other aspects as illustrated by our three examples, and the variety of other examples we have briefly discussed in Sec. IVC. Other applications relate to processing,39 as well as the use of computational codes for design in the materials and mechanical engineering areas14,37 and functional polymers.91 Many of these studies (experimental and computational) have shown that the design strategies are superior to random selection in minimizing the number of experiments. We also note as demonstrated by Example 3 in the context of MAX and apatite compounds, that the surrogate model does not necessarily improve over the course of the iterations, even though it leads to fewer measurements or better properties in the course of the iterations (Fig. 12a, b). This aspect has been observed also in other studies.27,30 There is thus a need for statistical studies of the interplay between the quality of the model and the selections recommended by the design strategies. However, we need to be cognizant of the “no free lunch theorem,”109 which essentially states that there is no universal optimizer for all problems! That is, there is no guarantee that a model trained on one data set will work on a different data set.

We are still at the stage where these methods have been barely exercised experimentally. The examples we have given have involved bulk synthesis with relatively slow turn around time and characterization. High throughput experiments performing synthesis by sputtering and X-ray have had success with thin film geometries110 as screening strategies for shape memory alloy design in bulk. For bulk synthesis, self-propagation high-temperature synthesis (SHS),111 methods based on laser additive manufacturing,112 and suspended droplet alloy processing show significant promise. The holy grail of automated on the fly synthesis and characterization will require the kinds of design tools we have discussed for next material selection. Recent steps in this direction include the use of unsupervised learning, such as cluster analysis, to convert composition and structural data from combinatorial libraries and material structure databases into potential composition phase diagrams and potential constituent phases.113,114 The convergence of the information sciences with materials science augurs considerable promise in changing the way new materials with targeted properties will be discovered, and how materials science will be studied in the future.