Introduction

X-ray absorption spectroscopy (XAS) is a widely used technique in the study of the properties, physical states, and local environments of materials.1,2,3 When incident X-ray photons with energy greater than the binding energy are absorbed by an atom, a core-level electron is removed from its quantum level. In XAS, the absorption coefficient, μ(E) is measured as a function of X-ray energy E. Detailed descriptions of X-ray absorption theory and equation have been included in many excellent books and review papers.4,5

The X-ray absorption fine structure (XAFS) is typically divided in to two regimes: X-ray absorption near-edge structure (XANES) and extended X-ray absorption fine structure (EXAFS).6 The XANES is a fingerprint of the oxidation states and coordination chemistries of the absorbing atom. Quantitative XANES analyses are typically difficult and are usually conducted in combination with principle component analysis or least-squares fitting. The EXAFS provides local atomic structure information, which can be extracted via coupling with theoretically calculated XAFS spectra using well-established software packages.7 One of the main challenges of interpreting XANES and EXAFS lies in obtaining reference spectra to fit the unknown spectra; measuring XAFS spectroscopy experimentally is laborious and time-consuming, requiring X-ray beams of finely tunable energy that are accessible only through synchrotron radiation facilities.5 To the authors’ knowledge, open reference databases usually contain at most hundreds of XAS spectra. For example, the electron energy-loss spectroscopy (EELS) database8 initiated in the 1990s contains 271 spectra, but only 21 of which are XAS spectra and 17 of which are K-edge spectra. EELS is theoretically equivalent to X-ray absorption9 under common acquisition conditions, but is of lower quality in terms of signal to noise ratio and energy resolution. Most XAS data are available only via publications in the literature, which cannot be extracted easily for comparison.

In recent years, theoretical calculations of XAFS have become more accurate and accessible due to the successful development of ab initio codes, such as the FEFF program,10,11 as well as advances in computing power. In this work, we will discuss the development of a high-throughput framework to generate a reference XAS database (XASdb) for all materials in the Materials Project12 database. This framework combines the power of the Python Materials Genomics (pymatgen) materials analysis library13 with the FireWorks workflow management software14 to carry out hundreds of thousands of XAFS calculations using the FEFF9 code.10 This framework has been implemented in the Atomate package.15 More importantly, we have developed a novel automated XANES spectra matching algorithm that leverages ensemble learning techniques to identify similar XANES spectra from our computed reference XASdb. We believe the combination of the XASdb with these machine-learned spectra matching tools will be an invaluable resource to the materials research community by greatly enhancing the efficiency at which experimental XAS spectra can be analyzed. It should be noted that this work primarily focuses on common K-edge XANES spectra; higher edge XANES and EXAFS computations and analysis are currently ongoing and will be discussed in future publications.

Results and discussion

We have selected the latest version (v9) of the popular FEFF program as our software of choice in this work. FEFF is a program for ab initio multiple-scattering calculations of XAFS and various other spectra for clusters of atoms. This choice is motivated by three factors: (i) FEFF-computed spectra has been shown to yield excellent agreement with experimentally measured spectra in a broad range of studies;16,17,18 (ii) FEFF calculations are relatively inexpensive compared to other approaches for computing XAS spectra (e.g., a typical FEFF calculation takes < 1 h on a single node, while multi-day, multi-core calculations are necessary for DFT-based spectra calculations); and (iii) FEFF requires minimal adjustable parameters. These three advantages make FEFF an ideal candidate for automation to generate XAS spectra across a broad range of chemistries. A key step in any automation framework is benchmarking of computational parameters for convergence and accuracy. The benchmarking dataset and criteria are detailed in the Methods section. The Pearson correlation coefficient, as given by the following expression, is used as the benchmarking criterion.

$$S_{{\rm Pearson}}\left( {X,Y} \right) = \frac{{\mathop {\sum }\nolimits_{{i} = 1}^{\mathrm{D}} \left( {X_{i} - \overline X } \right)\left( {Y_{i} - \overline Y } \right)}}{{\sqrt {\left( {\mathop {\sum }\nolimits_{{i} = 1}^{\mathrm{D}} \left( {X_{i} - \overline X } \right)^2} \right)\left( {\mathop {\sum }\nolimits_{{i} = 1}^{\mathrm{D}} \left( {Y_{i} - \overline Y } \right)^2} \right)} }},$$
(1)

where X i and Y i represent the absorption coefficients of two spectra on the same energy grid. The value of Spearson can range from −1 to 1, with a value of 1 being a perfect match. Used in this context, the Pearson correlation coefficient is a similarity metric, i.e., it measures the degree of similarity between two spectra.

We have tested the convergence of the FEFF calculated spectra with respect to four parameters: the radius of the cluster considered in the full multiple-scattering (FMS) calculation (self-consistent field (SCF) rfms1), the total number of multiple-scattering paths considered (FMS rfms), the exchange-correlation potential (EXCHANGE), and the treatment of the core (COREHOLE) (see Methods for a detailed description of the FEFF input file).

The SCF rfms1 was varied from 2 to 8 Å, and the spectrum at the highest value (8 Å) was set as the reference for each material. Figure 1 shows the computed Pearson correlation coefficients between spectra computed at lower rfms1 and the reference. We find that the computed spectra are converged (Spearson > 0.95) at around rfms1 = 6 Å for all material, though the Al K-edge for aluminum nitride is converged only for rfms1 = 6.5 Å. Given that the computational cost increases substantially for rfms1 > 7 Å (see Supplementary Fig. 1), we have chosen rfms1=7 Å as the default setting for SCF in the high-throughput XANES computations.

Fig. 1
figure 1

Benchmarking results of rfms1 parameter in the SCF card for K-edge XANES of various materials. The rfms1 parameter specifies the radius of the cluster considered for the full multiple scattering during self-consistent potential calculations. Pearson correlation coefficients were calculated between spectra calculated at different rfms1 and the reference calculated at rfms1=8.0 Å

The rfms field in the FMS card was varied from 3.0 to 11.0 Å at 1.0 Å intervals, and the spectrum at the highest value (11 Å) is set as the reference for each material. We find that the computed spectra are converged (Spearson > 0.95) around rfms = 9 Å for all materials (see Supplementary Fig. 2(a)). Since the computational cost increases substantially for rfms >9 Å (see Supplementary Fig. 2(b)), we have chosen rfms=9 Å as the default setting for FMS in the high-throughput XANES computations.

In FEFF9, two approximations of the core-hole potentials have been implemented, i.e., a fully screened potential based on the final-state rule (FSR) and a linear random-phase-approximation (RPA) screening. Systematic reviews of these two approaches have been done by Rehr et al.19 We evaluated the performance of all three core-hole options in FEFF9 on the computed K-edge XANES. As shown in Supplementary Fig. 3(a), spectra obtained using both the FSR and RPA are in much better agreement with experimental results than ones without core-hole treatment. The spectra computed without a core-hole treatment lack the edge enhancement observed in the experiments. In general, spectra obtained using FSR and RPA are similar (Supplementary Fig. 3(b)). We have chosen RPA screening as the default setting for the high-throughput XANES computations as the FSR might breakdown for the L-shell metals.20

Similar evaluations of the EXCHANGE card options reveal that the default Hedin-Lundquist model is the best option (see Supplementary Fig. 4).

Sensitivity of computed XAS spectra to lattice parameters

The FEFF code uses a self-consistent DFT calculation of the Fermi energy based on the real-space Green’s function (RSGF) approach with muffin-tin potentials for a given lattice structure. Comparing to the full-potential calculations, we find that the FEFF calculation of the densities of states is typically in fairly good agreement with DFT for many materials. In the Materials Project, the Perdew-Berke-Ernzerhof (PBE)21 generalized gradient approximation functional was used as the default for all relaxation calculations. As it is well known that PBE leads to systematic errors of up to 5% in the lattice parameters (with a tendency to overestimate),22,23,24,25 we tested the sensitivity of computed XANES spectra to ±5% changes in the lattice parameters. The results are shown in Fig. 2.

Fig. 2
figure 2

a Relationship between the Fermi energy level of K-edge XANES and a lattice parameter changes. Fermi energy levels of the unstrained structures are used as references. b Visualization of Na K-edge XANES spectra in Na2O (mp-2352) calculated with different applied strain values

We find that the Fermi energy level of the spectrum is sensitive to the lattice parameter variation (Fig. 2a). The Fermi energy level shifts towards lower energy as the lattice parameter increases, while the spacing of the spectral features contracts at the same time. An example for Na K-edge of Na2O is shown in Fig. 2b, and additional examples are available in Supplementary Fig. 5.

A portion of the Fermi energy shift can be attributed to the artifacts of the FEFF’s potential approximation model (see Supplementary Fig. 8). Nevertheless, the shape of the spectra remains unchanged. While different corrections to eliminate the artificial component of the dependence have been reported,26 these approaches are not amenable to a high-throughput approach. Here, we note that due to the approximations used in FEFF, we need to calibrate the Fermi level with experimental spectra. Therefore, a pure energy shift only translates to an energy calibration value in the post processing.

In summary, the PBE-relaxed structures from the Materials Project can be used as the input for high-throughput XANES calculations, even though there are other functionals27,28 that may provide better lattice parameters estimates.29,30,31,32

Workflow and database

Using the high-throughput parameters outlined above, we developed a high-throughput workflow for FEFF XAS calculations within the open source computational materials science workflow package Atomate.15 Atomate provides a high-level interface to compose workflows using the widely used open source materials science software such as Pymatgen,13 FireWorks,14 and Custodian. The proposed default FEFF9 parameters have been implemented as “input sets” in Pymatgen,13 which ensures reproducible and automated generation of standardized input files for any material. The compounds used in the high-throughput spectra generation were obtained from the Materials Project database.12 For each compound, the K-edge XANES spectrum was computed with each symmetrically unique site in the structure as the absorbing atom.

All computed spectra, as well as accompanying meta-data (e.g., input structure, absorbing atom, materials project id, etc.), are stored in a MongoDB database for on-demand querying and retrieval of data. So far, K-edge XANES spectra have been computed for more than 40,000 unique materials in the Materials Project database, which amounts to over 800,000 K-edge spectra. This is by far the largest repository of XANES spectra in the world, and is growing rapidly. Future plans include the calculation of XANES for L, M, and N shells as well as EXAFS spectra.

Spectra matching using ensemble learning

To extract the most utility and power from the XASdb, we have developed a novel Ensemble-Learned Spectra IdEntification (ELSIE) algorithm that allows for rapid identification of matching spectra for any experimental XAS spectra. The main goal of spectral matching is to obtain a list of compounds (the “hit list”) whose spectra are most similar to that of the target spectrum. The success and failure of matching is defined by the characteristics of the spectrum. In the case of XANES spectra, the relevant information to be extracted is the coordination environment and oxidation state of the absorbing atom. As multiple materials can have atoms in the same oxidation state and coordination environment, we define the matching to be successful if the correct coordination environment and oxidation state are within the top entries.

The ELSIE algorithm uses the ensemble method to improve the robustness of XAS identification. In ensemble learning, the core concept is the combination of multiple weak learners to achieve superior performance. It relies on the assumption that each weak learner is better than a random guess, and each weak learner captures different aspects of the problem. At the core of the algorithm is the process of building individual weak learners. Taking inspiration from the spectra matching algorithms for Raman spectroscopy33 and other spectra,34,35 we broke down the problem of matching XAS spectra into two main steps, namely preprocessing and similarity computations. We define each weak learner to be a combination of a preprocessor (a specific series of preprocessing steps) with a similarity metric. Figure 3 provides an overview of the ELSIE algorithm (see Methods section for the details on the construction of the ELSIE algorithm).

Fig. 3
figure 3

Workflow schema of the Ensemble-Learned Spectra IdEntification (ELSIE) algorithm. The algorithm consists of two steps. In the first step, the absorption species is identified and used to narrow down the candidate computed reference spectra. In the second step, the spectral matching ensemble yields a rank-ordered list of computational spectra according to similarity with respect to the target spectrum

We evaluated the ELSIE algorithm using 13 XANES spectra from EELSDb (Supplementary Table 1), supplemented by six high-quality experimental XANES spectra of V2O5, V2O3, VO2, LiNiO2, LiCoO2, and NiO from previous studies.36,37 The inclusion of this latter dataset is motivated by our desire to improve the diversity of the test data, especially with regards to transition metal species.

The first step is to narrow down the candidate computed reference spectra by the absorption element (A). Though this information is usually known a priori, the characteristic XAS absorption edge energy follows a power law with the atomic number,5,6 which leads to clearly separated energy ranges. Hence, we can identify the absorption element with 100% accuracy by comparing the energy range of the target spectrum to tabulated X-ray absorption edge data.38

Once the absorbing element A is identified, the computed spectra of all materials within the same chemical system are queried from the XASdb. For example, for the Al K-edge of Al2O3, we include the Al K-edge spectra of all Al and Al x O y materials as reference spectra. We excluded compounds with energy above hull (Ehull) larger than 100 meV/atom since they are not likely to be stable.39 For C K-edge XANES of the diamond structure (\(Fd\bar 3m\)), we relaxed the constraint to 200 meV/atom as the corresponding entry (mp-66, diamond) has an Ehull of 136 meV/atom. It should be noted that though the individual absorption spectrum for each symmetrically distinct site was computed for all crystal structures in the Materials Project database, the reference spectra used for comparison with the target spectra are constructed by summing these individual spectra taking into account the site multiplicities.

To evaluate the overall performance of ELSIE, we looked at three key metrics: (i) whether the correct structure is within the top 5 ranked computed spectra, (ii) whether the top ranked entry has the absorbing species in the correct oxidation state, and (iii) whether the top ranked entry has the absorbing species in the correct coordination environment, i.e., coordination number and geometry. Where the exact structural information is not available (e.g., in the experimental spectra from EELSdb), it is assumed that those spectra correspond to the ground state structures in the Materials Project database with the same chemical composition. It should also be noted that some reference materials may have the same element in multiple oxidation states and coordination environments. Therefore, the application of metrics (ii) and (iii) merely indicates whether at least one of the distinct sites in the top entry have the correct oxidation state and coordination environment. The results are summarized in Table 1.

Table 1 Performance of ELSIE algorithm on 19 test spectra

Of the 19 test spectra, we find that the correct structure is within the top 5 ranked structures for 11 systems, i.e., only 57.9% accuracy. However, the correct oxidation state and coordination environment are in the top entry for 16 and 15 systems, i.e., accuracies of 84.2% and 78.9%, respectively. The best coefficient α is found to be 0.01. Given that XANES is a technique primarily used to extract oxidation state and coordination environment information, these results are a major validation of the effectiveness of the ELSIE matching algorithm.

To emphasize the effectiveness of the ensemble approach, we also performed the same benchmark using a single learner utilizing just the sigmoid squashing function and cosine similarity measure on spectra that have been pre-normalized with respect to summed intensity. The ELSIE algorithm outperforms the single learner approach by 15.8% in identifying both the correct oxidation state and coordination environment.

We will now illustrate the performance of our spectral matching algorithm with a few case studies on diverse chemistries. For all spectra, we have confined our comparison to the energy range from −10 to 45 eV from the absorption edge, which is the region typically referred to as XANES.

Case study 1: main group metals

Figure 4a, b shows the ELSIE spectral matching results of the Al K-edge XANES of α-Al2O3 and Na K-edge XANES of NaCl, respectively. For both target spectra, the correct oxidation states and coordination environments are found in the top candidates. Furthermore, we may observe that our proposed peak shifting approach is effective in aligning the target and reference spectra.

Fig. 4
figure 4

Results of the similarity ranking returned by the ELSIE matching algorithm on a Al K-edge XANES of α-Al2O3 entry; b Na K-edge XANES of NaCl; and c Na K-edge of Na2O. Detailed information about the retrieved compounds can be found in the Materials Project website, a Al2O3 (Pbcn, mp-1938), Al2O3 (Pna21, mp-2254), Al2O3 (\(R\bar 3c\), mp-1143), and Al2O3 (C2/m, mp-7048), b NaCl (\(Fm\overline 3 m\), mp-22862), Na (\(Im\bar 3m\), mp-127), Na (P63/mmc, mp-10172) and Na (\(I\bar 43d\), mp-567772), and c Na (\(Im\overline 3 m\), mp-127), Na (P63/mmc, mp-10172), Na2O (\(Fm\bar 3m\), mp-2352), and Na (\(I\bar 43d\), mp-567772), in decreasing similarity order

Figure 4c shows a notable case—the Na K-edge of Na2O—where the ELSIE algorithm fails. Here, the ELSIE algorithm returns elemental Na as the top ranked result, as opposed to Na2O. The main reason for this failure is that the FEFF-computed spectra is not in good agreement with experimental spectra (see Supplementary Fig. 7 for this and a few other examples). Possible solutions include the use of real-space full-potential multiple-scattering theory or other first principle approaches.40 For Na2O in particular, we find that the experimental Na K-edge XANES of Na2O is more similar to the computed Na K-edge XANES of Na2CO3 (Supplementary Fig. 7(c)), which may indicate possible contamination by the atmosphere in experiments.

Case study 2: transition metal oxides

Figure 5 shows the ELSIE spectra matching results of the Ni K-edge XANES in NiO, Co K-edge XANES in LiCoO2. From Fig. 5a, we note that although the computed peak positions and amplitude are not in great quantitative agreement with the experimental measured spectra, the ground state NiO entry is nevertheless returned as the top ranked candidate. In particular, the small Ni 1s-3d peak at 8332 eV in the experimental Ni K-edge XANES of NiO is not present in the FEFF calculated spectra. There is, however, a small peak at 8337 eV in the FEFF calculated spectra, which we believe is the Ni 1s-3d peak. The inaccuracy in the position of the peak may be due to the muffin-tin approximation used in FEFF.

Fig. 5
figure 5

Results of the similarity ranking returned by the ELSIE matching algorithm on a Ni K-edge XANES of NiO; b Co K-edge XANES of LiCoO2; and c V K-edge of V2O5. Detailed information about the retrieved compounds can be found in the Materials Project website, a NiO (\(Fm\bar 3m\), mp-19009), NiO2 (P63/m1, mp-543096), NiO2 (\(R\bar 3m\), mp-25593) and NiO (\(Fm\bar 3m\), mp-715434), b Li(CoO2)2 (P2/m, mp-553952), Li6CoO4 (P42/nmc, mp-18925), CoO2 (\(P\bar 3m1\), mp-714976) and LiCoO2 (\(R\bar 3m\), mp-24850), and c V2O5 (C2/c, mp-542844), VO2 (Pnnm, mp-714880), V6O13 (Cmcm, mp-715617) and V9O17 (P1, mp-716723), in decreasing similarity order

For LiCoO2 (Fig. 5b), the ground state structure of LiCoO2 (\(R\bar 3m\)) is among the top five entries. All Co3+ ions in the top entry (Li(CoO2)2) are in octahedral coordination, i.e., the same coordination environment of Co3+ ions in LiCoO2 (\(R\bar 3m\)). We may, therefore, conclude that the ELSIE algorithm performs satisfactorily in both instances.

Figure 5c shows the ELSIE spectra matching results for the V K-edge of V2O5 (Pmmn). The ELSIE algorithm fails to retrieve the correct square-pyramidal coordination environment of V5+ in V2O5 (Pmmn). Indeed, vanadium ions in the top five matches returned by the ELSIE algorithm are in octahedral coordination. Here, the relative similarity of the V K-edge spectra for the different V oxidation states and coordination environments seems to be the key issue. Further structural refinement based on EXAFS simulations, therefore, becomes critical, which will be available in the XASdb in the near future.

In conclusion, we have demonstrated the development of a large database for XAS using high-throughput FEFF calculations. Parameter benchmark results indicate that the overall quality of the FEFF9 calculations with default input parameters is in quantitative agreement with experiments, which is adequate for comparison purposes. We developed a novel spectra-matching algorithm—the ELSIE algorithm—that enables the rapid matching of computed reference spectra to any target spectra. The ensemble learning approach far outperforms any single approach based on a pre-defined set of preprocessing and similarity metric; outstanding ~84 and ~79% accuracies in identifying the correct oxidation state and coordination environment are demonstrated based on a diverse test set comprising 19 experimental XANES spectra. The XASdb with the ELSIE algorithm has been integrated into a web application in the Materials Project, providing an important new public resource for the analysis of XAS to all materials researchers, and the ELSIE algorithm itself has been made available as part of veidt, an open source machine-learning library for materials science.

Methods

Benchmarking details

Robust, well-defined datasets are necessary for any benchmarking exercise. We have used the existing high-quality K-edge XAS spectra available in the open EELS Data Base (EELSDb)8 as reference data, and matched them with the corresponding materials in the Materials Project12 using the Materials API41 and pymatgen.13 For materials in the EELSDb without structural information, ground state structures with identical chemical compositions in the Materials Project were used. For spectra in EELSDb taken using the same materials, we selected one and adopted it in our study. Supplementary Table 1 summarizes the 13 unique materials used in this work.

FEFF

The FEFF software calculates X-ray absorption spectra using the RSGF formulation of the multiple-scattering theory.11 The X-ray absorption μ is written in terms of the imaginary part of the one-particle Green’s function G(r,r′; E), which incorporate both the inelastic losses and other quasiparticle effects. In terms of G(r,r′; E), μ is given by:

$$\mu = - \frac{1}{\pi }Im\left\langle {c\left| {\widehat {\it{ \in }} \cdot rG\left( {r,r^{\prime };E} \right)\widehat {\it{ \in }} \cdot r^ {\prime }} \right|c} \right\rangle \theta _{\mathrm{\Gamma }}\left( {E - E_{\rm F}} \right),$$
(2)

where θΓ is a broadened step function at the Fermi energy EF. This yields a unified treatment of EXAFS and XANES. The treatment of X-ray absorption can then be separated into atomic and scattering parts, i.e., G(r,r′; E) =GC(r,r′; E)+GSC(r,r′; E). The exact result of GSC(r,r′; E) is given by the full matrix inverse, or equivalently, a sum over all multiple-scattering paths.42 For the XANES calculation, FEFF implements the FMS technique, which includes the contributions from all orders of scattering within a cluster containing the absorber and scatterers. The FEFF code also incorporates a GW-based self-energy based on the Hedin-Lundqvist plasmon-pole model, which includes effects of electron–electron interactions such as mean-free paths and self-energy shifts. This method has been well tested and is usually a good approximation for EXAFS and reasonable for XANES. FEFF includes a screened corehole and gives results for excitonic enhancements comparable to GW/Bethe-Salpeter equation (BSE) calculations in many materials. FEFF can also incorporate Debye-Waller factors using correlated-Debye or more advanced models. Further details on the FEFF code and its theoretical foundations can be found in ref. 11 for interested readers.

In the FEFF input file, parameters are specified in control “cards”. The following parameters in FEFF were tested for convergence.

  1. i.

    Self-consistent field (SCF): The rfms1 field in the SCF card specifies the radius of the cluster considered in the FMS calculation. The higher the rfms1 is, the greater the number of atoms is included in calculation.

  2. ii.

    Full multiple scattering (FMS): The rfms field in the FMS card determines the total number of multiple-scattering paths considered in the XANES calculation. Default values are used for the other five optional fields in the FMS card.

  3. iii.

    EXCHANGE: The EXCHANGE card specifies the exchange-correlation potential model used for XANES calculation. No shift was applied to the Fermi energy level in this work, i.e., the second and third fields of the EXCHANGE card were kept being 0.

  4. iv.

    COREHOLE: The COREHOLE card is used to specify the treatment of the core during XAS calculations. ‘‘Core hole’’ is the hole in the orbital formed by the excitation of a single electron from that orbital.5 In FEFF9 code, a combination of BSE and time-dependent density functional theory (TDDFT) is used to improve the approximation of the core hole interactions.10,20

ELSIE algorithm construction

We adopted the concept of ensemble method to index the most similar spectra from the database with respect to a target spectrum. Each weak learner has a unique combination of a few spectral preprocessing techniques and one similarity metric, we will describe the preprocessing approaches and similarity metrics in turn.

Each preprocessor comprises a series of steps, designed to emphasize or weaken certain characteristics of the experimental and computed spectra. A preprocessor is generated as follows:

  1. 1.

    Peak shifting and quantization: This step is necessary to all preprocessors. Because of the differences in energy sampling intervals and energy ranges, linear interpolation was used to convert each spectrum to a vector of 200 intensity values with identical energy grid. The reference spectra are shifted such that the onset of absorption, which is well-defined by the photoelectric effect, is aligned with that of the target spectra. This onset is determined by ascertaining the lowest incident energy at which the computed absorption intensity reaches 6% of the peak intensity.

  2. 2.

    Pre-normalization: We included an optional pre-normalization step to rescale the intensity to a similar range. Given the spectrum X with X i represents the ith intensity, four normalization approaches are adopted:43

    $$X_i^{{\rm norm}} = \frac{{X_i}}{{ {\sum } X_i}}.$$
    (3)
    $$X_i^{{\rm{norm}}} = {{{X_i}} \over {\sqrt {\sum X_i^2} }}.$$
    (4)
    $$X_i^{{\mathrm{norm}}} = \frac{{X_i - X_{{\mathrm{min}}}}}{{X_{{\mathrm{max}}} - X_{{\mathrm{min}}}}}.$$
    (5)

    \(X_i^{{\mathrm{norm}}} = \left( {X_i - \mu } \right){\mathrm{/}}\sigma\),

    where \(\mu = \mathop {\sum } X_i{\mathrm{/}}n\) and

    $$\sigma = \sqrt {{\sum} {\left( {X_i - \mu } \right)^2} {\mathrm{/n}}} .$$
    (6)
  3. 3.

    Feature transformation: Several feature transformation functions were implemented in the third step, which include the square root and sigmoid squashing functions. The sigmoid squashed spectrum is calculated using \(X^{\prime } = \frac{{1 - {\mathrm{cos}}\left( {\pi X} \right)}}{2}\). The squared root squashing uses \(X^{\prime } = \sqrt X\), where X is the squashed new spectrum. This technique has shown to improve the response sensitivity with respect to different spectral features.44 The feature transformation functions also include taking the first or second order derivative of spectrum, or weighted the spectra with the first and second order derivatives. This step is necessary to make distinct weak learners.

  4. 4.

    Normalization: This last step is for all preprocessors. The spectra are all normalized such that the sum of intensities is equal to 1, i.e.\(\mathop {\sum}\nolimits_{i = 1}^{\rm D} {X_i} {\mathrm{ = }}1\).

Both the computed and target spectra are processed using the same series of steps for each preprocessor.

The preprocessed target and computed spectra are then compared in a pairwise manner using a similarity metric. Only bin-to-bin similarity metrics are used in the ELSIE algorithm development as they are less computationally demanding for high-throughput datasets.45 Four commonly used similarity metrics in the literatures are used in the ELSIE algorithm:

  1. 1.

    Pearson correlation: as defined in the Benchmarking section.

  2. 2.

    Euclidean similarity: In the D-dimensional spectral feature space, the Euclidean distance between two spectra X and Y is given by the following equation:

    $$d_{{\mathrm{Euc}}}{\mathrm{ = }}\sqrt {\mathop {\sum}\nolimits_{i = 1}^{\rm D} {\left| {X_i - Y_i} \right|^2} } .$$
    (7)

    The spectral similarity measure can be derived from the distance calculated using the following expression:

    $$S_{{\mathrm{Euc}}}\left( {X,Y} \right) = 1 - \frac{{d_{{\mathrm{Euc}}}\left( {X,Y} \right)}}{{d_{{\mathrm{Euc}}}^{{\mathrm{max}}}}},$$
    (8)

    where \(d_{{\mathrm{Euc}}}^{{\mathrm{max}}}\) is the absolute maximum expected Euclidean distance between two probability mass functions.45

  3. 3.

    Cosine similarity: The cosine similarity measure is the normalized inner product and measures the angle between two spectral vectors.46 The cosine similarity between two spectra can be calculated as:

    $$S_{{\mathrm{Cos}}} = \frac{{\mathop {\sum }\nolimits_{i = 1}^{\mathrm{D}} X_iY_i}}{{\sqrt {\mathop {\sum }\nolimits_{i = 1}^{\mathrm{D}} X_i^2} \sqrt {\mathop {\sum }\nolimits_{i = 1}^{\mathrm{D}} Y_i^2} }}.$$
    (9)
  4. 4.

    Ruzicka similarity: The Ruzicka45 similarity between two spectra is given by the following equation:

$$S_{{\mathrm{Ruz}}} = \frac{{\mathop {\sum }\nolimits_{i = 1}^{\mathrm{D}} {\mathrm{min}}\left( {X_i,Y_i} \right)}}{{\mathop {\sum }\nolimits_{i = 1}^{\mathrm{D}} {\mathrm{max}}\left( {X_i,Y_i} \right)}}.$$
(10)

The combination of preprocessors and similarity metrics results in a total of 168 learners that can potentially be used to construct the ELSIE algorithm. To make an ensemble that outperforms individual learners, one prerequisite is that each learner should have an error rate lower than random guessing. We, therefore, filtered the 168 leaners to 33 and adopted them in the ELSIE algorithm. The detailed filtering procedure can be found in the Supplementary Information.

For each target spectrum, each learner (one preprocessor + one similarity metric) outputs similarity scores for the reference spectra. However, the quantitative scores for different similarity metrics cannot be compared even for the same target spectrum. In the ELSIE algorithm, we instead combine the reference spectra ranking from each learner to derive an ensemble result. For a mixture of classifiers of various types, ranking-based combination methods have been shown to be more reliable.47 Based on the rankings, we compute the Borda count, defined as the number of candidates that are ranked equal and below the specific candidate. For example, the top spectrum among ten computed candidates would receive a Borda count of 10, while the second ranked spectrum has a Borda count of 9. For each target spectrum, the Borda counts of the reference spectra under all learners are then summed to arrive at a consensus ranking.48

Finally, the Borda ranks of all reference spectra are then combined with a penalty term for the peak shift and converted to a probabilistic estimate using the modified softmax function. The probability of a reference spectrum Xk is indicated by P(Xk) where the superscript k indicates the k-th spectrum, and is calculated as follows:

  1. 1.

    The Borda count of each reference (Rk) is normalized with respect to the count sum: \(R_{{\mathrm{norm}}}^{\mathrm{k}} = \frac{{R^{\mathrm{k}}}}{{{\sum} {R^{\mathrm{k}}} }}\) This step is required to avoid the exponential overflow.

  2. 2.

    \(P\left( {X^{\mathrm{k}}} \right)\) is then calculated by the following equation:

$$P\left( {X^{\mathrm{k}}} \right) = \frac{{\exp \left( {R_{{\mathrm{norm}}}^{\mathrm{k}}} \right){\mathrm{exp}}\left( { - \frac{{\alpha \left| {\Delta S^{\mathrm{k}}} \right|}}{{\delta _{\rm S}}}} \right)}}{{{\sum} {\exp \left( {R_{{\mathrm{norm}}}^{\mathrm{k}}} \right){\mathrm{exp}}\left( { - \frac{{\alpha \left| {\Delta S^{\mathrm{k}}} \right|}}{{\delta _{\rm S}}}} \right)} }},$$
(11)

where ΔSk could be calculated as \(\Delta S^{\rm k}{\mathrm{ = }}S^{\rm k} - \overline S .S^{\rm k}\) is the peak shift amount between the reference spectrum Xk and the target spectrum. \(\bar S\) is the mean peak shift of the reference spectra. δS is the standard deviation of Sk. Coefficient α is fitted to the test dataset. \({\mathrm{exp}}\left( { - \frac{{\alpha \left| {\Delta S^{\mathrm{k}}} \right|}}{{\delta _{\rm S}}}} \right)\) is therefore a term that imposes a larger penalty on large peak shifts relative to smaller peak shifts.

The algorithm itself has been highly optimized by leveraging on well-established numerical packages such as numpy and scipy.49,50 On a laptop computer with Intel i5 2.6 GHz single CPU and 2 GB of RAM, the ELSIE algorithm can perform a comparison between a target and candidate spectrum in about 0.03 s. Typically, 20–30 spectra are selected for comparison according to the rules that the computational reference spectra should have identical absorption species, limited number of elements and Ehull < 100 meV/atom. The overall time to perform a complete ranking is, therefore, around 1 s, which allows for on-the-fly matching of uploaded spectra.

Data availability

The computed spectra in the XASdb have been made available in the Materials Project website. A new web application—the XASApp (https://materialsproject.org/#apps/xas/)—has been developed which allows any user to compare multiple X-ray absorption spectra and find matches within the XASdb for an uploaded spectrum using the ELSIE algorithm.

The ELSIE algorithm has also been made publicly available as a part of veidt, an open-source Python machine-learning library for materials science developed by the Materials Virtual Lab that is available on the Python Package Index and Github (https://github.com/materialsvirtuallab/veidt).