Density-of-states similarity descriptor for unsupervised learning from materials data

Kuban, Martin; Rigamonti, Santiago; Scheidgen, Markus; Draxl, Claudia

doi:10.1038/s41597-022-01754-z

Download PDF

Analysis
Open access
Published: 22 October 2022

Density-of-states similarity descriptor for unsupervised learning from materials data

Scientific Data volume 9, Article number: 646 (2022) Cite this article

1830 Accesses
10 Citations
1 Altmetric
Metrics details

Subjects

Condensed-matter physics

Abstract

We develop a materials descriptor based on the electronic density-of-states (DOS) and investigate the similarity of materials based on it. As an application example, we study the Computational 2D Materials Database (C2DB) that hosts thousands of two-dimensional materials with their properties calculated by density-functional theory. Combining our descriptor with a clustering algorithm, we identify groups of materials with similar electronic structure. We introduce additional descriptors to characterize these clusters in terms of crystal structures, atomic compositions, and electronic configurations of their members. This allows us to rationalize the found (dis)similarities and to perform an automated exploratory and confirmatory analysis of the C2DB data. From this analysis, we find that the majority of clusters consist of isoelectronic materials sharing crystal symmetry, but we also identify outliers, i.e., materials whose similarity cannot be explained in this way.

2DMatPedia, an open computational database of two-dimensional materials from top-down and bottom-up approaches

Article Open access 12 June 2019

Crowd-sourcing materials-science challenges with the NOMAD 2018 Kaggle competition

Article Open access 18 November 2019

Reinforcing materials modelling by encoding the structures of defects in crystalline solids into distortion scores

Article Open access 17 September 2020

Introduction

The creation of databases for computational materials science has led to a huge amount of stored calculations, exceeding by far any human’s ability to comprehend the information in it. Thus, algorithmic data-analysis methods need to be leveraged to allow knowledge extraction from this large pool of data. Domain-specific search interfaces, provided by public databases^1,2,3,4,5, are one way to make information findable. These interfaces allow researchers to identify materials of their interest, e.g., in terms of structural features like space group or atom types, or in terms of properties like the electronic band gap. However, such features provide little insight only. Furthermore, the use of search interfaces is limited to mostly confirmatory analysis: Having a concrete physical mechanism in mind, e.g., the change of properties of alloys with stoichiometry, researchers can manually search materials that allow to confirm, or deny, a hypothesis.

Learning from data, however, is not limited to this kind of analysis. For instance, relations between materials in terms of certain properties, can become (only) apparent in large quantities of data. To reveal such relations and make use of them, both in-depth understanding of when we consider materials to be similar as well as powerful data-analysis methods are required. A prerequisite for understanding how different materials relate to one another is the availability of descriptive, numerical representations (descriptors), that accurately capture (dis)similarities, e.g., stemming from the atomic and/or electronic structure.

In the past years, several descriptors of the atomic structure have been published^6,7,8 and successfully applied for the prediction of material properties using machine learning (ML) techniques. However, descriptors based on the electronic structure are not well established in the ML community. In early work of Isayev and coworkers⁹, descriptors of both the electronic density-of-states (DOS) and the band structure are used to create a graphical representation of more than 20000 materials from the AFLOWlib database. More recently, supervised ML was proposed¹⁰ to predict electronic densities-of-states by their decomposition in local atomic contributions. Furthermore, a descriptor based on atomic distances, the projected densities of states (PDOS), and the Kohn-Sham band-gap was shown¹¹ to improve the prediction of computationally expensive material properties.

The majority of ML approaches in materials science focus on speeding up research. This concerns, for instance, the prediction of materials properties that are time-consuming to compute, like the electronic band gap, or the optimization of established methods, e.g., speeding up molecular-dynamics simulations through ML-based force fields. Thereby, highly non-linear ML models and/or complex material descriptors are necessary to achieve decent accuracy of predictions. Moreover, the underlying data are typically considered only as input for the ML models, and are not further analyzed.

In this work, we aim at obtaining deeper understanding of large materials data spaces by rationalizing the reasons behind features that materials may share. We demonstrate our approach by the similarity of materials in terms of their electronic properties. To this extent we develop a tunable DOS fingerprint that encodes the DOS of a material into a binary-valued two-dimensional (2D) map, stimulated by the work of Ref. ⁹. Combining it with unsupervised ML methods, we showcase its use by revealing similarities in the electronic structure of materials from the Computational 2D Materials Database (C2DB)². We are able to uncover not only expected trends, e.g., clusters consisting of materials containing isoelectronic substitutions of atomic species, but also unexpected correlations, e.g., clusters of structurally very different materials. Our results show that explorative analysis of a database allows for finding relations between materials which could not be foreseen without comprehensive, data-driven analysis.

Methods

Electronic density-of-states fingerprints

The analysis of spectra like the DOS is typically done by visual inspection, i.e., in a qualitative manner. For large datasets, this kind of analysis quickly becomes unfeasible. Therefore, a descriptor that allows for automated processing of such data is required. This includes a suitable numerical representation of the DOS. In the following, we review such representations that have been proposed in the literature, state their drawbacks, and tell how we overcome them.

To quantitatively compare the DOS of materials, Isayev et al.⁹ constructed a DOS fingerprint by encoding the DOS in the energy range between −10 and 10 eV as a series of 256 float (4 bytes) numbers. A similar point-wise representation was considered in Ref. ¹⁰ for building predictive models for the DOS based on Gaussian regression. It was pointed out that such representations were inefficient, as they may potentially require many sampling points of the DOS to efficiently train the models. Moreover, loss functions based on that representation turned out largely insensitive to spectral features with small overlap. To overcome these problems, the authors proposed two approaches: i) a truncated basis expansion based on principal-component analysis (PCA), which leads to an effective reduction of the degrees of freedom of the fingerprint (effectively smoothing the DOS spectra), and ii) a representation based on the cumulative distribution function associated to the DOS. The latter improved on the sensitivity of the loss functions to non-overlapping spectral features. More recently, a high-dimensional fingerprint based on the DOS projected on different atomic orbitals and sites, followed by PCA dimensionality reduction, has been proposed¹¹. A similar fingerprint was later used to predict the G₀W₀ band energies for materials in the C2DB¹².

A common drawback of all these DOS fingerprints is that they equally weight the contributions from the entire energy range considered in the spectra. Thus, they don’t account for the fact that different energy regions are associated to distinct physical phenomena. For instance, the shape of the DOS close to the top of the valence band and the size of the band-gap are most important aspects in the search for p-doped materials. Likewise, for metals, the magnitude and shape of the DOS around the Fermi energy are most relevant. Other research may focus on some features of the conduction band. Although the PCA-based approach mentioned above can effectively lead to a re-weighting of spectral features, it cannot be tailored at will to focus on specific regions, but is determined by the training data that is used for the construction of the descriptors.

To overcome the described issues, we have developed a DOS fingerprint that allows for a tailored weighting of spectral features. Using a non-uniform discretization of the energy axis, the fingerprint can be adapted to focus on desired energy regions. To achieve this discretization, the DOS is transformed into a two-dimensional raster image (Fig. 1(d)) as follows: First, the spectrum is shifted such that the energy ε = 0 is located at a reference energy ε_ref, which defines the main focus of the fingerprint. Then, the DOS ρ(E) (Fig. 1(a)) is integrated over an even number N_ε of intervals of variable widths Δε_i, to obtain a histogram {ρ_i} (Fig. 1(b)):

$${\rho }_{i}={\int }_{{\varepsilon }_{i}}^{{\varepsilon }_{i+1}}\rho (\varepsilon )d\varepsilon ,$$

(1)

with $i\in \left[-{N}_{\varepsilon }/2,{N}_{\varepsilon }/2\right]$, $i\in {\mathbb{Z}}$, ${\varepsilon }_{0}=0$, ${\varepsilon }_{i+1}={\varepsilon }_{i}+\Delta {\varepsilon }_{i}$ for i ≥ 0, and ε_−i = −ε_i. The integration intervals $\Delta {\varepsilon }_{i}$ are defined as

$$\Delta {\varepsilon }_{i}=n({\varepsilon }_{i},W,N)\Delta {\varepsilon }_{min},$$

(2)

where Δε_min is a parameter giving the minimal integration width and the integer-valued function

$$n(\varepsilon ,W,N)=\lfloor \,g(\varepsilon ,W)N+1\rfloor \in [1,N].$$

(3)

$\lfloor \cdot \rfloor $ denotes the ‘round down’ operator and $g\left(\varepsilon ,W\right)=\left(1-{\rm{\exp }}\left(-{\varepsilon }^{2}/2{W}^{2}\right)\right)$. Here, the parameter $N\in {\mathbb{N}}$ (N > 1) determines the maximum interval width $N\Delta {\varepsilon }_{min}$, and the parameter W determines the feature region: For ε = 0, $\Delta {\varepsilon }_{i}$ equals $\Delta {\varepsilon }_{min}$, while it approaches $N\Delta {\varepsilon }_{min}$ for $| \varepsilon | > W$. In this way, a finer discretization of the histogram is obtained for energies in the feature region |ε| < W. This is illustrated by the integration limits indicated by vertical lines in Fig. 1(b). From this histogram, a raster graphic is generated by defining a grid of pixels, as shown in Fig. 1(c). Every column i of the histogram is discretized in a grid of N_ρ intervals of height

$$\Delta {\rho }_{i}=n({\varepsilon }_{i},{W}_{H},{N}_{H})\Delta {\rho }_{min}.$$

(4)

Here, the parameters W_H, N_H, and Δρ_min play a role analogous to W, N, and $\Delta {\varepsilon }_{min}$ above: Close to ε = 0, a fine discretization $\Delta {\rho }_{i}=\Delta {\rho }_{min}$ is obtained, while it approaches N_HΔρ_min for $| \varepsilon | > {W}_{H}$. Finally, the number of “filled” pixels in column i is determined by

$${\rm{\min }}\left(\lfloor \frac{{\rho }_{i}}{\Delta {\rho }_{i}}\rfloor ,{N}_{\rho }\right),$$

(5)

resulting in the 2D raster image in panel (d) of Fig. 1, containing N_ε × N_ρ pixels enumerated by an index α. This image is then transformed into a binary-encoded vector f = (f₁, …, f_{Nε × nρ}) with component f_α = 1 if the pixel α is filled and 0 otherwise.

DOS similarity metric

The similarity between two materials i and j in terms of their DOS fingerprints f_i and f_j is denoted by S(f_i, f_j). As similarity metric, we use the Tanimoto coefficient (Tc)¹³, defined as:

$$S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{j}\right)=\frac{{{\boldsymbol{f}}}_{i}\cdot {{\boldsymbol{f}}}_{j}}{| {{\boldsymbol{f}}}_{i}{| }^{2}+| {{\boldsymbol{f}}}_{j}{| }^{2}-{{\boldsymbol{f}}}_{i}\cdot {{\boldsymbol{f}}}_{i}}.$$

(6)

$S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{j}\right)$ can be interpreted as the overlap of the areas covered by the raster images represented by f_i and f_j, divided by the union of the areas. S takes real values in the range [0, 1], being equal to 1 (0) if the images f_i and f_j are identical (have no overlap). A better idea of what these values mean, can be obtained by considering two spectra of equal area A. In this case, the overlapping area is given by A · 2S/(1 + S). Thus, a value of S = 0.5 means an overlap of 2/3 of the areas.

As an example of this metric, Fig. 2 shows the DOS of four different materials from the C2DB and their respective similarities. In the considered energy interval, C₂ (graphene) has much fewer available states than the other examples. Mainly for this reason, it is dissimilar to all of them (S ≤ 0.14, see similarity matrix in Fig. 2). The DOS of MoS₂ is similar to that of FeO₂ in magnitude for |ε| > 1eV, but since MoS₂ is a semiconductor and FeO₂ is a metal, the overall similarity is low (S = 0.4). MoS₂ and WMo₃S₈ exhibit a high similarity coefficient of S = 0.84, as both the shape and the magnitude are similar.

Clustering algorithm

A similarity metric allows for a range of practical applications as, for instance, to determine which materials from a dataset are most similar to any given reference. The latter could be a material with a desired property, for which one seeks alternatives. This kind of analysis is commonly applied in chemical similarity searching^13,14 or drug discovery¹⁵. A related application is the detection of (sub)sets of materials, i.e., clusters, that are more similar to one another than to other materials. In this work, we focus on the second case and develop a clustering algorithm that takes advantage of the following property of our similarity measure (Eq. 6): Its complement 1-S is a distance measure that is identical to the Soergel distance for dichotomous fingerprints¹³. For binary-valued descriptors, it obeys the triangle inequality¹³, i.e.,

$$S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{j}\right)\ge S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{k}\right)+S\left({{\boldsymbol{f}}}_{k},{{\boldsymbol{f}}}_{j}\right)-1$$

(7)

for any three fingerprints f_i, f_j, and f_k. This can be easily verified with the examples shown in Fig. 2. An important consequence is that any two materials that are more similar to a third one than a threshold S_thres, will be more similar than 2S_thres−1 to each other. This motivates a simple clustering algorithm as follows: Start by (i) making a list of the materials in the database. Then, (ii) identify the material (reference) with the highest number of other materials that are more similar to it than a given threshold S_thres. If no materials can be found for any reference, stop the algorithm, as all possible clusters are found. Otherwise, (iii) consider the found reference and its similar materials as a cluster and extract them from the list; and return to step (ii). The materials that do not belong to any cluster are considered orphans. When two materials have the same number of neighbors and share any of them, the cluster with the highest average similarity is selected.

Materials descriptors

In this work, we combine the DOS fingerprint introduced at the beginning of this section with the clustering algorithm defined above to identify clusters of materials with similar electronic structure. To understand why such clusters are formed, we make use of descriptors.

The electronic spectrum of a material can often be understood by counting the valence electrons in the outermost shells of its constituent atoms. This counting can, in principle, be obtained from the average of the column numbers of the atoms in the unit cell:

$${\bar{c}}_{m}=\frac{1}{N}\mathop{\sum }\limits_{i}^{N}{c}_{im}$$

(8)

where i runs over all N atoms in the unit cell of material m, and c_im denotes their column in the Periodic Table of Elements (PTE). ${\bar{c}}_{m}$ is calculated for all materials in a cluster. If it is equal for all of them, we conclude that the cluster is formed by isoelectronic materials. Note that here we employ a lax definition of isoelectronicity that considers only electron counting but not electronic configuration. As an example, ${\bar{c}}_{m}$ for two Si atoms is identical that of two C atoms or the combination of one Al and one P atom. We call this descriptor the PTE descriptor.

The geometry of the crystal structures can also be explicatory of clusters obtained from the DOS similarity metric. Accordingly, we consider a similarity measure based on the space group (SG) of the crystal structures, after removing all information of the species that form the structure. In practice, this is achieved by first replacing all atoms by a single species and then employing the software package spglib¹⁶ with a tolerance of symprec = 1 × 10⁻¹ to find the SG of the resulting geometry. In the following we call this the SG descriptor.

Results

Identification of clusters

To identify sets of similar materials, we use the clustering algorithm described in the Methods section. We call the materials in a cluster its members and identify the size of the clusters as the number of its members. The compactness of a cluster is determined by its radius r_c = 1 − S_min, with S_min ≥ 2S_thres − 1 being the minimum similarity between any two members of the cluster. In this work, we choose a similarity threshold of S_thres = 0.75. Therefore, all materials f_k within a cluster centered at the reference material f_ref have a similarity value of at least S_min = 0.5 to all other cluster members, i.e., an area overlap of at least 67%. With this choice, we find 294 distinct clusters that contain in total ~ 23% of the materials in the entire dataset. Materials not belonging to any cluster are called orphans. Among these, there are 2643 materials whose similarity to any other material in the dataset is less than S_thres. The remaining 54 orphans, about 2%, have a at least one neighbor with similarity S ≥ S_thres, but that neighbor(s) is (are) already part of another cluster (see Clustering algorithm above).

Figure 3(a) presents the distribution of clusters sizes on a logarithmic scale together with the maximal and mean cluster radii for clusters of a given size. About two third (200) of the clusters contain only two materials. Since the clustering algorithm requires that any member of the cluster has a similarity of S_thres to the reference material, the cluster radii for two-point clusters are as low as r_c ≤ 0.25. The mean cluster radii for the clusters with more than two members increase to r_c ~ 0.4 with increasing cluster size. Interestingly, even though the clustering algorithm allows for the maximal cluster radius to be as large as r_c = 0.5, the maximal cluster radii of the discovered clusters are all smaller than 0.4.

To illustrate the similarity relations between materials, we calculate pairwise similarities between all materials in the dataset, i.e., a symmetric matrix with elements ${S}_{ij}=S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{j}\right)$. In other words, each column and row of this matrix corresponds to the similarities of a single material to the rest of the dataset. The diagonal elements of the matrix are identical, S_ii = 1, as they describe the similarity of each material with itself. An excerpt of the full matrix can be seen in Fig. 3(b). The order of columns and rows has been chosen according to the cluster sizes, i.e., such that the largest cluster appears in the top left corner of the matrix, and the cluster radius decreases with increasing index. The color code makes apparent that many of the clusters are very dissimilar to each other, i.e., the average similarity of the cluster members to the rest of the dataset is low. For some others, however, the opposite is the case, and one could expect that, choosing a smaller threshold would merge them, as pointed out above. An example for this, will be given in the following section.

We note that the here chosen parameters serve the purpose of showcasing our approach. Both, the similarity threshold as well as the energy range can be varied to focus such analysis on certain aspects of the data. For instance, to enhance compactness, the similarity threshold can be increased. This, however, reduces the number of discovered clusters and their size, which ultimately prevents the discovery of meaningful clusters. Conversely, the reduction of the similarity threshold increases both cluster size and number of clusters, to the expense of larger cluster radii. Too large cluster radii bare the risk of masking meaningful relations between data points in large clusters that hinder the automatic analysis of clusters.

To illustrate these effects, we repeat the clustering process for a wide range of thresholds. The results are shown in Fig. 4(a): The black and orange curves indicate the total number of clusters and its subset of isoelectronic clusters, respectively, as a function of the similarity threshold. For low thresholds, all materials are contained in few large clusters which split into more smaller ones upon increasing the threshold. The maximum is found at S_thres = 0.67. This value corresponds to an overlap of the spectral areas of around ~ 50%. We choose a slightly larger value of S_thres = 0.75, corresponding to 67% overlap, to make the clusters more compact. As can be seen from Fig. 3(a), the maximum actual radius reached by a cluster is around 0.4, which means an overlap larger than 75%. This choice is indeed scientifically meaningful as the electronic structure in the chosen energy range (centered at the Fermi level), strongly impacts the structural stability and the dielectric response of the material, as well as the reactivity of its surfaces in, e.g., catalysis processes and chemisorption^17,18,19.

Figure 4(b) presents the DOS of six materials that are orphans. In fact, they remain orphans at S_thres as small as 0.1. Often, orphans have narrow bands leading to well localized peaks in the corresponding DOS. The figure includes four such cases, namely FeHfF₆, Li₂Cl₂O₄, FeZrCl₆, Na₂F₂O₂. As an example, the latter is composed by FO molecules bridged by Na atoms, in a checkerboard pattern (see also https://cmrdb.fysik.dtu.dk/c2db/row/F2Na2O2-feda03610e19). Their bonding and antibonding states form narrow peaks of large intensity at −1.5 and 1.8 eV, respectivley, while the Na states lie mostly outside the considered energy range. These sharp peaks have hardly any overlap with those of the other samples, which is then too small to form a cluster. Less often, orphans are large band-gap materials, such that no appreciable spectral weight is present in the considered energy range. This is the case for instance, for MgCl₂ and Bi₂F₆.

Analysis of selected clusters

In the following we analyze individual clusters and reason why the materials in these clusters are similar.

Isoelectronic substitutions

Figure 5 presents the DOS (a) and crystal structures (b) of five transition-metal dichalcogenides (TMDC) forming a cluster. Its cluster radius r_c = 0.28 is close to the mean for this cluster size (see Fig. 3(a)). Visual inspection of the corresponding DOS reveals a pronounced overall similarity in terms of i) the shape of spectra inside the feature region |E| ≲ 2eV, and ii) the size of the PBE band-gap that varies from 0.52 eV (TiSe₂) to 0.65 eV (Hf₂Ti₂Se₈). Above 2 eV, the DOS become more dissimilar, as expected from the coarser representation of the DOS outside the feature region. The inset of panel (a) presents the similarity matrix of the materials contained in this cluster. Here, sub-clusters become appreciable for materials with the same number of substituents, i.e., the materials containing one or two substituents are more similar to one another than to the other cluster members.

Considering the crystal lattice, the cluster members are very similar. All materials consist of a layer of TMs between two layers of Se. The cluster contains the binary phase (TiSe₂), as well as ternary phases, where either one or two Ti atoms are substituted with either Hf or Zr or both. The latter type has only minor influence on the DOS. This does not come to a surprise as all substitutions within this cluster are isoelectronic, i.e., with atomic species from group 4 of the PTE. We note here that there exists another cluster of materials with the same structural prototype, containing, among other materials, the binary compounds Se₈Hf₄ and Se₈Zr₄. These materials form a separate cluster because they have higher PBE band gaps, ranging between 0.72 eV (Se₈TiHf₃) and 0.82 eV (Se₈ HfZr₃). Choosing a lower similarity threshold, these clusters merge.

To further demonstrate the isoelectonic behavior of the materials of the here considered cluster, we compare their PDOS in Fig. 5(c). Their valence bands are mainly composed of fully occupied Se 4p states. The conduction bands have predominant Ti 3d character, with additional contributions from 4d or 5d states of Zr and Hf (when present)²⁰. The latter lie all in the same energy range and sum up to the same number of d states of the four group-IV TMs. The hybridization of TM-3d with Se-p orbitals is evident from small contributions of d states in the valence region and Se-p states in the conduction region. In sum, the similarity of the electronic spectrum of these materials becomes clear: The replacement of Ti by either Hf or Zr does not alter the valence band, while the conduction states are composed of empty d shells of the transition metals, which amount to the same number of empty states.

Several other clusters exist in the dataset which consist of isoelectronic materials, i.e., they may contain different elements but have the same number of valence electrons. To discover them, we make use of the PTE descriptor introduced in the Methods section. Overall, our descriptor identifies 230 clusters each of them having the same ${\bar{c}}_{m}$ for all its members (Eq. 8). This number corresponds to 78% of all clusters, and 16.5% of materials in the full dataset. Therefore we conclude that isoelectronicity is a main reason for the similarity of the DOS of the materials in the C2DB. 88.8% of all clusters contain at least two materials that have the same ${\bar{c}}_{m}$.

The existence of clusters exclusively from isoelectronic materials is observed for a large range of similarity thresholds, as indicated in Fig. 4(a). Comparing their number (orange curve) to the total number of clusters (black curve) reveals that with increasing similarity threshold, the majority of compact clusters are isoelectronic.

Materials with isoelectronic surface groups

The second most common origin of similarity concerns the substitution of flourine atoms at the materials’ surfaces with OH groups. Figure 6 presents an example of such clusters. These metallic materials consist of five alternating layers of carbon and either Ta or Nb. Again, Ta, and Nb are isoelectronic, i.e., from group 5 of the PTE. At the two surfaces of the materials, either F atoms or OH groups form bonds with the underlying TM atoms.

The cluster radius is r_c ~ 0.28, which is close to the mean value for four-point clusters. The general shapes of the curves are similar. Inspection of the PDOS (e.g., https://cmrdb.fysik.dtu.dk/c2db/row/C2H2O2Nb3-137a187a149c), reveals that the whole spectrum is dominated by d states of the TM. They hybridize with C and TM p states. The p bands have the largest contribution around E ≈ 0 and E ≈ 1.5, where also significant contributions from F and O p states are present. Thus, the saturated O in the hydroxyl group acquires an electronic configuration analogous to F and binds similarly to the TM atoms. In other words, the OH group can be regarded as isoelectronic to F. Minor differences between the spectra, as for instance displaced peaks below −1 eV, mainly originate from multiple van Hove singularities which are very sensitive to the precise location of band extrema and flat bands.

In total, we find 33 clusters in the dataset where F at the surface is interchanged with an OH group. In most of these cases, these clusters contain also sets of materials with other isoelectronic substitutions. Let us note in passing that, despite the fact that the similar behavior of flourine atoms and hydroxyl groups is well-established expert knowledge in chemistry and electronic-structure theory, it is not trivial to access such knowledge in an automatized manner. So far, the search interfaces of most databases, as well as descriptors for machine learning, rely on structural features, e.g., the chemical formula or the number of atoms in the unit cell. Thus, similar materials with, e.g., different numbers of atoms in the unit cell are unlikely to be found by these methods.

Role of crystal lattice

Now, we focus our search on clusters of materials composed of identical atoms but different host lattices. Figure 7 presents the data from three different phases of In₂S₂, which belongs to the class of post-transition metal chalcogenides. We designate them by their SG (value of their SG descriptor, cf. Methods), where we distinguish the two materials with SG 164 as SG 164-1 and SG 164-2. The semiconducting structures resulting from the phases SG 164-1 and SG 187 form a cluster due to the similarity of their DOS throughout the whole observed energy range. For comparison, we show the DOS of a third phase, SG 164-2, that shares the symmetry with the first material, however, shows markedly different behavior and, thus, is not part of the same cluster. While the similarity coefficient between the clustered materials is 0.76, the corresponding values with the third phase are 0.32 and 0.34, respectively. This finding goes hand in hand with the fact that the clustered materials have medium-sized band gaps of 1.60 (SG 187) and 1.68 eV (SG 164-1), respectively (values from (https://cmrdb.fysik.dtu.dk/c2db/)), the third one is metallic. Despite sharing the space group, the two phases with SG 164 show significant structural differences as evident from the top views of the unit cells depicted in Fig. 7. This can be further illustrated considering the stacking of In and S layers: For SG 164-1, the layer sequence corresponds to ABBC stacking; for SG 187, it is ABBA; for SG 164-2, it is ABDC. The (dis-)similarity of the different phases lies in the particular electronic configuration acquired by the atomic species: In the semiconducting phases, In adopts covalent bonding, manifested by a valence band that is dominated by hybridized S and In p states²¹. In this electronic configuration, the In atoms are tetrahedrally coordinated by three S atoms and one In atom. Here the In-In bond length is d_In-In = 2.82 Å. In the metalic phase, In atoms form metallic bonds with a significant contribution from In s states. In this case, every In atom is coordinated with three In atoms, and the bond length is d_In-In = 3.62 Å, which is close to that of bulk metallic In (3.38 Å). The metallic phase is metastable as compared to the semiconducting ones (see also https://cmrdb.fysik.dtu.dk/c2db/row/In2S2-ef93efd2b5c0)).

We note that in this case the dissimilarity between the semiconducting and metallic phases can be explained neither by the SG nor the PTE descriptors. Nonetheless, our DOS similarity search is able to capture the underlying electronic configuration and put together structures with identical atomic coordination, albeit different crystal structure.

Outliers

Overall, there are 25 clusters in the dataset with materials that are neither isoelectronic nor share the crystal lattice, i.e. they cannot be explained by our SG and PTE descriptors. Therefore, as a final example, we focus on a cluster that consists of two materials that have no apparent similarities in their atomic structures, neither in symmetry nor in composition. They are presented in Fig. 8(b). While Ta₂BS₂ has the trigonal space group 164, Bi₄Cu₄ is characterized by an orthorhombic lattice with SG 51. Unaffected by their structural dissimilarities, the DOS of both materials resemble each other with a similarity coefficient of 0.76, i.e., slightly above the threshold of S_thres = 0.75. Both materials exhibit a nearly constant DOS between −1.5 and 2 eV, while it increases below. To get a deeper insight, we show in Fig. 8 the band structures of Ta₂BS₂ (c) and Bi₄Cu₄ (d), indicating the atomic character of the bands, together with the corresponding DOS projected on the different atomic species. While in the case of Ta₂BS₂, only one band with mixed p-d character crosses the Fermi level, the energy spectrum of Bi₄Cu₄ exhibits several bands composed almost exclusively of Bi and Cu p states, giving rise to a more complicated topology of the Fermi surface. We conclude that the similarity of these materials’ DOS is accidental and this cluster can be indeed be considered as an outlier.

This example demonstrates that our approach has great potential for identifying materials with a desired property or specific feature. This could, e.g., be the electronic band gap or the character of a band edge. Combining unsupervised learning with physics-informed descriptors allows one to explore materials spaces, also regions where one would not expect good candidates.

Discussion

In this work, we have presented a fingerprint of the electronic DOS that allows one to quantitatively evaluate the similarity of materials in terms of their electronic structure. We have applied this fingerprint to the C2DB database, a large, heterogeneous data-set of two dimensional materials. Based on our similarity measure, we have devised a clustering algorithm to filter the data for sets of materials that exhibit pronounced similarities to one another. A significant number of (small) clusters have been identified and further analyzed. More specifically, 23% of the materials can be associated with at least one other material in the dataset. The majority of similarities in these particular materials can be explained by the similarity of the valence configuration of the involved atomic species, thus confirming physical expectations. This confirmatory analysis has been performed in an automatic fashion based on physically meaningful descriptors. In this way, we could identify, for instance, 16.5% of materials being isoelectronic to at least another material of the investigated database. Moreover, we have shown that this observation is valid for a wide range of similarity thresholds which indicates the robustness of the method. Our approach could be easily extended by introducing other descriptors with explanatory power.

Our work differs from other approaches in the literature. In many cases^9,12,22, similarity relations between data are employed to attain low-dimensional representations of the whole dataset. Such representations have been shown to be well suited for visualizing relations in material spaces, e.g., by highlighting materials with a certain feature, and subsequently identifying clusters of them. Such representations are used for a rough characterization of the materials space, which may lead to a global identification of descriptor-property correlations. In contrast, by using a strict similarity threshold that yields compact clusters, we focus on the local structure of the dataset. Our analysis confirms, in most cases, physical intuition. This allows us to automatically classify a large subset of clusters. Interestingly, it also leads to the discovery of outliers, i.e., clusters that are not explainable with the here used PTE or SG descriptors.

The presented analysis can be directly included into materials databases complying with the FAIR principles. An implementation of the DOS descriptor is already available in the NOMAD Encyclopedia²³ (https://nomad-lab.eu/prod/rae/encyclopedia) where one can can obtain a list of materials whose DOS are most similar to that of a chosen one. A tutorial on how to use this functionality can be found online (https://nomad-lab.eu/aitoolkit/tutorial-dos-similarity).

The findings of investigations along the lines described here, may guide researchers towards possibly interesting materials classes. For instance, experimentalists could select a material, exhibiting desired properties, that may be easier to synthesize than a prototypical one. That way, value can be added to high-throughput calculations or even experiments. In the context of electronic or spectroscopic properties, research fields such as photovoltaics, optoelectronics, or catalysis, to name a few, may benefit from such approach.

Summarizing, our method provides a means of analyzing large datasets from electronic-structure theory and contributes to understanding, controlling, and selecting such data in view of their re-use in other contexts. Last but not least, the finding of accidental – unexpected – similarities may be of relevance in technological applications, where considering materials with different composition but similar properties could lead to e.g., structures that are easier to synthesize or reveal other properties that are superior to those of the known materials.

Dataset

We use data from the Computational 2D Materials Database (C2DB)^2,11, a high-throughput database of atomically thin systems. The majority of its content is generated from structure prototypes that are decorated with different atomic species. These prototypes include Xane (e.g., graphane), Xene (e.g., graphene), MXY Janus (e.g., MoSSe), and TMDCs (e.g., MoS₂). At the point of writing this manuscript, the C2DB contains 4047 structures, composed of 63 different chemical elements. Projected densities of states (PDOS) and atomic structures are available at the C2DB website (https://cmrdb.fysik.dtu.dk/c2db/) for 3491 of these structures. The C2DB captures various materials properties calculated by the electronic-structure package GPAW^24,25, by employing density-functional theory (DFT) in the PBE parameterization of the generalized-gradient approximation (GGA). The properties include heat of formation, stiffness tensor, and phonons to assess the stability, magnetic properties, and optical polarizabiliy as well as electronic band structures and (if applicable) band gaps. The structures are relaxed with respect to their atomic positions and lattice parameters. Metastable materials, i.e., materials with energies above the convex hull or negative eigenvalues of the stiffness tensor, are also contained in the dataset.

Before using this data for our analysis, we preprocess it in the following way: First, we sum up over all PDOS for one material, in order to obtain the total DOS (TDOS). Then, we define the Fermi level, E_F, as the energy zero. The TDOS of every structure is then normalized with respect to the area of the unit cell spanned by the two periodic cell vectors. In this way, the results can be consistently compared across different geometries. For instance, supercells of the same material are considered identical. The resulting normalized TDOS are then employed to generate the fingerprints that encode the electronic structure. Analysis of atomic structures are performed using the Python package ASE²⁶. Further plots are generated using matplotlib²⁷.

Data availability

The data used in this publication can be accessed through a public web application programming interface (API) (https://cmrdb.fysik.dtu.dk/c2db/). Furthermore, we provide the preprocessed data and the calculated descriptors in a public Github repository²⁸ (https://github.com/kubanmar/dos-fingerprints-data).

Code availability

Our implementation of the DOS fingerprint is part of the NOMAD project²³ and can be obtained as a stand-alone Python package hosted at Github²⁹ (https://github.com/kubanmar/dos-fingerprints). Additionally, we provide a tutorial that explains how to use the DOS descriptor to find in the NOMAD Repository¹ the most similar materials to a given reference. This tutorial is implemented in the NOMAD AI Toolkit³⁰ and can be accessed online (https://nomad-lab.eu/aitoolkit/tutorial-dos-similarity).

In this work, we choose to focus our DOS descriptor on an energy region around the Fermi energy. To this end, we set the parameters Δε_min = 0.05 eV, Δε_max = 1.05 eV, N = Δε_max/Δε_min = 21, ε_ref = 0 eV, and W = 4 eV (see Eq. 2). W_H = 4 eV, N_ρ = 512, ρ_min = N_ρΔρ_min = 0.25, ρ_max = 2.75, and N_H = ρ_max/ρ_min = 11 (see Eq. 4). An implementation of our clustering algorithm is available at Github³¹ (https://github.com/kubanmar/similarity_threshold_clusterer).

References

Draxl, C. & Scheffler, M. NOMAD: The FAIR concept for big data-driven materials science. MRS Bulletin 43, 676–682, https://doi.org/10.1557/mrs.2018.208 (2018).
Article Google Scholar
Haastrup, S. et al. The computational 2D materials database: high-throughput modeling and discovery of atomically thin crystals. 2D Materials 5, 042002, https://doi.org/10.1088/2053-1583/aacfc1 (2018).
Article CAS Google Scholar
Curtarolo, S. et al. AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations. Computational Materials Science 58, 227–235, https://doi.org/10.1016/j.commatsci.2012.02.002 (2012).
Article CAS Google Scholar
Jain, A. et al. The Materials Project: A materials genome approach to accelerating materials innovation. APL Materials 1, 011002, https://doi.org/10.1063/1.4812323 (2013).
Article ADS CAS Google Scholar
Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and discovery with high-throughput density functional theory: The open quantum materials database (OQMD). JOM 65, 1501–1509, https://doi.org/10.1007/s11837-013-0755-4 (2013).
Article CAS Google Scholar
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115, https://doi.org/10.1103/PhysRevB.87.184115 (2013).
Article ADS CAS Google Scholar
Gastegger, M., Schwiedrzik, L., Bittermann, M., Berzsenyi, F. & Marquetand, P. wACSF–weighted atom-centered symmetry functions as descriptors in machine learning potentials. The Journal of Chemical Physics 148, 241709, https://doi.org/10.1063/1.5019667 (2018).
Article ADS CAS PubMed Google Scholar
Huo, H. & Rupp, M. Unified representation of molecules and crystals for machine learning https://doi.org/10.48550/ARXIV.1704.06439 (2017).
Article Google Scholar
Isayev, O. et al. Materials cartography: Representing and mining materials space using structural and electronic fingerprints. Chemistry of Materials 27, 735–743, https://doi.org/10.1021/cm503507h (2015).
Article CAS Google Scholar
Ben Mahmoud, C., Anelli, A., Csányi, G. & Ceriotti, M. Learning the electronic density of states in condensed matter. Phys. Rev. B 102, 235130, https://doi.org/10.1103/PhysRevB.102.235130 (2020).
Article ADS Google Scholar
Gjerding, M. N. et al. Recent progress of the computational 2D materials database (C2DB). 2D Materials 8, 044002, https://doi.org/10.1088/2053-1583/ac1059 (2021).
Article CAS Google Scholar
Knøsgaard, N. & Thygesen, K. Representing individual electronic states for machine learning GW band structures of 2D materials. Nature Communications 13, 468, https://doi.org/10.1038/s41467-022-28122-0 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Willett, P., Barnard, J. M. & Downs, G. M. Chemical similarity searching. Journal of Chemical Information and Computer Sciences 38, 983–996, https://doi.org/10.1021/ci9800211 (1998).
Article CAS Google Scholar
Maggiora, G., Vogt, M., Stumpfe, D. & Bajorath, J. Molecular Similarity in Medicinal Chemistry. Journal of Medicinal Chemistry 57, 3186–3204, https://doi.org/10.1021/jm401411z. PMID: 24151987 (2014).
Bender, A. & Glen, R. C. Molecular similarity: a key technique in molecular informatics. Org. Biomol. Chem. 2, 3204–3218, https://doi.org/10.1039/B409813G (2004).
Article CAS PubMed Google Scholar
Togo, A. & Tanaka, I. Spglib: a software library for crystal symmetry search https://doi.org/10.48550/ARXIV.1808.01590 (2018).
Article Google Scholar
Cohen, M. H., Ganduglia-Pirovano, M. V. & Kudrnovský, J. Orbital symmetry, reactivity, and transition metal surface chemistry. Phys. Rev. Lett. 72, 3222–3225, https://doi.org/10.1103/PhysRevLett.72.3222 (1994).
Article ADS CAS PubMed Google Scholar
Cohen, M. H., Ganduglia-Pirovano, M. V. & Kudrnovský, J. Electronic and nuclear chemical reactivity. The Journal of Chemical Physics 101, 8988–8997, https://doi.org/10.1063/1.468026 (1994).
Article ADS CAS Google Scholar
Yang, W. & Parr, R. G. Hardness, softness, and the fukui function in the electronic theory of metals and catalysis. Proceedings of the National Academy of Sciences 82, 6723–6726, https://doi.org/10.1073/pnas.82.20.6723 (1985).
Article ADS CAS Google Scholar
Pal, B. et al. Anomalous orbital structure in two-dimensional titanium dichalcogenides. Scientific Reports 9, 1896, https://doi.org/10.1038/s41598-018-37248-5 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhuang, H. L. & Hennig, R. G. Single-layer group-III monochalcogenide photocatalysts for water splitting. Chemistry of Materials 25, 3232–3238, https://doi.org/10.1021/cm401661x (2013).
Article CAS Google Scholar
De, S., Bartók, A. P., Csányi, G. & Ceriotti, M. Comparing molecules and solids across structural and alchemical space. Phys. Chem. Chem. Phys. 18, 13754–13769, https://doi.org/10.1039/C6CP00415F (2016).
Article CAS PubMed Google Scholar
Draxl, C. & Scheffler, M. The NOMAD laboratory: from data sharing to artificial intelligence. Journal of Physics: Materials 2, 036001, https://doi.org/10.1088/2515-7639/ab13bb (2019).
Article ADS CAS Google Scholar
Mortensen, J. J., Hansen, L. B. & Jacobsen, K. W. Real-space grid implementation of the projector augmented wave method. Phys. Rev. B 71, 035109, https://doi.org/10.1103/PhysRevB.71.035109 (2005).
Article ADS CAS Google Scholar
Enkovaara, J. et al. Electronic structure calculations with GPAW: a real-space implementation of the projector augmented-wave method. Journal of Physics: Condensed Matter 22, 253202, https://doi.org/10.1088/0953-8984/22/25/253202 (2010).
Article ADS CAS PubMed Google Scholar
Larsen, A. H. et al. The atomic simulation environment—a python library for working with atoms. Journal of Physics: Condensed Matter 29, 273002, https://doi.org/10.1088/1361-648x/aa680e (2017).
Article CAS Google Scholar
Hunter, J. D. Matplotlib: A 2D graphics environment. Computing in Science & Engineering 9, 90–95, https://doi.org/10.1109/MCSE.2007.55 (2007).
Article ADS Google Scholar
Kuban, M. kubanmar/dos-fingerprints-data: v1 Zenodo https://doi.org/10.5281/zenodo.7153812 (2022).
Kuban, M. & Scheidgen, M. kubanmar/dos-fingerprints: Initial release (Version v1) Zenodo https://doi.org/10.5281/zenodo.7153599 (2022).
Sbailò, L., Fekete, A., Ghiringhelli, L. M. & Scheffler, M. The NOMAD Artificial-Intelligence Toolkit: Turning materials-science data into knowledge and understanding https://doi.org/10.48550/ARXIV.2205.15686 (2022).
Article Google Scholar
Kuban, M. kubanmar/similarity_threshold_clusterer: v1 (Version v1) Zenodo https://doi.org/10.5281/zenodo.7153751 (2022).

Download references

Acknowledgements

This work received partial funding by the German Research Foundation (DFG) through the CRC 1404 (FONDA), Projektnummer 414984028, and the NFDI consortium FAIRmat - project 460197019. Partial support from the the European Union’s Horizon 2020 research and innovation program under the grant agreement N° 951786 (NOMAD CoE) is appreciated. We acknowledge support by the Open Access Publication Fund of Humboldt-Universität zu Berlin.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Humboldt-Universität zu Berlin, Institut für Physik und IRIS Adlershof, Berlin, 12489, Germany
Martin Kuban, Santiago Rigamonti, Markus Scheidgen & Claudia Draxl

Authors

Martin Kuban
View author publications
You can also search for this author in PubMed Google Scholar
Santiago Rigamonti
View author publications
You can also search for this author in PubMed Google Scholar
Markus Scheidgen
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Draxl
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.K. prepared the data, wrote the code, and analyzed the results, M.S. contributed to the development of the fingerprint, S.R. and C.D. supervised and reviewed all parts of the work. All authors contributed to the writing of the manuscript.

Corresponding author

Correspondence to Martin Kuban.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kuban, M., Rigamonti, S., Scheidgen, M. et al. Density-of-states similarity descriptor for unsupervised learning from materials data. Sci Data 9, 646 (2022). https://doi.org/10.1038/s41597-022-01754-z

Download citation

Received: 14 January 2022
Accepted: 10 October 2022
Published: 22 October 2022
DOI: https://doi.org/10.1038/s41597-022-01754-z

This article is cited by

A Quantum-Chemical Bonding Database for Solid-State Materials
- Aakash Ashok Naik
- Christina Ertural
- Janine George
Scientific Data (2023)
Feature-aware unsupervised lesion segmentation for brain tumor images using fast data density functional transform
- Shin-Jhe Huang
- Chien-Chang Chen
- Henry Horng-Shing Lu
Scientific Reports (2023)
Shared metadata for data-centric materials science
- Luca M. Ghiringhelli
- Carsten Baldauf
- Matthias Scheffler
Scientific Data (2023)