Introduction

The creation of databases for computational materials science has led to a huge amount of stored calculations, exceeding by far any human’s ability to comprehend the information in it. Thus, algorithmic data-analysis methods need to be leveraged to allow knowledge extraction from this large pool of data. Domain-specific search interfaces, provided by public databases1,2,3,4,5, are one way to make information findable. These interfaces allow researchers to identify materials of their interest, e.g., in terms of structural features like space group or atom types, or in terms of properties like the electronic band gap. However, such features provide little insight only. Furthermore, the use of search interfaces is limited to mostly confirmatory analysis: Having a concrete physical mechanism in mind, e.g., the change of properties of alloys with stoichiometry, researchers can manually search materials that allow to confirm, or deny, a hypothesis.

Learning from data, however, is not limited to this kind of analysis. For instance, relations between materials in terms of certain properties, can become (only) apparent in large quantities of data. To reveal such relations and make use of them, both in-depth understanding of when we consider materials to be similar as well as powerful data-analysis methods are required. A prerequisite for understanding how different materials relate to one another is the availability of descriptive, numerical representations (descriptors), that accurately capture (dis)similarities, e.g., stemming from the atomic and/or electronic structure.

In the past years, several descriptors of the atomic structure have been published6,7,8 and successfully applied for the prediction of material properties using machine learning (ML) techniques. However, descriptors based on the electronic structure are not well established in the ML community. In early work of Isayev and coworkers9, descriptors of both the electronic density-of-states (DOS) and the band structure are used to create a graphical representation of more than 20000 materials from the AFLOWlib database. More recently, supervised ML was proposed10 to predict electronic densities-of-states by their decomposition in local atomic contributions. Furthermore, a descriptor based on atomic distances, the projected densities of states (PDOS), and the Kohn-Sham band-gap was shown11 to improve the prediction of computationally expensive material properties.

The majority of ML approaches in materials science focus on speeding up research. This concerns, for instance, the prediction of materials properties that are time-consuming to compute, like the electronic band gap, or the optimization of established methods, e.g., speeding up molecular-dynamics simulations through ML-based force fields. Thereby, highly non-linear ML models and/or complex material descriptors are necessary to achieve decent accuracy of predictions. Moreover, the underlying data are typically considered only as input for the ML models, and are not further analyzed.

In this work, we aim at obtaining deeper understanding of large materials data spaces by rationalizing the reasons behind features that materials may share. We demonstrate our approach by the similarity of materials in terms of their electronic properties. To this extent we develop a tunable DOS fingerprint that encodes the DOS of a material into a binary-valued two-dimensional (2D) map, stimulated by the work of Ref. 9. Combining it with unsupervised ML methods, we showcase its use by revealing similarities in the electronic structure of materials from the Computational 2D Materials Database (C2DB)2. We are able to uncover not only expected trends, e.g., clusters consisting of materials containing isoelectronic substitutions of atomic species, but also unexpected correlations, e.g., clusters of structurally very different materials. Our results show that explorative analysis of a database allows for finding relations between materials which could not be foreseen without comprehensive, data-driven analysis.

Methods

Electronic density-of-states fingerprints

The analysis of spectra like the DOS is typically done by visual inspection, i.e., in a qualitative manner. For large datasets, this kind of analysis quickly becomes unfeasible. Therefore, a descriptor that allows for automated processing of such data is required. This includes a suitable numerical representation of the DOS. In the following, we review such representations that have been proposed in the literature, state their drawbacks, and tell how we overcome them.

To quantitatively compare the DOS of materials, Isayev et al.9 constructed a DOS fingerprint by encoding the DOS in the energy range between −10 and 10 eV as a series of 256 float (4 bytes) numbers. A similar point-wise representation was considered in Ref. 10 for building predictive models for the DOS based on Gaussian regression. It was pointed out that such representations were inefficient, as they may potentially require many sampling points of the DOS to efficiently train the models. Moreover, loss functions based on that representation turned out largely insensitive to spectral features with small overlap. To overcome these problems, the authors proposed two approaches: i) a truncated basis expansion based on principal-component analysis (PCA), which leads to an effective reduction of the degrees of freedom of the fingerprint (effectively smoothing the DOS spectra), and ii) a representation based on the cumulative distribution function associated to the DOS. The latter improved on the sensitivity of the loss functions to non-overlapping spectral features. More recently, a high-dimensional fingerprint based on the DOS projected on different atomic orbitals and sites, followed by PCA dimensionality reduction, has been proposed11. A similar fingerprint was later used to predict the G0W0 band energies for materials in the C2DB12.

A common drawback of all these DOS fingerprints is that they equally weight the contributions from the entire energy range considered in the spectra. Thus, they don’t account for the fact that different energy regions are associated to distinct physical phenomena. For instance, the shape of the DOS close to the top of the valence band and the size of the band-gap are most important aspects in the search for p-doped materials. Likewise, for metals, the magnitude and shape of the DOS around the Fermi energy are most relevant. Other research may focus on some features of the conduction band. Although the PCA-based approach mentioned above can effectively lead to a re-weighting of spectral features, it cannot be tailored at will to focus on specific regions, but is determined by the training data that is used for the construction of the descriptors.

To overcome the described issues, we have developed a DOS fingerprint that allows for a tailored weighting of spectral features. Using a non-uniform discretization of the energy axis, the fingerprint can be adapted to focus on desired energy regions. To achieve this discretization, the DOS is transformed into a two-dimensional raster image (Fig. 1(d)) as follows: First, the spectrum is shifted such that the energy ε = 0 is located at a reference energy εref, which defines the main focus of the fingerprint. Then, the DOS ρ(E) (Fig. 1(a)) is integrated over an even number Nε of intervals of variable widths Δεi, to obtain a histogram {ρi} (Fig. 1(b)):

$${\rho }_{i}={\int }_{{\varepsilon }_{i}}^{{\varepsilon }_{i+1}}\rho (\varepsilon )d\varepsilon ,$$
(1)
Fig. 1
figure 1

Generation of DOS fingerprints from the electronic DOS, ρ(E). The DOS of a material (a) is numerically integrated over small energy intervals [εi, εi + Δεi) (Eq. 1). The thereby generated histogram of states (b) is subsequently discretized (c), resulting in an image (d) of the DOS. In this image, each dark (light) pixel corresponds to a 1 (0) in the fingerprint. In panel (c), only every fifth discretization step is shown. To increase visibility, we use Nρ = 30, ρmin = 0.075, and ρmax = 0.825 for this figure. The other parameters are set as described in the Code Availability section.

with \(i\in \left[-{N}_{\varepsilon }/2,{N}_{\varepsilon }/2\right]\), \(i\in {\mathbb{Z}}\), \({\varepsilon }_{0}=0\), \({\varepsilon }_{i+1}={\varepsilon }_{i}+\Delta {\varepsilon }_{i}\) for i ≥ 0, and εi = −εi. The integration intervals \(\Delta {\varepsilon }_{i}\) are defined as

$$\Delta {\varepsilon }_{i}=n({\varepsilon }_{i},W,N)\Delta {\varepsilon }_{min},$$
(2)

where Δεmin is a parameter giving the minimal integration width and the integer-valued function

$$n(\varepsilon ,W,N)=\lfloor \,g(\varepsilon ,W)N+1\rfloor \in [1,N].$$
(3)

\(\lfloor \cdot \rfloor \) denotes the ‘round down’ operator and \(g\left(\varepsilon ,W\right)=\left(1-{\rm{\exp }}\left(-{\varepsilon }^{2}/2{W}^{2}\right)\right)\). Here, the parameter \(N\in {\mathbb{N}}\) (N > 1) determines the maximum interval width \(N\Delta {\varepsilon }_{min}\), and the parameter W determines the feature region: For ε = 0, \(\Delta {\varepsilon }_{i}\) equals \(\Delta {\varepsilon }_{min}\), while it approaches \(N\Delta {\varepsilon }_{min}\) for \(| \varepsilon | > W\). In this way, a finer discretization of the histogram is obtained for energies in the feature region |ε| < W. This is illustrated by the integration limits indicated by vertical lines in Fig. 1(b). From this histogram, a raster graphic is generated by defining a grid of pixels, as shown in Fig. 1(c). Every column i of the histogram is discretized in a grid of Nρ intervals of height

$$\Delta {\rho }_{i}=n({\varepsilon }_{i},{W}_{H},{N}_{H})\Delta {\rho }_{min}.$$
(4)

Here, the parameters WH, NH, and Δρmin play a role analogous to W, N, and \(\Delta {\varepsilon }_{min}\) above: Close to ε = 0, a fine discretization \(\Delta {\rho }_{i}=\Delta {\rho }_{min}\) is obtained, while it approaches NHΔρmin for \(| \varepsilon | > {W}_{H}\). Finally, the number of “filled” pixels in column i is determined by

$${\rm{\min }}\left(\lfloor \frac{{\rho }_{i}}{\Delta {\rho }_{i}}\rfloor ,{N}_{\rho }\right),$$
(5)

resulting in the 2D raster image in panel (d) of Fig. 1, containing Nε × Nρ pixels enumerated by an index α. This image is then transformed into a binary-encoded vector f = (f1, …, f × ) with component fα = 1 if the pixel α is filled and 0 otherwise.

DOS similarity metric

The similarity between two materials i and j in terms of their DOS fingerprints fi and fj is denoted by S(fi, fj). As similarity metric, we use the Tanimoto coefficient (Tc)13, defined as:

$$S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{j}\right)=\frac{{{\boldsymbol{f}}}_{i}\cdot {{\boldsymbol{f}}}_{j}}{| {{\boldsymbol{f}}}_{i}{| }^{2}+| {{\boldsymbol{f}}}_{j}{| }^{2}-{{\boldsymbol{f}}}_{i}\cdot {{\boldsymbol{f}}}_{i}}.$$
(6)

\(S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{j}\right)\) can be interpreted as the overlap of the areas covered by the raster images represented by fi and fj, divided by the union of the areas. S takes real values in the range [0, 1], being equal to 1 (0) if the images fi and fj are identical (have no overlap). A better idea of what these values mean, can be obtained by considering two spectra of equal area A. In this case, the overlapping area is given by A · 2S/(1 + S). Thus, a value of S = 0.5 means an overlap of 2/3 of the areas.

As an example of this metric, Fig. 2 shows the DOS of four different materials from the C2DB and their respective similarities. In the considered energy interval, C2 (graphene) has much fewer available states than the other examples. Mainly for this reason, it is dissimilar to all of them (S ≤ 0.14, see similarity matrix in Fig. 2). The DOS of MoS2 is similar to that of FeO2 in magnitude for |ε| > 1eV, but since MoS2 is a semiconductor and FeO2 is a metal, the overall similarity is low (S = 0.4). MoS2 and WMo3S8 exhibit a high similarity coefficient of S = 0.84, as both the shape and the magnitude are similar.

Fig. 2
figure 2

Illustration of the similarity metric with the examples of four materials. The DOS of graphene (C2), MoS2, WMo3S8, and FeO2 are presented on the left. The Fermi level is located at E = 0 eV. The similarity matrix (rows and columns in the same order, color-coded) is shown in the right panel.

Clustering algorithm

A similarity metric allows for a range of practical applications as, for instance, to determine which materials from a dataset are most similar to any given reference. The latter could be a material with a desired property, for which one seeks alternatives. This kind of analysis is commonly applied in chemical similarity searching13,14 or drug discovery15. A related application is the detection of (sub)sets of materials, i.e., clusters, that are more similar to one another than to other materials. In this work, we focus on the second case and develop a clustering algorithm that takes advantage of the following property of our similarity measure (Eq. 6): Its complement 1-S is a distance measure that is identical to the Soergel distance for dichotomous fingerprints13. For binary-valued descriptors, it obeys the triangle inequality13, i.e.,

$$S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{j}\right)\ge S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{k}\right)+S\left({{\boldsymbol{f}}}_{k},{{\boldsymbol{f}}}_{j}\right)-1$$
(7)

for any three fingerprints fi, fj, and fk. This can be easily verified with the examples shown in Fig. 2. An important consequence is that any two materials that are more similar to a third one than a threshold Sthres, will be more similar than 2Sthres−1 to each other. This motivates a simple clustering algorithm as follows: Start by (i) making a list of the materials in the database. Then, (ii) identify the material (reference) with the highest number of other materials that are more similar to it than a given threshold Sthres. If no materials can be found for any reference, stop the algorithm, as all possible clusters are found. Otherwise, (iii) consider the found reference and its similar materials as a cluster and extract them from the list; and return to step (ii). The materials that do not belong to any cluster are considered orphans. When two materials have the same number of neighbors and share any of them, the cluster with the highest average similarity is selected.

Materials descriptors

In this work, we combine the DOS fingerprint introduced at the beginning of this section with the clustering algorithm defined above to identify clusters of materials with similar electronic structure. To understand why such clusters are formed, we make use of descriptors.

The electronic spectrum of a material can often be understood by counting the valence electrons in the outermost shells of its constituent atoms. This counting can, in principle, be obtained from the average of the column numbers of the atoms in the unit cell:

$${\bar{c}}_{m}=\frac{1}{N}\mathop{\sum }\limits_{i}^{N}{c}_{im}$$
(8)

where i runs over all N atoms in the unit cell of material m, and cim denotes their column in the Periodic Table of Elements (PTE). \({\bar{c}}_{m}\) is calculated for all materials in a cluster. If it is equal for all of them, we conclude that the cluster is formed by isoelectronic materials. Note that here we employ a lax definition of isoelectronicity that considers only electron counting but not electronic configuration. As an example, \({\bar{c}}_{m}\) for two Si atoms is identical that of two C atoms or the combination of one Al and one P atom. We call this descriptor the PTE descriptor.

The geometry of the crystal structures can also be explicatory of clusters obtained from the DOS similarity metric. Accordingly, we consider a similarity measure based on the space group (SG) of the crystal structures, after removing all information of the species that form the structure. In practice, this is achieved by first replacing all atoms by a single species and then employing the software package spglib16 with a tolerance of symprec = 1 × 10−1 to find the SG of the resulting geometry. In the following we call this the SG descriptor.

Results

Identification of clusters

To identify sets of similar materials, we use the clustering algorithm described in the Methods section. We call the materials in a cluster its members and identify the size of the clusters as the number of its members. The compactness of a cluster is determined by its radius rc = 1 − Smin, with Smin ≥ 2Sthres − 1 being the minimum similarity between any two members of the cluster. In this work, we choose a similarity threshold of Sthres = 0.75. Therefore, all materials fk within a cluster centered at the reference material fref have a similarity value of at least Smin = 0.5 to all other cluster members, i.e., an area overlap of at least 67%. With this choice, we find 294 distinct clusters that contain in total ~ 23% of the materials in the entire dataset. Materials not belonging to any cluster are called orphans. Among these, there are 2643 materials whose similarity to any other material in the dataset is less than Sthres. The remaining 54 orphans, about 2%, have a at least one neighbor with similarity S ≥ Sthres, but that neighbor(s) is (are) already part of another cluster (see Clustering algorithm above).

Figure 3(a) presents the distribution of clusters sizes on a logarithmic scale together with the maximal and mean cluster radii for clusters of a given size. About two third (200) of the clusters contain only two materials. Since the clustering algorithm requires that any member of the cluster has a similarity of Sthres to the reference material, the cluster radii for two-point clusters are as low as rc ≤ 0.25. The mean cluster radii for the clusters with more than two members increase to rc ~ 0.4 with increasing cluster size. Interestingly, even though the clustering algorithm allows for the maximal cluster radius to be as large as rc = 0.5, the maximal cluster radii of the discovered clusters are all smaller than 0.4.

Fig. 3
figure 3

(a) Distribution of cluster sizes (blue bars) and maximal (black line) and mean (red line) cluster radii of the DOS clusters in the dataset for a similarity threshold of Sthres = 0.75. The dashed line indicates the maximal possible cluster radius for this threshold. The bars in light blue indicate the clusters that are used to generate the similarity matrix in the right panel. (b) Similarity matrix for materials in clusters with more than six members. The red boxes indicate the clusters detected by our algorithm.

To illustrate the similarity relations between materials, we calculate pairwise similarities between all materials in the dataset, i.e., a symmetric matrix with elements \({S}_{ij}=S\left({{\boldsymbol{f}}}_{i},{{\boldsymbol{f}}}_{j}\right)\). In other words, each column and row of this matrix corresponds to the similarities of a single material to the rest of the dataset. The diagonal elements of the matrix are identical, Sii = 1, as they describe the similarity of each material with itself. An excerpt of the full matrix can be seen in Fig. 3(b). The order of columns and rows has been chosen according to the cluster sizes, i.e., such that the largest cluster appears in the top left corner of the matrix, and the cluster radius decreases with increasing index. The color code makes apparent that many of the clusters are very dissimilar to each other, i.e., the average similarity of the cluster members to the rest of the dataset is low. For some others, however, the opposite is the case, and one could expect that, choosing a smaller threshold would merge them, as pointed out above. An example for this, will be given in the following section.

We note that the here chosen parameters serve the purpose of showcasing our approach. Both, the similarity threshold as well as the energy range can be varied to focus such analysis on certain aspects of the data. For instance, to enhance compactness, the similarity threshold can be increased. This, however, reduces the number of discovered clusters and their size, which ultimately prevents the discovery of meaningful clusters. Conversely, the reduction of the similarity threshold increases both cluster size and number of clusters, to the expense of larger cluster radii. Too large cluster radii bare the risk of masking meaningful relations between data points in large clusters that hinder the automatic analysis of clusters.

To illustrate these effects, we repeat the clustering process for a wide range of thresholds. The results are shown in Fig. 4(a): The black and orange curves indicate the total number of clusters and its subset of isoelectronic clusters, respectively, as a function of the similarity threshold. For low thresholds, all materials are contained in few large clusters which split into more smaller ones upon increasing the threshold. The maximum is found at Sthres = 0.67. This value corresponds to an overlap of the spectral areas of around ~ 50%. We choose a slightly larger value of Sthres = 0.75, corresponding to 67% overlap, to make the clusters more compact. As can be seen from Fig. 3(a), the maximum actual radius reached by a cluster is around 0.4, which means an overlap larger than 75%. This choice is indeed scientifically meaningful as the electronic structure in the chosen energy range (centered at the Fermi level), strongly impacts the structural stability and the dielectric response of the material, as well as the reactivity of its surfaces in, e.g., catalysis processes and chemisorption17,18,19.

Fig. 4
figure 4

(a) Total number of clusters (black) and isoelectronic clusters (orange) as a function of the similarity threshold. The dashed red line indicates Sthres = 0.75, the threshold used in this work. (b) Electronic density of states (DOS) of materials that are orphans for all thresholds considered in panel (a).

Figure 4(b) presents the DOS of six materials that are orphans. In fact, they remain orphans at Sthres as small as 0.1. Often, orphans have narrow bands leading to well localized peaks in the corresponding DOS. The figure includes four such cases, namely FeHfF6, Li2Cl2O4, FeZrCl6, Na2F2O2. As an example, the latter is composed by FO molecules bridged by Na atoms, in a checkerboard pattern (see also https://cmrdb.fysik.dtu.dk/c2db/row/F2Na2O2-feda03610e19). Their bonding and antibonding states form narrow peaks of large intensity at −1.5 and 1.8 eV, respectivley, while the Na states lie mostly outside the considered energy range. These sharp peaks have hardly any overlap with those of the other samples, which is then too small to form a cluster. Less often, orphans are large band-gap materials, such that no appreciable spectral weight is present in the considered energy range. This is the case for instance, for MgCl2 and Bi2F6.

Analysis of selected clusters

In the following we analyze individual clusters and reason why the materials in these clusters are similar.

Isoelectronic substitutions

Figure 5 presents the DOS (a) and crystal structures (b) of five transition-metal dichalcogenides (TMDC) forming a cluster. Its cluster radius rc = 0.28 is close to the mean for this cluster size (see Fig. 3(a)). Visual inspection of the corresponding DOS reveals a pronounced overall similarity in terms of i) the shape of spectra inside the feature region |E|  2eV, and ii) the size of the PBE band-gap that varies from 0.52 eV (TiSe2) to 0.65 eV (Hf2Ti2Se8). Above 2 eV, the DOS become more dissimilar, as expected from the coarser representation of the DOS outside the feature region. The inset of panel (a) presents the similarity matrix of the materials contained in this cluster. Here, sub-clusters become appreciable for materials with the same number of substituents, i.e., the materials containing one or two substituents are more similar to one another than to the other cluster members.

Fig. 5
figure 5

Densities of states (a) and unit cells (b) of the materials of a selected DOS cluster. The Fermi level is located at E = 0 eV. The cluster center, Se8Ti3 Hf, is indicated by the bold font in the legend. The inset shows their similarity matrix, where the color code is adapted to reflect the high similarities between the cluster members. (c) Corresponding PDOS. Although the contributions of individual orbitals vary between the different materials, due to their similar shape, their sum, i.e., the total DOS, is very similar.

Considering the crystal lattice, the cluster members are very similar. All materials consist of a layer of TMs between two layers of Se. The cluster contains the binary phase (TiSe2), as well as ternary phases, where either one or two Ti atoms are substituted with either Hf or Zr or both. The latter type has only minor influence on the DOS. This does not come to a surprise as all substitutions within this cluster are isoelectronic, i.e., with atomic species from group 4 of the PTE. We note here that there exists another cluster of materials with the same structural prototype, containing, among other materials, the binary compounds Se8Hf4 and Se8Zr4. These materials form a separate cluster because they have higher PBE band gaps, ranging between 0.72 eV (Se8TiHf3) and 0.82 eV (Se8 HfZr3). Choosing a lower similarity threshold, these clusters merge.

To further demonstrate the isoelectonic behavior of the materials of the here considered cluster, we compare their PDOS in Fig. 5(c). Their valence bands are mainly composed of fully occupied Se 4p states. The conduction bands have predominant Ti 3d character, with additional contributions from 4d or 5d states of Zr and Hf (when present)20. The latter lie all in the same energy range and sum up to the same number of d states of the four group-IV TMs. The hybridization of TM-3d with Se-p orbitals is evident from small contributions of d states in the valence region and Se-p states in the conduction region. In sum, the similarity of the electronic spectrum of these materials becomes clear: The replacement of Ti by either Hf or Zr does not alter the valence band, while the conduction states are composed of empty d shells of the transition metals, which amount to the same number of empty states.

Several other clusters exist in the dataset which consist of isoelectronic materials, i.e., they may contain different elements but have the same number of valence electrons. To discover them, we make use of the PTE descriptor introduced in the Methods section. Overall, our descriptor identifies 230 clusters each of them having the same \({\bar{c}}_{m}\) for all its members (Eq. 8). This number corresponds to 78% of all clusters, and 16.5% of materials in the full dataset. Therefore we conclude that isoelectronicity is a main reason for the similarity of the DOS of the materials in the C2DB. 88.8% of all clusters contain at least two materials that have the same \({\bar{c}}_{m}\).

The existence of clusters exclusively from isoelectronic materials is observed for a large range of similarity thresholds, as indicated in Fig. 4(a). Comparing their number (orange curve) to the total number of clusters (black curve) reveals that with increasing similarity threshold, the majority of compact clusters are isoelectronic.

Materials with isoelectronic surface groups

The second most common origin of similarity concerns the substitution of flourine atoms at the materials’ surfaces with OH groups. Figure 6 presents an example of such clusters. These metallic materials consist of five alternating layers of carbon and either Ta or Nb. Again, Ta, and Nb are isoelectronic, i.e., from group 5 of the PTE. At the two surfaces of the materials, either F atoms or OH groups form bonds with the underlying TM atoms.

Fig. 6
figure 6

DOS (a) and atomic structures (b) of materials with isoelectronic surface groups. The Fermi level is located at E = 0 eV. The cluster center is indicated in bold face. For increased visibility the unit cell is repeated in both in-plane directions.

The cluster radius is rc ~ 0.28, which is close to the mean value for four-point clusters. The general shapes of the curves are similar. Inspection of the PDOS (e.g., https://cmrdb.fysik.dtu.dk/c2db/row/C2H2O2Nb3-137a187a149c), reveals that the whole spectrum is dominated by d states of the TM. They hybridize with C and TM p states. The p bands have the largest contribution around E ≈ 0 and E ≈ 1.5, where also significant contributions from F and O p states are present. Thus, the saturated O in the hydroxyl group acquires an electronic configuration analogous to F and binds similarly to the TM atoms. In other words, the OH group can be regarded as isoelectronic to F. Minor differences between the spectra, as for instance displaced peaks below −1 eV, mainly originate from multiple van Hove singularities which are very sensitive to the precise location of band extrema and flat bands.

In total, we find 33 clusters in the dataset where F at the surface is interchanged with an OH group. In most of these cases, these clusters contain also sets of materials with other isoelectronic substitutions. Let us note in passing that, despite the fact that the similar behavior of flourine atoms and hydroxyl groups is well-established expert knowledge in chemistry and electronic-structure theory, it is not trivial to access such knowledge in an automatized manner. So far, the search interfaces of most databases, as well as descriptors for machine learning, rely on structural features, e.g., the chemical formula or the number of atoms in the unit cell. Thus, similar materials with, e.g., different numbers of atoms in the unit cell are unlikely to be found by these methods.

Role of crystal lattice

Now, we focus our search on clusters of materials composed of identical atoms but different host lattices. Figure 7 presents the data from three different phases of In2S2, which belongs to the class of post-transition metal chalcogenides. We designate them by their SG (value of their SG descriptor, cf. Methods), where we distinguish the two materials with SG 164 as SG 164-1 and SG 164-2. The semiconducting structures resulting from the phases SG 164-1 and SG 187 form a cluster due to the similarity of their DOS throughout the whole observed energy range. For comparison, we show the DOS of a third phase, SG 164-2, that shares the symmetry with the first material, however, shows markedly different behavior and, thus, is not part of the same cluster. While the similarity coefficient between the clustered materials is 0.76, the corresponding values with the third phase are 0.32 and 0.34, respectively. This finding goes hand in hand with the fact that the clustered materials have medium-sized band gaps of 1.60 (SG 187) and 1.68 eV (SG 164-1), respectively (values from (https://cmrdb.fysik.dtu.dk/c2db/)), the third one is metallic. Despite sharing the space group, the two phases with SG 164 show significant structural differences as evident from the top views of the unit cells depicted in Fig. 7. This can be further illustrated considering the stacking of In and S layers: For SG 164-1, the layer sequence corresponds to ABBC stacking; for SG 187, it is ABBA; for SG 164-2, it is ABDC. The (dis-)similarity of the different phases lies in the particular electronic configuration acquired by the atomic species: In the semiconducting phases, In adopts covalent bonding, manifested by a valence band that is dominated by hybridized S and In p states21. In this electronic configuration, the In atoms are tetrahedrally coordinated by three S atoms and one In atom. Here the In-In bond length is dIn-In = 2.82 Å. In the metalic phase, In atoms form metallic bonds with a significant contribution from In s states. In this case, every In atom is coordinated with three In atoms, and the bond length is dIn-In = 3.62 Å, which is close to that of bulk metallic In (3.38 Å). The metallic phase is metastable as compared to the semiconducting ones (see also https://cmrdb.fysik.dtu.dk/c2db/row/In2S2-ef93efd2b5c0)).

Fig. 7
figure 7

DOS (a) and atomic structures (b) of In2S2. The Fermi level is located at E = 0 eV. The structures with SG 164-1 and SG 187 form a DOS cluster. The structure with SG 164-2 (right) has a dissimilar DOS and is not part of the cluster. The unit cells are repeated in both in-plane directions to increase visibility.

We note that in this case the dissimilarity between the semiconducting and metallic phases can be explained neither by the SG nor the PTE descriptors. Nonetheless, our DOS similarity search is able to capture the underlying electronic configuration and put together structures with identical atomic coordination, albeit different crystal structure.

Outliers

Overall, there are 25 clusters in the dataset with materials that are neither isoelectronic nor share the crystal lattice, i.e. they cannot be explained by our SG and PTE descriptors. Therefore, as a final example, we focus on a cluster that consists of two materials that have no apparent similarities in their atomic structures, neither in symmetry nor in composition. They are presented in Fig. 8(b). While Ta2BS2 has the trigonal space group 164, Bi4Cu4 is characterized by an orthorhombic lattice with SG 51. Unaffected by their structural dissimilarities, the DOS of both materials resemble each other with a similarity coefficient of 0.76, i.e., slightly above the threshold of Sthres = 0.75. Both materials exhibit a nearly constant DOS between −1.5 and 2 eV, while it increases below. To get a deeper insight, we show in Fig. 8 the band structures of Ta2BS2 (c) and Bi4Cu4 (d), indicating the atomic character of the bands, together with the corresponding DOS projected on the different atomic species. While in the case of Ta2BS2, only one band with mixed p-d character crosses the Fermi level, the energy spectrum of Bi4Cu4 exhibits several bands composed almost exclusively of Bi and Cu p states, giving rise to a more complicated topology of the Fermi surface. We conclude that the similarity of these materials’ DOS is accidental and this cluster can be indeed be considered as an outlier.

Fig. 8
figure 8

Top: DOS (a) and atomic structures (b) of a cluster of materials that neither share atomic species not crystal structure. To increase visibility, the unit cell is repeated in both in-plane directions. Bottom: Band structures and projected DOS of Ta2BS2 (c) and Bi4Cu4 (d), indicating the atomic characters. The Fermi level is located at E = 0 eV in all panels.

This example demonstrates that our approach has great potential for identifying materials with a desired property or specific feature. This could, e.g., be the electronic band gap or the character of a band edge. Combining unsupervised learning with physics-informed descriptors allows one to explore materials spaces, also regions where one would not expect good candidates.

Discussion

In this work, we have presented a fingerprint of the electronic DOS that allows one to quantitatively evaluate the similarity of materials in terms of their electronic structure. We have applied this fingerprint to the C2DB database, a large, heterogeneous data-set of two dimensional materials. Based on our similarity measure, we have devised a clustering algorithm to filter the data for sets of materials that exhibit pronounced similarities to one another. A significant number of (small) clusters have been identified and further analyzed. More specifically, 23% of the materials can be associated with at least one other material in the dataset. The majority of similarities in these particular materials can be explained by the similarity of the valence configuration of the involved atomic species, thus confirming physical expectations. This confirmatory analysis has been performed in an automatic fashion based on physically meaningful descriptors. In this way, we could identify, for instance, 16.5% of materials being isoelectronic to at least another material of the investigated database. Moreover, we have shown that this observation is valid for a wide range of similarity thresholds which indicates the robustness of the method. Our approach could be easily extended by introducing other descriptors with explanatory power.

Our work differs from other approaches in the literature. In many cases9,12,22, similarity relations between data are employed to attain low-dimensional representations of the whole dataset. Such representations have been shown to be well suited for visualizing relations in material spaces, e.g., by highlighting materials with a certain feature, and subsequently identifying clusters of them. Such representations are used for a rough characterization of the materials space, which may lead to a global identification of descriptor-property correlations. In contrast, by using a strict similarity threshold that yields compact clusters, we focus on the local structure of the dataset. Our analysis confirms, in most cases, physical intuition. This allows us to automatically classify a large subset of clusters. Interestingly, it also leads to the discovery of outliers, i.e., clusters that are not explainable with the here used PTE or SG descriptors.

The presented analysis can be directly included into materials databases complying with the FAIR principles. An implementation of the DOS descriptor is already available in the NOMAD Encyclopedia23 (https://nomad-lab.eu/prod/rae/encyclopedia) where one can can obtain a list of materials whose DOS are most similar to that of a chosen one. A tutorial on how to use this functionality can be found online (https://nomad-lab.eu/aitoolkit/tutorial-dos-similarity).

The findings of investigations along the lines described here, may guide researchers towards possibly interesting materials classes. For instance, experimentalists could select a material, exhibiting desired properties, that may be easier to synthesize than a prototypical one. That way, value can be added to high-throughput calculations or even experiments. In the context of electronic or spectroscopic properties, research fields such as photovoltaics, optoelectronics, or catalysis, to name a few, may benefit from such approach.

Summarizing, our method provides a means of analyzing large datasets from electronic-structure theory and contributes to understanding, controlling, and selecting such data in view of their re-use in other contexts. Last but not least, the finding of accidental – unexpected – similarities may be of relevance in technological applications, where considering materials with different composition but similar properties could lead to e.g., structures that are easier to synthesize or reveal other properties that are superior to those of the known materials.

Dataset

We use data from the Computational 2D Materials Database (C2DB)2,11, a high-throughput database of atomically thin systems. The majority of its content is generated from structure prototypes that are decorated with different atomic species. These prototypes include Xane (e.g., graphane), Xene (e.g., graphene), MXY Janus (e.g., MoSSe), and TMDCs (e.g., MoS2). At the point of writing this manuscript, the C2DB contains 4047 structures, composed of 63 different chemical elements. Projected densities of states (PDOS) and atomic structures are available at the C2DB website (https://cmrdb.fysik.dtu.dk/c2db/) for 3491 of these structures. The C2DB captures various materials properties calculated by the electronic-structure package GPAW24,25, by employing density-functional theory (DFT) in the PBE parameterization of the generalized-gradient approximation (GGA). The properties include heat of formation, stiffness tensor, and phonons to assess the stability, magnetic properties, and optical polarizabiliy as well as electronic band structures and (if applicable) band gaps. The structures are relaxed with respect to their atomic positions and lattice parameters. Metastable materials, i.e., materials with energies above the convex hull or negative eigenvalues of the stiffness tensor, are also contained in the dataset.

Before using this data for our analysis, we preprocess it in the following way: First, we sum up over all PDOS for one material, in order to obtain the total DOS (TDOS). Then, we define the Fermi level, EF, as the energy zero. The TDOS of every structure is then normalized with respect to the area of the unit cell spanned by the two periodic cell vectors. In this way, the results can be consistently compared across different geometries. For instance, supercells of the same material are considered identical. The resulting normalized TDOS are then employed to generate the fingerprints that encode the electronic structure. Analysis of atomic structures are performed using the Python package ASE26. Further plots are generated using matplotlib27.