1 Introduction

The success of information retrieval in a given task critically depends on the quality of the underlying data. Another issue is that in many domains knowledge bases are spread across various data sources [14] and it is crucial to be able to combine information from different sources. In this work, we focus on knowledge bases in the form of Knowledge Graphs (KGs), which are particularly suited for information retrieval [17]. Joining information from different KGs is non-trivial, as there is no unified schema or vocabulary. The goal of the entity alignment task is to overcome this problem by learning a matching between entities in different KGs. In the typical setting some of the alignments are known in advance (seed alignments) and the task is therefore supervised. More formally, we are given graphs \(G_L = (V_L, E_L)\) and \(G_R = (V_R, E_R)\) with a seed alignment \(A = {(l_i, r_i)}_i \subseteq V_L \times V_R\). It is commonly assumed that an entity \(v \in V_L\) can match at most one entity \(v' \in V_R\). Thus the goal is to infer alignments for the remaining nodes only.

Graph Convolutional Networks (GCN) [7, 9], which have been recently become increasingly popular, are at the core of state-of-the-art methods for entity alignments in KGs [3, 6, 22, 24, 27]. In this paper, we thoroughly analyze one of the first GCN-based entity alignment methods, GCN-Align [22]. Since the other methods we are studying can be considered as extensions of this first paper and have a similar architecture, our goal is to understand the importance of its individual components and architecture choices.In summary, our contribution is as follows:

  1. 1.

    We investigate the reproducibility of the published results of a recent GCN-based method for entity alignment and uncover differences between the method’s description in the paper and the authors’ implementation.

  2. 2.

    We perform an ablation study to demonstrate the individual components’ contribution.

  3. 3.

    We apply the method to numerous additional datasets of different sizes to investigate the consistency of results across datasets.

2 Related Work

Table 1. Overview of related work in the field of entity alignment for knowledge graphs with their used datasets and metrics.

In this section, we review previous work on entity alignment for Knowledge Graphs and revisit the current evaluation process. We believe that this is useful for practitioners, since we discover some pitfalls, especially when implementing evaluation scores and selecting datasets for comparison. An overview of methods, datasets and metrics is provided in Table 1.

Methods. While the problem of entity alignments in Knowledge Graphs has been tackled historically by researching vocabularies which are as broad as possible, and establish them as a standard, recent approaches take a more data-driven view. Early methods use classical knowledge graph link prediction models such as TransE [2] to embed the entities of the individual knowledge graphs using an intra-KG link prediction loss, and differ in what they do with the aligned entities. For instance, MTransE [5] learns a linear transformation between the embedding spaces of the individual graphs using an \(L_2\)-loss. BootEA [19] adopts a bootstrapping approach and iteratively labels the most likely alignments to utilize them for further training. In addition to the alignment loss, embeddings of aligned entities are swapped regularly to calibrate embedding spaces against each other. SEA [15] learns a mapping between embedding spaces in both directions and additionally adds a cycle-consistency loss. Thereby, the distance between the original embedding of an entity, and the result of translating this embedding to the opposite space and back again, is penalized. IPTransE [26] embeds both KGs into the same embedding space and uses a margin-based loss to enforce the embeddings of aligned entities to become similar. RSN [8] generates sequences using different types of random walks which can move between graphs when visiting aligned entities. The generated sequences are feed to an adapted recurrent model. JAPE [18], KDCoE [4], MultiKE [25] and AttrE [20] utilize attributes available for some entities and additional information like the names of entities and relationships. Graph Convolutional Network (GCN) based models [3, 6, 22, 24, 27]Footnote 1 have in common that they use GCN to create node representations by aggregating node representations together with representations of their neighbors. Most of GCN approaches do not distinguish between different relations and either consider all neighbors equally [6, 22, 24] or use attention [3] to weight the representations of the neighbors for the aggregation.

Table 2. Overview of used datasets with their sizes in the number of triples (edges), entities (nodes), relations (different edge types) and alignments. For WK3l, the alignment is provided as a directed mapping on a entity level. However, there are additional triple alignments. Following a common practice as e.g. [15] we can assume that an alignment should be symmetric and that we can extract entity alignments from the triple alignments. Thereby, we obtain the number of alignments given in brackets.

Datasets. The datasets used by entity alignments methods are generally based on large-scale open-source data sources such as DBPedia [1], YAGO [13], or Wikidata [23]. While there is the DWY-100K dataset, which comprises 100 K aligned entities across the three aforementioned individual knowledge graphs, most of the datasets, such as DBP15K, or WK3l are generated from a single multi-lingual database. There, subsets are formed according to a specific language, and entities which are linked across languages are used as alignments. A detailed description of most-used datasets can be found in Table 2.

As an interesting observation we found out that all papers which evaluate on DBP15, do not evaluate on the full DBP15K datasetFootnote 2 (which we refer to as DBP15K (full)), but rather use a smaller subset provided by the authors of JAPE [18] in their GitHub repositoryFootnote 3, which we call DBP15K (JAPE). The smaller subsets were created by selecting a portion of entities (around 20K of 100K) which are popular, i.e. appear in many triples as head or tail. The number of aligned entities stays the same (15K). As [18] only reports the dataset statistics of the larger dataset, and does not mention the reduction of the dataset, subsequent papers also report the statistics of the larger dataset, although experiments use the smaller variant [3, 18, 19, 22, 26]. As the metrics rely on absolute ranks, the numbers are better than on the full dataset (cf. Table 3).

Scores. It is common practice to only consider the entities being part of the test alignment as potential matching candidates. Although we argue that ignoring entities exclusive to a single graph as potential candidates does not reflect well the use-case situationFootnote 4, we follow this evaluation scheme for our experiments to maintain comparability.

3 Method

GCN-Align [22] is a GCN-based approach to embed all entities from both graphs into a common embedding space. Each entity i is associated with structural features \(h_i \in \mathbb {R}^d\), which are initialized randomly and updated during training. The features of all entities in a single graph are combined to the feature matrix H. Subsequently, a two-layer GCN is applied. A single GCN layer is described by \( H^{(i+1)} = \sigma \left( \hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}H^{(i)}W^{(i)}\right) \) with \(\hat{A} = A + I\), where A is the adjacency matrix, and \(\hat{D}_{ii} = \sum _{j=1}^n \hat{A}_{ij}\) is the diagonal node degree matrix. The input of the first layer is set to \(H^{(0)} = H\), and \(\sigma \) is non-linear activation function, chosen as ReLU. The output of the last layer is considered as the structural representation, denoted by \(s_i = H^{(2)}_i \in \mathbb {R}^{d}\). Both graphs are equipped with their own node features, but the convolution weights \(W^{(i)}\) are shared across the graphs.

The adjacency matrix is derived from the knowledge graph by first computing a score, called functionality, for each relation as the ratio between the number of different entities which occur as head, and the number of triples in which the relation occurs \(\alpha _r\). Analogously, the inverse functionality \(\alpha _r'\) is obtained by replacing the nominator by the number of different tail entities. The final adjacency matrix is obtained as \( A_{ij} = \sum \limits _{(e_i, r, e_j)} \alpha _r' + \sum \limits _{(e_j, r, e_j)} \alpha _r\). Note, that analogously to structural features GCN-Align is able to process the attributes and integrate them in final representation. However, since attributes have little effect on final score, and to be consistent with other GNN models, here we focus only on structural representations.

Implementation Specifics. The codeFootnote 5 provided by the authors differs in a few aspects from the method described in the paper. First, when computing the adjacency matrix, fun(r) and ifun(f) are set to at least 0.3. S, the node embeddings are initialized with values drawn from a normal distribution with variance \(n^{-1/2}\), where n is the number of nodesFootnote 6. Additionally, the node features are always normalised to unit Euclidean length before passing them into the network. Finally, there are no convolution weights. This means that the whole GCN does not contain a single parameter, but is just a fixed function on the learned node embeddings.

4 Experiments

Table 3. Ablation study on using convolution weights and different embedding initialisation.We fix using convolution weights and the variance for the normal distribution from which the embedding vectors are initialized and optimize the other hyperparameters according to validation H@1 (80/20% train-validation split) on DBP15K (JAPE) zh-en in a large-scale hyperparameter search, comprising 1,440 experiments, with the following grid: optim. \(\in \) {Adam, SGD}, lr \(\in \{0.1, 0.5, 1, 10, 20\}\), #layers \(\in \{1, 2, 3\}\), #neg. samples \(\in \{5, 50, 100\}\), #epochs \(\in \{10, 500, 2000, 3000\}\). Hence, we obtain four sets of hyperparameters. For each dataset, we perform a smaller hyperparameter search to fine-tune LR, #epochs & #layers for each dataset (again 80/20 split) and evaluate the best models on the official test set with standard deviation computed across 5 runs.

In initial experiments we were able to reproduce the results reported in the paper using the implementation provided by the authors. Moreover, we are able to reproduce the results using our own implementation, and settings adjusted to the authors’ code. In addition, we replaced the adjacency matrix based on functionality and inverse functionality by a simpler version, where \(a_{ij} = \{(h,r,t) \in T \mid h=e_i, t=e_j\}\). We additionally use \(\hat{D}^{-1}\hat{A}\) instead of the symmetric normalization. In total, we see no difference in performance between our simplified adjacency matrix, and the authors’ one. We identified two aspects which affect the model’s performance: Not using convolutional weights, and normalizing the variance when initializing node embeddings. We provide empirical evidence for this finding across numerous datasets. Our results regarding Hits@1 (H@1) are summarised in Table 3.

Node Embedding Initialization. Comparing the columns of Table 3 we can observe the influence of the node embedding initialization. Using the settings from the authors’ code, i.e. not using weights, a choosing a variance of \(n^{-1/2}\) actually results in inferior performance in terms of H@1, as compared to use a standard normal distribution. These findings are consistent across datasets.

Convolution Weights. The first column of Table 3 corresponds to the weight usage and initialization settings used in the code for GCN-Align. We achieve slightly better results than published in [22], which we attribute to a more exhaustive parameter search. Interestingly, all best configurations use Adam optimizer instead of SGD. Adding convolution weights degrades the performance across all datasets and subsets thereof but one as witnessed by comparing the first two columns with the last two columns.

5 Conclusion

In this work, we reported our experiences when implementing the Knowledge Graph alignment method GCN-Align. We pointed at important differences between the model described in the paper and the actual implementation and quantified their effects in the ablation study. For future work, we plan to include other methods for entity alignments in our framework.