Keywords

1 Introduction and Motivation

Semantic Web initiatives have facilitated the definition of ontologies and large linked datasets, as well as the encoding of domain knowledge by annotating datasets with terms from ontologies. Ontology-based annotations induce annotation graphs or heterogeneous information networks where nodes represent entities or annotations, and links correspond to relationships among entities. Annotations encode domain knowledge required to precisely compute similarity between annotated concepts. Figure 1 presents therapeutical targets HER1 and HER2 and annotations from the Gene Ontology (GO)Footnote 1. These annotations explicitly describe properties of HER1 and HER2, and state-of-the-art similarity measures like AnnSim [13] or DiShIn [4], decide relatedness between HER1 and HER2 in terms of the similarity of these annotations. However, because annotations correspond to terms in an ontology, they can be of different types or be related through different relationships. Additionally, these annotations can be also used to perform reasoning tasks that infer new implicit annotations. In case semantic similarity measures do not consider this information, inaccurate similarity values can be assigned. Our research aims at exploiting all this knowledge to precisely decide relatedness, and defining a novel similarity measure named ColorSim which is able to: (i) distinguish the types of the relationships in the annotation graphs; and (ii) consider implicit relationships and compare them in terms of the justifications that support these inferences. Further, we devise an efficient and scalable implementation of ColorSim and will implement a framework for link prediction and domain pattern discovery that will exploit the properties of ColorSim. For a preliminary evaluation of our approach, we use the online tool Collaborative Evaluation of Semantic Similarity Measures (CESSM) [18] to study the quality of ColorSim on a dataset composed of pairs of proteins from UniProtFootnote 2. We compare ColorSim with respect to three domain-specific similarity measures: Sequence Similarity (SeqSim) [22], ECC [5], and Pfam [18], and eleven state-of-the-art semantic similarity measures. Experimental results suggest that ColorSim exhibits high correlation with domain-specific measures, and is competitive with similarity measures that consider both information content and structural characteristics of the compared annotations. We plan to extend our study for analyzing the impact of ColorSim on link prediction and pattern discovery in the Life Sciences domain, e.g., drug-target interaction collections [2, 16] and GO annotated families of genes [13]; as well as in the e-learning domain, e.g., for the recommendation of learning objects annotated with the Pedagogical Ontology (PO) developed in the INTUITELFootnote 3 project.

Fig. 1.
figure 1

Annotations in GO of genes HER1 and HER2

2 Related Work

We have identified the following similarity measures that are able to deal with heterogeneous information networks: (i) Taxonomic-based, (ii) Meta-Path-based, (iii) Neighborhood-based, (iv) Annotation-based, and (v) Information Content-based similarity measures.

Taxonomic-Based Similarity Measures: Taxonomic-based similarity measures decide relatedness in terms of the topology of the ontology and usually consider only the is-a relationship. \(D_{ps}\) [15] and \(D_{tax}\) [1] are state-of-the-art taxonomic similarity measures that assign higher similarity values to pairs of nodes that are at greater depth in the taxonomy and closer to their lowest common ancestor, i.e., similarity is defined in terms of the deepest common ancestor of these two nodes in the ontology. Usually, they do not consider any kind of semantics; therefore, relationship types or implicit facts may not be taken into account.

Meta-Path Based Similarity Measures: Meta-path-based similarity measures compute relatedness in terms of the sub-graphs of an original information network that satisfies a meta-path expression. A meta-path is a path expression on the nodes and edges of the information network, and characterizes a set of paths. The intuition behind meta-path-based similarity measures is that, the more linked two concepts are by paths that satisfy the input meta-path, the more similar they are. PathSim [23] and HeteSim [20] are meta-path-based similarity measures that compute relatedness based on this idea. These similarity measures are not designed to deal with ontologies, and the semantics that describe the terms used to annotate the concepts in the information network is not considered by these measures. Therefore, they only take into account links that are explicitly defined in the information network, omitting implicit facts and their corresponding justifications.

Neighborhood Based Similarity Measures: Neighborhood based similarity measures define relatedness of two concepts in terms of the similarity of their neighbors. SimRank [7] extends PageRank [12] to compute relatedness between graph related concepts. However, SimRank is not designed to deal with ontologies; thus, it does not differentiate between link types, their semantics, and implicit facts, i.e., all the neighbors are considered in the same way, regardless of the type of the relationships that connect them.

Information Content Based Similarity Measures: Information Content measures show how informative is a concept in a certain corpus. It is calculated with the following formula: \(IC(x) = -\log \left( \frac{\textit{freq}(x)}{N}\right) \), where \(\textit{freq}(x)\) is the number of times the concept \(x\) appears in the corpus, and \(N\) is the size of the corpus; therefore, more frequently used concepts are seen as less informative. The main work in this area is the similarity measure presented by Resnik et al. [19], which defines relatedness between two concepts as the Information Content of the most informative common ancestor. Further, Jiang and Conrath [8], and Lin [11] rely on this idea. Couto et al. refines with GraSM [3] and DiShIn [4] the similarity measure of Resnik defining the disjunctive common ancestors of two concepts; the similarity is defined by the average of the Information Content of all the disjunctive common ancestors. The Information Content-based similarity measures are designed to calculate the similarity between words in a thesaurus; therefore, they only consider the topology of the taxonomy.

Annotation-Based Similarity Measures: AnnSim [13] is an annotation-based similarity measure that determines relatedness of two entities in terms of the similarity of their annotations. To compute the similarity of annotations, AnnSim combines properties of path- and topological-based similarity measures like \(D_{\textit{tax}}\) and Dice coefficients, and does not consider any additional semantics represented in the corresponding ontology. Contrary to existing approaches, ColorSim considers semantics as a first-class citizen, and exploits this knowledge during the computation of relatedness between ontology-based annotated entities.

3 Problem Statement and Contributions

We hypothesize that semantics encoded in ontologies possess valuable information that have to be considered to determine relatedness. Our first research goal addresses the challenges of defining a semantic similarity measure able to differentiate between relationship types and exploit their semantics; then, we plan to develop a framework that relies on this measure to enhance data mining tasks. Our research questions (RQ) are the following: (RQ1) What is the improvement of considering semantics during the computation of similarity between two annotated concepts?; (RQ2) How can semantic similarity measures efficiently scale up to large datasets and be computed in real-time applications?; and (RQ3) What is the impact of expressive semantic similarity measures on data mining tasks, e.g., to discover domain patterns between annotated concepts?.

Existing similarity measures are not able to fully exploit information about relationship types or their properties. Therefore, our first research goal is to propose a novel semantic similarity measure. We rely on OWL2 as vocabulary to describe concepts and relationships, and the axioms that describe their semantics; further, an OWL2 reasoner is assumed to infer implicit facts. Figure 2(a) presents a taxonomy of relationships in the Gene Ontology (GO). Relationship taxonomies can refine a neighborhood-based similarity approach assuming that not only the neighbors of a concept influence in the similarity measure, but also the relationship type used to infer that this concept is a neighbor. For example, if we have four concepts A, B, C, and D, all of them identical in terms of taxonomy-based similarity, but related through the following relationships: (i) A part_of D; (ii) B negatively_regulates D; and (iii) C positively_regulates D. Since negatively_regulates and positively_regulates are more similar according to the taxonomy (See Fig. 2(a)), both B and C must be more similar than A and B, or A and C.

Fig. 2.
figure 2

Differences according to the knowledge encoded in GO

Fig. 3.
figure 3

Examples of implicit facts in GO (dashed arrows)

Additionally, existing semantic similarity measures do not take into account implicit facts. The description of the relationships in the datasets of the Linking Open Data (LOD) cloud, includes a set of semantic properties specified with OWL2, e.g., transitivity, reflexivity, ObjectPropertyChain, or symmetry, which allow the reasoner to infer new implicit relationships between two concepts. To illustrate, consider the following properties of GO relationships: (i) hasPart is the inverse of partOf; and (ii) regulates is transitive over partOf by means of an ObjectPropertyChain axiom. Additionally, relationships are transitive over the is-a relationship in OWL2. Although considering implicit relationships is a step forward in comparison with the-state-of-the art, this is not enough for computing accurate values of similarity. We consider that not only the final inference is relevant to calculate the similarity, but also the followed derivation route to reach this inference. This route is provided by OWL2 reasoners as a set of axioms that supports the final inference. Figure 2(b) illustrates implicit relationships according to the semantics encoded in GO using dashed arrows. The reasoner infers that A, B, and C negatively regulate E and F. A and B share the justification, while the justification for C is different. The justification for A and B is based on the fact that the property negatively_regulates is transitive over the is-a relationship, while the justification for C relies on the transitivity of negatively_regulates. Further, the same implicit relationship may have more than one justification. For example, the implicit relationship negatively_regulates in Fig. 3(a) can be inferred by applying: (a) transitivity over negatively_regulates, or (b) transitivity over the is-a relationship.

Our second research goal is to provide a framework able to efficiently compute ColorSim on real-time and to scale up to large datasets. Currently, Web based recommendation systems are based on similarity measures that have to be calculated in real-time to satisfy users’ requests. Similarity measures used in this context belong to some of the categories presented in Sect. 2; they can be calculated in polynomial time. Additionally, link prediction and domain pattern discovery approaches require accurately computation of similarity measures for large datasets. Thus, our research will explore different heuristics to efficiently determine the properties of the implicit and explicit ontology facts, as well as the combination of this knowledge to decide relatedness.

Finally, our third research goal is the development of graph mining frameworks that by exploiting our proposed similarity measures will be able to predict potential novel interactions and patterns. We will focus on the following three problems in the Life Sciences domain: (1) defining relatedness between semantically annotated surgery procedures [9]; (2) extending the predicting approach proposed by Palma et al. [14] to suggest new interactions between drugs and targets; and (3) analyzing and enhancing the quality of computationally inferred Gene Ontology annotations  [21].

4 Proposed Approach and Research Methodology

We aim at enhancing semantic similarity measures with semantics from ontologies, e.g., relationship types, implicit facts and their corresponding justifications, and thus, improve tasks of link prediction, pattern discovery, and recommendations. We propose ColorSim, a semantic similarity measure that computes relatedness between two entities \(E_1\) and \(E_2\) annotated with ontology terms. ColorSim assigns values of similarity to \(E_1\) and \(E_2\) close to 1.0, if their corresponding annotation sets \(A_1\) and \(A_2\), are highly similar, i.e., similarity depends on how good is the matching between the annotations in \(A_1\) and \(A_2\). To compute this matching, sets \(A_1\) and \(A_2\) are represented as a weighted bipartite graph \({ WBG}=(A_1 \cup A_2, { WE})\), where WE is a set of the weighted edges in the Cartesian product of \(A_1\) and \(A_2\), and an edge weight corresponds to the similarity between annotations \(a_1 \in A_1\) and \(a_2 \in A_2\) connected by the edge.

The novelty of our approach relies on the computation of the similarity between \(a_1\) and \(a_2\). ColorSim considers not only the class hierarchy of the ontology to decide the relatedness between \(a_1\) and \(a_2\), but also takes into account the explicit and implicit neighbors, the type of the relationships that supports the inference of these neighbors, and the reasoning processes performed to infer the implicit facts. To illustrate the impact that considering additional knowledge can have on the computation of the similarity, consider the portion of GO presented in Fig. 3(b). Although the neighbors of cardiac muscle contraction and diaphragm contraction are very different either in terms of the taxonomy-based similarity and based on their justifications, \(D_{tax}\)(cardiac muscle contraction,diaphragm contraction) is 0.75. Contrary, our similarity measure considers the semantics encoded in the ontology and detects that these two annotations are dissimilar, i.e., Sim(cardiac muscle contraction,diaphragm contraction) is equal to 0.135.

We define for each annotation \(a_i\), a set \(R_i\) of relationships where \(a_i\) appears as subject. Each element in \(R_i\) is a quadruple \(t=(a_i, a_j, r_{ij}, E_{ij})\), where \(r_{ij}\) is a relationship type such that there is an out-going link from \(a_i\) to \(a_j\) in the ontology, and \(E_{ij}\) is a set composed of the justifications that support the inference of \(r_{ij}\), whenever \(r_{ij}\) is an implicit fact. Figure 4 illustrates neighborhoods of nodes where the same relationships are inferred using different justifications. Quadruples represent the association between two nodes through an explicit or implicit relationship, e.g., \(t_1=\)(A, E, neg-regulates, {transitive over is-a}) is an example of a quadruple where the relationship neg-regulates is implicit and inferred by using the axiom transitive over is-a. Based on the knowledge represented in quadruples, we compute the similarity \(Sim(a_1,a_2)\) as follows:

$$ \textit{Sim}(a_1, a_2) = \frac{\sum \limits _{(t_{1i},t_{2j})\in R_1\,\times \,R_2}\textit{Sim}_{\textit{relationship}}(t_{1i},t_{2j})}{\textit{Max}(|R_1|,|R_2|)} $$

where

  • \(R_1\) and \(R_2\) are the relationships sets of \(a_1\) and \(a_2\), respectively;

  • quadruples \(t_{1i}=(a_1,a_i, r_{1i},E_{1i})\) and \(t_{2j}=(a_2,a_j, r_{2j},E_{2j})\) belong to the Cartesian product of \(R_1\,\times \,R_2\); and

  • \(\textit{Sim}_{\textit{relationship}}(t_{1i},t_{2j})\) is defined as a triangular norm \(tN\) Footnote 4 that combines the values of similarity of the justifications of \(r_{1i},r_{2j}\) with the taxonomy-based similarity of \(t_{1i}\) and \(t_{2j}\).

Fig. 4.
figure 4

Neighborhoods of nodes in Fig. 2(b). Solid and dashed arrows represent explicit and implicit relationships, respectively. Implicit relationships are labelled with the axioms used to derive the relation.

The \(\textit{Sim}_{\textit{relationship}}(t_{1i},t_{2j})\) is defined as follows:

$$\begin{aligned} \textit{Sim}_{\textit{relationship}}(t_{1i},t_{2j})={ tN}(\textit{Sim}_{D}(t_{1i},t_{2j}), \textit{Sim}_{\textit{justificationSet}}(E_{1i}, E_{2j})) \end{aligned}$$

where,

  • The taxonomic similarity of \(t_{1i}\) and \(t_{2j}\), \(\textit{Sim}_{D}(t_{1i},t_{2j})\), corresponds to a triangular norm that combines three taxonomic similarities: \(D_{\textit{tax}}(a_1,a_2)\), \(D_{\textit{tax}}(a_i, a_j)\), and \(D_{\textit{tax}}(r_{1i}, r_{2j})\); and

  • \(\textit{Sim}_{\textit{justificationSet}}(E_{1i}, E_{2j})\) is a similarity measure that determines the relatedness of the justification sets \(E_{1i}\) and \(E_{2j}\) based on the similarity of the justifications in the Cartesian product of \(E_{1i}\) and \(E_{2j}\).

A justification \(e\) is described in terms of a set \(X\) of axioms used in the derivation of the corresponding relationship. Formally, the similarity of sets \(E_{1i}\) and \(E_{2j}\) is defined as follows:

$$\begin{aligned} \textit{Sim}_{\textit{justificationSet}}(E_{1i}, E_{2j})=\frac{\sum \limits _{(e_{1i},e_{2j})\in (E_{1i}\,\times \,E_{2j})}\textit{Sim}_{\textit{justification}}(e_{1i},e_{2j})}{\textit{Max}(|E_{1i}|,|E_{2j}|)} \end{aligned}$$

where,

  • \(\textit{Sim}_{\textit{justification}}(e_{1i}, e_{2j})\) is defined as the similarity of the sets \(X_{1i}, X_{2j}\) of axioms of \(e_{1i}, e_{2j}\), i.e., \(\textit{Sim}_{\textit{justification}}(e_{1i}, e_{2j}) = \textit{Sim}_{\textit{axiomSet}}(X_{1i}, X_{2j})\)

  • the similarity of two sets of axioms, \(\textit{Sim}_{\textit{axiomSet}}(X_{1i}, X_{2j})\), is defined in terms of the type of the axioms.

Currently, we consider four types of OWL2 axioms: subClassOf, subPropertyOf, ObjectPropertyChain, and TransitiveProperty. Further, we provide a different definition of similarity for each axiom, and the similarity between different axioms is 0.0.

Based on the definition of the similarity \(\textit{Sim}(a_1, a_2)\) between two annotations \(a_1\) and \(a_2\), we compute the 1-to-1 maximal weighted bipartite graph matching between two sets of annotations. Given two annotation sets \(A_1\) and \(A_2\), let \({ MWBG}=(A_1 \cup A_2, { WEr})\) be the 1-to-1 maximal weighted bipartite graph matching for a weighted bipartite graph \({ WBG}=(A_1 \cup A_2, { WE})\), where WEr \(\subseteq \) WE, ColorSim on MWBG is as follows:

$$ \textit{ColorSim}({ MWBG} ) = \frac{\sum \limits _{(a_1,a_2)\in { WEr}}\textit{Sim}(a_1,a_2)}{\textit{Max}(|A_1|,|A_2|)}$$

5 Preliminary Results

We use the CESSM Collaborative Evaluation of GO-based Semantic Similarity Measures [18] to evaluate ColorSim on a dataset composed of pairs of proteins from UniProt. These proteins are annotated with GO terms separated into the GO hierarchies of biological process (BP), molecular function (MF), and cellular component (CC). GO and UniProt are both from August 2008. CESSM implements eleven semantic similarity measures; some of them are measures specifically developed for the GO ontology while others are general measures. We evaluated ColorSim with the provided dataset and compared our results w.r.t. the other measures and the three gold standards. Figures 5(a) and 5(b) report the results of ColorSim produced by the CESSM tool. The correlation between ColorSim and SeqSim is higher than 0.72; its behavior is very similar to simGIC (GI) [17] and simUI (UI) [6], two similarity measures specific for GO. Table 1 shows the correlations of ColorSin and state-of-the-art measures w.r.t. three gold standard measures: ECC, Pfam, and SeqSim. ColorSim is the sixth best with ECC, the first with Pfam, and the fourth with SeqSim. Further, ColorSim is the domain-independent measure with the highest correlation. The Pearson’s correlation of ColorSim with SeqSim is 0.726 while the state-of-the-art annotation similarity measure AnnSim has a correlation of 0.65 with SeqSim in the same dataset. Both measures rely on the GO annotations to compute similarity. However, AnnSim is based on \(D_{tax}\), and it only considers the class hierarchy of the ontology and may assign high values of similarity to dissimilar proteins which also have low values of SeqSim. Contrary, ColorSim is able to distinguish the relationships that relate the neighbors of two annotations and the axioms used to infer the implicit facts. Thus, ColorSim can assign more accurate values of similarity and exhibits a better correlation with baseline similarity measures.

Fig. 5.
figure 5

Correlation between SeqSim and ColorSim

Table 1. Correlation with three baseline similarity measures: ECC, Pfam, and SeqSim

6 Evaluation Plan

We will develop an implementation of ColorSim able to efficiently scale up to large datasets. The evaluation of our approach will be conducted on different biomedical datasets that represent associations between drugs and targets [2, 16], and genes and GO terms [13]; as well as PO annotated learning objects. We also plan to enhance the link prediction approach proposed by Palma et al. [14] with the properties of ColorSim and study the impact that these new features have on link prediction. Finally, we will extend ColorSim to consider order between the annotations of two entities; this feature will allow to detect relatedness between processes that are described in terms of sequences of annotations. We will use the dataset of semantically annotated surgery procedures [9] to evaluate the quality of our approach.

7 Lessons Learned and Conclusions

We proposed a semantic similarity measure aware of relationship types and of their semantics. Our results show an improvement w.r.t. state-of-the-art measures, being ColorSim the most correlated generic measure with the gold standards. However, it is important to highlight that because an OWL2 reasoner needs to be invoked, the worst scenario of ColorSim is 2NEXP-Time [10]. Therefore, heuristics are required to compute the justifications of the implicit relationships efficiently. Furthermore, we have observed that in ontologies with a small number of axioms, the benefits of ColorSim is negligible in comparison to its computational cost. Thus, we need to develop strategies to detect conditions that benefit the computation of the implicit relationships and their respective justifications. The study of these computational issues and the development of a graph mining framework that exploit the benefits of ColorSim, are part of our future work.