Abstract

Over the recent years, ontologies are widely used in various domains such as medical records annotation, medical knowledge representation and sharing, clinical guideline management, and medical decision-making. To implement the cooperation between intelligent applications based on biomedical ontologies, it is crucial to establish correspondences between the heterogeneous biomedical concepts in different ontologies, which is so-called biomedical ontology matching. Although Evolutionary algorithms (EAs) are one of the state-of-the-art methodologies to match the heterogeneous ontologies, huge memory consumption, long runtime, and the bias improvement of the solutions hamper them from efficiently matching biomedical ontologies. To overcome these shortcomings, we propose a compact CoEvolutionary Algorithm to efficiently match the biomedical ontologies. Particularly, a compact EA with local search strategy is able to save the memory consumption and runtime, and three subswarms with different optimal objectives can help one another to avoid the solution’s bias improvement. In the experiment, two famous testing cases provided by Ontology Alignment Evaluation Initiative (OAEI 2017), i.e. anatomy track and large biomed track, are utilized to test our approach’s performance. The experimental results show the effectiveness of our proposal.

1. Introduction

Ontologies provide a shared and common vocabulary for representing a domain of knowledge [1]. Over the recent years, ontologies are widely used in various domains such as medical records annotation [2], medical knowledge representation and sharing, clinical guidelines management [3], and medical decision-making [4]. However, most biomedical ontologies are developed independently by different experts who might define one entity with different names or in different ways, causing the problem of ontology heterogeneity. For example, to describe the muscles that surround and power the human heart, the National Cancer Institute’s thesaurus and ontology (NCI) [5] use the name “Myocardium,” whereas the Foundation Model of Anatomy (FMA) [6] uses “Cardiac Muscle Tissue.” To implement the cooperation between intelligent applications based on biomedical ontologies, it is crucial to establish correspondences between the heterogeneous biomedical concepts in different ontologies, which is so-called biomedical ontology matching.

Recently, Evolutionary Algorithms (EAs) are one of the state-of-the-art methodologies to match the heterogeneous ontologies [7]. However, huge memory consumption, long runtime, and the bias improvement of the solutions hamper EA-based ontology matching techniques from efficiently matching biomedical ontologies. Thus, besides the quality of alignments, main memory consumption and runtime needed by the ontology matcher are of prime importance when matching the biomedical ontologies. In this paper, we propose to use the compact EA [8], which utilizes a probabilistic representation of the population, to save the memory consumption of classic EA. Then, we introduce the local search strategy into its evolving process to balance the exploration and exploitation and reduce the runtime needed. On this basis, we further propose a compact Coevolutionary Algorithm, which utilizes three subswarms with different objectives to help one another to avoid the solution’s bias improvement caused by traditional metric f-measure [9].

The rest of the paper is organized as follows: Section 2 describes the related works; Section 3 gives some basic concepts of ontology, ontology alignment, and the similarity measures; Section 4 presents the optimal model problem and the details of the compact Coevolutionary Algorithm for matching biomedical ontologies; Section 5 gives the experimental results and relevant analysis; finally, Section 6 draws the conclusions.

2.1. Evolutionary Algorithm-Based Ontology Matching Technique

Due to the complex and time-consuming nature of the ontology matching process, EA-based methods could present a good methodology for obtaining ontology alignments and indeed have already been applied to solve the ontology alignment problem by reaching acceptable results [10]. Different from other EA based approaches [1113] which models the ontology alignment process as a meta-matching problem, i.e. how to determine the best appropriate weight configuration in ontology matching process in order to obtain a satisfactory alignment, in this work, ontology matching problem is considered as a global entity matching problem. Genetic Algorithm-Based Ontology Matching (GAOM) [14] is the representative system, which utilized Genetic Algorithm (GA) to determine the optimal ontology alignment. Particularly, GAOM utilizes the chromosomes to describe the potential alignments between two ontologies and utilizes GAs to determine the optimal solution. Besides, MapPSO and MapEVO [15] which exploited the Particle Swarm Optimization Algorithm (PSO) [16] and Evolutionary Programming (EP) [17], respectively, also adopted this idea. Acampora et al. [18] designed a Memetic Algorithm (MA) which introduced a local search process to improve the performance of EA. More recently, Xue et al. [19, 20], respectively, used the compact EA and compact Population-Based Incremental Learning Algorithm (PBIL) to save the memory consumption without sacrificing the solution’s quality. Compact EA and compact PBIL represented the population as a probability vector (PV) over the set of solutions and are operationally equivalent to the order-one behaviour of the simple EA with uniform crossover. In this way, a much smaller number of solutions must be stored in the memory, thus significantly reducing the memory consumption.

2.2. Coevolutionary Algorithm

The Coevolutionary Algorithm [21] makes multiple swarms simultaneously evolve and communicate with one another to improve the search performance. Currently, distributed coevolution is the most popular coevolving process, which shares the search information among multiple swarms through the population migration strategy. During the searching process, different swarms have evolving strategies and configurations. Tan et al. [22] proposed to decompose the problem’s solution vector into multiple swarms to evolve simultaneously. Mu and Liu [23] presented an M-elite Coevolutionary Algorithm that applied different elite strategies in the coevolving process. The elite centered swarm has the highest priority, and other swarms implemented the cooperative coevolving process. In [24], a parallel evolving mechanism was designed by dividing the population into three swarms that evolved independently. However, all the swarms use the same evolving strategy, and the swarm’s evolving process swarm was relatively independent, which decreased the algorithm’s exploration and exploitation ability. More recently, Wang et al. [25] proposed a two-elite strategy which makes use of the differences between two elites to guide the whole evolving process.

Different from all the techniques mentioned above, in this work, we propose a compact coevolutionary Algorithm to match the biomedical ontologies, which combines the advantages of the compact EA and coEvolutionary Algorithm to save the memory consumption and runtime and overcome the bias improvement of solutions.

2.3. Preliminaries
2.3.1. Ontology, Ontology Alignment, and Ontology Matching Process

In this work, an ontology is defined as a quadruple , where(i) is the class set, i.e., the set of concepts that populate the domain of interest,(ii) is the property set, i.e., the set of relations between the concepts of domain,(iii) is the instance set, i.e., the set of objects in the real world representing the instances of a concept, and(iv) is the axiom set, i.e., the statements that say what is true about the modeled domain.

An alignment between two ontologies and is defined as a set of correspondences, and each correspondence is a triple , where and are the entities in and , respectively, and is a confidence value holding for the correspondence between them. In this work, the relation existing between two ontology entities is the equivalence (=). The ontology matching process can be defined as a function [26], where p is the parameter set and r is the resource set. Ontology matching process returns a new alignment between ontologies and .

2.3.2. Concept Similarity Measure

Concept similarity measure is the foundation of biomedical ontology matching [27]. In this work, we utilize an asymmetrical concept similarity measure to calculate the biomedical concepts’ similarity values. First, for each biomedical concept, we construct a profile for it by collecting the label, comment, and property information such as label, domain, and range, from itself and all its direct descendants. Then, the similarity of two biomedical concepts and is measured based on the similarity of their profiles and , which can be calculated by the following two asymmetrical measures:where and are the cardinalities of the profile and , respectively, is the number of identical elements in and . The similarity value of and is equal to when , and otherwise, 0.

In this work, is the threshold to measure the extent of the semantic equivalence between and . When the similarity value between two profile elements is above the threshold, they are identified as semantically similar. Generally, should be set relatively small to reflect and have little difference when the entity and are semantically equivalent. However, if is too small, we would miss many semantically equivalent terms. Therefore, the suggested domain of is [0.01, 0.10]. In this work, to obtain a suitable, we conducted a pre-experiment on the benchmark by varying the value of in its suggested domain, and found the semantic equivalence performs well when is assigned to 0.06.

Moreover, the similarity value of two profile elements is calculated by N-gram distance [28], which is the most performing string-based similarity measure for the biological ontology matching problem, and a linguistic measure, which calculate a synonymy-based distance through the Unified Medical Language System (UMLS) [29]. Given two words and , their similarity is equal to 1 when two words are synonymous, and otherwise, .

2.4. Compact Coevolutionary Algorithm
2.4.1. Rough Alignment Evaluations

In this work, we suppose that, in the golden alignment, one concept in the ontology is matched with only one concept in the other ontologies and vice versa. Two rough alignment evaluations, i.e., MatchCoverage and MatchRatio, are utilized to measure the alignment's quality. In particular, MatchCoverage is utilized to approximate recall [9], which calculates the fraction of concepts which exist in at least one correspondence in the resulting alignment in comparison to the total number of concepts in the ontology. The formula of it is presented as follows:where(i) and are the matched concept sets of ontology and , respectively; and(ii) and are the concept sets of ontology and , respectively.

And, MatchRatio is used to approximate precision [9], which calculates the ratio between the number of found correspondences and the number of matched concepts. The formula of it is presented as follows:where(i) is the correspondence set in the alignment; and(ii) and are the matched concept sets of ontology and , respectively;

In most instances, it requires considering both and to measure the alignment’s quality. By referring to the most common combining function f-measure [9], we define as follows:

2.4.2. The Optimal Model for Ontology Entity Matching Problem

Given two biomedical ontologies and , we take maximizing as the goal, and the optimal model for ontology entity matching problem can be defined as follows:where the decision variable X represents an alignment between and , represents the ith correspondence between ith concept in and th concept in , and are the cardinalities of the concept set in and , respectively, and is the threshold to filter the final alignment.

One of the shortcomings of MatchFmeasure is that the improvement of it does not say anything about whether both MatchCoverage and MatchRatio are simultaneously improved or not. In other words, no matter how large a measured improvement in MatchFmeasure is, it can still be extremely dependent on the improvement on one of the individual metrics [30]. To overcome this bias improvement, we propose a compact coevolutionary Algorithm, which has three PVs that characterize subswarms that aim at maximizing MatchCoverage, MatchRatio, and MatchFmeasure, respectively. Through the cooperation of three PVs, we dedicate to ensure the simultaneous improvement on MatchCoverage and MatchRatio during the evolving process.

2.4.3. Compact Evolutionary Algorithm

Model-based optimization using probabilistic modeling of the search space is one of the areas where research on Compact Evolutionary Algorithm (CEA) has considerably advanced in recent years. In each generation, CEA updates the probability vector (PV), which is a probabilistic model describing the univariate statistics of the best solutions and then uses it to generate new candidate solutions. By employing the PV, instead of a population of solutions, to simulate the behavior of classic EA, a much smaller number of individuals is needed to be stored in the memory. Thus, CEA can significantly reduce the memory consumption [31]. In order to further improve CEA performance, we introduce the local search strategy into CEA’s evolving process. This marriage between global search and local search is helpful in reducing the possibility of the premature convergence and increasing the convergence speed.

In the next, three main components of CEA, i.e., chromosome-encoding mechanism, probability vector, and local search strategy are, respectively, presented.(1)Chromosome-Encoding Mechanism: in this work, the genes are encoded through the binary coding mechanism and can be divided into two parts. The first part stands for the correspondences in the alignment, and the other one stands for a threshold. Given the total number and of two biomedical concepts in ontologies, the first part of a chromosome (or PV) consists of gene segments, and the binary code length (BCL) of each gene segment is equal to , which ensures each gene segment could present any target ontology class’s index, while the second part of a chromosome (or PV) has only one gene segment, whose BCL is equal to , which can ensure this gene segment could present any threshold value under the numerical accuracy . Thus, the total length of the chromosome (or PV) is equal to .

Given a gene segment , where is the ith gene bit value of the gene segment, we decode to obtain a decimal number whose value is equal to . In particular, with respect to the first part decoding results, the decimal numbers obtained represent the indexes of the target classes, where 0 means the source instance is not mapped to any target ontology’s class. With regard to the second part of decoding result, the decimal number obtained should multiply the threshold’s numerical accuracy. Last but not least, if a decimal number d obtained is larger than u, we will replace it with .(2)Probability Vector: in general, CEA aims at generating a PV which represents a population of high evaluation solutions, and its operations take place directly on the PV. In this work, the number of elements in PV is equal to the number of individual’s gene bits and each element’s value is in [0,1], and here is an example on how to use PV to generate a new solution. First, generate four random numbers, such as 0.6, 0.5, 0.8, and 0.9. Then, compare the numbers with the elements in PV accordingly to determine the new generated individual’s gene values. For example, since , the first gene bit’s value of the new solution is 0, and similarly, the remaining gene bits’ values are 1, 0, and 0, respectively. In this way, the new solution we obtain is 0100. By repeating this procedure, we can obtain various individuals. In addition, if 0100 is the elite solution in the current generation, PV should be updated according to its information. Given PV’s update rate, say 0.1, if the gene value of the elite is 0, the corresponding element of PV will minus 0.1, otherwise add 0.1. In this way, the updated PV is .(3)Local Search Strategy: local search process tries to improve the elite solution by searching in the neighborhood of it. In this work, we utilize a crossover operator to implement the local search process, which randomly copies a sequential fragment of ’s genes into the corresponding positions of , to generate a new solution. For the sake of clarity, given the length of the chromosome len and the crossover probability , the pseudocode of the binary crossover operator is shown in Algorithm 1.

(1);
(2);
(3);
(4)while
(5);
(6) if
(7);
(8)
(9);
(10)end while

This procedure is similar with the two-point crossover where the first cut point is randomly selected from , and the second point is determined such that L consecutive genes (counted in a circular manner) are taken from . Since and are both generated through the PV, most of their gene bit values are the same. Therefore, even when is large, only mutates a few gene bit values of . In this sense, this variation operator can be considered fairly exploitative.

2.4.4. Pseudocode of Compact Coevolutionary Algorithm

In this work, we use three PVs to represent the subswarms for maximizing MatchRatio, MatchCoverage, and MatchFmeasure, respectively. In particular, the PV here represents the population that consists of the solutions of its corresponding representative subproblem and this problem’s neighbor subproblems. Finally, these PVs help each other in the process of determining three representative solutions, which are given in the following. Here, we mark three representative subproblems of maximizing MatchRatio, maximizing MatchCoverage, and maximizing MatchFmeasure with the symbols , , and , respectively, and three PVs for solving , , and with the symbols , , and , respectively. We present the pseudocode of compact Coevolutionary Algorithm in Algorithm 2.

Input :
(i) and : two biomedical ontologies;
(ii)len: the length of PV;
(iii)maxGen: maximum number of generations;
(iv)UR: PV’s update rate;
(v): crossover probability;
(vi): mutation probability;
(vii)MR: mutation rate.
Output: the solution with best MatchFmeasure
Step 1. Initialization:Step 1.1. Set the generation ;Step 1.2. Set the neighbor subproblem of and as and the neighbor subproblems of as and .Step 1.3. Initialize , , and by setting all the probabilities inside as 0.5.Step 1.4. Using , , and to generate the elites, which are marked with symbols , , and for , , and , respectively.
Step 2. Evolving process:Step 2.1. Update , , and , respectively.Take updating for instance, the procedures of updating and is similar to it:Step 2.1.1. Crossover(1)Generate a new individual through ;(2)[winner, loser] = compete(, );(3)if()(4);(5)for i = 0; ; i++(6) if()(7);(8) if()(9)  ;(10)else(11);(12)if ()(13)Step 2.1.2. Mutation(14)for(i = 0; i <len; i++)(15) if((random(0, 1) < )(16)  ;Step 2.1.3. Local search(17)Generate an individual through ;(18);(19)Generate i = round(random(0, len));(20);(21) while((random(0, 1) < ))(22)  i = i + 1;(23) if(())(24)  i = 0;(25);(26) end While(27) [winner, loser] = compete(, );(28) if(())(29)  ;Step 2.2. Update , and mutually.For (or ), is generated by applying the -based uniform crossover operator [32] on the (or ) and its neighbor subproblem’s probability vector Then, generate an individual a through and try to update the and through the competition with (or ) and .For , is generated through applying the uniform crossover operator between and , which are its neighbor subproblems’ PVs. Then, generate an individual a through and try to update the through the competition with .
Step 3. Stopping Criteria:(30)if ( is reached)(31) stop and the elite with best MatchFmeasure;(32)else(33) gen = gen+1;(34) go to Step 2;(35)end if
In the evolving process, we first update , , and , respectively (Step 2.1), which is equivalent to the process of updating the solutions of , , and . Then, we update , , and mutually (Step 2.2), which is equal to updating the solutions of , , and through their shared neighbor subproblems’ solutions, i.e., using the information of a PV to help its neighbor PVs.
2.5. Experimental Results and Analysis

In this work, we exploit the Anatomy (http://oaei.ontologymatching.org/2017/anatomy/index.html) and Large Biomed (http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/2017/) track to study the effectiveness of our approach, which are provided by the Ontology Alignment Evaluation Initiative (OAEI 2017) (http://oaei.ontologymatching.org/2017). The Anatomy track includes two ontologies (1 task), i.e., the Adult Mouse Anatomy (AMA) ontology (2,744 classes) and a part of NCI describing the human anatomy (3,304 classes). Large Biomed track (3 tasks) aims at finding alignments between FMA, SNOMED CT, and NCI, which, respectively, contains 78,989, 122,464, and 66,724 classes. Particularly, The large Biomedic track is split into three matching problems: FMA-NCI, FMA-SNOMED, and SNOMED-NCI and each matching problem in these tasks involving different fragments of the input ontologies.

The Compact Coevolutionary Algorithm uses the following parameters which represent a trade-off setting obtained in an empirical way to achieve the highest average alignment quality on all exploited testing datasets:(i)Numerical accuracy = 0.01;(ii)Update rate = 0.1;(iii)Crossover probability = 0.6;(iv)Mutation probability = 0.03;(v)Mutation rate = 0.05;(vi)Maximum generation = 3000.

3. Results and Analysis

In order to compare the quality of our proposal with the participants of OAEI 2017 (http://oaei.ontologymatching.org/2017/results/index.html) and Population-Based Incremental Learning Algorithm (PBIL) [20], which is a state-of-the-art compact EA-based ontology matching technique, we evaluate the obtained alignments with traditional recall, precision, and f-measure. PBIL and our approach’s results in Table 1 and Table 2 are the mean values in thirty time independent executions. The symbols P, R, and F in tables stand for precision, recall, and f-measure, respectively.

As can be seen from Table 1, our approach’s f-measure outperforms all the competitors, and our approach’s runtime is ranked the 4th place. In Table 2, our approach’s f-measure is the highest in task1, task2, and task3. For the running time, in task1 and task 2, our approach is in the 3rd place and 4th place in task3. In both tracks, our approach outperforms AML, which is the top ontology matcher and developed primarily for the biomedical ontology matching, in all tasks in terms of f-measure, and the runtime in our approach is also very close to or less than AML. The experimental results show that the cooperation among three swarms with different objectives can effectively overcome the bias improvements and improve the quality of biomedical ontology alignments.

In particular, PBIL works with one PV, but our approach utilizes three PVs to cooperate with each other during the evolving process to improve the solution’s quality. As can be seen from the experimental results, although our approach takes only a little more runtime than PBIL, the qualities of our results are much better than PBIL in terms of both recall and precision, which shows that our approach can effectively overcome the bias improvement of solutions in PBIL.

4. Conclusion

In this work, in order to overcome the drawbacks in traditional E-based ontology matching techniques, we for the first time propose a compact Coevolutionary Algorithm to efficiently match the biomedical ontologies. In our approach, three PVs are utilized to characterize three subswarms that take as objectives maximizing MatchCoverage, MatchRatio, and MatchFmeasure, respectively, and in each generation, PVs are first updated with CEA paradigm and then help each other to search for better solutions in the search space. In the experiment, OAEI 2017’s Anatomy track and Large Biomed track are utilized to test our approach’s performance, and the results show that our approach can efficiently determine better ontology alignments than state-of-the-art biomedical ontology matching techniques.

Data Availability

The data used to support the findings of this study have not been made available because of the protection of technical privacy and confidentiality.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Nos. 61503082 and 61403121), Natural Science Foundation of Fujian Province (No. 2016J05145), Scientific Research Startup Foundation of Fujian University of Technology (No. GY-Z15007), Scientific Research Development Foundation of Fujian University of Technology (No. GY-Z17162), and Fujian Province Outstanding Young Scientific Researcher Training Project (No. GY-Z160149).