Keywords

1 Introduction

A domain ontology represents the knowledge of a given domain in a principled way but in order to be implemented in real applications, an ontology has to be enriched with new lexical resources of this particular domain. This process, known as onto-terminology [1], populates the existing ontology with new concepts without considering ontological types and relations of these concepts. Therefore, the structure of existing ontology remains unchanged.

Recently, the population of the ontology with lexical data has been subject of research by various authors. In this regard, authors in [2] proposed a new approach named Synopsis to automatically building a lexicon for each specific term called criterion. The lexicon built is then used to populate the ontology. An adaptation of Synopsis approach is presented by researchers in [3]. They used the same methodology but rather than building terms lexicon, they built the lexicon of an ontology concepts. In order to do this, they built an information retrieval system called CoLexIR which automatically identifies all parts of a document that are related to a given concept.

The learning process of enriching ontology concepts employed in these approaches uses only contextual aspects of terms while they lack to consider the semantic information of these terms. Therefore, this paper proposes a new approach namely SEMCON, which combines the contextual and semantic information through its learning process of enriching the ontology concepts. Besides using contextual information, new statistical features such as term’s font size and term’s font type are also considered.

The rest of the paper is organized as follows. Section 2 describes our proposed method in detail. In Sect. 3 we describe the subjective experiment, while Sect. 4 describes the objective evaluation of proposed method. Lastly, Sect. 5 concludes the paper.

2 Proposed Model

The proposed model, shown in Fig. 1, initially partitions a document into subsets of text known as passages. After these passages are partitioned, each passage is treated as an independent document. More concrete, each passage represented by a presentation slide is considered as an independent document.

Fig. 1.
figure 1

Block diagram of SEMCON model

The next step is a morpho-syntatic analysis using TreeTagger [4] where the partitioned passages are tokenized and lemmatized. As a result, a list of potential terms which can either be a noun, verb, adverb or adjective is obtained. Finally, only nouns are filtered out as the most meaningful terms in a document [5].

Computation of the observation matrix is the next step in the proposed model. Observation matrix is a matrix where the rows represent the terms extracted from a document, and the columns are the possible passages from that particular document. Each entry of the observation matrix represents the observed values for a term, namely term frequency, term font size and term font type in each of the corresponding passages, as shown in Eq. 1. Introducing terms’ font type and terms’ font size, as the very important factors in the information finding process [6], is inspired from the representation of tags in the tag cloud [7].

$$\begin{aligned} O_{i,j} = \sum _{i \in t} \sum _{j \in p} (Freq_{i,j}+FT_{i,j}+FS_{i,j}) \end{aligned}$$
(1)

where, t and p show the set of terms and passages, respectively. Freq \(_{i,j}\) denotes the frequency of occurrences of a term t \(_{i}\) in passage p \(_{j}\). FT \(_{i,j}\) and FS \(_{i,j}\) show font type and font size of a term t \(_{i}\) in passage p \(_{j}\), respectively.

We adopt a linear increase for different font types and font sizes, varied in the range between 0 and 1. More formally, font type of a term t is found using Eq. 2, while font size is found using Eq. 3.

$$\begin{aligned} FT(t)=0.75*B+0.5*U+0.25*I \end{aligned}$$
(2)
$$\begin{aligned} FS(t)=1.0*T+0.75*L_{1}+0.50*L_{2}+0.25*L_{3} \end{aligned}$$
(3)

where, B, U and I denote bold font type, underlined font type and italic font type, respectively. Similarly, T, L \(_{1}\), L \(_{2}\) and L \(_{3}\) represent title font size, level 1 font size, level 2 font size and level 3 font size, respectively.

A concrete example of building of observation matrix using statistical features is represented in Fig. 2. It can be seen from the illustration that term Web occurs 4 times in the slide 2, where, 2 times it appears as level 1 font size and as bold font type and 2 times it appears as level 2 font size.

The next step is finding of term to term contextual score (S\(_{con}\)) which is calculated using the cosine similarity metric with respect to the passages, and it is given in Eq. 4.

$$\begin{aligned} S_{con}(t_{i},t{_j}) = \frac{{t_i} \cdot {t_j}}{\parallel {t_i} \parallel \parallel {t_j} \parallel } \end{aligned}$$
(4)

Further, we extract and use a subset of the terms in order to extend the concept list of ontology. There maybe single label concepts in an ontology as well as compound label concepts. For single label concepts, we use only those terms from the term square matrix for which an exact term exists in the ontology, i.e. Application or Storage, as shown in Fig. 3. Whereas, for compound label concepts, we use those terms from the term square matrix which are present as part of a concept in the ontology. For example, for concept InputAndOutputDevices, we consider either term Input, Output or Device.

The following step is the computation of the semantic score (S\(_{sem}\)). The semantic score is computed using WordNet database. WordNet [8] is a lexical database for the English language which groups words into sets of synonyms called synsets and records the various semantic relations between these synonym sets. We have used all the synsets to represent specific terms being considered. Go through all the terms we have on the observation matrix, we take all possible pairs and calculate the semantic score, sem(t \(_{i}\), t \(_{j}\) ), for each pair t \(_{i}\) and t \(_{i}\), where t \(_{i}\), t \(_{j}\) \(\in \) O and O is the observation matrix. After calculating the semantic score for all pair of terms, we generate a table for each term and the most similar terms set to be the synonyms for that term. The Wu&Palmer algorithm [9] is used to find the semantic score which is implemented in a freely available software package WordNet::Similarity [10] and the score is computed using Eq. 5.

$$\begin{aligned} S_{sem}(t_{i},t_{j}) = \frac{2* depth(lcs)}{depth(t_{i}) + depth(t_{j})} \end{aligned}$$
(5)

where, depth(lcs) indicates least common subsumer of terms t \(_{i}\), t \(_{j}\) and depth(t \(_{i}\) ) and depth(t \(_{j}\) ) indicate the path’s depth of term t \(_{i}\) and t \(_{j}\) in the WordNet::similarity.

Fig. 2.
figure 2

Building of observation matrix using statistical features

Fig. 3.
figure 3

Ontology sample of the computer domain

The overall correlation between two terms, t \(_{i}\), t \(_{j}\), is found using the contextual and semantic score. Mathematically, the overall score is given in Eq. 6.

$$\begin{aligned} S_{overall}(t_{i},t_{j}) = w*S_{con}(t_{i},t_{j})+(1-w)*S_{sem}(t_{i},t_{j}) \end{aligned}$$
(6)

where w is a parameter with value set as 0.5 based on the empirical analysis from the PowerPoint presentation data set. The overall score is in the range (0,1]. The overall score is 1 if two terms are the same.

Finally, to obtain terms which are more closely related to a given term, a rank cut-off method is applied using a specified threshold. Terms which are above the threshold are considered to be the relevant terms for enriching the concepts.

A simple example of the SEMCON output, given in Table 1, shows the top 10 terms obtained as the most related terms of Application concept. 6 of these terms, namely Application, Program, Apps, Function, Task and Software are amongst the top 10 terms selected by subjects as the closest terms with concept Application performed in the subjective experiment given in Sect. 3.

Table 1. Top 10 close related terms of ‘Application’ concept
Table 2. Borda count of subjects’ responses for the ‘Application’ concept.

3 Subjective Evaluation

To evaluate the performance of the SEMCON, we have used PowerPoint presentations dataset from 5 different domains. The dataset consists of 39 slides which cover 369 terms and 41 concepts.

A subjective survey was also carried out by publishing an online questionnaire to 15 subjects. They were asked to select 5 closely related terms from a list of terms for each given concept. From the subjective survey, we found then the most related terms selected by subjects for a given concept using Borda Count method [11]. Mathematically, Borda count method is defined in Eq. 7.

$$\begin{aligned} BordaCount(t) = \sum _{i=1}^{m}[(m+1-i)*freq_i(t)] \end{aligned}$$
(7)

where, freq \(_{i}\) (t) is the frequency of term t chosen at Position i, and m is the total number of possible positions, in our case 5.

The scores from the Borda count are then sorted to obtain the top ‘n’ terms, giving us the refined list of the most relevant terms selected by subjects. For our experiment, we set n = 10, and this gives us the top 10 terms. Table 2 shows the top 10 terms selected by subjects as the closest terms of concept Application.

4 Objective Evaluation

In order to validate the SEMCON model, we have also performed an objective evaluation where the results obtained from the SEMCON are compared with the results obtained from the tf*idf [12] and \(\chi ^{2}\) [13] methods. To evaluate the effectiveness of objective metrics, we employed the standard information retrieval measures such as Precision, Recall and F1 [12].

Table 3. The performance of objective methods

We evaluated the performance of objective methods on how well these methods score the top subjective terms. In order to do this, scores for the 10 top subjective terms are taken as the ground truth. The score of these terms obtained using the objective methods are then evaluated. Under this light, the most related terms of computer concepts are observed and the comparison, in terms of precision, recall and F1, is shown in Table 3. The comparison shows that the SEMCON has achieved an improvement on finding the new terms to enrich the concepts of computer domain ontology. More precisely, it achieved the average improvement of F1 of 12.0 % over the tf*idf and 21.7 % over the \(\chi ^{2}\).

5 Conclusion and Future Work

In this paper, we proposed a new approach to enriching the domain ontologies with new concepts by combining contextual as well as semantics of terms extracted from the accompanying documents. The proposed approach is a generic model which can be applied to any existing domain ontology for extending it with new concepts. The model defines the context using new statistical features such as term’s frequency, term’s font size and font type. The semantics is then incorporated by computing a semantic similarity score using lexical database WordNet. The experimental results show an improved performance of SEMCON compared with tf*idf and \(\chi ^{2}\). In future work we plan to investigate into how the combination of contextual and semantic components contributes to the overall task of ontology concepts enrichment.