Using Context-Aware and Semantic Similarity Based Model to Enrich Ontology Concepts

Kastrati, Zenun; Yayilgan, Sule Yildirim; Imran, Ali Shariq

doi:10.1007/978-3-319-19581-0_11

Zenun Kastrati¹⁸,
Sule Yildirim Yayilgan¹⁸ &
Ali Shariq Imran¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9103))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1789 Accesses
1 Citations

Abstract

Domain ontologies are a good starting point to model in a formal way the basic vocabulary of a given domain. However, in order for an ontology to be usable in real applications, it has to be supplemented with lexical resources of this particular domain. The learning process of enriching domain ontologies with new lexical resources employed in the existing approaches takes into account only the contextual aspects of terms and does not consider their semantics. Therefore, this paper proposes a new objective metric namely SEMCON which combines contextual as well as semantic information of terms to enriching the domain ontology with new concepts. The SEMCON defines the context by first computing an observation matrix which exploits the statistical features such as frequency of the occurrence of a term, term’s font type and font size. The semantics is then incorporated by computing a semantic similarity score using lexical database WordNet. Subjective and objective experiments are conducted and results show an improved performance of SEMCON compared with tf*idf and $\chi ^{2}$.

Download conference paper PDF

A Review on Feature Based Approach in Semantic Similarity for Multiple Ontology

An Enhanced Ontology Based Measure of Similarity between Words and Semantic Similarity Search

A New Model to Compute Semantic Similarity from Multi-ontology

Keywords

1 Introduction

A domain ontology represents the knowledge of a given domain in a principled way but in order to be implemented in real applications, an ontology has to be enriched with new lexical resources of this particular domain. This process, known as onto-terminology [1], populates the existing ontology with new concepts without considering ontological types and relations of these concepts. Therefore, the structure of existing ontology remains unchanged.

Recently, the population of the ontology with lexical data has been subject of research by various authors. In this regard, authors in [2] proposed a new approach named Synopsis to automatically building a lexicon for each specific term called criterion. The lexicon built is then used to populate the ontology. An adaptation of Synopsis approach is presented by researchers in [3]. They used the same methodology but rather than building terms lexicon, they built the lexicon of an ontology concepts. In order to do this, they built an information retrieval system called CoLexIR which automatically identifies all parts of a document that are related to a given concept.

The learning process of enriching ontology concepts employed in these approaches uses only contextual aspects of terms while they lack to consider the semantic information of these terms. Therefore, this paper proposes a new approach namely SEMCON, which combines the contextual and semantic information through its learning process of enriching the ontology concepts. Besides using contextual information, new statistical features such as term’s font size and term’s font type are also considered.

The rest of the paper is organized as follows. Section 2 describes our proposed method in detail. In Sect. 3 we describe the subjective experiment, while Sect. 4 describes the objective evaluation of proposed method. Lastly, Sect. 5 concludes the paper.

2 Proposed Model

The proposed model, shown in Fig. 1, initially partitions a document into subsets of text known as passages. After these passages are partitioned, each passage is treated as an independent document. More concrete, each passage represented by a presentation slide is considered as an independent document.

The next step is a morpho-syntatic analysis using TreeTagger [4] where the partitioned passages are tokenized and lemmatized. As a result, a list of potential terms which can either be a noun, verb, adverb or adjective is obtained. Finally, only nouns are filtered out as the most meaningful terms in a document [5].

Computation of the observation matrix is the next step in the proposed model. Observation matrix is a matrix where the rows represent the terms extracted from a document, and the columns are the possible passages from that particular document. Each entry of the observation matrix represents the observed values for a term, namely term frequency, term font size and term font type in each of the corresponding passages, as shown in Eq. 1. Introducing terms’ font type and terms’ font size, as the very important factors in the information finding process [6], is inspired from the representation of tags in the tag cloud [7].

$$\begin{aligned} O_{i,j} = \sum _{i \in t} \sum _{j \in p} (Freq_{i,j}+FT_{i,j}+FS_{i,j}) \end{aligned}$$

(1)

where, t and p show the set of terms and passages, respectively. Freq $_{i,j}$ denotes the frequency of occurrences of a term t $_{i}$ in passage p $_{j}$. FT $_{i,j}$ and FS $_{i,j}$ show font type and font size of a term t $_{i}$ in passage p $_{j}$, respectively.

We adopt a linear increase for different font types and font sizes, varied in the range between 0 and 1. More formally, font type of a term t is found using Eq. 2, while font size is found using Eq. 3.

$$\begin{aligned} FT(t)=0.75*B+0.5*U+0.25*I \end{aligned}$$

(2)

$$\begin{aligned} FS(t)=1.0*T+0.75*L_{1}+0.50*L_{2}+0.25*L_{3} \end{aligned}$$

(3)

where, B, U and I denote bold font type, underlined font type and italic font type, respectively. Similarly, T, L $_{1}$, L $_{2}$ and L $_{3}$ represent title font size, level 1 font size, level 2 font size and level 3 font size, respectively.

A concrete example of building of observation matrix using statistical features is represented in Fig. 2. It can be seen from the illustration that term Web occurs 4 times in the slide 2, where, 2 times it appears as level 1 font size and as bold font type and 2 times it appears as level 2 font size.

The next step is finding of term to term contextual score (S$_{con}$) which is calculated using the cosine similarity metric with respect to the passages, and it is given in Eq. 4.

$$\begin{aligned} S_{con}(t_{i},t{_j}) = \frac{{t_i} \cdot {t_j}}{\parallel {t_i} \parallel \parallel {t_j} \parallel } \end{aligned}$$

(4)

Further, we extract and use a subset of the terms in order to extend the concept list of ontology. There maybe single label concepts in an ontology as well as compound label concepts. For single label concepts, we use only those terms from the term square matrix for which an exact term exists in the ontology, i.e. Application or Storage, as shown in Fig. 3. Whereas, for compound label concepts, we use those terms from the term square matrix which are present as part of a concept in the ontology. For example, for concept InputAndOutputDevices, we consider either term Input, Output or Device.

The following step is the computation of the semantic score (S$_{sem}$). The semantic score is computed using WordNet database. WordNet [8] is a lexical database for the English language which groups words into sets of synonyms called synsets and records the various semantic relations between these synonym sets. We have used all the synsets to represent specific terms being considered. Go through all the terms we have on the observation matrix, we take all possible pairs and calculate the semantic score, sem(t $_{i}$, t $_{j}$ ), for each pair t $_{i}$ and t $_{i}$, where t $_{i}$, t $_{j}$ $\in $ O and O is the observation matrix. After calculating the semantic score for all pair of terms, we generate a table for each term and the most similar terms set to be the synonyms for that term. The Wu&Palmer algorithm [9] is used to find the semantic score which is implemented in a freely available software package WordNet::Similarity [10] and the score is computed using Eq. 5.

$$\begin{aligned} S_{sem}(t_{i},t_{j}) = \frac{2* depth(lcs)}{depth(t_{i}) + depth(t_{j})} \end{aligned}$$

(5)

where, depth(lcs) indicates least common subsumer of terms t $_{i}$, t $_{j}$ and depth(t $_{i}$ ) and depth(t $_{j}$ ) indicate the path’s depth of term t $_{i}$ and t $_{j}$ in the WordNet::similarity.

The overall correlation between two terms, t $_{i}$, t $_{j}$, is found using the contextual and semantic score. Mathematically, the overall score is given in Eq. 6.

$$\begin{aligned} S_{overall}(t_{i},t_{j}) = w*S_{con}(t_{i},t_{j})+(1-w)*S_{sem}(t_{i},t_{j}) \end{aligned}$$

(6)

where w is a parameter with value set as 0.5 based on the empirical analysis from the PowerPoint presentation data set. The overall score is in the range (0,1]. The overall score is 1 if two terms are the same.

Finally, to obtain terms which are more closely related to a given term, a rank cut-off method is applied using a specified threshold. Terms which are above the threshold are considered to be the relevant terms for enriching the concepts.

A simple example of the SEMCON output, given in Table 1, shows the top 10 terms obtained as the most related terms of Application concept. 6 of these terms, namely Application, Program, Apps, Function, Task and Software are amongst the top 10 terms selected by subjects as the closest terms with concept Application performed in the subjective experiment given in Sect. 3.

Table 1. Top 10 close related terms of ‘Application’ concept

Full size table

Table 2. Borda count of subjects’ responses for the ‘Application’ concept.

Full size table

3 Subjective Evaluation

To evaluate the performance of the SEMCON, we have used PowerPoint presentations dataset from 5 different domains. The dataset consists of 39 slides which cover 369 terms and 41 concepts.

A subjective survey was also carried out by publishing an online questionnaire to 15 subjects. They were asked to select 5 closely related terms from a list of terms for each given concept. From the subjective survey, we found then the most related terms selected by subjects for a given concept using Borda Count method [11]. Mathematically, Borda count method is defined in Eq. 7.

$$\begin{aligned} BordaCount(t) = \sum _{i=1}^{m}[(m+1-i)*freq_i(t)] \end{aligned}$$

(7)

where, freq $_{i}$ (t) is the frequency of term t chosen at Position i, and m is the total number of possible positions, in our case 5.

The scores from the Borda count are then sorted to obtain the top ‘n’ terms, giving us the refined list of the most relevant terms selected by subjects. For our experiment, we set n = 10, and this gives us the top 10 terms. Table 2 shows the top 10 terms selected by subjects as the closest terms of concept Application.

4 Objective Evaluation

In order to validate the SEMCON model, we have also performed an objective evaluation where the results obtained from the SEMCON are compared with the results obtained from the tf*idf [12] and $\chi ^{2}$ [13] methods. To evaluate the effectiveness of objective metrics, we employed the standard information retrieval measures such as Precision, Recall and F1 [12].

Table 3. The performance of objective methods

Full size table

We evaluated the performance of objective methods on how well these methods score the top subjective terms. In order to do this, scores for the 10 top subjective terms are taken as the ground truth. The score of these terms obtained using the objective methods are then evaluated. Under this light, the most related terms of computer concepts are observed and the comparison, in terms of precision, recall and F1, is shown in Table 3. The comparison shows that the SEMCON has achieved an improvement on finding the new terms to enrich the concepts of computer domain ontology. More precisely, it achieved the average improvement of F1 of 12.0 % over the tf*idf and 21.7 % over the $\chi ^{2}$.

5 Conclusion and Future Work

In this paper, we proposed a new approach to enriching the domain ontologies with new concepts by combining contextual as well as semantics of terms extracted from the accompanying documents. The proposed approach is a generic model which can be applied to any existing domain ontology for extending it with new concepts. The model defines the context using new statistical features such as term’s frequency, term’s font size and font type. The semantics is then incorporated by computing a semantic similarity score using lexical database WordNet. The experimental results show an improved performance of SEMCON compared with tf*idf and $\chi ^{2}$. In future work we plan to investigate into how the combination of contextual and semantic components contributes to the overall task of ontology concepts enrichment.

References

Roche, C., Calberg-Challot, M., Damas, L., Rouard, P.: Ontoterminology - a new paradigm for terminology. In: Dietz, J.L.G. (ed.) KEOD 2009 - Proceedings of the International Conference on Knowledge Engineering and Ontology Development, Portugal (2009)
Google Scholar
Duthil, B., Trousset, F., Roche, M., Dray, G., Plantié, M., Montmain, J., Poncelet, P.: Towards an automatic characterization of criteria. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011, Part I. LNCS, vol. 6860, pp. 457–465. Springer, Heidelberg (2011)
Chapter Google Scholar
Ranwez, S., Duthil, B., Sy, M.F., Montmain, J., Augereau, P., Ranwez, V., Hovy, E.: How ontology based information retrieval systems may benefit from lexical text analysis. In: Oltramari, A., Vossen, P., Qin, L., Hovy, E. (eds.) New Trends of Research in Ontologies and Lexical Resources. Ideas, Projects, Systems. Theory and Applications of Natural Language Processing, pp. 209–228. Springer, Heidelberg (2013)
Chapter Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing (1994)
Google Scholar
Li, H., Tian, Y., Ye, B., Cai, Q.: Comparison of current semantic similarity methods in WordNet. In: Computer Application and System Modeling, International Conference, vol. 4, pp. 4008-4011 (2010)
Google Scholar
Halvey, M.J., Keane, M.T.: An assessment of tag presentation techniques. In: Proceedings of the 16$^{th}$ International Conference on World Wide Web, USA, pp. 1313–1314. ACM (2007)
Google Scholar
Bateman, S., Gutwin, C., Nacenta, M.: Seeing things in the clouds: the effect of visual features on tag cloud selections. In: Proceedings ACM Conference on Hypertext and Hypermedia, HT 2008, pp. 193–202 (2008)
Google Scholar
Fellbaum, C.: WordNet: An Electronic Lexical Database. The MIT Press, Cambridge (1998)
MATH Google Scholar
Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: Proceedings of the 32$^{nd}$ Annual Meeting of the Associations for Computational Linguistics, pp. 133–138 (1994)
Google Scholar
Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet: similarity - measuring the relatedness of concepts. In: Proceedings of 19$^{th}$ National Conference on Artificial Intelligence, pp. 1024–1025 (2004)
Google Scholar
Young, P.: Optimal voting rules. J. Econ. Perspect. 9, 51–64 (1995)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Liu, J.N.K., He, Y.-L., Lim, E.H.Y., Wang, X.-Z.: A new method for knowledge and information management domain ontology graph model. IEEE Trans. Syst. Man Cybern.: Syst. 43, 115–127 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science and Media Technology, Gjøvik University College, Gjovik, Norway
Zenun Kastrati, Sule Yildirim Yayilgan & Ali Shariq Imran

Authors

Zenun Kastrati
View author publications
You can also search for this author in PubMed Google Scholar
Sule Yildirim Yayilgan
View author publications
You can also search for this author in PubMed Google Scholar
Ali Shariq Imran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zenun Kastrati .

Editor information

Editors and Affiliations

Technische Universität Darmstadt, Darmstadt, Germany
Chris Biemann
Universität Passau, Passau, Germany
Siegfried Handschuh
Universität Passau, Passau, Germany
André Freitas
University of Salford, Salford, United Kingdom
Farid Meziane
Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kastrati, Z., Yayilgan, S.Y., Imran, A.S. (2015). Using Context-Aware and Semantic Similarity Based Model to Enrich Ontology Concepts. In: Biemann, C., Handschuh, S., Freitas, A., Meziane, F., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2015. Lecture Notes in Computer Science(), vol 9103. Springer, Cham. https://doi.org/10.1007/978-3-319-19581-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-19581-0_11
Published: 04 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19580-3
Online ISBN: 978-3-319-19581-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Context-Aware and Semantic Similarity Based Model to Enrich Ontology Concepts

Abstract

Similar content being viewed by others

A Review on Feature Based Approach in Semantic Similarity for Multiple Ontology

An Enhanced Ontology Based Measure of Similarity between Words and Semantic Similarity Search

A New Model to Compute Semantic Similarity from Multi-ontology

Keywords

1 Introduction

2 Proposed Model

3 Subjective Evaluation

4 Objective Evaluation

5 Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Using Context-Aware and Semantic Similarity Based Model to Enrich Ontology Concepts

Abstract

Similar content being viewed by others

A Review on Feature Based Approach in Semantic Similarity for Multiple Ontology

An Enhanced Ontology Based Measure of Similarity between Words and Semantic Similarity Search

A New Model to Compute Semantic Similarity from Multi-ontology

Keywords

1 Introduction

2 Proposed Model

3 Subjective Evaluation

4 Objective Evaluation

5 Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation