Reference Hub3
SOM-Based Clustering of Multilingual Documents Using an Ontology

SOM-Based Clustering of Multilingual Documents Using an Ontology

Minh Hai Pham, Delphine Bernhard, Gayo Diallo, Radja Messai, Michel Simonet
ISBN13: 9781599046181|ISBN10: 1599046180|EISBN13: 9781599046204
DOI: 10.4018/978-1-59904-618-1.ch004
Cite Chapter Cite Chapter

MLA

Pham, Minh Hai, et al. "SOM-Based Clustering of Multilingual Documents Using an Ontology." Data Mining with Ontologies: Implementations, Findings, and Frameworks, edited by Hector Oscar Nigro, et al., IGI Global, 2008, pp. 65-82. https://doi.org/10.4018/978-1-59904-618-1.ch004

APA

Pham, M. H., Bernhard, D., Diallo, G., Messai, R., & Simonet, M. (2008). SOM-Based Clustering of Multilingual Documents Using an Ontology. In H. Nigro, S. Gonzalez Cisaro, & D. Xodo (Eds.), Data Mining with Ontologies: Implementations, Findings, and Frameworks (pp. 65-82). IGI Global. https://doi.org/10.4018/978-1-59904-618-1.ch004

Chicago

Pham, Minh Hai, et al. "SOM-Based Clustering of Multilingual Documents Using an Ontology." In Data Mining with Ontologies: Implementations, Findings, and Frameworks, edited by Hector Oscar Nigro, Sandra Elizabeth Gonzalez Cisaro, and Daniel Hugo Xodo, 65-82. Hershey, PA: IGI Global, 2008. https://doi.org/10.4018/978-1-59904-618-1.ch004

Export Reference

Mendeley
Favorite

Abstract

Clustering similar documents is a difficult task for text data mining. Difficulties stem especially from the way documents are translated into numerical vectors. In this paper, we will present a method which uses Self Organizing Map (SOM) to cluster medical documents. The originality of the method is that it does not rely on the words shared by documents but rather on concepts taken from an ontology. Our goal is to cluster various medical documents in thematically consistent groups (e.g. grouping all the documents related to cardiovascular diseases). Before applying the SOM algorithm, documents have to go through several pre-processing steps. First, textual data have to be extracted from the documents, which can be either in the PDF or HTML format. Documents are then indexed, using two kinds of indexing units: stems and concepts. After indexing, documents can be numerically represented by vectors whose dimensions correspond to indexing units. These vectors store the weight of the indexing unit within the document they represent. They are given as inputs to a SOM which arranges the corresponding documents on a two-dimensional map. We have compared the results for two indexing schemes: stem-based indexing and conceptual indexing. We will show that using an ontology for document clustering has several advantages. It is possible to cluster documents written in several languages since concepts are language-independent. This is especially helpful in the medical domain where research articles are written in different languages. Another advantage is that the use of concepts helps reduce the size of the vectors, which, in turn, reduces processing time.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.