A tool to discover the main themes in a Spanish or English document
Introduction
The recent years have been remarkable in the rapid growth of available knowledge through electronic media. Traditional data handling methods are becoming less and less capable of fulfilling the demands of this information deluge. Therefore, several new tools have been proposed to solve this problem. Nevertheless, most of the existing work has been done on structured databases (i. e. databases with well-known structure). However, the major generated information in human tasks has been based on non-structured databases over the years. These databases are collections of text written in natural language (histories, newspaper articles, e-mail messages, web pages, etc.). Thus, it is very interesting and worthwhile to develop tools to extract non-trivial information from a non-structured (i. e. textual) database within a reasonable time.
Text analysis through the computer is very useful. Some interesting problems that can be solved are the following: to construct abstracts from large documents; to find the semantic relations that exist in the main themes of a document; to find the conceptual tendency in documents written by someone, etc.
Section snippets
Preliminaries
In the recent decades different techniques have been developed in order to analyze and to extract knowledge in non-structured databases. The works developed in this text analysis area have been interesting since 1960. Initially, in the 1960s and in the 1970s, the information retrieval systems (IRS) (Cohen and Kjeldsen, 1987, Doyle, 1975, Salton, 1968, Salton, 1989) emerged. Later, the CYC project (Lenat and Guha, 1989, Lenat et al., 1990), which was developed in one decade, was proposed.
The CLASITEX++ system
The main idea of the CLASITEX++ system is simple. Assume that after reading a text or document, a person can answer the question: “what is the matter?”, i.e. he can say which are the most important themes in the document. This person reads the document and, based on his knowledge and on the frequency of appearance of the most important concepts, grasps the main themes. The person discards the words that do not have any meaning for him. Besides, the unknown words are written in order to obtain
Examples
We show the results of the CLASITEX++ system in a short simple text taken from “Discover the world of science” magazine (Flamsteed, 1998).
Saturn, 2004
Ten or twenty years ago, interplanetary space probes were built like battleships: big, rugged, bristling with instruments — and costing a boatload of money. Although NASA has been phasing out such missions, only in October did it finally launch its last: the Cassini probes to Saturn.
By 2004, if all goes well, Cassini will park itself in orbit
Conclusions
In this work, a tool for text analysis was presented. The system finds the most important themes dealt with in a Spanish or English document and it finds how these concepts are related to each other. The co-occurrence appearance of the main concepts in the sentences of the document is computed. Therefore, the system considers that if the co-occurrence between two concepts is high, then these concepts are strongly related. This information can be useful in order to construct summaries of very
References (22)
- et al.
Information retrieval by constrained spreading activation in semantic networks
Information Processing and Management
(1987) Finding the main themes in a Spanish document
Journal Expert Systems and Applications
(1998)- Aumann, Y., Feldman, R., Yehuda, Y. B., Landau, D., Liphstat, O., & Schler, Y. (1999). Circle graphs: new visualization...
- Beltrán-Martı́nez, B., Guzmán, A. A., Martı́nez-Trinidad, J. F., & Ruı́z-Shulcloper, J. (1998). CLASITEX+: una...
- et al.
I3R: a new approach to the design of document retrieval systems
Journal of the American Society for Information Science
(1987) Information retrieval and processing
(1975)- et al.
The HEARSAY-II speech understanding system: integrating knowledge to resolve uncertainly
ACM Computing Survey
(1980) - Feldman, R., Aumann, Y., Fresko, M., Liphstat, O., Rosenfeld, B., & Schler, Y. (1999). Text mining via information...
- Feldman, R., & Dagan, I. (1997). Knowledge discovery in textual databases (KDT). Technical report, Bar-Ilan University,...
- Feldman, R., & Hirsh, H. (1996). Mining association in text in the presence of background knowledge. Proceedings of the...
Saturn 2004
Discover, The World of Science
Cited by (6)
Technology in the 21st century: New challenges and opportunities
2019, Technological Forecasting and Social ChangeDesigning evolving user profile in e-CRM with dynamic clustering of Web documents
2008, Data and Knowledge EngineeringA fuzzy clustering approach for finding similar documents using a novel similarity measure
2007, Expert Systems with ApplicationsAn information search system on the Internet with semantic expansion
2014, 4ta. Conferencia Iberoamericana en Sistemas, Cibernetica e Informatica, CISCI 2005, MemoriasA new document representation using a unified graph to document similarity search
2013, Advanced Materials ResearchPattern recognition with mixed and incomplete data
2008, Pattern Recognition and Image Analysis