A tool to discover the main themes in a Spanish or English document

doi:10.1016/S0957-4174(00)00043-9

Expert Systems with Applications

Volume 19, Issue 4, November 2000, Pages 319-327

https://doi.org/10.1016/S0957-4174(00)00043-9 Get rights and content

Abstract

While most work on Knowledge Discovery in databases has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. In this paper a system based on information retrieval and text mining methods is presented. In addition, it is shown how the system analyzes a document containing natural language sentences in order to recognize its main topics or themes. The knowledge base used for the system is conformed by trees of concept. The architecture and the main algorithms of the system are discussed in this work.

Introduction

The recent years have been remarkable in the rapid growth of available knowledge through electronic media. Traditional data handling methods are becoming less and less capable of fulfilling the demands of this information deluge. Therefore, several new tools have been proposed to solve this problem. Nevertheless, most of the existing work has been done on structured databases (i. e. databases with well-known structure). However, the major generated information in human tasks has been based on non-structured databases over the years. These databases are collections of text written in natural language (histories, newspaper articles, e-mail messages, web pages, etc.). Thus, it is very interesting and worthwhile to develop tools to extract non-trivial information from a non-structured (i. e. textual) database within a reasonable time.

Text analysis through the computer is very useful. Some interesting problems that can be solved are the following: to construct abstracts from large documents; to find the semantic relations that exist in the main themes of a document; to find the conceptual tendency in documents written by someone, etc.

Section snippets

Preliminaries

In the recent decades different techniques have been developed in order to analyze and to extract knowledge in non-structured databases. The works developed in this text analysis area have been interesting since 1960. Initially, in the 1960s and in the 1970s, the information retrieval systems (IRS) (Cohen and Kjeldsen, 1987, Doyle, 1975, Salton, 1968, Salton, 1989) emerged. Later, the CYC project (Lenat and Guha, 1989, Lenat et al., 1990), which was developed in one decade, was proposed.

The CLASITEX⁺⁺ system

The main idea of the CLASITEX⁺⁺ system is simple. Assume that after reading a text or document, a person can answer the question: “what is the matter?”, i.e. he can say which are the most important themes in the document. This person reads the document and, based on his knowledge and on the frequency of appearance of the most important concepts, grasps the main themes. The person discards the words that do not have any meaning for him. Besides, the unknown words are written in order to obtain

Examples

We show the results of the CLASITEX⁺⁺ system in a short simple text taken from “Discover the world of science” magazine (Flamsteed, 1998).

Saturn, 2004
Ten or twenty years ago, interplanetary space probes were built like battleships: big, rugged, bristling with instruments — and costing a boatload of money. Although NASA has been phasing out such missions, only in October did it finally launch its last: the Cassini probes to Saturn.
By 2004, if all goes well, Cassini will park itself in orbit

Conclusions

In this work, a tool for text analysis was presented. The system finds the most important themes dealt with in a Spanish or English document and it finds how these concepts are related to each other. The co-occurrence appearance of the main concepts in the sentences of the document is computed. Therefore, the system considers that if the co-occurrence between two concepts is high, then these concepts are strongly related. This information can be useful in order to construct summaries of very

References (22)

P.R Cohen et al.
Information retrieval by constrained spreading activation in semantic networks
Information Processing and Management
(1987)
A.A Guzmán
Finding the main themes in a Spanish document
Journal Expert Systems and Applications
(1998)
Aumann, Y., Feldman, R., Yehuda, Y. B., Landau, D., Liphstat, O., & Schler, Y. (1999). Circle graphs: new visualization...
Beltrán-Martı́nez, B., Guzmán, A. A., Martı́nez-Trinidad, J. F., & Ruı́z-Shulcloper, J. (1998). CLASITEX+: una...
W.B Croft et al.
I3R: a new approach to the design of document retrieval systems
Journal of the American Society for Information Science
(1987)
L.B Doyle
Information retrieval and processing
(1975)
D.L Erman et al.
The HEARSAY-II speech understanding system: integrating knowledge to resolve uncertainly
ACM Computing Survey
(1980)
Feldman, R., Aumann, Y., Fresko, M., Liphstat, O., Rosenfeld, B., & Schler, Y. (1999). Text mining via information...
Feldman, R., & Dagan, I. (1997). Knowledge discovery in textual databases (KDT). Technical report, Bar-Ilan University,...
Feldman, R., & Hirsh, H. (1996). Mining association in text in the presence of background knowledge. Proceedings of the...

S Flamsteed

Saturn 2004

Discover, The World of Science

(1998)

Cited by (6)

Technology in the 21st century: New challenges and opportunities
2019, Technological Forecasting and Social Change
Although big data, big data analytics (BDA) and business intelligence have attracted growing attention of both academics and practitioners, a lack of clarity persists about how BDA has been applied in business and management domains. In reflecting on Professor Ayre's contributions, we want to extend his ideas on technological change by incorporating the discourses around big data, BDA and business intelligence. With this in mind, we integrate the burgeoning but disjointed streams of research on big data, BDA and business intelligence to develop unified frameworks. Our review takes on both technical and managerial perspectives to explore the complex nature of big data, techniques in big data analytics and utilisation of big data in business and management community. The advanced analytics techniques appear pivotal in bridging big data and business intelligence. The study of advanced analytics techniques and their applications in big data analytics led to identification of promising avenues for future research.
Designing evolving user profile in e-CRM with dynamic clustering of Web documents
2008, Data and Knowledge Engineering
Internet technology enables companies to capture new customers, track their performances and online behavior, and customize communications, products, services, and prices. Analyses of customers and customer interactions for electronic customer relationship management (e-CRM) can be performed by way of using data mining (DM), optimization methods, or combined approaches. One key issue in the analysis of access patterns on the Web is the clustering and classification of Web documents. Generally, the classification has its base on analytical models which assume a pre-fixed set of keywords (attributes) with predefined list of categories. This assumption is not realistic for large and evolving collections of documents such as World Wide Web. We propose a new approach to solve the problem of unknown number of evolving categories. The approach begins with the classification of test documents into a set of initial categories. A working prototype system which is based on Fuzzy Clustering CRM (FC-CRM) has been developed and presented to validate the proposed approach and illustrate how it handles the dynamic inflow of new documents.
A fuzzy clustering approach for finding similar documents using a novel similarity measure
2007, Expert Systems with Applications
Searching for similar documents has a crucial role in document management. This paper aims for developing a fast and high quality method of searching similar documents based on fuzzy clustering in large document collections. In order to perform these requirements, a two layers structure is proposed. Formerly, finding the similarity in documents is based on the strategy that uses word-by-word comparison. The proposed method in this study uses two layers structure and lets the documents pass through it to find the similarities. In this system, predefined fuzzy clusters are used to extract feature vectors of related documents for finding similar documents of them. Similarity measure is estimated based on these vectors. To do this, a distance based similarity measure is proposed. It has been seen in empirical results that the proposed system uses new similarity measure and has better performance compared with conventional similarity measurement systems.
An information search system on the Internet with semantic expansion
2014, 4ta. Conferencia Iberoamericana en Sistemas, Cibernetica e Informatica, CISCI 2005, Memorias
A new document representation using a unified graph to document similarity search
2013, Advanced Materials Research
Pattern recognition with mixed and incomplete data
2008, Pattern Recognition and Image Analysis

View full text

A tool to discover the main themes in a Spanish or English document

Abstract

Introduction

Section snippets

Preliminaries

The CLASITEX++ system

Examples

Conclusions

Information Processing and Management

Journal Expert Systems and Applications

I3R: a new approach to the design of document retrieval systems

Journal of the American Society for Information Science

Information retrieval and processing

The HEARSAY-II speech understanding system: integrating knowledge to resolve uncertainly

ACM Computing Survey

Saturn 2004

Discover, The World of Science

The CLASITEX⁺⁺ system