A framework for understanding Latent Semantic Indexing (LSI) performance

doi:10.1016/j.ipm.2004.11.007

Information Processing & Management

Volume 42, Issue 1, January 2006, Pages 56-73

https://doi.org/10.1016/j.ipm.2004.11.007 Get rights and content

Abstract

In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term by dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second-order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.

Introduction

Latent Semantic Indexing (LSI) (Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990) is a well-known information retrieval algorithm. LSI has been applied to a wide variety of learning tasks, such as search and retrieval (Deerwester et al., 1990), classification (Zelikovitz & Hirsh, 2001) and filtering (Dumais, 1994, Dumais, 1995). LSI is a vector space approach for modeling documents, and many have claimed that the technique brings out the ‘latent’ semantics in a collection of documents (Deerwester et al., 1990, Dumais, 1992).

LSI is based on a mathematical technique termed Singular Value Decomposition (SVD). The algebraic foundation for LSI was first described in Deerwester et al. (1990) and has been further discussed in Berry et al., 1995, Berry et al., 1999. These papers describe the SVD process and interpret the resulting matrices in a geometric context. The SVD, truncated to k dimensions, gives the best rank-k approximation to the original matrix. Wiemer-Hastings (1999) shows that the power of LSI comes primarily from the SVD algorithm.

Other researchers have proposed theoretical approaches to understanding LSI. Zha and Simon (1998) describes LSI in terms of a subspace model and proposes a statistical test for choosing the optimal number of dimensions for a given collection. Story (1996) discusses LSI’s relationship to statistical regression and Bayesian methods. Ding (1999) constructs a statistical model for LSI using the cosine similarity measure.

Although other researchers have explored the SVD algorithm to provide an understanding of SVD-based information retrieval systems, to our knowledge, only Schütze has studied the values produced by SVD (Schütze, 1992). We expand upon this work, showing here that LSI exploits higher-order term co-occurrence in a collection. We provide a mathematical proof of this fact herein, thereby providing an intuitive theoretical understanding of the mechanism whereby LSI emphasizes latent semantics.

This work is also the first to study the values produced in the SVD term by dimension matrix and we have discovered a correlation between the performance of LSI and the values in this matrix. Thus, in conjunction with the aforementioned proof of LSI’s theoretical foundation on higher-order co-occurrences, we have discovered the basis for the claim that is frequently made for LSI: LSI emphasizes underlying semantic distinctions (latent semantics) while reducing noise in the data. This is an important component in the theoretical basis for LSI.

Additional related work can be found in a recent article by Dominich. In Dominich (2003), the author shows that term co-occurrence is exploited in the connectionist interaction retrieval model, and this can account for or contribute to its effectiveness.

In Section 2 we present an overview of LSI along with a simple example of higher-order term co-occurrence in LSI. Section 3 explores the relationship between the values produced by LSI and term co-occurrence. In Sections 3 Higher-order co-occurrence in LSI, 4 Analysis of the LSI values we correlate LSI performance to the values produced by the SVD, indexed by the order of co-occurrence. Section 5 presents a mathematical proof of LSI’s base in higher-order co-occurrence. We draw conclusions and touch on future work in Section 6.

Section snippets

Overview of Latent Semantic Indexing

In this section we provide a brief overview of the LSI algorithm. We also discuss higher-order term co-occurrence in LSI, and present an example of LSI assignment of term co-occurrence values in a small collection.

Higher-order co-occurrence in LSI

In this section we study the relationship between the values produced by LSI and term co-occurrence. We show a relationship between the term co-occurrence patterns and resultant LSI similarity values. This data shows how LSI emphasizes important semantic distinctions, while de-emphasizing terms that co-occur frequently with many other terms (reduces ‘noise’). A full understanding of the relationship between higher-order term co-occurrence and the values produced by SVD is a necessary step

Analysis of the LSI values

In this section we expand upon the work described in Section 3. The results of our analysis show a strong correlation between the values produced by LSI and higher-order term co-occurrences.

Transitivity and the SVD

In this section we present mathematical proof that the LSI algorithm encapsulates term co-occurrence information. Specifically we show that a connectivity path exists for every non-zero element in the truncated matrix.

We begin by setting up some notation. Let A be a term by document matrix. The SVD process decomposes A into three matrices: a term by dimension matrix, T, a diagonal matrix of singular values, S, and a document by dimension matrix D. The original matrix is re-formed by multiplying

Conclusions and future work

Higher-order co-occurrences play a key role in the effectiveness of systems used for information retrieval and text mining. We have explicitly shown use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our empirical studies and mathematical analysis prove that term co-occurrence plays a crucial role in LSI. The work shown here will find many practical applications. Below we describe our own

Acknowledgments

This work was supported in part by National Science Foundation Grant Number EIA-0087977. The authors gratefully acknowledge the assistance of Dr. Kyriakos Kontostathis and Dr. Wei-Min Huang in developing the proof of the transitivity in LSI as well as in reviewing drafts of this article. The authors also would like to express their gratitude to Dr. Brian D. Davison for his comments on a draft. The authors gratefully acknowledge the assistance of their colleagues in the Computer Science and

References (24)

R.E. Story
An explanation of the effectiveness of latent semantic indexing by means of a Bayesian regression model
Information Processing and Management
(1996)
Berry, M. W., Do, T., O’Brien, G., Krishna, V., & Varadhan, S. (1993). SVDPACKC (version 1.0) User’s Guide. Technical...
M.W. Berry et al.
Matrices, vector spaces, and information retrieval
SIAM Review
(1999)
M.W. Berry et al.
Using linear algebra for intelligent information retrieval
SIAM Review
(1995)
S.C. Deerwester et al.
Indexing by latent semantic analysis
Journal of the American Society of Information Science
(1990)
Ding, C. H. Q. (1999). A similarity-based probability model for latent semantic indexing. In Proceedings of the...
S. Dominich
Connectionist interaction information retrieval
Information Processing and Management
(2003)
Dumais, S. T. (1992). LSI meets TREC: A status report. In D. Harman (Ed.), The First Text REtrieval Conference (TREC-1)...
Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In D. Harman (Ed.), The Second Text REtrieval...
Dumais, S. T. (1995). Using LSI for information filtering: TREC-3 experiments. In D. Harman (Ed.), The Third Text...

Edmonds, P. (1997). Choosing the word most typical in context using a lexical co-occurrence network. In Proceedings of...

Kontostathis, A., De, I., Holzman, L. E., & Pottenger, W. M. (2004). Use of term clusters for emerging trend detection....

Cited by (148)

Sustainable bioethanol production from first- and second-generation sugar-based feedstocks: Advanced bibliometric analysis
2023, Bioresource Technology Reports
Bioethanol is produced from carbohydrate-containing feedstocks through fermentation. Based on a bibliometric review of studies published between 2012 and 2021, we analyzed those on sustainable bioethanol production. The Web of Science main collection yielded 1647 publications, which were analyzed using VOSViewer, CiteSpace, and ArcMap software. More than half of these publications originated from some Asian countries, corresponding to 63.6 %, with India being the most participatory country. The most studied area was energy fuels, and Bioresource Technology (Elsevier) was the journal that published most on the topic, while Universiti Tenaga Nasional in Malaysia had the most interactions. Four emerging trends were identified, in mention: enzymatic hydrolysis, alternative process arrangements, use of the bacterium Zymomonas mobilis and structural features. In conclusion, it was found that the studies generally identified more advantages than disadvantages, and the research showed a positive trend, but there are still existing obstacles, which may be overcome through improved processes.
NCC: Neural concept compression for multilingual document recommendation[Formula presented]
2023, Applied Soft Computing
In this work, we propose a novel method for generating inter-lingual document representations using neural network concept compression. The presented approach is intended to improve the quality of content-based multilingual document recommendation and information retrieval systems by creating a language-independent representation. The main idea is to use mappings to align monolingual representation spaces, using concept compression, to create inter-lingual representations. The proposed approach outperforms traditional cross-lingual retrieval and recommendations methods in experiments conducted on JRC-Acquis and EU bookshop multilingual corpora. Our dataset and code are publicly available at https://github.com/Tsegaye-misikir/NCC.
Use of a domain-specific ontology to support automated document categorization at the concept level: Method development and evaluation
2021, Expert Systems with Applications
Voluminous, conveniently accessible textual documents, created and disseminated by modern information technology, makes automated document organization increasingly important for both individuals and organizations. Many existing techniques rely on document content analysis that classifies new, unlabeled documents by examining the similarity based on the overlap between their important features and the representative features of each document category. However, the performance of feature-based techniques can be significantly hindered by word mismatch and ambiguity problems. As a remedy, this study takes a concept-based approach and propose a text categorization method that incorporates a domain-specific ontology to support automated document categorization more effectively. The proposed method classifies documents according to their respective range of relevant concepts. We empirically evaluate our method versus several prevalent benchmarks that include feature-based k-nearest neighbors (kNN) and semantic-based techniques. The results show the proposed method more effective than the benchmark techniques; it achieves better performances when using a complete concept hierarchy without considering the hierarchical relationships among concepts. The proposed method illustrates how to incorporate a domain-specific ontology to improve document classification. Our method is computationally efficient because it produces a concept space of relatively few dimensionalities and does not require semantic space reconstruction as new documents arrive. Moreover, the relationships and patterns for classifying documents, generated by our method, are explicit and comprehensible.
Discovering web services in social web service repositories using deep variational autoencoders
2020, Information Processing and Management
Web Service registries have progressively evolved to social networks-like software repositories. Users cooperate to produce an ever-growing, rich source of Web APIs upon which new value-added Web applications can be built. Such users often interact in order to follow, comment on, consume and compose services published by other users. In this context, Web Service discovery is a core functionality of modern registries as needed Web Services must be discovered before being consumed or composed. Many efforts to provide effective keyword-based service discovery mechanisms are based on Information Retrieval techniques as services are described using structured or unstructured textdocuments that specify the provided functionality. However, traditional techniques suffer from term-mismatch, which means that only the terms that are contained in both user queries and descriptions are exploited to perform service retrieval. Early feature learning techniques such as LSA or LDA tried to solve this problem by finding hidden or latent features in text documents. Recently, alternative feature learning based techniques such as Word Embeddings achieved state of the art results for Web Service discovery. In this paper, we propose to learn features from service descriptions by using Variational Autoencoders, a special kind of autoencoder which restricts the encoded representation to model latent variables. Autoencoders in turn are deep neural networks used for unsupervised learning of efficient codings. We train our autoencoder using a real 17 113-service dataset extracted from the ProgrammableWeb.com API social repository. We measure discovery efficacy by using both Recall and Precision metrics, achieving significant gains compared to both Word Embeddings and classic latent features modelling techniques. Also, performance-oriented experiments show that the proposed approach can be readily exploited in practice.
Semantic text classification: A survey of past and recent advances
2018, Information Processing and Management
Citation Excerpt :
A higher-order path can be considered as a chain of co-occurrences of entities (i.e., terms) in different records (i.e., documents). Kontostathis and Pottenger (2006) verify and demonstrate mathematically that Latent Semantic Indexing (LSI) (Deerwester et al., 1990), a well-known semantic algorithm, utilizes higher-order relations. The advantages of using higher-order paths between documents and terms are demonstrated in Fig. 2.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.
A recent overview of the state-of-the-art elements of text classification
2018, Expert Systems with Applications
The aim of this study is to provide an overview the state-of-the-art elements of text classification. For this purpose, we first select and investigate the primary and recent studies and objectives in this field. Next, we examine the state-of-the-art elements of text classification. In the following steps, we qualitatively and quantitatively analyse the related works. Herein, we describe six baseline elements of text classification including data collection, data analysis for labelling, feature construction and weighing, feature selection and projection, training of a classification model, and solution evaluation. This study will help readers acquire the necessary information about these elements and their associated techniques. Thus, we believe that this study will assist other researchers and professionals to propose new studies in the field of text classification.

View all citing articles on Scopus

View full text

A framework for understanding Latent Semantic Indexing (LSI) performance

Abstract

Introduction

Section snippets

Overview of Latent Semantic Indexing

Higher-order co-occurrence in LSI

Analysis of the LSI values

Transitivity and the SVD

Conclusions and future work

Acknowledgments

Information Processing and Management

Matrices, vector spaces, and information retrieval

SIAM Review

Using linear algebra for intelligent information retrieval

SIAM Review

Indexing by latent semantic analysis

Journal of the American Society of Information Science

Connectionist interaction information retrieval

Information Processing and Management