A framework for understanding Latent Semantic Indexing (LSI) performance
Introduction
Latent Semantic Indexing (LSI) (Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990) is a well-known information retrieval algorithm. LSI has been applied to a wide variety of learning tasks, such as search and retrieval (Deerwester et al., 1990), classification (Zelikovitz & Hirsh, 2001) and filtering (Dumais, 1994, Dumais, 1995). LSI is a vector space approach for modeling documents, and many have claimed that the technique brings out the ‘latent’ semantics in a collection of documents (Deerwester et al., 1990, Dumais, 1992).
LSI is based on a mathematical technique termed Singular Value Decomposition (SVD). The algebraic foundation for LSI was first described in Deerwester et al. (1990) and has been further discussed in Berry et al., 1995, Berry et al., 1999. These papers describe the SVD process and interpret the resulting matrices in a geometric context. The SVD, truncated to k dimensions, gives the best rank-k approximation to the original matrix. Wiemer-Hastings (1999) shows that the power of LSI comes primarily from the SVD algorithm.
Other researchers have proposed theoretical approaches to understanding LSI. Zha and Simon (1998) describes LSI in terms of a subspace model and proposes a statistical test for choosing the optimal number of dimensions for a given collection. Story (1996) discusses LSI’s relationship to statistical regression and Bayesian methods. Ding (1999) constructs a statistical model for LSI using the cosine similarity measure.
Although other researchers have explored the SVD algorithm to provide an understanding of SVD-based information retrieval systems, to our knowledge, only Schütze has studied the values produced by SVD (Schütze, 1992). We expand upon this work, showing here that LSI exploits higher-order term co-occurrence in a collection. We provide a mathematical proof of this fact herein, thereby providing an intuitive theoretical understanding of the mechanism whereby LSI emphasizes latent semantics.
This work is also the first to study the values produced in the SVD term by dimension matrix and we have discovered a correlation between the performance of LSI and the values in this matrix. Thus, in conjunction with the aforementioned proof of LSI’s theoretical foundation on higher-order co-occurrences, we have discovered the basis for the claim that is frequently made for LSI: LSI emphasizes underlying semantic distinctions (latent semantics) while reducing noise in the data. This is an important component in the theoretical basis for LSI.
Additional related work can be found in a recent article by Dominich. In Dominich (2003), the author shows that term co-occurrence is exploited in the connectionist interaction retrieval model, and this can account for or contribute to its effectiveness.
In Section 2 we present an overview of LSI along with a simple example of higher-order term co-occurrence in LSI. Section 3 explores the relationship between the values produced by LSI and term co-occurrence. In Sections 3 Higher-order co-occurrence in LSI, 4 Analysis of the LSI values we correlate LSI performance to the values produced by the SVD, indexed by the order of co-occurrence. Section 5 presents a mathematical proof of LSI’s base in higher-order co-occurrence. We draw conclusions and touch on future work in Section 6.
Section snippets
Overview of Latent Semantic Indexing
In this section we provide a brief overview of the LSI algorithm. We also discuss higher-order term co-occurrence in LSI, and present an example of LSI assignment of term co-occurrence values in a small collection.
Higher-order co-occurrence in LSI
In this section we study the relationship between the values produced by LSI and term co-occurrence. We show a relationship between the term co-occurrence patterns and resultant LSI similarity values. This data shows how LSI emphasizes important semantic distinctions, while de-emphasizing terms that co-occur frequently with many other terms (reduces ‘noise’). A full understanding of the relationship between higher-order term co-occurrence and the values produced by SVD is a necessary step
Analysis of the LSI values
In this section we expand upon the work described in Section 3. The results of our analysis show a strong correlation between the values produced by LSI and higher-order term co-occurrences.
Transitivity and the SVD
In this section we present mathematical proof that the LSI algorithm encapsulates term co-occurrence information. Specifically we show that a connectivity path exists for every non-zero element in the truncated matrix.
We begin by setting up some notation. Let A be a term by document matrix. The SVD process decomposes A into three matrices: a term by dimension matrix, T, a diagonal matrix of singular values, S, and a document by dimension matrix D. The original matrix is re-formed by multiplying
Conclusions and future work
Higher-order co-occurrences play a key role in the effectiveness of systems used for information retrieval and text mining. We have explicitly shown use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our empirical studies and mathematical analysis prove that term co-occurrence plays a crucial role in LSI. The work shown here will find many practical applications. Below we describe our own
Acknowledgments
This work was supported in part by National Science Foundation Grant Number EIA-0087977. The authors gratefully acknowledge the assistance of Dr. Kyriakos Kontostathis and Dr. Wei-Min Huang in developing the proof of the transitivity in LSI as well as in reviewing drafts of this article. The authors also would like to express their gratitude to Dr. Brian D. Davison for his comments on a draft. The authors gratefully acknowledge the assistance of their colleagues in the Computer Science and
References (24)
An explanation of the effectiveness of latent semantic indexing by means of a Bayesian regression model
Information Processing and Management
(1996)- Berry, M. W., Do, T., O’Brien, G., Krishna, V., & Varadhan, S. (1993). SVDPACKC (version 1.0) User’s Guide. Technical...
- et al.
Matrices, vector spaces, and information retrieval
SIAM Review
(1999) - et al.
Using linear algebra for intelligent information retrieval
SIAM Review
(1995) - et al.
Indexing by latent semantic analysis
Journal of the American Society of Information Science
(1990) - Ding, C. H. Q. (1999). A similarity-based probability model for latent semantic indexing. In Proceedings of the...
Connectionist interaction information retrieval
Information Processing and Management
(2003)- Dumais, S. T. (1992). LSI meets TREC: A status report. In D. Harman (Ed.), The First Text REtrieval Conference (TREC-1)...
- Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In D. Harman (Ed.), The Second Text REtrieval...
- Dumais, S. T. (1995). Using LSI for information filtering: TREC-3 experiments. In D. Harman (Ed.), The Third Text...
Cited by (148)
Sustainable bioethanol production from first- and second-generation sugar-based feedstocks: Advanced bibliometric analysis
2023, Bioresource Technology ReportsNCC: Neural concept compression for multilingual document recommendation[Formula presented]
2023, Applied Soft ComputingUse of a domain-specific ontology to support automated document categorization at the concept level: Method development and evaluation
2021, Expert Systems with ApplicationsDiscovering web services in social web service repositories using deep variational autoencoders
2020, Information Processing and ManagementSemantic text classification: A survey of past and recent advances
2018, Information Processing and ManagementCitation Excerpt :A higher-order path can be considered as a chain of co-occurrences of entities (i.e., terms) in different records (i.e., documents). Kontostathis and Pottenger (2006) verify and demonstrate mathematically that Latent Semantic Indexing (LSI) (Deerwester et al., 1990), a well-known semantic algorithm, utilizes higher-order relations. The advantages of using higher-order paths between documents and terms are demonstrated in Fig. 2.
A recent overview of the state-of-the-art elements of text classification
2018, Expert Systems with Applications