A framework for understanding Latent Semantic Indexing (LSI) performance

https://doi.org/10.1016/j.ipm.2004.11.007Get rights and content

Abstract

In this paper we present a theoretical model for understanding the performance of Latent Semantic Indexing (LSI) search and retrieval application. Many models for understanding LSI have been proposed. Ours is the first to study the values produced by LSI in the term by dimension vectors. The framework presented here is based on term co-occurrence data. We show a strong correlation between second-order term co-occurrence and the values produced by the Singular Value Decomposition (SVD) algorithm that forms the foundation for LSI. We also present a mathematical proof that the SVD algorithm encapsulates term co-occurrence information.

Introduction

Latent Semantic Indexing (LSI) (Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990) is a well-known information retrieval algorithm. LSI has been applied to a wide variety of learning tasks, such as search and retrieval (Deerwester et al., 1990), classification (Zelikovitz & Hirsh, 2001) and filtering (Dumais, 1994, Dumais, 1995). LSI is a vector space approach for modeling documents, and many have claimed that the technique brings out the ‘latent’ semantics in a collection of documents (Deerwester et al., 1990, Dumais, 1992).

LSI is based on a mathematical technique termed Singular Value Decomposition (SVD). The algebraic foundation for LSI was first described in Deerwester et al. (1990) and has been further discussed in Berry et al., 1995, Berry et al., 1999. These papers describe the SVD process and interpret the resulting matrices in a geometric context. The SVD, truncated to k dimensions, gives the best rank-k approximation to the original matrix. Wiemer-Hastings (1999) shows that the power of LSI comes primarily from the SVD algorithm.

Other researchers have proposed theoretical approaches to understanding LSI. Zha and Simon (1998) describes LSI in terms of a subspace model and proposes a statistical test for choosing the optimal number of dimensions for a given collection. Story (1996) discusses LSI’s relationship to statistical regression and Bayesian methods. Ding (1999) constructs a statistical model for LSI using the cosine similarity measure.

Although other researchers have explored the SVD algorithm to provide an understanding of SVD-based information retrieval systems, to our knowledge, only Schütze has studied the values produced by SVD (Schütze, 1992). We expand upon this work, showing here that LSI exploits higher-order term co-occurrence in a collection. We provide a mathematical proof of this fact herein, thereby providing an intuitive theoretical understanding of the mechanism whereby LSI emphasizes latent semantics.

This work is also the first to study the values produced in the SVD term by dimension matrix and we have discovered a correlation between the performance of LSI and the values in this matrix. Thus, in conjunction with the aforementioned proof of LSI’s theoretical foundation on higher-order co-occurrences, we have discovered the basis for the claim that is frequently made for LSI: LSI emphasizes underlying semantic distinctions (latent semantics) while reducing noise in the data. This is an important component in the theoretical basis for LSI.

Additional related work can be found in a recent article by Dominich. In Dominich (2003), the author shows that term co-occurrence is exploited in the connectionist interaction retrieval model, and this can account for or contribute to its effectiveness.

In Section 2 we present an overview of LSI along with a simple example of higher-order term co-occurrence in LSI. Section 3 explores the relationship between the values produced by LSI and term co-occurrence. In Sections 3 Higher-order co-occurrence in LSI, 4 Analysis of the LSI values we correlate LSI performance to the values produced by the SVD, indexed by the order of co-occurrence. Section 5 presents a mathematical proof of LSI’s base in higher-order co-occurrence. We draw conclusions and touch on future work in Section 6.

Section snippets

Overview of Latent Semantic Indexing

In this section we provide a brief overview of the LSI algorithm. We also discuss higher-order term co-occurrence in LSI, and present an example of LSI assignment of term co-occurrence values in a small collection.

Higher-order co-occurrence in LSI

In this section we study the relationship between the values produced by LSI and term co-occurrence. We show a relationship between the term co-occurrence patterns and resultant LSI similarity values. This data shows how LSI emphasizes important semantic distinctions, while de-emphasizing terms that co-occur frequently with many other terms (reduces ‘noise’). A full understanding of the relationship between higher-order term co-occurrence and the values produced by SVD is a necessary step

Analysis of the LSI values

In this section we expand upon the work described in Section 3. The results of our analysis show a strong correlation between the values produced by LSI and higher-order term co-occurrences.

Transitivity and the SVD

In this section we present mathematical proof that the LSI algorithm encapsulates term co-occurrence information. Specifically we show that a connectivity path exists for every non-zero element in the truncated matrix.

We begin by setting up some notation. Let A be a term by document matrix. The SVD process decomposes A into three matrices: a term by dimension matrix, T, a diagonal matrix of singular values, S, and a document by dimension matrix D. The original matrix is re-formed by multiplying

Conclusions and future work

Higher-order co-occurrences play a key role in the effectiveness of systems used for information retrieval and text mining. We have explicitly shown use of higher orders of co-occurrence in the Singular Value Decomposition (SVD) algorithm and, by inference, on the systems that rely on SVD, such as LSI. Our empirical studies and mathematical analysis prove that term co-occurrence plays a crucial role in LSI. The work shown here will find many practical applications. Below we describe our own

Acknowledgments

This work was supported in part by National Science Foundation Grant Number EIA-0087977. The authors gratefully acknowledge the assistance of Dr. Kyriakos Kontostathis and Dr. Wei-Min Huang in developing the proof of the transitivity in LSI as well as in reviewing drafts of this article. The authors also would like to express their gratitude to Dr. Brian D. Davison for his comments on a draft. The authors gratefully acknowledge the assistance of their colleagues in the Computer Science and

References (24)

  • R.E. Story

    An explanation of the effectiveness of latent semantic indexing by means of a Bayesian regression model

    Information Processing and Management

    (1996)
  • Berry, M. W., Do, T., O’Brien, G., Krishna, V., & Varadhan, S. (1993). SVDPACKC (version 1.0) User’s Guide. Technical...
  • M.W. Berry et al.

    Matrices, vector spaces, and information retrieval

    SIAM Review

    (1999)
  • M.W. Berry et al.

    Using linear algebra for intelligent information retrieval

    SIAM Review

    (1995)
  • S.C. Deerwester et al.

    Indexing by latent semantic analysis

    Journal of the American Society of Information Science

    (1990)
  • Ding, C. H. Q. (1999). A similarity-based probability model for latent semantic indexing. In Proceedings of the...
  • S. Dominich

    Connectionist interaction information retrieval

    Information Processing and Management

    (2003)
  • Dumais, S. T. (1992). LSI meets TREC: A status report. In D. Harman (Ed.), The First Text REtrieval Conference (TREC-1)...
  • Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In D. Harman (Ed.), The Second Text REtrieval...
  • Dumais, S. T. (1995). Using LSI for information filtering: TREC-3 experiments. In D. Harman (Ed.), The Third Text...
  • Edmonds, P. (1997). Choosing the word most typical in context using a lexical co-occurrence network. In Proceedings of...
  • Kontostathis, A., De, I., Holzman, L. E., & Pottenger, W. M. (2004). Use of term clusters for emerging trend detection....
  • Cited by (148)

    • Semantic text classification: A survey of past and recent advances

      2018, Information Processing and Management
      Citation Excerpt :

      A higher-order path can be considered as a chain of co-occurrences of entities (i.e., terms) in different records (i.e., documents). Kontostathis and Pottenger (2006) verify and demonstrate mathematically that Latent Semantic Indexing (LSI) (Deerwester et al., 1990), a well-known semantic algorithm, utilizes higher-order relations. The advantages of using higher-order paths between documents and terms are demonstrated in Fig. 2.

    View all citing articles on Scopus
    View full text