Skip to main content

Principal Components for Automatic Term Hierarchy Building

  • Conference paper
String Processing and Information Retrieval (SPIRE 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Included in the following conference series:

Abstract

We show that the singular value decomposition of a term similarity matrix induces a term hierarchy. This decomposition, used in Latent Semantic Analysis and Principal Component Analysis for text, aims at identifying “concepts” that can be used in place of the terms appearing in the documents. Unlike terms, concepts are by construction uncorrelated and hence are less sensitive to the particular vocabulary used in documents. In this work, we explore the relation between terms and concepts and show that for each term there exists a latent subspace dimension for which the term coincides with a concept. By varying the number of dimensions, terms similar but more specific than the concept can be identified, leading to a term hierarchy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Croft, W.B., et al. (eds.) 21st ACM SIGIR 1998, Melbourne, AU, pp. 96–103. ACM Press, New York (1998)

    Google Scholar 

  2. Bast, H., Majumdar, D.: Understanding spectral retrieval via the synonymy graph. In: 28th ACM SIGIR 2005 (2005)

    Google Scholar 

  3. Chung, C.Y., Chen, B.: Cvs: a correlation-verification based smoothing technique on information retrieval and term clustering. In: KDD 2002: Eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 469–474. ACM Press, New York (2002)

    Chapter  Google Scholar 

  4. Chung, C.Y., Lieu, R., Liu, J., Luk, A., Mao, J., Raghavan, P.: Thematic mapping - from unstructured documents to taxonomies. In: CIKM 2002, pp. 608–610. ACM Press, New York (2002)

    Chapter  Google Scholar 

  5. Deerwester, S., Dumais, S., Furnas, G., Landauer, T.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41, 391–407 (1990)

    Article  Google Scholar 

  6. Dupret, G.: Latent concepts and the number orthogonal factors in latent semantic analysis. In: 26th ACM SIGIR 2003, pp. 221–226. ACM Press, New York (2003)

    Google Scholar 

  7. Dupret, G., Piwowarski, B.: Deducing a term taxonomy from term similarities. In: Second International Workshop on Knowledge Discovery and Ontologies, Porto, Portugal (2005)

    Google Scholar 

  8. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC (May 15, 1993)

    Google Scholar 

  9. Glover, E., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: CIKM 2002, pp. 507–514. ACM Press, New York (2002)

    Chapter  Google Scholar 

  10. Harville, D.A.: Matrix Algebra from a Statistician’s Perspective. Springer, Heidelberg (1997)

    MATH  Google Scholar 

  11. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: 14th conference on Computational linguistics, Morristown, NJ, USA, pp. 539–545. Association for Computational Linguistics (1992)

    Google Scholar 

  12. Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity measures, vol. 38, pp. 420–442. John Wiley, New York (1987)

    Google Scholar 

  13. Lawrie, D., Croft, W.: Discovering and comparing topic hierarchies. In: Proceedings of ‘(2000)

    Google Scholar 

  14. Maedche, A., Staab, S.: Discovering conceptual relations from text, pp. 321–325 (2000)

    Google Scholar 

  15. Nanas, N., Uren, V., Roeck, A.D.: Building and applying a concept hierarchy representation of a user profile. In: 26th ACM SIGIR 2003, pp. 198–204. ACM Press, New York (2003)

    Google Scholar 

  16. Njike-Fotzo, H., Gallinari, P.: Learning generalization/specialization relations between concepts - application for automatically building thematic document hierarchies. In: RIAO 2004 (April 2004)

    Google Scholar 

  17. Park, Y.C., Han, Y.S., Choi, K.-S.: Automatic thesaurus construction using bayesian networks. In: CIKM 1995, pp. 212–217. ACM Press, New York (1995)

    Chapter  Google Scholar 

  18. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: 21st ACM SIGIR 1998, pp. 275–281. ACM Press, New York (1998)

    Google Scholar 

  19. Ribeiro, B.A.N., Muntz, R.: A belief network model for ir. In: 19th ACM SIGIR 1996, pp. 253–260. ACM Press, New York (1996)

    Google Scholar 

  20. Robertson, S., Jones, K.S.: Simple proven approaches to text retrieval. Technical report tr356, Cambridge University Computer Laboratory (1997)

    Google Scholar 

  21. Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: 22th ACM SIGIR 1999, pp. 206–213. ACM Press, New York (1999)

    Chapter  Google Scholar 

  22. Srikant, R., Agrawal, R.: Mining generalized association rules. Future Generation Computer Systems 13(2–3), 161–180 (1997)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dupret, G., Piwowarski, B. (2006). Principal Components for Automatic Term Hierarchy Building. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_4

Download citation

  • DOI: https://doi.org/10.1007/11880561_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45774-9

  • Online ISBN: 978-3-540-45775-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics