Abstract
We show that the singular value decomposition of a term similarity matrix induces a term hierarchy. This decomposition, used in Latent Semantic Analysis and Principal Component Analysis for text, aims at identifying “concepts” that can be used in place of the terms appearing in the documents. Unlike terms, concepts are by construction uncorrelated and hence are less sensitive to the particular vocabulary used in documents. In this work, we explore the relation between terms and concepts and show that for each term there exists a latent subspace dimension for which the term coincides with a concept. By varying the number of dimensions, terms similar but more specific than the concept can be identified, leading to a term hierarchy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Croft, W.B., et al. (eds.) 21st ACM SIGIR 1998, Melbourne, AU, pp. 96–103. ACM Press, New York (1998)
Bast, H., Majumdar, D.: Understanding spectral retrieval via the synonymy graph. In: 28th ACM SIGIR 2005 (2005)
Chung, C.Y., Chen, B.: Cvs: a correlation-verification based smoothing technique on information retrieval and term clustering. In: KDD 2002: Eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 469–474. ACM Press, New York (2002)
Chung, C.Y., Lieu, R., Liu, J., Luk, A., Mao, J., Raghavan, P.: Thematic mapping - from unstructured documents to taxonomies. In: CIKM 2002, pp. 608–610. ACM Press, New York (2002)
Deerwester, S., Dumais, S., Furnas, G., Landauer, T.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41, 391–407 (1990)
Dupret, G.: Latent concepts and the number orthogonal factors in latent semantic analysis. In: 26th ACM SIGIR 2003, pp. 221–226. ACM Press, New York (2003)
Dupret, G., Piwowarski, B.: Deducing a term taxonomy from term similarities. In: Second International Workshop on Knowledge Discovery and Ontologies, Porto, Portugal (2005)
Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC (May 15, 1993)
Glover, E., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: CIKM 2002, pp. 507–514. ACM Press, New York (2002)
Harville, D.A.: Matrix Algebra from a Statistician’s Perspective. Springer, Heidelberg (1997)
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: 14th conference on Computational linguistics, Morristown, NJ, USA, pp. 539–545. Association for Computational Linguistics (1992)
Jones, W.P., Furnas, G.W.: Pictures of relevance: a geometric analysis of similarity measures, vol. 38, pp. 420–442. John Wiley, New York (1987)
Lawrie, D., Croft, W.: Discovering and comparing topic hierarchies. In: Proceedings of ‘(2000)
Maedche, A., Staab, S.: Discovering conceptual relations from text, pp. 321–325 (2000)
Nanas, N., Uren, V., Roeck, A.D.: Building and applying a concept hierarchy representation of a user profile. In: 26th ACM SIGIR 2003, pp. 198–204. ACM Press, New York (2003)
Njike-Fotzo, H., Gallinari, P.: Learning generalization/specialization relations between concepts - application for automatically building thematic document hierarchies. In: RIAO 2004 (April 2004)
Park, Y.C., Han, Y.S., Choi, K.-S.: Automatic thesaurus construction using bayesian networks. In: CIKM 1995, pp. 212–217. ACM Press, New York (1995)
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: 21st ACM SIGIR 1998, pp. 275–281. ACM Press, New York (1998)
Ribeiro, B.A.N., Muntz, R.: A belief network model for ir. In: 19th ACM SIGIR 1996, pp. 253–260. ACM Press, New York (1996)
Robertson, S., Jones, K.S.: Simple proven approaches to text retrieval. Technical report tr356, Cambridge University Computer Laboratory (1997)
Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: 22th ACM SIGIR 1999, pp. 206–213. ACM Press, New York (1999)
Srikant, R., Agrawal, R.: Mining generalized association rules. Future Generation Computer Systems 13(2–3), 161–180 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dupret, G., Piwowarski, B. (2006). Principal Components for Automatic Term Hierarchy Building. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_4
Download citation
DOI: https://doi.org/10.1007/11880561_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)