Cluster Labeling for Multilingual Scatter/Gather Using Comparable Corpora

Tholpadi, Goutham; Das, Mrinal Kanti; Bhattacharyya, Chiranjib; Shevade, Shirish

doi:10.1007/978-3-642-28997-2_33

Goutham Tholpadi²²,
Mrinal Kanti Das²²,
Chiranjib Bhattacharyya²² &
…
Shirish Shevade²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7224))

Included in the following conference series:

European Conference on Information Retrieval

2762 Accesses
2 Citations

Abstract

Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR (2003)
Google Scholar
Carmel, D., Roitman, H., Zwerdling, N.: Enhancing cluster labeling using wikipedia. In: SIGIR 2009 (2009)
Google Scholar
Carpineto, C., Osiski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. (July 2009)
Google Scholar
Chen, H.-H., Kuo, J.-J., Su, T.-C.: Clustering and Visualization in a Multi-lingual Multi-document Summarization System. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 266–280. Springer, Heidelberg (2003)
Chapter Google Scholar
Chin, O.S., Kulathuramaiyer, N., Yeo, A.W.: Automatic discovery of concepts from text. In: WI 2006 (2006)
Google Scholar
Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin (October 1968)
Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR 1992 (1992)
Google Scholar
Geraci, F., Pellegrini, M., Maggini, M., Sebastiani, F.: Cluster generation and labeling for web snippets: A fast, accurate hierarchical solution. Internet Math. (2007)
Google Scholar
Glover, E., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: CIKM 2002 (2002)
Google Scholar
Honarpisheh, M.A., Ghassem-Sani, G., Mirroshandel, G.: A multi-document multi-lingual automatic summarization system. In: IJCNLP 2009 (2009)
Google Scholar
Ke, W., Sugimoto, C.R., Mostafa, J.: Dynamicity vs. e ectiveness: studying online clustering for scatter/gather. In: SIGIR 2009 (2009)
Google Scholar
Kuo, J.-J., Chen, H.-H.: Multidocument summary generation: Using informative and event words. TALIP (February 2008)
Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: EMNLP 2009 (2009)
Google Scholar
Ming, Z.-Y., Wang, K., Chua, T.-S.: Prototype hierarchy based clustering for the categorization and navigation of web collections. In: SIGIR 2010 (2010)
Google Scholar
Osinski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intell. Sys. (May 2005)
Google Scholar
Radev, D.R., Jing, H., Styś, M., Tam, D.: Centroid-based summarization of multiple documents. Inf. Proc. Manag (November 2004)
Google Scholar
Toda, H., Kataoka, R.: A clustering method for news articles retrieval system. In: WWW 2005 (2005)
Google Scholar
Treeratpituk, P., Callan, J.: Automatically labeling hierarchical clusters. Digital Government Research (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Automation, Indian Institute of Science, Bangalore, India
Goutham Tholpadi, Mrinal Kanti Das, Chiranjib Bhattacharyya & Shirish Shevade

Authors

Goutham Tholpadi
View author publications
You can also search for this author in PubMed Google Scholar
Mrinal Kanti Das
View author publications
You can also search for this author in PubMed Google Scholar
Chiranjib Bhattacharyya
View author publications
You can also search for this author in PubMed Google Scholar
Shirish Shevade
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Yahoo! Research, Diagonal 177, 08018, Barcelona, Spain
Ricardo Baeza-Yates & B. Barla Cambazoglu &
Centrum Wiskunde & Informatica, Science Park 123, Amsterdam, The Netherlands
Arjen P. de Vries
Websays, Nàpols 294 7-4, 08025, Barcelona, Spain
Hugo Zaragoza
Yahoo! Research, Diagnoal 177, 08018, Barcelona, Spain
Vanessa Murdock
Yahoo! Labs, Tower 3, Matam Park, 31905, Haifa, Israel
Ronny Lempel
ISTI-CNR, via G. Moruzzi, 1, 56124, Pisa, Italy
Fabrizio Silvestri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tholpadi, G., Das, M.K., Bhattacharyya, C., Shevade, S. (2012). Cluster Labeling for Multilingual Scatter/Gather Using Comparable Corpora. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_33

Download citation

DOI: https://doi.org/10.1007/978-3-642-28997-2_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28996-5
Online ISBN: 978-3-642-28997-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics