Using Section Headings to Compute Cross-Lingual Similarity of Wikipedia Articles

Paramita, Monica Lestari; Clough, Paul; Gaizauskas, Robert

doi:10.1007/978-3-319-56608-5_59

Monica Lestari Paramita²⁰,
Paul Clough²⁰ &
Robert Gaizauskas²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10193))

Included in the following conference series:

European Conference on Information Retrieval

2437 Accesses
1 Citations

Abstract

Measuring the similarity of interlanguage-linked Wikipedia articles often requires the use of suitable language resources (e.g., dictionaries and MT systems) which can be problematic for languages with limited or poor translation resources. The size of Wikipedia can also present computational demands when computing similarity. This paper presents a ‘lightweight’ approach to measure cross-lingual similarity in Wikipedia using section headings rather than the entire Wikipedia article, and language resources derived from Wikipedia and Wiktionary to perform translation. Using an existing dataset we evaluate the approach for 7 language pairs. Results show that the performance using section headings is comparable to using all article content, dictionaries derived from Wikipedia and Wiktionary are sufficient to compute cross-lingual similarity and combinations of features can further improve results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://en.wikipedia.org/wiki/List_of_Wikipedias (20 Oct 2016).
2.
https://meta.wikimedia.org/wiki/Wiktionary#List_of_Wiktionaries (20 Oct 2016).
3.
This feature was previously identified as the best language-independent feature to identify cross-lingual similarity in Wikipedia [2].

References

Adafre, S.F., de Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of EACL 2006, pp. 62–69, 4 April 2006
Google Scholar
Barrón-Cedeño, A., Paramita, M.L., Clough, P., Rosso, P.: A comparison of approaches for measuring cross-lingual similarity of Wikipedia articles. In: Rijke, M., Kenter, T., Vries, A.P., Zhai, C.X., Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 424–429. Springer, Heidelberg (2014). doi:10.1007/978-3-319-06028-6_36
Chapter Google Scholar
Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: An approach for extracting bilingual terminology from Wikipedia. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 380–392. Springer, Heidelberg (2008). doi:10.1007/978-3-540-78568-2_28
Chapter Google Scholar
Mohammadi, M., GhasemAghaee, N.: Building bilingual parallel corpora based on Wikipedia. In: ICCEA 2010, vol. 2, pp. 264–268. IEEE (2010)
Google Scholar
Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in domain-specific information retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04447-2_28
Chapter Google Scholar
Paramita, M.L., Clough, P., Aker, A., Gaizauskas, R.J.: Correlation between similarity measures for inter-language linked Wikipedia articles. In: LREC 2012, pp. 790–797 (2012)
Google Scholar
Yasuda, K., Sumita, E.: Method for building sentence-aligned corpus from Wikipedia. In: Proceedings of WikiAI 2008, pp. 64–66, 13–14 July 2008
Google Scholar
Zesch, T., Müller, C., Gurevych, I.: Using Wiktionary for computing semantic relatedness. In: AAAI Conference on Artificial Intelligence, pp. 861–866 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Information School, University of Sheffield, Sheffield, UK
Monica Lestari Paramita & Paul Clough
Computer Science Department, University of Sheffield, Sheffield, UK
Robert Gaizauskas

Authors

Monica Lestari Paramita
View author publications
You can also search for this author in PubMed Google Scholar
Paul Clough
View author publications
You can also search for this author in PubMed Google Scholar
Robert Gaizauskas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Monica Lestari Paramita .

Editor information

Editors and Affiliations

University of Glasgow , Glasgow, United Kingdom
Joemon M Jose
TU Delft - EWI/ST/WIS , Delft, The Netherlands
Claudia Hauff
Middle East Technical University , Ankara, Turkey
Ismail Sengor Altıngovde
Open University , Milton Keynes, United Kingdom
Dawei Song
Signal Media , London, United Kingdom
Dyaa Albakour
Toronto, Canada
Stuart Watt
JohnTait.net Ltd. and BCS IRSG , Sunderland, United Kingdom
John Tait

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paramita, M.L., Clough, P., Gaizauskas, R. (2017). Using Section Headings to Compute Cross-Lingual Similarity of Wikipedia Articles. In: Jose, J., et al. Advances in Information Retrieval. ECIR 2017. Lecture Notes in Computer Science(), vol 10193. Springer, Cham. https://doi.org/10.1007/978-3-319-56608-5_59

Download citation

DOI: https://doi.org/10.1007/978-3-319-56608-5_59
Published: 08 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56607-8
Online ISBN: 978-3-319-56608-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics