Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia

Scozzafava, Federico; Raganato, Alessandro; Moro, Andrea; Navigli, Roberto

doi:10.1007/978-3-319-24309-2_27

Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia

Federico Scozzafava¹⁶,
Alessandro Raganato¹⁶,
Andrea Moro¹⁶ &
…
Roberto Navigli¹⁶

Conference paper
First Online: 17 October 2015

1009 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9336))

Abstract

In this paper we present an automatic multilingual annotation of the Wikipedia dumps in two languages, with both word senses (i.e. concepts) and named entities. We use Babelfy 1.0, a state-of-the-art multilingual Word Sense Disambiguation and Entity Linking system. As its reference inventory, Babelfy draws upon BabelNet 3.0, a very large multilingual encyclopedic dictionary and semantic network which connects concepts and named entities in 271 languages from different inventories, such as WordNet, Open Multilingual WordNet, Wikipedia, OmegaWiki, Wiktionary and Wikidata. In addition, we perform both an automatic evaluation of the dataset and a language-specific statistical analysis. In detail, we investigate the word sense distributions by part-of-speech and language, together with the similarity of the annotated entities and concepts for a random sample of interlinked Wikipedia pages in different languages. The annotated corpora are available at http://lcl.uniroma1.it/babelfied-wikipedia/.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Basave, A.E.C., Rizzo, G., Varga, A., Rowe, M., Stankovic, M., Dadzie, A.S.: Making sense of microposts (#Microposts2014) named entity extraction & linking challenge. In: 4th Workshop on Making Sense of Microposts (#Microposts2014) (2014)
Google Scholar
Bond, F., Foster, R.: Linking and extending an open multilingual wordnet. In: ACL (1), pp. 1352–1362 (2013)
Google Scholar
Carmel, D., Chang, M.W., Gabrilovich, E., Hsu, B.J.P., Wang, K.: ERD’14: entity recognition and disambiguation challenge. In: ACM SIGIR Forum, vol. 48, pp. 63–77. ACM (2014)
Google Scholar
Cornolti, M., Ferragina, P., Ciaramita, M.: A framework for benchmarking entity-annotation systems. In: Proc. of WWW, pp. 249–260 (2013)
Google Scholar
Dolan, S.: Six Degrees of Wikipedia (2008). http://mu.netsoc.ie/wiki/
Flati, T., Vannella, D., Pasini, T., Navigli, R.: Two is bigger (and better) than one: the wikipedia bitaxonomy project. In: Proc. of ACL, pp. 945–955. Association for Computational Linguistics, Baltimore (2014)
Google Scholar
Gabrilovich, E., Ringgaard, M., Subramanya, A.: FACC1: Freebase annotation of ClueWeb corpora, Version 1. Release date, pp. 06–26 (2013)
Google Scholar
Giles, J.: Internet encyclopaedias go head to head. Nature 438(7070), 900–901 (2005)
Article Google Scholar
Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gómez-Pérez, A., Buitelaar, P., McCrae, J.: Challenges for the multilingual web of data. Web Semantics: Science, Services and Agents on the World Wide Web 11, 63–71 (2012)
Article Google Scholar
Ide, N., Baker, C., Fellbaum, C., Fillmore, C.: MASC: the manually annotated sub-corpus of American English. In: Proc. of LREC (2008)
Google Scholar
Ji, H., Dang, H., Nothman, J., Hachey, B.: Overview of tac-kbp2014 entity discovery and linking tasks. In: Proc. of TAC (2014)
Google Scholar
Lefever, E., Hoste, V.: Semeval-2010 task 3: cross-lingual word sense disambiguation. In: Proc. of SemEval, pp. 15–20 (2010)
Google Scholar
Lefever, E., Hoste, V.: Semeval-2013 task 10: cross-lingual word sense disambiguation. In: Proc. of SemEval, pp. 158–166 (2013)
Google Scholar
Manandhar, S., Klapaftis, I.P., Dligach, D., Pradhan, S.S.: SemEval-2010 task 14: word sense induction & disambiguation. In: Proc. of SemEval, pp. 63–68 (2010)
Google Scholar
McDonald, R.T., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K.B., Petrov, S., Zhang, H., Täckström, O., et al.: Universal dependency annotation for multilingual parsing. In: ACL (2), pp. 92–97 (2013)
Google Scholar
Mihalcea, R.: Using wikipedia for automatic word sense disambiguation. In: HLT-NAACL, pp. 196–203 (2007)
Google Scholar
Miller, G.A.: WordNet: a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)
Article Google Scholar
Miller, G.A., Leacock, C., Tengi, R., Bunker, R.T.: A semantic concordance. In: Proc. of the workshop on Human Language Technology, pp. 303–308 (1993)
Google Scholar
Moro, A., Navigli, R.: SemEval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In: Proc. of SemEval, pp. 288–297 (2015)
Google Scholar
Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proc. of LREC, pp. 4214–4219 (2014)
Google Scholar
Moro, A., Raganato, A., Navigli, R.: Entity Linking meets Word Sense Disambiguation: A Unified Approach. TACL 2, 231–244 (2014)
Google Scholar
Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41(2), 10 (2009)
Article Google Scholar
Navigli, R., Jurgens, D., Vannella, D.: Semeval-2013 task 12: multilingual word sense disambiguation. In: Proc. of SemEval, vol. 2, pp. 222–231 (2013)
Google Scholar
Navigli, R., Litkowski, K.C., Hargraves, O.: Semeval-2007 task 07: coarse-grained english all-words task. In: Proc. of SemEval, pp. 30–35 (2007)
Google Scholar
Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193, 217–250 (2012)
Article MathSciNet MATH Google Scholar
Navigli, R., Ponzetto, S.P.: Joining forces pays off: multilingual joint word sense disambiguation. In: Proc. of EMNLP, pp. 1399–1410 (2012)
Google Scholar
Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., Dang, H.T.: English tasks: all-words and verb lexical sample. In: Proc. of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pp. 21–24 (2001)
Google Scholar
Pilehvar, M.T., Navigli, R.: A large-scale pseudoword-based evaluation framework for state-of-the-art word sense disambiguation. Computational Linguistics 40(4), 837–881 (2014)
Article Google Scholar
Pradhan, S.S., Loper, E., Dligach, D., Palmer, M.: Semeval-2007 task 17: English lexical sample, SRL and all words. In: Proc. of SemEval, pp. 87–92 (2007)
Google Scholar
Rao, D., McNamee, P., Dredze, M.: Entity linking: finding extracted entities in a knowledge base. In: Multi-source, Multilingual Information Extraction and Summarization, pp. 93–115. Springer (2013)
Google Scholar
Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Wikilinks: a large-scale cross-document coreference corpus labeled via links to Wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012-015 (2012)
Google Scholar
Snyder, B., Palmer, M.: The English all-words task. In: Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 41–43 (2004)
Google Scholar
Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL, vol. 1, pp. 173–180 (2003)
Google Scholar
Usbeck, R., Röder, M., Ngonga Ngomo, A.C., Baron, C., Both, A., Brümmer, M., Ceccarelli, D., Cornolti, M., Cherix, D., Eickmann, B., Ferragina, P., Lemke, C., Moro, A., Navigli, R., Piccinno, F., Rizzo, G., Sack, H., Speck, R., Troncy, R., Waitelonis, J., Wesemann, L.: GERBIL - general entity annotation benchmark framework. In: Proc. of WWW, pp. 1133–1143
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Sapienza Università di Roma, Viale Regina Elena 295, 00161, Roma, Italy
Federico Scozzafava, Alessandro Raganato, Andrea Moro & Roberto Navigli

Authors

Federico Scozzafava
View author publications
You can also search for this author in PubMed Google Scholar
Alessandro Raganato
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Moro
View author publications
You can also search for this author in PubMed Google Scholar
Roberto Navigli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Federico Scozzafava .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria,, Università di Ferrara,, Ferrara, Italy
Marco Gavanelli
Dipartimento di Ingegneria, Università di Ferrara, Ferrara, Italy
Evelina Lamma
Dep di Matem e Informatica, Universita di Ferrara, Ferrara, Italy
Fabrizio Riguzzi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scozzafava, F., Raganato, A., Moro, A., Navigli, R. (2015). Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia. In: Gavanelli, M., Lamma, E., Riguzzi, F. (eds) AI*IA 2015 Advances in Artificial Intelligence. AI*IA 2015. Lecture Notes in Computer Science(), vol 9336. Springer, Cham. https://doi.org/10.1007/978-3-319-24309-2_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-24309-2_27
Published: 17 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24308-5
Online ISBN: 978-3-319-24309-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics