Using Wikipedia for Cross-Language Named Entity Recognition

Fernandes, Eraldo R.; Brefeld, Ulf; Blanco, Roi; Atserias, Jordi

doi:10.1007/978-3-319-29009-6_1

Eraldo R. Fernandes¹⁸,
Ulf Brefeld¹⁹,
Roi Blanco²⁰ &
…
Jordi Atserias²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9546))

Included in the following conference series:

1651 Accesses

Abstract

Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://ifarm.nl/signll/conll/.
2.
http://evalita.fbk.eu/.
3.
http://www.wikipedia.org/.
4.
https://en.wikipedia.org/wiki/Help:Category.
5.
http://www.reuters.com/.
6.
http://efe.com/.

References

Altun, Y., McAllester, D., Belkin, M.: Maximum margin semi-supervised learning for structured variables. In: Advances in Neural Information Processing Systems (2006)
Google Scholar
Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden Markov support vector machines. In: Proceedings of the International Conference on Machine Learning (2003)
Google Scholar
Atserias, J., Zaragoza, H., Ciaramita, M., Attardi, G.: Semantically annotated snapshot of the english wikipedia. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. European Language Resources Association (ELRA), May 2008
Google Scholar
Augenstein, I., Maynard, D., Ciravegna, F.: Relation extraction from the web using distant supervision. In: Janowicz, K., Schlobach, S., Lambrix, P., Hyvönen, E. (eds.) EKAW 2014. LNCS, vol. 8876, pp. 26–41. Springer, Heidelberg (2014)
Google Scholar
Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat. 41(1), 164–171 (1970)
Article MATH MathSciNet Google Scholar
Brefeld, U., Scheffer, T.: Semi-supervised learning for structured output variables. In: Proceedings of the International Conference on Machine Learning (2006)
Google Scholar
Cao, L., Chen, C.W.: A novel product coding and recurrent alternate decoding scheme for image transmission over noisy channels. IEEE Trans. Commun. 51(9), 1426–1431 (2003)
Article MathSciNet Google Scholar
Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised Learning. MIT Press, Cambridge (2006)
Book Google Scholar
Ciaramita, M., Altun, Y.: Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2006)
Google Scholar
Collins, M.: Discriminative reranking for natural language processing. In: Proceedings of the International Conference on Machine Learning (2000)
Google Scholar
Collins, M., Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2002)
Google Scholar
Cucerzan, S., Yarowsky, D.: Bootstrapping a multilingual part-of-speech tagger in one person-day. In: Proceedings of CoNLL 2002, pp. 132–138 (2002)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. 39(1), 1–38 (1977)
MATH MathSciNet Google Scholar
Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 15–30. Springer, Heidelberg (2002)
Chapter Google Scholar
Fernandes, E.R., Brefeld, U.: Learning from partially annotated sequences. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part I. LNCS, vol. 6911, pp. 407–422. Springer, Heidelberg (2011)
Chapter Google Scholar
Ferrández, S., Toral, A., Ferrández, Ó., Ferrández, A., Muñoz, R.: Exploiting wikipedia and eurowordnet to solve cross-lingual question answering. Inf. Sci. 179(20), 3473–3488 (2009)
Article Google Scholar
Forney, G.D.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
Article MathSciNet Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Veloso, M.M. (ed.) IJCAI, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Google Scholar
Hammersley, J.M., Clifford, P.E.: Markov random fields on finite graphs and lattices. Unpublished manuscript (1971)
Google Scholar
Juang, B., Rabiner, L.: Hidden Markov models for speech recognition. Technometrics 33, 251–272 (1991)
Article MATH MathSciNet Google Scholar
Kazama, J., Torisawa, K.: Exploiting Wikipedia as external knowledge for named entity recognition. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 698–707, Prague, Czech Republic, June 2007. Association for Computational Linguistics
Google Scholar
Lafferty, J., Liu, Y., Zhu, X., Kernel conditional random fields: Representation, clique selection, and semi-supervised learning. Technical Report CMU-CS-04-115, School of Computer Science, Carnegie Mellon University (2004)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F., Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (2001)
Google Scholar
Lafferty, J., Zhu, X., Liu, Y., Kernel conditional random fields: representation and clique selection. In: Proceedings of the International Conference on Machine Learning (2004)
Google Scholar
Lee, C., Wang, S., Jiao, F., Greiner, R., Schuurmans, D.: Learning to model spatial dependency: Semi-supervised discriminative random fields. In: Advances in Neural Information Processing Systems (2007)
Google Scholar
Liao, W., Veermamachaneni, S.: A simple semi-supervised algorithm for named entity recognition. In: Proceedings of the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing (2009)
Google Scholar
Màrquez, L., de Gispert, A., Carreras, X., Padró, L.: Low-cost named entity classification for catalan: exploiting multilingual resources and unlabeled data. In: Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, pp. 25–32, Sapporo, Japan, July 2003. Association for Computational Linguistics
Google Scholar
McAllester, D., Hazan, T., Keshet, J.: Direct loss minimization for structured prediction. In: Advances in Neural Information Processing Systems (2010)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the International Conference on Machine Learning (2000)
Google Scholar
Mihalcea, R.: Using wikipedia for automatic word sense disambiguation. In: Proceedings of NAACL HLT 2007, pp. 196–203 (2007)
Google Scholar
Mika, P., Ciaramita, M., Zaragoza, H., Atserias, J.: Learning to tag and tagging to learn: a case study on wikipedia. IEEE Intell. Syst. 23, 26–33 (2008)
Article Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30(1), 3–26 (2007). John Benjamins Publishing Company
Article Google Scholar
Nothman, J., Murphy, T., Curran, J.R.: Analysing wikipedia and gold-standard corpora for ner training. In: EACL 2009: Proceedings of the 12th Conference of the EuropeanChapter of the Association for Computational Linguistics, pp. 612–620, Morristown, NJ, USA (2009). Association for Computational Linguistics
Google Scholar
Overell, S., Sigurbjörnsson, B., van Zwol, R.: Classifying tags using open content resources. In: WSDM 2009: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 64–73. ACM, New York (2009)
Google Scholar
Sebastian, P., Mirella, L.: Cross-linguistic projection of role-semantic information. In: HLT/EMNLP. The Association for Computational Linguistics (2005)
Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Richman, A.E., Schone, P.: Mining wiki resources for multilingual named entity recognition. In: Proceedings of ACL 2008: HLT, pp. 1–9, Columbus, Ohio, June 2008. Association for Computational Linguistics
Google Scholar
Ruiz-casado, M., Alfonseca, E., Castells, P.: Automatising the learning of lexical patterns: an application to the enrichment of wordnet by extracting semantic relationships from wikipedia. J. Data Knowl. Eng. 61, 484–499 (2007)
Article Google Scholar
Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: COLING-2002: Proceedings of the 6th Conference on Naturallanguage Learning, pp. 1–4, Morristown, NJ, USA (2002). Association for Computational Linguistics
Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of CoNLL-2003, pp. 142–147 (2003)
Google Scholar
Scheffer, T., Wrobel, S.: Active hidden Markov models for information extraction. In: Proceedings of the International Symposium on Intelligent Data Analysis (2001)
Google Scholar
Schwarz, R., Chow, Y.L.: The \(n\)-best algorithm: An efficient and exact procedure for finding the \(n\) most likely hypotheses. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (1990)
Google Scholar
Snyder, B., Barzilay, R.: Cross-lingual propagation for morphological analysis. In: Fox, D., Gomes, C.P. (eds.) AAAI, pp. 848–854. AAAI Press, Menlo Park (2008)
Google Scholar
Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-wordscale unlabeled data. In: Proceedings of ACL 2008: HLT (2008)
Google Scholar
Taskar, B., Guestrin, C., Koller, D.: Max-margin Markov networks. In: Advances in Neural Information Processing Systems (2004)
Google Scholar
Toral, A., Muñoz, R., Monachini, M.: Named entity wordnet. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. European Language Resources Association (ELRA), May 2008
Google Scholar
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484 (2005)
MATH MathSciNet Google Scholar
Wu, Y., Zhao, J., Xu, B., Yu, H.: Chinese named entity recognition based on multiple features. In: HLT 2005: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 427–434, Morristown, NJ, USA (2005). Association for Computational Linguistics
Google Scholar
Xu, L., Wilkinson, D., Southey, F., Schuurmans, D.: Discriminative unsupervised learning of structured predictors. In: Proceedings of the International Conference on Machine Learning (2006)
Google Scholar
Yarowsky, D., Ngai, G.: Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In: NAACL (2001)
Google Scholar
Zien, A., Brefeld, U., Scheffer, T.: Transductive support vector machines for structured variables. In: Proceedings of the International Conference on Machine Learning (2007)
Google Scholar
Zinkevich, M., Weimer, M., Smola, A., Li, L.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, vol. 23 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Universidade Federal de Mato Grosso Do Sul, Campo Grande, Brazil
Eraldo R. Fernandes
Leuphana University of Lüneburg, Lüneburg, Germany
Ulf Brefeld
Yahoo! Labs, London, UK
Roi Blanco
University of the Basque Country, Donostia, Spain
Jordi Atserias

Authors

Eraldo R. Fernandes
View author publications
You can also search for this author in PubMed Google Scholar
Ulf Brefeld
View author publications
You can also search for this author in PubMed Google Scholar
Roi Blanco
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Atserias
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ulf Brefeld .

Editor information

Editors and Affiliations

University of Kassel, Kassel, Hessen, Germany
Martin Atzmueller
BMW Technology Group, Chicago, Illinois, USA
Alvin Chin
Knowledge Engineering, Technische Universität Darmstadt, Darmstadt, Hessen, Germany
Frederik Janssen
TU Darmstadt, Darmstadt, Germany
Immanuel Schweizer
Graz University of Technology, Graz, Austria
Christoph Trattner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fernandes, E.R., Brefeld, U., Blanco, R., Atserias, J. (2016). Using Wikipedia for Cross-Language Named Entity Recognition. In: Atzmueller, M., Chin, A., Janssen, F., Schweizer, I., Trattner, C. (eds) Big Data Analytics in the Social and Ubiquitous Context. SENSEML MUSE MSM 2015 2014 2014. Lecture Notes in Computer Science(), vol 9546. Springer, Cham. https://doi.org/10.1007/978-3-319-29009-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-29009-6_1
Published: 07 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29008-9
Online ISBN: 978-3-319-29009-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics