Skip to main content

Using Wikipedia for Cross-Language Named Entity Recognition

  • Conference paper
  • First Online:
Book cover Big Data Analytics in the Social and Ubiquitous Context (SENSEML 2015, MUSE 2014, MSM 2014)

Abstract

Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and hence need to be trained on (manually) annotated corpora. However, annotated corpora hardly exist for non-standard languages and labeling additional data manually is tedious and costly. In this article, we present a novel method to automatically generate (partially) annotated corpora for NERC by exploiting the link structure of Wikipedia. Firstly, Wikipedia entries in the source language are labeled with the NERC tag set. Secondly, Wikipedia language links are exploited to propagate the annotations in the target language. Finally, mentions of the labeled entities in the target language are annotated with the respective tags. The procedure results in a partially annotated corpus that is likely to contain unannotated entities. To learn from such partially annotated data, we devise two simple extensions of hidden Markov models and structural perceptrons. Empirically, we observe that using the automatically generated data leads to more accurate prediction models than off-the-shelf NERC methods. We demonstrate that the novel extensions of HMMs and perceptrons effectively exploit the partially annotated data and outperforms their baseline counterparts in all settings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://ifarm.nl/signll/conll/.

  2. 2.

    http://evalita.fbk.eu/.

  3. 3.

    http://www.wikipedia.org/.

  4. 4.

    https://en.wikipedia.org/wiki/Help:Category.

  5. 5.

    http://www.reuters.com/.

  6. 6.

    http://efe.com/.

References

  1. Altun, Y., McAllester, D., Belkin, M.: Maximum margin semi-supervised learning for structured variables. In: Advances in Neural Information Processing Systems (2006)

    Google Scholar 

  2. Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden Markov support vector machines. In: Proceedings of the International Conference on Machine Learning (2003)

    Google Scholar 

  3. Atserias, J., Zaragoza, H., Ciaramita, M., Attardi, G.: Semantically annotated snapshot of the english wikipedia. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. European Language Resources Association (ELRA), May 2008

    Google Scholar 

  4. Augenstein, I., Maynard, D., Ciravegna, F.: Relation extraction from the web using distant supervision. In: Janowicz, K., Schlobach, S., Lambrix, P., Hyvönen, E. (eds.) EKAW 2014. LNCS, vol. 8876, pp. 26–41. Springer, Heidelberg (2014)

    Google Scholar 

  5. Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. Ann. Math. Stat. 41(1), 164–171 (1970)

    Article  MATH  MathSciNet  Google Scholar 

  6. Brefeld, U., Scheffer, T.: Semi-supervised learning for structured output variables. In: Proceedings of the International Conference on Machine Learning (2006)

    Google Scholar 

  7. Cao, L., Chen, C.W.: A novel product coding and recurrent alternate decoding scheme for image transmission over noisy channels. IEEE Trans. Commun. 51(9), 1426–1431 (2003)

    Article  MathSciNet  Google Scholar 

  8. Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised Learning. MIT Press, Cambridge (2006)

    Book  Google Scholar 

  9. Ciaramita, M., Altun, Y.: Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2006)

    Google Scholar 

  10. Collins, M.: Discriminative reranking for natural language processing. In: Proceedings of the International Conference on Machine Learning (2000)

    Google Scholar 

  11. Collins, M., Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2002)

    Google Scholar 

  12. Cucerzan, S., Yarowsky, D.: Bootstrapping a multilingual part-of-speech tagger in one person-day. In: Proceedings of CoNLL 2002, pp. 132–138 (2002)

    Google Scholar 

  13. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. 39(1), 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  14. Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 15–30. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  15. Fernandes, E.R., Brefeld, U.: Learning from partially annotated sequences. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part I. LNCS, vol. 6911, pp. 407–422. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  16. Ferrández, S., Toral, A., Ferrández, Ó., Ferrández, A., Muñoz, R.: Exploiting wikipedia and eurowordnet to solve cross-lingual question answering. Inf. Sci. 179(20), 3473–3488 (2009)

    Article  Google Scholar 

  17. Forney, G.D.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)

    Article  MathSciNet  Google Scholar 

  18. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Veloso, M.M. (ed.) IJCAI, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007)

    Google Scholar 

  19. Hammersley, J.M., Clifford, P.E.: Markov random fields on finite graphs and lattices. Unpublished manuscript (1971)

    Google Scholar 

  20. Juang, B., Rabiner, L.: Hidden Markov models for speech recognition. Technometrics 33, 251–272 (1991)

    Article  MATH  MathSciNet  Google Scholar 

  21. Kazama, J., Torisawa, K.: Exploiting Wikipedia as external knowledge for named entity recognition. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 698–707, Prague, Czech Republic, June 2007. Association for Computational Linguistics

    Google Scholar 

  22. Lafferty, J., Liu, Y., Zhu, X., Kernel conditional random fields: Representation, clique selection, and semi-supervised learning. Technical Report CMU-CS-04-115, School of Computer Science, Carnegie Mellon University (2004)

    Google Scholar 

  23. Lafferty, J., McCallum, A., Pereira, F., Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (2001)

    Google Scholar 

  24. Lafferty, J., Zhu, X., Liu, Y., Kernel conditional random fields: representation and clique selection. In: Proceedings of the International Conference on Machine Learning (2004)

    Google Scholar 

  25. Lee, C., Wang, S., Jiao, F., Greiner, R., Schuurmans, D.: Learning to model spatial dependency: Semi-supervised discriminative random fields. In: Advances in Neural Information Processing Systems (2007)

    Google Scholar 

  26. Liao, W., Veermamachaneni, S.: A simple semi-supervised algorithm for named entity recognition. In: Proceedings of the NAACL HLT Workshop on Semi-supervised Learning for Natural Language Processing (2009)

    Google Scholar 

  27. Màrquez, L., de Gispert, A., Carreras, X., Padró, L.: Low-cost named entity classification for catalan: exploiting multilingual resources and unlabeled data. In: Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, pp. 25–32, Sapporo, Japan, July 2003. Association for Computational Linguistics

    Google Scholar 

  28. McAllester, D., Hazan, T., Keshet, J.: Direct loss minimization for structured prediction. In: Advances in Neural Information Processing Systems (2010)

    Google Scholar 

  29. McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the International Conference on Machine Learning (2000)

    Google Scholar 

  30. Mihalcea, R.: Using wikipedia for automatic word sense disambiguation. In: Proceedings of NAACL HLT 2007, pp. 196–203 (2007)

    Google Scholar 

  31. Mika, P., Ciaramita, M., Zaragoza, H., Atserias, J.: Learning to tag and tagging to learn: a case study on wikipedia. IEEE Intell. Syst. 23, 26–33 (2008)

    Article  Google Scholar 

  32. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30(1), 3–26 (2007). John Benjamins Publishing Company

    Article  Google Scholar 

  33. Nothman, J., Murphy, T., Curran, J.R.: Analysing wikipedia and gold-standard corpora for ner training. In: EACL 2009: Proceedings of the 12th Conference of the EuropeanChapter of the Association for Computational Linguistics, pp. 612–620, Morristown, NJ, USA (2009). Association for Computational Linguistics

    Google Scholar 

  34. Overell, S., Sigurbjörnsson, B., van Zwol, R.: Classifying tags using open content resources. In: WSDM 2009: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 64–73. ACM, New York (2009)

    Google Scholar 

  35. Sebastian, P., Mirella, L.: Cross-linguistic projection of role-semantic information. In: HLT/EMNLP. The Association for Computational Linguistics (2005)

    Google Scholar 

  36. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  37. Richman, A.E., Schone, P.: Mining wiki resources for multilingual named entity recognition. In: Proceedings of ACL 2008: HLT, pp. 1–9, Columbus, Ohio, June 2008. Association for Computational Linguistics

    Google Scholar 

  38. Ruiz-casado, M., Alfonseca, E., Castells, P.: Automatising the learning of lexical patterns: an application to the enrichment of wordnet by extracting semantic relationships from wikipedia. J. Data Knowl. Eng. 61, 484–499 (2007)

    Article  Google Scholar 

  39. Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: COLING-2002: Proceedings of the 6th Conference on Naturallanguage Learning, pp. 1–4, Morristown, NJ, USA (2002). Association for Computational Linguistics

    Google Scholar 

  40. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of CoNLL-2003, pp. 142–147 (2003)

    Google Scholar 

  41. Scheffer, T., Wrobel, S.: Active hidden Markov models for information extraction. In: Proceedings of the International Symposium on Intelligent Data Analysis (2001)

    Google Scholar 

  42. Schwarz, R., Chow, Y.L.: The \(n\)-best algorithm: An efficient and exact procedure for finding the \(n\) most likely hypotheses. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (1990)

    Google Scholar 

  43. Snyder, B., Barzilay, R.: Cross-lingual propagation for morphological analysis. In: Fox, D., Gomes, C.P. (eds.) AAAI, pp. 848–854. AAAI Press, Menlo Park (2008)

    Google Scholar 

  44. Suzuki, J., Isozaki, H.: Semi-supervised sequential labeling and segmentation using giga-wordscale unlabeled data. In: Proceedings of ACL 2008: HLT (2008)

    Google Scholar 

  45. Taskar, B., Guestrin, C., Koller, D.: Max-margin Markov networks. In: Advances in Neural Information Processing Systems (2004)

    Google Scholar 

  46. Toral, A., Muñoz, R., Monachini, M.: Named entity wordnet. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Marrakech, Morocco. European Language Resources Association (ELRA), May 2008

    Google Scholar 

  47. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453–1484 (2005)

    MATH  MathSciNet  Google Scholar 

  48. Wu, Y., Zhao, J., Xu, B., Yu, H.: Chinese named entity recognition based on multiple features. In: HLT 2005: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 427–434, Morristown, NJ, USA (2005). Association for Computational Linguistics

    Google Scholar 

  49. Xu, L., Wilkinson, D., Southey, F., Schuurmans, D.: Discriminative unsupervised learning of structured predictors. In: Proceedings of the International Conference on Machine Learning (2006)

    Google Scholar 

  50. Yarowsky, D., Ngai, G.: Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In: NAACL (2001)

    Google Scholar 

  51. Zien, A., Brefeld, U., Scheffer, T.: Transductive support vector machines for structured variables. In: Proceedings of the International Conference on Machine Learning (2007)

    Google Scholar 

  52. Zinkevich, M., Weimer, M., Smola, A., Li, L.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems, vol. 23 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ulf Brefeld .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Fernandes, E.R., Brefeld, U., Blanco, R., Atserias, J. (2016). Using Wikipedia for Cross-Language Named Entity Recognition. In: Atzmueller, M., Chin, A., Janssen, F., Schweizer, I., Trattner, C. (eds) Big Data Analytics in the Social and Ubiquitous Context. SENSEML MUSE MSM 2015 2014 2014. Lecture Notes in Computer Science(), vol 9546. Springer, Cham. https://doi.org/10.1007/978-3-319-29009-6_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-29009-6_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29008-9

  • Online ISBN: 978-3-319-29009-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics