Skip to main content
Log in

Reassessing the value of resources for cross-lingual transfer of POS tagging models

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

When linguistically annotated data is scarce, as is the case for many under-resourced languages, one has to resort to less complete forms of annotations obtained from crawled dictionaries and/or through cross-lingual transfer. Several recent works have shown that learning from such partially supervised data can be effective in many practical situations. In this work, we review two existing proposals for learning with ambiguous labels which extend conventional learners to the weakly supervised setting: a history-based model using a variant of the perceptron, on the one hand; an extension of the Conditional Random Fields model on the other hand. Focusing on the part-of-speech tagging task, but considering a large set of ten languages, we show (a) that good performance can be achieved even in the presence of ambiguity, provided however that both monolingual and bilingual resources are available; (b) that our two learners exploit different characteristics of the training set, and are successful in different situations; (c) that in addition to the choice of an adequate learning algorithm, many other factors are critical for achieving good performance in a cross-lingual transfer setting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://wiktionary.org.

  2. A more recent revision of this tagset distinguishes 15 categories—see http://universaldependencies.github.io/docs/u/pos/index.html.

  3. See “https://perso.limsi.fr/wisniews/weakly/index.html” for details.

  4. This indicator can only be computed for labeled data.

  5. Alignment links at the word level can be easily obtained using softwares such as mgiza (Och and Ney 2003; Gao and Vogel 2008).

  6. If a word appears only in either the Wiktionary or in the bitext constraints, its type constraint will be not be modified.

  7. It may seem paradoxical to see the precision of the union be sometimes worse than wiki—this would occur when the coverage of wiki is small (e.g. for Arabic), so that the noise added through projection has overall a negative impact on precision.

  8. Arabic is special mostly due to the very low coverage of wiki constraints (Table 2), which by contrast, explains the usefulness of token constraints (Table 3), that were hardly filtered by the wiki constraints. This can be can be explained by the complexity of the Arabic morphological system (as well as its peculiar description in Wiktionary), which prevented us to extract a sufficient number of inflected word forms.

  9. http://wiktionary.org.

  10. Wiktionary statistics can be found at https://meta.wikimedia.org/wiki/Wiktionary.

  11. http://en.wiktionary.org/wiki/Wiktionary:Entry_layout_explained.

  12. Parameters are separated from the template name using the pipe symbol (‘|’).

  13. http://www.mediawiki.org/wiki/API:Main_page.

  14. We first consider a simple multiclass classification problem, then generalize the main concepts to the more interesting structured learning case.

  15. In a subsequent erratum http://www.dipanjandas.com/files/erratum, the authors present correct results which show, as we do, that using type constraints to restrict the search space can in fact be detrimental to training.

  16. The alignment models used in Moses produce asymmetric alignments, with many-to-one links allowed only in one direction. As is custom, we successively compute alignments in two directions and only keep alignment links that are found in both directions.

  17. The accuracy of this POS tagger on the Penn Treebank exceeds 97 %, close to the best performance reported on this corpus.

  18. At least to languages using (variants of) the Latin alphabet.

  19. We did not observed significant changes by varying the hyper-parameters values or when using either only \(\ell _1\) or \(\ell _2\) regularization.

  20. The test sets are only the same for Czech, Greek and Swedish.

  21. And consequently the ambiguity level. Note that when sampling additional labels from the type constraints, ambiguity does not always increase with \(\alpha \), as some words may be associated with one single label.

  22. For space reason, we only present in Figs. 4 and 5 the results obtained using the French corpus. Similar observations were made for other languages.

  23. We even observe cases where the use of type constraints in decoding is detrimental to performance, due to the high number of erroneous tags induced by these constraints.

  24. http://nlp.lsi.upc.edu/freeling/.

References

  • Berg-Kirkpatrick, T., Bouchard-Côté, A., DeNero, J., & Klein, D. (2010). Painless unsupervised learning with features. In Proceedings of the 2010 annual conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies, Los Angeles, California, NAACL HLT’10 (pp. 582–590).

  • Black, E., Jelinek, F., Lafferty, J., Magerman, D.M., Mercer, R., & Roukos, S. (1992). Towards history-based grammars: Using richer models for probabilistic parsing. In Proceedings of the workshop on speech and natural language, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT’91 (pp. 134–139).

  • Bordes, A., Usunier, N., & Weston, J. (2010). Label ranking under ambiguous supervision for learning semantic correspondences. In Proceedings of the international conference on machine learning, ICML’10 (pp. 103–110).

  • Borin, L. (2002). Alignment and tagging. Language and Computers, 43(1), 207–217.

    Google Scholar 

  • Broschart, J. (2009). Why Tongan does it differently: Categorial distinctions in a language without nouns and verbs. Linguistic Typology, 1, 123–166.

    Google Scholar 

  • Christodoulopoulos, C., Goldwater, S., & Steedman, M. (2010). Two decades of unsupervised POS induction: How far have we come? In Proceedings of the 2010 conference on empirical methods in natural language processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’10 (pp. 575–584).

  • Cohen, S.B., Das, D., & Smith, N.A. (2011). Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of the 2011 conference on empirical methods in natural language processing, Edinburgh, Scotland, UK., EMNLP’11 (pp. 50–61).

  • Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels. Journal of Machine Learning Research, 12, 1501–1536.

    Google Scholar 

  • Das, D., & Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11 (pp. 600–609).

  • Daumé, H.III., & Marcu, D. (2005). Learning as search optimization: Approximate large margin methods for structured prediction. In Proceedings of the 22nd international conference on machine learning, ACM, New York, NY, USA, ICML’05 (pp. 169–176).

  • Dorr, B. J. (1994). Machine translation divergences: A formal description and proposed solution. Computational Linguistics, 20(4), 597–633.

    Google Scholar 

  • Dredze, M., Talukdar, P.P., & Crammer, K. (2009). Sequence learning from data with multiple labels. In Proceedings of the ECML-PKDD 2009 workshop on learning from multi-label data, MLD.

  • Duong, L., Cohn, T., Verspoor, K., Bird, S., & Cook, P. (2014). What can we get from 1000 tokens? A case study of multilingual pos tagging for resource-poor languages. In Proceedings of the 2014 conference on empirical methods in natural language processing, Association for Computational Linguistics, Doha, Qatar, EMNLP’14 (pp. 886–897).

  • Durrett, G., Pauls, A., & Klein, D. (2012). Syntactic transfer using a bilingual lexicon. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP-CoNLL’12 (pp. 1–11).

  • Evans, N., & Levinson, S. C. (2009). The myth of language universals: Language diversity and its importance for cognitive science. Behavioral and Brain Sciences, 32, 429–448.

    Article  Google Scholar 

  • Ganchev, K., & Das, D. (2013). Cross-lingual discriminative learning of sequence models with posterior regularization. In Proceedings of the 2013 conference on empirical methods in natural language processing, Seattle, Washington, USA, EMNLP’13 (pp. 1996–2006).

  • Ganchev, K., Gillenwater, J., & Taskar, B. (2009). Dependency grammar induction via bitext projection constraints. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, Stroudsburg, PA, USA, ACL ’09 (pp. 369–377).

  • Gao, Q., & Vogel, S. (2008). Parallel implementations of word alignment tool. In Software engineering, testing, and quality assurance for natural language processing, SETQA-NLP ’08 (pp. 49–57).

  • Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies, Association for Computational Linguistics, Atlanta, Georgia, NAACL’13 (pp. 138–147).

  • Graça, J., Ganchev, K., & Taskar, B. (2007). Expectation maximization and posterior constraints. In Proceedings of the neural information processing systems.

  • Hajič, J., Ciaramita, M., Johansson, R., Kawahara, D., Martí, M.A., Màrquez, L., Meyers, A., Nivre, J., Padó, S., Štěpánek, J., Straňák, P., Surdeanu, M., Xue, N., & Zhang, Y. (2009). The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, Association for Computational Linguistics, Boulder, Colorado, CoNLL’09 (pp. 1–18).

  • Hwa, R., Resnik, P., Weinberg, A., Cabezas, C., & Kolak, O. (2005). Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3), 311–325.

    Article  Google Scholar 

  • Jin, R., & Ghahramani, Z. (2002). Learning with multiple labels. In S. Thrun & K. Obermayer (Eds.), Processings of Advances in Neural Information Processing Systems 15, NIPS’02 (pp. 897–904). Cambridge, MA: MIT Press.

    Google Scholar 

  • Kazama, J., & Torisawa, K. (2007). A new perceptron algorithm for sequence labeling with non-local features. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, EMNLP-CoNLL’07 (pp. 315–324).

  • Kim, S., Toutanova, K., & Yu, H. (2012). Multilingual named entity recognition using parallel data and metadata from wikipedia. In Proceedings of the 50th annual meeting of the Association for Computational Linguistics: Long papers, Stroudsburg, PA, USA, ACL ’12 (pp. 694–702).

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’07 (pp. 177–180).

  • Kozhevnikov, M., & Titov, I. (2013). Cross-lingual transfer of semantic role labeling models. In Proceedings of the 51st annual meeting of the Association for Computational Linguistics: Long papers, Sofia, Bulgaria, ACL’13 (pp. 1190–1200).

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th international conference on machine learning, Morgan Kaufmann, San Francisco, CA, ICML’01 (pp. 282–289).

  • Lavergne, T., Cappé, O., & Yvon, F. (2010). Practical very large scale CRFs. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, ACL’11 (pp. 504–513).

  • Li, S., Graça, J. V., & Taskar, B. (2012a). Wiki-ly supervised part-of-speech tagging. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, EMNLP-CoNLL ’12 (pp. 1389–1398). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2390948.2391106.

  • Li, W., Duan, L., Tsang, I. W. H., & Xu, D. (2012b). Co-labeling: A new multi-view learning approach for ambiguous problems. In Zaki, M.J., Siebes, A., Yu, J.X., Goethals, B., Webb, G.I., & Wu, X. (Eds.) Processings of the IEEE 12th international conference on data mining, IEEE Computer Society, ICDM (pp. 419–428).

  • Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004). The Penn Arabic Treebank: Building a large-scale annotated arabic corpus. In Proceedings of the conference on Arabic language resources and tools.

  • McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K., Petrov, S., Zhang, H., Täckström, O., Bedini, C., Bertomeu Castelló, N., & Lee, J. (2013). Universal dependency annotation for multilingual parsing. In Proceedings of the 51st annual meeting of the Association for Computational Linguistics: Short Papers, Sofia, Bulgaria, ACL’13 (pp. 92–97).

  • Moore, R. (2014). Fast high-accuracy part-of-speech tagging by independent classifiers. In Proceedings of the 25th international conference on computational linguistics: Technical papers, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, COLING’14 (pp. 1165–1176).

  • Mérialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155–171.

    Google Scholar 

  • Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Ozdowska, S. (2006). Projecting pos tags and syntactic dependencies from English and French to polish in aligned corpora. In Proceedings of the international workshop on cross-language knowledge induction, Association for Computational Linguistics, Stroudsburg, PA, USA, CrossLangInduction ’06 (pp. 53–60). http://dl.acm.org/citation.cfm?id=1608842.1608850.

  • Padó, S., & Lapata, M. (2009). Cross-lingual annotation projection of semantic roles. Journal of Artificial Intelligence Research, 36(1), 307–340.

    Google Scholar 

  • Petrov, S., Das, D., & McDonald, R. (2012). A universal part-of-speech tagset. In: Chair, N. C. C., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (Eds.), Proceedings of the eight international conference on language resources and evaluation, European Language Resources Association (ELRA), Istanbul, Turkey, LREC’12.

  • van der Plas, L., Apidianaki, M., & Chen, C. (2014). Global methods for cross-lingual semantic role and predicate labelling. In Proceedings of the 25th international conference on computational linguistics: Technical papers, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, COLING’14 (pp. 1279–1290).

  • Prokopidis, P., Desypri, E., Koutsombogera, M., Papageorgiou, H., & Piperidis, S. (2005). Theoretical and practical issues in the construction of a greek dependency treebank. In Civit, M., Kubler, S., Marti, M. A. (Eds.), Proceedings of the fourth workshop on treebanks and linguistic theories, Universitat de Barcelona, Barcelona, Spain, TLT’05 (pp. 149–160).

  • Ratinov, L., & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In Proceedings of the thirteenth conference on computational natural language learning, Association for Computational Linguistics, Stroudsburg, PA, USA, CoNLL ’09 (pp. 147–155).

  • Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of the 1996 conference on empirical methods in natural language processing, Association for Computational Linguistics, EMNLP’96.

  • Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of ICNN (pp. 586–591).

  • Ross, S., & Bagnell, D. (2010). Efficient reductions for imitation learning. In Proceedings of the international conference on artificial intelligence on statistics, AISTATS’10 (pp. 661–668).

  • Sutton, C., & McCallum, A. (2012). An introduction to conditional random fields for relational learning. Foundations and Trends in Machine Learning, 4(4), 267–373. doi:10.1561/2200000013.

  • Täckström, O., Das, D., Petrov, S., McDonald, R., & Nivre, J. (2013). Token and type constraints for cross-lingual part-of-speech tagging. Transactions of the Association for Computational Linguistics, 1, 1–12.

    Google Scholar 

  • Tiedemann, J. (2014). Rediscovering annotation projection for cross-lingual parser induction. In Proceedings of the 25th international conference on computational linguistics: Technical papers, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, COLING’14 (pp. 1854–1864).

  • Tsuboi, Y., Kashima, H., Oda, H., Mori, S., & Matsumoto, Y. (2008). Training conditional random fields using incomplete annotations. In Proceedings of the 22nd international conference on computational linguistics, COLING’08 (vol 1, pp. 897–904).

  • Tsuruoka, Y., Miyao, Y., & Kazama, J. (2011). Learning with lookahead: Can history-based models rival globally optimized models? In Proceedings of the fifteenth conference on computational natural language learning, Association for Computational Linguistics, Portland, Oregon, USA, CoNLL’11 (pp. 238–246).

  • Vapnik, V. (1995). The nature of statistical learning. New York: Springer.

    Book  Google Scholar 

  • Wang, M., & Manning, C. D. (2014). Cross-lingual projected expectation regularization for weakly supervised learning. Transactions of the Association for Computational Linguistics, 2, 55–66.

    Google Scholar 

  • Wisniewski, G., Pécheux, N., Gahbiche-Braham, S., & Yvon, F. (2014). Cross-lingual part-of-speech tagging through ambiguous learning. In Proceedings of the 2014 conference on empirical methods in natural language processing, Association for Computational Linguistics, Doha, Qatar, EMNLP’14 (pp. 1779–1785).

  • Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. doi:10.1016/S0893-6080(05)80023-1.

  • Yarowsky, D., & Ngai, G. (2001). Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, Association for Computational Linguistics, Stroudsburg, PA, USA, NAACL ’01 (pp. 1–8).

  • Yarowsky, D., Ngai, G., & Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on human language technology research, Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’01 (pp. 1–8).

  • Zeman, D., & Resnik, P. (2008). Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 workshop on NLP for less privileged languages, Hyderabad, India (pp. 35–42).

  • Zhang, Y., Reichart, R., Barzilay, R., & Globerson, A. (2012). Learning to map into a universal pos tagset. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, Stroudsburg, PA, USA, EMNLP-CoNLL ’12 (pp. 1368–1378).

  • Zhao, H., Song, Y., Kit, C., & Zhou, G. (2009). Cross-language dependency parsing using a bilingual lexicon. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’09 (pp. 55–63).

Download references

Acknowledgments

We wish to thank Thomas Lavergne and Alexandre Allauzen for early feedback and for providing us with the partially observed CRF implementation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guillaume Wisniewski.

Appendix

Appendix

See Table 9.

Table 9 Error rate (in %) achieved by POCRF and the skip version of POCRF depending on the configuration of type constraints (’type’ column), and on whether they are used to restict the allowed labels to compute the partition function at training (’train’ column) and test (’test’ column) time

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pécheux, N., Wisniewski, G. & Yvon, F. Reassessing the value of resources for cross-lingual transfer of POS tagging models. Lang Resources & Evaluation 51, 927–960 (2017). https://doi.org/10.1007/s10579-016-9362-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9362-7

Keywords

Navigation