Skip to main content
Log in

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG. In order to verify its general applicability, we evaluate the performance on different genres, authors and varieties of MHG, separately. We explore self-learning techniques which yield the advantage that unannotated data can be utilized to improve tagging performance on specific subcorpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. We consider the following tags to be open class tags: NOUN, PROPN, ADJ, ADV, VERB, INTJ and all combining tags.

  2. Since MHG is a language for which a feeling for language is not a reliable criterion due to the lack of native speakers, we rely on the validity judgement of educated German medievalists.

  3. Code downloaded from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.

  4. Code downloaded from http://cistern.cis.lmu.de/marmot/.

  5. We use 5-fold cross-validation for all settings. The test set splits are kept the same throughout the experiments.

  6. According to McNemar’s test using the “mid-p” variant (Fagerland et al. 2013), significance level \(\alpha \) is 0.05.

  7. The decreased quality is due to different goals of the data resources or different tagsets or annotation guidelines and not due to its intrinsic quality.

  8. The order of the annotated tags does not carry any meaning; all annotated tags are treatened equally (regardless of their order) when using them as features for the disambiguation.

  9. Evaluated on the manually annotated corpus.

  10. We use 80% of the manually annotated data for training the disambiguation model and 20% for testing.

  11. According to McNemar’s test using the “mid-p” variant (Fagerland et al. 2013) with a significance level alpha of 0.05.

  12. Normal form is a term used by Klein and Dipper (2016) and describes a word form to minimize the differences in spelling and use of diacritica, but do not standardize dialectal varieties [cf. Klein and Dipper (2016, p. 7,14)].

  13. Following the mapping given in Table 8 in "Appendix".

  14. The definition of closeness being either a clause or in case of missing sentence delimiters the context window of 10.

  15. https://www.linguistics.rub.de/rem/, 19.06.2017.

  16. For more information on geographical devision of the speaking areas compare Hennings (2003, pp. 18–20)

  17. Development set sizes for Hessian 1072 tokens and for Middle Low German 1007 tokens.

  18. This value has been determined empirically.

  19. These values has been determined empirically.

  20. According to McNemar’s test using the “mid-p” variant (Fagerland et al. 2013) with a significance level \(\alpha \) of 0.05.

  21. Clarin-D repository, metadata handle: http://hdl.handle.net/11022/1007-0000-0001-877B-D, landing page of TreeTagger where the model can be found: http://hdl.handle.net/11022/1007-0000-0000-8E4D-B

  22. Accessible via http://clarin05.ims.uni-stuttgart.de/mhdtt/index.html

References

  • Barteld, F., Schröder, I., & Zinsmeister, H. (2015). Unsupervised regularization of historical texts for POS tagging. In Proceedings of the 4th workshop on corpus-based research in the humanities (CRH) (pp. 3–12). Polish Academy of Sciences.

  • Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Wortman Vaughan, J. (2010). A theory of learning from different domains. Machine Learning, 79(1–2), 151–175.

    Article  Google Scholar 

  • Blitzer, J., McDonald, R., & Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP ’06), (pp. 120–128). Stroudsburg, PA, USA. Association for Computational Linguistics.

  • Brants, T. (2000). TnT: A Statistical Part-of-speech tagger. In Proceedings of the Sixth conference on applied natural language processing (ANLC ’00) (pp. 224–231). Stroudsburg, PA, USA. Association for Computational Linguistics.

  • Busa, R. (1980). The annals of humanities computing: The index thomisticus. Computers and the Humanities, 14(2), 83–90.

    Article  Google Scholar 

  • Celano, G., Crane, G., & Majidi, S. (2016). Part of speech tagging for ancient Greek. Open Linguistics, 2(1), 393–399.

    Article  Google Scholar 

  • Choi, J. D. (2016). Dynamic feature induction: The last gist to the state-of-the-art. In NAACL HLT 2016, The 2016 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, (pp. 271–281).

  • Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.

    Article  Google Scholar 

  • Daume III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 256–263), Prague, Czech Republic. Association for Computational Linguistics.

  • Dipper, S. (2011). Morphological and part-of-speech tagging of historical language data: A comparison. JLCL, 26(2), 25–37.

    Google Scholar 

  • Dipper, S., Donhauser, K., Klein, T., Linde, S., Müller, S., & Wegera, K.-P. (2013). HiTS: ein Tagset für historische sprachstufen des deutschen. JLCL, 28(1), 85–137.

    Google Scholar 

  • Fagerland, M. W., Lydersen, S., & Laake, P. (2013). The mcnemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional. BMC Medical Research Methodology, 13(1), 91.

    Article  Google Scholar 

  • Garrette, D. & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of the North American chapter of the association for computational linguistics: Human Language Technologies (NAACL-HLT-13) (pp. 138–147). Atlanta, GA.

  • Giesbrecht, E. & Evert, S. (2009). Is part-of-speech tagging a solved task? An evaluation of POS taggers for the German web as Corpus. In I. Alegria, I. Leturia & S. Sharoff (Ed.), Proceedings of the 5th web as corpus workshop (WAC5) (pp. 27–35), San Sebastian, Spain.

  • Goldberg, Y., Adler, M., & Elhadad, M. (2008). EM can find pretty good HMM POS-taggers (when given a good start). In K. McKeown, J. D. Moore, S. Teufel, J. Allan, S. Furui (Ed.), ACL (pp. 746–754). The Association for Computer Linguistics.

  • Hardmeier, C. (2016). A neural model for part-of-speech tagging in historical texts. In COLING 2016, 26th international conference on computational linguistics, proceedings of the conference: technical papers, December 11–16, 2016, Osaka, Japan (pp. 922–931).

  • Hennings, T. (2003). Einführung in das Mittelhochdeutsche. De Gruyter Studienbuch: De Gruyter.

    Google Scholar 

  • Hupkes, D. & Bod, R. (2016). POS-tagging of historical Dutch. In N. Calzolari (Conference Chair), K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 77–82). Paris, France: European Language Resources Association (ELRA).

  • Jiang, J. & Zhai, C. (2007). Instance weighting for domain adaptation in NLP. In In ACL 2007 (pp. 264–271).

  • Klein, T. & Dipper, S. (2016). Handbuch zum Referenzkorpus Mittelhochdeutsch.

  • Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155–171.

    Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in neural information processing systems 26, (pp. 3111–3119). Curran Associates, Inc.

  • Mittelhochdeutsche Begriffsdatenbank (1992–2017). Mittelhochdeutsche Begriffsdatenbank (MHDBDB). http://www.mhdbdb.sbg.ac.at/.

  • Moser, H., (Ed.) (1977). Des Minnesangs Frühling: Unter Benutzung der Ausgaben von Karl Lachmann und Moriz Haupt, Friedrich Vogt und Carl von Kraus. Bearbeitet von Hugo Moser und Helmut Tervooren. S. Hirzel, 36 edition.

  • Müller, T., Schmid, H., & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 322–332).

  • Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., McDonald, R. T., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the tenth international conference on language resources and evaluation LREC 2016, Portorož, Slovenia, May 23–28, 2016 (pp. 1659–1666).

  • Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Conference on empirical methods in natural language processing (pp. 133–142).

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In International conference on new methods in language processing (pp. 44–49). Manchester, UK.

  • Schulz, S. & Kuhn, J. (2016). Learning from Within? Comparing PoS tagging approaches for historical text. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC) (pp. 4316–4322). European Language Resources Association (ELRA).

  • Smith, N. A. & Eisner, J. (2005). Contrastive estimation: Training log-linear models on unlabeled data. In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL ’05) (pp. 354–362). Stroudsburg, PA, USA; Association for Computational Linguistics.

  • Søgaard, A. (2010). Simple semi-supervised training of part-of-speech taggers. In Proceedings of the ACL 2010 conference short papers (ACLShort ’10), (pp. 205–208). Stroudsburg, PA, USA: Association for Computational Linguistics.

  • Yang, Y. & Eisenstein, J. (2015). Unsupervised multi-domain adaptation with feature embeddings. In NAACL HLT 2015, the 2015 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, Denver, Colorado, USA, May 31–June 5, 2015 (pp. 672–682).

  • Yang, Y. & Eisenstein, J. (2016). Part-of-speech tagging for historical English. In Proceedings of the 2016 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1318–1328). Association for Computational Linguistics.

  • Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on association for computational linguistics (ACL ’95) (pp. 189–196). Stroudsburg, PA, USA: Association for Computational Linguistics.

  • Zeldes, A., & Schroeder, C. T. (2015). Computational methods for coptic: Developing and using part-of-speech tagging for digital scholarship in the humanities. Digital Scholarship in the Humanities, 30(supp–1), 164–176.

    Article  Google Scholar 

  • Zhou, Z.-H., & Li, M. (2005). Tri-Training: Exploiting unlabeled data using three classifiers. IEEE Transactions Knowledge and Data Engineering, 17(11), 1529–1541.

    Article  Google Scholar 

Download references

Acknowledgements

This work was completed within the Center for Reflected Text Analytics (CRETA) which is supported by the German Ministry of Education and Research (BMBF) and we are grateful for their financial support. We also want to thank our colleagues at the MHDBDB for their collaboration and the reviewers for their helpful comments. This work is based on a talk given at the meeting of “Digital Humanities im deutschsprachigen Raum” (DHd) in March 2017 in Bern.

Funding

Funding was provided by Bundesministerium für Bildung und Forschung (Grant No. 01UG1601).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarah Schulz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Appendix: A mapping HiTS to UD tagset

Appendix: A mapping HiTS to UD tagset

Table 8 Direct mapping from HiTS to universal dependencies tagset

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schulz, S., Ketschik, N. From 0 to 10 million annotated words: part-of-speech tagging for Middle High German. Lang Resources & Evaluation 53, 837–863 (2019). https://doi.org/10.1007/s10579-019-09462-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09462-8

Keywords

Navigation