ABSTRACT
In the world of non-proprietary NLP software the standard, and perhaps the best, HMM-based POS tagger is TnT (Brants, 2000). We argue here that some of the criticism aimed at HMM performance on languages with rich morphology should more properly be directed at TnT's peculiar license, free but not open source, since it is those details of the implementation which are hidden from the user that hold the key for improved POS tagging across a wider variety of languages. We present HunPos, a free and open source (LGPL-licensed) alternative, which can be tuned by the user to fully utilize the potential of HMM architectures, offering performance comparable to more complex models, but preserving the ease and speed of the training and tagging process.
- Michele Banko and Robert C. Moore. 2004. Part of speech tagging in context. In COLING '04: Proceedings of the 20th international conference on Computational Linguistics, page 556, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarDigital Library
- Thorsten Brants. 2000. TnT -- a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP-2000), Seattle, WA. Google ScholarDigital Library
- Kenneth Ward Church. 1988. A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the second conference on Applied natural language processing, pages 136--143, Morristown, NJ, USA. Association for Computational Linguistics. Google ScholarDigital Library
- Dóra Csendes, Jánós Csirik, and Tibor Gyimóthy. 2004. The Szeged Corpus: A POS tagged and syntactically annotated Hungarian natural language corpus. In Karel Pala Petr Sojka, Ivan Kopecek, editor, Text, Speech and Dialogue: 7th International Conference, TSD, pages 41--47.Google ScholarCross Ref
- Steven J. DeRose. 1988. Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14:31--39. Google ScholarDigital Library
- Damien Doligez, Jacques Garrigue, Didier Rémy, and Jérôme Vouillon, 2004. The Objective Caml system. Institut National de Recherche en Informatique et en Automatique.Google Scholar
- Jesús Giménez and Lluís Màrquez. 2003. Fast and accurate part-of-speech tagging: The svm approach revisited. In Proceedings of RANLP, pages 153--163.Google Scholar
- Jan Hajič, Pavel Krbec, Karel Oliva, Pavel Květoň, and Vladimír Petkevič. 2001. Serial combination of rules and statistics: A case study in Czech tagging. In Proceedings of the 39th Association of Computational Linguistics Conference, pages 260--267, Toulouse, France. Google ScholarDigital Library
- Dilek Z. Hakkani-Tür, Kemal Oflazer, and Gökhan Tür. 2000. Statistical morphological disambiguation for agglutinative languages. In Proceedings of the 18th conference on Computational linguistics, pages 285--291, Saarbrücken, Germany. Google ScholarDigital Library
- Péter Halácsy, András Kornai, Csaba Oravecz, Viktor Trón, and Dániel Varga. 2006. Using a morphological analyzer in high precision POS tagging of Hungarian. In Proceedings of LREC 2006, pages 2245--2248.Google Scholar
- Bryan Jurish. 2003. A hybrid approach to part-of-speech tagging. Technical report, Berlin-Brandenburgische Akademie der Wissenschaften.Google Scholar
- Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Karel Pala Petr Sojka, Ivan Kopecek, editor, Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 133--142, University of Pennsylvania.Google Scholar
- Noah A. Smith, David A. Smith, and Roy W. Tromble. 2005. Context-based morphological disambiguation with random fields. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver. Google ScholarDigital Library
- Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL, pages 252--259. Google ScholarDigital Library
- Viktor Trón, Péter Halácsy, Péter Rebrus, András Rung, Péter Vajda, and Eszter Simon. 2006. Morphdb. hu: Hungarian lexical database and morphological grammar. In Proceedings of LREC 2006, pages 1670--1673.Google Scholar
- HunPos: an open source trigram tagger
Recommendations
Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages
The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since inflectional morphology does not change the PoS ...
SemEval-2010 task 3: cross-lingual word sense disambiguation
SEW '09: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future DirectionsWe propose a multilingual unsupervised Word Sense Disambiguation (WSD) task for a sample of English nouns. Instead of providing manually sensetagged examples for each sense of a polysemous noun, our sense inventory is built up on the basis of the ...
Hindi Word Sense Disambiguation Using Lesk Approach on Bigram and Trigram Words
AICTC '16: Proceedings of the International Conference on Advances in Information Communication Technology & ComputingWord Sense Disambiguation (WSD) is a vital task which provides the definition of particular words according to their sense or according to given context. Lesk algorithm is originally based on the gloss overlap that can be observed as the measure, ...
Comments