Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Denis, Pascal; Sagot, Benoît

doi:10.1007/s10579-012-9193-0

Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Original Paper
Published: 04 July 2012

Volume 46, pages 721–736, (2012)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Pascal Denis¹ &
Benoît Sagot¹

524 Accesses
24 Citations
Explore all metrics

Abstract

This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on French tagging, we introduce a maximum entropy Markov model-based tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.75 % accuracy on the French Treebank, an error reduction of 25 % (38 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The MElt tagger is freely available from http://lingwb.gforge.inria.fr/. Results reported in this paper correspond to release MElt 1.0.
This tagset is known as Treebank+ in Crabbé and Candito (2008), and since then as CC (Candito et al. 2009).
The Lefff is freely distributed under the LGPL-LR license at http://alexina.gforge.inria.fr/.
Ratnaparkhi (1996) and Toutanova and Manning (2000) report accuracy scores of 96.43 and 96.86 % on section 23–24 of the Penn Treebank, respectively.
Arguably better suited for sequential problems, Conditional Random Fields (CRF) (Lafferty et al. 2001) are considerably slower to train.
Available from http://www.cs.utah.edu/hal/megam/.
Recall that features in MaxEnt are functions ranging on both contexts and classes. A concrete example of one of our features is given below:
$$ f_{{100}} (h,t) = \left\{ {\begin{array}{*{20}c} 1 \hfill & {{\text{if}}\,w_{i} = \hbox{``}{\text{le}}\hbox{''}\:\& \; t = {\text{DET}}} \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(3)
Specifically, we used a prior with precision (i.e., inverse variance) of 1 (which is the default in Megam); other values were tested during development but did not yield improvements.
Informally, the effect of this kind of regularization is to penalize artificially large weights by forcing the weights to be distributed according to a Gaussian distribution with mean zero.
We tried larger values (i.e., 5, 10, 15, 20) during development, but none of these led to significant improvements.
Available at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/.
The accuracy results of MElt ⁰_fr on ftb-dev are: 96.7 % overall and 86.2 % on unknown words.
Chi-square statistical significance tests were applied to changes in accuracy, with p set to 0.01 unless otherwise stated.
The accuracy results of MElt ^f_fr on ftb-dev are: 97.23 % overall and 90.01 % on unknown words.
An adaptation to French of the Morfette POS-tagger (Chrupała et al. 2008) using the ftb and the Lefff has been realized by G. Chrupała and D. Seddah (p.c.). Their accuracy results are similar to ours, although slightly lower (on the same data sets, Henestroza and Candito have obtained a 97.68 % accuracy). On other variants of the ftb, Chrupała and Seddah report 97.9 % (p.c.). However, these figures do not correspond exactly to the same experimental setup, as Morfette extracts and uses in its models an additional source of information, namely lemmas.
http://www.limsi.fr/TLP/grace/.
MElt has also been used for training POS taggers for Persian (Sagot et al. 2011) and Kurmanji Kurdish (Walther et al. 2010) (see below) based on noisy corpora and medium-size lexicons, with promising results.
We used a corpus of 20 million words extracted from the L’Est Républicain journalistic corpus, freely available at the web site of the CNRTL (http://www.cnrtl.fr/corpus/estrepublicain).
As computed by the bspline mode of gnuplot ’s contour lines generation algorithm.
The development times per sentence and per lexical entry mentioned in the previous paragraphs lead to the following formula for the total development time t(s, l) (expressed in seconds), in which s is the number of sentences, l the number of lexical entries: t(s, l) = 36s + 8,400 ⋅ log(s/100 + 1) + 2.4 ⋅ l.
Performing POS tagging with a morphological lexicon but without any training corpus is a significantly different task, addressed by an increasing literature (Merialdo 1994; Ravi and Knight 2009; Smith and Eisner 2005). In that regard, MElt has been used in a simple experiment on Kurmanji Kurdish, a resource-scarse Iranian language (Walther et al. 2010): in that paper, the authors project the morphological lexicon they have built for that language, disambiguate the resulting ambiguous annotation in three different ways, merge these annotations for producing a (noisy) training corpus, and train a MElt tagger based on this corpus and their lexicon. Despite the simplicity of the three disambiguation techniques, the authors report a 85.7 % accuracy on a tiny gold standard using a 36-tag tagset.

References

Abeillé, A., Clément, L., & Toussenel, F. (2003). Building a treebank for French. In A. Abeillé (Ed.), Treebanks. Dordrecht: Kluwer.
Chapter Google Scholar
Adda, G., Mariani, J., Paroubek, P., Rajman, M., & Lecomte J. (1999). Métrique et premiers résultats de l’évaluation grace des étiqueteurs morphosyntaxiques pour le français. In TALN.
Berger, A., Pietra, S. D., & Pietra, V. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.
Google Scholar
Candito M., Crabbé B., & Seddah D.(2009) On statistical parsing of French with supervised and semi-supervised strategies. In Proceedings of the EACL’09 workshop on grammatical inference for computational linguistics. Athens, Greece.
Chrupała, G., Dinu, G., & van Genabith, J. (2008). Learning morphology with morfette. In Proceedings of the 6th language resource and evaluation conference. Marrakesh, Morocco.
Crabbé, B., & Candito, M. (2008). Expériences d’analyses syntaxique statistique du français. In Proceedings of TALN’08. Avignon, France.
Denis, P., & Sagot, B. (2009). Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art pos tagging with less human effort. In Proceedings of PACLIC 2009. Hong Kong, China.
Denis, P., & Sagot, B. (2010). Exploitation d’une ressource lexicale pour la construction d’un étiqueteur morphosyntaxique état-de-l’art du français. In Traitement Automatique des Langues Naturelles TALN 2010. Montréal, Canada. http://hal.inria.fr/inria-00521231/en.
Hajič, J. (2000). Morphological tagging: Data vs. dictionaries. In Proceedings of ANLP’00 (pp. 94–101). Seattle, WA.
Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields Probabilistic models for segmenting and labeling sequence data. In ICML (pp. 282–289).
Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation. In Proceedings of the 6th workshop on natural language learning (pp. 49–55). Taipei, Taiwan.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Google Scholar
Marcus, M. P., & Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Google Scholar
Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155–172.
Google Scholar
Molinero, M. A., Sagot, B., & Nicolas, L. (2009). A morphological and syntactic wide-coverage lexicon for Spanish The Leffe. In Proceedings of the 7th conference on recent advances in natural language processing (RANLP 2009). Borovets, Bulgaria.
Nasr, A., & Volanschi, A. (2004). Couplage d’un étiqueteur morpho-syntaxique et d’un analyseur partiel représentés sous la forme d’automates finis pondérés. In TALN.
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of international conference on empirical methods in natural language processing (pp. 133–142).
Ravi, S., & Knight, K. (2009). Minimized models for unsupervised part-of-speech tagging. In Proceedings of the joint conference of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing of the asian federation of natural language processing (ACL-IJCNLP’09) (pp. 504–512). Singapore.
Sagot, B. (2005). Automatic acquisition of a Slovak lexicon from a raw corpus. In Lecture Notes in Artificial Intelligence 3658, Proceedings of TSD’05 (pp. 156–163). Karlovy Vary, Czech Republic: Springer.
Sagot, B. (2010). The Lefff, a freely available, accurate and large-coverage lexicon for French. In Proceedings of the 7th language resources and evaluation conference (LREC 2010). Valletta, Malta.
Sagot, B., Clément, L., de la Clergerie, É., & Boullier, P. (2006). The Lefff 2 syntactic lexicon for French architecture, acquisition, use. In Proceedings of the 5th language resource and evaluation conference (LREC 2006). Lisbon, Portugal. http://atoll.inria.fr/sagot/pub/LREC06b.pdf.
Sagot, B., Walther, G., Faghiri, P., & Samvelian, P. (2011). A new morphological lexicon and a POS tagger for the Persian language. In International conference in Iranian linguistics. Uppsala, Sweden. http://hal.inria.fr/inria-00614711/en/.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing. Manchester, UK.
Smith, N., & Eisner, J. (2005). Contrastive estimation training log-linear models on unlabeled data. In Proceedings of the 43th annual meeting of the Association for Computational Linguistics (ACL’05) (pp. 354–362). Ann Arbor, MI, USA.
Spoustová, D. J., Hajič, J., Raab, J., & Spousta, M. (2009). Semi-supervised training for the averaged perceptron POS tagger. In EACL ’09: Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics (pp. 763–771). Morristown, NJ, USA.
Taulé, M., Martí, M., & Recasens, M. (2008). Ancora multilevel annotated corpora for Catalan and Spanish. In Proceedings of 6th international conference on language resources and evaluation. Marrakesh, Morocco.
Toutanova, K., & Manning, C.D. (2000). Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of international conference on new methods in language processing (pp. 63–70). Hong Kong.
Walther, G., Sagot, B., & Fort, K. (2010). Fast development of basic NLP tools towards a lexicon and a POS tagger for Kurmanji Kurdish. In International conference on lexis and grammar. Belgrade, Serbia.

Download references

Author information

Authors and Affiliations

Alpage, INRIA Paris-Rocquencourt & Université Paris 7, Domaine de Voluceau, Rocquencourt, BP 105, 78153, Le Chesnay Cedex, France
Pascal Denis & Benoît Sagot

Authors

Pascal Denis
View author publications
You can also search for this author in PubMed Google Scholar
Benoît Sagot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Benoît Sagot.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Denis, P., Sagot, B. Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging. Lang Resources & Evaluation 46, 721–736 (2012). https://doi.org/10.1007/s10579-012-9193-0

Download citation

Published: 04 July 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10579-012-9193-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Abstract

Access this article

Similar content being viewed by others

“Part of Speech Tagging – A Corpus Based Approach”

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

Improving the RACAI Neural Network MSD Tagger

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging

Abstract

Access this article

Similar content being viewed by others

“Part of Speech Tagging – A Corpus Based Approach”

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

Improving the RACAI Neural Network MSD Tagger

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation