Abstract
We evaluate the effectiveness of using our edit distances algorithm to improving an unsupervised language-independent stemming method. The main idea is to create morphological families through the automatic words grouping using our distance. Based on that grouping, we make a stemming process. The capacity of the edit distance algorithm in the task of words clustering and the ability of our method to generate the correct stem for Spanish was evaluated. A good result (98% precision) for the morphological families’ creation and also a remarkable 99.85% of correct stemming was obtained.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistic 11(1-2), 22–31 (1968)
Paice, C.D.: Another Stemmer. ACM SIGRIR Forum 24 (3), 56–61 (1990)
Porter, M.: An Algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Hafer, M., Weiss, S.: Word segmentation by letter succesor varieties. Information Storage and Retrieval 10, 371–385 (1974)
Smirnov, I.: Overview of Stemming Algorithms. DePaul University (2008)
Jinxi, X., Bruce, C.: Corpus-based stemming using co-ocurrence of word variants. ACM Transactions on Information Systems 16, 61–81 (1998)
Peng, F., Ahmed, N., Li, X., Lu, Y.: Context sensitive stemming for web search. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 639–646 (2007)
James, M., Paul, M.: Single N-gram stemming. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–416 (2003)
Popovic, M.F., Willet, P.: The effectiveness of stemming for natural language accesess to Slovene textual data. Journal of American Society for Information Science 3(5), 384–390 (1992)
Braschler, M., Schäuble, P.: Experiments with the Eurospider REtrieval System for CLEF 2000. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, pp. 140–148. Springer, Heidelberg (2001)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics - Doklady 10, 707–710 (1966)
Fernández Orquín, A., Díaz, J., Fundora, A., Muñoz, R.: Un algoritmo para la extracción de características lexicográficas en la comparación de palabras. In: IV Convención Científica Internacional CIUM, Matanzas, Cuba (2009)
Knuth, D.E.: MiKTeX 2.6 (May 28, 2007), http://www.miktex.org
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fernández, A., Díaz, J., Gutiérrez, Y., Muñoz, R. (2011). An Unsupervised Method to Improve Spanish Stemmer. In: Muñoz, R., Montoyo, A., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2011. Lecture Notes in Computer Science, vol 6716. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22327-3_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-22327-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22326-6
Online ISBN: 978-3-642-22327-3
eBook Packages: Computer ScienceComputer Science (R0)