A hybrid approach for Arabic lemmatization

Boudchiche, Mohamed; Mazroui, Azzeddine

doi:10.1007/s10772-018-9528-3

A hybrid approach for Arabic lemmatization

Published: 18 July 2018

Volume 22, pages 563–573, (2019)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Mohamed Boudchiche¹ &
Azzeddine Mazroui¹

415 Accesses
12 Citations
Explore all metrics

Abstract

We present in this article an Arabic lemmatizer that assigns to each word of an Arabic sentence, a single lemma taking into account the word context. The proposed system comprises two modules. The first one consists in an analysis out of context, based on the morphosyntactic analyser Alkhalil Morpho Sys 2. In the second module, we use the context to identify the correct lemma from the potential lemmas of the word obtained by the first module. For this purpose, we use a statistical technique based on the hidden Markov models, where the observations are the words of the sentence, and the lemmas represent the hidden states. We validate this approach using a labelled corpus consisting of about 500,000 words. The lemmatizer gives the correct lemma in more than 99.24% in the training set and about 94.45% of the words in the test set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique

A Large Terminological Dictionary of Arabic Compound Words

The Generative Power of Arabic Morphology and Implications: A Case for Pattern Orientation in Arabic Corpus Annotation and a Proposed Pattern Ontology

References

Abuhaiba, I. S. I., & Dawoud, H. M. (2017). Combining different approaches to improve arabic text documents classification. International Journal of Intelligent Systems and Applications, 9(4), 39–52. https://doi.org/10.5815/ijisa.2017.04.05.
Article Google Scholar
Alajmi, A. F., Saad, E. M., & Awadalla, M. H. (2011). Hidden markov model based Arabic morphological analyzer. International Journal of Computer Engineering Research, 2(2), 28–33. http://www.academicjournals.org/article/article1379930440_Amal et al pdf.pdf.
Al-Shammari, E., & Lin, J. (2008). A novel Arabic lemmatization algorithm. In Proceedings of the second workshop on analytics for noisy unstructured text data—AND’08 (pp. 113–118). New York: ACM Press. https://doi.org/10.1145/1390749.1390767.
Attiya, M., Yaseen, M., & Choukri, K. (2005). Specifications of the Arabic Written Corpus produced within the NEMLAR project. http://www.nemlar.org.
Balakrishnan, V., & Ethel, L.-Y. (2014). Stemming and lemmatization: A comparison of retrieval performances. Lecture Notes on Software Engineering, 2(3), 262–267. https://doi.org/10.7763/LNSE.2014.V2.134.
Article Google Scholar
Boudchiche, M., & Mazroui, A. (2015a). Evaluation of the ambiguity caused by the absence of diacritical marks in Arabic texts: Statistical study. In 2015 5th international conference on information & communication technology and accessibility (ICTA) (pp. 1–6). IEEE. https://doi.org/10.1109/ICTA.2015.7426904.
Boudchiche, M., & Mazroui, A. (2015b). Enrichissement du corpus Nemlar par l’étiquette lexicale lemme. In Journée d’étude “Ressources langagières de l’arabe pour le TAL: construction, standardisation, gestion et exploitation.”. Morocco: Rabat.
Google Scholar
Boudchiche, M., Mazroui, A., Ould Abdallahi Ould Bebah, M., Lakhouaja, A., & Boudlal, A. (2017). AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer. Journal of King Saud University - Computer and Information Sciences, 29(2), 141–146. https://doi.org/10.1016/j.jksuci.2016.05.002.
Article Google Scholar
Boudlal, A., Belahbib, R., Lakhouaja, A., Mazroui, A., Meziane, A., & Bebah, M. (2011). A markovian approach for Arabic root extraction. International Arab Journal of Information Technology, 8(1), 91–98.
Google Scholar
Buckwalter, T. (2002). Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium No. LDC2002L49.
Chennoufi, A., & Mazroui, A. (2016). Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization. International Journal of Speech Technology, 19(2), 269–280. https://doi.org/10.1007/s10772-015-9313-5.
Article Google Scholar
Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511801389.
Book MATH Google Scholar
Diab, M., Kadri, H., & Daniel, J. (2007). Automated methods for processing Arabic text: from tokenization to base phrase chunking. In A. van den Bosch & A. Soudi (Eds.), Arabic computational morphology: Knowledge-based and empirical methods. Dordrecht: Springer.
Google Scholar
Dichy, J. (2001). On lemmatization in Arabic, a formal definition of the Arabic entries of multilingual lexical databases. In ACL 39th annual meeting. Workshop on Arabic Language Processing (pp. 23–30). Toulouse.
Dichy, J., & Farghaly, A. (2003). Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: On what basis should a multilingual lexical database centred on Arabic be built? In Proceedings of the {MT-summit IX} workshop on machine translation for semitic languages workshop on machine translation for semitic languages, 2016, (pp.1–8).
El-shishtawy, T., & El-Ghannam, F. (2012). An accurate arabic root-based lemmatizer for information retrieval purposes. IJCSI International Journal of Computer Science Issues, 9, 58–66.
Google Scholar
Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing. ACM Transactions on Asian Language Information Processing, 8(4), 1–22. https://doi.org/10.1145/1644879.1644881.
Article Google Scholar
Giménez, J., & Màrquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In the 4th international conference on language resources and evaluation, (pp. 43–46).
Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S., & Buckwalter, T. (2010). Standard arabic morphological analyzer (SAMA). Linguistic Data Consortium LDC2009E73.
Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A Toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. Proceedings of the Second International Conference on Arabic Language Resources and Tools, 102–109. http://www.elda.org/medar-conference/pdf/24.pdf%5CnAll Papers/H/Habash, et al. 2009 - Mada + tokan - A toolkit for arabic tokenization, di … morphological disambiguation, pos tagging, stemming and lemmatization.pdf.
Hammouda, F. K., & Almarimi, A. A. (2010). Heuristic lemmatization for Arabic texts indexation and classification. Journal of Computer Science, 6(6), 660–665.
Article Google Scholar
Korenius, T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and lemmatization in the clustering of finnish text documents. In Proceedings of the thirteenth ACM conference on Information and knowledge management - CIKM’04 (p. 625). New York, USA: ACM Press. https://doi.org/10.1145/1031171.1031285.
Koulali, R., & Meziane, A. (2013). Experiments with arabic topic detection. Journal of Theoretical and Applied Information Technology, 50(1), 28–32. https://doi.org/10.1007/978-3-642-25631-8_56.
Google Scholar
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press. http://ics.upjs.sk/~pero/web/documents/pillar/Manning_Schuetze_StatisticalNLP.pdf.
Neuhoff, D. (1975). The Viterbi algorithm as an aid in text recognition. IEEE Transactions on Information Theory, 21(2), 222–226. https://doi.org/10.1109/TIT.1975.1055355.
Article MathSciNet Google Scholar
Ney, H., & Essen, U. (1991). On smoothing techniques for bigram-based natural language modelling. In [Proceedings] ICASSP 91: 1991 international conference on acoustics, speech, and signal processing (pp. 825–828 vol. 2). IEEE. https://doi.org/10.1109/ICASSP.1991.150464.
Pasha, A., Al-badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., et al. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th language resources and evaluation conference (LREC’14), pp. 1094–1101.
Rabiner, L. R. (1990). A tutorial on hidden Markov models and selected applications in speech recognition. In Readings in speech recognition (pp. 267–296).
Reqqass, M., Lakhouaja, A., Mazroui, A., & Atih, I. (2015). Amelioration of the interactive dictionary of arabic language. International Journal of Computer Science and Applications, 12(1), 94–107. http://www.tmrfindia.org/ijcsa/v12i18.pdf.
Yaseen, M., Attia, M., Maegaard, B., Choukri, K., Paulsson, N., Haamid, S., et al. (2006). Building Annotated Written and Spoken Arabic LR’s in NEMLAR Project. In LREC (pp. 533–538). http://www.nemlar.org.
Zine, O., Meziane, A., & Boudchiche, M. (2018). Towards a high-quality lemma-based text to speech system for the arabic language. In Communications in computer and information science (Vol. 782, pp. 53–66) Cham: Springer. https://doi.org/10.1007/978-3-319-73500-9_4.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Faculty of Sciences, Mohammed First University, B-P 717, 60000, Oujda, Morocco
Mohamed Boudchiche & Azzeddine Mazroui

Authors

Mohamed Boudchiche
View author publications
You can also search for this author in PubMed Google Scholar
Azzeddine Mazroui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Boudchiche.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boudchiche, M., Mazroui, A. A hybrid approach for Arabic lemmatization. Int J Speech Technol 22, 563–573 (2019). https://doi.org/10.1007/s10772-018-9528-3

Download citation

Received: 19 November 2017
Accepted: 26 June 2018
Published: 18 July 2018
Issue Date: September 2019
DOI: https://doi.org/10.1007/s10772-018-9528-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid approach for Arabic lemmatization

Abstract

Access this article

Similar content being viewed by others

Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique

A Large Terminological Dictionary of Arabic Compound Words

The Generative Power of Arabic Morphology and Implications: A Case for Pattern Orientation in Arabic Corpus Annotation and a Proposed Pattern Ontology

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hybrid approach for Arabic lemmatization

Abstract

Access this article

Similar content being viewed by others

Improving Arabic Lemmatization Through a Lemmas Database and a Machine-Learning Technique

A Large Terminological Dictionary of Arabic Compound Words

The Generative Power of Arabic Morphology and Implications: A Case for Pattern Orientation in Arabic Corpus Annotation and a Proposed Pattern Ontology

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation