Skip to main content
Log in

A hybrid approach for Arabic lemmatization

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

We present in this article an Arabic lemmatizer that assigns to each word of an Arabic sentence, a single lemma taking into account the word context. The proposed system comprises two modules. The first one consists in an analysis out of context, based on the morphosyntactic analyser Alkhalil Morpho Sys 2. In the second module, we use the context to identify the correct lemma from the potential lemmas of the word obtained by the first module. For this purpose, we use a statistical technique based on the hidden Markov models, where the observations are the words of the sentence, and the lemmas represent the hidden states. We validate this approach using a labelled corpus consisting of about 500,000 words. The lemmatizer gives the correct lemma in more than 99.24% in the training set and about 94.45% of the words in the test set.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Abuhaiba, I. S. I., & Dawoud, H. M. (2017). Combining different approaches to improve arabic text documents classification. International Journal of Intelligent Systems and Applications, 9(4), 39–52. https://doi.org/10.5815/ijisa.2017.04.05.

    Article  Google Scholar 

  • Alajmi, A. F., Saad, E. M., & Awadalla, M. H. (2011). Hidden markov model based Arabic morphological analyzer. International Journal of Computer Engineering Research, 2(2), 28–33. http://www.academicjournals.org/article/article1379930440_Amal et al pdf.pdf.

  • Al-Shammari, E., & Lin, J. (2008). A novel Arabic lemmatization algorithm. In Proceedings of the second workshop on analytics for noisy unstructured text dataAND08 (pp. 113–118). New York: ACM Press. https://doi.org/10.1145/1390749.1390767.

  • Attiya, M., Yaseen, M., & Choukri, K. (2005). Specifications of the Arabic Written Corpus produced within the NEMLAR project. http://www.nemlar.org.

  • Balakrishnan, V., & Ethel, L.-Y. (2014). Stemming and lemmatization: A comparison of retrieval performances. Lecture Notes on Software Engineering, 2(3), 262–267. https://doi.org/10.7763/LNSE.2014.V2.134.

    Article  Google Scholar 

  • Boudchiche, M., & Mazroui, A. (2015a). Evaluation of the ambiguity caused by the absence of diacritical marks in Arabic texts: Statistical study. In 2015 5th international conference on information & communication technology and accessibility (ICTA) (pp. 1–6). IEEE. https://doi.org/10.1109/ICTA.2015.7426904.

  • Boudchiche, M., & Mazroui, A. (2015b). Enrichissement du corpus Nemlar par l’étiquette lexicale lemme. In Journée d’étude “Ressources langagières de l’arabe pour le TAL: construction, standardisation, gestion et exploitation.”. Morocco: Rabat.

    Google Scholar 

  • Boudchiche, M., Mazroui, A., Ould Abdallahi Ould Bebah, M., Lakhouaja, A., & Boudlal, A. (2017). AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer. Journal of King Saud University - Computer and Information Sciences, 29(2), 141–146. https://doi.org/10.1016/j.jksuci.2016.05.002.

    Article  Google Scholar 

  • Boudlal, A., Belahbib, R., Lakhouaja, A., Mazroui, A., Meziane, A., & Bebah, M. (2011). A markovian approach for Arabic root extraction. International Arab Journal of Information Technology, 8(1), 91–98.

    Google Scholar 

  • Buckwalter, T. (2002). Arabic Morphological Analyzer Version 1.0. Linguistic Data Consortium No. LDC2002L49.

  • Chennoufi, A., & Mazroui, A. (2016). Impact of morphological analysis and a large training corpus on the performances of Arabic diacritization. International Journal of Speech Technology, 19(2), 269–280. https://doi.org/10.1007/s10772-015-9313-5.

    Article  Google Scholar 

  • Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511801389.

    Book  MATH  Google Scholar 

  • Diab, M., Kadri, H., & Daniel, J. (2007). Automated methods for processing Arabic text: from tokenization to base phrase chunking. In A. van den Bosch & A. Soudi (Eds.), Arabic computational morphology: Knowledge-based and empirical methods. Dordrecht: Springer.

    Google Scholar 

  • Dichy, J. (2001). On lemmatization in Arabic, a formal definition of the Arabic entries of multilingual lexical databases. In ACL 39th annual meeting. Workshop on Arabic Language Processing (pp. 23–30). Toulouse.

  • Dichy, J., & Farghaly, A. (2003). Roots & Patterns vs. Stems plus Grammar-Lexis Specifications: On what basis should a multilingual lexical database centred on Arabic be built? In Proceedings of the {MT-summit IX} workshop on machine translation for semitic languages workshop on machine translation for semitic languages, 2016, (pp.1–8).

  • El-shishtawy, T., & El-Ghannam, F. (2012). An accurate arabic root-based lemmatizer for information retrieval purposes. IJCSI International Journal of Computer Science Issues, 9, 58–66.

    Google Scholar 

  • Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing. ACM Transactions on Asian Language Information Processing, 8(4), 1–22. https://doi.org/10.1145/1644879.1644881.

    Article  Google Scholar 

  • Giménez, J., & Màrquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In the 4th international conference on language resources and evaluation, (pp. 43–46).

  • Graff, D., Maamouri, M., Bouziri, B., Krouna, S., Kulick, S., & Buckwalter, T. (2010). Standard arabic morphological analyzer (SAMA). Linguistic Data Consortium LDC2009E73.

  • Habash, N., Rambow, O., & Roth, R. (2009). MADA + TOKAN: A Toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. Proceedings of the Second International Conference on Arabic Language Resources and Tools, 102–109. http://www.elda.org/medar-conference/pdf/24.pdf%5CnAll Papers/H/Habash, et al. 2009 - Mada + tokan - A toolkit for arabic tokenization, di … morphological disambiguation, pos tagging, stemming and lemmatization.pdf.

  • Hammouda, F. K., & Almarimi, A. A. (2010). Heuristic lemmatization for Arabic texts indexation and classification. Journal of Computer Science, 6(6), 660–665.

    Article  Google Scholar 

  • Korenius, T., Laurikkala, J., Järvelin, K., & Juhola, M. (2004). Stemming and lemmatization in the clustering of finnish text documents. In Proceedings of the thirteenth ACM conference on Information and knowledge management - CIKM04 (p. 625). New York, USA: ACM Press. https://doi.org/10.1145/1031171.1031285.

  • Koulali, R., & Meziane, A. (2013). Experiments with arabic topic detection. Journal of Theoretical and Applied Information Technology, 50(1), 28–32. https://doi.org/10.1007/978-3-642-25631-8_56.

    Google Scholar 

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press. http://ics.upjs.sk/~pero/web/documents/pillar/Manning_Schuetze_StatisticalNLP.pdf.

  • Neuhoff, D. (1975). The Viterbi algorithm as an aid in text recognition. IEEE Transactions on Information Theory, 21(2), 222–226. https://doi.org/10.1109/TIT.1975.1055355.

    Article  MathSciNet  Google Scholar 

  • Ney, H., & Essen, U. (1991). On smoothing techniques for bigram-based natural language modelling. In [Proceedings] ICASSP 91: 1991 international conference on acoustics, speech, and signal processing (pp. 825–828 vol. 2). IEEE. https://doi.org/10.1109/ICASSP.1991.150464.

  • Pasha, A., Al-badrashiny, M., Diab, M., El Kholy, A., Eskander, R., Habash, N., et al. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th language resources and evaluation conference (LREC14), pp. 1094–1101.

  • Rabiner, L. R. (1990). A tutorial on hidden Markov models and selected applications in speech recognition. In Readings in speech recognition (pp. 267–296).

  • Reqqass, M., Lakhouaja, A., Mazroui, A., & Atih, I. (2015). Amelioration of the interactive dictionary of arabic language. International Journal of Computer Science and Applications, 12(1), 94–107. http://www.tmrfindia.org/ijcsa/v12i18.pdf.

  • Yaseen, M., Attia, M., Maegaard, B., Choukri, K., Paulsson, N., Haamid, S., et al. (2006). Building Annotated Written and Spoken Arabic LR’s in NEMLAR Project. In LREC (pp. 533–538). http://www.nemlar.org.

  • Zine, O., Meziane, A., & Boudchiche, M. (2018). Towards a high-quality lemma-based text to speech system for the arabic language. In Communications in computer and information science (Vol. 782, pp. 53–66) Cham: Springer. https://doi.org/10.1007/978-3-319-73500-9_4.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Boudchiche.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boudchiche, M., Mazroui, A. A hybrid approach for Arabic lemmatization. Int J Speech Technol 22, 563–573 (2019). https://doi.org/10.1007/s10772-018-9528-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-9528-3

Keywords

Navigation