Visualizing Linguistic Complexity and Proficiency in Learner English Writings

Thomas Gaillat; Antoine Lafontaine; Anas Knefati

doi:10.1558/cj.19487

Authors

Thomas Gaillat University of Rennes
Antoine Lafontaine French National Research Institute for Health, Environment and Health
Anas Knefati Orange Business France

DOI:

https://doi.org/10.1558/cj.19487

Keywords:

linguistic complexity, L2 English, automatic essay feedback, visualization

Abstract

In this article, we focus on the design of a second language (L2) formative feedback system that provides linguistic complexity graph reports on the writings of English for special purposes students at the university level. The system is evaluated in light of formative instruction features pointed out in the literature. The significance of complexity metrics is also evaluated. A learner corpus of English classified according to the Common European Framework of References for Languages (CEFR) was processed using a pipeline that computes 83 complexity metrics. By way of analysis of variance (ANOVA) testing, multinomial logistic regression, and clustering methods, we identified and validated a set of nine significant metrics in terms of proficiency levels. Validation with classification gave 67.51% (A level), 60.16% (B level), and 60.47% (C level) balanced accuracy. Clustering showed between 53.10% and 67.37% homogeneity, depending on the level. As a result, these metrics were used to create graphical reports about the linguistic complexity of learner writing. These reports are designed to help language teachers diagnose their students’ writings in comparison with prerecorded cohorts of different proficiencies.

Author Biographies

Thomas Gaillat, University of Rennes

Thomas Gaillat is Associate Professor in corpus linguistics at the University of Rennes in France, where he also teaches English for specific purposes. He is a member of the LIDILE research team. He received his doctorate in 2016 at the University of Sorbonne Paris Cité, graduating summa cum laude. His thesis focused on corpus interoperability as a method to explore how this, that, and it, as referential forms, are used by learners of English. His publications cover linguistic questions intersecting the domains of natural language processing, corpus linguistics, and statistics. His main research axis is focused on language acquisition questions. He is the principal investigator of a project focused on analytics for language learning (A4LL). This project is funded by the French National Research Agency (Agence nationale de la recherche, ANR).
Antoine Lafontaine, French National Research Institute for Health, Environment and Health

Antoine Lafontaine is an engineer statistician who worked on educational data for the Data Tank at the National School for Statistics and Data Analysis (École nationale de la statistique et de l’analyse de l’information, ENSAI) from 2019 to 2021. The Data Tank specializes in learning analytics and pedagogical innovations. Since 2021, Antoine has been working with the ELIXIR team at the French National Research Institute for Health, Environment and Work (Institut de recherche en santé, environnement et travail, IRSET) (Inserm UMR_S 1085), and more specifically on the CONSTANCES cohort, a large general population-based epidemiological cohort in France.
Anas Knefati, Orange Business France

Anas Knefati holds a PhD in machine learning, and has previously worked as a research engineer specializing in machine learning for the Data Tank operation at the National School for Statistics and Data Analysis (ENSAI) from 2019 to 2021. During his time there, Anas contributed to the Data Tank’s focus on learning analytics and pedagogical innovations. Currently, he is working at Orange Business, where he is dedicated to industrializing artificial intelligence solutions.

References

Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V.2. Journal of Technology, Learning and Assessment, 4(3), 3–29. https://ejournals.bc.edu/index.php/jtla/article/view/1650

Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9780511801686

Ballier, N., Canu, S., Petitjean, C., Gasso, G., Balhana, C., …, & Gaillat, T. (2020). Machine learning for learner English. International Journal of Learner Corpus Research, 6(1), 72–103. https://doi.org/10.1075/ijlcr.18012.bal

Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., …, & Matsuo, A. (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774

Biber, D., Gray, B., Staples, S., & Egbert, J. (2020). Investigating grammatical complexity in L2 English writing research: Linguistic description versus predictive measurement. Journal of English for Academic Purposes, 46, 100869. https://doi.org/10.1016/j.jeap.2020.100869

Breiman, L., & Spector, P. (1992). Submodel selection and evaluation in regression. The X-random case. International Statistical Review/Revue Internationale de Statistique, 60(3), 291–319. https://doi.org/10.2307/1403680

Bulté, B., & Housen, A. (2012). Defining and operationalising L2 complexity. Amsterdam: John Benjamins Publishing Company. https://doi.org/10.1075/lllt.32.02bul

Council of Europe (2018). Common European Framework of Reference for Languages: Learning, teaching, assessment—Companion volume. Strasbourg: Council of Europe. http://www.coe.int/t/dg4/linguistic/Source/Framework_FR.pdf

Crossley, S. A., Kyle, K., & Dascalu, M. (2019). The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap. Behavior Research Methods, 51(1), 14–27. https://doi.org/10.3758/s13428-018-1142-4

Dascalu, M., Dessus, P., Trausan-Matu, S., Bianco, M., Nardy, A., …, & Trausan-Matu, S. (2013). ReaderBench, an environment for analyzing text complexity and reading strategies. In H. C. Lane, K. Yacef, J. Mostow, & P. Pavlik (Eds.), Artificial intelligence in education (pp. 379–388). AIED 13. Lecture Notes in Computer Science vol. 7926. Berlin & Heidelberg: Springer. https://doi.org/10.1007/978-3-642-39112-5_39

Dougiamas, M., & Taylor, P. (2003). Moodle: Using learning communities to create an open source course management system. Proceedings of the EDMEDIA 2003 Conference, Honolulu, Hawaii, 171–178. https://www.learntechlib.org/primary/p/13739/

Eisinga, R., Grotenhuis, M. te, & Pelzer, B. (2013). The reliability of a two-item scale: Pearson, Cronbach, or Spearman-Brown? International Journal of Public Health, 58(4), 637–642. https://doi.org/10.1007/s00038-012-0416-3

Ellis, R., Loewen, S., & Erlam, R. (2006). Implicit and explicit corrective feedback and the acquisition of L2 grammar. Studies in Second Language Acquisition, 28(2), 339–368. https://doi.org/10.1017/S0272263106060141

Gaillat, T. (in press). Investigating the scope of textual metrics for learner level discrimination and learner analytics. In A. Lenko-Szymanska & S. Götz (Eds.), Complexity, accuracy and fluency in learner corpus research. John Benjamins.

Gaillat, T., Janvier, P., Dumont, B., Lafontaine, A., Knefati, A., …, & Hamon, C. (2019, December). CELVA.Sp: A corpus for the visualisation of linguistic profiles in language learners. PERL 2019, December, Paris. https://hal.univ-rennes2.fr/hal-02496713

Gaillat, T., Simpkin, A., Ballier, N., Stearns, B., Sousa, A., …, & Zarrouk, M. (2021). Predicting CEFR levels in learners of English: The use of microsystem criterial features in a machine learning approach. ReCALL, 34(2). https://doi.org/10.1017/S095834402100029X

Granger, S. (2015). The contribution of learner corpora to reference and instructional materials design. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge 196 Visualizing Linguistic Complexity and Proficiency handbook of learner corpus research (pp. 485–510). Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139649414.022

Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1), 100–108. https://doi.org/10.2307/2346830

Hawkins, J. A., & Buttery, P. (2010). Criterial features in learner corpora: Theory and illustrations. English Profile Journal, 1(1). https://doi.org/10.1017/S2041536210000103

Housen, A., Kuiken, F., & Vedder, I. (Eds.). (2012). Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA. Amsterdam: John Benjamins Publishing Company. https://doi.org/10.1075/lllt.32

Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine grained indices of syntactic complexity and usage-based indices of syntactic sophistication. Dissertation, Georgia State University. https://scholarworks.gsu.edu/alesl_diss/35

Kyle, K., Crossley, S., & Berger, C. (2018). The Tool for the Automatic Analysis of Lexical Sophistication (TAALES): Version 2.0. Behavior Research Methods, 50(3), 1030–1046. https://doi.org/10.3758/s13428-017-0924-4

Lai, C., & Li, G. (2011). Technology and task-based language teaching: A critical review. CALICO Journal, 28(2), 498–521. https://doi.org/10.11139/cj.28.2.498-521

Leacock, C., Chodorow, M., & Tetreault, J. (2015). Automatic grammar- and spell-checking for language learners. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 567–586). Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139649414.025

Levshina, N. (2015). How to do linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins Publishing Company. http://www.jbe-platform.com/content/books/9789027268457; https://doi.org/10.1075/z.195

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. https://doi.org/10.1075/ijcl.15.4.02lu

Lu, X. (2012). The relationship of lexical richness to the quality of ESL learners’ oral narratives. Modern Language Journal, 96(2), 190–208. https://doi.org/10.1111/j.1540-4781.2011.01232_1.x

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In K. Bontcheva & J. Zhu (Eds.), Proceedings of the 52nd annual meeting of the Association for Computational Linguistics: System demonstrations (pp. 55–60). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-5010

McNamara, D. S., Boonthum, C., Levinstein, I., & Millis, K. (2007). Evaluating self-explanations in iSTART: Comparing word-based and LSA algorithms. In T. K. Landauer (Ed.), Handbook of latent semantic analysis (pp. 227–241). Mahwah: Lawrence Erlbaum Associates Publishers.

McNamara, D. S., Louwerse, M. M., McCarthy, P. M., & Graesser, A. C. (2010). Coh-Metrix: Capturing linguistic features of cohesion. Discourse Processes, 47(4), 292–330. https://doi.org/10.1080/01638530902959943

Meurers, W. D. (2009). On the automatic analysis of learner language: Introduction to the special issue. CALICO Journal, 26(3), 469–473. https://doi.org/10.1558/cj.v26i3.469-473

Pilán, I., & Volodina, E. (2018). Investigating the importance of linguistic complexity features across different datasets related to language learning. In L. Becerra-Bonache, M. D. Jiménez-López, C. Martín-Vide, & A. Torrens-Urrutia (Eds.), Proceedings of the Workshop on linguistic complexity and natural language processing (pp. 49–58). Association for Computational Linguistics. http://aclweb.org/anthology/W18-4606

Pilán, I., Volodina, E., & Zesch, T. (2016). Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks. In Y. Matsumoto & R. Prasad (Eds.), Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical papers (pp. 2101–2111). COLING 16 Organizing Committee. https://aclanthology.org/C16-1198

Roscoe, R. D., Allen, L. K., Weston, J. L., Crossley, S. A., & McNamara, D. S. (2014). The Writing Pal intelligent tutoring system: Usability testing and development. Computers and Composition, 34, 39–59. https://doi.org/10.1016/j.compcom.2014.09.002

Rudzewitz, B., Ziai, R., Nuxoll, F., Kuthy, K. D., & Meurers, W. D. (2019). Enhancing a web-based language tutoring system with learning analytics. In L. Paquette & C. Romero (Eds.), Joint proceedings of the Workshops of the 12th International Conference on Educational Data Mining co-located with the 12th International Conference on Educational Data Mining, EDM 2019 Workshops (pp. 1–7). CEUR Workshop Proceedings vol. 2592. Aachen: CEUR-WS.

Shute, V. J. (2008). Focus on formative feedback. Review of Educational Research, 78(1), 153–189. https://doi.org/10.3102/0034654307313795

Tack, A., François, T., Roekhaut, S., & Fairon, C. (2017). Human and automated CEFR-based grading of short answers. In J. Tetreault, J. Burstein, Cl Leacock, & H. Yannakoudakis (Eds.), Proceedings of the 12th Workshop on innovative use of NLP for building educational applications (pp. 169–179). Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-5018

Vajjala, S. (2018). Automated assessment of non-native learner essays: Investigating the role of linguistic features. International Journal of Artificial Intelligence in Education, 28, 79–105. https://doi.org/10.18653/v1/W17-5018

Vajjala, S., & Loo, K. (2014). Automatic CEFR level prediction for Estonian learner text. NEALT Proceedings Series, 22, 113–128.

Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language acquisition. In J. Tetrault, J. Burstein, & C Leacock (Eds.), Proceedings of the 7th Workshop on building educational applications using NLP(pp. 163–173). Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2390384.2390404

Wolfe-Quintero, K., Inagaki, S., & Kim, H.-Y. (1998). Second language development in writing: Measures of fluency, accuracy, and complexity. Second Language Teaching & Curriculum Center, University of Hawaii at Manoa.

Yannakoudakis, H., Briscoe, T., & Medlock, B. (2011). A new dataset and method for automatically grading ESOL texts. In D. Lin, Y. Matsumoto, & R. Mihalcea (Eds.), Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human language technologies (pp. 180–189). Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2002472.2002496

Yannakoudakis, H., Andersen, Ø. E., Geranpayeh, A., Briscoe, T., & Nicholls, D. (2018). Developing an automated writing placement system for ESL learners. Applied Measurement in Education, 31(3), 251–267. https://doi.org/10.1080/08957347.2018.1464447

Visualizing Linguistic Complexity and Proficiency in Learner English Writings

Authors

DOI:

Keywords:

Abstract

Author Biographies

References

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)

Subscription

Information

Accessibility

Unsubscribe

Latest publications