Complexity and Difficulty of Items in Learning Systems

Pelánek, Radek; Effenberger, Tomáš; Čechák, Jaroslav

doi:10.1007/s40593-021-00252-4

1123 Accesses
10 Citations
Explore all metrics

Abstract

Complexity and difficulty are two closely related but distinct concepts. These concepts are important in the development of intelligent learning systems, e.g., for sequencing items, student modeling, or content management. We show how to use complexity and difficulty measures in the development of learning systems and provide guidance on how to think, reason, and communicate about these notions. To do so, we propose a pragmatic distinction between difficulty and complexity measures. At the same time, we acknowledge the limitations of any simple distinction and discuss several potentially confounding issues: context, biases, and scaffoldings. We also provide an overview of specific measures and their applications in several educational domains and a detailed analysis of measures for problems in introductory programming.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 10

Measuring the Complexity of Learning Content to Enable Automated Comparison, Recommendation, and Generation

Using syntactic text analysis to estimate educational tasks’ difficulty and complexity

Article 21 January 2016

Two algorithms for estimating test complexity levels

Article 14 December 2017

References

Aleven, V., McLaughlin, E. A., Glenn, R. A., & Koedinger, K. R. (2016). Instruction based on adaptive learning technologies. In Handbook of research on learning and instruction. Routledge.
Alvarez, A., & Scott, T. A. (2010). Using student surveys in determining the difficulty of programming assignments. Journal of Computing Sciences in Colleges, 26(2), 157–163.
Google Scholar
Amendum, S. J., Conradi, K., & Hiebert, E. (2018). Does text complexity matter in the elementary grades? A research synthesis of text difficulty and elementary students’ reading fluency and comprehension. Educational Psychology Review, 30(1), 121–151.
Article Google Scholar
Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., Raths J., & Wittrock, M. C. (2000). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives., abridged edition. Pearson.
Aponte, M.-V., Levieux, G., & Natkin, S. (2011). Measuring the level of difficulty in single player video games. Entertainment Computing, 2(4), 205–213.
Article Google Scholar
Ascalon, M. E., Meyers, L. S., Davis, B. W., & Smits, N. (2007). Distractor similarity and item-stem structure: Effects on item difficulty. Applied Measurement in Education, 20(2), 153–170.
Article Google Scholar
Ayako Hoshino, H. N. (2010). Predicting the difficulty of multiple-choice close questions for computer-adaptive testing. Research in Computing Science, 47(2), 279–292.
Google Scholar
Bailin, A., & Grafstein, A. (2001). The linguistic assumptions underlying readability formulae: A critique. Language & Communication, 21(3), 285–301.
Article Google Scholar
Baker, F. B. (2001). The basics of item response theory. ERIC.
Baker, R. S. (2016). Stupid tutoring systems, intelligent humans. International Journal of Artificial Intelligence in Education, 26(2), 600–614.
Article Google Scholar
Baker, R., Walonoski, J., Heffernan, N., Roll, I., Corbett, A., & Koedinger, K. (2008). Why students engage in gaming the system behavior in interactive learning environments. Journal of Interactive Learning Research, 19(2), 185–224.
Google Scholar
Baldwin, P., Yaneva, V., Mee, J., Clauser, B. E., & Ha, L. A. (2020). Using natural language processing to predict item response times and improve test construction. Journal of Educational Measurement.
Barbu, O. C., & Beal, C. R. (2010). Effects of linguistic complexity and math difficulty on word problem solving by english learners. International Journal of Education, 2(2), 1–19.
Article Google Scholar
Bau, D., Gray, J., Kelleher, C., Sheldon, J., & Turbak, F. (2017). Learnable programming: Blocks and beyond. Communications of the ACM, 60(6), 72–80.
Article Google Scholar
Beckmann, J. F., & Goode, N. (2017). Missing the wood for the wrong trees: On the difficulty of defining the complexity of complex problem solving scenarios. Journal of Intelligence, 5(2), 15.
Article Google Scholar
Beckmann, J. F., Birney, D. P., & Goode, N. (2017). Beyond psychometrics: The difference between difficult problem solving and complex problem solving. Frontiers in Psychology, 8, 1739.
Article Google Scholar
Benedetto, L., Cappelli, A., Turrin, R., & Cremonesi, P. (2020). R2de: A NLP approach to estimating IRT parameters of newly generated questions. In Proceedings of learning analytics & knowledge.
Benjamin, R. G. (2012). Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review, 24(1), 63–88.
Article Google Scholar
Biggs, J. B., & Collis, K. F. (1981). Evaluating the quality of learning: The SOLO taxonomy (structure of the observed learning outcome). Academic Press.
Bloom, B. S., Engelhart, M. B., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives. In The classification of educational goals. Handbook 1: Cognitive domain. Longmans Green.
Bouvier, D., Lovellette, E., Matta, J., Alshaigy, B., Becker, B. A., Craig, M., Jackova, J., McCartney, R., Sanders, K., & Zarb, M. Novice programmers and the problem description effect. In Proceedings of the 2016 ITiCSE working group reports, ITiCSE ‘16 (pp. 103–118). ACM.
Brooks, C., Chavez, O., Tritz, J., & Teasley, S. (2015). Reducing selection bias in quasi-experimental educational studies. In Proceedings of learning analytics & knowledge (pp. 295–299). ACM.
Brusilovsky, P. L. (1992). A framework for intelligent knowledge sequencing and task sequencing. In Proceedings of intelligent tutoring systems (pp. 499–506). Springer.
Campbell, D. J. (1988). Task complexity: A review and analysis. Academy of Management Review, 13(1), 40–52.
Article Google Scholar
Čechák, J., & Pelánek, R. (2019). Item ordering biases in educational data. In S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, & R. Luckin (Eds.), Proceedings of artificial intelligence in education (pp. 48–58). Springer.
Cen, H., Koedinger, K., & Junker, B. (2006). Learning factors analysis–a general method for cognitive model evaluation and improvement. In Proceedings of intelligent tutoring systems (pp. 164–175). Springer.
Chen, C.-M., Liu, C.-Y., & Chang, M.-H. (2006). Personalized curriculum sequencing utilizing modified item response theory for web-based instruction. Expert Systems with Applications, 30(2), 378–396.
Article Google Scholar
Craig, M., Smith, J., & Petersen, A. (2017). Familiar contexts and the difficulty of programming problems. In Proceedings of computing education research (pp. 123–127). ACM.
Csikszentmihalyi, M., & Csikszentmihalyi, I. S. (1992). Optimal experience: Psychological studies of flow in consciousness. Cambridge University Press.
Daroczy, G., Wolska, M., Meurers, W. D., & Nuerk, H.-C. (2015). Word problems: A review of linguistic and numerical factors contributing to their difficulty. Frontiers in Psychology, 6, 348.
Article Google Scholar
De Ayala, R. (2008). The theory and practice of item response theory. The Guilford Press.
Eagle, M., & Barnes, T. (2014). Survival analysis on duration data in intelligent tutors. In Proceedings of intelligent tutoring systems (pp. 178–187). Springer.
Effenberger, T., Čechák, J., & Pelánek, R. (2019). Measuring difficulty of introductory programming tasks. In Proceedings learning at scale, pp. 1–4.
Finch, H. (2008). Estimation of item response theory parameters in the presence of missing data. Journal of Educational Measurement, 45(3), 225–245.
Article Google Scholar
Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6), 1082–1116.
Article Google Scholar
Gluga, R., Kay, J., Lister, R., Kleitman, S., & Lever, T. (2012). Coming to terms with Bloom: An online tutorial for teachers of programming fundamentals. In Proceedings of Australasian computing education conference (pp. 147–156). Australian Computer Society, Inc.
Goutte, C., Durand, G., & Léger, S. (2018). On the learning curve attrition bias in additive factor modeling. In Proceedings of artificial intelligence in education (pp. 109–113). Springer.
Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193–202.
Article Google Scholar
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3–4), 321–377.
Article MATH Google Scholar
Huang, Y., Aleven, V., McLaughlin, E., & Koedinger, K. (2020). A general multi-method approach to design-loop adaptivity in intelligent tutoring systems. In Proceedings of artificial intelligence in education (pp. 124–129). Springer.
Hufkens, L. V., & Browne, C. (2019). A functional taxonomy of logic puzzles. In IEEE conference on games (CoG) (Vol. 2019, pp. 1–4). IEEE.
Ihantola, P., & Petersen, A. (2019). Code complexity in introductory programming courses. In Proceedings of international conference on system sciences.
Jarušek, P., & Pelánek, R. (2011). What determines difficulty of transport puzzles? In Proceedings of Florida artificial intelligence research society conference (pp. 428–433). AAAI Press.
Jumaat, N. F., & Tasir, Z. (2014). Instructional scaffolding in online learning environment: A meta-analysis. In Proceedings of teaching and learning in computing and engineering (pp. 74–77). IEEE.
Kelleher, C., & Hnin, W. (2019). Predicting cognitive load in future code puzzles. In Proceedings of conference on human factors in computing systems (pp. 1–12).
Keller, L. A., Swaminathan, H., & Sireci, S. G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16(3), 207–222.
Article Google Scholar
Khodeir, N. A., Elazhary, H., & Wanas, N. (2018). Generating story problems via controlled parameters in a web-based intelligent tutoring system. The International Journal of Information and Learning Technology.
Kiili, K., De Freitas, S., Arnab, S., & Lainema, T. (2012). The design principles for flow experience in educational games. Procedia Computer Science, 15, 78–91.
Article Google Scholar
Koedinger, K. R., & Nathan, M. J. (2004). The real story behind story problems: Effects of representations on quantitative reasoning. The Journal of the Learning Sciences, 13(2), 129–164.
Article Google Scholar
Koedinger, K. R., Corbett, A. T., & Perfetti, C. (2012). The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learning. Cognitive Science, 36(5), 757–798.
Article Google Scholar
Kotovsky, K., Hayes, J. R., & Simon, H. A. (1985). Why are some problems hard? Evidence from tower of Hanoi. Cognitive Psychology, 17(2), 248–294.
Article Google Scholar
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 english words. Behavior Research Methods, 44(4), 978–990.
Article Google Scholar
Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30(1), 121–204.
Article Google Scholar
Leo, J., Kurdi, G., Matentzoglu, N., Parsia, B., Sattler, U., Forge, S., Donato, G., & Dowling, W. (2019). Ontology-based generation of medical, multi-term MCQS. International Journal of Artificial Intelligence in Education, 29(2), 145–188.
Article Google Scholar
Lin, C., Liu, D., Pang, W., & Apeh, E. (2015). Automatically predicting quiz difficulty level using similarity measures. In Proceedings of international conference on knowledge capture (pp. 1–8).
Linehan, C., Bellord, G., Kirman, B., Morford, Z. H., & Roche, B. (2014). Learning curves: Analysing pace and challenge in four successful puzzle games. In Proceedings of computer-human interaction in play (pp. 181–190). ACM.
Liu, P., & Li, Z. (2012). Task complexity: A review and conceptualization framework. International Journal of Industrial Ergonomics, 42(6), 553–568.
Article Google Scholar
Lovett, M. C., & Anderson, J. R. (1996). History of success and current context in problem solving: Combined influences on operator selection. Cognitive Psychology, 31(2), 168–217.
Article Google Scholar
Luchins, A. S. (1942). Mechanization in problem solving: The effect of einstellung. Psychological Monographs, 54(6), i.
Article Google Scholar
Mesmer, H. A., Cunningham, J. W., & Hiebert, E. H. (2012). Toward a theoretical model of text complexity for the early grades: Learning from the past, anticipating the future. Reading Research Quarterly, 47(3), 235–258.
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Miller, G. A. (1998). WordNet: An electronic lexical database. MIT Press.
Milton, J. (2010). The development of vocabulary breadth across the CEFR levels. In Communicative proficiency and linguistic development: Intersections between SLA and language testing research, pp. 211–232.
Mitkov, R., Ha, L. A., Varga, A., & Rello, L. (2009). Semantic similarity of distractors in multiple-choice tests: Extrinsic evaluation. In Proceedings of the workshop on geometrical models of natural language semantics (pp. 49–56). Association for Computational Linguistics.
Murray R. C., Ritter S., Nixon T., Schwiebert R., Hausmann R. G., Towle B., Fancsali S. E., & Vuong A. (2013). Revealing the learning in learning curves. In Proceedings of Artificial Intelligence in Education, (pp. 473–482). Springer.
Nixon, T., Fancsali, S., & Ritter, S. (2013). The complex dynamics of aggregate learning curves. In Proceedings of educational data mining (pp. 338–339).
Nuthong, S., & Witosurapot, S. (2017). Enabling fine granularity of difficulty ranking measure for automatic quiz generation. In Proceedings of information technology and electrical engineering (pp. 1–6). IEEE.
Pandarova, I., Schmidt, T., Hartig, J., Boubekki, A., Jones, R. D., & Brefeld, U. (2019). Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring. International Journal of Artificial Intelligence in Education, 1–26.
Papasalouros, A., Kanaris, K., & Kotis, K. (2008). Automatic generation of multiple choice questions from domain ontologies. e-Learning, 427–434.
Pelánek, R. (2014). Difficulty rating of sudoku puzzles: An overview and evaluation. arXiv preprint arXiv:1403.7373.
Pelánek, R. (2016). Applications of the elo rating system in adaptive educational systems. Computers & Education, 98, 169–179.
Article Google Scholar
Pelánek, R. (2017). Bayesian knowledge tracing, logistic models, and beyond: An overview of learner modeling techniques. User Modeling and User-Adapted Interaction, 27(3), 313–350.
Article Google Scholar
Pelánek, R. (2018). The details matter: Methodological nuances in the evaluation of student models. User Modeling and User-Adapted Interaction, 28, 207–235.
Article Google Scholar
Pelánek, R., & Jarušek, P. (2015). Student modeling based on problem solving times. International Journal of Artificial Intelligence in Education, 25(4), 493–519.
Article Google Scholar
Pelánek, R., Papoušek, J., Řihák, J., Stanislav, V., & Nižnan, J. (2017). Elo-based learner modeling for the adaptive practice of facts. User Modeling and User-Adapted Interaction, 27(1), 89–118.
Article Google Scholar
Polozov, O., O’Rourke, E., Smith, A. M., Zettlemoyer, L., Gulwani, S., & Popović, Z. (2015). Personalized mathematical word problem generation. In Proceedings of international joint conference on artificial intelligence.
Primi, R. (2001). Complexity of geometric inductive reasoning tasks: Contribution to the understanding of fluid intelligence. Intelligence, 30(1), 41–70.
Article Google Scholar
Robertson S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation.
Robinson, P. (2001). Task complexity, task difficulty, and task production: Exploring interactions in a componential framework. Applied Linguistics, 22(1), 27–57.
Article Google Scholar
Rosa, K. D., & Eskenazi, M. (2011). Effect of word complexity on l2 vocabulary learning. In Proceedings of workshop on innovative use of NLP for building educational applications (pp. 76–80). Association for Computational Linguistics.
Sao Pedro, M., Baker, R., & Gobert, J. (2013). Incorporating scaffolding and tutor context into bayesian knowledge tracing to predict inquiry skill acquisition. In Educational Data Mining, 2013.
Scheiter, K., & Gerjets, P. (2002). The impact of problem order: Sequencing problems as a strategy for improving one’s performance. Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 24.
Schwarz, N., & Sudman, S. (2012). Context effects in social and psychological research. Springer Science & Business Media.
Seyler, D., Yahya, M., & Berberich, K. (2017). Knowledge questions from knowledge graphs. Proceedings of theory of information retrieval, pp. 11–18.
Sheard, J., Carbone, A., Chinn, D., Clear, T., Corney, M., D’Souza, D., Fenwick, J., Harland, J., Laakso, M.-J., Teague, D., et al. (2013). How difficult are exams?: A framework for assessing the complexity of introductory programming exams. In Proceedings of australasian computing education conference (vol. 136, pp. 145–154). Australian Computer Society, Inc.
Sheehan, K. M., Kostin, I., & Futagi, Y. (2008). When do standard approaches for measuring vocabulary difficulty, syntactic complexity and referential cohesion yield biased estimates of text difficulty. In Proceedings of annual conference of the cognitive science society.
Sheehan, K. M., Kostin, I., Napolitano, D., & Flor, M. (2014). The textevaluator tool: Helping teachers and test developers select texts for use in instruction and assessment. The Elementary School Journal, 115(2), 184–209.
Article Google Scholar
Sohsah, G. N., Ünal, M. E., & Güzey, O. (2015). Classification of word levels with usage frequency, expert opinions and machine learning. British Journal of Educational Technology, 46(5), 1097–1101.
Article Google Scholar
Susanti, Y., Nishikawa, H., Tokunaga, T., & Obari, H. (2016). Item difficulty analysis of english vocabulary questions. In Proceedings of conference on computer supported education (pp. 267–274).
Taylor, K., & Rohrer, D. (2010). The effects of interleaved practice. Applied Cognitive Psychology, 24(6), 837–848.
Article Google Scholar
Thompson, B. (1984). Canonical correlation analysis: Uses and interpretation, number 47. Sage.
Thompson, E., Luxton-Reilly, A., Whalley, J. L., Hu, M., & Robbins, P. (2008). Bloom’s taxonomy for cs assessment. In Proceedings of Australasian computing education (pp. 155–161). Australian Computer Society, Inc.
Togelius, J., Yannakakis, G. N., Stanley, K. O., & Browne, C. (2011). Search-based procedural content generation: A taxonomy and survey. IEEE Transactions on Computational Intelligence and AI in Games, 3(3), 172–186.
Article Google Scholar
Uemura, T., & Ishikawa, S. (2004). Jacet 8000 and Asia TEFL vocabulary initiative. Journal of Asia TEFL, 1(1), 333–347.
Google Scholar
Van Der Linden, W. J. (2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3), 247–272.
Article MathSciNet Google Scholar
Van Merrienboer, J. J., & Krammer, H. P. (1990). The “completion strategy” in programming instruction: Theoretical and empirical support. In Research on instruction: Design and effects, pp. 45–61.
Wang, K., & Su, Z. (2016). Dimensionally guided synthesis of mathematical word problems. In Proceedings of international joint conference on artificial intelligence (pp. 2661–2668).
Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183–1193.
Article Google Scholar
Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and science education. In Number 6 in research monograph. Council of Chief State School Officers.
Whalley, J., & Kasto, N. (2014). How difficult are novice code writing tasks?: A software metrics approach. In Proceedings of Australasian computing education conference (pp. 105–112). Australian Computer Society, Inc.
Yaneva, V., Baldwin, P., Mee, J., et al. (2019). Predicting the difficulty of multiple choice questions in a high-stakes medical exam. In Proceedings of workshop on innovative use of NLP for building educational applications, pp. 11–20.

Download references

Author information

Authors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Radek Pelánek, Tomáš Effenberger & Jaroslav Čechák

Authors

Radek Pelánek
View author publications
You can also search for this author in PubMed Google Scholar
Tomáš Effenberger
View author publications
You can also search for this author in PubMed Google Scholar
Jaroslav Čechák
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radek Pelánek.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pelánek, R., Effenberger, T. & Čechák, J. Complexity and Difficulty of Items in Learning Systems. Int J Artif Intell Educ 32, 196–232 (2022). https://doi.org/10.1007/s40593-021-00252-4

Download citation

Received: 03 August 2020
Revised: 19 March 2021
Accepted: 30 March 2021
Published: 04 May 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s40593-021-00252-4

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Complexity and Difficulty of Items in Learning Systems

Abstract

Access this article

Similar content being viewed by others

Measuring the Complexity of Learning Content to Enable Automated Comparison, Recommendation, and Generation

Using syntactic text analysis to estimate educational tasks’ difficulty and complexity

Two algorithms for estimating test complexity levels

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Navigation

Complexity and Difficulty of Items in Learning Systems

Abstract

Access this article

Similar content being viewed by others

Measuring the Complexity of Learning Content to Enable Automated Comparison, Recommendation, and Generation

Using syntactic text analysis to estimate educational tasks’ difficulty and complexity

Two algorithms for estimating test complexity levels

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation