Skip to main content
Log in

Complexity and Difficulty of Items in Learning Systems

  • ARTICLE
  • Published:
International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Abstract

Complexity and difficulty are two closely related but distinct concepts. These concepts are important in the development of intelligent learning systems, e.g., for sequencing items, student modeling, or content management. We show how to use complexity and difficulty measures in the development of learning systems and provide guidance on how to think, reason, and communicate about these notions. To do so, we propose a pragmatic distinction between difficulty and complexity measures. At the same time, we acknowledge the limitations of any simple distinction and discuss several potentially confounding issues: context, biases, and scaffoldings. We also provide an overview of specific measures and their applications in several educational domains and a detailed analysis of measures for problems in introductory programming.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  • Aleven, V., McLaughlin, E. A., Glenn, R. A., & Koedinger, K. R. (2016). Instruction based on adaptive learning technologies. In Handbook of research on learning and instruction. Routledge.

  • Alvarez, A., & Scott, T. A. (2010). Using student surveys in determining the difficulty of programming assignments. Journal of Computing Sciences in Colleges, 26(2), 157–163.

    Google Scholar 

  • Amendum, S. J., Conradi, K., & Hiebert, E. (2018). Does text complexity matter in the elementary grades? A research synthesis of text difficulty and elementary students’ reading fluency and comprehension. Educational Psychology Review, 30(1), 121–151.

    Article  Google Scholar 

  • Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., Raths J., & Wittrock, M. C. (2000). A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives., abridged edition. Pearson.

  • Aponte, M.-V., Levieux, G., & Natkin, S. (2011). Measuring the level of difficulty in single player video games. Entertainment Computing, 2(4), 205–213.

    Article  Google Scholar 

  • Ascalon, M. E., Meyers, L. S., Davis, B. W., & Smits, N. (2007). Distractor similarity and item-stem structure: Effects on item difficulty. Applied Measurement in Education, 20(2), 153–170.

    Article  Google Scholar 

  • Ayako Hoshino, H. N. (2010). Predicting the difficulty of multiple-choice close questions for computer-adaptive testing. Research in Computing Science, 47(2), 279–292.

    Google Scholar 

  • Bailin, A., & Grafstein, A. (2001). The linguistic assumptions underlying readability formulae: A critique. Language & Communication, 21(3), 285–301.

    Article  Google Scholar 

  • Baker, F. B. (2001). The basics of item response theory. ERIC.

  • Baker, R. S. (2016). Stupid tutoring systems, intelligent humans. International Journal of Artificial Intelligence in Education, 26(2), 600–614.

    Article  Google Scholar 

  • Baker, R., Walonoski, J., Heffernan, N., Roll, I., Corbett, A., & Koedinger, K. (2008). Why students engage in gaming the system behavior in interactive learning environments. Journal of Interactive Learning Research, 19(2), 185–224.

    Google Scholar 

  • Baldwin, P., Yaneva, V., Mee, J., Clauser, B. E., & Ha, L. A. (2020). Using natural language processing to predict item response times and improve test construction. Journal of Educational Measurement.

  • Barbu, O. C., & Beal, C. R. (2010). Effects of linguistic complexity and math difficulty on word problem solving by english learners. International Journal of Education, 2(2), 1–19.

    Article  Google Scholar 

  • Bau, D., Gray, J., Kelleher, C., Sheldon, J., & Turbak, F. (2017). Learnable programming: Blocks and beyond. Communications of the ACM, 60(6), 72–80.

    Article  Google Scholar 

  • Beckmann, J. F., & Goode, N. (2017). Missing the wood for the wrong trees: On the difficulty of defining the complexity of complex problem solving scenarios. Journal of Intelligence, 5(2), 15.

    Article  Google Scholar 

  • Beckmann, J. F., Birney, D. P., & Goode, N. (2017). Beyond psychometrics: The difference between difficult problem solving and complex problem solving. Frontiers in Psychology, 8, 1739.

    Article  Google Scholar 

  • Benedetto, L., Cappelli, A., Turrin, R., & Cremonesi, P. (2020). R2de: A NLP approach to estimating IRT parameters of newly generated questions. In Proceedings of learning analytics & knowledge.

  • Benjamin, R. G. (2012). Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review, 24(1), 63–88.

    Article  Google Scholar 

  • Biggs, J. B., & Collis, K. F. (1981). Evaluating the quality of learning: The SOLO taxonomy (structure of the observed learning outcome). Academic Press.

  • Bloom, B. S., Engelhart, M. B., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (1956). Taxonomy of educational objectives. In The classification of educational goals. Handbook 1: Cognitive domain. Longmans Green.

  • Bouvier, D., Lovellette, E., Matta, J., Alshaigy, B., Becker, B. A., Craig, M., Jackova, J., McCartney, R., Sanders, K., & Zarb, M. Novice programmers and the problem description effect. In Proceedings of the 2016 ITiCSE working group reports, ITiCSE ‘16 (pp. 103–118). ACM.

  • Brooks, C., Chavez, O., Tritz, J., & Teasley, S. (2015). Reducing selection bias in quasi-experimental educational studies. In Proceedings of learning analytics & knowledge (pp. 295–299). ACM.

  • Brusilovsky, P. L. (1992). A framework for intelligent knowledge sequencing and task sequencing. In Proceedings of intelligent tutoring systems (pp. 499–506). Springer.

  • Campbell, D. J. (1988). Task complexity: A review and analysis. Academy of Management Review, 13(1), 40–52.

    Article  Google Scholar 

  • Čechák, J., & Pelánek, R. (2019). Item ordering biases in educational data. In S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, & R. Luckin (Eds.), Proceedings of artificial intelligence in education (pp. 48–58). Springer.

  • Cen, H., Koedinger, K., & Junker, B. (2006). Learning factors analysis–a general method for cognitive model evaluation and improvement. In Proceedings of intelligent tutoring systems (pp. 164–175). Springer.

  • Chen, C.-M., Liu, C.-Y., & Chang, M.-H. (2006). Personalized curriculum sequencing utilizing modified item response theory for web-based instruction. Expert Systems with Applications, 30(2), 378–396.

    Article  Google Scholar 

  • Craig, M., Smith, J., & Petersen, A. (2017). Familiar contexts and the difficulty of programming problems. In Proceedings of computing education research (pp. 123–127). ACM.

  • Csikszentmihalyi, M., & Csikszentmihalyi, I. S. (1992). Optimal experience: Psychological studies of flow in consciousness. Cambridge University Press.

  • Daroczy, G., Wolska, M., Meurers, W. D., & Nuerk, H.-C. (2015). Word problems: A review of linguistic and numerical factors contributing to their difficulty. Frontiers in Psychology, 6, 348.

    Article  Google Scholar 

  • De Ayala, R. (2008). The theory and practice of item response theory. The Guilford Press.

  • Eagle, M., & Barnes, T. (2014). Survival analysis on duration data in intelligent tutors. In Proceedings of intelligent tutoring systems (pp. 178–187). Springer.

  • Effenberger, T., Čechák, J., & Pelánek, R. (2019). Measuring difficulty of introductory programming tasks. In Proceedings learning at scale, pp. 1–4.

  • Finch, H. (2008). Estimation of item response theory parameters in the presence of missing data. Journal of Educational Measurement, 45(3), 225–245.

    Article  Google Scholar 

  • Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6), 1082–1116.

    Article  Google Scholar 

  • Gluga, R., Kay, J., Lister, R., Kleitman, S., & Lever, T. (2012). Coming to terms with Bloom: An online tutorial for teachers of programming fundamentals. In Proceedings of Australasian computing education conference (pp. 147–156). Australian Computer Society, Inc.

  • Goutte, C., Durand, G., & Léger, S. (2018). On the learning curve attrition bias in additive factor modeling. In Proceedings of artificial intelligence in education (pp. 109–113). Springer.

  • Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36(2), 193–202.

    Article  Google Scholar 

  • Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28(3–4), 321–377.

    Article  MATH  Google Scholar 

  • Huang, Y., Aleven, V., McLaughlin, E., & Koedinger, K. (2020). A general multi-method approach to design-loop adaptivity in intelligent tutoring systems. In Proceedings of artificial intelligence in education (pp. 124–129). Springer.

  • Hufkens, L. V., & Browne, C. (2019). A functional taxonomy of logic puzzles. In IEEE conference on games (CoG) (Vol. 2019, pp. 1–4). IEEE.

  • Ihantola, P., & Petersen, A. (2019). Code complexity in introductory programming courses. In Proceedings of international conference on system sciences.

  • Jarušek, P., & Pelánek, R. (2011). What determines difficulty of transport puzzles? In Proceedings of Florida artificial intelligence research society conference (pp. 428–433). AAAI Press.

  • Jumaat, N. F., & Tasir, Z. (2014). Instructional scaffolding in online learning environment: A meta-analysis. In Proceedings of teaching and learning in computing and engineering (pp. 74–77). IEEE.

  • Kelleher, C., & Hnin, W. (2019). Predicting cognitive load in future code puzzles. In Proceedings of conference on human factors in computing systems (pp. 1–12).

  • Keller, L. A., Swaminathan, H., & Sireci, S. G. (2003). Evaluating scoring procedures for context-dependent item sets. Applied Measurement in Education, 16(3), 207–222.

    Article  Google Scholar 

  • Khodeir, N. A., Elazhary, H., & Wanas, N. (2018). Generating story problems via controlled parameters in a web-based intelligent tutoring system. The International Journal of Information and Learning Technology.

  • Kiili, K., De Freitas, S., Arnab, S., & Lainema, T. (2012). The design principles for flow experience in educational games. Procedia Computer Science, 15, 78–91.

    Article  Google Scholar 

  • Koedinger, K. R., & Nathan, M. J. (2004). The real story behind story problems: Effects of representations on quantitative reasoning. The Journal of the Learning Sciences, 13(2), 129–164.

    Article  Google Scholar 

  • Koedinger, K. R., Corbett, A. T., & Perfetti, C. (2012). The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learning. Cognitive Science, 36(5), 757–798.

    Article  Google Scholar 

  • Kotovsky, K., Hayes, J. R., & Simon, H. A. (1985). Why are some problems hard? Evidence from tower of Hanoi. Cognitive Psychology, 17(2), 248–294.

    Article  Google Scholar 

  • Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 english words. Behavior Research Methods, 44(4), 978–990.

    Article  Google Scholar 

  • Kurdi, G., Leo, J., Parsia, B., Sattler, U., & Al-Emari, S. (2020). A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education, 30(1), 121–204.

    Article  Google Scholar 

  • Leo, J., Kurdi, G., Matentzoglu, N., Parsia, B., Sattler, U., Forge, S., Donato, G., & Dowling, W. (2019). Ontology-based generation of medical, multi-term MCQS. International Journal of Artificial Intelligence in Education, 29(2), 145–188.

    Article  Google Scholar 

  • Lin, C., Liu, D., Pang, W., & Apeh, E. (2015). Automatically predicting quiz difficulty level using similarity measures. In Proceedings of international conference on knowledge capture (pp. 1–8).

  • Linehan, C., Bellord, G., Kirman, B., Morford, Z. H., & Roche, B. (2014). Learning curves: Analysing pace and challenge in four successful puzzle games. In Proceedings of computer-human interaction in play (pp. 181–190). ACM.

  • Liu, P., & Li, Z. (2012). Task complexity: A review and conceptualization framework. International Journal of Industrial Ergonomics, 42(6), 553–568.

    Article  Google Scholar 

  • Lovett, M. C., & Anderson, J. R. (1996). History of success and current context in problem solving: Combined influences on operator selection. Cognitive Psychology, 31(2), 168–217.

    Article  Google Scholar 

  • Luchins, A. S. (1942). Mechanization in problem solving: The effect of einstellung. Psychological Monographs, 54(6), i.

    Article  Google Scholar 

  • Mesmer, H. A., Cunningham, J. W., & Hiebert, E. H. (2012). Toward a theoretical model of text complexity for the early grades: Learning from the past, anticipating the future. Reading Research Quarterly, 47(3), 235–258.

    Article  Google Scholar 

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

  • Miller, G. A. (1998). WordNet: An electronic lexical database. MIT Press.

  • Milton, J. (2010). The development of vocabulary breadth across the CEFR levels. In Communicative proficiency and linguistic development: Intersections between SLA and language testing research, pp. 211–232.

  • Mitkov, R., Ha, L. A., Varga, A., & Rello, L. (2009). Semantic similarity of distractors in multiple-choice tests: Extrinsic evaluation. In Proceedings of the workshop on geometrical models of natural language semantics (pp. 49–56). Association for Computational Linguistics.

  • Murray R. C., Ritter S., Nixon T., Schwiebert R., Hausmann R. G., Towle B., Fancsali S. E., & Vuong A. (2013). Revealing the learning in learning curves. In Proceedings of Artificial Intelligence in Education, (pp. 473–482). Springer.

  • Nixon, T., Fancsali, S., & Ritter, S. (2013). The complex dynamics of aggregate learning curves. In Proceedings of educational data mining (pp. 338–339).

  • Nuthong, S., & Witosurapot, S. (2017). Enabling fine granularity of difficulty ranking measure for automatic quiz generation. In Proceedings of information technology and electrical engineering (pp. 1–6). IEEE.

  • Pandarova, I., Schmidt, T., Hartig, J., Boubekki, A., Jones, R. D., & Brefeld, U. (2019). Predicting the difficulty of exercise items for dynamic difficulty adaptation in adaptive language tutoring. International Journal of Artificial Intelligence in Education, 1–26.

  • Papasalouros, A., Kanaris, K., & Kotis, K. (2008). Automatic generation of multiple choice questions from domain ontologies. e-Learning, 427–434.

  • Pelánek, R. (2014). Difficulty rating of sudoku puzzles: An overview and evaluation. arXiv preprint arXiv:1403.7373.

  • Pelánek, R. (2016). Applications of the elo rating system in adaptive educational systems. Computers & Education, 98, 169–179.

    Article  Google Scholar 

  • Pelánek, R. (2017). Bayesian knowledge tracing, logistic models, and beyond: An overview of learner modeling techniques. User Modeling and User-Adapted Interaction, 27(3), 313–350.

    Article  Google Scholar 

  • Pelánek, R. (2018). The details matter: Methodological nuances in the evaluation of student models. User Modeling and User-Adapted Interaction, 28, 207–235.

    Article  Google Scholar 

  • Pelánek, R., & Jarušek, P. (2015). Student modeling based on problem solving times. International Journal of Artificial Intelligence in Education, 25(4), 493–519.

    Article  Google Scholar 

  • Pelánek, R., Papoušek, J., Řihák, J., Stanislav, V., & Nižnan, J. (2017). Elo-based learner modeling for the adaptive practice of facts. User Modeling and User-Adapted Interaction, 27(1), 89–118.

    Article  Google Scholar 

  • Polozov, O., O’Rourke, E., Smith, A. M., Zettlemoyer, L., Gulwani, S., & Popović, Z. (2015). Personalized mathematical word problem generation. In Proceedings of international joint conference on artificial intelligence.

  • Primi, R. (2001). Complexity of geometric inductive reasoning tasks: Contribution to the understanding of fluid intelligence. Intelligence, 30(1), 41–70.

    Article  Google Scholar 

  • Robertson S. (2004). Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation.

  • Robinson, P. (2001). Task complexity, task difficulty, and task production: Exploring interactions in a componential framework. Applied Linguistics, 22(1), 27–57.

    Article  Google Scholar 

  • Rosa, K. D., & Eskenazi, M. (2011). Effect of word complexity on l2 vocabulary learning. In Proceedings of workshop on innovative use of NLP for building educational applications (pp. 76–80). Association for Computational Linguistics.

  • Sao Pedro, M., Baker, R., & Gobert, J. (2013). Incorporating scaffolding and tutor context into bayesian knowledge tracing to predict inquiry skill acquisition. In Educational Data Mining, 2013.

  • Scheiter, K., & Gerjets, P. (2002). The impact of problem order: Sequencing problems as a strategy for improving one’s performance. Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 24.

  • Schwarz, N., & Sudman, S. (2012). Context effects in social and psychological research. Springer Science & Business Media.

  • Seyler, D., Yahya, M., & Berberich, K. (2017). Knowledge questions from knowledge graphs. Proceedings of theory of information retrieval, pp. 11–18.

  • Sheard, J., Carbone, A., Chinn, D., Clear, T., Corney, M., D’Souza, D., Fenwick, J., Harland, J., Laakso, M.-J., Teague, D., et al. (2013). How difficult are exams?: A framework for assessing the complexity of introductory programming exams. In Proceedings of australasian computing education conference (vol. 136, pp. 145–154). Australian Computer Society, Inc.

  • Sheehan, K. M., Kostin, I., & Futagi, Y. (2008). When do standard approaches for measuring vocabulary difficulty, syntactic complexity and referential cohesion yield biased estimates of text difficulty. In Proceedings of annual conference of the cognitive science society.

  • Sheehan, K. M., Kostin, I., Napolitano, D., & Flor, M. (2014). The textevaluator tool: Helping teachers and test developers select texts for use in instruction and assessment. The Elementary School Journal, 115(2), 184–209.

    Article  Google Scholar 

  • Sohsah, G. N., Ünal, M. E., & Güzey, O. (2015). Classification of word levels with usage frequency, expert opinions and machine learning. British Journal of Educational Technology, 46(5), 1097–1101.

    Article  Google Scholar 

  • Susanti, Y., Nishikawa, H., Tokunaga, T., & Obari, H. (2016). Item difficulty analysis of english vocabulary questions. In Proceedings of conference on computer supported education (pp. 267–274).

  • Taylor, K., & Rohrer, D. (2010). The effects of interleaved practice. Applied Cognitive Psychology, 24(6), 837–848.

    Article  Google Scholar 

  • Thompson, B. (1984). Canonical correlation analysis: Uses and interpretation, number 47. Sage.

  • Thompson, E., Luxton-Reilly, A., Whalley, J. L., Hu, M., & Robbins, P. (2008). Bloom’s taxonomy for cs assessment. In Proceedings of Australasian computing education (pp. 155–161). Australian Computer Society, Inc.

  • Togelius, J., Yannakakis, G. N., Stanley, K. O., & Browne, C. (2011). Search-based procedural content generation: A taxonomy and survey. IEEE Transactions on Computational Intelligence and AI in Games, 3(3), 172–186.

    Article  Google Scholar 

  • Uemura, T., & Ishikawa, S. (2004). Jacet 8000 and Asia TEFL vocabulary initiative. Journal of Asia TEFL, 1(1), 333–347.

    Google Scholar 

  • Van Der Linden, W. J. (2009). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46(3), 247–272.

    Article  MathSciNet  Google Scholar 

  • Van Merrienboer, J. J., & Krammer, H. P. (1990). The “completion strategy” in programming instruction: Theoretical and empirical support. In Research on instruction: Design and effects, pp. 45–61.

  • Wang, K., & Su, Z. (2016). Dimensionally guided synthesis of mathematical word problems. In Proceedings of international joint conference on artificial intelligence (pp. 2661–2668).

  • Wauters, K., Desmet, P., & Van Den Noortgate, W. (2012). Item difficulty estimation: An auspicious collaboration between data and judgment. Computers & Education, 58(4), 1183–1193.

    Article  Google Scholar 

  • Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and science education. In Number 6 in research monograph. Council of Chief State School Officers.

  • Whalley, J., & Kasto, N. (2014). How difficult are novice code writing tasks?: A software metrics approach. In Proceedings of Australasian computing education conference (pp. 105–112). Australian Computer Society, Inc.

  • Yaneva, V., Baldwin, P., Mee, J., et al. (2019). Predicting the difficulty of multiple choice questions in a high-stakes medical exam. In Proceedings of workshop on innovative use of NLP for building educational applications, pp. 11–20.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radek Pelánek.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pelánek, R., Effenberger, T. & Čechák, J. Complexity and Difficulty of Items in Learning Systems. Int J Artif Intell Educ 32, 196–232 (2022). https://doi.org/10.1007/s40593-021-00252-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40593-021-00252-4

Navigation