ABSTRACT
Student models are typically evaluated through predicting the correctness of the next answer. This approach is insufficient in the problem-solving context, especially for student models that use performance data beyond binary correctness. We propose more comprehensive methods for validating student models and illustrate them in the context of introductory programming. We demonstrate the insufficiency of the next answer correctness prediction task, as it is neither able to reveal low validity of student models that use just binary correctness, nor does it show increased validity of models that use other performance data. The key message is that the prevalent usage of the next answer correctness for validating student models and binary correctness as the only input to the models is not always warranted and limits the progress in learning analytics.
- Richard Anderson-Sprecher. 1994. Model comparisons and R2. The American Statistician 48, 2 (1994), 113–117.Google Scholar
- David Bau, Jeff Gray, Caitlin Kelleher, Josh Sheldon, and Franklyn Turbak. 2017. Learnable Programming: Blocks and Beyond. Commun. ACM 60, 6 (2017), 72–80.Google ScholarDigital Library
- Joseph E Beck and Xiaolu Xiong. 2013. Limits to accuracy: How well can we do at student modeling. In Proceedings of Educational Data Mining.Google Scholar
- Yoav Bergner. 2017. Measurement and its uses in learning analytics. Handbook of learning analytics 35, 2 (2017).Google Scholar
- Peter Brusilovsky, Charalampos Karagiannidis, and Demetrios Sampson. 2004. Layered evaluation of adaptive learning systems. International Journal of Continuing Engineering Education and Life Long Learning 14, 4-5 (2004), 402–421.Google ScholarCross Ref
- David A Cook and Thomas J Beckman. 2006. Current concepts in validity and reliability for psychometric instruments: theory and application. The American journal of medicine 119, 2 (2006), 166–e7.Google Scholar
- Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests. Psychometrika 16, 3 (1951), 297–334.Google ScholarCross Ref
- Holli A DeVon, Michelle E Block, Patricia Moyle-Wright, Diane M Ernst, Susan J Hayden, Deborah J Lazzara, Suzanne M Savoy, and Elizabeth Kostas-Polston. 2007. A psychometric toolbox for testing validity and reliability. Journal of Nursing Scholarship 39, 2 (2007), 155–164.Google ScholarCross Ref
- Christopher Doble, Jeffrey Matayoshi, Eric Cosyn, Hasan Uzun, and Arash Karami. 2019. A data-based simulation study of reliability for an adaptive assessment based on knowledge space theory. International Journal of Artificial Intelligence in Education 29, 2(2019), 258–282.Google ScholarCross Ref
- Tomáš Effenberger, Jaroslav Čechák, and Radek Pelánek. 2019. Measuring Difficulty of Introductory Programming Tasks. In Proceedings of Learning at Scale. ACM.Google ScholarDigital Library
- Tomáš Effenberger and Radek Pelánek. 2018. Towards making block-based programming activities adaptive. In Proceedings of Learning at Scale. ACM, 13.Google ScholarDigital Library
- Tomáš Effenberger and Radek Pelánek. 2019. Measuring Students’ Performance on Programming Tasks. In Proceedings of Learning at Scale. ACM.Google ScholarDigital Library
- Tomáš Effenberger and Radek Pelánek. 2020. Impact of Methodological Choices on the Evaluation of Student Models. In Proceedings of Artificial Intelligence in Education. Springer, 153–164.Google ScholarDigital Library
- Luke Glenn Eglington and Philip I Pavlik. 2019. Predictiveness of Prior Failures is Improved by Incorporating Trial Duration. Journal of Educational Data Mining 11, 2 (2019), 1–19.Google Scholar
- José González-Brenes and Yun Huang. 2015. Your model is predictive - but is it useful? theoretical and empirical considerations of a new paradigm for adaptive tutoring evaluation. In Proceedings of Educational Data Mining.Google Scholar
- Kilem L Gwet. 2014. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC.Google Scholar
- Wynne Harlen. 2005. Trusting teachers’ judgement: Research evidence of the reliability and validity of teachers’ assessment used for summative purposes. Research papers in education 20, 3 (2005), 245–270.Google Scholar
- Yun Huang, José P González-Brenes, Rohit Kumar, and Peter Brusilovsky. 2015. A Framework for Multifaceted Evaluation of Student Models. In Proceedings of Educational Data Mining.Google Scholar
- Michael T Kane. 2013. Validating the interpretations and uses of test scores. Journal of Educational Measurement 50, 1 (2013), 1–73.Google ScholarCross Ref
- S Klinkenberg, M Straatemeier, and HLJ Van der Maas. 2011. Computer adaptive practice of Maths ability using a new item response model for on the fly ability and difficulty estimation. Computers & Education 57, 2 (2011), 1813–1824.Google ScholarDigital Library
- G Frederic Kuder and Marion W Richardson. 1937. The theory of the estimation of test reliability. Psychometrika 2, 3 (1937), 151–160.Google ScholarCross Ref
- Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica: Biochemia medica 22, 3 (2012), 276–282.Google Scholar
- Samuel Messick. 1993. Foundations of validity: Meaning and consequences in psychological assessment. ETS Research Report Series 1993, 2 (1993), i–18.Google Scholar
- Korinn S Ostrow and Neil T Heffernan. 2018. Testing the Validity and Reliability of Intrinsic Motivation Inventory Subscales Within ASSISTments. In Proceedings of Artificial Intelligence in Education. Springer, 381–394.Google ScholarCross Ref
- Alexandros Paramythis, Stephan Weibelzahl, and Judith Masthoff. 2010. Layered evaluation of interactive adaptive systems: framework and formative methods. User Modeling and User-Adapted Interaction 20, 5 (2010), 383–453.Google ScholarDigital Library
- Miranda C Parker, Mark Guzdial, and Shelly Engleman. 2016. Replication, validation, and use of a language independent CS1 knowledge assessment. In Proceedings of International Computing Education Research. 93–101.Google ScholarDigital Library
- Radek Pelánek. 2015. Metrics for Evaluation of Student Models.Journal of Educational Data Mining 7, 2 (2015), 1–19.Google Scholar
- Radek Pelánek. 2017. Bayesian knowledge tracing, logistic models, and beyond: an overview of learner modeling techniques. User Modeling and User-Adapted Interaction 27, 3-5 (2017), 313–350.Google ScholarDigital Library
- Radek Pelánek. 2018. The details matter: methodological nuances in the evaluation of student models. User Modeling and User-Adapted Interaction 28 (2018), 207–235. Issue 3. https://doi.org/10.1007/s11257-018-9204-yGoogle ScholarDigital Library
- Radek Pelánek and Tomáš Effenberger. 2020. Beyond Binary Correctness: Classification of Students’ Answers in Learning Systems. User Modeling and User-Adapted Interaction 27, 1 (2020), 89–118.Google Scholar
- Radek Pelánek and Petr Jarušek. 2015. Student Modeling Based on Problem Solving Times. International Journal of Artificial Intelligence in Education (2015), 1–27.Google ScholarCross Ref
- Radek Pelánek and Jiří Řihák. 2018. Analysis and design of mastery learning criteria. New Review of Hypermedia and Multimedia 24, 3 (2018), 133–159.Google ScholarCross Ref
- Allison Elliott Tew and Mark Guzdial. 2011. The FCS1: a language independent assessment of CS1 knowledge. In Proceedings of ACM Technical Symposium on Computer Science Education. 111–116.Google ScholarDigital Library
- Ian Utting, Allison Elliott Tew, Mike McCracken, Lynda Thomas, Dennis Bouvier, Roger Frye, James Paterson, Michael Caspersen, Yifat Ben-David Kolikant, Juha Sorva, 2013. A fresh look at novice programmers’ performance and their teachers’ expectations. In Proceedings of Innovation and Technology in Computer Science Education. 15–32.Google ScholarDigital Library
- Eric G Van Inwegen, Seth A Adjei, Yan Wang, and Neil T Heffernan. 2015. Using Partial Credit and Response History to Model User Knowledge. In Proceedings of Educational Data Mining.Google Scholar
- Yutao Wang and Neil Heffernan. 2013. Extending knowledge tracing to allow partial credit: Using continuous versus binary nodes. In Proceedings of Artificial Intelligence in Education. Springer, 181–188.Google ScholarCross Ref
- Yutao Wang, Neil T Heffernan, and Joseph E Beck. 2010. Representing Student Performance with Partial Credit. In Proceedings of Educational Data Mining. 335–336.Google Scholar
Recommendations
Standards of validity and the validity of standards in behavioral software engineering research: the perspective of psychological test theory
ESEM '18: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and MeasurementBackground. There are some publications in software engineering research that aim at guiding researchers in assessing validity threats to their studies. Still, many researchers fail to address many aspects of validity that are essential to quantitative ...
Validity in (Co-) Simulation
Software Engineering and Formal Methods. SEFM 2022 Collocated WorkshopsAbstractCo-simulation is an essential tool for the design of complex engineered systems. From early on in the life cycle of a system, models at different levels of abstraction and approximation are combined to make decisions about the system under design. ...
Validity Under Assumptions and Modus Ponens
Logic and ArgumentationAbstractSlightly altering and extending McGee’s semantics for conditionals, we define a ternary notion of validity for natural language arguments, which can be regarded as a unification of two kinds of validity in the literature. By the new notion of ...
Comments