Skip to main content

Advertisement

Log in

A Meta-Analysis of Machine Learning-Based Science Assessments: Factors Impacting Machine-Human Score Agreements

  • Published:
Journal of Science Education and Technology Aims and scope Submit manuscript

Abstract

Machine learning (ML) has been increasingly employed in science assessment to facilitate automatic scoring efforts, although with varying degrees of success (i.e., magnitudes of machine-human score agreements [MHAs]). Little work has empirically examined the factors that impact MHA disparities in this growing field, thus constraining the improvement of machine scoring capacity and its wide applications in science education. We performed a meta-analysis of 110 studies of MHAs in order to identify the factors most strongly contributing to scoring success (i.e., high Cohen's kappa [κ]). We empirically examined six factors proposed as contributors to MHA magnitudes: algorithm, subject domain, assessment format, construct, school level, and machine supervision type. Our analyses of 110 MHAs revealed substantial heterogeneity in \(\kappa{ (mean =} \, \text{.64; range = .09-.97}\), taking weights into consideration). Using three-level random-effects modeling, MHA score heterogeneity was explained by the variability both within publications (i.e., the assessment task level: 82.6%) and between publications (i.e., the individual study level: 16.7%). Our results also suggest that all six factors have significant moderator effects on scoring success magnitudes. Among these, algorithm and subject domain had significantly larger effects than the other factors, suggesting that technical features and assessment external features might be primary targets for improving MHAs and ML-based science assessments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Altman, D. G. (1991). Mathematics for kappa. Practical statistics for medical research, 1991, 406–407.

    Google Scholar 

  • Anderson, C. W., & de los Santos, E. X., Bodbyl, S., Covitt, B. A., Edwards, K. D., & Hancock, J. B. (2018). Designing educational systems to support enactment of the Next Generation Science Standards. Journal of Research in Science Teaching, 55(7), 1026–1052.

    Google Scholar 

  • Bartolucci, A. A., & Hillegass, W. B. (2010). Overview, strengths, and limitations of systematic reviews and meta-analyses. In F. Chiappelli (Ed.), Evidence-based practice: Toward optimizing clinical outcomes (pp. 17–33). Berlin Heidelberg: Springer.

    Google Scholar 

  • *Beggrow, E. P., Ha, M., Nehm, R. H., Pearl, D., & Boone, W. J. (2014). Assessing scientific practices using machine-learning methods: How closely do they match clinical interview performance? Journal of Science Education and Technology, 23(1), 160–182.

    Google Scholar 

  • Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40.

  • Borenstein, M., Hedges, L. V., Higgins, J. P., & Rothstein, H. R. (2011). Introduction to meta-analysis: John Wiley & Sons.

  • Castelvecchi, D. (2016). Can we open the black box of AI? Nature, 538(7623), 20–23.

    Google Scholar 

  • *Chanijani, S. S. M., Klein, P., Al-Naser, M., Bukhari, S. S., Kuhn, J., & Dengel, A. (2016). A study on representational competence in physics using mobile eye tracking systems. Paper presented at the International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct.

  • *Chen, C.-K. (2010). Curriculum Assessment Using Artificial Neural Network and Support Vector Machine Modeling Approaches: A Case Study. IR Applications. Volume 29. Association for Institutional Research (NJ1).

  • *Chen, C.-M., Wang, J.-Y., & Yu, C.-M. (2017). Assessing the attention levels of students by using a novel attention aware system based on brainwave signals. British Journal of Educational Technology, 48(2), 348–369.

  • Chen, J., Zhang, Y., Wei, Y., & Hu, J. (2019). Discrimination of the contextual features of top performers in scientific literacy using a machine learning approach. Research in Science Education, 1–30.

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37–46.

    Google Scholar 

  • Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4), 213.

    Google Scholar 

  • Cohen, J. (2013). Statistical power analysis for the behavioral sciences. New York: Lawrence Erlbaum Associates.

    Google Scholar 

  • Cooper, H., Valentine, J. C., Charlton, K., & Melson, A. (2003). The effects of modified school calendars on student achievement and on school and community attitudes. Review of Educational Research, 73(1), 1–52.

  • Donnelly, D. F., Vitale, J. M., & Linn, M. C. (2015). Automated guidance for thermodynamics essays: Critiquing versus revisiting. Journal of Science Education and Technology, 24(6), 861–874.

  • Dusseldorp, E., Li, X., & Meulman, J. (2016). Which combinations of behaviour change techniques are effective assessing interaction effects in meta-analysis. European Health Psychologist, 18, 563.

  • Duval, S., & Tweedie, R. (2000). A nonparametric “trim and fill” method of accounting for publication bias in meta-analysis. Journal of the American statistical association, 95(449), 89–98.

    Google Scholar 

  • *Elluri, S. (2017). A machine learning approach for identifying the effectiveness of simulation tools for conceptual understanding. (unpublished master’s thesis,10686333), Purdue University, West Lafayette, Indiana.

  • Everitt, B. (1968). Moments of the statistics kappa and weighted kappa. British Journal of Mathematical and Statistical Psychology, 21(1), 97–103.

    Google Scholar 

  • Fleiss, J., Levin, B., & Paik, M. (2013). Statistical methods for rates and proportions. John Wiley & Sons.

  • Gane, B., Zaidi, S., Zhai., X., & Pellegrino, J. (2020). Using Machine Learning to Score Tasks that Assess Three-dimensional Science Learning. Paper will be presented on the 2020 annual conference of the American Educational Research Association, California.

  • Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111–129.

  • Gerard, L., Matuk, C., McElhaney, K., & Linn, M. C. (2015). Automated, adaptive guidance for K-12 education. Educational Research Review, 15, 41–58.

  • Gerard, L. F., Ryoo, K., McElhaney, K. W., Liu, O. L., Ra"erty, A. N., & Linn, M. C. (2016). Automated guidance for student inquiry. Journal of Educational Psychology, 108(1), 60–81.

  • *Ghali, R., Frasson, C., & Ouellet, S. (2016, June). Using Electroencephalogram to Track Learner’s Reasoning in Serious Games. In International Conference on Intelligent Tutoring Systems (pp. 382–388). Springer, Cham.

  • *Ghali, R., Ouellet, S., & Frasson, C. (2016). LewiSpace: An exploratory study with a machine learning model in an educational game. Journal of Education and Training Studies, 4(1), 192–201.

  • *Gobert, J. D., Baker, R., & Pedro, M. S. (2011). Using machine-learned detectors to assess and predict students' inquiry performance. Retrieved on November 2018 from https://proxy.cc.uic.edu/login?url=https://search.proquest.com/docview/964185951?accountid=14552.

  • *Gobert, J. D., Baker, R. S., & Wixon, M. B. (2015). Operationalizing and detecting disengagement within online science microworlds. Educational Psychologist, 50(1), 43–57.

  • *Gobert, J. D., Sao Pedro, M., Raziuddin, J., & Baker, R. S. (2013). From log files to assessment metrics: Measuring students’ science inquiry skills using educational data mining. Journal of the Learning Sciences, 22(4), 521–563.

  • Goubeaud, K. (2010). How is science learning assessed at the postsecondary level? Assessment and grading practices in college biology, chemistry and physics. Journal of Science Education and Technology, 19(3), 237–245.

    Google Scholar 

  • Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC.

  • *Ha, M. (2013). Assessing scientific practices using machine learning methods: Development of automated computer scoring models for written evolutionary explanations (Doctoral dissertation, The Ohio State University).

  • *Ha, M., & Nehm, R. H. (2016a). The impact of misspelled words on automated computer scoring: A case study of scientific explanations. Journal of Science Education and Technology, 25(3), 358–374.

    Google Scholar 

  • *Ha, M., & Nehm, R. (2016b). predicting the accuracy of computer scoring of text: Probabilistic, multi-model, and semantic similarity approaches. Paper in proceedings of the National Association for Research in Science Teaching, Baltimore, MD, April, 14–17.

  • *Ha, M., Nehm, R. H., UrbanLurain, M., & Merrill, J. E. (2011). Applying computerized-scoring models of written biological explanations across courses and colleges: Prospects and limitations. CBE Life Sciences Education, 10(4), 379–393.

  • Huang, C.-J., Wang, Y.-W., Huang, T.-H., Chen, Y.-C., Chen, H.-M., & Chang, S.-C. (2011). Performance evaluation of an online argumentation learning assistance agent. Computers & Education, 57(1), 1270–1280.

    Google Scholar 

  • Hunt, R. J. (1986). Percent agreement, Pearson’s correlation, and kappa as measures of inter-examiner reliability. Journal of Dental Research, 65(2), 128–130.

    Google Scholar 

  • Hutter, F., Kotthoff, L., & Vanschoren, J. (2019). Automated Machine Learning: Springer.

  • Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260.

    Google Scholar 

  • Jovic, A., Brkic, K., & Bogunovic, N. (2014, May). An overview of free software tools for general data mining. In 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (pp. 1112–1117). IEEE.

  • *Kim, K. J., Pope, D. S., Wendel, D., & Meir, E. (2017). WordBytes: Exploring an intermediate constraint format for rapid classification of student answers on constructed response assessments. Journal of Educational Data Mining, 9(2), 45–71.

    Google Scholar 

  • *Klebanov, B., Burstein, J., Harackiewicz, J. M., Priniski, S. J., & Mulholland, M. (2017). Reflective writing about the utility value of science as a tool for increasing stem motivation and retention – Can AI help scale up? International Journal of Artificial Intelligence in Education, 27(4), 791–818.

    Google Scholar 

  • Konstantopoulos, S. (2011). Fixed effects and variance components estimation in three-level meta-analysis. Research Synthesis Methods, 2(1), 61–76.

    Google Scholar 

  • Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, 15(2), 155–163.

    Google Scholar 

  • Krippendorff, K. (1980). Validity in content analysis. In E. Mochmann (Ed.), Computerstrategien fÃ1⁄4r die kommunikationsanalyse (pp. 69–112). Frankfurt, Germany: Campus. Retrieved from https://repository.upenn.edu/asc_papers/291.

  • *Kyrilov, A. (2014). Using case-based reasoning to improve the quality of feedback generated by automated grading systems. Paper presented at the Proceedings of the tenth annual Conference on International Computing Education Research, Glasgow, Scotland, United Kingdom. https://dl.acm.org/citation.cfm?doid=2632320.2632330

  • Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. biometrics, 159–174.

  • Leacock, C., Messineo, D., & Zhang, X. (2013). Issues in prompt selection for automated scoring of short answer questions. In annual conference of the National Council on Measurement in Education, San Francisco, CA.

  • Lee, H.-S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., & Liu, O. L. (2019). Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Science Education, 103(3), 590–622.

    Google Scholar 

  • *Lintean, M., Rus, V., & Azevedo, R. (2012). Automatic detection of student mental models based on natural language student input during metacognitive skill training. International Journal of Artificial Intelligence in Education, 21(3), 169–190.

  • Liu, O. L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M. C. (2014). Automated scoring of constructed-response science items: Prospects and obstacles. Educational measurement: issues and practice, 33(2), 19–28.

    Google Scholar 

  • Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., & Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233.

    Google Scholar 

  • *Lottridge, S., Wood, S., & Shaw, D. (2018). The effectiveness of machine score-ability ratings in predicting automated scoring performance. Applied Measurement in Education, 31(3), 215–232.

    Google Scholar 

  • *Mao, L., Liu, O. L., Roohr, K., Belur, V., Mulholland, M., Lee, H.-S., & Pallant, A. (2018). Validation of automated scoring for a formative assessment that employs scientific argumentation. Educational Assessment, 23(2), 121–138.

    Google Scholar 

  • *Mason, R. A., & Just, M. A. (2016). Neural representations of physics concepts. Psychological science, 27(6), 904–913.

    Google Scholar 

  • McGraw-Hill Education, C. T. B. (2014). Smarter balanced assessment consortium field test: Automated scoring research studies (in accordance with smarter balanced RFP 17).

  • *Moharreri, K., Ha, M., & Nehm, R. H. (2014). EvoGrader: An online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7(1), 15.

  • Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning: MIT press.

  • Montalvo, O., Baker, R. S., Sao Pedro, M. A., Nakama, A., & Gobert, J. D. (2010). Paper presented at the Educational Data Mining: Identifying students’ inquiry planning using machine learning.

  • *Muldner, K., Burleson, W., Van de Sande, B., & VanLehn, K. (2011). An analysis of students’ gaming behaviors in an intelligent tutoring system: Predictors and impacts. User Modeling and User-Adapted Interaction, 21(1–2), 99–135.

  • Nakamura, C. M., Murphy, S. K., Christel, M. G., Stevens, S. M., & Zollman, D. A. (2016). Automated analysis of short responses in an interactive synthetic tutoring system for introductory physics. Physical Review Physics Education Research, 12(1), 010122.

    Google Scholar 

  • National Research Council. (2012). A Framework for K-12 Science Education: Practices, Crosscutting Concepts, and Core Ideas. Committee on a Conceptual Framework for New K-12 Science Education Standards. Board on Science Education, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.

  • National Research Council. (2014). Developing Assessments for the Next Generation Science Standards. Committee on Developing Assessments of Science Proficiency in K-12. Board on Testing and Assessment and Board on Science Education, J.W. Pellegrino, M.R. Wilson, J.A. Koenig, and A.S. Beatty, Editors. Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.

  • *Nehm, R. H., Ha, M., & Mayfield, E. (2012). Transforming biology assessment with machine learning: Automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21(1), 183–196.

  • *Nehm, R. H., & Haertig, H. (2012). Human vs computer diagnosis of students’ natural selection knowledge: testing the efficacy of text analytic software. Journal of Science Education and Technology, 21(1), 56–73.

    Google Scholar 

  • NGSS Lead States. (2013). Next Generation Science Standards: For States, By States. Washington, DC: The National Academies Press.

    Google Scholar 

  • *Okoye, I., Sumner, T., & Bethard, S. (2013). Automatic extraction of core learning goals and generation of pedagogical sequences through a collection of digital library resources. Paper presented at the Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries.

  • *Okoye, I. U. (2015). Building an educational recommender system based on conceptual change learning theory to improve students' understanding of science concepts. (AAI3704786 Ph.D.), University of Colorado at Boulder.

  • *Opfer, J. E., Nehm, R. H., & Ha, M. (2012). Cognitive foundations for science assessment design: knowing what students know about evolution. Journal of Research in Science Teaching, 49(6), 744–777.

  • Parsons, S. (2016). Authenticity in Virtual Reality for assessment and intervention in autism: A conceptual review. Educational Research Review, 19, 138–157.

    Google Scholar 

  • Powers, D. M. (2011). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2(1), 37–63.

    Google Scholar 

  • Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological bulletin, 86(3), 638–641.

    Google Scholar 

  • Rothstein, H. R. (2008). Publication bias as a threat to the validity of meta-analytic results. Journal of Experimental Criminology, 4(1), 61–81.

    Google Scholar 

  • *Ryoo, K., & Linn, M. C. (2016). Designing automated guidance for concept diagrams in inquiry instruction. Journal of Research in Science Teaching, 53(7), 1003–1035.

    Google Scholar 

  • *Ryoo, K., & Linn, M. C. (2016). Designing automated guidance for concept diagrams in inquiry instruction. Journal of Research in Science Teaching, 53(7), 1003–1035.

    Google Scholar 

  • Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3), 210–229.

    Google Scholar 

  • *Sao Pedro, M., Baker, R. S., Montalvo, O., Nakama, A., & Gobert, J. D. (2010). Using text replay tagging to produce detectors of systematic experimentation behavior patterns. Paper presented at the Educational Data Mining 2010.

  • *Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 20(1), 46–65.

    Google Scholar 

  • *Shermis, M. D., & Burstein, J. C. (2003). Automated essay scoring: A cross-disciplinary perspective. Routledge.

  • *Steele, M. M., Merrill, J., Haudek, K., & Urban-Lurain, M. (2016). The development of constructed response astronomy assessment items. Paper presented at the National Association for Research in Science Teaching (NARST), Baltimore, MD.

  • Sun, S. (2011). Meta-analysis of Cohen’s kappa. Health Services and Outcomes Research Methodology, 11(3–4), 145–163.

    Google Scholar 

  • *Tansomboon, C., Gerard, L. F., Vitale, J. M., & Linn, M. C. (2017). Designing automated guidance to promote the productive revision of science explanations. International Journal of Artificial Intelligence in Education, 27(4), 729–757.

  • Tufféry, S. (2011). Data mining and statistics for decision making. John Wiley & Sons.

  • Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of statistical software, 36(3), 1–48.

    Google Scholar 

  • *Vitale, J., Lai, K., & Linn, M. (2015). Taking advantage of automated assessment of student-constructed graphs in science. Journal of Research in Science Teaching, 52(10), 1426–1450.

  • *Wang, H. C., Chang, C. Y., & Li, T. Y. (2008). Assessing creative problem-solving with automated text grading. Computers & Education, 51(4), 1450–1466.

  • Wiley, J., Hastings, P., Blaum, D., Jaeger, A. J., Hughes, S., & Wallace, P. (2017). Different approaches to assessing the quality of explanations following a multiple-document inquiry activity in science. International Journal of Artificial Intelligence in Education, 27(4), 758–790.

    Google Scholar 

  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31(1), 2–13.

  • Williamson, D. M., Bennett, R. E., Lazer, S., Bernstein, J., Foltz, P. W., Landauer, T. K., & Sweeney, K. (2010). Automated scoring for the assessment of common core standards. Retrieved December 15, 2010, from http://professionals.collegeboard.com/profdownload/Automated-Scoring-for-theAssessment-of-Common-Core-Standards.pdf

  • *Yan, J. (2014). A computer-based approach for identifying student conceptual change (Unpublished master’s dissertation). West Lafayette, Indiana: Purdue University.

    Google Scholar 

  • Yeh, S. S. (2009). Class size reduction or rapid formative assessment?: A comparison of cost-effectiveness. Educational Research Review, 4(1), 7–15.

    Google Scholar 

  • *Yoo, J., & Kim, J. (2014). Can online discussion participation predict group project performance? Investigating the roles of linguistic features and participation patterns. International Journal of Artificial Intelligence in Education, 24(1), 8–32.

    Google Scholar 

  • *Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clustering in educational assessment. Educational and psychological measurement, 76(2), 280–303.

  • Zhai, X. (2019). Applying machine learning in science assessment: Opportunity and challenge. A call for a Special Issue in Journal of Science Education and Technology. https://doi.org/10.13140/RG.2.2.10914.07365

    Article  Google Scholar 

  • Zhai, X., Haudek, K., Shi, L., Nehm, R., & Urban-Lurain, M. (2020a). From substitution to redefinition: A framework of machine learning-based science assessment. Journal of Research in Science Teaching, 57(9), 1430-1459. https://doi.org/10.1002/tea.21658

    Article  Google Scholar 

  • Zhai, X., Yin, Y., Pellegrino, J., Haudek, K., & Shi, L. (2020b). Applying machine learning in science assessment: A systematic review. Studies in Science Education., 56(1), 111–151.

    Google Scholar 

  • Zhai, X., Haudek, K., Stuhlsatz, M., & Wilson, C. (2020c). Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment. Studies in Educational Evaluation, 67, 1–12. https://doi.org/10.1016/j.stueduc.2020.100916

  • Zhai, X., Krajcik, J., & Pellegrino, J. (In press). On the validity of machine learning-based Next Generation Science Assessments: A validity inferential network. Journal of Science Education and Technology. https://doi.org/10.1007/s10956-020-09879-9

  • Zhu, M., Lee, H. S., Wang, T., Liu, O. L., Belur, V., & Pallant, A. (2017). Investigating the impact of automated feedback on students’ scientific argumentation. International Journal of Science Education, 39(12), 1648–1668.

  • Zhu, M., Liu, O. L., & Lee, H. S. (2020). The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Computers & Education, 143, 103668.

  • Zhai, X. (in press). Advancing automatic guidance in virtual science inquiry: From ease of use to personalization. Educational Technology Research and Development.

Download references

Acknowledgements

The project was partially supported by National Science Foundation (DUE-1561159).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoming Zhai.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Note. * indicates the papers we reviewed for this study.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhai, X., Shi, L. & Nehm, R. A Meta-Analysis of Machine Learning-Based Science Assessments: Factors Impacting Machine-Human Score Agreements. J Sci Educ Technol 30, 361–379 (2021). https://doi.org/10.1007/s10956-020-09875-z

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10956-020-09875-z

Keywords

Navigation