The development and evaluation of a software prototype for computer-adaptive testing
Introduction
The British Standards Institution estimates that approximately 1000 computer-assisted assessments are performed each day in the United Kingdom (British Standards Institution, 2001). In addition, a number of studies relating to the use of computers in student assessment within British Higher Education have been published in recent years, some examples being Towse and Garside (1998), Harvey and Mogey (1999), Doubleday (2000), Kleeman, Keech, and Wright (2001), Conole and Bull (2002) and Sclater and Howie (2003). These studies covered a wide range of computer-delivered assessments, and this paper focuses on two specific delivery methods: computer-based test (CBT) and computer-adaptive test (CAT).
Harvey and Mogey (1999) and others (e.g., De Angelis, 2000; Mason, Patry, & Bernstein, 2001; Pritchett, 1999) have reported numerous benefits of the CBT approach over the standard paper-and-pencil one. These benefits ranged from the automation of marking and subsequent reduction in marking workload to the opportunity to provide students with immediate feedback on their performance. Notwithstanding these benefits, previous work by Lord (1980), Freedle and Duran (1987), Wainer (1990) and Carlson (1994) suggested that CBTs have often being viewed as unsatisfactory in terms of efficiency. The reason for this inefficiency is that the questions administered during a given CBT session are not tailored for the specific ability of an individual student. In a typical CBT, the same predefined set of questions is presented to all students participating in the assessment session, regardless of their ability. The questions within this fixed set are typically selected in such a way that a broad range of ability levels, from low to advanced, is catered for (Pritchett, 1999). In this scenario, it is accepted that high-performance students are presented with one or more questions that are below their level of ability. Similarly, low-performance students are presented with questions that are above their level of ability.
The underlying idea of a CAT is to offer each student a set of questions that is appropriate to their level of ability. To this end, in a CAT the questions are dynamically selected for each student, based on his or her individual performance during the test. In general terms, a CAT session starts with a random question of average difficulty. If the student answers the question correctly, the estimate of his or her ability is increased. Since the ability estimate has increased, the rationale is that he or she may also be able to answer a more difficult question. Thus, a more challenging question appropriate for this new higher estimate follows. Conversely, if the response provided is incorrect, the estimate of his or her ability is decreased and an easier question that is suitable for this new lower estimate is presented next.
One of the principles within the CAT approach is that administering easy questions to a high-ability student is not efficient, as a correct response would provide low value information about his or her ability. Likewise, an incorrect response from a less proficient student to a difficult question adds little information about this individual’s ability within the subject being tested. By selecting and administering questions that match the individual student’s estimated level of ability, questions that present a low value information are avoided (Lilley and Barker, 2002, Lilley and Barker, 2003). In doing so, the test length can be reduced up to 50% without jeopardising test validity and reliability (Carlson, 1994; Jacobson, 1993; Microsoft Corporation, 2002).
The use of CAT has been increasing and indeed replacing traditional CBTs in some areas of education and training. Usually this replacement is associated with the need for higher efficiency when assessing large numbers, for example, in online training. The replacement of CBTs with CATs in examinations such as the Graduate Management Admission Test (Graduate Management Admission Council, 2002), Test of English as a Foreign Language (Educational Testing Service, 2003) and Microsoft Certified Professional (Microsoft Corporation, 2002) are evidence of this trend.
Inappropriate levels of question difficulty might lead those least proficient students to experience frustration or even bewilderment when overly difficult questions are presented. In a similar way, most proficient students might feel bored if the questions administered during a given session of assessment were unchallenging. One could then argue that, in addition to enhanced test length efficiency, the dynamic selection of questions within CATs has the potential to offer higher levels of interaction and student motivation than those afforded by traditional CBTs. This assumption raises questions regarding the evaluation of educational software and the assignment of specific values to benefits that are often intangible, such as improved student motivation.
Interest in how to evaluate educational software has been growing, and a generalised model has not yet been fully established within British Higher Education (Boyle & O’Hare, 2003). This paper focuses on the development and evaluation of a CAT prototype designed at the University of Hertfordshire, and it is hoped that the methods described here will be of interest to both educational researchers and teaching staff.
In the following section of this paper there is a brief introduction to CAT followed by the results of an expert evaluation and two user evaluations performed on the prototype, using questionnaire, online data collection, observation and focus group methods. In the final section, the future directions of our research within computer-adaptive testing are discussed along with our perceptions on the benefits and limitations of these evaluation methods.
Section snippets
Computer-Adaptive test
The prototype introduced here comprised a database containing 250 objective questions related to the use of English language and grammar. A Graphical User Interface was designed to deliver questions simply and effectively for each candidate. The adaptive algorithm used in the prototype was based on the Three-Parameter Logistic Model (3-PL) from Item Response Theory (Lord, 1980).
In order to evaluate the probability P of a student with an unknown ability θ answering a question of difficulty b
The heuristic evaluation
The prototype was first subjected to a heuristic evaluation (Molich & Nielsen, 1990), with the participation of 11 experts formed by both lecturers in Computer Science and in English for Academic Purposes within the University of Hertfordshire. Lilley and Barker (2002) provide details of how this evaluation was performed. Prior to the heuristic evaluation, all experts attended a session in which the main characteristics of a CAT were explained. It was considered important that the experts were
First user evaluation
The first user evaluation involved 27 international volunteers who, without any prior training on how to operate the software, were asked to take a test on the use of English language and grammar. Most of the students were Chinese. Their mean age was 24.8 and there were 13 male and 14 female students in the group. The test comprised 20 questions, presented in two sections, one of 10 dynamically selected CAT questions and the other of 10 static CBT questions. The order in which the questions
Focus group
One of the main advantages of a focus group is the possibility of gathering information about complex or sensitive issues that were likely to be overlooked in the quantitative methods employed earlier (Preece, Rogers, & Sharp, 2002), such as the online questionnaire described earlier. Twelve volunteers took part in a focus group study immediately after undertaking the CAT test described in the previous section of this paper.
The focus group was guided by a facilitator experienced in the area of
Conclusion and future work
In a traditional computer-based test (CBT), all candidates usually answer the same set of questions. The number of questions correctly answered is used as a measure of performance in the test. From this score we make the assumption that those scoring highest know the most about a subject and those scoring lowest know the least. Teachers have long been concerned about this approach for several reasons. Most importantly for us, such tests provide very little information about learner performance.
References (33)
- et al.
User requirements of the “ultimate” online assessment engine
Computers & Education
(2003) - Barker, T., & Barker, J. (2002). The evaluation of complex, intelligent, interactive, individualised human–computer...
- Barker, T., & Lilley, M. (2003). Are individual learners disadvantaged by the use of computer-adaptive testing? In...
- et al.
The use of a co-operative student model of learner characteristics to configure a multimedia application
User Modelling and User Adapted Interaction
(2002) - et al.
A comparison of tiled and overlapping windows
- Boyle, A., & O’Hare, D. (2003). Finding appropriate methods to assure quality computer-based development in UK Higher...
Designing for multimedia learning
(1997)- British Standards Institution (2001). New exam guidelines to stop the cyber-cheats [online]. Available:...
Computer-adaptive testing: A shift in the evaluation paradigm
Journal of Educational Technology Systems
(1994)- Conole, G., & Bull, J. (2002). Pebbles in the pond: Evaluation of the CAA Centre. In Proceedings of the 6th...