Book/Adaptive Testing

During a normal test, the examinee receives a number of items, some of which may not be completely relevant to him or her. They can be more difficult or easier than the examinee’s level. The information functions of the test items cover the range of difficulty levels within which the skills of most test takers range. An unwanted side effect is that each test participant answers a series of questions that are too easy for them, or too difficult for them. At the same time, both are demotivating and, from the point of view of the testing institution, a waste of time. In electronic testing, one can therefore imagine an algorithm that will select items for the test taker, the difficulty of which will be adapted to their performance in solving the previous items.

This approach is called “computer adaptive testing” (CAT). It allows measuring the student's latent skill with the same accuracy as a classic test, but using a smaller number of items.

Thus, adaptive testing adapts the test to the test taker, item by item, based on their responses. A correct answer leads to a more difficult item, while an incorrect answer leads to an easier item. The difficulty of the items is continuously adjusted to the skills of the test taker. A gifted student will receive increasingly difficult items, while an average student will receive easier items. The number of items used is related to the required measurement accuracy. This means that the test stops when the predetermined required accuracy of the psychometric criteria is reached. With adaptive testing, the test is only as long as it really needs to be.

Fig. 6.7.1 Principle of adaptive testing. If we know the psychometric properties of the items, we can arrive at the same accuracy of the “knowledge” parameter estimate using a smaller number of items. Let's imagine that we first give the subject an item of average difficulty (A). The test-taker answers correctly and the adaptive testing algorithm chooses a more difficult item (B), then the test-taker does not answer correctly and the algorithm offers him an easier item (C). Instead of the respondent having to answer all test items, in this illustrative example, answering three items is sufficient to determine the respondent's level with sufficient accuracy.

The method is based on item response theory (IRT), which was discussed in the previous chapter.

Advantages and Disadvantages of Computer Adaptive Testing (CAT)

CAT is a modern way of testing that uses algorithms to optimally adapt the test to each examinee. Traditionally, items are compiled into a test set and presented to students in that set. The most obvious disadvantage of this approach is its inefficiency. The difficulty of the test items does not reflect the skills of the test taker. Let's imagine an exceptionally skilled student who answers all the most difficult questions correctly. We can confidently give it a high score without wasting time answering all the simple questions. While this saving may seem small for one student, if you apply the same method to the entire tested group, the time savings are significant.

Another problem is the uneven accuracy of measurements for students with different levels of knowledge. In traditional tests, items of medium difficulty usually have the greatest representation. There is a good reason for this: the test takers are likely to include a large number of people of intermediate skills. People of average skill will be evaluated very accurately by the test. However, this will happen at the expense of low measurement accuracy for students with a low or, conversely, a high level of skill. These are evaluated with much less accuracy. For the same reason, students with above-average or below-average skills may have a bad experience with the test. Weak students may feel exhausted and discouraged by the fact that most items are too difficult, while above-average students may be demotivated by the fact that most items are too easy for them.

Advantages of CAT

Shorter tests (by up to 50%)
Stable accuracy
Favorable feedback from test takers
Better motivation of test takers
Lower divulgence of tested items
Possibility of use for measuring student progress (the student’s test will be different at the end)

Disadvantages CAT

Impossibility of returning to previously answered items during the test
Sensitivity to test anxiety
The need for prior calibration of the items
For items with beneficial properties, using them too often can result in these items being divulged
Requires a sufficient amount of pilot testers (several hundred)
Preparation requires highly qualified professionals
More demanding to explain to the public – higher public relations costs

Requirements for Computer Adaptive Testing (CAT)

CATs have many advantages, including cutting testing time in half, but they require experienced psychometricians, large pilot samples, and specialized software. Here's a basic overview of what to consider when deciding on adaptive testing:

Items must be evaluable automatically, because the next item is selected in real time based on the result for the previous item. This excludes some otherwise useful forms of test items (constructed answer questions, essay, etc.)
Resources are needed to develop banks with a large number of items. Banks usually need at least three times as many items as the intended length of the test (although this is often no more than is needed for traditional test formats).
Extensive pilot tests must take place. IRT requires a sample size of at least 100–1,000 test takers to be used for pilot testing. The required number depends on the complexity of the IRT model used. More complex IRT models require larger samples.
It is necessary to have experts in psychometrics. For successful deployment, qualified experts are needed, especially for item calibration and IRT analysis, or for the simulation of adaptive testing with a given test set.
Analytical software must be available. IRT analysis software (e.g. freely available ShinyItemAnalysis or commercial equivalents) is required for item calibration.
An IRT-supporting item bank capable of storing IRT item parameters and supporting the design of CATs is essential.
Finally, there is a need to have an appropriate test delivery system. The latter must be capable of adaptive testing based on IRT, at least with appropriate termination criteria and item selection algorithms.