Book (2022)/Equalization of Test Difficulty

Z StaTest

Equalization of test difficulty (also test levelling) is part of standardization. Its objective is to ensure the mutual comparability of different runs or parallel forms of the test (for example, in individual years, at individual schools, etc.).

Equalizing is a technical procedure to recalculate students' grading from individual runs (parallel forms) of the test so that student results achieved in one run can be compared with student results in other test runs[1].

Leveling of difficulty is an important aspect of testing quality and directly affects its validity. It is an essential tool in educational evaluation, as it plays a vital role in determining the validity of the test in all forms and years.

Two procedures are used when comparing tests with each other: linking tests and equalizing tests. The linking of two tests means that we create a relationship (prolinking) between the results of these tests. E.g. we can create a table of corresponding scores from both tests, always achieved by students of the same level in both tests. Based on this table, we can say that students who scored X on the first test will most likely score Y on the second test.

The claim that the difficulty has been equalized is much stronger. If the two tests under consideration were successfully equalized, then we can state that students who scored X on the first test and students who scored Y on the second test have very similar levels of knowledge and skills measured by these tests.

In other words, to say that two forms of a test are balanced (equivalent) means that they measure the same content and support the same conclusions about what students know and can do. If, on the other hand, we say that there is a link between the two tests, this is a much weaker claim, which only means that there is a statistically measurable relationship between the scores on the two tests. This is because the fact that students scored X on the first test and scored Y on the second does not mean that the two tests are really measuring the same thing (the same construct). The linking of tests is therefore not a sufficient argument for us to replace one test with another. To do this, it would be necessary to verify that the tests are equivalent, i.e. to obtain confirmation from experts that both tests cover the same domain with the same means.

Equalizing the difficulty of tests can either precede the test (pre-equating) or follow it (post-equating). Pre-equalization of test levels refers to the preparation of a new test so that its format, content and characteristics correspond to the initial test. In the case of additional leveling of test difficulty, the test can also be compiled according to the rules for preliminary equalizing, but the final equalizing is carried out only with the help of data obtained from the analysis of the completed test.

To balance the difficulty of the two tests, we need some comparable data. One possibility is to give both tests to a sufficiently large group of people and compare the results. To limit the effect of test order, the group can be divided and each half will receive the tests in reverse order. The disadvantage of this approach is the impracticality and time-consuming administration of two tests. The security risk also increases, as the exposure of two tests increases the risk of their items being divulged.

To limit these negative aspects, we can use the so-called anchoring of the test, which is where a certain number of items are included in the test that are the same in all versions. These so-called anchor items are then used to compare different versions of the test. Anchor items should be representative, should cover the range of test difficulty and should comprise at least 20% of the test length[2]. The selection of anchor item topics should replicate the content of the entire test. A set of anchor items can be considered a “mini version” of the entire test[1].

Anchor items can be either “intrinsic” or “extrinsic,” depending on whether or not they count toward the test score. They can be “embedded“ if they are scattered throughout the test, or “attached” as a separate block of items at the end of the test.

There are many methods for equalizing tests.

Linear equalization is a tool for establishing equivalent scores between two parallel forms of a test within classical test theory. Linear equalization assumes that the tests differ only in the value of their average raw scores and the variability of the results (i.e. the size of the standard deviation). Under these assumptions, we can convert scores from one test to another using a linear transformation. So, we can first transform the mean score of the second test to the mean score of the first test and then transform the value of the score of the second test one standard deviation above and below the mean. The result is a linear transformation of the scores from the second test to the point scale of the first test. This method has several limitations:

  • Linear equalization will not work in cases where the relationship between test results is not linear (e.g. with an asymmetric distribution of scores).
  • The transformation applies only to the test set for which it was calculated.
  • The transformation works best for scores that are less than one standard deviation away from the mean.

The advantage of the linear transformation is that it is easy to understand and computationally simple.

If we would like to use a more robust calculation that also works for students at the edges of the investigated skill range, we can use, for example, equipercentile equalization.

The equipercentile method provides greater accuracy in aligning results across the entire range of possible outcomes. In this equalization of results, we first determine the percentile order of the scores achieved on both tests. The percentile ranks between the two tests are then matched using a table. The second option is to first convert the raw scores to percentiles and then score them (for both tests together). A number of computer programs offer the ability to calculate equivalent scores or establish a percentile ranking for all scores achieved. Percentile ranking is also often used to communicate results to students. Disadvantages include that, like linear test alignment, equipercentile is dependent on a specific selection of students, and is not readily applicable to other groups. Both methods mentioned so far are similar in many ways. Sometimes the linear adjustment is referred to as the equipercentile approximation[3].

IRT-based equalization methods. In practice, methods based on item response theory are more widely used, which have proven to be more accurate and reliable than methods derived from classical test theory and do not include dependence on a specific group of test takers.

IRT-based test equalization methods can be divided into two groups:

  • Methods of equalizing observed scores
  • Methods of equalizing actual scores

In the first case, the actual scores in the two test forms are compared. Based on the knowledge of the behavior of the anchor items present in both test forms, we transform the scores of the second of the tests so that the difficulty of the anchor items in both tests converges. In the second case, the estimated distributions of total scores in two forms are derived from the IRT model, where we plot the characteristic curves of two or more compared tests in one graph and equalize them using the equipotential alignment methodology[4].

One of the limitations of IRT-based test equalization methods is the required number of test takers, which should not fall below 500. Estimating parameters in small sample conditions is not satisfactory and worsens with the complexity of the IRT model.

For equalizing the difficulty of tests based on IRT, the IRTEQ[5] free software is available, or the R equate package can be used[6].



Reference

  1. 1,0 1,1 KOLEN, Michael J, Robert L BRENNAN a Michael J KOLEN. Test equating, scaling, and linking: methods and practices. 2nd ed. New York: Springer, c2004, xxvi, 548 p. ISBN 0-387-40086-9.
  2. JELÍNEK, Martin a Petr KVĚTON. Testování v psychologii :  Teorie odpovědi na položku a počítačové adaptivní testování. 1. vydání. Praha : Grada, 2011. 160 s. ISBN 978-802-4735-153.
  3. HAMBLETON, Ronald K., Hariharan SWAMINATHAN a H. Jane ROGERS. Fundamentals of item response theory. Newbury Park, Calif.: Sage Publications, c1991. ISBN 0803936478.
  4. A Practitioner's Introduction to Equating: With Primers on Classical Test Theory and Item Response Theory [online]. Washington: Council of Chief State School Officers, 2021 [cit. 2021-10-1]. Dostupné z: https://ccsso.org/resource-library/practitioners-introduction-equating
  5. Han, K. T. (2009). IRTEQ: Windows application that implements IRT scaling and equating [computer program]. Applied Psychological Measurement, 33(6), 491-493.
  6. ALBANO, Anthony D. Equate: An R Package for Observed-Score Linking and Equating. Journal of Statistical Software [online]. 2016, 74(8) [cit. 2021-10-1]. ISSN 1548-7660. Dostupné z: doi:10.18637/jss.v074.i08