Book (2022)/Relative Determination of Test Pass Thresholds

Relative standardization is the method of test evaluation, in which the performance of the tested individual is compared with the performance of the relevant population. This means that it is ascertained whether the tested individual achieves better or worse results than others who are tested. Tests in which the performance of the test taker is assessed in relation to others are called norm-referenced tests, (NRT). For example, SAT tests, which are used as a decisive criterion for admission to many universities in the US, use this approach to evaluate the individual’s performance in the context of the performance of others. In our setting, relative standardization comparing the performance of students with each other, is a common part of entrance examinations and various classification tests.

Relative assessment is based on the assumption that the performance of mutually comparable study groups (across space and time) is basically the same.

Advantages of Relative Assessment

Relative assessment is not linked to the content of the test, but evaluates individual participants against each other. So, the advantage is that it prevents inflation of the highest grades, clearly differentiates the best students and it is not necessary to individually standardize each test separately.

Disadvantages of Relative Assessment

Grading students according to relative standardization discourages cooperation and teamwork because students realize they are competing with each other for a limited number of top grades. It also reduces students' motivation to study by weakening the relationship between their effort and their final grade, as it depends not only on their own performance but also on the performance of others. The disadvantages of relative grading include fluctuations in the quality of successful students according to the quality of the group. Especially in smaller groups, it may happen that even students with a level of knowledge that does not meet our requirements succeed. And conversely, some students may not succeed in the test, no matter how well they know the material. Relative standardization can exaggerate insignificant differences, especially in smaller and homogeneous groups. Considering these limitations, we should use relative evaluation mainly in large, heterogeneous groups in which cooperation is not expected. Relative standardization, on the other hand, should not be used in groups consisting of fewer than 40 students.

From the student's point of view, this method is inherently “unfair”, because the grading depends not only on the student’s own performance, but also on the performance of others with whom they are compared. It is therefore possible that with the same level of knowledge, a student would be graded better in one year than in another. To minimize this risk and ensure year-to-year comparability, leveling of test difficulty is used, which will be discussed in a separate chapter.

Practical Application of Relative Assessment

With relative standardization, the group is divided up according to the number of points achieved and is graded. A z-score or percentile ranking is used, for example, to determine specific grades. When using a four-level classification scale, the boundaries between the individual classification levels correspond, for example to the z-scores of -2, 0, 2, as indicated in Figure 5.3.1. The setting of the cut-off score in the case of relative grading can be arbitrary, for example on entrance exams it can be based on the capacity of the school for which the entrance exam is being conducted.

Fig. 5.3.1 Relative standardization compares the performance of an individual with other examinees. In doing so, the total score is converted to derived values. To express a student's result in the group, one of the methods of relative test standardization can be used:
The percentile scale roughly indicates what percentage of the tested population performs worse than the student in question.
The z-scale describes how far (as measured by the standard deviation of the data) a given student's score is from the mean.
The T-scale uses the same metric but expresses it on a scale of hundreds.

Percentile scale

The most well-known method of comparing the performance of examinees is to show their performance using a percentile scale. A percentile is determined for the student's result, which roughly tells how many percent of students in the reference group had a worse result than the given student. The percentile thus approximately determines the student's ranking converted to an interval of 0 to 1 (or 0-100%).

When calculating a student's percentile, the number of students who scored worse than the student is counted and half of the students who scored the same as the student are added. Then it is determined how large a part of the total number of students this group makes up. The percentile rank $PR_{i}$ for the person with the i-th worst total score can be derived through the relationship:

PR_{i}=100\cdot {\frac {N_{i}-{\frac {n_{i}}{2}}}{n}},

where N_i is the cumulative frequency for the given outcome, n_i is the frequency of the given outcome, and n is the number of students tested. Cumulative frequency expresses the number of students who achieved a given result or worse.

Z-score

Another method of standardizing a student's result is to calculate his z-score. For a given student, his z-score shows, how much his result is above or below the mean (measured in standard deviation units). So, we can simply calculate the z-score as the difference between the student's raw score ${X}$ and the average of the whole group ${\bar {X}}$ , divided by the standard deviation ${SD}$ :

z={\frac {X-{\bar {X}}}{SD}}.

Using the z-score, the teacher can easily identify excellent students (z > 2) and, conversely, very weak students (z < −2). The teacher can also easily compare a student's performance on different parts of the test.

A more detailed analysis of other methods of standardization (e.g. C-scale and others) is available, for example, in Jeřábek and Bílek’s Teorie a praxe tvorby didaktických testů^[1].

Reference

↑ JEŘÁBEK, Ondřej a Martin BÍLEK. Teorie a praxe tvorby didaktických testů. Olomouc: Univerzita Palackého v Olomouci, 2010. ISBN 978-80-244-2494-1.

[1] JEŘÁBEK, Ondřej a Martin BÍLEK. Teorie a praxe tvorby didaktických testů. Olomouc: Univerzita Palackého v Olomouci, 2010. ISBN 978-80-244-2494-1.

[1]