Book (2022)/Item Analysis

Z StaTest

After completing a live test run, the first thing we'll probably want to do is evaluate the test to see how the students did on it. However, students' answers contain more than just information regarding their knowledge and skills, as the test results also reflect the characteristics of the test questions. Whereas the evaluation of a test can yield information on how individual test participants performed, item analysis can give us the (psychometric) properties. Item analysis is also important for assignment authors and reviewers, as it provides them with objective feedback on how the items they create or review perform in practice. While reviewers are good at assessing, for example, content validity, their estimates of item difficulty are often very subjective. That is why we are interested in item analysis as a source of objective reflection of our items, a tool for their continuous improvement and for educating authors and item reviewers[1].

The basic assumption of item analysis is that the analyzed test is consistent, i.e. that it was written by qualified teachers, and that it therefore consists of items measuring one area of knowledge or ability. The quality of each item is assessed by comparing students' responses to the item with their overall test score.

The main item characteristics are their difficulty and sensitivity.

Item Difficulty

One of the basic characteristics of a test item is whether at least some test participants can answer it correctly — whether it’s not too difficult for the test takers.

We can estimate the difficulty of the item based on the proportion of test participants who were able to answer it correctly. This proportion is called the difficulty index and is denoted by :

.

Where is the number of examinees that answered the given item correctly and is the total number of examinees.

The difficulty index takes on values between 0 and 100% (respectively 0 and 1). The more students that answered the item correctly, the closer the value of the index is to 100% (or 1). It's a bit confusing since we're talking about difficulty and this index is highest when the item is the easiest.

Therefore, an additional quantity, the difficulty value, is introduced. The difficulty value indicates the proportion of test takers who answered the given item incorrectly, so it is a complement to the difficulty index:

.

For more complex scoring items, indexes are calculated using the arithmetic average of the point evaluations of all test takers on a given item and the highest attainable number of points for it.

In summative testing, items whose difficulty value is neither too high nor too low (typically 20-80%) bring the greatest benefit, the best discrimination. This is logical because items that are too difficult will not differentiate between weaker and better test takers, as no one will solve a task that is too difficult. Similarly, at the opposite end of the difficulty scale, an item that is too easy will yield almost no information, because even very weak test takers will solve a task that is too easy. In the case of items with borderline difficulty values, their discriminative ability naturally decreases.

Note that this estimate of item difficulty (introduced within classical test theory, CTT) is dependent on the test takers. The value will be different for each group, and if the groups differ significantly from each other, the difficulty of the same item can be completely different for each group. Overcoming this connection between difficulty and test subjects is made possible by item response theory, in which the ability of the test takers is one of the parameters.

Item Sensitivity

The sensitivity of the item, or its discrimination, describes its ability to distinguish between differently performing students. Let's imagine that we divide a group of students into better and worse, e.g. according to their overall result on a test. The difference between the average success rate of both groups when solving a specific item expresses the ability of this item to distinguish between better and worse students and is referred to as the upper-lower index (ULI).

We calculate the ULI as the difference in success between a group of better (U - upper) and worse (L - lower) students when solving a specific item.

.
Fig. 6.4.1 ULI Index — difference in the probability of correct answering of an item between better and worse students.

For tests that are supposed to distinguish between the best and the second best, e.g. in admissions tests with a large excess of applicants, we may be interested in how the item differentiates just around the dividing score between accepted and not accepted. In such a case, the ULI index can be used, focusing on the divide between certain percentiles between which the dividing score falls.

Fig. 6.4.2 ULI54 Inex - the difference in the probability of the correct answer to the item between one fifth of the best and one fifth of the other students.

The ULI index can theoretically take on values between −1 and 1, but negative values ​​are indicative of a very gross error in the item (or error in the key) and are rare in practice. A ULI equal to one means that all the better students master the item, while all the worse ones do not. Byčkovský and Zvára[2] states that:

  • for items with a difficulty between 0.2 and 0.3, or a difficulty between 0.7 and 0.8, the ULI sensitivity should be at least 0.15,
  • for items with a difficulty between 0.3 and 0.7, the ULI resolution should be at least 0.25. If the ULI value is lower, the item should be considered suspect.

In practice, items hovering around the stated limits are considered not ideal, but tolerable. However, if the ULI value is too low (ULI < 0.1), the item should be checked to see if it is well constructed and does not contain any serious errors. If we are working with a finer division of the skills interval (as in the case of ULI54), an index value of around 0.1 can be perfectly fine. However, once the value of an arbitrary ULI is close to zero, or even negative, it means that the item is not working. A negative ULI value means that worse students did better than better students. So, there may be something in the item that leads the better students down the wrong track. For example, they may be looking for a catch in the question. A negative ULI can also indicate an error in the key by which the item is scored. Such an item must either be corrected or removed from testing. An interesting problem is the methodology of dividing the interval of skills into smaller parts. It may happen that the interval cannot be “automatically” divided completely ideally, e.g. because there is a large group with the same results on the boundary between the groups. In practice, it turns out that the method of division on the edge of the interval is of little importance in getting an idea of the item's sensitivity. Even if you split the disputed group arbitrarily on the borderline of the intervals, the resulting ULI usually gives a very good idea of the item's behavior.

Some papers use a different division of the skill interval. For example, an examiner divides students into three groups based on their test scores. The division of test takers into “upper third” and “lower third” is often used, but studies have shown that when students are divided into groups that have 27% of students in the “upper” and “lower” groups, the discrimination value increases[3]. It is evident that the 46% percent of students with an average test score do not show up in the discrimination index calculation. This practice is followed, for example, by the Rogō testing system, which calculates the ULI based on the bottom and top 27.5% of students.

Visualization of Item Analysis Results

Examples of Items and Graphical Representation of Their Properties

Let's look at some examples of the behavior of the items used in acceptance tests and their graphical representation.

Fig. 6.4.3 Visualization of item behavior. The item “A person hears sound in the range of ...”, is so easy that it practically does not distinguish between differently skilled students.

Example one:

A person hears sound in the range from
A) 16 to 20,000 Hz
b) up to 100,000 Hz
c) less than 16 Hz
d) more than 20,000 Hz

When reviewing this item, we could discuss a number of errors that the item exhibits. For example, the proposed distractors c) and d) do not have the “range” nature referred to in the question. However, let's see how the use of this item turned out in a real test.


Students were divided into fifths according to their overall result on the test. Correct answer probabilities were calculated for these fifths. We see that even the weakest students achieved over 90% success on this item. Students in the better groups approached 100% success. The item is so easy that it practically does not differentiate between better and worse students.

Fig. 6.4.4 Visualization of item behavior. The item “The energy of a photon is...” is rather difficult. It is very difficult for the weakest students and does not distinguish between them, but it distinguishes very well between excellent and best students. We also see a significant difference between the weakest and best students. The item can be very useful in the test.

Example two:

The energy of a photon is
a) inversely proportional to frequency.
b) directly proportional to the wavelength.
C) directly proportional to frequency.
d) independent of wavelength.

The methodology is the same as in the previous example, again we divided the students into five equal groups according to their overall performance on the test. Note that the last fifth covers the range of 40 points on a 100-point test. It can be seen here that the test as a whole was quite difficult. This particular item behaves similarly. Its maximum resolution is between the fourth and fifth fifths. Items that differentiate on the “difficult” end of the spectrum tend to be quite valuable and not easy to write. This behavior was a surprise for this item, as the reviewers predicted it would be easy.

Analysis of Distractors

An analysis of how the offered options contribute to the quality of the multiple-choice question – i.e. the correct answer (key) and above all the incorrect options (distractors) is referred to as distractor analysis. We are trying to find out whether the distractors are sufficiently attractive for the students and what proportion of the total number of students chose the distractors.

Let's look at the visualization of distractor analysis on a concrete example. On a 70-problem test, students were asked how methanol is formed:

Fig. 6.4.5 Distractor analysis.
What reaction can form methanol?
a) Oxidation of carbon monoxide.
b) Oxidation of methanol.
C) Reduction of formaldehyde.
d) Oxidation of methyl aldehyde.

The authors of the test marked option C as the correct answer.

The students were divided into five groups according to how many points they scored on the entire test. The gray bars indicate what proportion of correct responses corresponds to each of these five groups. So, we see that the weakest group of students—the leftmost gray column—answered correctly much less often than the best group according to the total score achieved (the rightmost column). The height difference of the last and first gray column ULI51 = 0.7 shows that the item discriminates well between the best and worst students, although whether or not it is truly well-constructed is debatable. Even the height difference of the fifth and fourth column ULI54 = 0.14 is satisfactory and indicates a good discrimination between the best and second best students. So, the item as a whole works very well.

Now let's look at the functioning of the offered options. Their behavior is described by colored dashed lines, which for each group of (similarly successful) students show how likely it is that these students would choose the offered answer. The red line (distractor A) is practically an unacceptable choice for all groups of students. Only in the weakest group, about 12% of students, choose this option, but otherwise practically no one else. The blue-green line of the correct answer (key C) rises continuously throughout the skill interval. This indicates that this response is properly constructed. In the weakest group, students choose answer C with the same probability as the other two distractors, so—except for the unattractive distractor A—students in the weakest group are actually guessing. This is again a sign of a well-differentiated item. While distractor D (dark blue line) decreases monotonically over the entire ability interval, indicating that it is working properly, distractor B (yellow line) first increases a little as students' skill increases and only then begins to decrease. Of the best students, practically no one chooses it. However, the fact that the decline is not monotonic means that students in the second weakest group are thinking about it in a way that the author did not anticipate. In this item, the authors used three different names for the same substance – formaldehyde, methanol and methyl aldehyde. The first two are fairly common. In the second weakest group, there were probably many students who, although they knew that methanol can be created by a simple reaction from formaldehyde or methanol, they only guessed whether the reaction was oxidation or reduction.

Let us now consider the distractor analysis in the case of a nonfunctional item. Students were asked about rare gases on a 100-item test:

Fig. 6.4.6 Distractor analysis.
Noble gases
A) They are rarely present in nature and form almost no compounds
B) At least one is used in medicine
c) They are inert, but otherwise normal gases with a diatomic molecule such as hydrogen, for example
d) They are always heavier than air

If we look at the height of the gray bars, we can see that the best students perform the worst on this item. The correct answer should be the simultaneous choice of answers A) and B). While the probability that a student will choose option B) increases with their skill, this is not the case for option A). Students in the two worst groups choose this answer, but after that the probability of choosing it drops steeply. Answer A) contains a fundamental problem that completely devalues ​​this item. If we examine it, we see that it contains several errors. It is not one, but a combination of two statements: “Noble gases are rarely represented in nature.” and “Noble gases form almost no compounds.” The relativizing terms “few” and “almost none” are problematic, as they make the decision whether the option is correct depend on a purely subjective point of view. An even bigger problem is the definition of "in nature", since the author probably meant the biosphere, while for the gifted students, “nature” is probably imagined as the universe. And in this view, this answer is not true. The remaining two distractors (c, d) work correctly, but this can no longer save the item. If the author happens to write such an item, it should not pass review. The analysis of distractors is then the last chance to correct the author's and reviewers' omissions, due to its objective perspective, and remove the item from the test before it is scored.

In order to interpret distractor analyses well, one needs to have test data over a sufficiently large set of students. While the success rate of the individual groups on the item (gray columns) is relatively stable, because it reflects the data of all the students in the group, the individual distractors are no longer chosen by the entire group and are therefore significantly more sensitive to being influenced by random “noise”. If the representation of the behavior of the distractors is to retain a reasonable explanatory power, there must be more than a few hundred people in the entire test group (when divided into five subgroups). If the numbers are smaller, division into a smaller number of subgroups can be used, in extreme cases only two (two gray columns). This means we will lose the nuance of the detailed view, but the result will be less affected by random phenomena.

A distractor is considered functional (plausible) if it is chosen by at least 5% of the tested group. Designing sufficiently attractive distractors can be quite difficult, partly because the teacher may no longer be able to imagine what is difficult for students and what is not. When creating new items, the teacher can use previous, preferably formative testing to design distractors, in which students are presented with a similar item as a short-form response item. They then create distractors for the multiple-choice question based on incorrect answers.

Graphic Preview of the Overall Test Results

Two-color graph

For a quick orientation in how well the test was compiled, we can advantageously use a two-color graph in the item analysis. In the literature, it is also called a difficulty-discrimination plot, or “DD-plot” for short. On the horizontal axis, the items are ordered by difficulty, from easiest to hardest. For each item, the red bar shows its difficulty and the blue bar shows its sensitivity. On this graph, at first glance, we can discern “oddly” behaving items, whose sensitivity is small or even negative, and we can deal with their more detailed analysis to determine the causes of anomalies.

Fig. 6.4.7 A two-color graph (difficulty-discrimination plot, DD-plot) shows the test items sorted by difficulty (height of the red bar). For each item, its discrimination is plotted (blue bars). The horizontal dashed line shows the limit (20%) below which discrimination of a functioning item should not fall. Item #12 is very easy and its discriminative power is very low. Item No. 20 is very difficult and its discriminating power is very small and, moreover, negative, i.e. better students answer worse than weaker ones. This item probably contains some other problem that the author was not aware of. This item must be excluded from the test.

We get a different, perhaps even more illustrative form of the graph if we plot the discrimination of items (on the vertical axis) depending on their difficulty (on the horizontal axis).

Fig. 6.4.8 For the same 20-item test, item discrimination (on the vertical axis) is plotted against item difficulty (on the horizontal axis). The test is rather difficult and the discrimination of the items in it is rather below average. Both suspect items (#12 and #20) stand out clearly in this representation.

Rit and Rir Indices

To assess the sensitivity of the item, it is also possible to use the correlation coefficient between the point gain for the item and the point gain for the entire test, which is called Rit (correlation item-test), or the correlation coefficient between the item and the rest of the test, Rir (correlation item-rest).

The Rit coefficient is calculated as the biserial point correlation coefficient between the item score and the total test score. It tells us to what extent a given item contributes to the selection of correctly responding students from all test takers. In other words, it reflects the distinctiveness of the item and shows the item’s performance against the test as a whole. Positive values close to 1 mean that students successful in solving the given item were also successful on the test overall. Negative values show that students who correctly solved a given test item achieved a rather low overall score on the rest of the test. Correlation indicates whether an item measures the same construct as the rest of the test. If the test is focused on several topics, this should be taken into account when interpreting this coefficient. The Rir value is similar to Rit, but more accurate because the contribution to the correlation from the item itself is not taken into account. Rir is always slightly lower than Rit.

Recommendations exist for numerical values of the Rit correlation, similar to those for the ULI index:

  • Avoid questions with a Rit value below 0.20.
  • Always look at Rit combined with P difficulty.

Although discrimination assessment using ULI is more common, CERMAT, for example, uses Rit when analyzing items on tests of high importance[4].



Odkazy

Reference

  1. Položková analýza testů studijních předpokladů jako součást zkvalitňování procesu přijímání na vysokou školu. In: MAIEROVÁ, Eva, Lenka ŠRÁMKOVÁ, Kristýna HOSÁKOVÁ, Martin DOLEJŠ a Ondřej SKOPAL. PHD EXISTENCE 2015: česko-slovenská psychologická konference (nejen) pro doktorandy a o doktorandech. Olomouc: Univerzita Palackého v Olomouci, Filozofická fakulta, 2015, s. 75-84. ISBN 978-80-244-4694-3.
  2. Chybná citace: Chyba v tagu <ref>; citaci označené Byčkovský není určen žádný text
  3. SWERDLIK, Mark, Edward PROFESSOR a Ronald COHEN. Psychological Testing and Assessment : An Introduction to Tests and Measurement. - vydání. McGraw-Hill Education, 2012. 752 s. ISBN 9780078035302.
  4. Hodnotící zpráva Matematika+ 2018: Pokusné ověřování obsahu, formy, organizace a hodnocení výběrové zkoušky ze středoškolské matematiky. CERMAT: Centrum pro zjišťování výsledků vzdělávání [online]. Praha, 2018 [cit. 2021-11-16]. Dostupné z: https://data.cermat.cz/files/files/Matematika/MA-PLUS_hodnotici_zprava_2018.pdf