Book (2022)/Reviews of Test Items

Z StaTest


In the test preparation process, especially for exams of great importance, the checking of items with the help of expert reviews before their use in the test (so-called panel review) plays an irreplaceable role. While in the case of a fourth-grade quiz, which the teacher uses to determine the students' knowledge of science, it may not be necessary for other teachers to assess the content of the test, for tests that are part of an entrance exam or a professional certification exam, it already is necessary. Items go through several levels of independent review before being seen by the first test taker.

The opposition, or item review, is divided into several phases, which always focus on a specific area. Its objective is to reveal the shortcomings that items and tests usually contain in their initial form. The motivation is to ensure correctness, optimize the test and eliminate subjective influences. Even if the review is initially somewhat time and organizationally demanding, its benefit is undeniable and increases with the importance of the test. After successfully managing all the revisions listed below (content revision, fairness revision, editorial revision), the author team should go through the final form of the individual items again and approve all the changes made.

Why is it necessary to check the items and the whole test?

Test items are part of a tool that we use to measure a certain skill of the test takers. Checking the correctness, wording accuracy and non-contradiction of the items makes the test a better measurement tool and reduces the likelihood that the test will be unfair and that any of the participants will complain about it or its individual items.

Who should check the items?

This can vary widely depending on the importance of the test. For tests of minor importance, one additional reviewer is more than sufficient. You simply ask a colleague to take the test for you and check it. For exams of high importance, such as entrance exams, graduation tests, etc., the item must be reviewed by several reviewers with clearly assigned roles. The reviewing experts must be both experts in the given area and at the same time they should be familiar with the population being tested.

What do the controllers check?

It depends on the type of reviewer and their role in the review process. Testing institutions often create checklists for reviewers to follow. The reviewer can check that the item stem is well worded. Whether or not it is grammatically instructive and does not facilitate choosing the right answer. Whether the key is correct and the distractors incorrect and whether all options are of comparable length. The controller can check the correctness of punctuation, the correct use of superscripts and subscripts, compliance with writing conventions for variables and units.

How is the review work organized?

Although the review form (checklist) can be in paper form, it is more common for it to be in electronic form. It is often directly integrated into the item bank, so items do not leave the bank's secure environment even during review. The test administrator can check the status of reviews in the item bank and motivate reviewers to perform better.

The process of opposition hinges on cooperation, similar to the preparation of a complete test program. Several participating experts independently assess the suitability of individual items and work together to eliminate all shortcomings that could hinder practical implementation. Teamwork plays a crucial role in opposing tests and test items.

The process of opposing items and the test itself can be divided into three phases, through which the opponent is guided by the item reviewer form (discussed in more detail below).

Content Review

Are the answers correctly and accurately worded? Aren't the distractors debatable?

As part of the content review, it is highly recommended that both the co-authors of the entire test and independent experts who were not involved in their creation check the questions and the answer options. A writer’s subjective attitude may have resulted in an ambiguous, i.e. incorrectly worded test item, the use of which would reduce the value of the test.

For most educators, creating alternative answers (distractors) tends to be a particularly difficult activity. In general, distractors should not be meaningless statements or absurd possibilities that the examinee will automatically exclude, but on the contrary, they should force him to think and then eliminate them after logical reasoning. MTF-type multiple-choice items are particularly susceptible to ambiguous distractor wording.

Other types of content deficiencies may arise for other types of questions. Single Best Answer (SBA) questions must be reviewed so that there is expert consensus on the clear best answer.

If the teacher is faced with the item of creating more test items and, in addition to correct answers, also designing a number of suitable distractors, he can help himself by assigning his new items to students as short-answer questions, as part of formative testing. Students will often design highly functional and attractive distractors when generating responses.

In general, when it comes to content it is recommended especially:

  • to check the accuracy of the wording of the assignment/item stem,
  • whether the options offered in each item are formulated in such a way that under no circumstances, in any interpretation or in any considered case, the distractor cannot be the correct answer and vice versa (applies especially to MTF),
  • whether the items in the test correspond to the test plan (blueprint).

Editorial Review

Are the questions sufficiently comprehensible, typographically uniform and without typographical errors?

Editorial review may at first glance appear to be not very time-consuming, but in practice it can be more complicated. It is necessary to go through all the test items and verify whether they are sufficiently readable, comprehensible and formally and typographically uniform. It is advisable to rework complex sentences, double negatives and difficult questions into a simpler form so that the student cannot get lost in the wording. The assignment of the item and the options offered should be constructed as clearly as possible. The uniformity and style of test item creation varies among test writers. In this phase of opposition, both terminological and typographic aspects are homogenized. Grammatical correctness is an integral part of checking any texts. This also applies to creating test items. Eliminating all grammatically incorrect or questionable expressions according to spelling rules should be the final stage of editorial review.

In practice, it turns out that a single review is absolutely insufficient. The ideal of 5-7 reviews is hard to achieve with limited funds, but 3 reviews seems like a workable minimum. Often, only one of the reviewers draws attention to a problem. Therefore, the review processor must be very attentive to the reviewers' suggestions in order not to overlook a possible problem.

Example: During the editorial review, we can also reveal grammatically or graphically instructive wording of questions (so-called suggestive assignment): Jan Amos Komenský's birthplace was:

  1. Uherský Brod
  2. Nivnice
  3. Komňa
  4. Brno

Item Reviewer Form

From a practical point of view, it is beneficial to provide the reviewers with a form that will guide them through the opposition of the test items. Answering the individual questions on the form forces the reviewer to engage with the test item from all the points of view covered by the form. It is not absolutely necessary that each test item completely passes in all monitored parameters; however, the opponent should register and comment on any deviations. Below is an example of such a form for item reviewers.

Table 3.7.1 Review of a single best answer question
Item assignment
Reviewer
Yes ✓
 or  
No ✗
Comments
Tests essential knowledge.
Corresponds to the topic according to the test plan.
It tests the application of knowledge, not just recollection of isolated data.
It corresponds to the required level of knowledge.
The assignment is clearly formulated.
The entry does not contain trick question elements (e.g. double negative).
An expert will think of the correct answer, even if he or she does not know the options offered.
Distractors are homogeneous.
The wording of the options does not indicate the correct answer.
None of the options are disproportionally difficult.
It does not take the form of "which statement is correct" or "all statements are correct except".
It does not contain the words "always", "usually", "rarely", "never", etc.
One of the offered options is the best.
The offered options are sorted alphabetically or in some other logical order.
The options are of similar length and content.
The options are compatible with the question.

Fairness Review

Do the items only measure the specific knowledge or skill required and nothing else?

Every item, every test, should test the required knowledge, know-how or skill and nothing else. By definition, the fairness of a test is the extent to which conclusions drawn from test results are valid for different groups of test takers.

If knowledge and skills are required to answer the question, which for whatever reason were not comparably available to all tested persons, i.e. if all test subjects did not have the same opportunity to acquire the required knowledge or skills, the item is not fair. Such a question is easier for a group of students who have been advantaged in some way, and conversely more difficult for another group who have been disadvantaged through no fault of their own. An example can be the excessive use of technical terms or complex sentence constructions that may not be understandable to everyone. Although the author of the question wanted to verify certain knowledge, in this case he or she is inadvertently testing language proficiency and proficiency in professional terminology. In this context, another complication can be testing the students' attention through "tricks in the question", or the use of double negatives and the like.

The item should not favor any group based on age, sex, origin, social and economic status, religion, race, native language, etc. Since the breakdown into groups is not restricted in any way, it is not realistic to examine fairness for all possible groups in the testing participant population. It is therefore recommended to examine fairness towards those groups that experience or research has shown might be adversely affected. These are often groups that have been discriminated against based on factors such as ethnicity, disability, gender, or native language. Students from different groups with the same level of knowledge should be equally likely to answer the question correctly.

Basic recommendations and rules for the creation of test items and tests with respect to the fairness of the items are given, for example, in the ETS Standards for Quality and Fairness[1]. These standards recommend verifying that test items:

  • Are not offensive or controversial
  • Do not reinforce stereotypes of any groups
  • Are free of racial, ethnic, gender, socio-economic and other forms of bias
  • Do not have content that would be considered inappropriate or offensive to any group

The unfairness of items can often be revealed by a thorough review of the fairness of the assignment itself. However, sometimes even an experienced opponent fails to detect it. This is why, when analyzing the test results, we also examine the differential behavior of the items, as we will show in the chapter dedicated to item analysis.



Odkazy

Reference

  1. Educational testing service. EETS Standards for Quality and Fairness [online] Educational Testing Service, 2014. Dostupné také z <https://www.ets.org/about/fairness>.