Book (2022)/Automation of Test Item Creation

As the use of computer-assisted testing increases, and especially with the development of adaptive testing, methods that could simplify the creation of test items are attracting attention. In the traditional approach to test construction, each item is created by area specialists. First, the item is written by the author, then other experts oppose it, then the teacher checks it in a pilot test and revises and modifies it according to the results. Only then is the item finally used for testing. The whole process is long and expensive. As a result, it is increasingly difficult to meet the increasing demand for test items^[1]. Automatic item generation (AIG) could greatly save time and resources and is therefore being intensively researched. Some concepts for solving this item have already reached the stage of practical testing.

In the first of the concepts, we can divide the process of automatic item cloning into two steps. Test item writers first create item models that serve as a kind of template. They try to distill the essence of the item, which is fundamental for demonstrating knowledge. Various alternating terms are then suggested for appropriate places in these templates (often by machine, e.g. using synonym dictionaries). Using a set of wildcard terms, the algorithm then turns this template into a group of related items by creating all possible permutations. This generates “new” but not independent items. No more than one item from each clone group can be used in a particular test run. In addition, some permutations will lead to nonsensical or improbable combinations, so they must be excluded ^[2], ^[3], ^[4].

It is then a matter of discussion whether the item banks should only contain the resulting clones or the source templates and variable components of the items. The purpose is to obtain items whose psychometric characteristics would be appraisable from the known results of another item from the same series of clones. Thanks to the demanding nature of the creation process, which leads to the need to clarify the essence of each item, the items created in this way are often of surprisingly high quality. Note that the usability of the machine-generated versions also depends on the specific language. For example, in Czech, with its complicated grammar, it would be extremely difficult.

A similar procedure was also tested in efforts to create item-comparable tests for reliability verification using the test-retest method. Attempting to modify originally functional items by changing the alternating terms was shown to result in the creation of items of greater difficulty^[5]. This somewhat undermines the original notion that the cloned items will have the same psychometric parameters as the original. It is therefore a question whether the whole process makes sense when new items are created, but they still need to be calibrated anyway.

The first papers dealing with using artificial intelligence for the automation of test item creation are beginning to open the second concept. Its model procedure was demonstrated at a workshop at the meeting of the European Council of Medical Assessors in Braza, Portugal. A group of test writers was given the task of creating a cognitive map for the given topic (abdominal pain). A cognitive map helps to describe the problem successively by elements (e.g., age, gender, context, vital signs, cause, diagnosis). Each of these elements can have a set of different values. Experienced test developers need several hours to create a cognitive map. The computer then generated a set of items that represent different combinations of elements of the cognitive map. During the workshop, this mixing of elements was done with the help of the Excel application. Deploying a similar system could make life easier for test writers in the future^[6]. The problem with this approach lies in the time-consuming and expensive nature of creating a cognitive map. A paper on automated item generation in first grade mathematics shows that automated generation is cost-effective (compared to traditional generation) if a set of more than 200 items can be generated from a single cognitive model^[7].

In the same year, another system was presented, which can use artificial intelligence to mine data from a bibliographic branch database and use it to create item stems and distractor suggestions. These draft items can serve as a semi-finished test for human writers to create new items more easily^[8].

Odkazy

Reference

↑ DRASGOW, Fritz, Richard M LUECHT a Randy E BENNETT. Technology and testing. In Brennan, Robert L. Educational measurement. 4. vydání. Praeger Publishers, 2006. 779 s. Washington, DC: American Council on Education. ISBN 0275981258, 9780275981259
↑ GIERL, Mark J a Thomas M HALADYNA. Automatic item generation: Theory and practice. 1. vydání. New York : Routledge, 2012. 256 s. ISBN 978-0-415-89750-1.
↑ GIERL, Mark J. a Hollis LAI. The Role of Item Models in Automatic Item Generation. International Journal of Testing [online]. 2012, 12(3), 273-298 [cit. 2021-9-26]. ISSN 1530-5058. Dostupné z: doi:10.1080/15305058.2011.635830
↑ GIERL, Mark J, Hollis LAI a Simon R TURNER. Using automatic item generation to create multiple-choice test items. Medical Education [online]. 2012, 46(8), 757-765 [cit. 2021-10-4]. ISSN 03080110. Dostupné z: doi:10.1111/j.1365-2923.2012.04289.x
↑ FIŘTOVÁ, Lenka. Klonování úloh jako cesta k vyrovnání obtížnosti různých variant testu? In: Konference Psychologická diagnostika. Brno: MUNI FSS, 2021
↑ VAN DER VLEUTEN, Cees. Automatic Item Generation by Cees van der Vleuten [online]. Maastricht University, 2019 [cit. 2021-10-4]. Dostupné z: https://www.maastrichtuniversity.nl/news-events/newsletters/article/NyJydZFCFpcpCYHi4Fadew
↑ KOSH, Audra E., Mary Ann SIMPSON, Lisa BICKEL, Mark KELLOGG a Ellie SANFORD‐MOORE. A Cost–Benefit Analysis of Automatic Item Generation. Educational Measurement: Issues and Practice [online]. 2018, 38(1), 48-53 [cit. 2021-10-4]. ISSN 0731-1745. Dostupné z: doi:10.1111/emip.12237
↑ Davier, M.V. (2019). Training Optimus Prime, M.D.: Generating Medical Certification Items by Fine-Tuning OpenAI's gpt2 Transformer Model. ArXiv, abs/1908.08594.

[1] DRASGOW, Fritz, Richard M LUECHT a Randy E BENNETT. Technology and testing. In Brennan, Robert L. Educational measurement. 4. vydání. Praeger Publishers, 2006. 779 s. Washington, DC: American Council on Education. ISBN 0275981258, 9780275981259

[2] GIERL, Mark J a Thomas M HALADYNA. Automatic item generation: Theory and practice. 1. vydání. New York : Routledge, 2012. 256 s. ISBN 978-0-415-89750-1.

[3] GIERL, Mark J. a Hollis LAI. The Role of Item Models in Automatic Item Generation. International Journal of Testing [online]. 2012, 12(3), 273-298 [cit. 2021-9-26]. ISSN 1530-5058. Dostupné z: doi:10.1080/15305058.2011.635830

[4] GIERL, Mark J, Hollis LAI a Simon R TURNER. Using automatic item generation to create multiple-choice test items. Medical Education [online]. 2012, 46(8), 757-765 [cit. 2021-10-4]. ISSN 03080110. Dostupné z: doi:10.1111/j.1365-2923.2012.04289.x

[5] FIŘTOVÁ, Lenka. Klonování úloh jako cesta k vyrovnání obtížnosti různých variant testu? In: Konference Psychologická diagnostika. Brno: MUNI FSS, 2021

[6] VAN DER VLEUTEN, Cees. Automatic Item Generation by Cees van der Vleuten [online]. Maastricht University, 2019 [cit. 2021-10-4]. Dostupné z: https://www.maastrichtuniversity.nl/news-events/newsletters/article/NyJydZFCFpcpCYHi4Fadew

[7] KOSH, Audra E., Mary Ann SIMPSON, Lisa BICKEL, Mark KELLOGG a Ellie SANFORD‐MOORE. A Cost–Benefit Analysis of Automatic Item Generation. Educational Measurement: Issues and Practice [online]. 2018, 38(1), 48-53 [cit. 2021-10-4]. ISSN 0731-1745. Dostupné z: doi:10.1111/emip.12237

[8] Davier, M.V. (2019). Training Optimus Prime, M.D.: Generating Medical Certification Items by Fine-Tuning OpenAI's gpt2 Transformer Model. ArXiv, abs/1908.08594.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]