Book (2022)/Security Analysis of Tests

In the event of a violation of academic integrity, test scores may not reflect the skills and knowledge of test takers. Forensic analysis of tests (educational data forensics, EDF) is a statistical analysis of test results with the aim of detecting deviations that potentially indicate tampering, favoritism, or outright test fraud. Should there be violations of academic integrity at the level of test administrators or item bank administrators, forensic analysis is practically the only tool to systematically uncover this activity.

The analysis should answer questions of the type
Questions focused on individuals

Is there anything unusual about this individual?
Did they answer each item with a “C”?
Were they answering too quickly?
Did they spend 10 minutes on each of the first 5 items and skip the rest?
Did they get a high score in a suspiciously short amount of time?
Did they noticeably change many incorrect answers to correct ones?

Questions focused on relationships between individuals

Are some participants' answers strikingly similar?
Were these participants sitting near each other? In the same classroom?
Does anything unusual appear when comparing this person to others?
Are there individuals around him or her who have almost the same answers?

Fig. no. 9.3.1 Histogram of test score gains illustrating an example of group-level analysis. In the circle, there is a disproportionately large group of extremely successful respondents who achieved almost the full number of points. A two-peaked distribution of scores always indicates an inhomogeneous group. In this case, these could be participants for whom the test was too easy (but then we would expect the distribution to be more “normal”). This course could therefore correspond to a situation where a limited group of respondents had the text of the test available in advance. Such a “two-peaked” test should always be paid close attention to and examined to see if other indications support possible suspicions.

Group level questions

Are some schools or teachers performing unusually well?
Do some test centers have unusually high pass rates and short test times?
Are similarly answered tests common to a certain group of test takers?
What is the common feature of this group?
Does any group of examinees answer questions from one profile subject significantly better?
Does any group of applicants respond significantly better to questions that are new or, conversely, old?
Or newly reviewed? Reviewed by one reviewer?
Are there significant differences between classrooms?
Are there significant differences between candidates from different rounds of the test?

Statistical Indications of Possible Fraudulent Conduct

There are many different forensic data methods that can be used to detect cheating^[1]. Statistical methods for detecting suspected irregularities may include:

Evaluating the similarity of responses between pairs of examinees. The simplest methods use descriptive statistics to summarize the number (or proportion) of jointly correct answers or common errors. For example, the responses in common index (RIC) is the number of questions for which two examinees have the same answer. More complex methods work with probability estimation, whether the similarity of common answers can still be coincidental.

Analysis of changed (deleted) answers tracks the number of changed student answers on answer sheets and test programs. An implausibly large number of changed answers in a class may indicate tampering (e.g. mass copying in the absence of supervision). The number of changes from a wrong answer to a good answer is an extremely strong indicator of fraudulent behavior^[2], ^[3].

Analysis of predicted vs. current performance: Statistical analysis of test results from the previous year can predict future performance. Unexpectedly successful summary test results may indicate cheating, especially when large gains are not repeated in the following year, or test scores if high success rates are confirmed in subsequent years. The effect of improving results from better teaching is more gradual and long-term.

Analyzing Student Responses: It should be considered suspicious if students fail to answer a large number of easy questions while simultaneously having an unlikely number of difficult questions answered correctly. Similarly, testers can look for other statistically significant similarities across tests.

Comparison of scores between subjects: It is suspicious if there is a significant difference in results for subjects whose results are otherwise highly correlated. For example, students within one test room will score improbably high in one subject.

Mismatch between test scores and prior academic performance: If students with poor prior academic performance score high on tests, this may indicate cheating. The approach using machine learning to detect these anomalies is innovative in this regard^[4].

Forensic Test Analysis Tools

We are looking for a way to identify unlikely states from test data that may indicate possible cheating. Not too many user-friendly software tools for data forensics exist.

PerFit

One strategy is that we can create a graph for each student of relative success in answering items ordered by difficulty. Logically, one would expect that the graph should be a monotonically decreasing function with increasing item difficulty. Significant deviations are easily recognizable. For this analysis we can use, for example, the PerFit package in R^[5].

This involves use of the “person-fit” analysis, which shows, with a certain (about 25%) sensitivity and a certain (about 90%) specificity, a non-standard test performance for a given student. It does not have to be directly cheating (copying or knowing the questions in advance), it can also be random guessing, perhaps only on a certain part of the test, etc. Although the sensitivity and specificity of this examination are not self-saving, it can be a valuable way of extracting data that already exists anyway.

The package does not require any external data. It works with a matrix of questions and students where there is only a value of 1 (correct) or 0 (incorrect) as a dichotomous item score. The tool itself calculates the difficulty of the item and the probability of a correct answer for that student. The resulting graphs are based on the raw data from the given test, and nothing more is needed.

The procedure is well-usable for cases where everyone has the same test, or when the data for the same test can be recalculated (e.g. if everyone had the same items, only in a different order and with mixed options). The path from the matrix to the graph is straightforward, just 2-3 lines of code and you will get a graph for that student.

Fig. 9.3.2 Illustration of using PerFit to identify improbable test results. The probability of a correct answer should decrease with increasing difficulty. If it doesn't, as in this case, it's an indication of something out of the ordinary that requires attention.

SIFT

SIFT (Software for Investigation of Fraud in Testing) is a tool that uses advanced statistical methods to investigate fraud in testing. It is provided free of charge (for registration) by one of the leading suppliers of commercial test systems – Assessment Systems Corporation (ASC). A user manual and sample data are available for the program, but not support, which can be purchased separately. SIFT calculates a number of indices pointing to different types of cheating (copying, teacher assistance, missed items, etc.) and can aggregate the results by grouping by variables such as classroom, or location of the test taker within that classroom, etc. It supports all three areas of analysis – focused on individuals, on relationships between them and on groups. SIFT provides objectively measured statistics for decision-making, but their interpretation in a given situation is up to the user^[6].

CopyDetect

CopyDetect (Zopluoglu, 2016) is a package in the open-source R statistical programming language (R Core Team, 2013) that computes several cheating indices within and beyond the IRT model. Among them are the Omega index, introduced by Wollack^[7], K indices^[8], and S indices^[9]. CopyDetect processes only one examined pair at a time. It is therefore up to the user to write a routine for processing larger amounts of data. Note that R packages are open-source software, so they should be approached with some caution.

Statistical methods allow us to express suspicion of unauthorized cooperation during completion of a test, but we should draw conclusions with caution. Statistical procedures should not be the only evidence of copying, especially when used for general screening purposes. While it is clear that the higher the agreement between responses, the more likely it is that test cheating has occurred, but even a high level of agreement is not conclusive evidence of cheating. There is always a chance that a test match is (albeit highly unlikely) the result of honest test completion. If, on the other hand, someone copies less than 10% of the items, statistical methods are not capable of distinguishing them from random phenomena.

Odkazy

Reference

↑ Cizek, Gregory J. and James A. Wollack , "Handbook of Quantitative Methods for Detecting Cheating on Tests" (Abingdon: Routledge, 2016).
↑ Maynes, D.; Educator cheating and the statistical detection of group-based test security threats. In WOLLACK, James A. a John J. FERMER. (Eds.), Handbook of test security (pp. 187–214). New York, Routledge, Psychology Press, 2013. ISBN 978-0-203-66480-3.
↑ Ranger, J., Schmidt, N., & Wolgast, A. (2020). The Detection of Cheating on E-Exams in Higher Education-The Performance of Several Old and Some New Indicators. Frontiers in psychology, 11, 568825. https://doi.org/10.3389/fpsyg.2020.568825
↑ Kamalov F, Sulieman H, Santandreu Calonge D (2021) Machine learning based approach to exam cheating detection. PLoS ONE 16(8): e0254340. https://doi.org/10.1371/journal.pone.0254340
↑ TENDEIRO, Jorge N., Rob R. MEIJER a A. Susan M. NIESSEN. PerFit: An R Package for Person-Fit Analysis in IRT. Journal of Statistical Software [online]. 2016, 74(5), 1-27 [cit. 2021-10-7]. ISSN 1548-7660. Dostupné z: doi:10.18637/jss.v074.i05
↑ THOMPSON, Nathan. SIFT: A new tool for statistical detection of test fraud: SIFT: Software for Investigating Test Fraud. Assessment Systems Corporation (ASC) [online]. 2016 [cit. 2021-11-16]. Dostupné z: https://assess.com/sift-new-tool-statistical-detection-test-fraud/
↑ WOLLACK, James A. A Nominal Response Model Approach for Detecting Answer Copying. Applied Psychological Measurement [online]. 1997, 21(4), 307-320 [cit. 2021-10-6]. ISSN 0146-6216. Dostupné z: doi:10.1177/01466216970214002
↑ VAN DER LINDEN, Wim J. a Leonardo SOTARIDONA. Detecting Answer Copying When the Regular Response Process Follows a Known Response Model. Journal of Educational and Behavioral Statistics [online]. 2006, 31(3), 283-304 [cit. 2021-10-6]. ISSN 1076-9986. Dostupné z: doi:10.3102/10769986031003283
↑ SOTARIDONA, Leonardo S. a Rob R. MEIJER. Two New Statistics to Detect Answer Copying. Journal of Educational Measurement [online]. 2003, 40(1), 53-69 [cit. 2021-10-6]. ISSN 0022-0655. Dostupné z: doi:10.1111/j.1745-3984.2003.tb01096.x

[1] Cizek, Gregory J. and James A. Wollack , "Handbook of Quantitative Methods for Detecting Cheating on Tests" (Abingdon: Routledge, 2016).

[2] Maynes, D.; Educator cheating and the statistical detection of group-based test security threats. In WOLLACK, James A. a John J. FERMER. (Eds.), Handbook of test security (pp. 187–214). New York, Routledge, Psychology Press, 2013. ISBN 978-0-203-66480-3.

[3] Ranger, J., Schmidt, N., & Wolgast, A. (2020). The Detection of Cheating on E-Exams in Higher Education-The Performance of Several Old and Some New Indicators. Frontiers in psychology, 11, 568825. https://doi.org/10.3389/fpsyg.2020.568825

[4] Kamalov F, Sulieman H, Santandreu Calonge D (2021) Machine learning based approach to exam cheating detection. PLoS ONE 16(8): e0254340. https://doi.org/10.1371/journal.pone.0254340

[5] TENDEIRO, Jorge N., Rob R. MEIJER a A. Susan M. NIESSEN. PerFit: An R Package for Person-Fit Analysis in IRT. Journal of Statistical Software [online]. 2016, 74(5), 1-27 [cit. 2021-10-7]. ISSN 1548-7660. Dostupné z: doi:10.18637/jss.v074.i05

[6] THOMPSON, Nathan. SIFT: A new tool for statistical detection of test fraud: SIFT: Software for Investigating Test Fraud. Assessment Systems Corporation (ASC) [online]. 2016 [cit. 2021-11-16]. Dostupné z: https://assess.com/sift-new-tool-statistical-detection-test-fraud/

[7] WOLLACK, James A. A Nominal Response Model Approach for Detecting Answer Copying. Applied Psychological Measurement [online]. 1997, 21(4), 307-320 [cit. 2021-10-6]. ISSN 0146-6216. Dostupné z: doi:10.1177/01466216970214002

[8] VAN DER LINDEN, Wim J. a Leonardo SOTARIDONA. Detecting Answer Copying When the Regular Response Process Follows a Known Response Model. Journal of Educational and Behavioral Statistics [online]. 2006, 31(3), 283-304 [cit. 2021-10-6]. ISSN 1076-9986. Dostupné z: doi:10.3102/10769986031003283

[9] SOTARIDONA, Leonardo S. a Rob R. MEIJER. Two New Statistics to Detect Answer Copying. Journal of Educational Measurement [online]. 2003, 40(1), 53-69 [cit. 2021-10-6]. ISSN 0022-0655. Dostupné z: doi:10.1111/j.1745-3984.2003.tb01096.x

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]