Application of machine learning methods in test security

doi:10.3724/SP.J.1042.2024.01814

Abstract

Abstract:

Abnormal response behavior in psychological and educational tests compromises the reliability of the test and the validity of the resulting scores. In the context of academic achievement tests, such behaviors may result in inaccurate assessments of students' learning levels by teachers. Similarly, in questionnaires, these behaviors can impact the reliability of the questionnaires and the interpretation of the results. The potential negative consequences of these abnormal behaviors pose a significant threat to the security of the tests and the quality of the screening of the test administrators. At present, the prevailing approach to addressing the issue of test security is through the application of statistics. However, the increasing prevalence of diverse testing formats and the generation of substantial volumes of real-time process data have introduced novel considerations to the domain of test security. The incorporation of diverse test security detection processes with complex interactions poses a significant challenge for statistics. The analysis of these unstructured process data calls for the development of novel approaches that extend beyond latent feature modeling.

The application of machine learning methods is becoming increasingly prevalent in psychological and educational measurement research. Machine learning algorithms can learn from data and make predictions or decisions about unknown events without explicit instructions. These algorithms offer several advantages over traditional methods. Firstly, they are not limited by specific theories or assumptions and are designed to identify generalizable predictive patterns. Secondly, they can jointly model all variables related to the participants as input features, thus utilizing all available information. Thirdly, the training of machine learning models is often based on real data, reducing the problem of misfit between statistical models and empirical data. Finally, machine learning algorithms are highly efficient and capable of modeling and analyzing large amounts of assessment data in real time.

The review was divided into three principal sections. First, machine learning algorithms were classified into three principal categories: supervised, unsupervised, and semi-supervised learning methods. These categories were further subdivided into three subcategories: ensemble learning, deep learning, and transfer learning. Each study was included in a different broad category based on the underlying model used for the review. The theory of each machine learning method was first introduced, and then the application of the method is reviewed. The test security issues addressed in this study could be broadly classified into two categories: cheating in educational tests and careless responding in questionnaires. We then proceeded to examine the applicability of various machine learning methods across different test types and anomaly types. To conclude, we presented three practical recommendations for researchers and practitioners. (1) Obtaining high-quality labeled data for test security studies is challenging. There are three methods for obtaining labeled data: the simulation emulation method, the manual labeling method, and the SMOTE method. (2) Other techniques for initial data include missing value interpolation, data encoding, and feature scaling. (3) The selection of input features is also an important consideration. Finally, prospective avenues for future research were identified from the following perspectives: machine learning-based person-fit research, machine learning test security research based on multimodal data, test security research based on generative adversarial networks, and the interpretability of research results.

Key words: machine learning, psychological tests, educational tests, test security, statistics

CLC Number:

B841

GAO Xuliang, LI Ning. Application of machine learning methods in test security[J]. Advances in Psychological Science, 2024, 32(11): 1814-1828.

Figures/Tables 9

References 67

[1]	韩丹, 郭庆科, 王昭, 陈雪霞. (2008). 考试抄袭识别的心理测量学研究回顾. 心理科学进展, 16(1), 175-183.
[2]	胡佳琪, 黄美薇, 骆方. (2020). 考试作弊甄别技术的研究进展:个体作弊的甄别. 中国考试, (11), 32-36.
[3]	黄美薇, 潘逸沁, 骆方. (2020). 结合选择题与主观题信息的两阶段作弊甄别方法. 心理科学, 43(1), 75-80.
[4]	刘冬予, 骆方, 屠焯然, 饶思敬, 沈阳. (2024). 人工智能技术赋能心理学发展的现状与挑战. 北京师范大学学报(自然科学版), 60(1), 30-37.
[5]	刘玥, 刘红云. (2021). 心理与教育测验中异常作答处理的新技术: 混合模型方法. 心理科学进展, 29(9), 1696-1710. doi: 10.3724/SP.J.1042.2021.01696
[6]	骆方, 王欣夷, 徐永泽, 封慰. (2020). 考试作弊甄别技术的研究进展:团体作弊的甄别. 中国考试, (11), 37-41.
[7]	童昊, 喻晓锋, 秦春影, 彭亚风, 钟小缘. (2022). 多级计分测验中基于残差统计量的被试拟合研究. 心理学报, 54(9), 1122-1136. doi: 10.3724/SP.J.1041.2022.01122
[8]	王昭, 郭庆科, 岳艳. (2007). 心理测验中个人拟合研究的回顾与展望. 心理科学进展, 15(3), 559-566.
[9]	徐静, 骆方, 马彦珍, 胡路明, 田雪涛. (2024). 开放式情境判断测验的自动化评分. 心理学报, 56(6), 831-844. doi: 10.3724/SP.J.1041.2024.00831
[10]	张龙飞, 王晓雯, 蔡艳, 涂冬波. (2020). 心理与教育测验中异常反应侦查新技术:变点分析法. 心理科学进展, 28(9), 1462-1477. doi: 10.3724/SP.J.1042.2020.01462
[11]	钟晓钰, 李铭尧, 李凌艳. (2021). 问卷调查中被试不认真作答的控制与识别. 心理科学进展, 29(2), 225-237. doi: 10.3724/SP.J.1042.2021.00225
[12]	钟小缘, 喻晓锋, 苗莹, 秦春影, 彭亚风, 童昊. (2022). 基于作答时间数据的改变点分析在检测加速作答中的探索——已知和未知项目参数. 心理学报, 54(10), 1277-1292. doi: 10.3724/SP.J.1041.2022.01277
[13]	Alpaydin, E. (2020). Introduction to machine learning. MIT press.
[14]	Alsabhan, W. (2023). Student cheating detection in higher education by implementing machine learning and LSTM techniques. Sensors, 23(8), 4149.
[15]	Arias, V. B., Garrido, L. E., Jenaro, C., Martínez-Molina, A., & Arias, B. (2020). A little garbage in, lots of garbage out: Assessing the impact of careless responding in personality survey data. Behavior Research Methods, 52(6), 2489-2505.
[16]	Arthur, W., Jr., Hagen, E., & George, F., Jr. (2021). The lazy or dishonest respondent: Detection and prevention. Annual Review of Organizational Psychology and Organizational Behavior, 8, 105-137.
[17]	Cavalcanti, E. R., Pires, C. E., Cavalcanti, E. P., & Pires, V. F. (2012). Detection and evaluation of cheating on college exams using supervised classification. Informatics in Education, 11(2), 169-190.
[18]	Chan, K., & Stolfo, J. (1997). On the accuracy of meta- learning for scalable data mining. Journal of Intelligent Information Systems, 8(1), 5-28
[19]	Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321-357.
[20]	Chen, R. C., Dewi, C., Huang, S. W., & Caraka, R. E. (2020). Selecting critical features for data classification based on machine learning methods. Journal of Big Data, 7(1), 52.
[21]	Cizek, G. J., & Wollack, J. A. (Eds.). (2017). Handbook of quantitative methods for detecting cheating on tests. New York, NY: Routledge.
[22]	Curran, P. G. (2016). Methods for the detection of carelessly invalid responses in survey data. Journal of Experimental Social Psychology, 66, 4-19.
[23]	Di Mattia, F., Galeone, P., De Simoni, M., & Ghelfi, E. (2019). A survey on gans for anomaly detection. arxiv preprint arxiv: 1906. 11632. https://doi.org/10.48550/arXiv.1906.11632
[24]	Dong, X., Yu, Z., Cao, W., Shi, Y., & Ma, Q. (2020). A survey on ensemble learning. Frontiers of Computer Science, 14, 241-258. doi: 10.1007/s11704-019-8208-z
[25]	Du, M., Liu, N., & Hu, X. (2019). Techniques for interpretable machine learning. Communications of the ACM, 63(1), 68-77.
[26]	Foltýnek, T., Meuschke, N., & Gipp, B. (2019). Academic plagiarism detection: A systematic literature review. ACM Computing Surveys (CSUR), 52(6), 1-42.
[27]	Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
[28]	Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,... Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11), 139-144.
[29]	Gorgun, G., & Bulut, O. (2022). Identifying aberrant responses in intelligent tutoring systems: An application of anomaly detection methods. Psychological Test and Assessment Modeling, 64(4), 359-384.
[30]	Heaton, J. (2016). An empirical analysis of feature engineering for predictive modeling. In SoutheastCon 2016 (pp. 1-6). IEEE.
[31]	Hodge, V., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85-126.
[32]	Huang, J. L., Liu, M., & Bowling, N. A. (2015). Insufficient effort responding: Examining an insidious confound in survey data. Journal of Applied Psychology, 100(3), 828-845. doi: 10.1037/a0038510 pmid: 25495093
[33]	Hussein, F., Al-Ahmad, A., El-Salhi, S., Alshdaifat, E. A., & Al-Hami, M. T. (2022). Advances in contextual action recognition: Automatic cheating detection using machine learning techniques. Data, 7(9), 122.
[34]	Jiao, H., Yadav, C., & Li, G. (2023). Integrating psychometric analysis and machine learning to augment data for cheating detection in large-scale assessment. OSF. https://doi.org/10.31234/osf.io/fjz2c
[35]	Kamalov, F., Sulieman, H., & Santandreu Calonge, D. (2021). Machine learning based approach to exam cheating detection. Plos One, 16(8), e0254340. https://doi.org/10.1371/journal.pone.0254340
[36]	Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16(4), 277-298.
[37]	Kim, D., Woo, A., & Dickison, P. (2016). Identifying and investigating aberrant responses using psychometrics- based and machine learning-based approaches. In G. J. Cizek & J. A.Wollack (Eds.), Handbook of quantitative methods for detecting cheating on tests (pp. 70-97). New York, NY: Routledge.
[38]	Liao, M., Patton, J., Yan, R., & Jiao, H. (2021). Mining process data to detect aberrant test takers. Measurement: Interdisciplinary Research and Perspectives, 19(2), 93-105.
[39]	Man, K., Harring, J. R., & Sinharay, S. (2019). Use of data mining methods to detect test fraud. Journal of Educational Measurement, 56(2), 251-279. doi: 10.1111/jedm.12208
[40]	Meng, H., & Ma, Y. (2023). Machine learning-based profiling in test cheating detection. Educational Measurement: Issues and Practice, 42(1), 59-75.
[41]	Pan, Y., Sinharay, S., Livne, O., & Wollack, J. A. (2022). A machine learning approach for detecting item compromise and preknowledge in computerized adaptive testing. Psychological Test and Assessment Modeling, 64(4), 385-424.
[42]	Pan, Y., & Wollack, J. A. (2021). An unsupervised-learning- based approach to compromised items detection. Journal of Educational Measurement, 58(3), 413-433.
[43]	Pan, Y., & Wollack, J. A. (2023). A machine learning approach for the simultaneous detection of preknowledge in examinees and items when both are unknown. Educational Measurement: Issues and Practice, 42(1), 76-98.
[44]	Ranger, J., Schmidt, N., & Wolgast, A. (2020). The detection of cheating on E-exams in higher education—The performance of several old and some new indicators. Frontiers in Psychology, 11, 568825. https://doi.org/10.3389/fpsyg.2020.568825
[45]	Ranger, J., Schmidt, N., & Wolgast, A. (2023). Detecting cheating in large-scale assessment: The transfer of detectors to new tests. Educational and Psychological Measurement, 83(5), 1033-1058. doi: 10.1177/00131644221132723 pmid: 37663534
[46]	Rodríguez-Villalobos, M., Fernandez-Garza, J., & Heredia-Escorza, Y. (2023). Monitoring methods and student performance in distance education exams. The International Journal of Information and Learning Technology, 40(2), 164-176.
[47]	Schroeders, U., Schmidt, C., & Gnambs, T. (2022). Detecting careless responding in survey data using stochastic gradient boosting. Educational and Psychological Measurement, 82(1), 29-56. doi: 10.1177/00131644211004708 pmid: 34992306
[48]	Sinharay, S. (2017). Detection of item preknowledge using likelihood ratio test and score test. Journal of Educational and Behavioral Statistics, 42(1), 46-68.
[49]	Stekhoven, D., & Bühlmann, P. (2012). MissForest - non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112-118. doi: 10.1093/bioinformatics/btr597 pmid: 22039212
[50]	Taloni, A., Scorcia, V., & Giannaccare, G. (2024). Modern threats in academia: Evaluating plagiarism and artificial intelligence detection scores of ChatGPT. Eye, 38(2), 397-400.
[51]	Tang, S., Samuel, S., & Li, Z. (2023). Detecting atypical test-taking behavior with behavior prediction using LSTM. Psychological Test and Assessment Modeling, 65(2), 76-124.
[52]	Thomas, S. L. (2016). So happy together? Combining Rasch and item response theory model estimates with support vector machines to detect test fraud. (Unpublished doctorial dissertation). University of Virginia.
[53]	Tiong, L. C. O., & Lee, H. J. (2021). E-cheating prevention measures: Detection of cheating at online examinations using deep learning approach--a case study. arXiv preprint arXiv:2101. 09841. https://doi.org/10.48550/arXiv.2101.09841
[54]	Ullah, A., Xiao, H., & Barker, T. (2019). A dynamic profile questions approach to mitigate impersonation in online examinations. Journal of Grid Computing, 17, 209-223.
[55]	van der Linden, W. J., & Guo, F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73, 365-384.
[56]	van Krimpen-Stoop, E. M. L. A., & Meijer, R. R. (2001). CUSUM-based person-fit statistics for adaptive testing. Journal of Educational and Behavioral Statistics, 26(2), 199-217.
[57]	Ward, M. K., & Meade, A. W. (2023). Dealing with careless responding in survey data: Prevention, identification, and recommended best practices. Annual Review of Psychology, 74, 577-596.
[58]	Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big Data, 3, 1-40.
[59]	Welz, M., & Alfons, A. (2023). I don't care anymore: Identifying the onset of careless responding. arXiv preprint arXiv: 2303. 07167.https://doi.org/10.48550/arXiv.2303.07167
[60]	Zenati, H., Foo, C. S., Lecouat, B., Manek, G., & Chandrasekhar, V. R. (2018). Efficient gan-based anomaly detection. arxiv preprint arxiv:1802. 06222. https://doi.org/10.48550/arXiv.1802.06222
[61]	Zhen, Y., & Zhu, X. (2024). An ensemble learning approach based on TabNet and machine learning models for cheating detection in educational tests. Educational and Psychological Measurement, 84(4), 780-809.
[62]	Zhou, T., & Jiao, H. (2022). Data augmentation in machine learning for cheating detection in large-scale assessment: An illustration with the blending ensemble learning algorithm. Psychological Test and Assessment Modeling, 64(4), 425-444.
[63]	Zhou, T., & Jiao, H. (2023). Exploration of the stacking ensemble machine learning algorithm for cheating detection in large-scale assessment. Educational and Psychological Measurement, 83(4), 831-854. doi: 10.1177/00131644221117193 pmid: 37398846
[64]	Zhu, X., & Goldberg, A. B. (2009). Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1), 1-130.
[65]	Zhu, Z., Arthur, D., & Chang, H. H. (2022). A new person-fit method based on machine learning in CDM in education. British Journal of Mathematical and Statistical Psychology, 75(3), 616-637.
[66]	Zimek, A., Schubert, E., & Kriegel, H. P. (2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(5), 363-387.
[67]	Zopluoglu, C. (2019). Detecting examinees with item preknowledge in large-scale testing using extreme gradient boosting (XGBoost). Educational and Psychological Measurement, 79(5), 931-961. doi: 10.1177/0013164419839439 pmid: 31488920

方法类型	具体方法	测验及异常类型
监督学习	决策树(Cavalcanti et al., 2012); 神经网络(Zhu et al., 2022); 梯度提升法(Schroeders et al., 2022); 二次判别分析(Ranger et al., 2023); 支持向量机(Thomas, 2016; Pan et al., 2022); 极端梯度提升(Zopluoglu, 2019); 支持向量机、K近邻、随机森林(Man et al., 2019); 支持向量机、决策树、逻辑回归、朴素贝叶斯、判别分析、神经网络、梯度提升、随机森林模型构成的堆叠及混合集成学习(Jiao et al., 2023; Zhou & Jiao, 2022, 2023); 决策树、逻辑回归、朴素贝叶斯、二次判别分析、神经网络、梯度提升、随机森林、K近邻、多层感知机、自适应提升、高斯过程、深度神经网络TabNet (Zhen & Zhu, 2024); 长短期记忆网络(Alsabhan, 2023; Kamalov et al., 2021; Tang et al., 2023; Tiong & Lee, 2021); 逻辑回归、线性判别分析、二次判别分析、K近邻、朴素贝叶斯、支持向量机、决策树、随机森林、自适应提升和神经网络(Meng & Ma, 2023)	教育测验作弊: (Alsabhan, 2023; Cavalcanti et al., 2012; Jiao et al., 2023; Kamalov et al., 2021; Man et al., 2019; Meng & Ma, 2023; Pan et al., 2022; Ranger et al., 2023; Tang et al., 2023; Thomas, 2016; Tiong & Lee, 2021; Zhen & Zhu, 2024; Zhou & Jiao, 2022, 2023; Zopluoglu, 2019) 教育测验作弊、随机作答、睡眠效应：(Zhu et al., 2022) 调查问卷粗心作答：(Schroeders et al., 2022)
无监督学习	层次聚类(Pan & Wollack, 2021; Pan & Wollack, 2023); K均值聚类(Liao et al., 2021); K均值聚类、高斯混合模型、自组织映射聚类(Man et al., 2019); 独立森林、椭圆包络、单类支持向量机、密度聚类(Jiao et al., 2023; Zhou & Jiao, 2022); 高斯混合模型(Ranger et al., 2023); 高斯混合模型、贝叶斯高斯混合模型、独立森林、马式距离、局部异常值因子和椭圆包络(Gorgun & Bulut, 2022);核密度估计(Kamalov et al., 2021); 自编码器(Pan et al., 2022; Pan & Wollack, 2023; welz & Alfons, 2023);购物篮分析(Kim et al., 2016)	教育测验作弊：(Gorgun & Bulut, 2022; Jiao et al., 2023; Kamalov et al., 2021; Kim et al., 2016; Man et al., 2019; Liao et al., 2021; Pan & Wollack, 2021; Pan & Wollack, 2023; Pan et al., 2022; Pan & Wollack, 2023; Zhou & Jiao, 2022) 调查问卷粗心作答：(welz & Alfons, 2023)
半监督学习	自训练算法(Pan et al., 2022; Ranger et al., 2023)	教育测验作弊：(Pan et al., 2022; Ranger et al., 2023)

方法类型	具体方法	测验及异常类型
监督学习	决策树(Cavalcanti et al., 2012); 神经网络(Zhu et al., 2022); 梯度提升法(Schroeders et al., 2022); 二次判别分析(Ranger et al., 2023); 支持向量机(Thomas, 2016; Pan et al., 2022); 极端梯度提升(Zopluoglu, 2019); 支持向量机、K近邻、随机森林(Man et al., 2019); 支持向量机、决策树、逻辑回归、朴素贝叶斯、判别分析、神经网络、梯度提升、随机森林模型构成的堆叠及混合集成学习(Jiao et al., 2023; Zhou & Jiao, 2022, 2023); 决策树、逻辑回归、朴素贝叶斯、二次判别分析、神经网络、梯度提升、随机森林、K近邻、多层感知机、自适应提升、高斯过程、深度神经网络TabNet (Zhen & Zhu, 2024); 长短期记忆网络(Alsabhan, 2023; Kamalov et al., 2021; Tang et al., 2023; Tiong & Lee, 2021); 逻辑回归、线性判别分析、二次判别分析、K近邻、朴素贝叶斯、支持向量机、决策树、随机森林、自适应提升和神经网络(Meng & Ma, 2023)	教育测验作弊: (Alsabhan, 2023; Cavalcanti et al., 2012; Jiao et al., 2023; Kamalov et al., 2021; Man et al., 2019; Meng & Ma, 2023; Pan et al., 2022; Ranger et al., 2023; Tang et al., 2023; Thomas, 2016; Tiong & Lee, 2021; Zhen & Zhu, 2024; Zhou & Jiao, 2022, 2023; Zopluoglu, 2019) 教育测验作弊、随机作答、睡眠效应：(Zhu et al., 2022) 调查问卷粗心作答：(Schroeders et al., 2022)
无监督学习	层次聚类(Pan & Wollack, 2021; Pan & Wollack, 2023); K均值聚类(Liao et al., 2021); K均值聚类、高斯混合模型、自组织映射聚类(Man et al., 2019); 独立森林、椭圆包络、单类支持向量机、密度聚类(Jiao et al., 2023; Zhou & Jiao, 2022); 高斯混合模型(Ranger et al., 2023); 高斯混合模型、贝叶斯高斯混合模型、独立森林、马式距离、局部异常值因子和椭圆包络(Gorgun & Bulut, 2022);核密度估计(Kamalov et al., 2021); 自编码器(Pan et al., 2022; Pan & Wollack, 2023; welz & Alfons, 2023);购物篮分析(Kim et al., 2016)	教育测验作弊：(Gorgun & Bulut, 2022; Jiao et al., 2023; Kamalov et al., 2021; Kim et al., 2016; Man et al., 2019; Liao et al., 2021; Pan & Wollack, 2021; Pan & Wollack, 2023; Pan et al., 2022; Pan & Wollack, 2023; Zhou & Jiao, 2022) 调查问卷粗心作答：(welz & Alfons, 2023)
半监督学习	自训练算法(Pan et al., 2022; Ranger et al., 2023)	教育测验作弊：(Pan et al., 2022; Ranger et al., 2023)