多级计分测验中基于残差统计量的被试拟合研究

doi:10.3724/SP.J.1041.2022.01122

摘要/Abstract

摘要：

本文提出一种多级计分项目下的个人拟合统计量R, 考察它在检测6种常见的异常作答模式(作弊、猜测、随机、粗心、创新作答、混合异常)下的表现, 并与标准化对数似然统计量l_zp进行比较。结果表明：(1) 在异常作答覆盖率较低并且异常作答类型为作弊和猜测时, R的检测率显著高于l_zp; (2) 随着测验长度和被试异常程度的增加, 两种统计量的检测率都会上升; (3) 在一些条件下, R与l_zp检测效果接近。实证数据分析进一步展示了R统计量的使用方法和过程, 结果也表明R统计量具有较好的应用前景。

关键词: 多级计分项目, 项目反应理论, 个人拟合统计量, 异常行为检测, 等级反应模型

Abstract:

Tests are widely used in educational measurement and psychometrics, and the examinee’s aberrant responses will affect the estimation of their abilities. These examinees with aberrant responses should not be treated with conventional methods, the important thing is to accurately screen them out of the normal group. To achieve this, a common method is to construct person-fit statistics to detect whether the response patterns fit their estimated abilities.
In this study, a residual-based person-fit statistic R was proposed, which can be applied to both dichotomous or polytomous IRT models. The construction of R is based on a weighted residual between the observed response and the expected response. By accumulating the weighted residuals, the goodness of fit can be calculated and compared with a specific critical value to determine whether an examinee is aberrant or not. Given that tests with polytomous items can provide more information, polytomously scored items are being increasingly popular in educational measurement and psychometrics. The ability of R statistic to detect aberrant response patterns under the graded response model was mainly considered in this article.
An existing polytomous person-ft statistic l_zp was also introduced in its outstanding standardized form and superior power. In the first study, a simulation study was conducted to generate the empirical distribution of R statistic and l_zp. R statistic is an accumulation of weighted residuals, showing a positive skew distribution; l_zp shows a negative skew distribution when the test is less than 80 items. Both of them differ from the standard normal distribution, It is necessary to set critical value according to the type 1 error, using it to distinguish whether each respondent's response pattern is fitted. In the second study, examinees with different aberrant behaviors (e.g., Cheaters, Lucky guessers, Random respondents, Careless respondents, Creative respondents and Mixed) under different test length conditions were simulated, and the detection rate as well as area under curve (AUC) were used to compare the effectiveness of the two person-fit statistics. The results show that the R statistic has a better detection rate than l_zp when the aberrant behavior affects only a few items or the aberrant behavior is cheating or guessing. When the aberrant behavior covers plenty of items, l_zp is slightly better than R statistic. Then, an empirical study was also conducted to show the power of R statistic.
Both of the R statistic and the l_zp have their own pros and cons, so we may combine them in the future person-fit studies. The R statistic has a better detection rate under certain conditions compared to the l_zp, especially when cheating and lucky guessing happened. Considering that cheating and guessing behaviors of low-ability examinees are more preferred in many aberrant test behaviors, the R statistic is worthy of further research and exploration in real-world applications.

Key words: polytomous items, item response theory, residual-based person-fit statistic, aberrant detection, polytomous item response models

中图分类号:

B841

童昊, 喻晓锋, 秦春影, 彭亚风, 钟小缘. (2022). 多级计分测验中基于残差统计量的被试拟合研究. 心理学报, 54(9), 1122-1136.

TONG Hao, YU Xiaofeng, QIN Chunying, PENG Yafeng, ZHONG Xiaoyuan. (2022). Detection of aberrant response patterns using a residual-based statistic in testing with polytomous items. Acta Psychologica Sinica, 54(9), 1122-1136.

图/表 14

参考文献 48

[1]	Buchanan, T., & Smith, J. L. (1999). Using the internet for psychological research: Personality testing on the world wide web. British Journal of Psychology, 90(1), 125-144. doi: 10.1348/000712699161189 URL
[2]	Chen, Q., Ding, S., Zhu, L., & Xu, Z. (2010). Three-parameter graded response model and its parameter estimation. Journal of Jiangxi Normal University (Natural Science), 34(2), 117-122.
	[陈青, 丁树良, 朱隆尹, 许志勇. (2010). 三参数等级反应模型及其参数估计. 江西师范大学学报(自然科学版), 34(2), 117-122.]
[3]	Cheng, X., Ding, S., Zhu, L., & Wu, H. (2012). The stratified item selection strategy with maximal information under graded response model. Journal of Jiangxi Normal University (Natural Science), 36(5), 117-122.
	[程小扬, 丁树良, 朱隆尹, 巫华芳. (2012). 等级评分模型下的最大信息量分层选题策略. 江西师范大学学报(自然科学版), 36(5), 446-451.]
[4]	Cooperman, A. W., Weiss, D. J., & Wang, C. (2021). Robustness of adaptive measurement of change to item parameter estimation error. Educational and Psychological Measurement, Advance online publication.
[5]	Curran, P. G., Kotrba, L., Denison, D. (2010, April). Careless responding in surveys: Applying traditional techniques to organizational settings. Paper presented at the 25th annual conference of the Society for Industrial/Organizational Psychology, Atlanta, GA.
[6]	de la Torre, J., & Deng, W. (2008). Improving person-fit assessment by correcting the ability estimate and its reference distribution. Journal of Educational Measurement, 45(2), 159-177. doi: 10.1111/j.1745-3984.2008.00058.x URL
[7]	Dodd, B, G., de Ayala, R, J, & Koch, W, R. (1995). Computerized adaptive testing with polytomous items. Applied Psychological Measurement, 19(1), 5-22. doi: 10.1177/014662169501900103 URL
[8]	Donlon, T. F., & Fischer, F. E. (1968). An index of an individual's agreement with group-determined item difficulties. Educational and Psychological Measurement, 28(1), 105-113. doi: 10.1177/001316446802800110 URL
[9]	Doval, E., & Delicado, P. (2020). Identifying and classifying aberrant response patterns through functional data analysis. Journal of Educational and Behavioral Statistics, 45(6), 719-749. doi: 10.3102/1076998620911941 URL
[10]	Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38(1), 67-86. doi: 10.1111/j.2044-8317.1985.tb00817.x URL
[11]	Emons, W. H. M. (2008). Nonparametric person-fit analysis of polytomous item scores. Applied Psychological Measurement, 32(3), 224-247. doi: 10.1177/0146621607302479 URL
[12]	Fung, W. K. (1993). Unmasking outliers and leverage points: A confirmation. Journal of the American Statistical Association, 88(422), 515-519. doi: 10.1080/01621459.1993.10476302 URL
[13]	Glas, C. A. W., & Dagohoy, A. V. T. (2007). A person fit test for IRT models for polytomous items. Psychometrika, 72(2), 159-180. doi: 10.1007/s11336-003-1081-5 URL
[14]	Gulliksen, H. (1950). Theory of mental tests. John Wiley & Sons Inc.
[15]	Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review, 9(2), 139-150. doi: 10.2307/2086306 URL
[16]	Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer, et al. (Eds.), Measurement and prediction (pp.60-90). Princeton: Princeton University Press.
[17]	Harris, K. M., & Udry, J. R. (2010). National Longitudinal Study of Adolescent Health (Add Health), 1994-2008: Core files [restricted use] (Technical report). Ann Arbor, MI: Inter-University Consortium for Political and Social Research.
[18]	Hong, M., Steedle, J. T., & Cheng, Y. (2020). Methods of detecting insufficient effort responding: Comparisons and practical recommendations. Educational and Psychological Measurement, 80(2), 312-345. doi: 10.1177/0013164419865316 URL
[19]	Hotaka, M. (2017). Robust latent ability estimation based on item response information and model fit (Dissertation). Milwaukee.
[20]	Huang, J. L., Bowling, N. A., Liu, M. Q., & Li, Y. H. (2015). Detecting insufficient effort responding with an infrequency scale: Evaluating validity and participant reactions. Journal of Business and Psychology, 30, 299-311. doi: 10.1007/s10869-014-9357-6 URL
[21]	Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied Measurement in Education, 16(4), 277-298. doi: 10.1207/S15324818AME1604_2 URL
[22]	Levine, M. V., & Rubin, D. B. (1979). Measuring the appropriateness of multiple-choice test scores. Journal of Educational Statistics, 4(4), 269-290. doi: 10.3102/10769986004004269 URL
[23]	Li, J., & Ding, S. (2018). The several stratified methods of CAT in the presence of calibration error on GRM. Journal of Jiangxi Normal University (Natural Science), 42(4), 374-378.
	[李佳, 丁树良. (2018). 基于GRM模型的CAT分层方法在校准误差中的应用研究. 江西师范大学学报(自然科学版), 42(4), 374-378.]
[24]	Liu, Y., & Liu, H. Y. (2018). A comparison study for the four parameter logistic model and traditional logistic models. Psychological Exploration, 38(3), 228-235.
	[刘玥, 刘红云. (2018). 四参数Logistic模型和传统模型对被试作答拟合能力的比较研究. 心理学探新, 38(3), 228-235.]
[25]	Lu, Y., & Sireci, S. G. (2007). Validity issues in test speededness. Educational Measurement: Issues and Practice, 26(4), 29-37. doi: 10.1111/j.1745-3992.2007.00106.x URL
[26]	Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. doi: 10.1007/BF02296272 URL
[27]	Masters, G. N., & Wright, B. D. (1997). The partial credit model. In W. J. van der Linden (Ed.), Handbook of modern item response theory (pp.101-121). New York, NY: Springer.
[28]	Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data. Psychological Methods, 17(3), 437-455. doi: 10.1037/a0028085 pmid: 22506584
[29]	Meijer, R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25(2), 107-135. doi: 10.1177/01466210122031957 URL
[30]	Nering, M, L. ( 1995). The distribution of person fit using true and estimated person parameters. Applied Psychological Measurement, 19(2), 121-129. doi: 10.1177/014662169501900201 URL
[31]	Oshima, T. C. (1994). The effect of speededness on parameter estimation in item response theory. Journal of Educational Measurement, 31(3), 200-219. doi: 10.1111/j.1745-3984.1994.tb00443.x URL
[32]	Rogers, H. J., & Hattie, J. A. (1987). A Monte Carlo investigation of several person and item fit statistics for item response models. Applied Psychological Measurement, 11, 47-57 doi: 10.1177/014662168701100103 URL
[33]	Rupp, A. A. (2013). A systematic review of the methodology for person fit research in Item Response Theory: Lessons about generalizability of inferences from the design of simulation studies. Psychological Test and Assessment Modeling, 55(1), 3-8.
[34]	Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(4), 1-97.
[35]	Schnipke, D. L. (1996). How contaminated by guessing are item-parameter estimates and what can be done about it? Paper presented at the annual meeting of the National Council on Measurement in Education, New York, NY.
[36]	Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34(3), 213-232. doi: 10.1111/j.1745-3984.1997.tb00516.x URL
[37]	Shao, C., Li, J., & Cheng, Y. (2016). Detection of test speededness using change-point analysis. Psychometrika, 81(4), 1118-1141. pmid: 26305400
[38]	Sinharay, S. (2016). Asymptotically correct standardization of person-fit statistics beyond dichotomous items. Psychometrika, 81(4), 992-1013. pmid: 25953476
[39]	Snijders, T. (2001). Asymptotic null distribution of person-fit statistics with estimated person parameter. Psychometrika, 66(3), 331-342. doi: 10.1007/BF02294437 URL
[40]	van Der Ark, L. A. (2001). Relationships and properties of polytomous item response theory models. Applied Psychological Measurement, 25(3), 273-282. doi: 10.1177/01466210122032073 URL
[41]	van Krimpen-Stoop, M. L. A., & Meijer, R. R. (2002). Detection of person misfit in computerized adaptive tests with polytomous items. Applied Psychological Measurement, 26(2), 164-180. doi: 10.1177/01421602026002004 URL
[42]	Wollack, J. A., Cohen, A. S., & Wells, C. S. (2003). A method for maintaining scale stability in the presence of test speededness. Journal of Educational Measurement, 40(4), 307-330. doi: 10.1111/j.1745-3984.2003.tb01149.x URL
[43]	Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.
[44]	Wright, B. D., & Stone, M. H. (1979). Best test design. Rasch measurement. Chicago: Mesa Press.
[45]	Xiong, J., Ding, S., Luo, F., & Luo, Z. (2020) Online calibration of polytomous items under the graded response model. Frontiers in Psychology, 10, 3085. doi: 10.3389/fpsyg.2019.03085 URL
[46]	Xiong, J., Luo, H., Wang, X., & Ding, S. (2018). The online calibration based on graded response model. Journal of Jiangxi Normal University (Natural Science), 42(1), 62-66.
	[熊建华, 罗慧, 王晓庆, 丁树良. (2018). 基于GRM的在线校准研究. 江西师范大学学报(自然科学版), 42(1), 62-66.]
[47]	Yu, X., & Cheng, Y. (2019). A change-point analysis procedure based on weighted residuals to detect back random responding. Psychological Methods, 24(5), 658-674. doi: 10.1037/met0000212 URL
[48]	Yuan, K. H., & Zhong, X. (2008). Outliers, leverage observations, and influential cases in factor analysis: Using robust procedures to minimize their effect. Sociological Methodology, 38(1), 329-368. doi: 10.1111/j.1467-9531.2008.00198.x URL

项目数	一类错误率	$R$(UB)	${{l}_{zp}}$(LB)
20	0.01	706.9	-2.215
	0.025	416.2	-1.770
	0.05	282.9	-1.399
40	0.01	1057.4	-2.176
	0.025	691.9	-1.760
	0.05	519.6	-1.417
60	0.01	1407.7	-2.125
	0.025	949.2	-1.717
	0.05	730.4	-1.383
80	0.01	1904.7	-2.127
	0.025	1278.3	-1.738
	0.05	983.6	-1.411

项目数	一类错误率	$R$(UB)	${{l}_{zp}}$(LB)
20	0.01	706.9	-2.215
	0.025	416.2	-1.770
	0.05	282.9	-1.399
40	0.01	1057.4	-2.176
	0.025	691.9	-1.760
	0.05	519.6	-1.417
60	0.01	1407.7	-2.125
	0.025	949.2	-1.717
	0.05	730.4	-1.383
80	0.01	1904.7	-2.127
	0.025	1278.3	-1.738
	0.05	983.6	-1.411

异常类型	定义	操作定义
作弊	能力较低的被试在平均难度较高的项目上获得满分	随机挑选低能力被试$(\text{ }\!\!\theta\!\!\text{ }<Z.375)$,在难度最高的前n个项目上获得满分
幸运猜测	能力较低的被试在平均难度较高的项目上依靠猜测获得满分	随机挑选低能力被试$(\text{ }\!\!\theta\!\!\text{ }<Z.375)$, 在难度最高的前n个项目上, 有0.2的概率获得满分, 0.8的概率维持原作答
随机作答	所有能力范围内的被试都有可能出现, 有一定概率获得0分	随机挑选被试, 随机抽取n题, 有0.8的概率得0分, 0.2的概率维持原作答
粗心	能力较高的被试在平均难度较低的项目上有一定概率获得0分	随机挑选高能力被试$(\text{ }\!\!\theta\!\!\text{ }<Z.625)$, 在难度最低的前n个项目上, 有0.8的概率获得0分, 0.2的概率维持原作答
创造性作答	能力较高的被试在最容易的项目上获得0分	随机挑选高能力被试$(\text{ }\!\!\theta\!\!\text{ }<Z.625)$, 在难度最低的前n个项目上获得0分
混合	将以上异常情况进行混合	以上5种情况各占异常被试总体的五分之一

异常类型	定义	操作定义
作弊	能力较低的被试在平均难度较高的项目上获得满分	随机挑选低能力被试$(\text{ }\!\!\theta\!\!\text{ }<Z.375)$,在难度最高的前n个项目上获得满分
幸运猜测	能力较低的被试在平均难度较高的项目上依靠猜测获得满分	随机挑选低能力被试$(\text{ }\!\!\theta\!\!\text{ }<Z.375)$, 在难度最高的前n个项目上, 有0.2的概率获得满分, 0.8的概率维持原作答
随机作答	所有能力范围内的被试都有可能出现, 有一定概率获得0分	随机挑选被试, 随机抽取n题, 有0.8的概率得0分, 0.2的概率维持原作答
粗心	能力较高的被试在平均难度较低的项目上有一定概率获得0分	随机挑选高能力被试$(\text{ }\!\!\theta\!\!\text{ }<Z.625)$, 在难度最低的前n个项目上, 有0.8的概率获得0分, 0.2的概率维持原作答
创造性作答	能力较高的被试在最容易的项目上获得0分	随机挑选高能力被试$(\text{ }\!\!\theta\!\!\text{ }<Z.625)$, 在难度最低的前n个项目上获得0分
混合	将以上异常情况进行混合	以上5种情况各占异常被试总体的五分之一

异常类型	临界值对应一类错误率	异常程度低(0.1)				异常程度中(0.25)				异常程度高(0.5)
		虚警率		检测率		虚警率		检测率		虚警率		检测率
		R	l_zp	R	l_zp	R	l_zp	R	l_zp	R	l_zp	R	l_zp
作弊	0.01	0.010	0.011	0.499	0.224	0.009	0.011	0.592	0.937	0.010	0.010	0.714	0.994
	0.025	0.024	0.026	0.729	0.400	0.023	0.026	0.827	0.973	0.024	0.026	0.901	0.998
	0.05	0.049	0.052	0.889	0.581	0.047	0.051	0.957	0.989	0.048	0.052	0.972	1
幸运猜测	0.01	0.010	0.011	0.124	0.031	0.009	0.011	0.230	0.096	0.010	0.011	0.457	0.295
	0.025	0.025	0.026	0.196	0.067	0.024	0.026	0.362	0.165	0.024	0.026	0.579	0.396
	0.05	0.049	0.052	0.262	0.119	0.048	0.052	0.472	0.243	0.048	0.052	0.673	0.487
随机作答	0.01	0.010	0.010	0.138	0.068	0.010	0.010	0.181	0.205	0.010	0.011	0.173	0.387
	0.025	0.025	0.025	0.201	0.120	0.025	0.025	0.272	0.279	0.025	0.026	0.321	0.465
	0.05	0.050	0.050	0.270	0.185	0.050	0.050	0.363	0.354	0.051	0.051	0.463	0.535
粗心	0.01	0.010	0.011	0.826	0.491	0.009	0.011	0.832	0.887	0.009	0.011	0.646	0.995
	0.025	0.024	0.026	0.914	0.632	0.025	0.026	0.952	0.934	0.024	0.026	0.907	0.998
	0.05	0.050	0.051	0.946	0.736	0.051	0.052	0.985	0.961	0.050	0.052	0.989	0.999
创造性作答	0.01	0.009	0.011	0.922	0.704	0.010	0.011	0.839	0.998	0.009	0.011	0.653	1
	0.025	0.024	0.026	0.989	0.854	0.025	0.026	0.966	1	0.025	0.026	0.957	1
	0.05	0.050	0.052	0.999	0.939	0.051	0.052	0.995	1	0.050	0.051	0.997	1
混合	0.01	0.010	0.010	0.459	0.299	0.010	0.011	0.464	0.600	0.010	0.012	0.430	0.659
	0.025	0.025	0.025	0.552	0.405	0.025	0.026	0.585	0.638	0.025	0.028	0.596	0.688
	0.05	0.050	0.051	0.615	0.495	0.051	0.051	0.661	0.669	0.051	0.054	0.678	0.715