用于处理不努力作答的标准化残差系列方法和混合多层模型法的比较

doi:10.3724/SP.J.1041.2022.00411

摘要/Abstract

摘要：

文章采用模拟研究, 分别在混合多层模型假设满足和违背的情境下, 比较了混合多层模型方法与标准化残差系列方法在识别不努力作答和参数估计方面的表现。结果显示：(1)不存在不努力作答或其严重性低时, 各方法表现接近; (2)不努力作答严重性高时, 固定参数迭代标准化残差法普遍更优, 混合多层模型法仅在假设满足且两种作答反应时差异大的条件下表现较好。建议实际应用中优先选择固定参数迭代标准化残差法。

关键词: 不努力作答, 标准化反应时残差, 迭代净化, 混合多层模型, 贝叶斯估计

Abstract:

Assessment datasets contaminated by non-effortful responses may lead to serious consequences if not handled appropriately. Previous research has proposed two different strategies: down-weighting and accommodating. Down-weighting tries to limit the influence of aberrant responses on parameter estimation by reducing their weight. The extreme form of down-weighting is the detection and removal of irregular responses and response times (RTs). The standard residual-based methods, including the recently developed residual method using an iterative purification process, can be used to detect non-effortful responses in the framework of down-weighting. In accommodating, on the other hand, one tries to extend a model in order to account for the contaminations directly. This boils down to a mixture hierarchical model (MHM) for responses and RTs. However, to the authors’ knowledge, few studies have compared standard residual methods and MHM under different simulation conditions. It is unknown which method should be applied in different situations. Meanwhile, MHM has strong assumptions for different types of responses. It would be valuable to examine the performance of the method when the assumptions are violated. The purpose of this study is to compare standard residual methods and MHM under a fully crossed simulation design. In addition, specific recommendations for their applications are provided.
The simulation study included two scenarios. In simulation scenario I, data were generated under the assumptions of MHM. In simulation scenario II, the assumptions of MHM concerning non-effortful responses and RTs were both violated. Simulation scenario I had three manipulated factors. (1) Non-effort prevalence ($\pi $), which was the proportion of individuals with non-effortful responses. It had three levels: 0%, 20% and 40%. (2) Non-effort severity ($\pi _{i}^{non}$), which was the proportion of non-effortful responses for each non-effortful individual. It varied between two levels: low and high. When $\pi _{i}^{non}$ was low, $\pi _{i}^{non}$ was generated from U (0, 0.25); while when $\pi _{i}^{non}$ was high, $\pi _{i}^{non}$ was generated from U (0.5, 0.75), where “U” denoted a uniform distribution. (3) Difference between RTs of non-effortful and effortful responses (${{d}_{RT}}$). The difference between RTs from two groups, ${{d}_{RT}}$, had two levels, small and large. The logarithm of RTs of non-effortful responses were generated from normal distribution N ($\mu $,$0.5$²), where $\text{ }\!\!\mu\!\!\text{ }=-1$ when ${{d}_{RT}}$ was small, $\text{ }\!\!\mu\!\!\text{ }=-2$ when ${{d}_{RT}}$ was large. For generating the non-effortful responses, we followed Wang, Xu and Shang (2018), with the probability of a correct response ${{g}_{j}}$ setting at 0.25 for all non-effortful responses. In simulation scenario II, only the first two factors were considered. Non-effortful RTs were generated from a uniform distribution with a lower bound of $\text{exp}\left( -5 \right)$ and upper bound being the 5th percentile of RT on item j with $\tau =0$. The probability of a correct response for non-effortful responses was dependent on the ability level of each examinee. In all the conditions, sample size was fixed at I = 2,000 and test length was fixed at J = 30. For each condition, 30 replications were generated. For effortful responses, Responses and RTs were simulated from van der Linden’s (2007) hierarchical model. Item parameters were generated with ${{a}_{j}}\tilde{\ }U\left( 1,2.5 \right)$, ${{b}_{j}}\tilde{\ }N\left( 0,1 \right)$, $~{{\alpha }_{j}}\tilde{\ }U\left( 1.5,2.5 \right),{{\beta }_{j}}\tilde{\ }U\left( -0.2,0.2 \right)$. For simulees, the person parameters $\left( {{\theta }_{i}},{{\tau }_{i}} \right)$ were generated from a bivariate normal distribution with the mean vector of $\mathbf{\mu }=\left( 0,0 \right)'$and the covariance matrix of $\mathbf{\Sigma }=\left[ \begin{matrix} 1 & 0.25 \\ 0.25 & 0.25 \\ \end{matrix} \right]$. Four methods were compared under each condition: the original standard residual method (OSR), conditional estimate standard residual (CSR), conditional estimate with fixed item parameters standard residual method using iterative purifying procedure (CSRI), and MHM. These methods were implemented in R and JAGS using a Bayesian MCMC sampling method for parameter calibration. Finally, these methods were evaluated in terms of convergence rate, detection accuracy and parameter recovery.
The results are presented as following. First of all, MHM suffered from convergence issues, especially for the latent variable indicating non-effortful responses. On the contrary, all the standard residual methods achieved convergence successfully. The convergence issues were more serious in simulation scenario II. Secondly, when all the items were assumed to have effortful responses, the false positive rate (FPR) of MHM was 0. Although the standard residual methods had FPR around 5% (the nominal level), the accuracy of parameter estimates was similar for all these methods. Third, when data were contaminated by non-effortful responses, CSRI had higher true positive rate (TPR) almost in all the conditions. MHM showed lower TPR but lower false discovery rate (FDR), exhibiting even lower TPR in simulation scenario II. When $\pi _{i}^{non}$ was high, CSRI and MHM showed more advantages over the other methods in terms of parameter recovery. However, when $\pi _{i}^{non}$ was high and ${{d}_{RT}}$ was small, MHM generally had higher RMSE than CSRI. Compared to simulation scenario I, MHM performed worse in simulation scenario II. The only problem CSRI needed to deal with was its overestimation of time discrimination parameter across all the conditions except for when $\pi $=40% and ${{d}_{RT}}$ was large. In a real data example, all the methods were applied to a dataset collected for program assessment and accountability purposes from undergraduates at a mid-sized southeastern university in USA. Evidences from convergence validity showed that CSRI and MHM might detect non-effortful responses more accurately and obtain more precise parameter estimates for this data.
In conclusion, CSRI generally performed better than the other methods across all the conditions. It is highly recommended to use this method in practice because: (1) It showed acceptable FPR and fairly accurate parameter estimates even when all responses were effortful; (2) It was free of strong assumptions, which meant that it would be robust under various situations; (3) It showed most advantages when $\pi _{i}^{non}$ was high in terms of the detection of non-effortful responses and the improvement of the parameter estimation. In order to improve the estimation of time discrimination parameter in CSRI, the robust estimation methods that down-weight flagged response patterns can be used as an alternative to directly removing non-effortful responses (i.e., the method in the current study). MHM can perform well when all its assumptions are met and $\pi _{i}^{non}$ is high, ${{d}_{RT}}$ is large. However, some parameters have difficulty in convergence under MHM, which will limit its application in practice.

Key words: non-effortful response, standard response time residual, iterative purification, mixture hierarchical model, Bayesian estimation

中图分类号:

B841

刘玥, 刘红云, 游晓锋, 杨建芹. (2022). 用于处理不努力作答的标准化残差系列方法和混合多层模型法的比较. 心理学报, 54(4), 411-425.

LIU Yue, LIU Hongyun, YOU Xiaofeng, YANG Jianqin. (2022). A comparison of standard residual methods and a mixture hierarchical model for detecting non-effortful responses. Acta Psychologica Sinica, 54(4), 411-425.

图/表 11

参考文献 30

[1]	Borghans, L., & Schils, T. (2012). The leaning tower of PISA: Decomposing achievement test scores into cognitive and noncognitive components. The Netherlands: School of Business and Economics, Maastricht University.
[2]	Clark, M. E., Gironda, R. J., & Young, R. W. (2003). Detection of back random responding: Effectiveness of MMPI-2 and personality assessment inventory validity indices. Psychological Assessment, 15(2), 223-234. doi: 10.1037/1040-3590.15.2.223 URL
[3]	Feinberg, R., & Jurich, D. (2018, April). Using rapid responses to evaluate test speededness. Paper presented at the meeting of the National Council of Measurement in Education (NCME), New York, NY.
[4]	Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457-472.
[5]	Hong, M., Rebouças, D. A., & Cheng, Y. (2021). Robust estimation for response time modeling. Journal of Educational Measurement. 58(2), 262-280. doi: 10.1111/jedm.v58.2 URL
[6]	Köhler, C., Pohl, S., & Carstensen, C. H. (2017). Dealing with item nonresponse in large-scale cognitive assessments: The impact of missing data methods on estimated explanatory relationships. Journal of Educational Measurement, 54(4), 397-419. doi: 10.1111/jedm.2017.54.issue-4 URL
[7]	Liu, Y., Cheng, Y., & Liu, H. (2020). Identifying effortful individuals with mixture modeling response accuracy and response time simultaneously to improve item parameter estimation. Educational and Psychological Measurement, 80(4), 775-807. doi: 10.1177/0013164419895068 URL
[8]	Liu, Y., & Liu, H. (2021). Detecting noneffortful responses based on a residual method using an iterative purification process. Journal of Educational and Behavioral Statistics, 46(6), 717-752. doi: 10.3102/1076998621994366 URL
[9]	Lu, J., Wang, C., Zhang, J., & Tao, J. (2020). A mixture model for responses and response times with a higher‐order ability structure to detect rapid guessing behaviour. British Journal of Mathematical and Statistical Psychology, 73(2), 261-288. doi: 10.1111/bmsp.v73.2 URL
[10]	Matzke, D., Love, J., & Heathcote, A. (2017). A Bayesian approach for estimating the probability of trigger failures in the stop-signal paradigm. Behavior Research Methods, 49(1), 267-281. doi: 10.3758/s13428-015-0695-8 pmid: 26822670
[11]	McHugh, M. L. (2013). The chi-square test of independence. Biochemia medica, 23(2), 143-149. pmid: 23894860
[12]	Molenaar, D., Bolsinova, M., & Vermunt, J. K. (2018). A semi-parametric within-subject mixture approach to the analyses of responses and response times. British Journal of Mathematical and Statistical Psychology, 71(2), 205-228. doi: 10.1111/bmsp.2018.71.issue-2 URL
[13]	Pastor, D. A., Ong, T. Q., & Strickman, S. N. (2019). Patterns of solution behavior across items in low-stakes assessments. Educational Assessment, 24(3), 189-212. doi: 10.1080/10627197.2019.1615373 URL
[14]	Plummer, M. (2003, March). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Retrieved from https://www.r-project.org/conferences/DSC-2003/Drafts/Plummer.
[15]	Qian, H., Staniewska, D., Reckase, M., & Woo, A. (2016). Using response time to detect item preknowledge in computer-based licensure examinations. Educational Measurement: Issues and Practice, 35(1), 38-47.
[16]	Ranger, J., Wolgast, A., & Kuhn, J. T. (2019). Robust estimation of the hierarchical model for responses and response times. British Journal of Mathematical and Statistical Psychology, 72(1), 83-107. doi: 10.1111/bmsp.2019.72.issue-1 URL
[17]	R Development Core Team. (2009). R: A language and environment for statistical computing [Computer software Manual]. Vienna, Austria: Retrieved from http://www.Rproject. org (ISBN 3-900051-07-0)
[18]	Rios, J. A., Guo, H., Mao, L., & Liu, O. L. (2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74-104. doi: 10.1080/15305058.2016.1231193 URL
[19]	Rose, N. (2013). Item nonresponses in educational and psychological measurement (Unpublished doctorial dissertation). Friedrich Schiller University, Jena, Germany.
[20]	Setzer, J. C., Wise, S. L., van den Heuvel, J. R., & Ling, G. (2013). An investigation of examinee test-taking effort on a large-scale assessment. Applied Measurement in Education, 26(1), 34-49. doi: 10.1080/08957347.2013.739453 URL
[21]	Ulitzsch, E., von Davier, M., & Pohl, S. (2020). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item‐level non‐response. British Journal of Mathematical and Statistical Psychology, 73(S1), 83-112. doi: 10.1111/bmsp.v73.s1 URL
[22]	van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287-308. doi: 10.1007/s11336-006-1478-z URL
[23]	van der Linden, W. J., & Guo, F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73(3), 365-384. doi: 10.1007/s11336-007-9046-8 URL
[24]	Wang, C., & Xu, G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456-477. doi: 10.1111/bmsp.2015.68.issue-3 URL
[25]	Wang, C., Xu, G., & Shang, Z. (2018). A two-stage approach to differentiating normal and aberrant behavior in computer based testing. Psychometrika, 83(1), 223-254. doi: 10.1007/s11336-016-9525-x URL
[26]	Wang, C., Xu, G., Shang, Z., & Kuncel, N. (2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469-501. doi: 10.3102/1076998618767123 URL
[27]	Wise, S. L. (2015). Effort analysis: Individual score validation of achievement test data. Applied Measurement in Education, 28(3), 237-252. doi: 10.1080/08957347.2015.1042155 URL
[28]	Wise, S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52-61. doi: 10.1111/emip.2017.36.issue-4 URL
[29]	Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19-38. doi: 10.1111/jedm.2006.43.issue-1 URL
[30]	Wise, S. L., & Kingsbury, G. G. (2016). Modeling student test- taking motivation in the context of an adaptive achievement test. Journal of Educational Measurement, 53(1), 86-105. doi: 10.1111/jedm.12102 URL

情境	π	$\pi _{i}^{non}$	${{d}_{RT}}$	作答分类参数(Δ_ij)	被试参数	合计
情境1	0%			0.05	0.00	0.05
	20%	低	小	15.83	0.00	14.80
			大	11.70	0.00	10.94
		高	小	11.10	0.00	10.38
			大	12.11	0.01	11.33
	40%	低	小	12.88	0.00	12.04
			大	12.73	0.00	11.91
		高	小	9.30	0.00	8.70
			大	13.15	0.00	12.30
情境2	20%	低		16.75	0.00	15.67
	20%	高		15.53	0.00	14.52
	40%	低		7.08	0.00	6.62
	40%	高		11.93	0.00	11.15

情境	π	$\pi _{i}^{non}$	${{d}_{RT}}$	作答分类参数(Δ_ij)	被试参数	合计
情境1	0%			0.05	0.00	0.05
	20%	低	小	15.83	0.00	14.80
			大	11.70	0.00	10.94
		高	小	11.10	0.00	10.38
			大	12.11	0.01	11.33
	40%	低	小	12.88	0.00	12.04
			大	12.73	0.00	11.91
		高	小	9.30	0.00	8.70
			大	13.15	0.00	12.30
情境2	20%	低		16.75	0.00	15.67
	20%	高		15.53	0.00	14.52
	40%	低		7.08	0.00	6.62
	40%	高		11.93	0.00	11.15

情境	π	$\pi _{i}^{non}$	${{d}_{RT}}$	指标	OSR	CSR	CSRI	MHM
情境1	0%			FPR	0.05	0.05	0.06	0.00
	20%	低 (0.025)	小	TPR	0.59	0.59	0.69	0.39
				FDR	0.69	0.69	0.71	0.20
				Pr	0.05	0.05	0.06	0.01
			大	TPR	0.91	0.91	0.97	0.87
				FDR	0.47	0.49	0.53	0.09
				Pr	0.04	0.04	0.05	0.02
		高 (0.125)	小	TPR	0.19	0.25	0.50	0.03
				FDR	0.48	0.54	0.43	0.08
				Pr	0.04	0.07	0.11	0.00
			大	TPR	0.31	0.50	0.93	0.82
				FDR	0.16	0.36	0.28	0.07
				Pr	0.05	0.10	0.16	0.11
	40%	低 (0.050)	小	TPR	0.55	0.55	0.65	0.51
				FDR	0.46	0.45	0.47	0.20
				Pr	0.05	0.05	0.06	0.03
			大	TPR	0.87	0.87	0.94	0.91
				FDR	0.17	0.16	0.18	0.09
				Pr	0.05	0.05	0.06	0.05
		高 (0.250)	小	TPR	0.13	0.24	0.49	0.16
				FDR	0.23	0.31	0.23	0.10
				Pr	0.04	0.09	0.16	0.05
			大	TPR	0.17	0.49	0.93	0.94
				FDR	0.03	0.17	0.14	0.07
				Pr	0.04	0.15	0.27	0.25
情境2	20%	低 (0.025)		TPR	0.77	0.78	0.90	0.64
				FDR	0.52	0.53	0.55	0.10
				Pr	0.04	0.04	0.05	0.02
		高 (0.125)		TPR	0.27	0.34	0.72	0.18
				FDR	0.17	0.35	0.24	0.01
				Pr	0.04	0.07	0.12	0.02
	40%	低 (0.050)		TPR	0.70	0.69	0.82	0.73
				FDR	0.22	0.21	0.22	0.11
				Pr	0.04	0.04	0.05	0.04
		高 (0.250)		TPR	0.20	0.29	0.56	0.13
				FDR	0.02	0.10	0.06	0.00
				Pr	0.05	0.08	0.15	0.03

情境	π	$\pi _{i}^{non}$	${{d}_{RT}}$	指标	OSR	CSR	CSRI	MHM
情境1	0%			FPR	0.05	0.05	0.06	0.00
	20%	低 (0.025)	小	TPR	0.59	0.59	0.69	0.39
				FDR	0.69	0.69	0.71	0.20
				Pr	0.05	0.05	0.06	0.01
			大	TPR	0.91	0.91	0.97	0.87
				FDR	0.47	0.49	0.53	0.09
				Pr	0.04	0.04	0.05	0.02
		高 (0.125)	小	TPR	0.19	0.25	0.50	0.03
				FDR	0.48	0.54	0.43	0.08
				Pr	0.04	0.07	0.11	0.00
			大	TPR	0.31	0.50	0.93	0.82
				FDR	0.16	0.36	0.28	0.07
				Pr	0.05	0.10	0.16	0.11
	40%	低 (0.050)	小	TPR	0.55	0.55	0.65	0.51
				FDR	0.46	0.45	0.47	0.20
				Pr	0.05	0.05	0.06	0.03
			大	TPR	0.87	0.87	0.94	0.91
				FDR	0.17	0.16	0.18	0.09
				Pr	0.05	0.05	0.06	0.05
		高 (0.250)	小	TPR	0.13	0.24	0.49	0.16
				FDR	0.23	0.31	0.23	0.10
				Pr	0.04	0.09	0.16	0.05
			大	TPR	0.17	0.49	0.93	0.94
				FDR	0.03	0.17	0.14	0.07
				Pr	0.04	0.15	0.27	0.25
情境2	20%	低 (0.025)		TPR	0.77	0.78	0.90	0.64
				FDR	0.52	0.53	0.55	0.10
				Pr	0.04	0.04	0.05	0.02
		高 (0.125)		TPR	0.27	0.34	0.72	0.18
				FDR	0.17	0.35	0.24	0.01
				Pr	0.04	0.07	0.12	0.02
	40%	低 (0.050)		TPR	0.70	0.69	0.82	0.73
				FDR	0.22	0.21	0.22	0.11
				Pr	0.04	0.04	0.05	0.04
		高 (0.250)		TPR	0.20	0.29	0.56	0.13
				FDR	0.02	0.10	0.06	0.00
				Pr	0.05	0.08	0.15	0.03

评价标准	方法	OSR	CSR	CSRI	MHM
bias	a	-0.01	-0.01	-0.01	0.01
	b	0.00	0.00	0.00	0.00
	α	-0.21	-0.22	-0.26	0.00
	β	-0.07	-0.07	-0.08	0.02
	θ	0.00	-0.01	-0.01	0.01
	τ	-0.01	-0.01	-0.01	0.02
RMSE	a	0.11	0.11	0.11	0.10
	b	0.05	0.05	0.05	0.05
	α	0.22	0.22	0.27	0.03
	β	0.07	0.07	0.08	0.02
	θ	0.29	0.29	0.29	0.28
	τ	0.10	0.10	0.11	0.09