ISSN 0439-755X
CN 11-1911/B

中国科学院心理研究所

• 研究报告 •

用于处理不努力作答的标准化残差系列方法和混合多层模型法的比较

1. 1四川师范大学脑与心理科学研究院, 成都 610066
2应用实验心理北京市重点实验室
3北京师范大学心理学部, 北京 100875
4南昌师范学院数学与信息科学学院,南昌 360111
• 收稿日期:2021-04-08 出版日期:2022-04-25 发布日期:2022-02-21
• 通讯作者: 刘红云 E-mail:hyliu@bnu.edu.cn
• 基金资助:
国家自然科学基金项目(32071091)

A comparison of standard residual methods and a mixture hierarchical model for detecting non-effortful responses

LIU Yue1, LIU Hongyun2,3(), YOU Xiaofeng4, YANG Jianqin4

1. 1Institute of Brain and Psychological Sciences, Sichuan Normal University, Chengdu 610066, China
2Beijing Key Laboratory of Applied Experimental Psychology, Beijing Normal University, Beijing 100875, China
3Faculty of Psychology, Beijing Normal University, Beijing 100875, China
4School of Mathematics and Information Science, Nanchang Normal University, Nanchang 360111, China
• Received:2021-04-08 Online:2022-04-25 Published:2022-02-21
• Contact: LIU Hongyun E-mail:hyliu@bnu.edu.cn

Abstract:

Assessment datasets contaminated by non-effortful responses may lead to serious consequences if not handled appropriately. Previous research has proposed two different strategies: down-weighting and accommodating. Down-weighting tries to limit the influence of aberrant responses on parameter estimation by reducing their weight. The extreme form of down-weighting is the detection and removal of irregular responses and response times (RTs). The standard residual-based methods, including the recently developed residual method using an iterative purification process, can be used to detect non-effortful responses in the framework of down-weighting. In accommodating, on the other hand, one tries to extend a model in order to account for the contaminations directly. This boils down to a mixture hierarchical model (MHM) for responses and RTs. However, to the authors’ knowledge, few studies have compared standard residual methods and MHM under different simulation conditions. It is unknown which method should be applied in different situations. Meanwhile, MHM has strong assumptions for different types of responses. It would be valuable to examine the performance of the method when the assumptions are violated. The purpose of this study is to compare standard residual methods and MHM under a fully crossed simulation design. In addition, specific recommendations for their applications are provided.
The simulation study included two scenarios. In simulation scenario I, data were generated under the assumptions of MHM. In simulation scenario II, the assumptions of MHM concerning non-effortful responses and RTs were both violated. Simulation scenario I had three manipulated factors. (1) Non-effort prevalence ($\pi$), which was the proportion of individuals with non-effortful responses. It had three levels: 0%, 20% and 40%. (2) Non-effort severity ($\pi _{i}^{non}$), which was the proportion of non-effortful responses for each non-effortful individual. It varied between two levels: low and high. When $\pi _{i}^{non}$ was low, $\pi _{i}^{non}$ was generated from U (0, 0.25); while when $\pi _{i}^{non}$ was high, $\pi _{i}^{non}$ was generated from U (0.5, 0.75), where “U” denoted a uniform distribution. (3) Difference between RTs of non-effortful and effortful responses (${{d}_{RT}}$). The difference between RTs from two groups, ${{d}_{RT}}$, had two levels, small and large. The logarithm of RTs of non-effortful responses were generated from normal distribution N ($\mu$,$0.5$2), where $\text{ }\!\!\mu\!\!\text{ }=-1$ when ${{d}_{RT}}$ was small, $\text{ }\!\!\mu\!\!\text{ }=-2$ when ${{d}_{RT}}$ was large. For generating the non-effortful responses, we followed Wang, Xu and Shang (2018), with the probability of a correct response ${{g}_{j}}$ setting at 0.25 for all non-effortful responses. In simulation scenario II, only the first two factors were considered. Non-effortful RTs were generated from a uniform distribution with a lower bound of $\text{exp}\left( -5 \right)$ and upper bound being the 5th percentile of RT on item j with $\tau =0$. The probability of a correct response for non-effortful responses was dependent on the ability level of each examinee. In all the conditions, sample size was fixed at I = 2,000 and test length was fixed at J = 30. For each condition, 30 replications were generated. For effortful responses, Responses and RTs were simulated from van der Linden’s (2007) hierarchical model. Item parameters were generated with ${{a}_{j}}\tilde{\ }U\left( 1,2.5 \right)$, ${{b}_{j}}\tilde{\ }N\left( 0,1 \right)$, $~{{\alpha }_{j}}\tilde{\ }U\left( 1.5,2.5 \right),{{\beta }_{j}}\tilde{\ }U\left( -0.2,0.2 \right)$. For simulees, the person parameters $\left( {{\theta }_{i}},{{\tau }_{i}} \right)$ were generated from a bivariate normal distribution with the mean vector of $\mathbf{\mu }=\left( 0,0 \right)'$and the covariance matrix of $\mathbf{\Sigma }=\left[ \begin{matrix} 1 & 0.25 \\ 0.25 & 0.25 \\ \end{matrix} \right]$. Four methods were compared under each condition: the original standard residual method (OSR), conditional estimate standard residual (CSR), conditional estimate with fixed item parameters standard residual method using iterative purifying procedure (CSRI), and MHM. These methods were implemented in R and JAGS using a Bayesian MCMC sampling method for parameter calibration. Finally, these methods were evaluated in terms of convergence rate, detection accuracy and parameter recovery.
The results are presented as following. First of all, MHM suffered from convergence issues, especially for the latent variable indicating non-effortful responses. On the contrary, all the standard residual methods achieved convergence successfully. The convergence issues were more serious in simulation scenario II. Secondly, when all the items were assumed to have effortful responses, the false positive rate (FPR) of MHM was 0. Although the standard residual methods had FPR around 5% (the nominal level), the accuracy of parameter estimates was similar for all these methods. Third, when data were contaminated by non-effortful responses, CSRI had higher true positive rate (TPR) almost in all the conditions. MHM showed lower TPR but lower false discovery rate (FDR), exhibiting even lower TPR in simulation scenario II. When $\pi _{i}^{non}$ was high, CSRI and MHM showed more advantages over the other methods in terms of parameter recovery. However, when $\pi _{i}^{non}$ was high and ${{d}_{RT}}$ was small, MHM generally had higher RMSE than CSRI. Compared to simulation scenario I, MHM performed worse in simulation scenario II. The only problem CSRI needed to deal with was its overestimation of time discrimination parameter across all the conditions except for when $\pi$=40% and ${{d}_{RT}}$ was large. In a real data example, all the methods were applied to a dataset collected for program assessment and accountability purposes from undergraduates at a mid-sized southeastern university in USA. Evidences from convergence validity showed that CSRI and MHM might detect non-effortful responses more accurately and obtain more precise parameter estimates for this data.
In conclusion, CSRI generally performed better than the other methods across all the conditions. It is highly recommended to use this method in practice because: (1) It showed acceptable FPR and fairly accurate parameter estimates even when all responses were effortful; (2) It was free of strong assumptions, which meant that it would be robust under various situations; (3) It showed most advantages when $\pi _{i}^{non}$ was high in terms of the detection of non-effortful responses and the improvement of the parameter estimation. In order to improve the estimation of time discrimination parameter in CSRI, the robust estimation methods that down-weight flagged response patterns can be used as an alternative to directly removing non-effortful responses (i.e., the method in the current study). MHM can perform well when all its assumptions are met and $\pi _{i}^{non}$ is high, ${{d}_{RT}}$ is large. However, some parameters have difficulty in convergence under MHM, which will limit its application in practice.