|
A comparison of standard residual methods and a mixture hierarchical model for detecting non-effortful responses
LIU Yue, LIU Hongyun, YOU Xiaofeng, YANG Jianqin
2022, 54 (4):
411-425.
doi: 10.3724/SP.J.1041.2022.00411
Assessment datasets contaminated by non-effortful responses may lead to serious consequences if not handled appropriately. Previous research has proposed two different strategies: down-weighting and accommodating. Down-weighting tries to limit the influence of aberrant responses on parameter estimation by reducing their weight. The extreme form of down-weighting is the detection and removal of irregular responses and response times (RTs). The standard residual-based methods, including the recently developed residual method using an iterative purification process, can be used to detect non-effortful responses in the framework of down-weighting. In accommodating, on the other hand, one tries to extend a model in order to account for the contaminations directly. This boils down to a mixture hierarchical model (MHM) for responses and RTs. However, to the authors’ knowledge, few studies have compared standard residual methods and MHM under different simulation conditions. It is unknown which method should be applied in different situations. Meanwhile, MHM has strong assumptions for different types of responses. It would be valuable to examine the performance of the method when the assumptions are violated. The purpose of this study is to compare standard residual methods and MHM under a fully crossed simulation design. In addition, specific recommendations for their applications are provided. The simulation study included two scenarios. In simulation scenario I, data were generated under the assumptions of MHM. In simulation scenario II, the assumptions of MHM concerning non-effortful responses and RTs were both violated. Simulation scenario I had three manipulated factors. (1) Non-effort prevalence (π), which was the proportion of individuals with non-effortful responses. It had three levels: 0%, 20% and 40%. (2) Non-effort severity ($\pi_{i}^{non}$), which was the proportion of non-effortful responses for each non-effortful individual. It varied between two levels: low and high. When $\pi_{i}^{non}$ was low, $\pi_{i}^{non}$was generated from U (0, 0.25); while when $\pi_{i}^{non}$ was high, $\pi_{i}^{non}$was generated from U (0.5, 0.75), where “U” denoted a uniform distribution. (3) Difference between RTs of non-effortful and effortful responses (dRT). The difference between RTs from two groups, dRT, had two levels, small and large. The logarithm of RTs of non-effortful responses were generated from normal distribution N (μ,0.5 2), where μ=-1 when dRT was small, μ=-2when dRT was large. For generating the non-effortful responses, we followed Wang, Xu and Shang (2018), with the probability of a correct response gj setting at 0.25 for all non-effortful responses. In simulation scenario II, only the first two factors were considered. Non-effortful RTs were generated from a uniform distribution with a lower bound of exp(-5) and upper bound being the 5th percentile of RT on item j with τ=0. The probability of a correct response for non-effortful responses was dependent on the ability level of each examinee. In all the conditions, sample size was fixed at I = 2,000 and test length was fixed at J = 30. For each condition, 30 replications were generated. For effortful responses, Responses and RTs were simulated from van der Linden’s (2007) hierarchical model. Item parameters were generated with aj~U(1,2.5), bj~N(0,1), αj~U(1.5,2.5),βj~U(-0.2,0.2). For simulees, the person parameters (θi, τi) were generated from a bivariate normal distribution with the mean vector of μ=(0,0)’ and the covariance matrix of $\Sigma=\left[\begin{array}{cc}1 & 0.25 \\ 0.25 & 0.25\end{array}\right]$. Four methods were compared under each condition: the original standard residual method (OSR), conditional estimate standard residual (CSR), conditional estimate with fixed item parameters standard residual method using iterative purifying procedure (CSRI), and MHM. These methods were implemented in R and JAGS using a Bayesian MCMC sampling method for parameter calibration. Finally, these methods were evaluated in terms of convergence rate, detection accuracy and parameter recovery. The results are presented as following. First of all, MHM suffered from convergence issues, especially for the latent variable indicating non-effortful responses. On the contrary, all the standard residual methods achieved convergence successfully. The convergence issues were more serious in simulation scenario II. Secondly, when all the items were assumed to have effortful responses, the false positive rate (FPR) of MHM was 0. Although the standard residual methods had FPR around 5% (the nominal level), the accuracy of parameter estimates was similar for all these methods. Third, when data were contaminated by non-effortful responses, CSRI had higher true positive rate (TPR) almost in all the conditions. MHM showed lower TPR but lower false discovery rate (FDR), exhibiting even lower TPR in simulation scenario II. When $\pi_{i}^{non}$ was high, CSRI and MHM showed more advantages over the other methods in terms of parameter recovery. However, when $\pi_{i}^{non}$ was high and dRT was small, MHM generally had higher RMSE than CSRI. Compared to simulation scenario I, MHM performed worse in simulation scenario II. The only problem CSRI needed to deal with was its overestimation of time discrimination parameter across all the conditions except for when π=40% and dRT was large. In a real data example, all the methods were applied to a dataset collected for program assessment and accountability purposes from undergraduates at a mid-sized southeastern university in USA. Evidences from convergence validity showed that CSRI and MHM might detect non-effortful responses more accurately and obtain more precise parameter estimates for this data. In conclusion, CSRI generally performed better than the other methods across all the conditions. It is highly recommended to use this method in practice because: (1) It showed acceptable FPR and fairly accurate parameter estimates even when all responses were effortful; (2) It was free of strong assumptions, which meant that it would be robust under various situations; (3) It showed most advantages when $\pi_{i}^{non}$ was high in terms of the detection of non-effortful responses and the improvement of the parameter estimation. In order to improve the estimation of time discrimination parameter in CSRI, the robust estimation methods that down-weight flagged response patterns can be used as an alternative to directly removing non-effortful responses (i.e., the method in the current study). MHM can perform well when all its assumptions are met and $\pi_{i}^{non}$ is high, dRT is large. However, some parameters have difficulty in convergence under MHM, which will limit its application in practice.
References |
Related Articles |
Metrics
|