A comparison of standard residual methods and a mixture hierarchical model for detecting non-effortful responses

doi:10.3724/SP.J.1041.2022.00411

Abstract

Abstract: Assessment datasets contaminated by non-effortful responses may lead to serious consequences if not handled appropriately. Previous research has proposed two different strategies: down-weighting and accommodating. Down-weighting tries to limit the influence of aberrant responses on parameter estimation by reducing their weight. The extreme form of down-weighting is the detection and removal of irregular responses and response times (RTs). The standard residual-based methods, including the recently developed residual method using an iterative purification process, can be used to detect non-effortful responses in the framework of down-weighting. In accommodating, on the other hand, one tries to extend a model in order to account for the contaminations directly. This boils down to a mixture hierarchical model (MHM) for responses and RTs. However, to the authors’ knowledge, few studies have compared standard residual methods and MHM under different simulation conditions. It is unknown which method should be applied in different situations. Meanwhile, MHM has strong assumptions for different types of responses. It would be valuable to examine the performance of the method when the assumptions are violated. The purpose of this study is to compare standard residual methods and MHM under a fully crossed simulation design. In addition, specific recommendations for their applications are provided.
The simulation study included two scenarios. In simulation scenario I, data were generated under the assumptions of MHM. In simulation scenario II, the assumptions of MHM concerning non-effortful responses and RTs were both violated. Simulation scenario I had three manipulated factors. (1) Non-effort prevalence (π), which was the proportion of individuals with non-effortful responses. It had three levels: 0%, 20% and 40%. (2) Non-effort severity ($\pi_{i}^{non}$), which was the proportion of non-effortful responses for each non-effortful individual. It varied between two levels: low and high. When $\pi_{i}^{non}$ was low, $\pi_{i}^{non}$was generated from U (0, 0.25); while when $\pi_{i}^{non}$ was high, $\pi_{i}^{non}$was generated from U (0.5, 0.75), where “U” denoted a uniform distribution. (3) Difference between RTs of non-effortful and effortful responses (d_RT). The difference between RTs from two groups, d_RT, had two levels, small and large. The logarithm of RTs of non-effortful responses were generated from normal distribution N (μ,0.5²), where μ=-1 when d_RT was small, μ=-2when d_RT was large. For generating the non-effortful responses, we followed Wang, Xu and Shang (2018), with the probability of a correct response g_j setting at 0.25 for all non-effortful responses. In simulation scenario II, only the first two factors were considered. Non-effortful RTs were generated from a uniform distribution with a lower bound of exp(-5) and upper bound being the 5th percentile of RT on item j with τ=0. The probability of a correct response for non-effortful responses was dependent on the ability level of each examinee. In all the conditions, sample size was fixed at I = 2,000 and test length was fixed at J = 30. For each condition, 30 replications were generated. For effortful responses, Responses and RTs were simulated from van der Linden’s (2007) hierarchical model. Item parameters were generated with a_j~U(1,2.5), b_j~N(0,1), α_j~U(1.5,2.5),β_j~U(-0.2,0.2). For simulees, the person parameters (θ_i, τ_i) were generated from a bivariate normal distribution with the mean vector of μ=(0,0)’ and the covariance matrix of $\Sigma=\left[\begin{array}{cc}1 & 0.25 \\ 0.25 & 0.25\end{array}\right]$. Four methods were compared under each condition: the original standard residual method (OSR), conditional estimate standard residual (CSR), conditional estimate with fixed item parameters standard residual method using iterative purifying procedure (CSRI), and MHM. These methods were implemented in R and JAGS using a Bayesian MCMC sampling method for parameter calibration. Finally, these methods were evaluated in terms of convergence rate, detection accuracy and parameter recovery.
The results are presented as following. First of all, MHM suffered from convergence issues, especially for the latent variable indicating non-effortful responses. On the contrary, all the standard residual methods achieved convergence successfully. The convergence issues were more serious in simulation scenario II. Secondly, when all the items were assumed to have effortful responses, the false positive rate (FPR) of MHM was 0. Although the standard residual methods had FPR around 5% (the nominal level), the accuracy of parameter estimates was similar for all these methods. Third, when data were contaminated by non-effortful responses, CSRI had higher true positive rate (TPR) almost in all the conditions. MHM showed lower TPR but lower false discovery rate (FDR), exhibiting even lower TPR in simulation scenario II. When $\pi_{i}^{non}$ was high, CSRI and MHM showed more advantages over the other methods in terms of parameter recovery. However, when $\pi_{i}^{non}$ was high and d_RT was small, MHM generally had higher RMSE than CSRI. Compared to simulation scenario I, MHM performed worse in simulation scenario II. The only problem CSRI needed to deal with was its overestimation of time discrimination parameter across all the conditions except for when π=40% and d_RT was large. In a real data example, all the methods were applied to a dataset collected for program assessment and accountability purposes from undergraduates at a mid-sized southeastern university in USA. Evidences from convergence validity showed that CSRI and MHM might detect non-effortful responses more accurately and obtain more precise parameter estimates for this data.
In conclusion, CSRI generally performed better than the other methods across all the conditions. It is highly recommended to use this method in practice because: (1) It showed acceptable FPR and fairly accurate parameter estimates even when all responses were effortful; (2) It was free of strong assumptions, which meant that it would be robust under various situations; (3) It showed most advantages when $\pi_{i}^{non}$ was high in terms of the detection of non-effortful responses and the improvement of the parameter estimation. In order to improve the estimation of time discrimination parameter in CSRI, the robust estimation methods that down-weight flagged response patterns can be used as an alternative to directly removing non-effortful responses (i.e., the method in the current study). MHM can perform well when all its assumptions are met and $\pi_{i}^{non}$ is high, d_RT is large. However, some parameters have difficulty in convergence under MHM, which will limit its application in practice.

Key words: non-effortful response, standard response time residual, iterative purification, mixture hierarchical model, Bayesian estimation

CLC Number:

B841

LIU Yue, LIU Hongyun, YOU Xiaofeng, YANG Jianqin. (2022). A comparison of standard residual methods and a mixture hierarchical model for detecting non-effortful responses. Acta Psychologica Sinica, 54(4), 411-425.

References

[1] Borghans, L., & Schils, T. (2012). The leaning tower of PISA: Decomposing achievement test scores into cognitive and noncognitive components. The Netherlands: School of Business and Economics, Maastricht University.
[2] Clark M. E., Gironda R. J., & Young R. W. (2003). Detection of back random responding: Effectiveness of MMPI-2 and personality assessment inventory validity indices.Psychological Assessment, 15(2), 223-234.
[3] Feinberg, R., & Jurich, D. (2018, April). Using rapid responses to evaluate test speededness. Paper presented at the meeting of the National Council of Measurement in Education (NCME), New York, NY.
[4] Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences.Statistical Science, 7(4), 457-472.
[5] Hong M., Rebouças D. A., & Cheng Y. (2021). Robust estimation for response time modeling.Journal of Educational Measurement. 58(2), 262-280.
[6] Köhler C., Pohl S., & Carstensen C. H. (2017). Dealing with item nonresponse in large-scale cognitive assessments: The impact of missing data methods on estimated explanatory relationships.Journal of Educational Measurement, 54(4), 397-419.
[7] Liu Y., Cheng Y., & Liu H. (2020). Identifying effortful individuals with mixture modeling response accuracy and response time simultaneously to improve item parameter estimation.Educational and Psychological Measurement, 80(4), 775-807.
[8] Liu, Y., & Liu, H. (2021). Detecting noneffortful responses based on a residual method using an iterative purification process.Journal of Educational and Behavioral Statistics, 46(6), 717-752.
[9] Lu J., Wang C., Zhang J., & Tao J. (2020). A mixture model for responses and response times with a higher‐order ability structure to detect rapid guessing behaviour.British Journal of Mathematical and Statistical Psychology, 73(2), 261-288.
[10] Matzke D., Love J., & Heathcote A. (2017). A Bayesian approach for estimating the probability of trigger failures in the stop-signal paradigm.Behavior Research Methods, 49(1), 267-281.
[11] McHugh, M. L. (2013). The chi-square test of independence.Biochemia medica, 23(2), 143-149.
[12] Molenaar D., Bolsinova M., & Vermunt J. K. (2018). A semi-parametric within-subject mixture approach to the analyses of responses and response times.British Journal of Mathematical and Statistical Psychology, 71(2), 205-228.
[13] Pastor D. A., Ong T. Q., & Strickman S. N. (2019). Patterns of solution behavior across items in low-stakes assessments.Educational Assessment, 24(3), 189-212.
[14] Plummer, M. (2003, March). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Retrieved from https://www.r-project.org/conferences/DSC-2003/Drafts/Plummer.pdf
[15] Qian H., Staniewska D., Reckase M., & Woo A. (2016). Using response time to detect item preknowledge in computer-based licensure examinations.Educational Measurement: Issues and Practice, 35(1), 38-47.
[16] Ranger J., Wolgast A., & Kuhn J. T. (2019). Robust estimation of the hierarchical model for responses and response times.British Journal of Mathematical and Statistical Psychology, 72(1), 83-107.
[17] R Development Core Team. (2009). R: A language and environment for statistical computing [Computer software Manual]. Vienna, Austria: Retrieved from http://www.Rproject. org (ISBN 3-900051-07-0
[18] Rios J. A., Guo H., Mao L., & Liu O. L. (2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not?International Journal of Testing, 17(1), 74-104.
[19] Rose, N. (2013). Item nonresponses in educational and psychological measurement (Unpublished Doctorial dissertation). Friedrich Schiller University, Jena, Germany.
[20] Setzer J. C., Wise S. L., van den Heuvel, J. R., & Ling G. (2013). An investigation of examinee test-taking effort on a large-scale assessment.Applied Measurement in Education, 26(1), 34-49.
[21] Ulitzsch E., von Davier M., & Pohl S. (2020). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item‐level non‐response.British Journal of Mathematical and Statistical Psychology, 73(S1), 83-112.
[22] van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items.Psychometrika, 72(3), 287-308.
[23] van der Linden, W. J., & Guo, F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing.Psychometrika, 73(3), 365-384.
[24] Wang, C., & Xu, G. (2015). A mixture hierarchical model for response times and response accuracy.British Journal of Mathematical and Statistical Psychology, 68(3), 456-477.
[25] Wang C., Xu G., & Shang Z. (2018). A two-stage approach to differentiating normal and aberrant behavior in computer based testing.Psychometrika, 83(1), 223-254.
[26] Wang C., Xu G., Shang Z., & Kuncel N. (2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method.Journal of Educational and Behavioral Statistics, 43(4), 469-501.
[27] Wise, S. L. (2015). Effort analysis: Individual score validation of achievement test data.Applied Measurement in Education, 28(3), 237-252.
[28] Wise, S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications.Educational Measurement: Issues and Practice, 36(4), 52-61.
[29] Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort-moderated IRT model.Journal of Educational Measurement, 43(1), 19-38.
[30] Wise, S. L., & Kingsbury, G. G. (2016). Modeling student test- taking motivation in the context of an adaptive achievement test.Journal of Educational Measurement, 53(1), 86-105.