心理与教育测验中异常作答处理的新技术: 混合模型方法

doi:10.3724/SP.J.1042.2021.01696

摘要/Abstract

摘要：

混合模型方法(Mixture Model Method)是近年来提出的, 对心理与教育测验中的异常作答进行处理的方法。与反应时阈值法, 反应时残差法等传统方法相比, 混合模型方法可以同时完成异常作答的识别和模型参数估计, 并且, 在数据污染严重的情况下仍具有较好的表现。该方法的原理为根据正常作答和异常作答的特点, 针对分类潜变量(即作答层面的分类)的不同类别, 在作答反应和(或)反应时部分建立不同的模型, 从而实现对分类潜变量, 以及模型中其他题目和被试参数的估计。文章详细介绍了目前提出的几种混合模型方法, 并将其与传统方法比较分析。未来研究可在模型前提假设违背, 含有多种异常作答等情况下探索混合模型方法的稳健性和适用性, 通过固定部分题目参数, 增加选择流程等方式提高混合模型方法的使用效率。

关键词: 异常作答, 反应时, 阈值, 残差法, 混合模型

Abstract:

Aberrant responses have been repeatedly reported in psychological and educational measurement. If traditional measurement models or methods (e.g., item response theory, IRT) are applied to data sets contaminated by aberrant responses, parameter estimates may be biased. Therefore, it is necessary to identify aberrant responses and to reduce their detrimental effects.

In the literature, there are two traditional response time (RT)-based methods to detect aberrant responses: RT threshold method and RT residual method. The focus of these methods is to find a threshold of RT or RT residual. If a RT or RT residual is remarkably less than the threshold, this response should be regarded as an aberrant response with extremely short RT (e.g., speededness, rapid-guessing), and consequently does not provide information about the test taker’s latent trait. Afterwards, down-weighting strategy, which tries to limit the influence of aberrant responses on parameter estimation by reducing their weight in the sample, can be applied.

The mixture model method (MMM), is a new method proposed to handle data contaminated by aberrant responses. This method applies the accommodating strategy, which is to extend a model in order to account for the contaminations directly. MMM shows more advantages in terms of: (1) detecting aberrant responses and obtaining parameter estimates simultaneously, instead of two steps (detecting and down-weighting); (2) precisely recovering the severity of aberrant responding. There are two categories of MMM. The first category of methods assumes that the classification (i.e., whether the item is answered normally or aberrantly) can be predicted by RT. While the second category is a natural extension of van der Linden’s (2007) hierarchical model, which models responses and RTs jointly. In this method, the observed RT, as well as the correct response probability of each item-by-person encounter can be decomposed to RT (or probability) caused by normal response and that caused by aberrant response according to the most important difference between the two distinct behaviors. This method leads to more precisely estimated item and person parameters, as well as excellent classification of aberrant/normal behavior.

First, this article compares the basic logic of the two traditional RT-based methods and MMM. Aberrant responses are regarded as outliers in both RT threshold method and RT residual method. Therefore, they rely heavily on the severity of aberrance. If data set is contaminated by aberrant responses seriously, the observed RT (or RT residual) distribution will be different from the expected distribution, which in turn leads to low power and sometimes high false detection rate. On the other hand, MMM, which assumes that both observed RT and correct response probability follow a mixture distribution, treats aberrant and normal responses equally. In that way, it has little reliance on the severity of aberrance. In addition to that, MMM can apply to the situation when all the respondents actually respond regularly in theoretic. In that situation, all the responses are assumed to be classified into one category. Second, this article summarizes the disadvantages of the three methods. MMM has three primary limitations: (1) it usually relies heavily on strong assumptions, which means that it may not perform well if these assumptions are violated; (2) low proportion of aberrant response may lead to convergence problem and model identification problem; (3) it is quite complex and time-consuming. In all, practitioners should choose a proper method according to the characteristics of tests and categories of aberrant responses (e.g., rapid-guessing, item with preknowledge, cheating). In the end, this article suggests future researches may investigate the performance of MMM when its assumptions are violated or data consists of more types of aberrant response patterns. Fixing item parameter estimates, proposing some index to help choosing suitable methods, are encouraged to improve the efficiency of MMM.

Key words: aberrant responses, response time, threshold, residual method, mixture model

中图分类号:

B841

刘玥, 刘红云. (2021). 心理与教育测验中异常作答处理的新技术: 混合模型方法. 心理科学进展 , 29(9), 1696-1710.

LIU Yue, LIU Hongyun. (2021). Mixture Model Method: A new method to handle aberrant responses in psychological and educational testing. Advances in Psychological Science, 29(9), 1696-1710.

图/表 2

参考文献 70

[1]	黄美薇, 潘逸沁, 骆方. (2020). 结合选择题与主观题信息的两阶段作弊甄别方法. 心理科学, (1), 75-80.
[2]	简小珠, 焦璨, Steven P Reise, 彭春妹. (2010). 四参数模型对被试作答异常现象的拟合与纠正. 心理科学进展, 18(3), 537-544.
[3]	Baer R. A., Ballenger J., Berry D. T. R., & Wetter M. W. (1997). Detection of random responding on the MMPI-A. Journal of Personality Assessment, 68(1), 139-151. pmid: 16370774
[4]	Berry D. T. R., Wetter M. W., Baer R. A., Larsen L., Clark C., & Monroe K. (1992). MMPI-2 random responding indices: Validation using a self-report methodology. Psychological Assessment, 4(3), 340-345. doi: 10.1037/1040-3590.4.3.340 URL
[5]	Bolsinova M., & Tijmstra J. (2019). Modeling differences between response times of correct and incorrect responses. Psychometrika, 84(4), 1018-1046. doi: 10.1007/s11336-019-09682-5 pmid: 31463656
[6]	Bolt D. M., Cohen A. S., & Wollack J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39(4), 331-348. doi: 10.1111/jedm.2002.39.issue-4 URL
[7]	Borghans L., & Schils T. (2012). The leaning tower of PISA: Decomposing achievement test scores into cognitive and noncognitive components (Unpublished doctorial dissertation). Maastricht University.
[8]	Bridgeman B., & Cline F. (2004). Effects of differentially time-consuming tests on computer-adaptive test scores. Journal of Educational Measurement, 41(2), 137-148. doi: 10.1111/jedm.2004.41.issue-2 URL
[9]	Clark M. E., Gironda R. J., & Young R. W. (2003). Detection of back random responding: Effectiveness of MMPI-2 and personality assessment inventory validity indices. Psychological Assessment, 15(2), 223-234. doi: 10.1037/1040-3590.15.2.223 URL
[10]	Cousineau D. (2009). Fitting the three-parameter Weibull distribution: Review and evaluation of existing and new methods. IEEE Transactions on Dielectrics and Electrical Insulation, 16(1), 281-288. doi: 10.1109/TDEI.2009.4784578 URL
[11]	Custer M., Sharairi S., & Swift D. (2012,April). A comparison of scoring options for omitted and not-reached items through the recovery of IRT parameters when utilizing the Rasch model and joint maximum likelihood estimation. Paper presented at the annual meeting of the National Council of Measurement in Education, Vancouver, BC, Canada.
[12]	Dolan C. V., van der Maas H. L. J., & Molenaar P. C. M. (2002). A framework for ML estimation of parameters of (mixtures of) common reaction time distributions given optional truncation or censoring. Behavior Research Methods, Instruments & Computers, 34, 304-323. doi: 10.3758/BF03195458 URL
[13]	Feinberg R., & Jurich D. (2018, April). Using rapid responses to evaluate test speededness. Paper presented at the meeting of the National Council of Measurement in Education (NCME), New York, NY.
[14]	Goldhammer F., Martens T., Christoph G., & Lüdtke O. (2016). Test-taking engagement in PIAAC (OECD Education Working Papers, No. 133). Paris, France: OECD Publishing.
[15]	Guo H., Rios J. A., Haberman S., Liu O. L., Wang J., & Paek I. (2016). A new procedure for detection of students’ rapid guessing responses using response time. Applied Measurement in Education, 29(3), 173-183. doi: 10.1080/08957347.2016.1171766 URL
[16]	Hauser C., & KingsburyG. G.(2009). Individual score validity in a modest-stakes adaptive educational testing setting. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.
[17]	Hauser C., Kingsbury G. G., & Wise S. L. (2008). Individual validity: Adding a missing link. Paper presented at the annual meeting of the American Educational Research Association, New York, NY.
[18]	Hong M. R., & Cheng Y. (2019a). Robust maximum marginal likelihood (RMML) estimation for item response theory models. Behavior Research Methods, 51(2), 573-588. doi: 10.3758/s13428-018-1150-4 URL
[19]	Hong M. R., & Cheng Y. (2019b). Clarifying the effect of test speededness. Applied Psychological Measurement, 43(8), 611-623. doi: 10.1177/0146621618817783 URL
[20]	Köhler C., Pohl S., & Carstensen C. H. (2017). Dealing with item nonresponse in large-scale cognitive assessments: The impact of missing data methods on estimated explanatory relationships. Journal of Educational Measurement, 54(4), 397-419. doi: 10.1111/jedm.2017.54.issue-4 URL
[21]	Kong X. J., Wise S. L., & Bhola D. S. (2007). Setting the response time threshold parameter to differentiate solution behavior from rapid-guessing behavior. Educational and Psychological Measurement, 67(4), 606-619. doi: 10.1177/0013164406294779 URL
[22]	Lee Y. H., & Jia Y. (2014). Using response time to investigate students’ test-taking behaviors in a NAEP computer-based study. Large-scale Assessments in Education, 2(8), 1-24.
[23]	Liu Y., Cheng Y., & Liu H. (2020). Identifying effortful individuals with mixture modeling response accuracy and response time simultaneously to improve item parameter estimation. Educational and Psychological Measurement, 80(4), 775-807. doi: 10.1177/0013164419895068 URL
[24]	Lu J., Wang C., Zhang J., & Tao J. (2020). A mixture model for responses and response times with a higher-order ability structure to detect rapid guessing behaviour. British Journal of Mathematical and Statistical Psychology, 73(2), 261-288. doi: 10.1111/bmsp.v73.2 URL
[25]	Ma L., Wise S. L., Thum Y. M., & Kingsbury G. (2011, April). Detecting response time threshold under the computer adaptive testing environment. Paper presented at the annual meeting of the National Council of Measurement in Education, New Orleans, LA.
[26]	Masters G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. doi: 10.1007/BF02296272 URL
[27]	Meyer J. P. (2010). A mixture Rasch model with item response time components. Applied Psychological Measurement, 34(7), 521-538. doi: 10.1177/0146621609355451 URL
[28]	Michaelides M. P., Ivanova M., & Nicolaou C. (2020). The relationship between response-time effort and accuracy in PISA science multiple choice items. International Journal of Testing, 20(3), 187-205. doi: 10.1080/15305058.2019.1706529 URL
[29]	Molenaar D., Bolsinova M., & Vermunt J. K. (2018). A semi-parametric within-subject mixture approach to the analyses of responses and response times. British Journal of Mathematical and Statistical Psychology, 71(2), 205- 228. doi: 10.1111/bmsp.2018.71.issue-2 URL
[30]	Molenaar D., Bolsinova M., Rozsa S., & de Boeck P.,(2016). Response mixture modeling of intraindividual differences in responses and response times to the Hungarian WISC- IV block design test. Journal of Intelligence, 4(3), 10-29. doi: 10.3390/jintelligence4030010 URL
[31]	Molenaar D., Oberski D., Vermunt J., & de Boeck P., (2016). Hidden Markov item response theory models for responses and response times. Multivariate Behavioral Research, 51(5), 606-626. doi: 10.1080/00273171.2016.1192983 pmid: 27712114
[32]	Molenaar D., & de Boeck P.,(2018). Response mixture modeling: Accounting for heterogeneity in item characteristics across response times. Psychometrika, 83(2), 279-297. doi: 10.1007/s11336-017-9602-9 pmid: 29392567
[33]	Morgenthaler S. (2007). A survey of robust statistics. Statistical Methods and Applications, 15, 271-293. doi: 10.1007/s10260-006-0034-4 URL
[34]	Partchev I., & de Boeck P.,(2012). Can fast and slow intelligence be differentiated? Intelligence, 40(1), 23-32. doi: 10.1016/j.intell.2011.11.002 URL
[35]	Patton J. M., Cheng Y., Hong M. R., & Diao Q. (2019). Detection and treatment of careless responses to improve item parameter estimation. Journal of Educational and Behavioral Statistics, 44(3), 309-341. doi: 10.3102/1076998618825116 URL
[36]	Pohl S., Haberkorn K., Hardt K., & Wiegand E. (2012). NEPS technical report for reading? Scaling results of starting cohort 3 in fifth grade. NEPS Working Paper No.15. Bamberg: Otto-Friedrich-Universitt, Nationales Bildungspanel.
[37]	Pokropek A. (2016). Grade of membership response time model for detecting guessing behaviors. Journal of Educational and Behavioral Statistics, 41(3), 300-325. doi: 10.3102/1076998616636618 URL
[38]	Qian H., Staniewska D., Reckase M., & Woo A. (2016). Using response time to detect item preknowledge in computer-based licensure examinations. Educational Measurement: Issues and Practice, 35(1), 38-47.
[39]	Ranger J., & Kuhn J. T. (2017). Detecting unmotivated individuals with a new model-selection approach for Rasch models. Psychological Test and Assessment Modeling, 59(3), 269-295.
[40]	Ranger J., Wolgast A., & Kuhn J. T. (2019). Robust estimation of the hierarchical model for responses and response times. British Journal of Mathematical and Statistical Psychology, 72(1), 83-107. doi: 10.1111/bmsp.2019.72.issue-1 URL
[41]	Rios J. A., Guo H., Mao L., & Liu O. L. (2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not?. International Journal of Testing, 17(1), 74-104. doi: 10.1080/15305058.2016.1231193 URL
[42]	Rose N. (2013). Item nonresponses in educational and psychological measurement (Unpublished doctorial dissertation). Friedrich-Schiller-University, Jena.
[43]	Rose N., von Davier M., & Nagengast B. (2017). Modeling omitted and not-reached items in IRT models. Psychometrika, 82(3), 795-819. doi: 10.1007/s11336-016-9544-7 URL
[44]	Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometric Monograph Supplement No. 17). Richmond, VA: Psychometric Society.
[45]	Schnipke D. L., & Scrams D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34(3), 213-232. doi: 10.1111/jedm.1997.34.issue-3 URL
[46]	Schnipke D. L. & Scrams D. J. (2002). Exploring issues of examinee behavior: Insights gained from response-time analyses. In C. N. Mills, M.T. Potenza, J.J. Fremer, & W. C. Ward (Eds.), Computer-based testing: Building the foundation for future assessments (pp. 237-266). Mahwah, NJ: Lawrence Erlbaum.
[47]	Setzer J. C., Wise S. L., van den Heuvel J. R., & Ling G. (2013). An investigation of examinee test-taking effort on a large-scale assessment. Applied Measurement in Education, 26(1), 34-49. doi: 10.1080/08957347.2013.739453 URL
[48]	Shao C., Li J., & Cheng Y. (2016). Detection of test speededness using change-point analysis. Psychometrika, 81(4), 1118-1141. pmid: 26305400
[49]	Silm G., Must O., & Täht K. (2013). Test-taking effort as a predictor of performance in low-stakes tests. TRAMES: A Journal of the Humanities & Social Sciences, 17(4), 433- 448.
[50]	Sinharay S., & Johnson M. S. (2019). The use of item scores and response times to detect examinees who may have benefited from item preknowledge. British Journal of Mathematical and Statistical Psychology, 73(3), 397-419. doi: 10.1111/bmsp.v73.3 URL
[51]	Ulitzsch E., von Davier M., & Pohl S. (2020). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non- response. British Journal of Mathematical and Statistical Psychology, 73(S1), 83-112. doi: 10.1111/bmsp.v73.s1 URL
[52]	van der Linden W. J.(2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181-204. doi: 10.3102/10769986031002181 URL
[53]	van der Linden W. J.(2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287-308. doi: 10.1007/s11336-006-1478-z URL
[54]	van der Linden W. J., & Guo, F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73, 365-384. doi: 10.1007/s11336-007-9046-8 URL
[55]	Wang C., Chang H. H., & Douglas J. A. (2013). The linear transformation model with frailties for the analysis of item response times. British Journal of Mathematical and Statistical Psychology, 66(1), 144-168. doi: 10.1111/j.2044-8317.2012.02045.x URL
[56]	Wang C., Fan Z., Chang H. H., & Douglas J. A. (2013). A semiparametric model for jointly analyzing response times and accuracy in computerized testing. Journal of Educational and Behavioral Statistics, 38(4), 381-417. doi: 10.3102/1076998612461831 URL
[57]	Wang C., & Xu G. (2015). A mixture hierarchical model for response times and response accuracy. British Journal of Mathematical and Statistical Psychology, 68(3), 456-477. doi: 10.1111/bmsp.2015.68.issue-3 URL
[58]	Wang C., Xu G., & Shang Z. (2018). A two-stage approach to differentiating normal and aberrant behavior in computer based testing. Psychometrika, 83(1), 223-254. doi: 10.1007/s11336-016-9525-x URL
[59]	Wang C., Xu G., Shang Z., & Kuncel N. (2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469-501. doi: 10.3102/1076998618767123 URL
[60]	Weirich S., Hecht M., Penk C., Roppelt A., & Böhme K. (2017). Item position effects are moderated by changes in test-taking effort. Applied Psychological Measurement, 41(2), 115-129. doi: 10.1177/0146621616676791 URL
[61]	Wise S. L. (2015). Effort analysis: Individual score validation of achievement test data. Applied Measurement in Education, 28(3), 237-252. doi: 10.1080/08957347.2015.1042155 URL
[62]	Wise S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52-61. doi: 10.1111/emip.2017.36.issue-4 URL
[63]	Wise S. L. (2019). An information-based approach to identifying rapid-guessing thresholds. Applied Measurement in Education, 32(4), 325-336. doi: 10.1080/08957347.2019.1660350 URL
[64]	Wise S. L., & DeMars C. E. (2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19-38. doi: 10.1111/jedm.2006.43.issue-1 URL
[65]	Wise S. L., & DeMars C. E. (2010). Examinee noneffort and the validity of program assessment results. Educational Assessment, 15(1), 27-41. doi: 10.1080/10627191003673216 URL
[66]	Wise S. L., & Kingsbury G. G. (2016). Modeling student test-taking motivation in the context of an adaptive achievement test. Journal of Educational Measurement, 53(1), 86-105. doi: 10.1111/jedm.12102 URL
[67]	Wise S. L., & Ma L. (2012, April). Setting response time thresholds for a CAT item pool: The normative threshold method. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, Canada.
[68]	Wright B. D., & Stone M. H. (1979). Best test design. Rasch measurement. Chicago, IL: MESA Press.
[69]	Yan T., & Tourangeau R. (2008). Fast times and easy questions: The effects of age, experience and question complexity on web survey response times. Applied Cognitive Psychology, 22(1), 51-68. doi: 10.1002/(ISSN)1099-0720 URL
[70]	Yu X., & Cheng Y. (2019). A change-point analysis procedure based on weighted residuals to detect back random responding. Psychological Methods, 24(5), 658-674. doi: 10.1037/met0000212 URL

方法类型	具体方法	没有综合利用反应时和作答反应的信息	没有基于理论分布	偶有例外, 无法批量应用	包含有关异常作答的强假设	对高比例异常作答敏感	异常作答比例低时容易出现问题	计算复杂耗时长	识别结果不一定是异常作答	只能用于已知异常作答答对概率的情境	只能用于识别快速异常作答
反应时阈值法	统一阈值法	×	×								×
	根据题目特征求阈值法	×	×								×
	双峰分布交点求阈值法	×	×	×							×
	常模阈值法		×								×
	基于信息求阈值法		×	×							×
	条件分布法		×	×						×	×
反应时残差法	标准化反应时残差法	×				×					×
反应时残差法	贝叶斯残差法					×		×			×
混合模型法	等级分组的反应时模型				×					×	×
	半参数化的混合模型				×		×	×	×		×
	基于反应时的混合作答反应模型				×		×	×	×		×
	基于反应时和作答反应的混合多层模型				×		×	×

方法类型	具体方法	没有综合利用反应时和作答反应的信息	没有基于理论分布	偶有例外, 无法批量应用	包含有关异常作答的强假设	对高比例异常作答敏感	异常作答比例低时容易出现问题	计算复杂耗时长	识别结果不一定是异常作答	只能用于已知异常作答答对概率的情境	只能用于识别快速异常作答
反应时阈值法	统一阈值法	×	×								×
	根据题目特征求阈值法	×	×								×
	双峰分布交点求阈值法	×	×	×							×
	常模阈值法		×								×
	基于信息求阈值法		×	×							×
	条件分布法		×	×						×	×
反应时残差法	标准化反应时残差法	×				×					×
反应时残差法	贝叶斯残差法					×		×			×
混合模型法	等级分组的反应时模型				×					×	×
	半参数化的混合模型				×		×	×	×		×
	基于反应时的混合作答反应模型				×		×	×	×		×
	基于反应时和作答反应的混合多层模型				×		×	×