ISSN 0439-755X
CN 11-1911/B
主办:中国心理学会
   中国科学院心理研究所
出版:科学出版社

心理学报 ›› 2012, Vol. 44 ›› Issue (8): 1124-1136.

• 论文 • 上一篇    

分类数据测量等价性检验方法及其比较:项目阈值(难度)参数的组间差异性检验

刘红云;李冲;张平平;骆方   

  1. (1北京师范大学心理学院, 应用实验心理北京市重点实验室, 北京 100875)
    (2北京新东方学校学习与发展中心, 100080) (3北京师范大学认知神经科学与学习国家重点实验室, 北京 100875)
  • 收稿日期:2011-12-14 修回日期:1900-01-01 出版日期:2012-08-28 发布日期:2012-08-28
  • 通讯作者: 骆方

Testing Measurement Equivalence of Categorical Items’ Threshold/Difficulty Parameters: A Comparison of CCFA and (M)IRT Approaches

LIU Hong-Yun;LI Chong;ZHANG Ping-Ping;LUO Fang   

  1. (1School of psychology, Beijing Normal University; Beijing Key Lab of Applied Experimental Psychology, Beijing 100875, China)
    (2Beijing New Oriental School, Learning & Development Center, Beijing 100080, China)
    (3National Key Laboratory of Cognitive Neuroscience and Learning, Beijing 100875, China)
  • Received:2011-12-14 Revised:1900-01-01 Published:2012-08-28 Online:2012-08-28
  • Contact: LUO Fang

摘要: 测量工具满足等价性是进行多组比较的前提, 测量等价性的检验方法主要有基于CFA的多组比较法和基于IRT的DIF检验两类方法。文章比较了单维测验情境下基于CCFA的DIFFTEST检验方法和基于IRT模型的IRT-LR检验方法, 以及多维测验情境下DIFFTEST和基于MIRT的卡方检验方法的差异。通过模拟研究的方法, 比较了几种方法的检验力和第一类错误, 并考虑了样本总量、样本量的组间均衡性、测验长度、阈值差异大小以及维度间相关程度的影响。研究结果表明:(1)在单维测验下, IRT-LR是比DIFFTEST更为严格的检验方法; 多维测验下, 在测验较长、测验维度之间相关较高时, MIRT-MG比DIFFTEST更容易检验出项目阈值的差异, 而在测验长度较短、维度之间相关较小时, DIFFTEST的检验力反而略高于MIRT-MG方法。(2)随着阈值差值增加, DIFFTEST、IRT-LR和MIRT-MG三种方法的检验力均在增加, 当阈值差异达到中等或较大时, 三种方法都可以有效检验出测验阈值的不等价性。(3)随着样本总量增加, DIFFTEST、IRT-LR和MIRT-MG方法的检验力均在增加; 在总样本量不变, 两组样本均衡情况下三种方法的检验力均高于不均衡的情况。(4)违背等价性题目个数不变时, 测验越长DIFFTEST的检验力会下降, 而IRT-LR和MIRT-MG检验力则上升。(5) DIFFTEST方法的一类错误率平均值接近名义值0.05; 而IRT-LR和MIRT-MG方法的一类错误率平均值远低于0.05。

关键词: 分类数据, 验证性因素分析, 项目功能差异, (多维)项目反应理论, 测量等价性

Abstract: Multiple group confirmatory factor analyses and differential item functioning basing on the unidimensional or the multidimensional item response theory were the two most commonly used methods to assess the measurement equivalence of categorical items. Unlike the traditional linear factor analysis, multiple-group categorical confirmatory factor analysis (CCFA) could model the categorical measures with a threshold structure appropriately, which is comparable to the difficulty parameters in the multidimensional IRT [(M)IRT)]. In this study, we compared the multiple-group categorical CFA (CCFA) and (M)IRT in terms of their power to detect violations of measurement invariance (i.e., DIF) with the Monte Carlo method. Moreover, given the limitation of the assumptions under the traditional unidimensional IRT model, this study extended the DIF test method to the (M)IRT model. Simulation studies under both unidimensional and multidimensional conditions were conducted to compare the DIFFTEST method, IRT-LR method (for unidimensional scale), and MIRT-MG (for multidimensional scale) with respect to their power to detect the lack of invariance across groups. Results indicated that the three methods, namely, DIFFTEST, IRT-LR, and MIRT-MG, showed reasonable power to identify the measurement non-equivalence when the difference of threshold was large. For unidimensional scale, the IRT-LR test demonstrated superior power to DIFFTEST. Whereas, for multidimensional scale, the results were not completely consistent across different conditions. The power of MIRT-MG was higher than that of DIFFTEST when test length was long and the correlation between dimensions was high. In contrast, the power of DIFFTEST was higher than that of MIRT-MG when test length was short and the correlations between dimensions were low. For a fixed number of noninvariant items, the power of the DIFFTEST method became smaller as the test length increased, whereas the power of the IRT-LR and MIRT-MG methods became larger as the test length increased. The number of respondents per group (sample size) was found to be one of the most important factors affecting the performance of these three approaches. The power of the DIFFTEST, IRT-LR, and, MIRT-MG methods would increase as the sample size increased. For a finite number of observations, the power of all three methods was larger under the balanced design when the two groups were equal in size than when two groups were unequal in size in the unbalanced design. For the DIFFTEST method, the Type I errors reached the nominal error rate at 5%, while the IRT-LR and MIRT-MG methods produced much lower Type I error rates.

Key words: categorical data, confirmatory factor analysis, differential item functioning, (multidimensional) item response theory, measurement equivalence