ISSN 0439-755X
CN 11-1911/B
主办:中国心理学会
   中国科学院心理研究所
出版:科学出版社

心理学报 ›› 2013, Vol. 45 ›› Issue (4): 466-480 .doi: 10.3724/SP.J.1041.2013.00466

• 论文 • 上一篇    下一篇

不同铆测验设计下多维IRT等值方法的比较

刘玥;刘红云   

  1. (1四川省教育科学研究所, 成都 610225) (2北京师范大学心理学院, 北京 100875)
  • 收稿日期:2012-09-24 发布日期:2013-04-25 出版日期:2013-04-25
  • 通讯作者: 刘红云
  • 基金资助:

    国家自然科学基金(31100759); 全国教育科学“十二五”规划教育部重点课题(GFA111001)资助。

Comparison of MIRT Linking Methods for Different Common Item Designs

LIU Yue;LIU Hongyun   

  1. (1 Sichuan Educational Assessment and Evaluation Centre, Sichuan Institute Of Education Sciences, Chengdu 610225, China) (2 School of Psychology, Beijing Normal University, Beijing 100875, China)
  • Received:2012-09-24 Online:2013-04-25 Published:2013-04-25
  • Contact: LIU Hongyun

摘要: 实际应用中测验往往具有多维结构, 如果仍采用单维IRT方法进行等值, 会得到不准确的结果。因此对于多维结构的测验, 需要使用多维IRT等值方法来实现参数的转换。基于共同题设计, 文章通过模拟研究的方法, 考察了不同铆测验设计下几种多维IRT等值方法的表现, 同时考虑了测验长度、两个维度题目数量的比例、铆测验长度、铆测验的选择策略、两个维度之间的相关和等值群体的能力水平差异六个因素的影响。所比较的多维IRT等值方法有:均值/均值(MM)方法, 均值/标准差(MS)方法, Stoking-Lord (SL)方法, Haebara (HB)方法, 最小平方(LS)方法。结果显示:(1) SL, HB和LS方法得到的等值误差均方根最小, 且在各条件下表现较为稳定。(2) MM和MS方法在非等组条件下呈现出很大的误差均方根。(3)铆测验设计对SL, HB和LS方法的等值结果没有显著影响。(4)在两个维度之间的相关较高, 测验长度和铆测验长度较长, 等值群体的能力水平没有差异的条件下, SL, HB和LS方法得到的等值误差均方根最小。

关键词: 测验等值, 多维IRT, 均值/均值方法, 均值/标准差方法, Stoking-Lord方法, Haebara方法

Abstract: A great number of educational assessments usually measure more than one trait (Ackerman, 1992; DeMars, 2006; Reckase, 1985). In order to adjust scores on these different test forms, multidimensional item response theory (MIRT) and its linking procedures should be developed. So far, some researchers have already extended UIRT linking methods to the multidimensional structure (Davey et al., 1996; Hirsch, 1989; Li & Lissitz, 2000; Min, 2003; Yon, 2006). There were numerous studies comparing MIRT linking methods in the literature. However, although choosing anchor items was of great importance in common item designs, a few of studies compared MIRT linking methods under different common item designs. It was still in doubt that, how we could choose the common items across different MIRT linking methods. The purpose of this study was to compare five MIRT linking methods under two kinds of common item choosing strategies in various situations. The study was a mixed measure design of simulation conditions (between-factors) and linking methods (within-factor). There were six between-factors: (1) 2 test lengths (40 items and 80 items); (2) 2 levels of the proportion of the number of items in one dimension to another (1:1 and 1:3); (3) 3 anchor lengths (1/20, 1/5 to 1/3 of total test); (4) 2 strategies of choosing common items (averagely choosing the items in all dimensions or choosing according to the proportions of items in every dimension); (5) 3 correlations between two ability dimensions (r=0, 0.5, 0.9); (6) 2 levels of equivalent/non-equivalent ability levels between two populations. The five MIRT linking methods we investigated were: Mean/Mean (MM) method, Mean/Sigma (MS) method, Stoking-Lord’s (SL) method, Haebara’s (HB) method and Least Square (LS) method. Under each condition, the number of examinees was fixed as I =2000, and 30 replications were generated. BMIRT (Yao, 2003) was applied to estimate item and ability parameters using an MCMC method. Based on the previous studies about equating (Kim & Cohen, 1998; Kim & Cohen, 2002), a two-step of linking was applied. The first step was to transform the scale of parameters in the new test onto the base test, and the second step was to transform the scale of all the simulated items onto the generating scale. In each step, the transformation matrices were produced by LinkMIRT (Yao, 2004) and the R package called “Plink” (Weeks, 2010). Finally, the recovery of parameters was evaluated by four criteria: bias, mean absolute error, root mean square error, correlation between the parameters after equating and true values. To compare the five MIRT equating methods, the results showed that: the RMSE for parameters under SL, HB and LS methods were smaller and more stable in different situations; however, the RMSE for parameters in MM and MS methods were significantly large, especially in non-equivalent group conditions. Therefore, the latter results were displayed for the SL, HB and LS methods. It was found that these methods were not affected by the common item design. It meant that in multidimensional linking, if the number of common items was more than 5% of the total test, the RMSE became acceptable. Meanwhile, the strategy of choosing common items didn’t have significant influence on the linking results of the three methods across different conditions of test structure. Moreover, for other simulation factors: as test length increased, the RMSE of these methods decreased; as the correlations between two ability dimensions increased, the RMSE of ability parameter decreased; the difference of ability levels between two populations had smaller effect on these methods, that only for intercept parameter, the non-equivalent group condition produced larger error. In conclusion, SL, HB and LS methods generally performed better than the other two methods across all conditions, so it was highly recommended to use these methods in practical. The performances of SL, HB and LS methods were similar under different common item designs, which was amazing for MIRT linking. Once an appropriate method was chosen, shorter anchor set could be applied, as developing good common items for multidimensional tests was quite time-consuming. Meanwhile, the common items could be chosen either according to the proportions of items in every dimension or averagely in all dimensions. This might be more convenient for practitioners as well. Lastly, as test length had significant effect on the accuracy of equated parameters, it wss suggested to make sure the test was comprised of enough items in every dimension before conducting an MIRT linking.

Key words: test equating, Multidimensional Item Response Theory, Mean/Mean (MM) method, Mean/Sigma (MS) method, Stoking-Lord’s (SL) method, Haebara’s (HB) method, Least Square (LS) method