ISSN 0439-755X
CN 11-1911/B

Acta Psychologica Sinica ›› 2013, Vol. 45 ›› Issue (4): 466-480 .

### Comparison of MIRT Linking Methods for Different Common Item Designs

LIU Yue;LIU Hongyun

1. (1 Sichuan Educational Assessment and Evaluation Centre, Sichuan Institute Of Education Sciences, Chengdu 610225, China) (2 School of Psychology, Beijing Normal University, Beijing 100875, China)
• Received:2012-09-24 Published:2013-04-25 Online:2013-04-25
• Contact: LIU Hongyun

Abstract: A great number of educational assessments usually measure more than one trait (Ackerman, 1992; DeMars, 2006; Reckase, 1985). In order to adjust scores on these different test forms, multidimensional item response theory (MIRT) and its linking procedures should be developed. So far, some researchers have already extended UIRT linking methods to the multidimensional structure (Davey et al., 1996; Hirsch, 1989; Li & Lissitz, 2000; Min, 2003; Yon, 2006). There were numerous studies comparing MIRT linking methods in the literature. However, although choosing anchor items was of great importance in common item designs, a few of studies compared MIRT linking methods under different common item designs. It was still in doubt that, how we could choose the common items across different MIRT linking methods. The purpose of this study was to compare five MIRT linking methods under two kinds of common item choosing strategies in various situations. The study was a mixed measure design of simulation conditions (between-factors) and linking methods (within-factor). There were six between-factors: (1) 2 test lengths (40 items and 80 items); (2) 2 levels of the proportion of the number of items in one dimension to another (1:1 and 1:3); (3) 3 anchor lengths (1/20, 1/5 to 1/3 of total test); (4) 2 strategies of choosing common items (averagely choosing the items in all dimensions or choosing according to the proportions of items in every dimension); (5) 3 correlations between two ability dimensions (r=0, 0.5, 0.9); (6) 2 levels of equivalent/non-equivalent ability levels between two populations. The five MIRT linking methods we investigated were: Mean/Mean (MM) method, Mean/Sigma (MS) method, Stoking-Lord’s (SL) method, Haebara’s (HB) method and Least Square (LS) method. Under each condition, the number of examinees was fixed as I =2000, and 30 replications were generated. BMIRT (Yao, 2003) was applied to estimate item and ability parameters using an MCMC method. Based on the previous studies about equating (Kim & Cohen, 1998; Kim & Cohen, 2002), a two-step of linking was applied. The first step was to transform the scale of parameters in the new test onto the base test, and the second step was to transform the scale of all the simulated items onto the generating scale. In each step, the transformation matrices were produced by LinkMIRT (Yao, 2004) and the R package called “Plink” (Weeks, 2010). Finally, the recovery of parameters was evaluated by four criteria: bias, mean absolute error, root mean square error, correlation between the parameters after equating and true values. To compare the five MIRT equating methods, the results showed that: the RMSE for parameters under SL, HB and LS methods were smaller and more stable in different situations; however, the RMSE for parameters in MM and MS methods were significantly large, especially in non-equivalent group conditions. Therefore, the latter results were displayed for the SL, HB and LS methods. It was found that these methods were not affected by the common item design. It meant that in multidimensional linking, if the number of common items was more than 5% of the total test, the RMSE became acceptable. Meanwhile, the strategy of choosing common items didn’t have significant influence on the linking results of the three methods across different conditions of test structure. Moreover, for other simulation factors: as test length increased, the RMSE of these methods decreased; as the correlations between two ability dimensions increased, the RMSE of ability parameter decreased; the difference of ability levels between two populations had smaller effect on these methods, that only for intercept parameter, the non-equivalent group condition produced larger error. In conclusion, SL, HB and LS methods generally performed better than the other two methods across all conditions, so it was highly recommended to use these methods in practical. The performances of SL, HB and LS methods were similar under different common item designs, which was amazing for MIRT linking. Once an appropriate method was chosen, shorter anchor set could be applied, as developing good common items for multidimensional tests was quite time-consuming. Meanwhile, the common items could be chosen either according to the proportions of items in every dimension or averagely in all dimensions. This might be more convenient for practitioners as well. Lastly, as test length had significant effect on the accuracy of equated parameters, it wss suggested to make sure the test was comprised of enough items in every dimension before conducting an MIRT linking.