ISSN 0439-755X
CN 11-1911/B

Acta Psychologica Sinica ›› 2012, Vol. 44 ›› Issue (8): 1124-1136.

Previous Articles    

Testing Measurement Equivalence of Categorical Items’ Threshold/Difficulty Parameters: A Comparison of CCFA and (M)IRT Approaches

LIU Hong-Yun;LI Chong;ZHANG Ping-Ping;LUO Fang   

  1. (1School of psychology, Beijing Normal University; Beijing Key Lab of Applied Experimental Psychology, Beijing 100875, China)
    (2Beijing New Oriental School, Learning & Development Center, Beijing 100080, China)
    (3National Key Laboratory of Cognitive Neuroscience and Learning, Beijing 100875, China)
  • Received:2011-12-14 Revised:1900-01-01 Published:2012-08-28 Online:2012-08-28
  • Contact: LUO Fang

Abstract: Multiple group confirmatory factor analyses and differential item functioning basing on the unidimensional or the multidimensional item response theory were the two most commonly used methods to assess the measurement equivalence of categorical items. Unlike the traditional linear factor analysis, multiple-group categorical confirmatory factor analysis (CCFA) could model the categorical measures with a threshold structure appropriately, which is comparable to the difficulty parameters in the multidimensional IRT [(M)IRT)]. In this study, we compared the multiple-group categorical CFA (CCFA) and (M)IRT in terms of their power to detect violations of measurement invariance (i.e., DIF) with the Monte Carlo method. Moreover, given the limitation of the assumptions under the traditional unidimensional IRT model, this study extended the DIF test method to the (M)IRT model. Simulation studies under both unidimensional and multidimensional conditions were conducted to compare the DIFFTEST method, IRT-LR method (for unidimensional scale), and MIRT-MG (for multidimensional scale) with respect to their power to detect the lack of invariance across groups. Results indicated that the three methods, namely, DIFFTEST, IRT-LR, and MIRT-MG, showed reasonable power to identify the measurement non-equivalence when the difference of threshold was large. For unidimensional scale, the IRT-LR test demonstrated superior power to DIFFTEST. Whereas, for multidimensional scale, the results were not completely consistent across different conditions. The power of MIRT-MG was higher than that of DIFFTEST when test length was long and the correlation between dimensions was high. In contrast, the power of DIFFTEST was higher than that of MIRT-MG when test length was short and the correlations between dimensions were low. For a fixed number of noninvariant items, the power of the DIFFTEST method became smaller as the test length increased, whereas the power of the IRT-LR and MIRT-MG methods became larger as the test length increased. The number of respondents per group (sample size) was found to be one of the most important factors affecting the performance of these three approaches. The power of the DIFFTEST, IRT-LR, and, MIRT-MG methods would increase as the sample size increased. For a finite number of observations, the power of all three methods was larger under the balanced design when the two groups were equal in size than when two groups were unequal in size in the unbalanced design. For the DIFFTEST method, the Type I errors reached the nominal error rate at 5%, while the IRT-LR and MIRT-MG methods produced much lower Type I error rates.

Key words: categorical data, confirmatory factor analysis, differential item functioning, (multidimensional) item response theory, measurement equivalence