ISSN 0439-755X
CN 11-1911/B

心理学报 ›› 2017, Vol. 49 ›› Issue (9): 1234-1246.doi: 10.3724/SP.J.1041.2017.01234

• • 上一篇    

 基于双因子模型的测验总分和 维度分的合成方法

 刘 玥;  刘红云   

  1.  (北京师范大学心理学部, 北京 100875)
  • 收稿日期:2016-07-07 发布日期:2017-07-16 出版日期:2017-09-25
  • 通讯作者: 刘红云, E-mail: E-mail: E-mail:
  • 基金资助:

 Reporting overall scores and domain scores of bi-factor models

 LIU Yue; LIU Hongyun   

  1.  (School of Psychology, Beijing Normal University, Beijing 100875, China)
  • Received:2016-07-07 Online:2017-07-16 Published:2017-09-25
  • Contact: LIU Hongyun, E-mail: E-mail: E-mail:
  • Supported by:

摘要:  双因子模型可以同时包含一个全局因子和多个局部因子, 在描述多维测验结构时有其独特优势, 近些年应用越来越广泛。文章基于双因子模型, 提出了4种合成总分和维度分的方法, 分别是:原始分法, 加和法, 全局题目加权加和法和局部题目加权加和法, 并采用模拟的方法, 在样本量、测验长度、维度间相关变化的条件下考察了这些方法与传统多维IRT方法的表现。最后, 通过实证研究对结果进行了验证。结果显示:(1)全局加权加和法和局部加权加和法, 尤其是局部加权加和法合成的总分和维度分与真值最接近、信度最高。(2)在维度间相关较高, 测验长度较长的条件下, 局部加权加和法的结果较好, 部分条件下甚至优于多维IRT法。(3)仅有局部加权加和法合成的维度分能够反应维度间真实的相关关系。

关键词:  双因子模型, 多维IRT, 总分, 维度分

Abstract:  In large-scale assessments, most of the tests have a multidimensional structure. There is an increasing interest in reporting overall scores and domain scores simultaneously. The domain scores complement the overall scores by providing finer grained diagnosis of examinees’ strengths and weaknesses. However, due to the small number of items within each dimension, the lack of sufficiently high reliability is the primary impediment for generating and reporting domain scores. A number of methods have been developed recently to improve the reliability and optimality of the overall scores and domain scores. For overall scores, simply averaging or weighted averaging the scores from different content areas, using maximum information method to compute the weights of composite scores under the MIRT framework were some commonly-used procedures. There were also some subscoring methods in the CTT and IRT framework, such as Kelly’s (1927) regressed score method, the MIRT method, and the higher order IRT method. Nowadays, the bi-factor model became more and more popular in education measurement. Reporting overall scores and domain scores based on it became an important topic. The purpose of this study was to investigate several methods to generate overall scores and domain scores based on the bi-factor model, and to compare them with the MIRT method under different condition. Study 1 was a mixed measure design of simulation conditions (between-factors) and methods (within- factor). There were three between-factors: (1) 3 sample sizes (500,1000,2000); (2) 3 test length (18 items, 36 items, 60 items); and (3) 5 correlations between dimensions (0.0, 0.3, 0.5, 0.7, 0.9). The methods for generating overall scores and domain scores were: (1) original scores from bi-factor model (Bifactor-M1); (2) summed original scores from the bi-factor model (Bifactor-M2); (3) weighted sum original scores from the bi-factor model based on all the items (Bifactor-M3); (4) weighted sum original scores from the bi-factor model based on items of each dimension (Bifactor-M4). The overall scores from Bifactor-M3 and Bifactor-M4 were the same. As many studies found that the MIRT-based methods provided the best estimates of overall and subscores, this method was also conducted and compared with the other methods based on the bi-factor model. Under each condition, 30 replications were generated using SimuMIRT (Yao, 2015). BMIRT (Yao, 2015) was applied to estimate domain ability parameters using an MCMC method, then the overall ability was generated by the maximum information method. Finally, the results were evaluated by four criteria: root mean square error (RMSE), reliability, correlation between the estimated scores and true values, and correlation between the estimated domain scores. Study 2 was a real data example. 4815 responses for science test of National College Entrance Examination were collected. The test contained 66 items covering three subjects: Physics (17 items), Chemistry (30 items), Biology (19 items). Four proposed methods and the MIRT method were applied to estimate overall scores and domain scores. For the real data, the overall ability and domain ability estimates from the MIRT model were used as “true” values to compare the relative performances between different methods. The evaluation criteria were similar to the simulation study. The results of the simulation showed that, for overall scores: (1) the Bifactor-M1 and the Bifactor-M2 had larger RMSE than other methods; when the correlation between dimensions was low, the RMSE of Bifactor-M1 was the largest; as the correlation became larger, the RMSE of Bifactor-M2 became the largest. (2) The Bifactor-M3 and the MIRT method had the smallest RMSE. (3) As the correlation between dimensions increased, the RMSE of the Bifactor-M3 and the MIRT method decreased. (4) When the test length and the correlation between dimensions increased, Bifactor-M3 tended to report more reliable overall scores (reliability higher than 0.8). For domain scores: (1) Bifactor-M1 had the largest RMSE. (2) When test length was short, the RMSE of Bifactor-M2 was smaller than that of the MIRT method; when test length was long, the RMSE of Bifactor-M2 increased as the correlation between dimensions increased, and larger than that of MIRT method when the correlation was 0.9. (3) The RMSE of Bifactor-M3 and Bifactor-M4 decreased as the correlation between dimensions increased. (4) The RMSE of Bifactor-M4 was equal to or smaller than that of MIRT method. (5) When the test length and the correlation between dimensions increased, the Bifactor-M3 and the Bifactor-M4 tended to report more reliable overall scores. Finally, domain scores from the Bifactor-M4 could recover the correlations of true value better than other methods. For the real data example, the results showed that: (1) the bi-factor model fitted the data best as compared to the UIRT and MIRT models; (2) overall scores from the Bifactor-M3 and the domain score from the Bifactor-M4 were similar to those from the MIRT method. In conclusion, overall scores and domain score from the Bifactor-M4 generally performed better than the other proposed methods. First, the scores from Bifactor-M4 had smaller RMSE and higher reliability. Second, the correlation between domain scores form the Bifactor-M4 was similar to the true value. Therefore, it was highly recommended to use this method in practical, especially in the following situations: (1) the test designers have specific definition of the core competencies, then bi-factor model can provide the estimations of core competencies, overall scores, and domain scores simultaneously. (2) When tests have a multidimensional structure and the correlations between dimensions are high, it is suggested to use bi-factor model to calibrate the data. (3) Other than reporting overall scores and domain scores, if the study focuses on the relationship between general construct, domain specific construct, and criterion as well, it is recommended to use the bi-factor model.

Key words:  bi-factor model, multidimensional item response theory, overall scores, domain scores