Please wait a minute...
心理学报  2017, Vol. 49 Issue (9): 1234-1246    DOI: 10.3724/SP.J.1041.2017.01234
  本期目录 | 过刊浏览 | 高级检索 |
 基于双因子模型的测验总分和 维度分的合成方法
 刘 玥;  刘红云
 (北京师范大学心理学部, 北京 100875)
 Reporting overall scores and domain scores of bi-factor models
 LIU Yue; LIU Hongyun
 (School of Psychology, Beijing Normal University, Beijing 100875, China)
全文: PDF(799 KB)   评审附件 (1 KB) 
输出: BibTeX | EndNote (RIS)       背景资料
摘要  双因子模型可以同时包含一个全局因子和多个局部因子, 在描述多维测验结构时有其独特优势, 近些年应用越来越广泛。文章基于双因子模型, 提出了4种合成总分和维度分的方法, 分别是:原始分法, 加和法, 全局题目加权加和法和局部题目加权加和法, 并采用模拟的方法, 在样本量、测验长度、维度间相关变化的条件下考察了这些方法与传统多维IRT方法的表现。最后, 通过实证研究对结果进行了验证。结果显示:(1)全局加权加和法和局部加权加和法, 尤其是局部加权加和法合成的总分和维度分与真值最接近、信度最高。(2)在维度间相关较高, 测验长度较长的条件下, 局部加权加和法的结果较好, 部分条件下甚至优于多维IRT法。(3)仅有局部加权加和法合成的维度分能够反应维度间真实的相关关系。
E-mail Alert
关键词  双因子模型多维IRT 总分 维度分    
Abstract: In large-scale assessments, most of the tests have a multidimensional structure. There is an increasing interest in reporting overall scores and domain scores simultaneously. The domain scores complement the overall scores by providing finer grained diagnosis of examinees’ strengths and weaknesses. However, due to the small number of items within each dimension, the lack of sufficiently high reliability is the primary impediment for generating and reporting domain scores. A number of methods have been developed recently to improve the reliability and optimality of the overall scores and domain scores. For overall scores, simply averaging or weighted averaging the scores from different content areas, using maximum information method to compute the weights of composite scores under the MIRT framework were some commonly-used procedures. There were also some subscoring methods in the CTT and IRT framework, such as Kelly’s (1927) regressed score method, the MIRT method, and the higher order IRT method. Nowadays, the bi-factor model became more and more popular in education measurement. Reporting overall scores and domain scores based on it became an important topic. The purpose of this study was to investigate several methods to generate overall scores and domain scores based on the bi-factor model, and to compare them with the MIRT method under different condition. Study 1 was a mixed measure design of simulation conditions (between-factors) and methods (within- factor). There were three between-factors: (1) 3 sample sizes (500,1000,2000); (2) 3 test length (18 items, 36 items, 60 items); and (3) 5 correlations between dimensions (0.0, 0.3, 0.5, 0.7, 0.9). The methods for generating overall scores and domain scores were: (1) original scores from bi-factor model (Bifactor-M1); (2) summed original scores from the bi-factor model (Bifactor-M2); (3) weighted sum original scores from the bi-factor model based on all the items (Bifactor-M3); (4) weighted sum original scores from the bi-factor model based on items of each dimension (Bifactor-M4). The overall scores from Bifactor-M3 and Bifactor-M4 were the same. As many studies found that the MIRT-based methods provided the best estimates of overall and subscores, this method was also conducted and compared with the other methods based on the bi-factor model. Under each condition, 30 replications were generated using SimuMIRT (Yao, 2015). BMIRT (Yao, 2015) was applied to estimate domain ability parameters using an MCMC method, then the overall ability was generated by the maximum information method. Finally, the results were evaluated by four criteria: root mean square error (RMSE), reliability, correlation between the estimated scores and true values, and correlation between the estimated domain scores. Study 2 was a real data example. 4815 responses for science test of National College Entrance Examination were collected. The test contained 66 items covering three subjects: Physics (17 items), Chemistry (30 items), Biology (19 items). Four proposed methods and the MIRT method were applied to estimate overall scores and domain scores. For the real data, the overall ability and domain ability estimates from the MIRT model were used as “true” values to compare the relative performances between different methods. The evaluation criteria were similar to the simulation study. The results of the simulation showed that, for overall scores: (1) the Bifactor-M1 and the Bifactor-M2 had larger RMSE than other methods; when the correlation between dimensions was low, the RMSE of Bifactor-M1 was the largest; as the correlation became larger, the RMSE of Bifactor-M2 became the largest. (2) The Bifactor-M3 and the MIRT method had the smallest RMSE. (3) As the correlation between dimensions increased, the RMSE of the Bifactor-M3 and the MIRT method decreased. (4) When the test length and the correlation between dimensions increased, Bifactor-M3 tended to report more reliable overall scores (reliability higher than 0.8). For domain scores: (1) Bifactor-M1 had the largest RMSE. (2) When test length was short, the RMSE of Bifactor-M2 was smaller than that of the MIRT method; when test length was long, the RMSE of Bifactor-M2 increased as the correlation between dimensions increased, and larger than that of MIRT method when the correlation was 0.9. (3) The RMSE of Bifactor-M3 and Bifactor-M4 decreased as the correlation between dimensions increased. (4) The RMSE of Bifactor-M4 was equal to or smaller than that of MIRT method. (5) When the test length and the correlation between dimensions increased, the Bifactor-M3 and the Bifactor-M4 tended to report more reliable overall scores. Finally, domain scores from the Bifactor-M4 could recover the correlations of true value better than other methods. For the real data example, the results showed that: (1) the bi-factor model fitted the data best as compared to the UIRT and MIRT models; (2) overall scores from the Bifactor-M3 and the domain score from the Bifactor-M4 were similar to those from the MIRT method. In conclusion, overall scores and domain score from the Bifactor-M4 generally performed better than the other proposed methods. First, the scores from Bifactor-M4 had smaller RMSE and higher reliability. Second, the correlation between domain scores form the Bifactor-M4 was similar to the true value. Therefore, it was highly recommended to use this method in practical, especially in the following situations: (1) the test designers have specific definition of the core competencies, then bi-factor model can provide the estimations of core competencies, overall scores, and domain scores simultaneously. (2) When tests have a multidimensional structure and the correlations between dimensions are high, it is suggested to use bi-factor model to calibrate the data. (3) Other than reporting overall scores and domain scores, if the study focuses on the relationship between general construct, domain specific construct, and criterion as well, it is recommended to use the bi-factor model.
Key words bi-factor model    multidimensional item response theory    overall scores    domain scores
收稿日期: 2016-07-07      出版日期: 2017-07-16
基金资助: 家自然科学基金(31571152)、北京市与中央在京高校共建项目(019-105812)、未来教育高精尖创新中心、中央高校基本科研业务费专项资金资助。
通讯作者: 刘红云, E-mail:     E-mail: E-mail:
刘玥, 刘红云.  基于双因子模型的测验总分和 维度分的合成方法[J]. 心理学报, 2017, 49(9): 1234-1246.
LIU Yue, LIU Hongyun.  Reporting overall scores and domain scores of bi-factor models. Acta Psychologica Sinica, 2017, 49(9): 1234-1246.
链接本文:      或
[1] 徐霜雪, 俞宗火, 李月梅.  预测视角下双因子模型与高阶模型的模拟比较[J]. 心理学报, 2017, 49(8): 1125-1136.
[2] 刘玥;刘红云. 不同铆测验设计下多维IRT等值方法的比较[J]. 心理学报, 2013, 45(4): 466-480 .
Full text



版权所有 © 《心理学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发  技术支持