%A LIU Yue, LIU Hongyun
%T &nbsp;Reporting overall scores and domain scores of bi-factor models
%0 Journal Article
%D 2017
%J Acta Psychologica Sinica
%R 10.3724/SP.J.1041.2017.01234
%P 1234-1246
%V 49
%N 9
%U {https://journal.psych.ac.cn/acps/CN/abstract/article_4027.shtml}
%8 2017-09-25
%X &nbsp;In large-scale assessments, most of the tests have a multidimensional structure. There is an increasing interest in reporting overall scores and domain scores simultaneously. The domain scores complement the overall scores by providing finer grained diagnosis of examinees&rsquo; strengths and weaknesses. However, due to the small number of items within each dimension, the lack of sufficiently high reliability is the primary impediment for generating and reporting domain scores. A number of methods have been developed recently to improve the reliability and optimality of the overall scores and domain scores. For overall scores, simply averaging or weighted averaging the scores from different content areas, using maximum information method to compute the weights of composite scores under the MIRT framework were some commonly-used procedures. There were also some subscoring methods in the CTT and IRT framework, such as Kelly&rsquo;s (1927) regressed score method, the MIRT method, and the higher order IRT method. Nowadays, the bi-factor model became more and more popular in education measurement. Reporting overall scores and domain scores based on it became an important topic. The purpose of this study was to investigate several methods to generate overall scores and domain scores based on the bi-factor model, and to compare them with the MIRT method under different condition. Study 1 was a mixed measure design of simulation conditions (between-factors) and methods (within- factor). There were three between-factors: (1) 3 sample sizes (500,1000,2000); (2) 3 test length (18 items, 36 items, 60 items); and (3) 5 correlations between dimensions (0.0, 0.3, 0.5, 0.7, 0.9). The methods for generating overall scores and domain scores were: (1) original scores from bi-factor model (Bifactor-M1); (2) summed original scores from the bi-factor model (Bifactor-M2); (3) weighted sum original scores from the bi-factor model based on all the items (Bifactor-M3); (4) weighted sum original scores from the bi-factor model based on items of each dimension (Bifactor-M4). The overall scores from Bifactor-M3 and Bifactor-M4 were the same. As many studies found that the MIRT-based methods provided the best estimates of overall and subscores, this method was also conducted and compared with the other methods based on the bi-factor model. Under each condition, 30 replications were generated using SimuMIRT (Yao, 2015). BMIRT (Yao, 2015) was applied to estimate domain ability parameters using an MCMC method, then the overall ability was generated by the maximum information method. Finally, the results were evaluated by four criteria: root mean square error (RMSE), reliability, correlation between the estimated scores and true values, and correlation between the estimated domain scores. Study 2 was a real data example. 4815 responses for science test of National College Entrance Examination were collected. The test contained 66 items covering three subjects: Physics (17 items), Chemistry (30 items), Biology (19 items). Four proposed methods and the MIRT method were applied to estimate overall scores and domain scores. For the real data, the overall ability and domain ability estimates from the MIRT model were used as &ldquo;true&rdquo; values to compare the relative performances between different methods. The evaluation criteria were similar to the simulation study. The results of the simulation showed that, for overall scores: (1) the Bifactor-M1 and the Bifactor-M2 had larger RMSE than other methods; when the correlation between dimensions was low, the RMSE of Bifactor-M1 was the largest; as the correlation became larger, the RMSE of Bifactor-M2 became the largest. (2) The Bifactor-M3 and the MIRT method had the smallest RMSE. (3) As the correlation between dimensions increased, the RMSE of the Bifactor-M3 and the MIRT method decreased. (4) When the test length and the correlation between dimensions increased, Bifactor-M3 tended to report more reliable overall scores (reliability higher than 0.8). For domain scores: (1) Bifactor-M1 had the largest RMSE. (2) When test length was short, the RMSE of Bifactor-M2 was smaller than that of the MIRT method; when test length was long, the RMSE of Bifactor-M2 increased as the correlation between dimensions increased, and larger than that of MIRT method when the correlation was 0.9. (3) The RMSE of Bifactor-M3 and Bifactor-M4 decreased as the correlation between dimensions increased. (4) The RMSE of Bifactor-M4 was equal to or smaller than that of MIRT method. (5) When the test length and the correlation between dimensions increased, the Bifactor-M3 and the Bifactor-M4 tended to report more reliable overall scores. Finally, domain scores from the Bifactor-M4 could recover the correlations of true value better than other methods. For the real data example, the results showed that: (1) the bi-factor model fitted the data best as compared to the UIRT and MIRT models; (2) overall scores from the Bifactor-M3 and the domain score from the Bifactor-M4 were similar to those from the MIRT method. In conclusion, overall scores and domain score from the Bifactor-M4 generally performed better than the other proposed methods. First, the scores from Bifactor-M4 had smaller RMSE and higher reliability. Second, the correlation between domain scores form the Bifactor-M4 was similar to the true value. Therefore, it was highly recommended to use this method in practical, especially in the following situations: (1) the test designers have specific definition of the core competencies, then bi-factor model can provide the estimations of core competencies, overall scores, and domain scores simultaneously. (2) When tests have a multidimensional structure and the correlations between dimensions are high, it is suggested to use bi-factor model to calibrate the data. (3) Other than reporting overall scores and domain scores, if the study focuses on the relationship between general construct, domain specific construct, and criterion as well, it is recommended to use the bi-factor model.