Research on test reliability in China’s mainland from 2001 to 2020

doi:10.3724/SP.J.1042.2022.01682

Abstract

Abstract:

With the application of confirmatory factor analysis, research on reliability has entered a new stage. In the first two decades of the 21st century, the studies on test reliability (including point estimation and interval estimation) in China’s mainland show three main lines of development.

The first line is the development from research centered on the coefficient αto the reliability research based on confirmatory factor models, including the homogeneity coefficient, composite reliability, maximum reliability, single-indicator reliability and reliability of the whole item set scores. Studies have shown that the coefficient αis still useful. In most cases, the α coefficient is the lower bound of the reliability of the composite score (total or average score). As long as the coefficient αis high enough, the test reliability will be even higher. But the coefficient αcannot be used to measure the homogeneity and the internal consistency of a test. The homogeneity coefficient based on the bi-factor model can be adopted to measure the homogeneity of a multidimensional scale, and the composite reliability can be adopted to measure the internal consistency (if consistency is understood as the consistency within each dimension). Furthermore, the Delta method can be employed to estimate the confidence intervals of various reliability.

The second line is the expansion of data types collected by scales (or questionnaires), from single-level data to multi-level and longitudinal data. Whether unidimensional or multidimensional, it is recommended to use a multi-level confirmatory factor model to calculate the reliability of multi-level data. As for the longitudinal data, it is recommended to use the test reliability developed on the basis of the linear mixed model, and the longitudinal data can also be used as a special case of the two-level data for reliability analysis.

The third line is the extended use of reliability, involving rater reliability, encoder reliability, attribute-level classification consistency in cognitive diagnostic assessment, and reliability of difference scores. In addition, research of reliability generalization and reliability meta-analysis appeared.

For a common test with item-errors that can be reasonably assumed uncorrelated, the following procedure of reliability analysis is recommended. When the coefficient αis high enough, report the coefficient α; otherwise calculate the composite reliability on the basis of the factor model. If the composite reliability is high enough, report the composite reliability; otherwise the test reliability is considered unacceptable.

If the composite reliability of every variable in a statistical model is very high (over 0.95), modeling with composite scores does not differ much from modeling with latent variables. Otherwise, it is better to use latent variable modeling.

Key words: reliability, coefficient α, homogeneity coefficient, composite reliability, interval estimation

CLC Number:

B841

WEN Zhonglin, CHEN Hongxi, FANG Jie, YE Baojuan, CAI Baozhen. Research on test reliability in China’s mainland from 2001 to 2020[J]. Advances in Psychological Science, 2022, 30(8): 1682-1691.

Figures/Tables 2

References 84

[1]	安胜利, 陈平雁. (2001). 量表的信度及其影响因素. 中国临床心理学杂志, 9(4), 315-318.
[2]	陈炳为, 许碧云, 倪宗瓒, 杨惠芳. (2005). 证实性因子分析在量表信度中的应用研究. 中国卫生统计, 22(4). 261-263.
[3]	陈社育, 余嘉元. (2001). 经典真分数理论与概化理论信度观评析. 心理科学进展, 9(3), 258-263.
[4]	陈希镇. (1991). 如何正确使用信度估计公式. 心理学报, 24(1), 41-49.
[5]	陈希镇, 李学娟. (2011). 结构方程模型下的信度估计. 统计与决策, 27(1), 13-15.
[6]	丁树良, 周新莲. (2002). 一种新的信度估计. 江西师范大学学报(自然科学版), 26(3), 222-224.
[7]	方敏. (2009). 结构方程模型下的信度检验. 中国卫生统计, 26(5), 524-526.
[8]	顾海根, 李超. (2005). 同质信度多种指标的比较研究. 心理科学, 28(5), 1196-1198.
[9]	顾红磊, 温忠麟. (2014). 项目表述效应对自陈量表信效度的影响——以核心自我评价量表为例. 心理科学, 37(5), 1245-1252.
[10]	顾红磊, 温忠麟. (2017). 多维测验分数的报告与解释: 基于双因子模型的视角. 心理发展与教育, 33(4), 504-512.
[11]	顾红磊, 温忠麟, 方杰. (2014). 双因子模型: 多维构念测量的新视角. 心理科学, 37(4), 973-979.
[12]	关丹丹, 张厚粲. (2004). 信度的再认识与信度概括化研究. 心理科学, 27(2), 445-448.
[13]	关丹丹, 张厚粲, 李中权. (2005). 差异分数的信度分析. 心理科学, 28(1), 161-163.
[14]	关守义. (2009). 克龙巴赫α系数研究述评. 心理科学, 32(3), 685-687.
[15]	郭磊, 张金明. (2018). 使用Bootstrap方法计算认知诊断评估中的信度. 心理学探新, 38(5), 433-439.
[16]	何佳, 何惧, 席雁, 徐超. (2007). 评分者信度的分析方法简介及比较. 中国现代医生, 45(6), 76-77.
[17]	侯杰泰, 温忠麟, 成子娟. (2004). 结构方程模型及其应用. 北京: 教育科学出版社.
[18]	蒋小花, 沈卓之, 张楠楠, 廖洪秀, 徐海燕. (2010). 问卷的信度和效度分析. 现代预防医学, 37(3), 429-431.
[19]	焦璨, 吴利, 张敏强, 张文怡. (2009). 信度概化研究的新进展评析. 学术研究, 52(2), 54-59.
[20]	焦璨, 张敏强, 黄庆均, 张文怡, 黎光明. (2008). 非正态分布测量数据对克隆巴赫信度α系数的影响. 应用心理学, 14(3), 276-281.
[21]	李斌, 辛涛, 张淑梅, 孙佳楠. (2011). 多评分者多任务情境下评分者信度的模型拟合研究. 湖南师范大学教育科学学报, 10(6), 107-110.
[22]	李春会, 朱永忠. (2012). 基于信度系数与α系数分析结构方程模型. 暨南大学学报(自然科学与医学版), 33(3), 250-252.
[23]	李宇斌, 蔡艳, 涂冬波. (2020). 手机依赖的计算机化自适应测量及其效果评估. 心理科学, 43(3), 748-755.
[24]	刘红云. (2008). α系数与测验的同质性. 心理科学, 31(1), 185-188.
[25]	刘霖芯, 张韬, 杨珉. (2018). 利用多水平模型计算及校正Cronbach alpha系数. 中国卫生统计, 35(6), 838-842.
[26]	刘拓, 戴晓阳. (2011). 不拟合被试对测验信、效度的影响. 中国临床心理学杂志, 19(6), 743-745.
[27]	马文军, 潘波. (2000). 问卷的信度和效度以及如何用SAS软件分析. 中国卫生统计, 17(6), 364-365.
[28]	麦玉娇, 温忠麟. (2013). 探索性结构方程建模(ESEM): EFA和CFA的整合. 心理科学进展, 21(5), 934-939.
[29]	孟庆茂, 刘红云. (2002). α系数在使用中存在的问题. 心理学探新, 22(3), 42-47.
[30]	孙晓敏, 张厚粲. (2005). 表现性评价中评分者信度估计方法的比较研究——从相关法、百分比法到概化理论. 心理科学, 28(3), 646-649.
[31]	田雪垠, 郑蝉金, 郭少阳, 贺冠瑞. (2019). 基于多层验证性因素分析的各种信度系数方法. 心理学探新, 39(5), 461-467.
[32]	屠金路, 金瑜, 王庭照. (2005). bootstrap法在合成分数信度区间估计中的应用. 心理科学, 28(5), 1199-1200.
[33]	屠金路, 王庭照, 金瑜. (2010). 结构方程模型下多因子非同质测量合成分数的信度估计. 心理科学, 33(3), 666-669.
[34]	汪大勋, 涂冬波. (2021). 认知诊断计算机化自适应测量技术在心理障碍诊断与评估中的应用. 江西师范大学学报(自然科学版), 45(2), 111-117.
[35]	王孟成, 叶宝娟. (2014). 通过Mplus计算几种常用的测验信度. 心理学探新, 34(1), 48-52.
[36]	汪文义, 方小婷, 叶宝娟. (2018). 认知诊断属性分类一致性信度区间估计三种方法. 心理科学, 41(6), 1492-1499.
[37]	汪文义, 朱黎君, 叶宝娟, 方小婷. (2020). Bootstrap区间估计在认知诊断模型误设中的应用. 心理科学, 43(6), 1498-1505.
[38]	韦嘉, 郭磊, 张进辅. (2017). 表述效应对平衡量表内部一致性信度的影响. 西南大学学报(自然科学版), 39(8), 133-139.
[39]	温忠麟, 方杰, 沈嘉琦, 谭倚天, 李定欣, 马益铭. (2021). 新世纪20年国内心理统计方法研究回顾. 心理科学进展, 29(8). 1331-1344.
[40]	温忠麟, 黄彬彬, 汤丹丹. (2018). 问卷数据建模前传. 心理科学, 41(1), 204-210.
[41]	温忠麟, 叶宝娟. (2011). 测验信度估计: 从α系数到内部一致性信度. 心理学报, 43(7), 821-829.
[42]	吴瑞林, 袁克海. (2012). 基于结构方程模型的合成信度及其使用问题研究. 统计与信息论坛, 27(12), 14-20.
[43]	席仲恩, 汪顺玉. (2007). 论负克伦巴赫alpha系数和分半信度系数. 重庆邮电大学学报(自然科学版), 19(6), 785-787.
[44]	谢小庆. (1998). 信度估计的γ系数. 心理学报, 30(2), 193-196.
[45]	徐建平, 张厚粲. (2005). 质性研究中编码者信度的多种方法考察. 心理科学, 28(6), 152-154.
[46]	徐万里. (2008). 结构方程模式在信度检验中的应用. 统计与信息论坛, 23(7), 9-13.
[47]	严芳, 李伟明. (2002). 用结构方程建模(SEM)估计概化理论(GT)中的评分者信度. 心理学报, 34(5), 534-539.
[48]	杨强, 叶宝娟, 温忠麟. (2014a). 两种估计多维测验合成信度置信区间方法比较. 心理学探新, 34(1), 43-47.
[49]	杨强, 叶宝娟, 温忠麟. (2014b). 用SPSS软件计算单维测验的合成信度. 中国临床心理学杂志, 22(3), 496-498.
[50]	叶宝娟. (2012). 偏态分布下单维测验合成信度三种区间估计的比较. 教育测量与评价(理论版), 5(10), 28-32.
[51]	叶宝娟, 温忠麟. (2011). 单维测验合成信度三种区间估计的比较. 心理学报, 43(4), 453-461.
[52]	叶宝娟, 温忠麟. (2012a). 用 Delta 法估计多维测验合成信度的置信区间. 心理科学, 35(5), 1213-1217.
[53]	叶宝娟, 温忠麟. (2012b). 测验同质性系数及其区间估计. 心理学报, 44(12), 1687-1694.
[54]	叶宝娟, 温忠麟. (2013a). α系数的区间估计方法比较. 心理科学, 36(1), 215-222.
[55]	叶宝娟, 温忠麟. (2013b). 两水平研究中单维测验信度的估计. 心理科学, 36(3), 728-733.
[56]	叶宝娟, 温忠麟, 陈启山. (2012). 追踪研究中测验信度的估计. 心理科学进展, 20(3), 467-474.
[57]	叶宝娟, 温忠麟, 胡竹菁. (2013). 单维测验合成信度元分析. 心理科学, 36(6), 1464-1469.
[58]	叶宝娟, 杨强. (2011). 用验证性因子分析估计单维测验的信度. 教育测量与评价(理论版), 4(11), 8-12.
[59]	叶宝娟, 杨强. (2014). 偏态分布下多维测验合成信度区间估计的比较. 教育测量与评价(理论版), 7(11), 8-11.
[60]	叶宝娟, 杨强. (2015). 用Delta法估计误差相关测验合成信度的置信区间: 以FAD为例. 心理学探新, 35(3), 251-256.
[61]	张力为. (2002). 信度的正用与误用. 北京体育大学学报, 25(3), 348-350.
[62]	张龙飞, 刘凯, 宋鸽, 涂冬波. (2020). 计算机化自适应测验技术在情绪智力智能测评中的初步应用——基于项目反应理论. 江西师范大学学报(自然科学版), 44(5), 454-461.
[63]	Alonso A., Laenen A., Molenberghs G., Helena Geys H., & Vangeneugden T. (2010). A unified approach to multi- item reliability. Biometrics, 66(4), 1061-1068. doi: 10.1111/j.1541-0420.2009.01373.x pmid: 20070298
[64]	Bentler P. M. (2009). Alpha, dimension-free, and model- based internal consistency reliability. Psychometrika, 74(1), 137-143. doi: 10.1007/s11336-008-9100-1 pmid: 20161430
[65]	Edwards A. A., Joyner K. J., & Schatschneider C. (2021). A simulation study on the performance of different reliability estimation methods. Educational and Psychological Measurement, 81(6), 1089-1117. doi: 10.1177/0013164421994184 pmid: 34565817
[66]	Fu Y., Wen Z., & Wang Y. (2018). The total score with maximal reliability and maximal criterion validity: An illustration using a career satisfaction measure. Educational and Psychological Measurement, 78(6), 1108-1122. doi: 10.1177/0013164417738564 URL
[67]	Fu Y., Wen Z., & Wang Y. (2022). A comparison of reliability estimation based on confirmatory factor analysis and exploratory structural equation models. Educational and Psychological Measurement, 82(2), 205-224. doi: 10.1177/00131644211008953 URL
[68]	Graham J. M. (2006). Congeneric and (essentially) tau- equivalent estimates of score reliability: What they are and how to use them. Educational and Psychological Measurement, 66(6), 930-944. doi: 10.1177/0013164406288165 URL
[69]	Kelley K., & Pornprasertmanit S. (2016). Confidence intervals for population reliability coefficients: Evaluation of methods, recommendations, and software for composite measures. Psychological Methods, 21(1), 69-92. doi: 10.1037/a0040086 pmid: 26962759
[70]	Lai M. H. C. (2020). Composite reliability of multilevel data: It's about observed scores and construct meanings. Psychological Methods, 26(1), 90-102. doi: 10.1037/met0000287 URL
[71]	Lord F. M., Novick M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
[72]	Maydeu-Olivares A., Coffman D. L., & Hartmann W. M. (2007). Asymptotically distribution free (ADF) interval estimation of coefficient alpha. Psychological Methods, 12(2), 157-176. pmid: 17563170
[73]	McNeish D. (2018). Thanks coefficient alpha, we’ll take it from here. Psychological Methods, 23(3), 412-433. doi: 10.1037/met0000144 pmid: 28557467
[74]	Padilla M. A., & Divers J. (2016). A comparison of composite reliability estimators: Coefficient omega confidence intervals in the current literature. Educational and Psychological Measurement, 76(3), 436-453. doi: 10.1177/0013164415593776 pmid: 29795872
[75]	Pfadt J. M., van den Bergh D., Sijtsma K., Moshagen M., & Wagenmakers E. (in press). Bayesian estimation of single-test reliability coefficients. Multivariate Behavioral Research.
[76]	Raykov T. (2001). Estimation of congeneric scale reliability using covariance structure analysis with nonlinear constraints. British Journal of Mathematical and Statistical Psychology, 54(2), 315-323. doi: 10.1348/000711001159582 URL
[77]	Raykov T., & Marcoulides G. A. (2019). Thanks coefficient alpha, we still need you! Educational and Psychological Measurement, 79(1), 200-210. doi: 10.1177/0013164417725127 pmid: 30636788
[78]	Raykov T., & Shrout P. E. (2002). Reliability of scales with general structure: Point and interval estimation using a structural equation modeling approach. Structural Equation Modeling, 9(2), 195-212. doi: 10.1207/S15328007SEM0902_3 URL
[79]	Reise S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667-696. doi: 10.1080/00273171.2012.715555 URL
[80]	Revelle W., & Zinbarg R. E. (2009). Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma. Psychometrika, 74(1), 145-154. doi: 10.1007/s11336-008-9102-z URL
[81]	Scherer R., & Teo T. (2020). A tutorial on the meta- analytic structural equation modeling of reliability coefficients. Psychological Methods, 25(6), 747-775. doi: 10.1037/met0000261 URL
[82]	Sijtsma K., & Pfadt J. M. (2021). Part Ⅱ: On the use, the misuse, and the very limited usefulness of cronbach's alpha: Discussing lower bounds and correlated errors. Psychometrika, 86(4), 843-860. doi: 10.1007/s11336-021-09789-8 pmid: 34387809
[83]	ten Hove D., Jorgensen T. D., & van der Ark L. A. (in press). Interrater reliability for multilevel data: A generalizability theory approach. Psychological Methods.
[84]	Zinbarg R. E., Yovel I., Revelle W., & McDonald R. P. (2006). Estimating generalizability to a latent variable common to all of a scale's indicators: A comparison of estimators for ω_h. Applied Psychological Measurement, 30(2), 121-144. doi: 10.1177/0146621605278814 URL

类别	文献
α系数	安胜利等(2001); 孟庆茂等(2002); 陈炳为等(2005); 席仲恩等(2007); 焦璨等(2008); 刘红云(2008); 关守义(2009); 蒋小花等(2010); 刘拓等(2011); 温忠麟等(2011); 李春会等(2012); 叶宝娟, 温忠麟(2013a); 王孟成等(2014)
同质性系数	丁树良等(2002); 孟庆茂等(2002); 顾海根等(2005); 刘红云(2008); 陈希镇等(2011); 温忠麟等(2011, 2018); 叶宝娟, 温忠麟(2012b); 顾红磊等(2014, 2017)
合成信度	张力为(2002); 屠金路等(2005, 2010); 徐万里(2008); 温忠麟等(2011); 叶宝娟, 温忠麟(2011, 2012a); 叶宝娟等(2013, 2014, 2015); 吴瑞林等(2012); 叶宝娟(2012); 杨强等(2014a, 2014b); 韦嘉等(2017)
最大信度	叶宝娟, 杨强(2011); 田雪垠等(2019)
单指标信度	方敏(2009); 王孟成等(2014)
整个题目集分数的信度	叶宝娟, 杨强(2011)
两水平研究的信度	叶宝娟, 温忠麟(2013b); 刘霖芯等(2018); 田雪垠等(2019)
追踪研究的信度	叶宝娟等(2012)
评分者信度	严芳等(2002); 孙晓敏等(2005); 何佳等(2007); 蒋小花等(2010); 李斌等(2011)
编码者信度	徐建平等(2005)
认知诊断属性分类一致性信度	郭磊等(2018); 汪文义等(2018, 2020)
差异分数的信度	关丹丹等(2005)
信度概化	关丹丹等(2004); 焦璨等(2009)

类别	文献
α系数	安胜利等(2001); 孟庆茂等(2002); 陈炳为等(2005); 席仲恩等(2007); 焦璨等(2008); 刘红云(2008); 关守义(2009); 蒋小花等(2010); 刘拓等(2011); 温忠麟等(2011); 李春会等(2012); 叶宝娟, 温忠麟(2013a); 王孟成等(2014)
同质性系数	丁树良等(2002); 孟庆茂等(2002); 顾海根等(2005); 刘红云(2008); 陈希镇等(2011); 温忠麟等(2011, 2018); 叶宝娟, 温忠麟(2012b); 顾红磊等(2014, 2017)
合成信度	张力为(2002); 屠金路等(2005, 2010); 徐万里(2008); 温忠麟等(2011); 叶宝娟, 温忠麟(2011, 2012a); 叶宝娟等(2013, 2014, 2015); 吴瑞林等(2012); 叶宝娟(2012); 杨强等(2014a, 2014b); 韦嘉等(2017)
最大信度	叶宝娟, 杨强(2011); 田雪垠等(2019)
单指标信度	方敏(2009); 王孟成等(2014)
整个题目集分数的信度	叶宝娟, 杨强(2011)
两水平研究的信度	叶宝娟, 温忠麟(2013b); 刘霖芯等(2018); 田雪垠等(2019)
追踪研究的信度	叶宝娟等(2012)
评分者信度	严芳等(2002); 孙晓敏等(2005); 何佳等(2007); 蒋小花等(2010); 李斌等(2011)
编码者信度	徐建平等(2005)
认知诊断属性分类一致性信度	郭磊等(2018); 汪文义等(2018, 2020)
差异分数的信度	关丹丹等(2005)
信度概化	关丹丹等(2004); 焦璨等(2009)