心理与教育测验分类信度：分类一致性评估方法

doi:10.3724/SP.J.1042.2025.1340

摘要/Abstract

摘要：

心理测验、教育测验和医学测验广泛应用于测试者分类, 而内部一致性和α等信度系数并不能直接评价分类信度, 如何评估标准参照测验的分类信度, 成为研究者和实践者关注的重要问题。本研究从分类一致性方法视角, 探究单次施测测验的分类一致性估计模式, 分析各类代表性方法发展脉络及其核心思想, 结合各方法相关软件包与程序, 分析人格测验、学业测验、诊断测验等真实数据。结合理论分析与数据分析, 总结各类方法的优劣与影响因素, 提出选用各类方法的建议, 讨论分类一致性区间估计等问题, 推动分类测验的分类一致性的研究、应用与报告。

关键词: 分类信度, 分类一致性, 决策规则, 认知诊断, 机器学习

Abstract:

The reliability of norm-referenced tests is not appropriate for classification tests or criterion-referenced tests. Classification consistency is a crucial metric in psychological and educational measurement, reflecting the probability that examinees will receive the same classification categories on two independent administrations of a test or two parallel tests. It is widely utilized in evaluating the classification reliability of psychological assessments, educational tests, and medical diagnostic tests. Since administering a test twice or parallel tests is often challenging in practice due to increased testing time and test construction expense, many methods are focused on estimating classification consistency based on results from a single test administration in psychological and educational measurement. These methods are designed to provide important psychometric properties for assessing and improving the reliability and fairness of tests.
The purpose of the study firstly focused on the investigation of the general framework for estimating classification consistency based on criterion-referenced tests. The general procedures for estimating classification consistency based on a single test administration can be briefly summed up as follows: (a) determining the probabilities of examinees being classified into each category according to classification criteria, (b) following an independent and identically distributed based on the assumption that two administrations of a test or two parallel forms are independent, (c) computing the sum of the squared probabilities of their classification across all categories, which refers to the conditional classification consistency for an examinee, and (d) obtaining marginal classification consistency based on a person or distribution method.
Following the general framework for estimating classification consistency, the methods have been developed for the estimation of single-administration classification consistency by the consideration of measurement error, conditional standard error of measurement, classification probabilities, and simulated retest classification errors under different psychometric models. This article describes the ideal and procedures of the representative methods in details under classical measurement theory (CTT), item response theory (IRT), cognitive diagnostic models (CDM), and machine learning models (MLM). The theoretical foundations, computational steps, and applications of representative methods were systematically introduced under each model.
CTT-based methods provide classification consistency of observed test scores. For example, the Livingston and Lewis approach utilizes test score distributions and test reliability to estimate classification consistency. The Lee method employs a compound multinomial distribution for establishing the conditional distribution of total summed scores and applies it to compute the expected probabilities of each examinee falling into each category of performance levels. However, the limitation of CTT is that parameters are sample and test dependent.
IRT-based methods estimate classification consistency of observed test scores or latent ability through modeling the probability of item response based on latent ability and item parameters. The Rudner's approach estimates conditional classification consistency by incorporating conditional standard error of measurement, which can be computed from an individual's test information function. The Lee's and Guo's methods employ the conditional distribution of total summed scores or likelihood functions to compute the expected classification probabilities of each examinee, respectively. These methods require relatively large sample sizes to calibrate item parameters.
CDM-based methods are designed to evaluate classification consistency of attribute pattern, attribute status, and the number of skills mastered. These methods provide a finer-grained approach to report reliability of cognitive diagnostic assessments. For example, attribute-level consistency indices and pattern-level consistency indices quantify classification reliability at a more fine-grained level and holistic levels, respectively. MLM-based methods provide data-driven insights into classification reliability. These methods can learn complex relationships between test items from test data, offering dynamic and potentially more accurate estimations of classification consistency, compared to traditional psychometric approaches.
Beyond the introduction to the method of classification consistency, this study provides applications of classification consistency indices, illustrating their use in educational, psychological, and diagnostic assessments. Four examples were illustrated about how to apply classification consistency indices for evaluating test reliability. A comparative analysis of these methods reveals that CTT-based methods offer simplicity and ease of computation, while they may lack precision for CRT. IRT-based methods enhance estimation precision but require more complex assumptions. CDM-based methods are suitable for formative assessment. Machine learning methods, though promising, are still in the early stages of integration within psychometrics and require further validation for practical implementation.
Future research should investigate the approach of estimating confidence intervals for classification consistency, as current methods primarily provide point estimates. Additionally, more extensive empirical studies of MLM-based classification consistency estimations are necessary. Researchers and practitioners are encouraged to incorporate and report classification consistency more frequently to enhance the overall quality and fairness of CRT. By systematically reviewing existing methodologies and their applications, this study highlights the significance of reporting classification consistency for CRT.

Key words: classification reliability, classification consistency, decision rules, cognitive diagnosis, machine learning

中图分类号:

B841

陈静仪, 宋丽红, 汪文义. (2025). 心理与教育测验分类信度：分类一致性评估方法. 心理科学进展 , 33(8), 1340-1357.

CHEN Jingyi, SONG Lihong, WANG Wenyi. (2025). Classification consistency for measuring classification reliability of psychological and educational tests. Advances in Psychological Science, 33(8), 1340-1357.

图/表 6

参考文献 88

[1]	陈平. (2022). 浅谈标准设定中的关键技术: 来自我国大规模测评项目的经验. 中国考试, (8), 48-56.
[2]	陈平, 李珍, 辛涛, 高慧健. (2011). 标准参照测验决策一致性指标研究的总结与展望. 心理发展与教育, 27(2), 210-215.
[3]	陈思佚, 崔红, 周仁来, 贾艳艳. (2012). 正念注意觉知量表(MAAS)的修订及信效度检验. 中国临床心理学杂志, 20(2), 148-151.
[4]	陈希镇. (1996). 标准参照测验中的信度估计公式. 心理学报, 28(4), 436-442.
[5]	丁树良, 罗芬, 涂冬波. (2012). 项目反应理论新进展专题研究. 北京: 北京师范大学出版社.
[6]	郭磊, 张金明, 宋乃庆. (2019). 整合后验信息的多分属性认知诊断信效度指标. 心理科学, 42(2), 446-454.
[7]	廖友国, 张本钰. (2024). 成年中期抑郁情绪的变化轨迹:基于增长混合模型. 心理科学, 47(2), 300-307.
[8]	刘晓梅, 卞冉, 车宏生, 王丽娜, 邵燕萍. (2011). 情境判断测验的效度研究述评. 心理科学进展, 19(5), 740-748.
[9]	任赫, 黄颖诗, 陈平. (2022). 计算机化分类测验终止规则的类别、特点及应用. 心理科学进展, 30(5), 1168-1182. doi: 10.3724/SP.J.1042.2022.01168
[10]	宋吉祥, 李付鹏. (2022). 高中学业水平考试等级赋分的分类一致性和准确性研究. 教学与管理, (24), 37-41.
[11]	汪大勋, 涂冬波. (2021). 认知诊断计算机化自适应测量技术在心理障碍诊断与评估中的应用. 江西师范大学学报(自然科学版), 45(2), 111-117.
[12]	汪文义, 方小婷, 叶宝娟. (2018). 认知诊断属性分类一致性信度区间估计三种方法. 心理科学, 41(6), 1492-1499.
[13]	汪文义, 宋丽红, 丁树良. (2016). 复杂决策规则下MIRT的分类准确性和分类一致性. 心理学报, 48(12), 1612-1624.
[14]	王昭, 郭庆科, 岳艳. (2007). 心理测验中个人拟合研究的回顾与展望. 心理科学进展, 15(3), 559-566.
[15]	温忠麟, 叶宝娟. (2011). 测验信度估计: 从α系数到内部一致性信度. 心理学报, 43(7), 821-829.
[16]	张军. (2015). 单维参数型与非参数型项目反应理论项目参数的比较研究. 心理学探新, 35(3), 279-283.
[17]	中共中央, 国务院. (2020). 新时代教育评价改革总体方案. 2024-06-03 取自 https://www.gov.cn/zhengce/2020-10/13/content_5551032.htm.
[18]	周成超, 楚洁, 王婷, 彭倩倩, 何江江, 郑文贵,... 徐凌忠. (2008). 简易心理状况评定量表Kessler10中文版的信度和效度评价. 中国临床心理学杂志, 16(6), 627-629.
[19]	Chang, H. H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model. Psychometrika, 58(1), 37-52.
[20]	Cheng, Y., Liu, C., & Behrens, J. (2015). Standard error of ability estimates and the classification accuracy and consistency of binary decisions. Psychometrika, 80(3), 645-664. doi: 10.1007/s11336-014-9407-z pmid: 25228494
[21]	Cohen, J. (1960). A coefﬁcient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46.
[22]	Cui, Y., Gierl, M. J., & Chang, H. H. (2012). Estimating classification consistency and accuracy for cognitive diagnostic assessment. Journal of Educational Measurement, 49(1), 19-38.
[23]	Deng, N., & Hambleton, R. K. (2013). Evaluating CTT- and IRT-based single-administration estimates of classification consistency and accuracy. In R. E. Millsap, L. A. van der Ark, D. M. Bolt, & C. M. Woods (Eds.), Springer proceedings in mathematics & statistics: Vol. 66: New developments in quantitative psychology (pp. 235-250). Springer.
[24]	Douglas, J., & Cohen, A. (2001). Nonparametric item response function estimation for assessing parametric model fit. Applied Psychological Measurement, 25(3), 234-243.
[25]	Givens, G. H., & Hoeting, J. A. (2013). Computational statistics. John Wiley & Sons. Inc.
[26]	Glaser, R. (1963). Instructional technology and the measurement of learing outcomes: Some questions. American Psychologist, 18(8), 519-521.
[27]	Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure. Psychological Assessment, 4(1), 26-42.
[28]	Gonzalez, O. (2021a). Psychometric and machine learning approaches for diagnostic assessment and tests of individual classification. Psychological Methods, 26(2), 236-254.
[29]	Gonzalez, O. (2021b). Psychometric and machine learning approaches to reduce the length of scales. Multivariate Behavioral Research, 56(6), 903-919.
[30]	Gonzalez, O. (2023). Summary intervals for model-based classification accuracy and consistency indices. Educational and Psychological Measurement, 83(2), 240-261. doi: 10.1177/00131644221092347 pmid: 36866072
[31]	Gonzalez, O., Georgeson, A. R., & Pelham, W. E. (2023). How accurate and consistent are score-based assessment decisions? A procedure using the linear factor model. Assessment, 30(5), 1640-1650.
[32]	Gonzalez, O., Georgeson, A. R., & Pelham, W. E. (2024). Estimating classification consistency of machine learning models for screening measures. Psychological Assessment, 36(6-7), 395-406. doi: 10.1037/pas0001313 pmid: 38829349
[33]	Gonzalez, O., Georgeson, A. R., Pelham, W. E., & Fouladi, R. T. (2021). Estimating classification consistency of screening measures and quantifying the impact of measurement bias. Psychological Assessment, 33(7), 596-609. doi: 10.1037/pas0000938 pmid: 33998821
[34]	Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment, Research and Evaluation, 11(6), 1-9.
[35]	Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10(3), 159-170.
[36]	Hanson, B. A. (1991). Method of moments estimates for the four-parameter beta compound binomial model and the calculation of classification consistency indexes (Research Rep. No. 91-5). Iowa City, IA: American College Testing.
[37]	Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models. Journal of Educational Measurement, 27(4), 345-359.
[38]	Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253-264.
[39]	Huynh, H. (1979). Statistical inference for two reliability indices in mastery testing based on the beta-binomial model. Journal of Educational Statistics, 4(3), 231-246.
[40]	Jiang, Y., Zhang, J., & Xin, T. (2019). Toward education quality improvement in China: A brief overview of the national assessment of education quality. Journal of Educational and Behavioral Statistics, 44(6), 733-751.
[41]	Johnson, M. S., & Sinharay, S. (2018). Measures of agreement to assess attribute-level classification accuracy and consistency for cognitive diagnostic assessments. Journal of Educational Measurement, 55(4), 635-664.
[42]	Johnson, M. S., & Sinharay, S. (2020). The reliability of the posterior probability of skill attainment in diagnostic classification models. Journal of Educational and Behavioral Statistics, 45(1), 5-31.
[43]	Kessler, R. C., Barker, P. R., Colpe, L. J., Epstein, J. F., Gfroerer, J. C., Hiripi, E., Howes, M. J., … Zaslavsky, A. M. (2003). Screening for serious mental illness in the general population. Archives of General Psychiatry, 60(2), 184-189. doi: 10.1001/archpsyc.60.2.184 pmid: 12578436
[44]	Kim, S. Y., & Lee, W.-C. (2019). Classification consistency and accuracy for mixed-format tests. Applied Measurement in Education, 32(2), 97-115. doi: 10.1080/08957347.2019.1577246
[45]	Lathrop, Q. N., & Cheng, Y. (2013). Two approaches to estimation of classification accuracy rate under item response theory. Applied Psychological Measurement, 37(3), 226-241.
[46]	Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency. Journal of Educational Measurement, 51(3), 318-334.
[47]	Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1-17.
[48]	Lee, W.-C., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model. Applied Psychological Measurement, 33(5), 374-390.
[49]	Lee, W.-C., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26(4), 412-432.
[50]	Livingston, S. A. (1972). Criterion-referenced applications of classical test theory. Journal of Educational Measurement, 9(1), 13-26.
[51]	Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179-197.
[52]	Lord, F. M. (1965). A strong true-score theory, with applications. Psychometrika, 30(3), 239-270. pmid: 5216215
[53]	Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score "equatings". Applied Psychological Measurement, 8(4), 453-461.
[54]	Maas, L., Brinkhuis, M. J. S., Kester, L., & Meij, L. W. (2022). Cognitive diagnostic assessment in university statistics education: Valid and reliable skill measurement for actionable feedback using learning dashboards. Applied Sciences, 12(10), Article 4809.
[55]	Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy. Applied Psychological Measurement, 31(3), 181-194.
[56]	Nájera, P., Abad, F. J., Chiu, C.-Y., & Sorrel, M. A. (2023). The restricted DINA model: A comprehensive cognitive diagnostic model for classroom-level assessments. Journal of Educational and Behavioral Statistics, 48(6), 719-749.
[57]	Park, S., Kim, K. Y., & Lee, W. (2023). Estimating classification accuracy and consistency indices for multiple measures with the simple structure MIRT model. Journal of Educational Measurement, 60(1), 106-125.
[58]	Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6(1), 1-9.
[59]	Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1(3), 385-401.
[60]	Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56(4), 611-630.
[61]	Ravand, H., & Baghaei, P. (2019). Diagnostic classification models: Recent developments, practical issues, and prospects. International Journal of Testing, 20(1), 24-56.
[62]	Roussos, L. A., DiBello, L. V., Stout, W., Hartz, S. M., Henson, R. A., & Templin, J. L. (2007). The Fusion model skills diagnosis system. In: J. P. Leighton, & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 275-318). Cambridge University Press.
[63]	Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment, Research & Evaluation, 7(14), 1-8.
[64]	Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research and Evaluation, 10(13), 1-4.
[65]	Rupp, A. A., Templin, J. L., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. The Guilford Press.
[66]	Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Nonparametric item analyses of the Beck depression inventory: Evaluating gender item bias and response option weights. Psychological Assessment, 6(3), 255-270.
[67]	Selzer, M. L. (1971). The Michigan alcoholism screening test: The quest for a new diagnostic instrument. The American Journal of Psychiatry, 127(12), 1653-1658.
[68]	Setzer, J. C., Cheng, Y., & Liu, C. (2023). Classification accuracy and consistency of compensatory composite test scores. Journal of Educational Measurement, 60(3), 501-519.
[69]	Shrock, S. A., & Coscarelli, W. C. (2007). Criterion- referenced test development: Technical and legal guidelines for corporate training (3rd ed.). John Wiley & Sons, Inc.
[70]	Skaggs, G., Wilkins, J. L. M., & Hein, S. F. (2017). Estimating an observed score distribution from a cognitive diagnostic model. Applied Psychological Measurement, 41(2), 150-154. doi: 10.1177/0146621616677320 pmid: 29881084
[71]	Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test. Journal of Educational Measurement, 13(4), 265-276.
[72]	Subkoviak, M. J. (1978). Empirical investigation of procedures for estimating reliability for mastery tests. Journal of Educational Measurement, 15(2), 111-116.
[73]	Swaminathan, H., Hambleton, R. K., & Algina, J. (1974). Reliability of criterion-referenced tests: A decision- theoretic formulation. Journal of Educational Measurement, 11(4), 263-267.
[74]	Teitelbaum, L. M., & Carey, K. B. (2000). Temporal stability of alcohol screening measures in a psychiatric setting. Psychology of Addictive Behaviors, 14(4), 401-404. pmid: 11130159
[75]	Templin, J., & Bradshaw, L. (2013). Measuring the reliability of diagnostic classification model examinee estimates. Journal of Classification, 30(2), 251-275.
[76]	Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11(3), 287-305. pmid: 16953706
[77]	Thissen, D., Pommerich, M., Billeaud, K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19(1), 39-49.
[78]	Thompson, W. J., Clark, A. K., & Nash, B. (2019). Measuring the reliability of diagnostic mastery classifications at multiple levels of reporting. Applied Measurement in Education, 32(4), 298-309. doi: 10.1080/08957347.2019.1660345
[79]	Thompson, W. J., Nash, B., Clark, A. K., & Hoover, J. C. (2023). Using simulated retests to estimate the reliability of diagnostic assessment systems. Journal of Educational Measurement, 60(3), 455-475.
[80]	von Davier, M., & Lee, Y.-S. (Ed.). (2019). Handbook of diagnostic classification models: Models and model extensions, applications, software packages. Springer International Publishing.
[81]	Wang, W., Song, L., Chen, P., & Ding, S. (2019). An item-level expected classification accuracy and its applications in cognitive diagnostic assessment. Journal of Educational Measurement, 56(1), 51-75.
[82]	Wang, W., Song, L., Chen, P., Meng, Y., & Ding, S. (2015). Attribute-level and pattern-level classification consistency and accuracy indices for cognitive diagnostic assessment. Journal of Educational Measurement, 52(4), 457-476.
[83]	Wang, W., Song, L., & Ding, S. (2017). An extension of Rudner-based consistency and accuracy indices for multidimensional item response theory. In L. A. von der Ark, M. Wiberg, S. A. Culpepper, J. A. Douglas, & W.-C. Wang (Eds.), Springer proceedings in mathematics & statistics: Vol 196: Quantitative psychology . (pp.43-58). Springer New York LLC.
[84]	Wang, W., Song, L., Ding, S., & Meng, Y. (2016). Estimating classification accuracy and consistency indices for multidimensional latent ability. In: van der Ark, L., Bolt, D., Wang, W. C., Douglas, J., Wiberg, M. (Eds.), Springer proceedings in mathematics & statistics: Vol 167: Quantitative psychology research (pp.89-103). Springer.
[85]	Wolkowitz, A. A. (2021). A computationally simple method for estimating decision consistency. Journal of Educational Measurement, 58(3), 388-412.
[86]	Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency indices. Applied Psychological Measurement, 36(7), 602-624.
[87]	Youngstrom, E. A. (2014). A primer on receiver operating characteristic analysis and diagnostic efficiency statistics for pediatric psychology: We are ready to ROC. Journal of Pediatric Psychology, 39(2), 204-221. doi: 10.1093/jpepsy/jst062 pmid: 23965298
[88]	Zhang, S., Du, J., Chen, P., Xin, T., & Chen, F. (2017). Using procedure based on item response theory to evaluate classification consistency indices in the practice of large-scale assessment. Frontiers in Psychology, 8, Article 1676.

维度	分类一致性估计方法	划界分数为20			划界分数为(10, 30)
维度	分类一致性估计方法	500	1000	2000	500	1000	2000
宜人性	Rudner方法	0.932	0.939	0.940	0.846	0.844	0.842
	Lee等人(2009)	0.883	0.890	0.891	0.763	0.769	0.766
	Lee (2010)	0.891	0.901	0.903	0.783	0.785	0.781
	Lathrop和Cheng (2014)	0.888	0.897	0.899	0.789	0.798	0.800
责任心	Rudner方法	0.835	0.837	0.843	0.967	0.972	0.972
	Lee等人(2009)	0.770	0.775	0.772	0.823	0.826	0.826
	Lee (2010)	0.756	0.756	0.760	0.894	0.899	0.895
	Lathrop和Cheng (2014)	0.778	0.785	0.783	0.857	0.862	0.861

维度	分类一致性估计方法	划界分数为20			划界分数为(10, 30)
维度	分类一致性估计方法	500	1000	2000	500	1000	2000
宜人性	Rudner方法	0.932	0.939	0.940	0.846	0.844	0.842
	Lee等人(2009)	0.883	0.890	0.891	0.763	0.769	0.766
	Lee (2010)	0.891	0.901	0.903	0.783	0.785	0.781
	Lathrop和Cheng (2014)	0.888	0.897	0.899	0.789	0.798	0.800
责任心	Rudner方法	0.835	0.837	0.843	0.967	0.972	0.972
	Lee等人(2009)	0.770	0.775	0.772	0.823	0.826	0.826
	Lee (2010)	0.756	0.756	0.760	0.894	0.899	0.895
	Lathrop和Cheng (2014)	0.778	0.785	0.783	0.857	0.862	0.861

分类一致性估计方法	划界分数为16			划界分数为(10, 20)
分类一致性估计方法	200	500	765	200	500	765
Livingston和Lewis (1995)	0.749	0.757	0.759	0.601	0.628	0.623
Rudner方法	0.874	0.866	0.855	0.770	0.775	0.775
Lee (2010)	0.918	0.908	0.888	0.791	0.788	0.785
Lathrop和Cheng (2014)	0.860	0.858	0.847	0.742	0.752	0.757

分类一致性估计方法	划界分数为16			划界分数为(10, 20)
分类一致性估计方法	200	500	765	200	500	765
Livingston和Lewis (1995)	0.749	0.757	0.759	0.601	0.628	0.623
Rudner方法	0.874	0.866	0.855	0.770	0.775	0.775
Lee (2010)	0.918	0.908	0.888	0.791	0.788	0.785
Lathrop和Cheng (2014)	0.860	0.858	0.847	0.742	0.752	0.757

知识状态估计方法	分类一致性估计方法	属性1	属性2	属性3	模式
MLE	Wang等人(2015)方法	0.783	0.658	0.844	0.453
	Thompson等人(2023)方法	0.746	0.664	0.799	0.428
MAP	Johnson和Sinharay (2018)方法	0.894	0.856	0.877	0.782
	Thompson等人(2023)方法	0.860	0.832	0.845	0.754