Advances in Psychological Science ›› 2025, Vol. 33 ›› Issue (8): 1340-1357.doi: 10.3724/SP.J.1042.2025.1340
• Research Method • Previous Articles Next Articles
CHEN Jingyi1, SONG Lihong2, WANG Wenyi1
Received:
2024-09-20
Online:
2025-08-15
Published:
2025-05-15
CLC Number:
CHEN Jingyi, SONG Lihong, WANG Wenyi. Classification consistency for measuring classification reliability of psychological and educational tests[J]. Advances in Psychological Science, 2025, 33(8): 1340-1357.
[1] 陈平. (2022). 浅谈标准设定中的关键技术: 来自我国大规模测评项目的经验.中国考试, (8), 48-56. [2] 陈平, 李珍, 辛涛, 高慧健. (2011). 标准参照测验决策一致性指标研究的总结与展望.心理发展与教育, 27(2), 210-215. [3] 陈思佚, 崔红, 周仁来, 贾艳艳. (2012). 正念注意觉知量表(MAAS)的修订及信效度检验.中国临床心理学杂志, 20(2), 148-151. [4] 陈希镇. (1996). 标准参照测验中的信度估计公式.心理学报, 28(4), 436-442. [5] 丁树良, 罗芬, 涂冬波. (2012). 项目反应理论新进展专题研究. 北京: 北京师范大学出版社. [6] 郭磊, 张金明, 宋乃庆. (2019). 整合后验信息的多分属性认知诊断信效度指标.心理科学, 42(2), 446-454. [7] 廖友国, 张本钰. (2024). 成年中期抑郁情绪的变化轨迹:基于增长混合模型.心理科学, 47(2), 300-307. [8] 刘晓梅, 卞冉, 车宏生, 王丽娜, 邵燕萍. (2011). 情境判断测验的效度研究述评.心理科学进展, 19(5), 740-748. [9] 任赫, 黄颖诗, 陈平. (2022). 计算机化分类测验终止规则的类别、特点及应用.心理科学进展, 30(5), 1168-1182. [10] 宋吉祥, 李付鹏. (2022). 高中学业水平考试等级赋分的分类一致性和准确性研究.教学与管理, (24), 37-41. [11] 汪大勋, 涂冬波. (2021). 认知诊断计算机化自适应测量技术在心理障碍诊断与评估中的应用.江西师范大学学报(自然科学版), 45(2), 111-117. [12] 汪文义, 方小婷, 叶宝娟. (2018). 认知诊断属性分类一致性信度区间估计三种方法.心理科学, 41(6), 1492-1499. [13] 汪文义, 宋丽红, 丁树良. (2016). 复杂决策规则下MIRT的分类准确性和分类一致性.心理学报, 48(12), 1612-1624. [14] 王昭, 郭庆科, 岳艳. (2007). 心理测验中个人拟合研究的回顾与展望.心理科学进展, 15(3), 559-566. [15] 温忠麟, 叶宝娟. (2011). 测验信度估计: 从α系数到内部一致性信度.心理学报, 43(7), 821-829. [16] 张军. (2015). 单维参数型与非参数型项目反应理论项目参数的比较研究.心理学探新, 35(3), 279-283. [17] 中共中央, 国务院. (2020). 新时代教育评价改革总体方案. 2024-06-03 取自https://www.gov.cn/zhengce/2020-10/13/content_5551032.htm. [18] 周成超, 楚洁, 王婷, 彭倩倩, 何江江, 郑文贵, ... 徐凌忠. (2008). 简易心理状况评定量表Kessler10中文版的信度和效度评价.中国临床心理学杂志, 16(6), 627-629. [19] Chang, H. H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT model.Psychometrika, 58(1), 37-52. [20] Cheng Y., Liu C., & Behrens J. (2015). Standard error of ability estimates and the classification accuracy and consistency of binary decisions.Psychometrika, 80(3), 645-664. [21] Cohen, J. (1960). A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1), 37-46. [22] Cui Y., Gierl M. J., & Chang H. H. (2012). Estimating classification consistency and accuracy for cognitive diagnostic assessment.Journal of Educational Measurement, 49(1), 19-38. [23] Deng, N., & Hambleton, R. K. (2013). Evaluating CTT- and IRT-based single-administration estimates of classification consistency and accuracy. In R. E. Millsap, L. A. van der Ark, D. M. Bolt, & C. M. Woods (Eds.), Springer proceedings in mathematics & statistics: Vol. 66: New developments in quantitative psychology(pp. 235-250). Springer. [24] Douglas, J., & Cohen, A. (2001). Nonparametric item response function estimation for assessing parametric model fit.Applied Psychological Measurement, 25(3), 234-243. [25] Givens, G. H., & Hoeting, J. A. (2013). Computational statistics. John Wiley & Sons. Inc. [26] Glaser, R. (1963). Instructional technology and the measurement of learing outcomes: Some questions.American Psychologist, 18(8), 519-521. [27] Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure.Psychological Assessment, 4(1), 26-42. [28] Gonzalez, O. (2021a). Psychometric and machine learning approaches for diagnostic assessment and tests of individual classification. Psychological Methods, 26(2), 236-254. [29] Gonzalez, O. (2021b). Psychometric and machine learning approaches to reduce the length of scales. Multivariate Behavioral Research, 56(6), 903-919. [30] Gonzalez, O. (2023). Summary intervals for model-based classification accuracy and consistency indices.Educational and Psychological Measurement, 83(2), 240-261. [31] Gonzalez O., Georgeson A. R., & Pelham W. E. (2023). How accurate and consistent are score-based assessment decisions? A procedure using the linear factor model. Assessment, 30(5), 1640-1650. [32] Gonzalez O., Georgeson A. R., & Pelham W. E. (2024). Estimating classification consistency of machine learning models for screening measures. Psychological Assessment, 36(6-7), 395-406. [33] Gonzalez O., Georgeson A. R., Pelham W. E., & Fouladi R. T. (2021). Estimating classification consistency of screening measures and quantifying the impact of measurement bias.Psychological Assessment, 33(7), 596-609. [34] Guo, F. (2006). Expected classification accuracy using the latent distribution.Practical Assessment, Research and Evaluation, 11(6), 1-9. [35] Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests.Journal of Educational Measurement, 10(3), 159-170. [36] Hanson, B. A. (1991). Method of moments estimates for the four-parameter beta compound binomial model and the calculation of classification consistency indexes (Research Rep. No. 91-5). Iowa City, IA: American College Testing. [37] Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated under alternative strong true score models.Journal of Educational Measurement, 27(4), 345-359. [38] Huynh, H. (1976). On the reliability of decisions in domain-referenced testing.Journal of Educational Measurement, 13(4), 253-264. [39] Huynh, H. (1979). Statistical inference for two reliability indices in mastery testing based on the beta-binomial model.Journal of Educational Statistics, 4(3), 231-246. [40] Jiang Y., Zhang J., & Xin T. (2019). Toward education quality improvement in China: A brief overview of the national assessment of education quality.Journal of Educational and Behavioral Statistics, 44(6), 733-751. [41] Johnson, M. S., & Sinharay, S. (2018). Measures of agreement to assess attribute-level classification accuracy and consistency for cognitive diagnostic assessments.Journal of Educational Measurement, 55(4), 635-664. [42] Johnson, M. S., & Sinharay, S. (2020). The reliability of the posterior probability of skill attainment in diagnostic classification models. Journal of Educational and Behavioral Statistics, 45(1), 5-31. [43] Kessler R. C., Barker P. R., Colpe L. J., Epstein J. F., Gfroerer J. C., Hiripi E., Howes M. J., ... Zaslavsky A. M. (2003). Screening for serious mental illness in the general population.Archives of General Psychiatry, 60(2), 184-189. [44] Kim, S. Y., & Lee, W.-C. (2019). Classification consistency and accuracy for mixed-format tests.Applied Measurement in Education, 32(2), 97-115. [45] Lathrop, Q. N., & Cheng, Y. (2013). Two approaches to estimation of classification accuracy rate under item response theory.Applied Psychological Measurement, 37(3), 226-241. [46] Lathrop, Q. N., & Cheng, Y. (2014). A nonparametric approach to estimate classification accuracy and consistency.Journal of Educational Measurement, 51(3), 318-334. [47] Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory.Journal of Educational Measurement, 47(1), 1-17. [48] Lee W.-C., Brennan R. L., & Wan L. (2009). Classification consistency and accuracy for complex assessments under the compound multinomial model.Applied Psychological Measurement, 33(5), 374-390. [49] Lee W.-C., Hanson B. A., & Brennan R. L. (2002). Estimating consistency and accuracy indices for multiple classifications.Applied Psychological Measurement, 26(4), 412-432. [50] Livingston, S. A. (1972). Criterion-referenced applications of classical test theory.Journal of Educational Measurement, 9(1), 13-26. [51] Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores.Journal of Educational Measurement, 32(2), 179-197. [52] Lord, F. M. (1965). A strong true-score theory, with applications. Psychometrika, 30(3), 239-270. [53] Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score "equatings".Applied Psychological Measurement, 8(4), 453-461. [54] Maas L., Brinkhuis M. J. S., Kester L., & Meij L. W. (2022). Cognitive diagnostic assessment in university statistics education: Valid and reliable skill measurement for actionable feedback using learning dashboards. Applied Sciences, 12(10), Article 4809. [55] Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy.Applied Psychological Measurement, 31(3), 181-194. [56] Nájera P., Abad F. J., Chiu C.-Y., & Sorrel M. A. (2023). The restricted DINA model: A comprehensive cognitive diagnostic model for classroom-level assessments.Journal of Educational and Behavioral Statistics, 48(6), 719-749. [57] Park S., Kim K. Y., & Lee W. (2023). Estimating classification accuracy and consistency indices for multiple measures with the simple structure MIRT model.Journal of Educational Measurement, 60(1), 106-125. [58] Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement.Journal of Educational Measurement, 6(1), 1-9. [59] Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population.Applied Psychological Measurement, 1(3), 385-401. [60] Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation.Psychometrika, 56(4), 611-630. [61] Ravand, H., & Baghaei, P. (2019). Diagnostic classification models: Recent developments, practical issues, and prospects.International Journal of Testing, 20(1), 24-56. [62] Roussos L. A.,DiBello, L. V., Stout, W., Hartz, S. M., Henson, R. A., & Templin, J. L. (2007). The Fusion model skills diagnosis system. In: J. P. Leighton, & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education: Theory and applications (pp. 275-318). Cambridge University Press. [63] Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees.Practical Assessment, Research & Evaluation, 7(14), 1-8. [64] Rudner, L. M. (2005). Expected classification accuracy.Practical Assessment Research and Evaluation, 10(13), 1-4. [65] Rupp A. A., Templin J. L., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications The Guilford Press Theory, methods, and applications. The Guilford Press. [66] Santor D. A., Ramsay J. O., & Zuroff D. C. (1994). Nonparametric item analyses of the Beck depression inventory: Evaluating gender item bias and response option weights.Psychological Assessment, 6(3), 255-270. [67] Selzer, M. L. (1971). The Michigan alcoholism screening test: The quest for a new diagnostic instrument.The American Journal of Psychiatry, 127(12), 1653-1658. [68] Setzer J. C., Cheng Y., & Liu C. (2023). Classification accuracy and consistency of compensatory composite test scores.Journal of Educational Measurement, 60(3), 501-519. [69] Shrock, S. A., & Coscarelli, W. C. (2007). Criterion- referenced test development: Technical and legal guidelines for corporate training (3rd ed.). John Wiley & Sons, Inc. [70] Skaggs G., Wilkins J. L. M., & Hein S. F. (2017). Estimating an observed score distribution from a cognitive diagnostic model.Applied Psychological Measurement, 41(2), 150-154. [71] Subkoviak, M. J. (1976). Estimating reliability from a single administration of a criterion-referenced test.Journal of Educational Measurement, 13(4), 265-276. [72] Subkoviak, M. J. (1978). Empirical investigation of procedures for estimating reliability for mastery tests.Journal of Educational Measurement, 15(2), 111-116. [73] Swaminathan H., Hambleton R. K., & Algina J. (1974). Reliability of criterion-referenced tests: A decision- theoretic formulation.Journal of Educational Measurement, 11(4), 263-267. [74] Teitelbaum, L. M., & Carey, K. B. (2000). Temporal stability of alcohol screening measures in a psychiatric setting.Psychology of Addictive Behaviors, 14(4), 401-404. [75] Templin, J., & Bradshaw, L. (2013). Measuring the reliability of diagnostic classification model examinee estimates.Journal of Classification, 30(2), 251-275. [76] Templin, J. L., & Henson, R. A. (2006). Measurement of psychological disorders using cognitive diagnosis models.Psychological Methods, 11(3), 287-305. [77] Thissen D., Pommerich M., Billeaud K., & Williams, V. S. L. (1995). Item response theory for scores on tests including polytomous items with ordered responses.Applied Psychological Measurement, 19(1), 39-49. [78] Thompson W. J., Clark A. K., & Nash B. (2019). Measuring the reliability of diagnostic mastery classifications at multiple levels of reporting.Applied Measurement in Education, 32(4), 298-309. [79] Thompson W. J., Nash B., Clark A. K., & Hoover J. C. (2023). Using simulated retests to estimate the reliability of diagnostic assessment systems.Journal of Educational Measurement, 60(3), 455-475. [80] von Davier, M., & Lee, Y.-S. (Ed.). (2019). Handbook of diagnostic classification models: Models and model extensions, applications, software packages. Springer International Publishing. [81] Wang W., Song L., Chen P., & Ding S. (2019). An item-level expected classification accuracy and its applications in cognitive diagnostic assessment.Journal of Educational Measurement, 56(1), 51-75. [82] Wang W., Song L., Chen P., Meng Y., & Ding S. (2015). Attribute-level and pattern-level classification consistency and accuracy indices for cognitive diagnostic assessment.Journal of Educational Measurement, 52(4), 457-476. [83] Wang W., Song L., & Ding S. (2017). An extension of Rudner-based consistency and accuracy indices for multidimensional item response theory. In L. A. von der Ark, M. Wiberg, S. A. Culpepper, J. A. Douglas, & W.-C. Wang (Eds.), Springer proceedings in mathematics & statistics: Vol 196: Quantitative psychology (pp. 43-58). Springer New York LLC. [84] Wang W., Song L., Ding S., & Meng, Y. (2016). Estimating classification accuracy and consistency indices for multidimensional latent ability. In: van der Ark, L., Bolt, D., Wang, W. C., Douglas, J., Wiberg, M. (Eds.), Springer proceedings in mathematics & statistics: Vol 167: Quantitative psychology research (pp. 89-103). Springer. [85] Wolkowitz, A. A. (2021). A computationally simple method for estimating decision consistency. Journal of Educational Measurement, 58(3), 388-412. [86] Wyse, A. E., & Hao, S. (2012). An evaluation of item response theory classification accuracy and consistency indices.Applied Psychological Measurement, 36(7), 602-624. [87] Youngstrom, E. A. (2014). A primer on receiver operating characteristic analysis and diagnostic efficiency statistics for pediatric psychology: We are ready to ROC.Journal of Pediatric Psychology, 39(2), 204-221. [88] Zhang S., Du J., Chen P., Xin T., & Chen F. (2017). Using procedure based on item response theory to evaluate classification consistency indices in the practice of large-scale assessment. Frontiers in Psychology, 8, Article 1676. |
[1] | LIU Yongjin, YANG Xue, DU Xinxin, JI Wenqi, ZANG Yinyin, GUAN Ruiyuan, SONG Sen, QIAN Mingyi, MU Wenting. Neurophysiological mechanisms and interventions of subthreshold depression by integrating machine learning techniques [J]. Advances in Psychological Science, 2025, 33(6): 887-904. |
[2] | GAO Baixue, XIE Yunlong, LUO Junlong, HE Wen. Application of machine learning to improve the predictive performance of non-suicidal self-injury: A systematic review [J]. Advances in Psychological Science, 2025, 33(3): 506-519. |
[3] | SONG Lihong, WANG Wenyi, DING Shuliang. Q-matrix theory and its applications in cognitive diagnostic assessment [J]. Advances in Psychological Science, 2024, 32(6): 1010-1033. |
[4] | GAO Xuliang, LI Ning. Application of machine learning methods in test security [J]. Advances in Psychological Science, 2024, 32(11): 1814-1828. |
[5] | Xunbing Shen, Xiaoqing Mei, Min Gao, Zhencai Chen, Yafang Li, Mingliang Gong. Eyes are the Windows of Lies [J]. Advances in Psychological Science, 2023, 31(suppl.): 172-172. |
[6] | Yuxi Zhou, Xunbing Shen. Emotion Elicitation Promote the Disclosure of Facial Deception Cues [J]. Advances in Psychological Science, 2023, 31(suppl.): 179-179. |
[7] | CHEN Xinwen, LI Hongjie, DING Yulong. Exploring the neural representation patterns in event-related EEG/MEG signals: The methods based on classification decoding and representation similarity analysis [J]. Advances in Psychological Science, 2023, 31(2): 173-195. |
[8] | BU Xiaoou, WANG Yao, DU Yawen, WANG Pei. Application of machine learning in early screening of children with dyslexia [J]. Advances in Psychological Science, 2023, 31(11): 2092-2105. |
[9] | LIU Xiaohan, CHEN Minglong, GUO Jing. Application of machine learning in prognosis and trajectory of post-traumatic stress disorder in children [J]. Advances in Psychological Science, 2022, 30(4): 851-862. |
[10] | HOU Tingting, CHEN Xiao, KONG Depeng, SHAO Xiujun, LIN Fengxun, LI Kaiyun. Application of machine learning in early identification and diagnosis of autistic children [J]. Advances in Psychological Science, 2022, 30(10): 2321-2337. |
[11] | SU Yue, LIU Mingming, ZHAO Nan, LIU Xiaoqian, ZHU Tingshao. Identifying psychological indexes based on social media data: A machine learning method [J]. Advances in Psychological Science, 2021, 29(4): 571-585. |
[12] | LI Jia, MAO Xiuzhen, ZHANG Xueqin. Q-matrix estimation (validation) methods for cognitive diagnosis [J]. Advances in Psychological Science, 2021, 29(12): 2272-2280. |
[13] | DONG Jianyu, WEI Wenqi, WU Ke, NI Na, WANG Canfei, FU Ying, PENG Xin. The application of machine learning in depression [J]. Advances in Psychological Science, 2020, 28(2): 266-274. |
[14] | ZHENG Hong, PU Cheng-cheng, WANG Yi, Raymond C. K. CHAN. The classification of schizophrenia based on brain structural features: A machine learning approach [J]. Advances in Psychological Science, 2020, 28(2): 252-265. |
[15] | LIANG Jing, RUAN Qiannan, LI He, MA Mengqing, YAN Wenjing. Deception detection based on memory-response conflict: A cognitive load approach [J]. Advances in Psychological Science, 2020, 28(10): 1619-1630. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||