ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

心理科学进展 ›› 2025, Vol. 33 ›› Issue (8): 1340-1357.doi: 10.3724/SP.J.1042.2025.1340 cstr: 32111.14.2025.1340

• 研究方法 • 上一篇    下一篇

心理与教育测验分类信度:分类一致性评估方法

陈静仪1, 宋丽红2(), 汪文义1   

  1. 1江西师范大学计算机信息工程学院
    2江西师范大学教育学院, 南昌 330022
  • 收稿日期:2024-09-20 出版日期:2025-08-15 发布日期:2025-05-15
  • 通讯作者: 宋丽红, E-mail: viviansong1981@163.com
  • 基金资助:
    国家自然科学基金(62267004);国家自然科学基金(62467003);国家自然科学基金(62067005);江西省普通本科高校教育教学改革研究课题(JXJG-22-2-44);江西省普通本科高校教育教学改革研究课题(JXJG-23-2-6)

Classification consistency for measuring classification reliability of psychological and educational tests

CHEN Jingyi1, SONG Lihong2(), WANG Wenyi1   

  1. 1School of Computer and Information Engineering, Jiangxi Normal University, Nanchang 330022, China
    2School of Education, Jiangxi Normal University, Nanchang 330022, China
  • Received:2024-09-20 Online:2025-08-15 Published:2025-05-15

摘要:

心理测验、教育测验和医学测验广泛应用于测试者分类, 而内部一致性和α等信度系数并不能直接评价分类信度, 如何评估标准参照测验的分类信度, 成为研究者和实践者关注的重要问题。本研究从分类一致性方法视角, 探究单次施测测验的分类一致性估计模式, 分析各类代表性方法发展脉络及其核心思想, 结合各方法相关软件包与程序, 分析人格测验、学业测验、诊断测验等真实数据。结合理论分析与数据分析, 总结各类方法的优劣与影响因素, 提出选用各类方法的建议, 讨论分类一致性区间估计等问题, 推动分类测验的分类一致性的研究、应用与报告。

关键词: 分类信度, 分类一致性, 决策规则, 认知诊断, 机器学习

Abstract:

The reliability of norm-referenced tests is not appropriate for classification tests or criterion-referenced tests. Classification consistency is a crucial metric in psychological and educational measurement, reflecting the probability that examinees will receive the same classification categories on two independent administrations of a test or two parallel tests. It is widely utilized in evaluating the classification reliability of psychological assessments, educational tests, and medical diagnostic tests. Since administering a test twice or parallel tests is often challenging in practice due to increased testing time and test construction expense, many methods are focused on estimating classification consistency based on results from a single test administration in psychological and educational measurement. These methods are designed to provide important psychometric properties for assessing and improving the reliability and fairness of tests.
The purpose of the study firstly focused on the investigation of the general framework for estimating classification consistency based on criterion-referenced tests. The general procedures for estimating classification consistency based on a single test administration can be briefly summed up as follows: (a) determining the probabilities of examinees being classified into each category according to classification criteria, (b) following an independent and identically distributed based on the assumption that two administrations of a test or two parallel forms are independent, (c) computing the sum of the squared probabilities of their classification across all categories, which refers to the conditional classification consistency for an examinee, and (d) obtaining marginal classification consistency based on a person or distribution method.
Following the general framework for estimating classification consistency, the methods have been developed for the estimation of single-administration classification consistency by the consideration of measurement error, conditional standard error of measurement, classification probabilities, and simulated retest classification errors under different psychometric models. This article describes the ideal and procedures of the representative methods in details under classical measurement theory (CTT), item response theory (IRT), cognitive diagnostic models (CDM), and machine learning models (MLM). The theoretical foundations, computational steps, and applications of representative methods were systematically introduced under each model.
CTT-based methods provide classification consistency of observed test scores. For example, the Livingston and Lewis approach utilizes test score distributions and test reliability to estimate classification consistency. The Lee method employs a compound multinomial distribution for establishing the conditional distribution of total summed scores and applies it to compute the expected probabilities of each examinee falling into each category of performance levels. However, the limitation of CTT is that parameters are sample and test dependent.
IRT-based methods estimate classification consistency of observed test scores or latent ability through modeling the probability of item response based on latent ability and item parameters. The Rudner's approach estimates conditional classification consistency by incorporating conditional standard error of measurement, which can be computed from an individual's test information function. The Lee's and Guo's methods employ the conditional distribution of total summed scores or likelihood functions to compute the expected classification probabilities of each examinee, respectively. These methods require relatively large sample sizes to calibrate item parameters.
CDM-based methods are designed to evaluate classification consistency of attribute pattern, attribute status, and the number of skills mastered. These methods provide a finer-grained approach to report reliability of cognitive diagnostic assessments. For example, attribute-level consistency indices and pattern-level consistency indices quantify classification reliability at a more fine-grained level and holistic levels, respectively. MLM-based methods provide data-driven insights into classification reliability. These methods can learn complex relationships between test items from test data, offering dynamic and potentially more accurate estimations of classification consistency, compared to traditional psychometric approaches.
Beyond the introduction to the method of classification consistency, this study provides applications of classification consistency indices, illustrating their use in educational, psychological, and diagnostic assessments. Four examples were illustrated about how to apply classification consistency indices for evaluating test reliability. A comparative analysis of these methods reveals that CTT-based methods offer simplicity and ease of computation, while they may lack precision for CRT. IRT-based methods enhance estimation precision but require more complex assumptions. CDM-based methods are suitable for formative assessment. Machine learning methods, though promising, are still in the early stages of integration within psychometrics and require further validation for practical implementation.
Future research should investigate the approach of estimating confidence intervals for classification consistency, as current methods primarily provide point estimates. Additionally, more extensive empirical studies of MLM-based classification consistency estimations are necessary. Researchers and practitioners are encouraged to incorporate and report classification consistency more frequently to enhance the overall quality and fairness of CRT. By systematically reviewing existing methodologies and their applications, this study highlights the significance of reporting classification consistency for CRT.

Key words: classification reliability, classification consistency, decision rules, cognitive diagnosis, machine learning

中图分类号: