ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

Advances in Psychological Science ›› 2025, Vol. 33 ›› Issue (8): 1340-1357.doi: 10.3724/SP.J.1042.2025.1340

• Research Method • Previous Articles     Next Articles

Classification consistency for measuring classification reliability of psychological and educational tests

CHEN Jingyi1, SONG Lihong2, WANG Wenyi1   

  1. 1School of Computer and Information Engineering, Jiangxi Normal University, Nanchang 330022, China;
    2School of Education, Jiangxi Normal University, Nanchang 330022, China
  • Received:2024-09-20 Online:2025-08-15 Published:2025-05-15

Abstract: The reliability of norm-referenced tests is not appropriate for classification tests or criterion-referenced tests. Classification consistency is a crucial metric in psychological and educational measurement, reflecting the probability that examinees will receive the same classification categories on two independent administrations of a test or two parallel tests. It is widely utilized in evaluating the classification reliability of psychological assessments, educational tests, and medical diagnostic tests. Since administering a test twice or parallel tests is often challenging in practice due to increased testing time and test construction expense, many methods are focused on estimating classification consistency based on results from a single test administration in psychological and educational measurement. These methods are designed to provide important psychometric properties for assessing and improving the reliability and fairness of tests.
The purpose of the study firstly focused on the investigation of the general framework for estimating classification consistency based on criterion-referenced tests. The general procedures for estimating classification consistency based on a single test administration can be briefly summed up as follows: (a) determining the probabilities of examinees being classified into each category according to classification criteria, (b) following an independent and identically distributed based on the assumption that two administrations of a test or two parallel forms are independent, (c) computing the sum of the squared probabilities of their classification across all categories, which refers to the conditional classification consistency for an examinee, and (d) obtaining marginal classification consistency based on a person or distribution method.
Following the general framework for estimating classification consistency, the methods have been developed for the estimation of single-administration classification consistency by the consideration of measurement error, conditional standard error of measurement, classification probabilities, and simulated retest classification errors under different psychometric models. This article describes the ideal and procedures of the representative methods in details under classical measurement theory (CTT), item response theory (IRT), cognitive diagnostic models (CDM), and machine learning models (MLM). The theoretical foundations, computational steps, and applications of representative methods were systematically introduced under each model.
CTT-based methods provide classification consistency of observed test scores. For example, the Livingston and Lewis approach utilizes test score distributions and test reliability to estimate classification consistency. The Lee method employs a compound multinomial distribution for establishing the conditional distribution of total summed scores and applies it to compute the expected probabilities of each examinee falling into each category of performance levels. However, the limitation of CTT is that parameters are sample and test dependent.
IRT-based methods estimate classification consistency of observed test scores or latent ability through modeling the probability of item response based on latent ability and item parameters. The Rudner's approach estimates conditional classification consistency by incorporating conditional standard error of measurement, which can be computed from an individual's test information function. The Lee's and Guo's methods employ the conditional distribution of total summed scores or likelihood functions to compute the expected classification probabilities of each examinee, respectively. These methods require relatively large sample sizes to calibrate item parameters.
CDM-based methods are designed to evaluate classification consistency of attribute pattern, attribute status, and the number of skills mastered. These methods provide a finer-grained approach to report reliability of cognitive diagnostic assessments. For example, attribute-level consistency indices and pattern-level consistency indices quantify classification reliability at a more fine-grained level and holistic levels, respectively. MLM-based methods provide data-driven insights into classification reliability. These methods can learn complex relationships between test items from test data, offering dynamic and potentially more accurate estimations of classification consistency, compared to traditional psychometric approaches.
Beyond the introduction to the method of classification consistency, this study provides applications of classification consistency indices, illustrating their use in educational, psychological, and diagnostic assessments. Four examples were illustrated about how to apply classification consistency indices for evaluating test reliability. A comparative analysis of these methods reveals that CTT-based methods offer simplicity and ease of computation, while they may lack precision for CRT. IRT-based methods enhance estimation precision but require more complex assumptions. CDM-based methods are suitable for formative assessment. Machine learning methods, though promising, are still in the early stages of integration within psychometrics and require further validation for practical implementation.
Future research should investigate the approach of estimating confidence intervals for classification consistency, as current methods primarily provide point estimates. Additionally, more extensive empirical studies of MLM-based classification consistency estimations are necessary. Researchers and practitioners are encouraged to incorporate and report classification consistency more frequently to enhance the overall quality and fairness of CRT. By systematically reviewing existing methodologies and their applications, this study highlights the significance of reporting classification consistency for CRT.

Key words: classification reliability, classification consistency, decision rules, cognitive diagnosis, machine learning

CLC Number: