ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

Advances in Psychological Science ›› 2022, Vol. 30 ›› Issue (6): 1410-1428.doi: 10.3724/SP.J.1042.2022.01410

• Research Method • Previous Articles    

IRT-based scoring methods for multidimensional forced choice tests

LIU Juan1, ZHENG Chanjin2,3(), LI Yunchuan1, LIAN Xu1   

  1. 1Beijing Insight Online Management Consulting Co., Ltd., Beijing 100102, China
    2Department of Educational Psychology, East China Normal University, Shanghai 200062, China
    3Shanghai Institute of Artificial Intelligence for Education, East China Normal University, Shanghai 200062, China
  • Received:2021-07-06 Online:2022-06-15 Published:2022-04-26
  • Contact: ZHENG Chanjin E-mail:chjzheng@dep.ecnu.edu.cn

Abstract:

Forced-choice (FC) test is widely used in non-cognitive testing because of its effectiveness in resisting faking and the response bias caused by traditional Likert method. The traditional scoring of forced-choice test produces ipsative data, which has been criticized for being unsuitable for inter-individual comparisons. In recent years, the development of multiple forced-choice IRT models that allow researchers to obtain normative information from forced-choice test has re-ignited the interest of researchers and practitioners in forced-choice IRT models. The six prevailing forced-choice IRT models in existing studies can be classified according to the adopted decision model and item response model. The TIRT model, the RIM model and the BRB-IRT model for which the decision model is Thurstone’s Law of Comparative Judgment, and the MUPP framework and its derivatives for which the Luce Choice Axiom is adopted. In terms of item response mode, both the MUPP-GGUM and GGUM-RANK models are applicable to items with unfolding response mode, while the other forced-choice models are applicable to items with dominant response mode. In the parameter estimation method, it can also be distinguished according to the estimation algorithm and the estimation process. MUPP-GGUM uses a two-step strategy for parameter estimation, and it uses Likert scale to calibrate item parameters in advance, so that it can facilitate subsequent item bank management, while the others use joint estimation methods. For joint estimation, TIRT uses the traditional estimation algorithms: weighted least squares (WLS)/diagonally weighted least squares (DWLS), both of which are conveniently used in Mplus and take relatively little time, but at the same time they suffer from poor convergence and high computer memory usage in high-dimensional situations. The other model uses the Markov chain Monte Carlo (MCMC) algorithm, which effectively solves the convergence and insufficient memory in traditional algorithms, but the estimation time is longer and much slower than the traditional algorithms.
The research on the application of the forced-choice IRT model is summarized in three areas: parameter invariance testing, computerized adaptive testing (CAT) and validity study. Parameter invariance testing can be divided into cross-block consistency and cross-population consistency (also known as DIF), with more research currently focusing on the latter, for example, there are already DIF testing methods for TIRT and RIM. While enriching or upgrading existing DIF testing methods is needed in future research in addition to develop other forced-choice model DIF testing methods so as to be more sensitive to DIF from multiple sources. Non-cognitive tests are usually high-dimensional, and the tests length problems caused by high dimensionality can be naturally addressed by CAT. There are studies that have already explored appropriate item selection strategies for the MUPP-GGUM, GGUM-RANK and RIM models. Future research can continue to explore item selection strategies for different forced-choice IRT models to ensure that the forced-choice CAT test can achieve a balance between measurement precision and test length in high-dimensional context. Validity studies focus on whether the scores obtained from the forced selection IRT model reflect the true characteristics of individuals, as tests that are not validated have huge pitfalls in the interpretation of the results. Some studies have compared IRT scores, traditional scores, and Likert-type scores to see whether IRT scores can yield similar results to Likert scores, whether they perform better than traditional scores in terms of recovery of latent traits. However, the use of Likert scale scores as criterion may introduces response bias as a source of error, and future research can focus on obtaining purer, more convincing criterion.

Key words: forced choice test, ipsative data, TIRT, MUPP, GGUM-RANK

CLC Number: