ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

心理科学进展 ›› 2022, Vol. 30 ›› Issue (6): 1410-1428.doi: 10.3724/SP.J.1042.2022.01410

• 研究方法 • 上一篇    

适用于多维迫选测验的IRT计分模型

刘娟1, 郑蝉金2,3(), 李云川1, 连旭1   

  1. 1北京智鼎优源管理咨询有限公司, 北京 100102
    2华东师范大学教育心理学系
    3华东师范大学上海智能教育研究院, 上海 200062
  • 收稿日期:2021-07-06 出版日期:2022-06-15 发布日期:2022-04-26
  • 通讯作者: 郑蝉金 E-mail:chjzheng@dep.ecnu.edu.cn

IRT-based scoring methods for multidimensional forced choice tests

LIU Juan1, ZHENG Chanjin2,3(), LI Yunchuan1, LIAN Xu1   

  1. 1Beijing Insight Online Management Consulting Co., Ltd., Beijing 100102, China
    2Department of Educational Psychology, East China Normal University, Shanghai 200062, China
    3Shanghai Institute of Artificial Intelligence for Education, East China Normal University, Shanghai 200062, China
  • Received:2021-07-06 Online:2022-06-15 Published:2022-04-26
  • Contact: ZHENG Chanjin E-mail:chjzheng@dep.ecnu.edu.cn

摘要:

迫选(forced-choice, FC)测验由于可以控制传统李克特方法带来的反应偏差, 被广泛应用于非认知测验中, 而迫选测验的传统计分方式会产生自模式数据, 这种数据由于不适合于个体间的比较, 一直备受批评。近年来, 多种迫选IRT模型的发展使研究者能够从迫选测验中获得接近常模性的数据, 再次引起了研究者与实践人员对迫选IRT模型的兴趣。首先, 依据所采纳的决策模型和题目反应模型对6种较为主流的迫选IRT模型进行分类和介绍。然后, 从模型构建思路、参数估计方法两个角度对各模型进行比较与总结。其次, 从参数不变性检验、计算机化自适应测验(computerized adaptive testing, CAT)和效度研究3个应用研究方面进行述评。最后提出未来研究可以在模型拓展、参数不变性检验、迫选CAT测验和效度研究4个方向深入。

关键词: 迫选测验, 自模式数据, TIRT, MUPP, GGUM-RANK

Abstract:

Forced-choice (FC) test is widely used in non-cognitive testing because of its effectiveness in resisting faking and the response bias caused by traditional Likert method. The traditional scoring of forced-choice test produces ipsative data, which has been criticized for being unsuitable for inter-individual comparisons. In recent years, the development of multiple forced-choice IRT models that allow researchers to obtain normative information from forced-choice test has re-ignited the interest of researchers and practitioners in forced-choice IRT models. The six prevailing forced-choice IRT models in existing studies can be classified according to the adopted decision model and item response model. The TIRT model, the RIM model and the BRB-IRT model for which the decision model is Thurstone’s Law of Comparative Judgment, and the MUPP framework and its derivatives for which the Luce Choice Axiom is adopted. In terms of item response mode, both the MUPP-GGUM and GGUM-RANK models are applicable to items with unfolding response mode, while the other forced-choice models are applicable to items with dominant response mode. In the parameter estimation method, it can also be distinguished according to the estimation algorithm and the estimation process. MUPP-GGUM uses a two-step strategy for parameter estimation, and it uses Likert scale to calibrate item parameters in advance, so that it can facilitate subsequent item bank management, while the others use joint estimation methods. For joint estimation, TIRT uses the traditional estimation algorithms: weighted least squares (WLS)/diagonally weighted least squares (DWLS), both of which are conveniently used in Mplus and take relatively little time, but at the same time they suffer from poor convergence and high computer memory usage in high-dimensional situations. The other model uses the Markov chain Monte Carlo (MCMC) algorithm, which effectively solves the convergence and insufficient memory in traditional algorithms, but the estimation time is longer and much slower than the traditional algorithms.
The research on the application of the forced-choice IRT model is summarized in three areas: parameter invariance testing, computerized adaptive testing (CAT) and validity study. Parameter invariance testing can be divided into cross-block consistency and cross-population consistency (also known as DIF), with more research currently focusing on the latter, for example, there are already DIF testing methods for TIRT and RIM. While enriching or upgrading existing DIF testing methods is needed in future research in addition to develop other forced-choice model DIF testing methods so as to be more sensitive to DIF from multiple sources. Non-cognitive tests are usually high-dimensional, and the tests length problems caused by high dimensionality can be naturally addressed by CAT. There are studies that have already explored appropriate item selection strategies for the MUPP-GGUM, GGUM-RANK and RIM models. Future research can continue to explore item selection strategies for different forced-choice IRT models to ensure that the forced-choice CAT test can achieve a balance between measurement precision and test length in high-dimensional context. Validity studies focus on whether the scores obtained from the forced selection IRT model reflect the true characteristics of individuals, as tests that are not validated have huge pitfalls in the interpretation of the results. Some studies have compared IRT scores, traditional scores, and Likert-type scores to see whether IRT scores can yield similar results to Likert scores, whether they perform better than traditional scores in terms of recovery of latent traits. However, the use of Likert scale scores as criterion may introduces response bias as a source of error, and future research can focus on obtaining purer, more convincing criterion.

Key words: forced choice test, ipsative data, TIRT, MUPP, GGUM-RANK

中图分类号: