Test Equating with Testlets

Abstract

Abstract: The research on test equating is very important for fairness of examnation, item banking, teaching quality assessing and computerized adaptive test. Along with the development of research on examination, testlets have appeared in different examnations increasingly, such as reading comprehension, mathematics, map etc. How to equate tests composed of testlets is a problem we are facing. When item response theory (IRT) models are applied in test equating, strong statistical assumptions—local independence (LI)—must be met. However, previous studies have shown that local independence is likely to be violated when testlets are contained in test. Hence, when equating tests composed of testlets, that local dependence is ignored can lead to distortion of equating coefficients using standard IRT model.
In order to solve this problem, we use a testlets-based model—2 Parameters Testlet Model (2PTM), which derives from IRT 2 Parameters Logistic Model by adding random-effect parameters associated with each testlet. Local dependence is considered in 2PTM. IRT characteristic curve equating methods and specific procedures for calculating equating coefficients were presented in this paper. In terms of the recovery of estimating the equating coefficients and based on Wilcoxon sign-rank test, a lot of experiments was done using Monte Carlo simulation method. The effectiveness of equating tests containing testlets was investigated under the several conditions, including the accuracy of the estimation of item parameters (AEIP), the number of examinees and the degree of local dependence. The findings of equating tests made up of testlets using 2PTM were compared with standard IRT model—2PLM, which not account for local dependence among items from a common testlet.
Results suggest that 2PTM is better than 2PLM in recovery and have significant differences mostly, so 2PTM is suitable for equating tests based testlets. In addition, the findings of using six different equating criterions for 2PTM were also compared with each other. The results showed that, generally speaking, when the value of the coefficient A is between 0.5 and 0.9, the performance of SLcrit is the best, SQRcrit is proper for 0.9<A<1.5 and Hcrit is proper for 1.5<A<2.0. The higher AEIP is, the better SQRcrit and SLcrit perform. Hcrit and SQRcrit are proper for large testlet effect. LCcrit, Wcrit and SREcrit are rarely better than others.

Key words: test equating, testlets, item response theory, Monte Carlo simulation, equating criterion

WU Rui,DING Shu-Liang,GAN Deng-Wen. (2010). Test Equating with Testlets. , 42(03), 434-442.

[1]	REN He, CHEN Ping. Two new termination rules for multidimensional computerized classification testing [J]. Acta Psychologica Sinica, 2021, 53(9): 1044-1058.
[2]	ZHAN Peida, JIAO Hong, MAN Kaiwen. The multidimensional log-normal response time model: An exploration of the multidimensionality of latent processing speed [J]. Acta Psychologica Sinica, 2020, 52(9): 1132-1142.
[3]	LIU Yue, LIU Hongyun. Reporting overall scores and domain scores of bi-factor models [J]. Acta Psychologica Sinica, 2017, 49(9): 1234-1246.
[4]	WANG Meng-Cheng, DENG Qiaowen, BI Xiangyang, YE Haosheng, YANG Wendeng. Performance of the entropy as an index of classification accuracy in latent profile analysis: A Monte Carlo simulation study [J]. Acta Psychologica Sinica, 2017, 49(11): 1473-1482.
[5]	CHEN Ping. Two new online calibration methods for computerized adaptive testing [J]. Acta Psychologica Sinica, 2016, 48(9): 1184-1198.
[6]	MENG Xiangbin; TAO Jian; CHEN Shali. Warm’sweighted maximum likelihood estimation of latent trait in the four-parameter logistic model [J]. Acta Psychologica Sinica, 2016, 48(8): 1047-1056.
[7]	WANG Wenyi;SONG Lihong;DING Shuliang. Classification accuracy and consistency indices for complex decision rules in multidimensional item response theory [J]. Acta Psychologica Sinica, 2016, 48(12): 1612-1624.
[8]	WANG Meng-Cheng; DENG Qiaowen. The mechanism of auxiliary variables in full information maximum likelihood–based structural equation models with missing data [J]. Acta Psychologica Sinica, 2016, 48(11): 1489-1498.
[9]	ZHAN Peida; CHEN Ping; BIAN Yufang. Using confirmatory compensatory multidimensional IRT models to do cognitive diagnosis [J]. Acta Psychologica Sinica, 2016, 48(10): 1347-1356.
[10]	ZHAN Peida; LI Xiaomin; WANG Wen-Chung; BIAN Yufang; WANG Lijun. The Multidimensional Testlet-Effect Cognitive Diagnostic Models [J]. Acta Psychologica Sinica, 2015, 47(5): 689-701.
[11]	YAO Ruosong;ZHAO Baonan;LIU Ze;MIAO Qunying. The Application of Many-Facet Rasch Model in Leaderless Group Discussion [J]. Acta Psychologica Sinica, 2013, 45(9): 1039-1049.
[12]	LIU Yue;LIU Hongyun. Comparison of MIRT Linking Methods for Different Common Item Designs [J]. Acta Psychologica Sinica, 2013, 45(4): 466-480 .
[13]	LIU Hongyun;LUO Fang;ZHANG Yu;ZHANG Danhui. Mediation Analysis for Ordinal Outcome Variables [J]. Acta Psychologica Sinica, 2013, 45(12): 1431-1442.
[14]	DU Wenjiu;ZHOU Juan;LI Hongbo. The Item Parameters’ Estimation Accuracy of Two-Parameter Logistic Model [J]. Acta Psychologica Sinica, 2013, 45(10): 1179-1186.
[15]	LI Guangming;ZHANG Minqiang. Using Adjusted Bootstrap to Improve the Estimation of Variance Components and Their Variability for Generalizability Theory [J]. Acta Psychologica Sinica, 2013, 45(1): 114-124.

Test Equating with Testlets

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments