大学英语四、六级考试分数等值研究

心理学报 ›› 2005, Vol. 37 ›› Issue (02): 280-284.

• • 上一篇

大学英语四、六级考试分数等值研究

朱正才

上海交通大学外国语学院 , 上海 200030

收稿日期:2004-04-08 修回日期:1900-01-01 出版日期:2005-03-30 发布日期:2005-03-30
通讯作者: 朱正才

A STUDY OF SCORE EQUATING IN THE COLLEGE ENGLISH TEST: A NEW APPROACH BASED ON “ANCHOR ITEMS” AND TWO-PARAMETER IRT MODEL

Zhu Zhengcai

Institute of Foreign Language, Jiaotong University, Shanghai 200233, China

Received:2004-04-08 Revised:1900-01-01 Published:2005-03-30 Online:2005-03-30
Contact: Zhu Zhengcai

摘要/Abstract

摘要： 对现有的大学英语四、六级考试分数等值模式中存在的若干问题进行了深入的分析，并提出了新的解决方案——一个基于铆题设计和两参数IRT模型的解决方案。主要包括：（1）用两参数逻辑斯蒂模型替代原来的Rasch模型，以改进题目模型的适合性；（2）用共同题目的等值设计取代原来的共同被试等值设计，解决共同被试等值设计中，等值考生的动机水平难以控制的难题；（3）建立专用的等值用题库，并且一次性完成其中铆题的预测和参数标定工作，以解决原来等值模式中存在的误差累积问题。同时，由于铆题的保密工作难度较小，因此，等值专用题库对保证等值结果的可靠性也具有重大意义；（4）本文还对新的分数等值方案进行了真实的考试数据等值计算实验，并得到了一个令人满意的分数等值结果。

关键词: 项目反应理论, 分数等值, 逻辑斯蒂模型

Abstract: In China’s College English Test (CET), Rasch model has been used in the score equating procedure for 15 years and lots of score equating data have been accumulated. This paper discusses in detail some demerits of the score equating method based on Rasch model, and introduces a new score equating approach based on “anchor items” and two-parameters IRT model (the Item Response Theory model). It is assumed that for the old score equating method based on Rasch model: 1)The students in the control group give equal attention to both the formal and the control papers. 2)There has been no leakage of the items in either paper. 3) All items have the same Discrimination Index. A failure in assumption 1) would usually occur because the students feel that the control paper test is an extra burden to them and they often do not give it the same importance as the formal paper. In this case their marks on the control paper would be lower than their true performance. If the two papers were, in fact, equally easy or difficult they would score lower marks on the control paper, thus making it appear harder. This would have the effect of making the formal paper seem to be relatively easier and in the process of equating the students’ marks would be reduced. If assumption 2) is not true and the control paper has not truly been kept confidential, the effects would be in the opposite direction. The candidates would do better than they should on the control paper, causing their marks on that paper to be relatively high in comparison with the formal test. The latter test would therefore appear to the equating algorithm to be harder than it really is and all the students’ marks would be increased. Note that this would be true even if only a few items were leaked. For example, if just one Reading passage were leaked, together with the associated items, those five items would be scored correct for students who might otherwise have failed at least in some of them. Since reading items have double weight, this could falsely increase the score of weaker students by up to 10 marks! Of course, the effect on the mean score would be smaller since many students would have scored on these items anyway. It might also be argued that, since there is evidence that the items do not all have the same Discrimination Index, the two/three-parameter IRT model should be used. It has to be accepted that any equating step will increase the standard error of measurement (SEM) of the final score because the parameters that need to be used for equating will be estimated with some standard error of their own. However, this increase will usually be small (given the sample size of several hundred used to do the model fitting) and should be more than compensated for by the reduction in the “between-forms” bias, which the equating procedure is designed to correct. In this paper, a pilot study with real CET test data is reported with satisfactory score equating results.

Key words: College English Test, Item Response Theory, score equating

中图分类号:

B841

朱正才. (2005). 大学英语四、六级考试分数等值研究. 心理学报, 37(02), 280-284.

Zhu Zhengcai. (2005). A STUDY OF SCORE EQUATING IN THE COLLEGE ENGLISH TEST: A NEW APPROACH BASED ON “ANCHOR ITEMS” AND TWO-PARAMETER IRT MODEL. , 37(02), 280-284.

[1]	付颜斌, 陈琦鹏, 詹沛达. 问题解决任务中行动序列的二分类建模：单/两参数行动序列模型[J]. 心理学报, 2023, 55(8): 1383-1396.
[2]	童昊, 喻晓锋, 秦春影, 彭亚风, 钟小缘. 多级计分测验中基于残差统计量的被试拟合研究[J]. 心理学报, 2022, 54(9): 1122-1136.
[3]	任赫, 陈平. 两种新的多维计算机化分类测验终止规则[J]. 心理学报, 2021, 53(9): 1044-1058.
[4]	罗芬, 王晓庆, 蔡艳, 涂冬波. 基于基尼指数的双目标CD-CAT选题策略[J]. 心理学报, 2020, 52(12): 1452-1465.
[5]	陈平. 两种新的计算机化自适应测验在线标定方法[J]. 心理学报, 2016, 48(9): 1184-1198.
[6]	孟祥斌;陶剑;陈莎莉. 四参数Logistic模型潜在特质参数的 Warm加权极大似然估计[J]. 心理学报, 2016, 48(8): 1047-1056.
[7]	汪文义; 宋丽红;丁树良. 复杂决策规则下MIRT的分类准确性和分类一致性[J]. 心理学报, 2016, 48(12): 1612-1624.
[8]	詹沛达;陈平;边玉芳. 使用验证性补偿多维IRT模型进行认知诊断评估[J]. 心理学报, 2016, 48(10): 1347-1356.
[9]	詹沛达;李晓敏;王文中;边玉芳;王立君. 多维题组效应认知诊断模型[J]. 心理学报, 2015, 47(5): 689-701.
[10]	姚若松;赵葆楠;刘泽;苗群鹰. 无领导小组讨论的多侧面Rasch模型应用[J]. 心理学报, 2013, 45(9): 1039-1049.
[11]	杜文久;周娟;李洪波. 二参数逻辑斯蒂模型项目参数的估计精度[J]. 心理学报, 2013, 45(10): 1179-1186.
[12]	刘红云,李冲,张平平,骆方. 分类数据测量等价性检验方法及其比较：项目阈值(难度)参数的组间差异性检验[J]. 心理学报, 2012, 44(8): 1124-1136.
[13]	杜文久;肖涵敏. 多维项目反应理论等级反应模型[J]. 心理学报, 2012, 44(10): 1402-1407.
[14]	刘红云,骆方,王玥,张玉. 多维测验项目参数的估计：基于SEM与MIRT方法的比较[J]. 心理学报, 2012, 44(1): 121-132.
[15]	肖涵敏,杜文久,张婷婷. 基于项目节点的多级评分模型的统一[J]. 心理学报, 2011, 43(12): 1462-1467.

大学英语四、六级考试分数等值研究

A STUDY OF SCORE EQUATING IN THE COLLEGE ENGLISH TEST: A NEW APPROACH BASED ON “ANCHOR ITEMS” AND TWO-PARAMETER IRT MODEL

PDF (PC)

可视化

English Version

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价