心理科学进展 ›› 2023, Vol. 31 ›› Issue (10): 1966-1980.doi: 10.3724/SP.J.1042.2023.01966
• 研究方法 • 上一篇
收稿日期:
2023-01-10
出版日期:
2023-10-15
发布日期:
2023-07-25
通讯作者:
陈平, E-mail: 基金资助:
CHEN Ping(), DAI Yi, HUANG Yingshi
Received:
2023-01-10
Online:
2023-10-15
Published:
2023-07-25
摘要:
测验模式效应(Test Mode Effect, TME)是指同一测验采用不同测验形式施测而产生的测验功能差异。TME的存在会对测验公平、选拔标准和测验等值等产生影响, 因此对TME进行准确检测和合理解释具有重要意义。通过对TME的来源、检测(包括实验设计和检测方法)以及研究结果进行系统梳理, 全面展示TME研究的方法论。对TME模型进行进一步解释、对TME研究中的测验形式进行拓展以及将TME的研究成果应用于我国的大规模教育测评项目, 都是TME领域的未来重要发展方向。
中图分类号:
陈平, 代艺, 黄颖诗. (2023). 测验模式效应:来源、检测与应用. 心理科学进展 , 31(10), 1966-1980.
CHEN Ping, DAI Yi, HUANG Yingshi. (2023). Test mode effect: Sources, detection, and applications. Advances in Psychological Science, 31(10), 1966-1980.
TME的来源 | TME产生的说明 | ||
---|---|---|---|
测验层面 | |||
作答设备 | PBT使用纸笔, CBT使用屏幕、鼠标和键盘 | ||
是否允许检查并修改答案 | PBT允许检查修改答案, CBT往往不允许 | ||
测验过程有无监督 | PBT往往有监督, CBT可能无监督 | ||
测验计时与选题方式 | CBT的计时和选题方式更灵活, PBT的更固定 | ||
题目层面 | |||
题目呈现方式 | CBT的多样形式导致很难与PBT有完全相同的题目呈现方式 | ||
题目类型 | 题型交互方式的复杂程度影响CBT上的表现 | ||
被试层面 | |||
人口学变量 | 年龄和性别等通过影响其他变量间接导致TME | ||
计算机的熟练程度 | 计算机熟练程度可能影响CBT上的成绩 | ||
作答动机 | PBT和CBT上的作答动机不同导致得分差异 | ||
评分者层面 | |||
评分者效应 | 主观题易受评分者效应的影响 |
表1 TME的来源和对TME产生的说明
TME的来源 | TME产生的说明 | ||
---|---|---|---|
测验层面 | |||
作答设备 | PBT使用纸笔, CBT使用屏幕、鼠标和键盘 | ||
是否允许检查并修改答案 | PBT允许检查修改答案, CBT往往不允许 | ||
测验过程有无监督 | PBT往往有监督, CBT可能无监督 | ||
测验计时与选题方式 | CBT的计时和选题方式更灵活, PBT的更固定 | ||
题目层面 | |||
题目呈现方式 | CBT的多样形式导致很难与PBT有完全相同的题目呈现方式 | ||
题目类型 | 题型交互方式的复杂程度影响CBT上的表现 | ||
被试层面 | |||
人口学变量 | 年龄和性别等通过影响其他变量间接导致TME | ||
计算机的熟练程度 | 计算机熟练程度可能影响CBT上的成绩 | ||
作答动机 | PBT和CBT上的作答动机不同导致得分差异 | ||
评分者层面 | |||
评分者效应 | 主观题易受评分者效应的影响 |
组别 | PBT | CBT | ||
---|---|---|---|---|
题本A | 题本B | 题本A | 题本B | |
组1 | Test 1 | Test 2 | ||
组2 | Test 1 | Test 2 | ||
组3 | Test 2 | Test 1 | ||
组4 | Test 2 | Test 1 |
表2 TME研究中的BIB设计
组别 | PBT | CBT | ||
---|---|---|---|---|
题本A | 题本B | 题本A | 题本B | |
组1 | Test 1 | Test 2 | ||
组2 | Test 1 | Test 2 | ||
组3 | Test 2 | Test 1 | ||
组4 | Test 2 | Test 1 |
优点 | 缺点 | 适用范围 | 实现方式 | |
---|---|---|---|---|
ANOVA | 方便快捷, 适用范围广 | 检验力较低 | 对TME进行初步检测 | SPSS或TAM包 |
MCFA | 可探究潜变量和观测变量间以及潜变量间的关系 | 对题目层面的TME检测过程较为繁琐 | 人格和社会心理领域内的测验 | lavaan包 |
DIF | 检验力高, 包含方法多样, 可灵活选择 | 各种DIF方法的自身不足 | 教育测量领域内的成就测验 | mirt包 |
MEM | 检验力高, 可在一定程度上了解TME的来源 | 模型较为复杂, 可能出现模型识别等问题 | mdltm软件 |
表3 四种TME检测方法的总结
优点 | 缺点 | 适用范围 | 实现方式 | |
---|---|---|---|---|
ANOVA | 方便快捷, 适用范围广 | 检验力较低 | 对TME进行初步检测 | SPSS或TAM包 |
MCFA | 可探究潜变量和观测变量间以及潜变量间的关系 | 对题目层面的TME检测过程较为繁琐 | 人格和社会心理领域内的测验 | lavaan包 |
DIF | 检验力高, 包含方法多样, 可灵活选择 | 各种DIF方法的自身不足 | 教育测量领域内的成就测验 | mirt包 |
MEM | 检验力高, 可在一定程度上了解TME的来源 | 模型较为复杂, 可能出现模型识别等问题 | mdltm软件 |
检验方法 | 代码示例 |
---|---|
ANOVA | 目的: 比较每一题在PBT和CBT上的平均分 # 加载所需程序包 ------- library(TAM) # 数据准备 ---------------- # 1 = PBT, 0 = CBT # nperson 为被试量(即 # nitem 为题目数(即 # response_raw 包含两种测验形式下的所有作答, 是一个[nperson, nitem]的矩阵 # TMEbetween 用于储存每道题在不同测验形式下的显著性结果 # 创建数据框, 包含测验模式标签“mode”与相应的作答数据 response_b <- data.frame(mode = c(rep(1, nperson/2), rep(0, nperson/2)), response_raw) # 数据分析 ---------------- # 创建空矩阵用于结果存储 TMEbetween <- matrix(data = NA, nrow = nitem, ncol = 1) for (j in 1:nitem){ # 对每一题比较两种测验模式下的得分差异(第一列是标签, 因此从j+1开始) anova_item <- aov(response_b[, j+1] ~ mode, data = response_b) # 将结果储存于矩阵相应位置 TMEbetween[j, 1] <- summary(anova_item)[[1]]$`Pr(>F)`[1] } |
MCFA | 目的: 检验PBT与CBT下结果的测量不变性 # 加载所需程序包 ------- library(lavaan) # 模型检验 ---------------- # (本示例限定所有题目都属于同一个潜在特质) # 1. 检验形态等价(即结构不变性) # 2. 检验载荷等价(即弱不变性) # 3. 检验截距等价(即强不变性) # 4. 依次放松每道题目的载荷限制, 并将结果储存于cfa_item model <- 'trait =~ item1 + item2 + … + itemN' # 建立模型 fit1 <- cfa(model, data = response_b, group = "mode") # 形态等价 fit2 <- cfa(model, data = response_b, group = "mode", group.equal = "loadings") # 载荷等价 fit3 <- cfa(model, data = response_b, group = "mode", group.equal = c("loadings", "intercepts")) # 截距等价 cfa_item <- matrix(data = NA, nrow = nitem, ncol = 1) # 创建空矩阵 for (j in 1:nitem){ # 依次对每一题放松限制 fit4 <- cfa(model, data = response_b, group = "mode", group.equal = c("loadings", "intercepts"), group.partial = paste("item", j, "~1", sep = "")) # 将结果储存于矩阵相应位置 cfa_item[j, 1] <- anova(fit3, fit4)$`Pr(>Chisq)`[2] } |
DIF (SIBTEST) | 目的: 分析参照组和目标组的结果差异 # 加载所需程序包 ------- library(mirt) # DIF检验 ----------------- # beta_statistic用于储存检验统计量的结果, 并且: # # # # suspect为可能存在TME的题目集合 # anchor为不存在TME的锚题集合 #(当不指定锚题时, 可令除待检题目外的所有题作为锚题集) anchor <- c(1, 2, 3) # 设置锚题为第1、2和3题 suspect <- c(1:nitem)[-anchor] # 除去锚题, 即得到可能存在DIF的题目集合 beta_statistic <- matrix(data = NA, nrow = length(suspect), ncol = 1) # 创建空矩阵 for (j in 1:length(suspect)){ # 对每一题进行DIF检验 dif_item <- SIBTEST(response_b[, -1], response_b\$mode, match_set = anchor, suspect_set = suspect[j]) # 将结果储存于矩阵相应位置 beta_statistic[j, 1] <- dif_item\$beta[1] } |
附表1 基于R软件的ANOVA、MCFA和DIF方法代码示例
检验方法 | 代码示例 |
---|---|
ANOVA | 目的: 比较每一题在PBT和CBT上的平均分 # 加载所需程序包 ------- library(TAM) # 数据准备 ---------------- # 1 = PBT, 0 = CBT # nperson 为被试量(即 # nitem 为题目数(即 # response_raw 包含两种测验形式下的所有作答, 是一个[nperson, nitem]的矩阵 # TMEbetween 用于储存每道题在不同测验形式下的显著性结果 # 创建数据框, 包含测验模式标签“mode”与相应的作答数据 response_b <- data.frame(mode = c(rep(1, nperson/2), rep(0, nperson/2)), response_raw) # 数据分析 ---------------- # 创建空矩阵用于结果存储 TMEbetween <- matrix(data = NA, nrow = nitem, ncol = 1) for (j in 1:nitem){ # 对每一题比较两种测验模式下的得分差异(第一列是标签, 因此从j+1开始) anova_item <- aov(response_b[, j+1] ~ mode, data = response_b) # 将结果储存于矩阵相应位置 TMEbetween[j, 1] <- summary(anova_item)[[1]]$`Pr(>F)`[1] } |
MCFA | 目的: 检验PBT与CBT下结果的测量不变性 # 加载所需程序包 ------- library(lavaan) # 模型检验 ---------------- # (本示例限定所有题目都属于同一个潜在特质) # 1. 检验形态等价(即结构不变性) # 2. 检验载荷等价(即弱不变性) # 3. 检验截距等价(即强不变性) # 4. 依次放松每道题目的载荷限制, 并将结果储存于cfa_item model <- 'trait =~ item1 + item2 + … + itemN' # 建立模型 fit1 <- cfa(model, data = response_b, group = "mode") # 形态等价 fit2 <- cfa(model, data = response_b, group = "mode", group.equal = "loadings") # 载荷等价 fit3 <- cfa(model, data = response_b, group = "mode", group.equal = c("loadings", "intercepts")) # 截距等价 cfa_item <- matrix(data = NA, nrow = nitem, ncol = 1) # 创建空矩阵 for (j in 1:nitem){ # 依次对每一题放松限制 fit4 <- cfa(model, data = response_b, group = "mode", group.equal = c("loadings", "intercepts"), group.partial = paste("item", j, "~1", sep = "")) # 将结果储存于矩阵相应位置 cfa_item[j, 1] <- anova(fit3, fit4)$`Pr(>Chisq)`[2] } |
DIF (SIBTEST) | 目的: 分析参照组和目标组的结果差异 # 加载所需程序包 ------- library(mirt) # DIF检验 ----------------- # beta_statistic用于储存检验统计量的结果, 并且: # # # # suspect为可能存在TME的题目集合 # anchor为不存在TME的锚题集合 #(当不指定锚题时, 可令除待检题目外的所有题作为锚题集) anchor <- c(1, 2, 3) # 设置锚题为第1、2和3题 suspect <- c(1:nitem)[-anchor] # 除去锚题, 即得到可能存在DIF的题目集合 beta_statistic <- matrix(data = NA, nrow = length(suspect), ncol = 1) # 创建空矩阵 for (j in 1:length(suspect)){ # 对每一题进行DIF检验 dif_item <- SIBTEST(response_b[, -1], response_b\$mode, match_set = anchor, suspect_set = suspect[j]) # 将结果储存于矩阵相应位置 beta_statistic[j, 1] <- dif_item\$beta[1] } |
[56] |
Kröhne, U., & Martens, T. (2011). 11 Computer-based competence tests in the national educational panel study: The challenge of mode effects. Zeitschrift für Erziehungswissenschaft, 14, 169-186.
doi: 10.1007/s11618-011-0185-4 URL |
[57] |
Kulik, J. A., Kulik, C.-L. C., & Cohen, P. A. (1980). Effectiveness of computer-based college teaching: A meta-analysis of findings. Review of Educational Research, 50(4), 525-544.
doi: 10.3102/00346543050004525 URL |
[58] |
Lee, J. A., Moreno, K. E., & Sympson, J. B. (1986). The effects of mode of test administration on test performance. Educational and Psychological Measurement, 46(2), 467-474.
doi: 10.1177/001316448604600224 URL |
[59] |
Lee, Y.-J. (2002). A comparison of composing processes and written products in timed-essay tests across paper-and- pencil and computer modes. Assessing Writing, 8, 135-157.
doi: 10.1016/S1075-2935(03)00003-5 URL |
[60] |
Li, J. (2006). The mediation of technology in ESL writing and its implications for writing assessment. Assessing Writing, 11(1), 5-21.
doi: 10.1016/j.asw.2005.09.001 URL |
[61] | Liu, J., Brown, T., Chen, J., Ali, U., Hou, L., & Costanzo, K. (2016). Mode comparability study based on spring 2015 operational test data. Retrieved March 6, 2023, from https://files.eric.ed.gov/fulltext/ED599049.pdf. |
[62] | Lynch, S. (2022). Adapting paper-based tests for computer administration: Lessons learned from 30 years of mode effects studies in education. Practical Assessment, Research, and Evaluation, 27, Article 22. |
[63] |
Ma, W., & Guo, W. (2019). Cognitive diagnosis models for multiple strategies. British Journal of Mathematical and Statistical Psychology, 72(2), 370-392.
doi: 10.1111/bmsp.12155 |
[64] | Magnus, B. E., Liu, Y., He, J., Quinn, H., Thissen, D., Gross, H. E., & Reeve, B. B. (2016). Mode effects between computer self-administration and telephone interviewer- administration of the PROMIS(®) pediatric measures, self-and proxy report. Quality of Life Research, 25(7), 1655-1665. |
[65] | McMullin, J., Varnhagen, C., Heng, P., & Apedoe, X. (2002). Effects of surrounding information and line length on text comprehension from the web. Canadian Journal of Learning and Technology, 28, 19-29. |
[66] | OECD. (2014). PISA 2015 Field Trial Goals, Assessment Design, and Analysis Plan for Cognitive Assessment. PISA, OECD Publishing, Paris. |
[67] | OECD. (2016). PISA 2015 Results (Volume I): Excellence and Equity in Education. PISA, OECD Publishing, Paris. |
[68] | OECD. (2017). PISA 2015 technical report. PISA, OECD Publishing, Paris. |
[69] |
Paleczek, L., Seifert, S., & Schöfl, M. (2021). Comparing digital to print assessment of receptive vocabulary with GraWo-KiGa in Austrian kindergarten. British Journal of Educational Technology, 52(6), 2145-2161.
doi: 10.1111/bjet.v52.6 URL |
[70] | Poggio, J., Glasnapp, D. R., Yang, X., & Poggio, A. J. (2005). A comparative evaluation of score results from computerized and paper & pencil mathematics testing in a large scale state assessment program. Journal of Technology, Learning, and Assessment, 3(6), 1-31. |
[71] |
Pomplun, M. (2007). A bifactor analysis for a mode-of- administration effect. Applied Measurement in Education, 20, 137-152.
doi: 10.1080/08957340701301264 URL |
[72] |
Pomplun, M., Ritchie, T., & Custer, M. (2006). Factors in paper-and-pencil and computer reading score differences at the primary grades. Educational Assessment, 11(2), 127-143.
doi: 10.1207/s15326977ea1102_3 URL |
[73] |
Porion, A., Aparicio, X., Megalakaki, O., Robert, A., & Baccino, T. (2016). The impact of paper-based versus computerized presentation on text comprehension and memorization. Computers in Human Behavior, 54, 569-576.
doi: 10.1016/j.chb.2015.08.002 URL |
[74] | Powers, D. E. (1999). Test anxiety and test performance: Comparing paper-based and computer-adaptive versions of the GRE general test (ETS Research Report Series, No. 99-15). Princeton, NJ: Educational Testing Service. |
[75] |
Powers, D. E., Fowles, M. E., Farnum, M., & Ramsey, P. (1994). They think less of my handwritten essay if others word process theirs? Effects on essay scores of intermingling handwritten and word-processed essays. Journal of Educational Measurement, 31(3), 220-233.
doi: 10.1111/jedm.1994.31.issue-3 URL |
[1] | 白新文, 陈毅文. (2004). 测量等价性的概念及其判定条件. 心理科学进展, 12(2), 231-239. |
[2] | 蔡华俭, 林永佳, 伍秋萍, 严乐, 黄玄凤. (2008). 网络测验和纸笔测验的测量不变性研究——以生活满意度量表为例. 心理学报, 40(2), 228-239. |
[3] | 蔡晓芬. (2014). SP程序和DFTD策略应用于IRT取向下DIF检测方法的效应比较 (硕士学位论文). 江西师范大学, 南昌. |
[4] |
陈冠宇, 陈平. (2019). 解释性项目反应理论模型: 理论与应用. 心理科学进展, 27(5), 937-950.
doi: 10.3724/SP.J.1042.2019.00937 |
[5] | 陈平, 丁树良. (2008). 允许检查并修改答案的计算机化自适应测验. 心理学报, 40(6), 737-747. |
[6] |
高旭亮, 涂冬波, 王芳, 张龙, 李雪莹. (2016). 可修改答案的计算机化自适应测验的方法. 心理科学进展, 24(4), 654-664.
doi: 10.3724/SP.J.1042.2016.00654 |
[7] |
韩建涛, 刘文令, 庞维国. (2019). 创造力测评中的评分者效应. 心理科学进展, 27(1), 171-180.
doi: 10.3724/SP.J.1042.2019.00171 |
[8] | 林喆, 陈平, 辛涛. (2015). 允许CAT题目检查的区块题目袋方法. 心理学报, 47(9), 1188-1198. |
[9] |
聂旭刚, 陈平, 张缨斌, 何引红. (2018). 题目位置效应的概念及检测. 心理科学进展, 26(2), 368-380.
doi: 10.3724/SP.J.1042.2018.00368 |
[10] | 檀慧玲, 李文燕, 万兴睿. (2018). 国际教育评价项目合作问题解决能力测评: 指标框架、评价标准及技术分析. 电化教育研究, 39(9), 123-128. |
[11] | 汤楚. (2016). 短测验项目功能差异检测方法的比较研究 (硕士学位论文). 江西师范大学, 南昌. |
[76] |
Prisacari, A. A., & Danielson, J. (2017a). Rethinking testing mode: Should I offer my next chemistry test on paper or computer? Computers & Education, 106, 1-12.
doi: 10.1016/j.compedu.2016.11.008 URL |
[77] |
Prisacari, A. A., & Danielson, J. (2017b). Computer-based versus paper-based testing: Investigating testing mode with cognitive load and scratch paper use. Computers in Human Behavior, 77, 1-10.
doi: 10.1016/j.chb.2017.07.044 URL |
[78] | Puhan, G., Boughton, K., & Kim, S. (2007). Examining differences in examinee performance in paper and pencil and computerized testing. Journal of Technology, Learning, and Assessment, 6(3), 1-21. |
[79] |
Raju, N. S., van der Linden, W., & Fleer, P. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353-368.
doi: 10.1177/014662169501900405 URL |
[80] |
Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495-2527.
doi: 10.1007/s10462-021-10068-2 |
[81] | Robitzsch, A., Kiefer, T., & Wu, M. (2022). Test Analysis Modules (TAM). R package. Retrieved April 26, 2023, from https://cran.r-project.org/web/packages/TAM/TAM.pdf. |
[82] | Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1-36. |
[83] | Rowan, B. (2010). Comparability of paper-and-pencil and computer-based cognitive and non-cognitive measures in a low-stakes testing environment (Unpublished doctorial dissertation). James Madison University, Harrisonburg. |
[84] |
Russell, M., & Haney, W. (1997). Testing writing on computers: An experiment comparing student performance on tests conducted via computer and via paper-and-pencil. Education Policy Analysis Archives, 5(3), 1-20.
doi: 10.14507/epaa.v5n1.1997 URL |
[85] |
Russell, M., & Plati, T. (2002). Does it matter with what I write? Comparing performance on paper, computer and portable writing devices. Current Issues in Education, 5(4), 1-15.
doi: 10.1353/eac.1985.a592029 URL |
[12] | Arnold, V., Legas, J., Obler, S., Pacheco, M. A., Russell, C., & Umbdenstock, L. (1990). Do students get higher scoreson their word-processed paper? A study of bias in scoring hand-written vs. word-processed papers. Retrieved March 7, 2023, from https://files.eric.ed.gov/fulltext/ED345818.pdf. |
[13] |
Backes, B., & Cowan, J. (2019). Is the pen mightier than the keyboard? The effect of online testing on measured student achievement. Economics of Education Review, 68, 89-103.
doi: 10.1016/j.econedurev.2018.12.007 |
[14] |
Beatty, A. E., Esco, A., Curtiss, A. B. C., & Ballen, C. J. (2022). Students who prefer face-to-face tests outperform their online peers in organic chemistry. Chemistry Education Research and Practice, 23, 464-474.
doi: 10.1039/D1RP00324K URL |
[15] | Bennett, R. E., Braswell, J., Oranje, A., Sandene, B., Kaplan, B., & Yan, F. (2008). Does it matter if I take my mathematics test on computer? A second empirical study of mode effects in NAEP. The Journal of Technology, Learning, and Assessment, 6(9), 1-39. |
[16] | Bernard, M., Fernandez, M., Hull, S., & Chaparro, B. S. (2003). The effects of line length on children and adults’ perceived and actual online reading performance. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 47(11), 1375-1379. |
[17] | Bernard, M., Lida, B., Riley, S., Hackler, T., & Janzen, K. (2002). A comparison of popular online fonts: Which size and type is best. Usability News, 4(1), 1-8. |
[18] | Bernard, M., & Mills, M. (2000). So, what size and type of font should I use on my website? Usability News, 2(2), 1-5. |
[19] |
Blumenthal, S., & Blumenthal, Y. (2020). Tablet or paper and pen? Examining mode effects on German elementary school students’ computation skills with curriculum-based measurements. International Journal of Educational Methodology, 6(4), 669-680.
doi: 10.12973/ijem URL |
[20] |
Bodmann, S. M., & Robinson, D. H. (2004). Speed and performance differences among computer-based and paper-pencil tests. Journal of Educational Computing Research, 31(1), 51-60.
doi: 10.2190/GRQQ-YT0F-7LKB-F033 URL |
[21] |
Bridgeman, B., Lennon, M. L., & Jackenthal, A. (2003). Effects of screen size, screen resolution, and display rate on computer-based test performance. Applied Measurement in Education, 16(3), 191-205.
doi: 10.1207/S15324818AME1603_2 URL |
[86] | Russell, M, & Tao, W. (2004a). Effects of handwriting and computer-print on composition scores: A follow-up to Powers, Fowles, Farnum, & Ramsey. Practical Assessment, Research, and Evaluation, 9, Article 1. |
[87] | Russell, M., & Tao, W. (2004b). The influence of computer-print on rater scores. Practical Assessment, Research, and Evaluation, 9, Article 10. |
[88] | Schwarz, R. D., Rich, C., & Podrabsky, T. (2003, April). A DIF analysis of item-level mode effects for computerized and paper-and-pencil tests. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL. |
[89] | Seifert, S., & Paleczek, L. (2022). Comparing tablet and print mode of a German reading comprehension test in grade 3: Influence of test order, gender and language. International Journal of Educational Research, 113, 1-13. |
[90] |
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194.
doi: 10.1007/BF02294572 URL |
[91] |
Terluin, B., Brouwers, E. P. M., Marchand, M. A. G., & de Vet, H. C. W. (2018). Assessing the equivalence of web-based and paper-and-pencil questionnaires using differential item and test functioning (DIF and DTF) analysis: A case of the Four-Dimensional Symptom Questionnaire (4DSQ). Quality of Life Research, 27(5), 1191-1200.
doi: 10.1007/s11136-018-1816-5 pmid: 29468387 |
[92] | von Davier, M. (2005). A general diagnostic model applied to language testing data (ETS Research Report Series, No. 05-16). Princeton, NJ: Educational Testing Service. |
[93] | von Davier, M., Khorramdel, L., He, Q. W., Shin, H. J., & Chen, H. W. (2019). Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. Journal of Educational and Behavioral Statistics, 44(6), 671-705. |
[94] |
Wainer, H. (1993). Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practice, 12, 15-20.
doi: 10.1111/j.1745-3992.1993.tb00519.x URL |
[95] |
Wang, S., Jiao, H., Young, M. J., Brooks, T., & Olson, J. (2007). A meta-analysis of testing mode effects in grade K-12 mathematics tests. Educational and Psychological Measurement, 67(2), 219-238.
doi: 10.1177/0013164406288166 URL |
[22] |
Brunfaut, T., Harding, L., & Batty, A. O. (2018). Going online: The effect of mode of delivery on performances and perceptions on an English L2 writing test suite. Assessing Writing, 36, 3-18.
doi: 10.1016/j.asw.2018.02.003 URL |
[23] | Buerger, S., Kroehne, U., & Goldhammer, F. (2016). The transition to computer-based testing in large-scale assessments: Investigating (partial) measurement invariance between modes. Psychological Test and Assessment Modeling, 58, 597-616. |
[24] | Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. |
[25] |
Chan, K. S., Orlando, M., Ghosh-Dastidar, B., Duan, N., & Sherbourne, C. D. (2004). The interview mode effect on the Center for Epidemiological Studies Depression (CES-D) scale: An item response theory analysis. Medical Care, 42(3), 281-289.
pmid: 15076828 |
[26] |
Chan, S., Bax, S., & Weir, C. (2018). Researching the comparability of paper-based and computer-based delivery in a high-stakes writing test. Assessing Writing, 36, 32-48.
doi: 10.1016/j.asw.2018.03.008 URL |
[27] |
Chua, S. L., Chen, D.-T., & Wong, A. F. L. (1999). Computer anxiety and its correlates: A meta-analysis. Computers in Human Behavior, 15(5), 609-623.
doi: 10.1016/S0747-5632(99)00039-4 URL |
[28] |
Chua, Y. P. (2012). Effects of computer-based testing on test performance and testing motivation. Computers in Human Behavior, 28(5), 1580-1586.
doi: 10.1016/j.chb.2012.03.020 URL |
[29] |
Clariana, R., & Wallace, P. (2002). Paper-based versus computer-based assessment: Key factors associated with the test mode effect. British Journal of Educational Technology, 33(5), 593-602.
doi: 10.1111/bjet.2002.33.issue-5 URL |
[30] |
Claudia, P. F., Oshima, T. C., & Nambury, S. R. (1999). A description and demonstration of the polytomous-DFIT framework. Applied Psychological Measurement, 23(4), 309-326.
doi: 10.1177/01466219922031437 URL |
[31] |
de La Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333-353.
doi: 10.1007/BF02295640 URL |
[96] |
Wang, S., Jiao, H., Young, M. J., Brooks, T., & Olson, J. (2008). Comparability of computer-based and paper-and- pencil testing in K-12 reading assessments: A meta-analysis of testing mode effects. Educational and Psychological Measurement, 68(1), 5-24.
doi: 10.1177/0013164407305592 URL |
[97] |
Weigold, A., Weigold, I. K., Drakeford, N. M., Dykema, S. A., & Smith, C. A. (2016). Equivalence of paper-and- pencil and computerized self-report surveys in older adults. Computers in Human Behavior, 54, 407-413.
doi: 10.1016/j.chb.2015.08.033 URL |
[98] | Wise, S. L., Freeman, S. A., Finney, S. J., Enders, C. K., & Severance, D. D. (1997, March). The accuracy of examinee judgments of relative item difficulty: Implications for computerized adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education. Chicago, IL. |
[99] | Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675. |
[100] | Zhi, M., & Huang, B. (2021). Investigating the authenticity of computer-and paper-based ESL writing tests. Assessing Writing, 50, Article 100548. |
[101] |
Ziefle, M. (1998). Effects of display resolution on visual performance. Human Factors, 40(4), 554-568.
pmid: 9974229 |
[32] |
Duchnicky, R. L., & Kolers, P. A. (1983). Readability of text scrolled on visual display terminals as a function of window size. Human Factors, 25(6), 683-692.
pmid: 6671649 |
[33] | Feskens, R., Fox, J.-P., & Zwitser, R. (2019). Differential item functioning in PISA due to mode effects. In B. Veldkamp & C. Sluijter (Eds.), Theoretical and practical advances in computer-based educational measurement (pp. 231-247). Cham, Switzerland: Springer. |
[34] |
Fouladi, R. T., McCarthy, C. J., & Moller, N. (2002). Paper-and-pencil or online? Evaluating mode effects on measures of emotional functioning and attachment. Assessment, 9(2), 204-215.
pmid: 12066835 |
[35] |
Fritts, B. E., & Marszalek, J. M. (2010). Computerized adaptive testing, anxiety levels, and gender differences. Social Psychology of Education, 13, 441-458.
doi: 10.1007/s11218-010-9113-3 URL |
[36] | Goldberg, A., Russell, M. & Cook, A. (2003). The effect of computers on student writing: A meta-analysis of studies from 1992 to 2002. The Journal of Technology, Learning, and Assessment, 2(1), 1-52. |
[37] |
Goldberg, A. L., & Pedulla, J. J. (2002). Performance differences according to test mode and computer familiarity on a practice graduate record exam. Educational and Psychological Measurement, 62(6), 1053-1067.
doi: 10.1177/0013164402238092 URL |
[38] | Gu, L., Ling, G. M., Liu, O. L., Yang, Z. T., Li, G. R., Kardanova, E., & Loyalka, P. (2021). Examining mode effects for an adapted Chinese critical thinking assessment. Assessment & Evaluation in Higher Education, 46(6), 879-893. |
[39] |
Hamhuis, E., Glas, C., & Meelissen, M. (2020). Tablet assessment in primary education: Are there performance differences between TIMSS’ paper-and-pencil test and tablet test among Dutch grade-four students? British Journal of Educational Technology, 51(6), 2340-2358.
doi: 10.1111/bjet.v51.6 URL |
[40] | Hox, J. J., De Leeuw, E. D., & Zijlmans, E. A. O. (2015). Measurement equivalence in mixed mode surveys. Frontiers in Psychology, 6, Article 87. |
[41] |
Hunsu, N. J. (2015). Issues in transitioning from the traditional blue-book to computer-based writing assessment. Computers and Composition, 35, 41-51.
doi: 10.1016/j.compcom.2015.01.006 URL |
[42] | Jeong, H. (2012). A comparative study of scores on computer-based tests and paper-based tests. Behaviour & Information Technology, 33(4), 410-422. |
[43] | Jerrim, J. (2016). PISA 2012: How do results for the paper and computer tests compare? Assessment in Education: Principles, Policy & Practice, 23(4), 495-518. |
[44] |
Jerrim, J., Micklewright, J., Heine, J.-H., Salzer, C., & McKeown, C. (2018). PISA 2015: How big is the ‘mode effect’ and what has been done about it? Oxford Review of Education, 44(4), 476-493.
doi: 10.1080/03054985.2018.1430025 URL |
[45] |
Jin, Y., & Yan, M. (2017). Computer literacy and the construct validity of a high-stakes computer-based writing assessment. Language Assessment Quarterly, 14(2), 101-119.
doi: 10.1080/15434303.2016.1261293 URL |
[46] | Johnson, M., & Green, S. (2006). On-Line mathematics assessment: The impact of mode on performance and question answering strategies. Journal of Technology, Learning, and Assessment, 4(5), 1-35. |
[47] |
Keng, L., McClarty, K. L., & Davis, L. L. (2008). Item-level comparative analysis of online and paper administrations of the Texas Assessment of Knowledge and Skills. Applied Measurement in Education, 21(3), 207-226.
doi: 10.1080/08957340802161774 URL |
[48] |
Khoshsima, H., Hosseini, M., & Toroujeni, S. M. H. (2017). Cross-mode comparability of computer-based testing (CBT) versus paper-pencil based testing (PPT): An investigation of testing administration mode among Iranian intermediate EFL learners. English Language Teaching, 10(2), 23-32.
doi: 10.5539/elt.v10n2p23 URL |
[49] | Khoshsima, H., & Toroujeni, S. M. H. (2017). Comparability of computer-based testing and paper-based testing: Testing mode effect, testing mode order, computer attitudes and testing mode preference. International Journal of Computer, 24, 80-99. |
[50] |
Kim, D., & Huynh, H. (2008). Computer-based and paper- and-pencil administration mode effects on a statewide end-of-course English test. Educational and Psychological Measurement, 68(4), 554-570.
doi: 10.1177/0013164407310132 URL |
[51] | Kim, S., & Walker, M. (2021). Assessing mode effects of at-home testing without a randomized trial (ETS Research Reprot Series, No. 21-10). New Jersey, NJ: Educational Testing Service. |
[52] |
Kim, Y. J., Dykema, J., Stevenson, J., Black, P., & Moberg, D. P. (2018). Straightlining: Overview of measurement, comparison of indicators, and effects in mail-web mixed-mode surveys. Social Science Computer Review, 37(2), 214-233.
doi: 10.1177/0894439317752406 URL |
[53] |
Kingston, N. M. (2008). Comparability of computer-and paper- administered multiple-choice tests for K-12 populations: A synthesis. Applied Measurement in Education, 22(1), 22-37.
doi: 10.1080/08957340802558326 URL |
[54] | Kline, R. (2013). Assessing statistical aspects of test fairness with structural equation modelling. Educational Research and Evaluation: Fairness Issue in Educational Assessment, 19(2-3), 204-222. |
[55] | Kroehne, U., Gnambs, T., & Goldhammer, F. (2019). Disentangling setting and mode effects for online competence assessment. In H. P. Blossfeld & H. G. Roβbach (2nd Eds.), Education as a lifelong process (pp. 171-193). Wiesbaden, Germany: Springer VS. |
[1] | 王阳, 温忠麟, 李伟, 方杰. 新世纪20年国内结构方程模型方法研究与模型发展[J]. 心理科学进展, 2022, 30(8): 1715-1733. |
[2] | 王阳, 温忠麟, 付媛姝. 等效性检验——结构方程模型评价和测量不变性分析的新视角[J]. 心理科学进展, 2020, 28(11): 1961-1969. |
[3] | 陈冠宇, 陈平. 解释性项目反应理论模型:理论与应用[J]. 心理科学进展, 2019, 27(5): 937-950. |
[4] | 聂旭刚, 陈平, 张缨斌, 何引红. 题目位置效应的概念及检测[J]. 心理科学进展, 2018, 26(2): 368-380. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||