Test mode effect: Sources, detection, and applications

doi:10.3724/SP.J.1042.2023.01966

Abstract

Abstract:

International large-scale assessment programs (e.g., PISA, TIMSS, and NAEP), as well as small classroom tests, are increasingly using computers to administer tests. The test mode is undergoing a transformation from the traditional “paper-based testing (PBT)” to “computer-based testing (CBT)”. Before transforming the test mode, researchers and practitioners face a key issue: when the same test is administered with different test modes (such as PBT and CBT), the test results are not necessarily the same, so they cannot be blindly compared directly. Such difference in test function caused by the administration of the same test in different test modes is referred to as the test mode effect (TME). The existence of TME will have an impact on test fairness, selection criteria and test equating, so it is of great significance to accurately detect and interpret TME.

This review aims to systematically sort out the whole process of TME from “generation to detection”, and to grasp the research ideas and development trends of TME by summarizing or commenting or comparing the source, experimental design and detection methods of TME. Specifically, firstly, the source of TME was sorted out from the test level, item level, subject level and rater level respectively, and how the differences in these four levels lead to the generation of TME was analyzed; secondly, three experimental designs (between-group design, within-group design, and balanced incomplete block design) used to control for subject characteristics in TME research were outlined, along with their applicable scenarios; thirdly, four TME detection methods were introduced, they were ANOVA (Analysis of Variance), MCFA (multi-group confirmatory factor analysis), DIF (differential item functioning) and MEM (mode effect model); and their pros and cons, scope of application, and implementation methods were summarized and commented as well; finally, the research results in the field of TME over the past 40 years were summarized and analyzed.

Several future directions for research on TME can be identified. First, the interpretability and applicability of the MEM method can be further enhanced by including factors related to the source of TME in the existing MEM. Second, the range of test modes for TME research should be expanded. That is, TME may also occur between PBT and other test modes, including mobile-based assessment, phone or face-to-face interview, game-based assessment, and virtual and augmented reality-based assessments. Third, the rich TME research results can be applied to large-scale educational assessment programs in China to promote the use and development of CBT.

In summary, this article (1) proceeds from practical issues, emphasizing the impact of TME on test fairness, selection criteria and test equating, which will help arouse the attention of test users; (2) systematically introduces and compares four methods for detecting TME, providing references for researchers to choose and use in practice; (3) sorts out the source, detection (including experimental design and detection methods) and future research directions of TME, providing a complete research idea for follow-up research.

Key words: test mode effect, test fairness, measurement invariance, computer-based testing

CLC Number:

B841

CHEN Ping, DAI Yi, HUANG Yingshi. Test mode effect: Sources, detection, and applications[J]. Advances in Psychological Science, 2023, 31(10): 1966-1980.

Figures/Tables 5

TME的来源		TME产生的说明
测验层面
	作答设备	PBT使用纸笔, CBT使用屏幕、鼠标和键盘
	是否允许检查并修改答案	PBT允许检查修改答案, CBT往往不允许
	测验过程有无监督	PBT往往有监督, CBT可能无监督
	测验计时与选题方式	CBT的计时和选题方式更灵活, PBT的更固定
题目层面
	题目呈现方式	CBT的多样形式导致很难与PBT有完全相同的题目呈现方式
	题目类型	题型交互方式的复杂程度影响CBT上的表现
被试层面
	人口学变量	年龄和性别等通过影响其他变量间接导致TME
	计算机的熟练程度	计算机熟练程度可能影响CBT上的成绩
	作答动机	PBT和CBT上的作答动机不同导致得分差异
评分者层面
	评分者效应	主观题易受评分者效应的影响

组别	PBT		CBT
组别	题本A	题本B	题本A	题本B
组1	Test 1			Test 2
组2		Test 1	Test 2
组3		Test 2	Test 1
组4	Test 2			Test 1

	优点	缺点	适用范围	实现方式
ANOVA	方便快捷, 适用范围广	检验力较低	对TME进行初步检测	SPSS或TAM包
MCFA	可探究潜变量和观测变量间以及潜变量间的关系	对题目层面的TME检测过程较为繁琐	人格和社会心理领域内的测验	lavaan包
DIF	检验力高, 包含方法多样, 可灵活选择	各种DIF方法的自身不足	教育测量领域内的成就测验	mirt包
MEM	检验力高, 可在一定程度上了解TME的来源	模型较为复杂, 可能出现模型识别等问题	教育测量领域内的成就测验	mdltm软件

检验方法	代码示例
ANOVA	目的: 比较每一题在PBT和CBT上的平均分 # 加载所需程序包 ------- library(TAM) # 数据准备 ---------------- # 1 = PBT, 0 = CBT # nperson 为被试量(即图1中N) # nitem 为题目数(即图1中I) # response_raw 包含两种测验形式下的所有作答, 是一个[nperson, nitem]的矩阵 # TMEbetween 用于储存每道题在不同测验形式下的显著性结果 # 创建数据框, 包含测验模式标签“mode”与相应的作答数据 response_b <- data.frame(mode = c(rep(1, nperson/2), rep(0, nperson/2)), response_raw) # 数据分析 ---------------- # 创建空矩阵用于结果存储 TMEbetween <- matrix(data = NA, nrow = nitem, ncol = 1) for (j in 1:nitem){ # 对每一题比较两种测验模式下的得分差异(第一列是标签, 因此从j+1开始) anova_item <- aov(response_b[, j+1] ~ mode, data = response_b) # 将结果储存于矩阵相应位置 TMEbetween[j, 1] <- summary(anova_item)[[1]]$`Pr(>F)`[1] }
MCFA	目的: 检验PBT与CBT下结果的测量不变性 # 加载所需程序包 ------- library(lavaan) # 模型检验 ---------------- # (本示例限定所有题目都属于同一个潜在特质) # 1. 检验形态等价(即结构不变性) # 2. 检验载荷等价(即弱不变性) # 3. 检验截距等价(即强不变性) # 4. 依次放松每道题目的载荷限制, 并将结果储存于cfa_item model <- 'trait =~ item1 + item2 + … + itemN' # 建立模型 fit1 <- cfa(model, data = response_b, group = "mode") # 形态等价 fit2 <- cfa(model, data = response_b, group = "mode", group.equal = "loadings") # 载荷等价 fit3 <- cfa(model, data = response_b, group = "mode", group.equal = c("loadings", "intercepts")) # 截距等价 cfa_item <- matrix(data = NA, nrow = nitem, ncol = 1) # 创建空矩阵 for (j in 1:nitem){ # 依次对每一题放松限制 fit4 <- cfa(model, data = response_b, group = "mode", group.equal = c("loadings", "intercepts"), group.partial = paste("item", j, "~1", sep = "")) # 将结果储存于矩阵相应位置 cfa_item[j, 1] <- anova(fit3, fit4)$`Pr(>Chisq)`[2] }
DIF (SIBTEST)	目的: 分析参照组和目标组的结果差异 # 加载所需程序包 ------- library(mirt) # DIF检验 ----------------- # beta_statistic用于储存检验统计量的结果, 并且: # $\beta \in \left( 0,0.05 \right)$ 表示不存在DIF # $\beta \in \left( 0.05,0.1 \right)$ 表示存在中等程度DIF # $\beta$ 大于0.1表示存在较严重DIF (Puhan et al., 2007) # suspect为可能存在TME的题目集合 # anchor为不存在TME的锚题集合 #(当不指定锚题时, 可令除待检题目外的所有题作为锚题集) anchor <- c(1, 2, 3) # 设置锚题为第1、2和3题 suspect <- c(1:nitem)[-anchor] # 除去锚题, 即得到可能存在DIF的题目集合 beta_statistic <- matrix(data = NA, nrow = length(suspect), ncol = 1) # 创建空矩阵 for (j in 1:length(suspect)){ # 对每一题进行DIF检验 dif_item <- SIBTEST(response_b[, -1], response_b$mode, match_set = anchor, suspect_set = suspect[j]) # 将结果储存于矩阵相应位置 beta_statistic[j, 1] <- dif_item$beta[1] }

检验方法	代码示例
ANOVA	目的: 比较每一题在PBT和CBT上的平均分 # 加载所需程序包 ------- library(TAM) # 数据准备 ---------------- # 1 = PBT, 0 = CBT # nperson 为被试量(即图1中N) # nitem 为题目数(即图1中I) # response_raw 包含两种测验形式下的所有作答, 是一个[nperson, nitem]的矩阵 # TMEbetween 用于储存每道题在不同测验形式下的显著性结果 # 创建数据框, 包含测验模式标签“mode”与相应的作答数据 response_b <- data.frame(mode = c(rep(1, nperson/2), rep(0, nperson/2)), response_raw) # 数据分析 ---------------- # 创建空矩阵用于结果存储 TMEbetween <- matrix(data = NA, nrow = nitem, ncol = 1) for (j in 1:nitem){ # 对每一题比较两种测验模式下的得分差异(第一列是标签, 因此从j+1开始) anova_item <- aov(response_b[, j+1] ~ mode, data = response_b) # 将结果储存于矩阵相应位置 TMEbetween[j, 1] <- summary(anova_item)[[1]]$`Pr(>F)`[1] }
MCFA	目的: 检验PBT与CBT下结果的测量不变性 # 加载所需程序包 ------- library(lavaan) # 模型检验 ---------------- # (本示例限定所有题目都属于同一个潜在特质) # 1. 检验形态等价(即结构不变性) # 2. 检验载荷等价(即弱不变性) # 3. 检验截距等价(即强不变性) # 4. 依次放松每道题目的载荷限制, 并将结果储存于cfa_item model <- 'trait =~ item1 + item2 + … + itemN' # 建立模型 fit1 <- cfa(model, data = response_b, group = "mode") # 形态等价 fit2 <- cfa(model, data = response_b, group = "mode", group.equal = "loadings") # 载荷等价 fit3 <- cfa(model, data = response_b, group = "mode", group.equal = c("loadings", "intercepts")) # 截距等价 cfa_item <- matrix(data = NA, nrow = nitem, ncol = 1) # 创建空矩阵 for (j in 1:nitem){ # 依次对每一题放松限制 fit4 <- cfa(model, data = response_b, group = "mode", group.equal = c("loadings", "intercepts"), group.partial = paste("item", j, "~1", sep = "")) # 将结果储存于矩阵相应位置 cfa_item[j, 1] <- anova(fit3, fit4)$`Pr(>Chisq)`[2] }
DIF (SIBTEST)	目的: 分析参照组和目标组的结果差异 # 加载所需程序包 ------- library(mirt) # DIF检验 ----------------- # beta_statistic用于储存检验统计量的结果, 并且: # $\beta \in \left( 0,0.05 \right)$ 表示不存在DIF # $\beta \in \left( 0.05,0.1 \right)$ 表示存在中等程度DIF # $\beta$ 大于0.1表示存在较严重DIF (Puhan et al., 2007) # suspect为可能存在TME的题目集合 # anchor为不存在TME的锚题集合 #(当不指定锚题时, 可令除待检题目外的所有题作为锚题集) anchor <- c(1, 2, 3) # 设置锚题为第1、2和3题 suspect <- c(1:nitem)[-anchor] # 除去锚题, 即得到可能存在DIF的题目集合 beta_statistic <- matrix(data = NA, nrow = length(suspect), ncol = 1) # 创建空矩阵 for (j in 1:length(suspect)){ # 对每一题进行DIF检验 dif_item <- SIBTEST(response_b[, -1], response_b$mode, match_set = anchor, suspect_set = suspect[j]) # 将结果储存于矩阵相应位置 beta_statistic[j, 1] <- dif_item$beta[1] }

References 55

[56]	Kröhne, U., & Martens, T. (2011). 11 Computer-based competence tests in the national educational panel study: The challenge of mode effects. Zeitschrift für Erziehungswissenschaft, 14, 169-186. doi: 10.1007/s11618-011-0185-4 URL
[57]	Kulik, J. A., Kulik, C.-L. C., & Cohen, P. A. (1980). Effectiveness of computer-based college teaching: A meta-analysis of findings. Review of Educational Research, 50(4), 525-544. doi: 10.3102/00346543050004525 URL
[58]	Lee, J. A., Moreno, K. E., & Sympson, J. B. (1986). The effects of mode of test administration on test performance. Educational and Psychological Measurement, 46(2), 467-474. doi: 10.1177/001316448604600224 URL
[59]	Lee, Y.-J. (2002). A comparison of composing processes and written products in timed-essay tests across paper-and- pencil and computer modes. Assessing Writing, 8, 135-157. doi: 10.1016/S1075-2935(03)00003-5 URL
[60]	Li, J. (2006). The mediation of technology in ESL writing and its implications for writing assessment. Assessing Writing, 11(1), 5-21. doi: 10.1016/j.asw.2005.09.001 URL
[61]	Liu, J., Brown, T., Chen, J., Ali, U., Hou, L., & Costanzo, K. (2016). Mode comparability study based on spring 2015 operational test data. Retrieved March 6, 2023, from https://files.eric.ed.gov/fulltext/ED599049.pdf.
[62]	Lynch, S. (2022). Adapting paper-based tests for computer administration: Lessons learned from 30 years of mode effects studies in education. Practical Assessment, Research, and Evaluation, 27, Article 22.
[63]	Ma, W., & Guo, W. (2019). Cognitive diagnosis models for multiple strategies. British Journal of Mathematical and Statistical Psychology, 72(2), 370-392. doi: 10.1111/bmsp.12155
[64]	Magnus, B. E., Liu, Y., He, J., Quinn, H., Thissen, D., Gross, H. E., & Reeve, B. B. (2016). Mode effects between computer self-administration and telephone interviewer- administration of the PROMIS(^®) pediatric measures, self-and proxy report. Quality of Life Research, 25(7), 1655-1665.
[65]	McMullin, J., Varnhagen, C., Heng, P., & Apedoe, X. (2002). Effects of surrounding information and line length on text comprehension from the web. Canadian Journal of Learning and Technology, 28, 19-29.
[66]	OECD. (2014). PISA 2015 Field Trial Goals, Assessment Design, and Analysis Plan for Cognitive Assessment. PISA, OECD Publishing, Paris.
[67]	OECD. (2016). PISA 2015 Results (Volume I): Excellence and Equity in Education. PISA, OECD Publishing, Paris.
[68]	OECD. (2017). PISA 2015 technical report. PISA, OECD Publishing, Paris.
[69]	Paleczek, L., Seifert, S., & Schöfl, M. (2021). Comparing digital to print assessment of receptive vocabulary with GraWo-KiGa in Austrian kindergarten. British Journal of Educational Technology, 52(6), 2145-2161. doi: 10.1111/bjet.v52.6 URL
[70]	Poggio, J., Glasnapp, D. R., Yang, X., & Poggio, A. J. (2005). A comparative evaluation of score results from computerized and paper & pencil mathematics testing in a large scale state assessment program. Journal of Technology, Learning, and Assessment, 3(6), 1-31.
[71]	Pomplun, M. (2007). A bifactor analysis for a mode-of- administration effect. Applied Measurement in Education, 20, 137-152. doi: 10.1080/08957340701301264 URL
[72]	Pomplun, M., Ritchie, T., & Custer, M. (2006). Factors in paper-and-pencil and computer reading score differences at the primary grades. Educational Assessment, 11(2), 127-143. doi: 10.1207/s15326977ea1102_3 URL
[73]	Porion, A., Aparicio, X., Megalakaki, O., Robert, A., & Baccino, T. (2016). The impact of paper-based versus computerized presentation on text comprehension and memorization. Computers in Human Behavior, 54, 569-576. doi: 10.1016/j.chb.2015.08.002 URL
[74]	Powers, D. E. (1999). Test anxiety and test performance: Comparing paper-based and computer-adaptive versions of the GRE general test (ETS Research Report Series, No. 99-15). Princeton, NJ: Educational Testing Service.
[75]	Powers, D. E., Fowles, M. E., Farnum, M., & Ramsey, P. (1994). They think less of my handwritten essay if others word process theirs? Effects on essay scores of intermingling handwritten and word-processed essays. Journal of Educational Measurement, 31(3), 220-233. doi: 10.1111/jedm.1994.31.issue-3 URL
[1]	白新文, 陈毅文. (2004). 测量等价性的概念及其判定条件. 心理科学进展, 12(2), 231-239.
[2]	蔡华俭, 林永佳, 伍秋萍, 严乐, 黄玄凤. (2008). 网络测验和纸笔测验的测量不变性研究——以生活满意度量表为例. 心理学报, 40(2), 228-239.
[3]	蔡晓芬. (2014). SP程序和DFTD策略应用于IRT取向下DIF检测方法的效应比较 (硕士学位论文). 江西师范大学, 南昌.
[4]	陈冠宇, 陈平. (2019). 解释性项目反应理论模型: 理论与应用. 心理科学进展, 27(5), 937-950. doi: 10.3724/SP.J.1042.2019.00937
[5]	陈平, 丁树良. (2008). 允许检查并修改答案的计算机化自适应测验. 心理学报, 40(6), 737-747.
[6]	高旭亮, 涂冬波, 王芳, 张龙, 李雪莹. (2016). 可修改答案的计算机化自适应测验的方法. 心理科学进展, 24(4), 654-664. doi: 10.3724/SP.J.1042.2016.00654
[7]	韩建涛, 刘文令, 庞维国. (2019). 创造力测评中的评分者效应. 心理科学进展, 27(1), 171-180. doi: 10.3724/SP.J.1042.2019.00171
[8]	林喆, 陈平, 辛涛. (2015). 允许CAT题目检查的区块题目袋方法. 心理学报, 47(9), 1188-1198.
[9]	聂旭刚, 陈平, 张缨斌, 何引红. (2018). 题目位置效应的概念及检测. 心理科学进展, 26(2), 368-380. doi: 10.3724/SP.J.1042.2018.00368
[10]	檀慧玲, 李文燕, 万兴睿. (2018). 国际教育评价项目合作问题解决能力测评: 指标框架、评价标准及技术分析. 电化教育研究, 39(9), 123-128.
[11]	汤楚. (2016). 短测验项目功能差异检测方法的比较研究 (硕士学位论文). 江西师范大学, 南昌.
[76]	Prisacari, A. A., & Danielson, J. (2017a). Rethinking testing mode: Should I offer my next chemistry test on paper or computer? Computers & Education, 106, 1-12. doi: 10.1016/j.compedu.2016.11.008 URL
[77]	Prisacari, A. A., & Danielson, J. (2017b). Computer-based versus paper-based testing: Investigating testing mode with cognitive load and scratch paper use. Computers in Human Behavior, 77, 1-10. doi: 10.1016/j.chb.2017.07.044 URL
[78]	Puhan, G., Boughton, K., & Kim, S. (2007). Examining differences in examinee performance in paper and pencil and computerized testing. Journal of Technology, Learning, and Assessment, 6(3), 1-21.
[79]	Raju, N. S., van der Linden, W., & Fleer, P. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353-368. doi: 10.1177/014662169501900405 URL
[80]	Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495-2527. doi: 10.1007/s10462-021-10068-2
[81]	Robitzsch, A., Kiefer, T., & Wu, M. (2022). Test Analysis Modules (TAM). R package. Retrieved April 26, 2023, from https://cran.r-project.org/web/packages/TAM/TAM.pdf.
[82]	Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1-36.
[83]	Rowan, B. (2010). Comparability of paper-and-pencil and computer-based cognitive and non-cognitive measures in a low-stakes testing environment (Unpublished doctorial dissertation). James Madison University, Harrisonburg.
[84]	Russell, M., & Haney, W. (1997). Testing writing on computers: An experiment comparing student performance on tests conducted via computer and via paper-and-pencil. Education Policy Analysis Archives, 5(3), 1-20. doi: 10.14507/epaa.v5n1.1997 URL
[85]	Russell, M., & Plati, T. (2002). Does it matter with what I write? Comparing performance on paper, computer and portable writing devices. Current Issues in Education, 5(4), 1-15. doi: 10.1353/eac.1985.a592029 URL
[12]	Arnold, V., Legas, J., Obler, S., Pacheco, M. A., Russell, C., & Umbdenstock, L. (1990). Do students get higher scoreson their word-processed paper? A study of bias in scoring hand-written vs. word-processed papers. Retrieved March 7, 2023, from https://files.eric.ed.gov/fulltext/ED345818.pdf.
[13]	Backes, B., & Cowan, J. (2019). Is the pen mightier than the keyboard? The effect of online testing on measured student achievement. Economics of Education Review, 68, 89-103. doi: 10.1016/j.econedurev.2018.12.007
[14]	Beatty, A. E., Esco, A., Curtiss, A. B. C., & Ballen, C. J. (2022). Students who prefer face-to-face tests outperform their online peers in organic chemistry. Chemistry Education Research and Practice, 23, 464-474. doi: 10.1039/D1RP00324K URL
[15]	Bennett, R. E., Braswell, J., Oranje, A., Sandene, B., Kaplan, B., & Yan, F. (2008). Does it matter if I take my mathematics test on computer? A second empirical study of mode effects in NAEP. The Journal of Technology, Learning, and Assessment, 6(9), 1-39.
[16]	Bernard, M., Fernandez, M., Hull, S., & Chaparro, B. S. (2003). The effects of line length on children and adults’ perceived and actual online reading performance. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 47(11), 1375-1379.
[17]	Bernard, M., Lida, B., Riley, S., Hackler, T., & Janzen, K. (2002). A comparison of popular online fonts: Which size and type is best. Usability News, 4(1), 1-8.
[18]	Bernard, M., & Mills, M. (2000). So, what size and type of font should I use on my website? Usability News, 2(2), 1-5.
[19]	Blumenthal, S., & Blumenthal, Y. (2020). Tablet or paper and pen? Examining mode effects on German elementary school students’ computation skills with curriculum-based measurements. International Journal of Educational Methodology, 6(4), 669-680. doi: 10.12973/ijem URL
[20]	Bodmann, S. M., & Robinson, D. H. (2004). Speed and performance differences among computer-based and paper-pencil tests. Journal of Educational Computing Research, 31(1), 51-60. doi: 10.2190/GRQQ-YT0F-7LKB-F033 URL
[21]	Bridgeman, B., Lennon, M. L., & Jackenthal, A. (2003). Effects of screen size, screen resolution, and display rate on computer-based test performance. Applied Measurement in Education, 16(3), 191-205. doi: 10.1207/S15324818AME1603_2 URL
[86]	Russell, M, & Tao, W. (2004a). Effects of handwriting and computer-print on composition scores: A follow-up to Powers, Fowles, Farnum, & Ramsey. Practical Assessment, Research, and Evaluation, 9, Article 1.
[87]	Russell, M., & Tao, W. (2004b). The influence of computer-print on rater scores. Practical Assessment, Research, and Evaluation, 9, Article 10.
[88]	Schwarz, R. D., Rich, C., & Podrabsky, T. (2003, April). A DIF analysis of item-level mode effects for computerized and paper-and-pencil tests. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL.
[89]	Seifert, S., & Paleczek, L. (2022). Comparing tablet and print mode of a German reading comprehension test in grade 3: Influence of test order, gender and language. International Journal of Educational Research, 113, 1-13.
[90]	Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159-194. doi: 10.1007/BF02294572 URL
[91]	Terluin, B., Brouwers, E. P. M., Marchand, M. A. G., & de Vet, H. C. W. (2018). Assessing the equivalence of web-based and paper-and-pencil questionnaires using differential item and test functioning (DIF and DTF) analysis: A case of the Four-Dimensional Symptom Questionnaire (4DSQ). Quality of Life Research, 27(5), 1191-1200. doi: 10.1007/s11136-018-1816-5 pmid: 29468387
[92]	von Davier, M. (2005). A general diagnostic model applied to language testing data (ETS Research Report Series, No. 05-16). Princeton, NJ: Educational Testing Service.
[93]	von Davier, M., Khorramdel, L., He, Q. W., Shin, H. J., & Chen, H. W. (2019). Developments in psychometric population models for technology-based large-scale assessments: An overview of challenges and opportunities. Journal of Educational and Behavioral Statistics, 44(6), 671-705.
[94]	Wainer, H. (1993). Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practice, 12, 15-20. doi: 10.1111/j.1745-3992.1993.tb00519.x URL
[95]	Wang, S., Jiao, H., Young, M. J., Brooks, T., & Olson, J. (2007). A meta-analysis of testing mode effects in grade K-12 mathematics tests. Educational and Psychological Measurement, 67(2), 219-238. doi: 10.1177/0013164406288166 URL
[22]	Brunfaut, T., Harding, L., & Batty, A. O. (2018). Going online: The effect of mode of delivery on performances and perceptions on an English L2 writing test suite. Assessing Writing, 36, 3-18. doi: 10.1016/j.asw.2018.02.003 URL
[23]	Buerger, S., Kroehne, U., & Goldhammer, F. (2016). The transition to computer-based testing in large-scale assessments: Investigating (partial) measurement invariance between modes. Psychological Test and Assessment Modeling, 58, 597-616.
[24]	Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29.
[25]	Chan, K. S., Orlando, M., Ghosh-Dastidar, B., Duan, N., & Sherbourne, C. D. (2004). The interview mode effect on the Center for Epidemiological Studies Depression (CES-D) scale: An item response theory analysis. Medical Care, 42(3), 281-289. pmid: 15076828
[26]	Chan, S., Bax, S., & Weir, C. (2018). Researching the comparability of paper-based and computer-based delivery in a high-stakes writing test. Assessing Writing, 36, 32-48. doi: 10.1016/j.asw.2018.03.008 URL
[27]	Chua, S. L., Chen, D.-T., & Wong, A. F. L. (1999). Computer anxiety and its correlates: A meta-analysis. Computers in Human Behavior, 15(5), 609-623. doi: 10.1016/S0747-5632(99)00039-4 URL
[28]	Chua, Y. P. (2012). Effects of computer-based testing on test performance and testing motivation. Computers in Human Behavior, 28(5), 1580-1586. doi: 10.1016/j.chb.2012.03.020 URL
[29]	Clariana, R., & Wallace, P. (2002). Paper-based versus computer-based assessment: Key factors associated with the test mode effect. British Journal of Educational Technology, 33(5), 593-602. doi: 10.1111/bjet.2002.33.issue-5 URL
[30]	Claudia, P. F., Oshima, T. C., & Nambury, S. R. (1999). A description and demonstration of the polytomous-DFIT framework. Applied Psychological Measurement, 23(4), 309-326. doi: 10.1177/01466219922031437 URL
[31]	de La Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333-353. doi: 10.1007/BF02295640 URL
[96]	Wang, S., Jiao, H., Young, M. J., Brooks, T., & Olson, J. (2008). Comparability of computer-based and paper-and- pencil testing in K-12 reading assessments: A meta-analysis of testing mode effects. Educational and Psychological Measurement, 68(1), 5-24. doi: 10.1177/0013164407305592 URL
[97]	Weigold, A., Weigold, I. K., Drakeford, N. M., Dykema, S. A., & Smith, C. A. (2016). Equivalence of paper-and- pencil and computerized self-report surveys in older adults. Computers in Human Behavior, 54, 407-413. doi: 10.1016/j.chb.2015.08.033 URL
[98]	Wise, S. L., Freeman, S. A., Finney, S. J., Enders, C. K., & Severance, D. D. (1997, March). The accuracy of examinee judgments of relative item difficulty: Implications for computerized adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education. Chicago, IL.
[99]	Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675.
[100]	Zhi, M., & Huang, B. (2021). Investigating the authenticity of computer-and paper-based ESL writing tests. Assessing Writing, 50, Article 100548.
[101]	Ziefle, M. (1998). Effects of display resolution on visual performance. Human Factors, 40(4), 554-568. pmid: 9974229
[32]	Duchnicky, R. L., & Kolers, P. A. (1983). Readability of text scrolled on visual display terminals as a function of window size. Human Factors, 25(6), 683-692. pmid: 6671649
[33]	Feskens, R., Fox, J.-P., & Zwitser, R. (2019). Differential item functioning in PISA due to mode effects. In B. Veldkamp & C. Sluijter (Eds.), Theoretical and practical advances in computer-based educational measurement (pp. 231-247). Cham, Switzerland: Springer.
[34]	Fouladi, R. T., McCarthy, C. J., & Moller, N. (2002). Paper-and-pencil or online? Evaluating mode effects on measures of emotional functioning and attachment. Assessment, 9(2), 204-215. pmid: 12066835
[35]	Fritts, B. E., & Marszalek, J. M. (2010). Computerized adaptive testing, anxiety levels, and gender differences. Social Psychology of Education, 13, 441-458. doi: 10.1007/s11218-010-9113-3 URL
[36]	Goldberg, A., Russell, M. & Cook, A. (2003). The effect of computers on student writing: A meta-analysis of studies from 1992 to 2002. The Journal of Technology, Learning, and Assessment, 2(1), 1-52.
[37]	Goldberg, A. L., & Pedulla, J. J. (2002). Performance differences according to test mode and computer familiarity on a practice graduate record exam. Educational and Psychological Measurement, 62(6), 1053-1067. doi: 10.1177/0013164402238092 URL
[38]	Gu, L., Ling, G. M., Liu, O. L., Yang, Z. T., Li, G. R., Kardanova, E., & Loyalka, P. (2021). Examining mode effects for an adapted Chinese critical thinking assessment. Assessment & Evaluation in Higher Education, 46(6), 879-893.
[39]	Hamhuis, E., Glas, C., & Meelissen, M. (2020). Tablet assessment in primary education: Are there performance differences between TIMSS’ paper-and-pencil test and tablet test among Dutch grade-four students? British Journal of Educational Technology, 51(6), 2340-2358. doi: 10.1111/bjet.v51.6 URL
[40]	Hox, J. J., De Leeuw, E. D., & Zijlmans, E. A. O. (2015). Measurement equivalence in mixed mode surveys. Frontiers in Psychology, 6, Article 87.
[41]	Hunsu, N. J. (2015). Issues in transitioning from the traditional blue-book to computer-based writing assessment. Computers and Composition, 35, 41-51. doi: 10.1016/j.compcom.2015.01.006 URL
[42]	Jeong, H. (2012). A comparative study of scores on computer-based tests and paper-based tests. Behaviour & Information Technology, 33(4), 410-422.
[43]	Jerrim, J. (2016). PISA 2012: How do results for the paper and computer tests compare? Assessment in Education: Principles, Policy & Practice, 23(4), 495-518.
[44]	Jerrim, J., Micklewright, J., Heine, J.-H., Salzer, C., & McKeown, C. (2018). PISA 2015: How big is the ‘mode effect’ and what has been done about it? Oxford Review of Education, 44(4), 476-493. doi: 10.1080/03054985.2018.1430025 URL
[45]	Jin, Y., & Yan, M. (2017). Computer literacy and the construct validity of a high-stakes computer-based writing assessment. Language Assessment Quarterly, 14(2), 101-119. doi: 10.1080/15434303.2016.1261293 URL
[46]	Johnson, M., & Green, S. (2006). On-Line mathematics assessment: The impact of mode on performance and question answering strategies. Journal of Technology, Learning, and Assessment, 4(5), 1-35.
[47]	Keng, L., McClarty, K. L., & Davis, L. L. (2008). Item-level comparative analysis of online and paper administrations of the Texas Assessment of Knowledge and Skills. Applied Measurement in Education, 21(3), 207-226. doi: 10.1080/08957340802161774 URL
[48]	Khoshsima, H., Hosseini, M., & Toroujeni, S. M. H. (2017). Cross-mode comparability of computer-based testing (CBT) versus paper-pencil based testing (PPT): An investigation of testing administration mode among Iranian intermediate EFL learners. English Language Teaching, 10(2), 23-32. doi: 10.5539/elt.v10n2p23 URL
[49]	Khoshsima, H., & Toroujeni, S. M. H. (2017). Comparability of computer-based testing and paper-based testing: Testing mode effect, testing mode order, computer attitudes and testing mode preference. International Journal of Computer, 24, 80-99.
[50]	Kim, D., & Huynh, H. (2008). Computer-based and paper- and-pencil administration mode effects on a statewide end-of-course English test. Educational and Psychological Measurement, 68(4), 554-570. doi: 10.1177/0013164407310132 URL
[51]	Kim, S., & Walker, M. (2021). Assessing mode effects of at-home testing without a randomized trial (ETS Research Reprot Series, No. 21-10). New Jersey, NJ: Educational Testing Service.
[52]	Kim, Y. J., Dykema, J., Stevenson, J., Black, P., & Moberg, D. P. (2018). Straightlining: Overview of measurement, comparison of indicators, and effects in mail-web mixed-mode surveys. Social Science Computer Review, 37(2), 214-233. doi: 10.1177/0894439317752406 URL
[53]	Kingston, N. M. (2008). Comparability of computer-and paper- administered multiple-choice tests for K-12 populations: A synthesis. Applied Measurement in Education, 22(1), 22-37. doi: 10.1080/08957340802558326 URL
[54]	Kline, R. (2013). Assessing statistical aspects of test fairness with structural equation modelling. Educational Research and Evaluation: Fairness Issue in Educational Assessment, 19(2-3), 204-222.
[55]	Kroehne, U., Gnambs, T., & Goldhammer, F. (2019). Disentangling setting and mode effects for online competence assessment. In H. P. Blossfeld & H. G. Roβbach (2^nd Eds.), Education as a lifelong process (pp. 171-193). Wiesbaden, Germany: Springer VS.