开放式情境判断测验的自动化评分

doi:10.3724/SP.J.1041.2024.00831

摘要/Abstract

摘要：

受限于评分成本, 开放式情境判断测验难以广泛使用。本研究以教师胜任力测评为例, 探索了自动化评分的应用。针对教学中的典型问题场景开发了开放式情境判断测验, 收集中小学教师作答文本, 采用有监督学习策略分别从文档层面和句子层面应用深度神经网络识别作答类别, 卷积神经网络(Convolutional Neural Network, CNN)效果理想, 各题评分准确率为70%~88%, 与人类评分一致性高, 人机评分的相关系数r为0.95, 二次加权Kappa系数(Quadratic Weighted Kappa, QWK)为0.82。结果表明, 机器评分可以获得稳定的效果, 自动化评分研究能够助力于开放式情境判断测验的广泛应用。

关键词: 情境判断测验, 自动化评分, 教师胜任力, 开放式测验, 机器学习

Abstract:

Situational Judgment Tests (SJTs) have gained popularity for their unique testing content and high face validity. However, traditional SJT formats, particularly those employing multiple-choice (MC) options, have encountered scrutiny due to their susceptibility to test-taking strategies. In contrast, open-ended and constructed response (CR) formats present a propitious means to address this issue. Nevertheless, their extensive adoption encounters hurdles primarily stemming from the financial implications associated with manual scoring. In response to this challenge, we propose an open-ended SJT employing a written-constructed response format for the assessment of teacher competency. This study established a scoring framework leveraging natural language processing (NLP) technology to automate the assessment of response texts, subsequently subjecting the system's validity to rigorous evaluation. The study constructed a comprehensive teacher competency model encompassing four distinct dimensions: student-oriented, problem-solving, emotional intelligence, and achievement motivation. Additionally, an open-ended situational judgment test was developed to gauge teachers' aptitude in addressing typical teaching dilemmas. A dataset comprising responses from 627 primary and secondary school teachers was collected, with manual scoring based on predefined criteria applied to 6, 000 response texts from 300 participants. To expedite the scoring process, supervised learning strategies were employed, facilitating the categorization of responses at both the document and sentence levels. Various deep learning models, including the convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory (LSTM), C-LSTM, RNN+attention, and LSTM+attention, were implemented and subsequently compared, thereby assessing the concordance between human and machine scoring. The validity of automatic scoring was also verified.

This study reveals that the open-ended situational judgment test exhibited an impressive Cronbach's alpha coefficient of 0.91 and demonstrated a good fit in the validation factor analysis through the use of Mplus. Criterion-related validity was assessed, revealing significant correlations between test results and various educational facets, including instructional design, classroom evaluation, homework design, job satisfaction, and teaching philosophy. Among the diverse machine scoring models evaluated, CNNs have emerged as the top-performing model, boasting a scoring accuracy ranging from 70% to 88%, coupled with a remarkable degree of consistency with expert scores (r = 0.95, QWK = 0.82). The correlation coefficients between human and computer ratings for the four dimensions—student-oriented, problem-solving, emotional intelligence, and achievement motivation—approximated 0.9. Furthermore, the model showcased an elevated level of predictive accuracy when applied to new text datasets, serving as compelling evidence of its robust generalization capabilities.

This study ventured into the realm of automated scoring for open-ended situational judgment tests, employing rigorous psychometric methodologies. To affirm its validity, the study concentrated on a specific facet: the evaluation of teacher competency traits. Fine-grained scoring guidelines were formulated, and state-of-the-art NLP techniques were used for text feature recognition and classification. The primary findings of this investigation can be summarized as follows: (1) Open-ended SJTs can establish precise scoring criteria grounded in crucial behavioral response elements; (2) Sentence-level text classification outperforms document- level classification, with CNNs exhibiting remarkable accuracy in response categorization; and (3) The scoring model consistently delivers robust performance and demonstrates a remarkable degree of alignment with human scoring, thereby hinting at its potential to partially supplant manual scoring procedures.

Key words: situational judgment tests, automated scoring, teacher competency, open-ended tests, machine learning

中图分类号:

B841

徐静, 骆方, 马彦珍, 胡路明, 田雪涛. (2024). 开放式情境判断测验的自动化评分. 心理学报, 56(6), 831-844.

XU Jing, LUO Fang, MA Yanzhen, HU Luming, TIAN Xuetao. (2024). Automated scoring of open-ended situational judgment tests. Acta Psychologica Sinica, 56(6), 831-844.

图/表 10

参考文献 48

[1]	Arthur, W., Edwards, B. D., & Barrett, G. V. (2002). Multiple- choice and constructed response tests of ability: Race-based subgroup performance differences on alternative paper- and-pencil test formats. Personnel Psychology, 55(4), 985-1008.
[2]	Bacon, D. R. (2003). Assessing learning outcomes: A comparison of multiple-choice and short-answer questions in a marketing context. Jounal of Marketing Education, 25(1), 31-36.
[3]	Basu, T., & Murthy, C. A. (2013, December). Effective text classification by a supervised feature selection approach. IEEE 12th International Conference on Data Mining Workshops (ICDM) (pp. 918-925). Brussels, Belgium.
[4]	Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60-117.
[5]	Burrus, J., Betancourt, A., Holtzman, S., Minsky, J., MacCann, C., & Roberts, R. D. (2012). Emotional intelligence relates to well-being: Evidence from the situational judgment test of emotional management. Applied Psychology: Health and Well-Being, 4(2), 151-166.
[6]	Cucina, J. M., Su, C., Busciglio, H. H., Thomas, P. H., & Peyton, S. T. (2015). Video-based testing: A high-fidelity job simulation that demonstrates reliability, validity, and utility. International Journal of Selection and Assessment, 23(3), 197-209.
[7]	Downer, K., Wells, C., & Crichton, C. (2019). All work and no play: A text analysis. International Journal of Market Research, 61(3), 236-251.
[8]	Edwards, B. D., & Arthur, W.,Jr. (2007). An examination of factors contributing to a reduction in subgroup differences on a constructed-response paper-and-pencil test of scholastic achievement. Journal of Applied Psychology, 92(3), 794-801. pmid: 17484558
[9]	Finch, W. H., Finch, M. E. H., Mcintosh, C. E., & Braun, C. (2018). The use of topic modeling with latent dirichlet analysis with open-ended survey items. Translational Issues in Psychological Science, 4(4), 403-424.
[10]	Funke, U., & Schuler, H. (1998). Validity of stimulus and response components in a video test of social competence. International Journal of Selection and Assessment, 6(2), 115-123.
[11]	Gu, H. L., & Wen, Z. L. (2017). Reporting and interpreting multidimensional test scores: A bi-factor perspective. Psychological Development and Education, 33(4), 504-512.
	[顾红磊, 温忠麟. (2017). 多维测验分数的报告与解释: 基于双因子模型的视角. 心理发展与教育, 33(4), 504-512.]
[12]	Guo, F., Gallagher, C. M., Sun, T., Tavoosi, S., & Min, H. (2021). Smarter people analytics with organizational text data: Demonstrations using classic and advanced NLP models. Human Resource Management Journal. https://doi.org/10.1111/1748-8583.12426
[13]	Iliev, R., Dehghani, M., & Sagi, E. (2015). Automated text analysis in psychology: Methods, applications, and future developments. Language and Cognition, 7(2), 265-290.
[14]	Kastner, M., & Stangla, B. (2011). Multiple choice and constructed response tests: Do test format and scoring matter? Procedia-Social and Behavioral Sciences. 12, 263-273.
[15]	Kim, Y. (2014). Convolutional neural networks for sentence classification. Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing, 1746-1751.
[16]	Kjell, O. E., Kjell, K., Garcia, D., & Sikstrom, S. (2019). Semantic measures: Using natural language processing to measure, differentiate, and describe psychological constructs. Psychological Methods, 24(1), 92-115. doi: 10.1037/met0000191 pmid: 29963879
[17]	Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. Proceedings of the 29th AAAI Conference on Artificial Intelligence, 2267-2273.
[18]	Lee, B. C., & Kim, B. Y. (2021). Development of an AI-based interview system for remote hiring. International Journal of Advanced Research in Engineering and Technology, 12(3), 654-663.
[19]	Lievens, F., de Corte, W., & Westerveld, L. (2015). Understanding the building blocks of selection procedures: Effects of response fidelity on performance and validity. Journal of Management, 41(6), 1604-1627.
[20]	Lievens, F., Sackett, P. R., Dahlke, J. A., Oostrom, J. K., & de Soete, B. (2019). Constructed response formats and their effects on minority-majority differences and validity. Journal of Applied Psychology, 104(5), 715-726. doi: 10.1037/apl0000367 pmid: 30431296
[21]	Ling, C. (2020). Development of Classroom Observation Scale to Promote the Professional Development of New Teachers (Unpublished master's thesis). Beijing Normal University.
	[凌晨. (2020). 课堂观察量表的开发——促进初任教师专业发展 (硕士学位论文). 北京师范大学.]
[22]	Lubis, F. F., Mutaqin, Putri, A., Waskita, D., Sulistyaningtyas, T., Arman, A. A., & Rosmansyah, Y. (2021). Automated short-answer grading using semantic similarity based on word embedding. International Journal of Technology. 12(3), 571-581.
[23]	Marentette, B. J., Meyers, L. S., Hurtz, G. M., & Kuang, D. C. (2012). Order effects on situational judgment test items: A case of construct-irrelevant difficulty. International Journal of Selection and Assessment, 20(3), 319-332.
[24]	McDaniel, M. A., Hartman, N. S., Whetzel, D. L., & Grubb, W. L. (2007). Situational judgment tests, response instructions, and validity: A meta‐analysis. Personnel Psychology, 60(1), 63-91.
[25]	McDaniel, M. A., Morgeson, F. P., Finnegan, E. B., Campion, M. A., & Braverman, E. P. (2001). Use of situational judgment tests to predict job performance: A clarification of the literature. Journal of Applied Psychology, 86(4), 730-740. pmid: 11519656
[26]	McDaniel, M. A., Psotka, J., Legree, P. J., Yost, A. P., & Weekley, J. A. (2011). Toward an understanding of situational judgment item validity and group differences. Journal of Applied Psychology, 96(2), 327-336. doi: 10.1037/a0021983 pmid: 21261409
[27]	Oostrom, J. K., Born, M. P., Serlie, A. W., & van der Molen, H. T. (2010). Webcam testing: Validation of an innovative open-ended multimedia test. European Journal of Work and Organizational Psychology, 19(5), 532-550.
[28]	Oostrom, J. K., Born, M. P., Serlie, A. W., & van der Molen, H. T. (2011). A multimedia situational test with a constructed-response format: Its relationship with personality, cognitive ability, job experience, and academic performance. Journal of Personnel Psychology, 10(2), 78-88.
[29]	Oostrom, J. K., Born, M. P., Serlie, A. W., & van der Molen, H. T. (2012). Implicit trait policies in multimedia situational judgment tests for leadership skills: Can they predict leadership behavior? Human Performance, 25(4), 335-353.
[30]	Pang, N., Zhao, X., Wang, W., Xiao, W., & Guo, D. (2021). Few-shot text classification by leveraging bi-directional attention and cross-class knowledge. Science China Information Sciences. 64(3), 130103.
[31]	Qi, S. Q., & Dai, H. Q. (2003). The property, function and the development of situational judgment tests. Psychological Exploration, 23(4), 42-46.
	[漆书青, 戴海琦. (2003). 情景判断测验的性质、功能与开发编制. 心理学探新, 23(4), 42-46.]
[32]	Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3), 2495-2527.
[33]	Ramineni, C., Trapani, C. S., Williamson, D. M., David, T., & Bridgeman, B. (2012). Evaluation of the e-rater® scoring engine for the GRE® Issue and Argument Prompts. ETS Research Report Series, (1), i-106.
[34]	Robson, S. M., Jones, A., & Abraham, J. (2007). Personality, faking, and convergent validity: A warning concerning warning statements. Human Performance, 21(1), 89-106.
[35]	Rogers, W. T., & Harley, D. (1999). An empirical comparison of three-and four-choice items and tests: Susceptibility to testwiseness and internal consistency reliability. Educational and Psychological Measurement, 59(2), 234-247.
[36]	Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning and Assessment, 1(2), 1-22.
[37]	Slaughter, J. E., Christian, M. S., Podsakoff, N. P., Sinar, E. F., & Lievens, F. (2014). On the limitations of using situational judgment tests to measure interpersonal skills: The moderating influence of employee anger. Personnel Psychology, 67(4), 847-885.
[38]	Süzen, N., Gorban, A. N., Levesley, J., & Mirkes, E. M. (2020). Automatic short answer grading and feed-back using text mining methods. Procedia Computer Science, 169, 726-743.
[39]	Tavoosi, S. (2022). Development and validation of a counterproductive work behavior situational judgment test with an open-ended response format: A computerized scoring approach (Unpublished master’s thesis). University of Central Florida.
[40]	Wang, Y., & Peng, H. L. (2019). Validation on automatic scoring for open-ended questions in Chinese oral tests. China Examinations, 9, 63-71.
	[王妍, 彭恒利. (2019). 汉语口语开放性试题计算机自动评分的效度验证. 中国考试, 9, 63-71.]
[41]	Weekley, J. A., & Ployhart, R. E. (2005). Situational judgment: Antecedents and relationships with performance. Human Performance, 18(1), 81-104.
[42]	Whetzel, D. L., & McDaniel, M. A. (2009). Situational judgment tests: An overview of current research. Human Resource Management Review, 19(3), 188-202.
[43]	Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13.
[44]	Xie, X. Q. (2013). Validation: From reasonable to plausible interpretation of test score. China Examinations, 7, 3-8.
	[谢小庆. (2013). 效度: 从分数的合理解释到可接受解释. 中国考试, 7, 3-8.]
[45]	Xu, J. P. (2004). Research on teacher competency model and evaluation (Unpublished doctorial dissertation). Beijing Normal University.
	[徐建平. (2004). 教师胜任力模型与测评研究 (博士学位论文). 北京师范大学.]
[46]	Yang, L., Xin, T., Luo, F., Zhang, S., & Tian, X. (2022). Automated evaluation of the quality of ideas in compositions based on concept maps. Natural Language Engineering, 28(4), 449-486.
[47]	Zhang, Y., Lin, C., & Chi, M. (2020). Going deeper: Automatic short-answer grading by combining student and question models. User Modeling and User-Adapted Interaction, 30(1), 51-80.
[48]	Zhao, Y., Shen, Y., & Yao, J. (2019, August). Recurrent neural network for text classification with hierarchical multiscale dense connections. Proceedings of the 28th International Joint Conference on Artificial Intelligence, 5450-5456.

	预测正例	预测反例
实际正例	TP真正例	FN假负例
实际反例	FP假正例	TN真负例

	预测正例	预测反例
实际正例	TP真正例	FN假负例
实际反例	FP假正例	TN真负例

模型	χ²	df	χ²/df	CFI	TLI	SRMR	RMSEA
M₁	264.34	170	1.56	0.947	0.941	0.042	0.043
M₂	256.12	164	1.56	0.948	0.940	0.041	0.044
M₃	226.59	190	1.19	0.957	0.946	0.038	0.042
M₄	179.58	144	1.25	0.980	0.974	0.033	0.029

模型	χ²	df	χ²/df	CFI	TLI	SRMR	RMSEA
M₁	264.34	170	1.56	0.947	0.941	0.042	0.043
M₂	256.12	164	1.56	0.948	0.940	0.041	0.044
M₃	226.59	190	1.19	0.957	0.946	0.038	0.042
M₄	179.58	144	1.25	0.980	0.974	0.033	0.029

变量	M ± SD	1	2	3	4	5	6	7	8
n = 290
1总分	37.77 ± 8.17	1
2 学生导向	1.89 ± 0.46	0.89^***	1
3 问题解决	1.87 ± 0.48	0.87^***	0.68^***	1
4 情绪智力	1.91 ± 0.50	0.78^***	0.58^***	0.55^***	1
5 成就动机	1.89 ± 0.54	0.75^***	0.55^***	0.58^***	0.54^***	1
6 工作满意度	3.88 ± 0.29	0.20^**	0.22^***	0.14^*	0.15^*	0.13^*	1
7 公用教学理念	4.58 ±0.41	0.21^***	0.18^**	0.15^*	0.20^***	0.18^**	0.45^***	1
8 学科教学理念	3.58 ±0.49	0.22^***	0.21^**	0.18^**	0.13^*	0.20^**	0.33^***	0.50^***	1
n = 181
1总分	38.95 ± 8.30	1
2 学生导向	1.92 ± 0.49	0.87^***	1
3 问题解决	1.96 ± 0.47	0.88^***	0.68^***	1
4 情绪智力	1.95 ± 0.52	0.76^***	0.53^***	0.54^***	1
5 成就动机	1.97 ± 0.61	0.71^***	0.52^***	0.55^***	0.49^***	1
6 教学设计	2.60 ± 0.31	0.26^***	0.24^**	0.27^***	0.16^*	0.18^*	1
7 课堂视频	2.54 ± 0.32	0.20^**	0.21^**	0.17^*	0.10	0.13	0.61^***	1
8 学生作业	2.52 ± 0.44	0.22^**	0.18^*	0.22^**	0.18^*	0.16^*	0.43^***	0.39^***	1