ISSN 0439-755X
CN 11-1911/B
主办:中国心理学会
   中国科学院心理研究所
出版:科学出版社

心理学报 ›› 2024, Vol. 56 ›› Issue (6): 831-844.doi: 10.3724/SP.J.1041.2024.00831

• 研究报告 • 上一篇    

开放式情境判断测验的自动化评分

徐静1, 骆方1, 马彦珍2, 胡路明3, 田雪涛1   

  1. 1北京师范大学心理学部;
    2中国基础教育质量监测协同创新中心, 北京 100875;
    3北京师范大学珠海校区文理学院心理系, 广东 珠海 519085
  • 收稿日期:2022-10-22 发布日期:2024-04-08 出版日期:2024-06-25
  • 通讯作者: 田雪涛, E-mail: xttian@bnu.edu.cn
  • 基金资助:
    * 国家自然科学基金青年科学基金(62207002); 国家自然科学基金面上项目(62377003); 中国博士后科学基金特别资助(站前) (2022TQ0040); 中国博士后科学基金面上资助(2022M720486)

Automated scoring of open-ended situational judgment tests

XU Jing1, LUO Fang1, MA Yanzhen2, HU Luming3, TIAN Xuetao1   

  1. 1School of Psychology, Beijing Normal University, Beijing 100875, China;
    2Collaborative Innovation Center of Assessment toward Basic Education Quality, Beijing Normal University, Beijing 100875, China;
    3Department of Psychology, School of Arts and Sciences, Beijing Normal University at Zhuhai, Zhuhai 519085, China
  • Received:2022-10-22 Online:2024-04-08 Published:2024-06-25

摘要: 受限于评分成本, 开放式情境判断测验难以广泛使用。本研究以教师胜任力测评为例, 探索了自动化评分的应用。针对教学中的典型问题场景开发了开放式情境判断测验, 收集中小学教师作答文本, 采用有监督学习策略分别从文档层面和句子层面应用深度神经网络识别作答类别, 卷积神经网络(Convolutional Neural Network, CNN)效果理想, 各题评分准确率为70%~88%, 与人类评分一致性高, 人机评分的相关系数r为0.95, 二次加权Kappa系数(Quadratic Weighted Kappa, QWK)为0.82。结果表明, 机器评分可以获得稳定的效果, 自动化评分研究能够助力于开放式情境判断测验的广泛应用。

关键词: 情境判断测验, 自动化评分, 教师胜任力, 开放式测验, 机器学习

Abstract: Situational Judgment Tests (SJTs) have gained popularity for their unique testing content and high face validity. However, traditional SJT formats, particularly those employing multiple-choice (MC) options, have encountered scrutiny due to their susceptibility to test-taking strategies. In contrast, open-ended and constructed response (CR) formats present a propitious means to address this issue. Nevertheless, their extensive adoption encounters hurdles primarily stemming from the financial implications associated with manual scoring. In response to this challenge, we propose an open-ended SJT employing a written-constructed response format for the assessment of teacher competency. This study established a scoring framework leveraging natural language processing (NLP) technology to automate the assessment of response texts, subsequently subjecting the system's validity to rigorous evaluation. The study constructed a comprehensive teacher competency model encompassing four distinct dimensions: student-oriented, problem-solving, emotional intelligence, and achievement motivation. Additionally, an open-ended situational judgment test was developed to gauge teachers' aptitude in addressing typical teaching dilemmas. A dataset comprising responses from 627 primary and secondary school teachers was collected, with manual scoring based on predefined criteria applied to 6, 000 response texts from 300 participants. To expedite the scoring process, supervised learning strategies were employed, facilitating the categorization of responses at both the document and sentence levels. Various deep learning models, including the convolutional neural network (CNN), recurrent neural network (RNN), long short-term memory (LSTM), C-LSTM, RNN+attention, and LSTM+attention, were implemented and subsequently compared, thereby assessing the concordance between human and machine scoring. The validity of automatic scoring was also verified.
This study reveals that the open-ended situational judgment test exhibited an impressive Cronbach's alpha coefficient of 0.91 and demonstrated a good fit in the validation factor analysis through the use of Mplus. Criterion-related validity was assessed, revealing significant correlations between test results and various educational facets, including instructional design, classroom evaluation, homework design, job satisfaction, and teaching philosophy. Among the diverse machine scoring models evaluated, CNNs have emerged as the top-performing model, boasting a scoring accuracy ranging from 70% to 88%, coupled with a remarkable degree of consistency with expert scores (r = 0.95, QWK = 0.82). The correlation coefficients between human and computer ratings for the four dimensions—student-oriented, problem-solving, emotional intelligence, and achievement motivation—approximated 0.9. Furthermore, the model showcased an elevated level of predictive accuracy when applied to new text datasets, serving as compelling evidence of its robust generalization capabilities.
This study ventured into the realm of automated scoring for open-ended situational judgment tests, employing rigorous psychometric methodologies. To affirm its validity, the study concentrated on a specific facet: the evaluation of teacher competency traits. Fine-grained scoring guidelines were formulated, and state-of-the-art NLP techniques were used for text feature recognition and classification. The primary findings of this investigation can be summarized as follows: (1) Open-ended SJTs can establish precise scoring criteria grounded in crucial behavioral response elements; (2) Sentence-level text classification outperforms document- level classification, with CNNs exhibiting remarkable accuracy in response categorization; and (3) The scoring model consistently delivers robust performance and demonstrates a remarkable degree of alignment with human scoring, thereby hinting at its potential to partially supplant manual scoring procedures.

Key words: situational judgment tests, automated scoring, teacher competency, open-ended tests, machine learning