ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

心理科学进展 ›› 2026, Vol. 34 ›› Issue (3): 404-423.doi: 10.3724/SP.J.1042.2026.0404 cstr: 32111.14.2026.0404

• 研究方法 • 上一篇    下一篇

生成式大语言模型赋能心理测量学:优势、挑战与应用

田雪涛1†, 周文杰2†, 骆方1, 乔志宏1, 丰怡3   

  1. 1北京师范大学心理学部, 北京 100875;
    2美国加州大学伯克利分校教育学院, 伯克利 94720;
    3中央财经大学心理咨询中心, 北京 100081
  • 收稿日期:2024-12-23 出版日期:2026-03-15 发布日期:2026-01-07
  • 作者简介:†田雪涛和周文杰为本文共同第一作者。
  • 基金资助:
    北京市教育科学规划青年专项课题(CCFA24122)资助

Empowering psychometrics with generative large language models: Advantages, challenges, and applications

TIAN Xuetao1, ZHOU Wenjie2, LUO Fang1, QIAO Zhihong1, FENG Yi3   

  1. 1Faculty of Psychology, Beijing Normal University, Beijing 100875, China;
    2Berkeley School of Education, University of California, Berkeley 94720, US;
    3Mental Health Center, Central University of Finance and Economics, Beijing 100081, China
  • Received:2024-12-23 Online:2026-03-15 Published:2026-01-07

摘要: 生成式大语言模型(LLMs)是一种在大规模语料库上预训练的人工智能模型, 为心理测量学领域带来前所未有的机遇和挑战。本文通过整合人工智能与心理学交叉研究发展脉络, 总结LLMs赋能心理测量学的显著优势, 定位LLMs在心理学应用中的重要挑战, 并提出基于LLMs的心理测量研究发展方向。具体地, LLMs能够基于上下文生成连贯的自然语言文本, 具有改变传统测验交互方式的潜力; LLMs突破对超长文本和多模态数据的处理能力, 其强大的内容理解能力能够全面获取和分析被试的心理信息; LLMs有助于实现实时分析和个性化反馈, 促进从结果评价向过程评价的转变。尽管LLMs的实际应用面临着稳定性、创造性和拓展性等挑战, 但在情境判断测验生成、合作式问题解决能力评估、心理健康智慧诊疗和试题质量分析等领域展现出广阔的应用前景和研究价值。

关键词: 生成式大语言模型, 心理测量学, 人工智能, 自动化评估, 交互式测验

Abstract: Generative Large Language Models (LLMs), pre-trained on vast corpora, are introducing a paradigm shift in psychometrics, moving beyond the capabilities of previous artificial intelligence applications. While earlier machine learning methods enhanced psychometrics through automated item generation and improved measurement models, they were often constrained by the need for large, high-quality labeled datasets and suffered from poor generalization. This paper argues that LLMs offer transformative potential by providing innovative solutions for test interaction, content comprehension, and evaluation methodologies. It systematically outlines the core advantages and pressing challenges of integrating LLMs and proposes four key application areas where they can drive significant progress: situational judgment test generation, collaborative problem-solving assessment, intelligent mental health diagnostics, and automated item quality analysis.
A primary innovation offered by LLMs is the fundamental transformation of the test-taker interaction model. Traditional psychometric assessments have evolved from static paper-and-pencil formats to more dynamic computerized tests. However, LLMs enable a shift towards truly natural and free-form conversational interactions. This allows for the capture of much richer psychological information embedded in natural language, including semantics, tone, and linguistic structure, which are inaccessible through button clicks or fixed-choice responses. Furthermore, by leveraging agent-based simulations, LLMs can create dynamic and adaptive assessment environments. These agents can play various social roles, actively engaging with test-takers to elicit and observe complex psychological traits in ecologically valid contexts, moving assessment from a rigid procedure to an interactive, responsive experience.
This enhanced interaction is powered by LLMs' breakthrough capabilities in content comprehension. Technologically, this represents a leap from traditional natural language processing techniques (e.g., Bag-of-Words, Word2Vec) to models that possess a deep, contextual understanding of language. The massive context windows of modern LLMs (e.g., 128k tokens) allow for the holistic analysis of extremely long texts, such as complete interview transcripts or extensive open-ended responses, without losing semantic coherence. This is crucial for process-oriented evaluation. Another significant advance is in multimodal data understanding. Instead of analyzing text, audio, and visual data in silos, multimodal LLMs map these different data types into a shared semantic vector space. This enables the deep fusion and synergistic analysis of verbal content, vocal tone, facial expressions, and body language, facilitating a more comprehensive and nuanced assessment of an individual's psychological state.
These advancements directly impact scoring and evaluation, enabling a transition from static, outcome-based assessment to dynamic, process-oriented evaluation. In automated scoring, LLMs' superior semantic understanding allows for more accurate and consistent grading of complex, open-ended responses compared to earlier models. More importantly, LLMs facilitate a continuous feedback loop that transforms assessment into a developmental tool. By analyzing process data in real-time, an LLM-powered system can provide instant, personalized feedback, adjust item difficulty dynamically, and guide the test-taker. This creates a “measurement-evaluation-feedback-development” cycle, where the assessment not only measures a trait but also contributes to the individual's growth.
Despite this potential, the paper identifies critical challenges that must be addressed for responsible implementation. The stability of LLMs is a primary concern; their outputs can be inconsistent, they may suffer from context loss in long dialogues, and they are prone to “hallucinations” or factual errors. Furthermore, the “silent updates” of closed-source models pose a threat to measurement invariance in longitudinal studies. The creativity of LLMs is also limited, as they primarily recombine existing data patterns and may struggle to generate truly novel ideas or psychological constructs. Scalability and extensibility challenges include the models' difficulty in adapting to new psychological constructs, their still-developing ability to deeply integrate multimodal data, and inherent cultural biases from training data that limit cross-cultural applicability. Finally, significant ethical issues regarding data privacy, algorithmic bias, and the high computational cost must be carefully managed.
Looking forward, the paper highlights four promising applications. First, in Situational Judgment Test generation, LLMs can create a vast number of realistic scenarios and behaviorally distinct response options, mitigating item exposure and reducing reliance on expert time. Second, in collaborative problem-solving assessment, LLMs can act as standardized, yet interactive, partners, allowing for the reliable measurement of communication and teamwork skills in a controlled but realistic setting. Third, for intelligent mental health diagnostics, LLMs can function as automated conversational agents that conduct structured clinical interviews, creating a safe space for disclosure and enabling continuous, dynamic assessment beyond static questionnaires. Last, for test item quality analysis, LLMs can simulate both domain experts and diverse test-taker populations to provide initial evaluations of item clarity, difficulty, and potential bias, streamlining the test development process.

Key words: large language models, psychometrics, artificial intelligence, automated assessment, interactive testing