|
|
Empowering psychometrics with generative large language models: Advantages, challenges, and applications
TIAN Xuetao, ZHOU Wenjie, LUO Fang, QIAO Zhihong, FENG Yi
2026, 34 (3):
404-423.
doi: 10.3724/SP.J.1042.2026.0404
Generative Large Language Models (LLMs), pre-trained on vast corpora, are introducing a paradigm shift in psychometrics, moving beyond the capabilities of previous artificial intelligence applications. While earlier machine learning methods enhanced psychometrics through automated item generation and improved measurement models, they were often constrained by the need for large, high-quality labeled datasets and suffered from poor generalization. This paper argues that LLMs offer transformative potential by providing innovative solutions for test interaction, content comprehension, and evaluation methodologies. It systematically outlines the core advantages and pressing challenges of integrating LLMs and proposes four key application areas where they can drive significant progress: situational judgment test generation, collaborative problem-solving assessment, intelligent mental health diagnostics, and automated item quality analysis. A primary innovation offered by LLMs is the fundamental transformation of the test-taker interaction model. Traditional psychometric assessments have evolved from static paper-and-pencil formats to more dynamic computerized tests. However, LLMs enable a shift towards truly natural and free-form conversational interactions. This allows for the capture of much richer psychological information embedded in natural language, including semantics, tone, and linguistic structure, which are inaccessible through button clicks or fixed-choice responses. Furthermore, by leveraging agent-based simulations, LLMs can create dynamic and adaptive assessment environments. These agents can play various social roles, actively engaging with test-takers to elicit and observe complex psychological traits in ecologically valid contexts, moving assessment from a rigid procedure to an interactive, responsive experience. This enhanced interaction is powered by LLMs' breakthrough capabilities in content comprehension. Technologically, this represents a leap from traditional natural language processing techniques (e.g., Bag-of-Words, Word2Vec) to models that possess a deep, contextual understanding of language. The massive context windows of modern LLMs (e.g., 128k tokens) allow for the holistic analysis of extremely long texts, such as complete interview transcripts or extensive open-ended responses, without losing semantic coherence. This is crucial for process-oriented evaluation. Another significant advance is in multimodal data understanding. Instead of analyzing text, audio, and visual data in silos, multimodal LLMs map these different data types into a shared semantic vector space. This enables the deep fusion and synergistic analysis of verbal content, vocal tone, facial expressions, and body language, facilitating a more comprehensive and nuanced assessment of an individual's psychological state. These advancements directly impact scoring and evaluation, enabling a transition from static, outcome-based assessment to dynamic, process-oriented evaluation. In automated scoring, LLMs' superior semantic understanding allows for more accurate and consistent grading of complex, open-ended responses compared to earlier models. More importantly, LLMs facilitate a continuous feedback loop that transforms assessment into a developmental tool. By analyzing process data in real-time, an LLM-powered system can provide instant, personalized feedback, adjust item difficulty dynamically, and guide the test-taker. This creates a “measurement-evaluation-feedback-development” cycle, where the assessment not only measures a trait but also contributes to the individual's growth. Despite this potential, the paper identifies critical challenges that must be addressed for responsible implementation. The stability of LLMs is a primary concern; their outputs can be inconsistent, they may suffer from context loss in long dialogues, and they are prone to “hallucinations” or factual errors. Furthermore, the “silent updates” of closed-source models pose a threat to measurement invariance in longitudinal studies. The creativity of LLMs is also limited, as they primarily recombine existing data patterns and may struggle to generate truly novel ideas or psychological constructs. Scalability and extensibility challenges include the models' difficulty in adapting to new psychological constructs, their still-developing ability to deeply integrate multimodal data, and inherent cultural biases from training data that limit cross-cultural applicability. Finally, significant ethical issues regarding data privacy, algorithmic bias, and the high computational cost must be carefully managed. Looking forward, the paper highlights four promising applications. First, in Situational Judgment Test generation, LLMs can create a vast number of realistic scenarios and behaviorally distinct response options, mitigating item exposure and reducing reliance on expert time. Second, in collaborative problem-solving assessment, LLMs can act as standardized, yet interactive, partners, allowing for the reliable measurement of communication and teamwork skills in a controlled but realistic setting. Third, for intelligent mental health diagnostics, LLMs can function as automated conversational agents that conduct structured clinical interviews, creating a safe space for disclosure and enabling continuous, dynamic assessment beyond static questionnaires. Last, for test item quality analysis, LLMs can simulate both domain experts and diverse test-taker populations to provide initial evaluations of item clarity, difficulty, and potential bias, streamlining the test development process.
References |
Related Articles |
Metrics
|