ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

心理科学进展 ›› 2025, Vol. 33 ›› Issue (10): 1783-1793.doi: 10.3724/SP.J.1042.2025.1783 cstr: 32111.14.2025.1783

• 研究前沿 • 上一篇    下一篇

大语言模型的共情模拟:评估、提升与挑战

周倩伊1,#, 蔡亚琦1,#, 张亚1,2()   

  1. 1 华东师范大学心理与认知科学学院, 上海市心理健康与危机干预重点实验室, 上海 200062
    2 青少年心理健康与危机智能干预安徽省哲学社会科学重点实验室, 合肥师范学院, 合肥 230601
  • 收稿日期:2025-03-09 出版日期:2025-10-15 发布日期:2025-08-18
  • 通讯作者: 张亚, E-mail: yzhang@psy.ecnu.edu.cn
  • 作者简介:第一联系人:

    # 周倩伊和蔡亚琦为本文的共同第一作者

  • 基金资助:
    青少年心理健康与危机智能干预安徽省哲学社科重点实验室开放基金重大项目(SYS2024XXX)

Empathy in large language models: Evaluation, enhancement, and challenges

ZHOU Qianyi1,#, CAI Yaqi1,#, ZHANG Ya1,2()   

  1. 1 Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, School of Psychology and Cognitive Science, East China Normal University, Shanghai 200062, China
    2 Key Laboratory of Philosophy and Social Science of Anhui Province on Adolescent Mental Health and Crisis Intelligence Intervention, Hefei Normal University, Hefei 230601, China
  • Received:2025-03-09 Online:2025-10-15 Published:2025-08-18

摘要:

随着大语言模型(Large Language Models, LLMs)在自然语言生成和情感计算方面的技术进步, 其在心理咨询、医患沟通和客户服务等领域的共情模拟能力受到广泛关注。LLMs的共情模拟主要表现为认知共情模拟, 而非情感共情模拟, 主要包括情绪识别、共情回应和语境适应。当前评估LLMs共情模拟的方法包括人工评估、自动化评估、任务驱动评估, 三种方法各有优劣, 适用场景有所差异。LLMs共情模拟能力与人类共情能力相比, 在共情生成任务方面表现出色, 但仍面临情感理解的局限性; 为了进一步提升LLMs的共情模拟能力, 可以采用数据增强、模型框架与架构优化、强化学习、引导词优化等方法加以改进。同时, 模型使用过程中的伦理规范与潜在风险仍需引起关注。

关键词: 大语言模型, 共情模拟, 共情评估, 伦理问题

Abstract:

Amid the rapid evolution of artificial intelligence technologies, the application scope of large language models (LLMs) has extended beyond traditional information processing tasks to novel domains involving the simulation of complex human emotions and interactions. Particularly in emotion-intensive contexts such as psychological counseling, physician-patient communication, and customer service, the capacity of LLMs for empathy simulation has emerged as a focal point in academic research and demonstrates substantial potential for real-world application. However, fundamental questions remain: What are the essential differences between LLM-simulated empathy and human empathy? How can we evaluate such capabilities in a scientific and comprehensive manner? What is the current state of development, and what are the core bottlenecks? More critically, how can LLMs’ empathetic performance be effectively enhanced while addressing the associated ethical risks?

While existing studies have explored some of these issues, a systematic integrative framework is still lacking. Therefore, this study conducts a comprehensive analysis of empathy simulation in LLMs across four key dimensions: evaluation methods, current development status, enhancement strategies, and critical challenges. The goal is to provide a theoretical foundation and directional guidance for future research and practical deployment in this domain.

Currently, the evaluation of LLMs’ empathy simulation can be categorized into three main approaches: human-based, automated, and task-driven. Human evaluation relies on subjective ratings or comparative judgments made by human annotators or domain experts, and excels at capturing nuanced emotional perceptions and context-dependent subtleties. However, it suffers from high subjectivity and cost. Automated evaluation employs computational techniques such as sentiment classification and cosine similarity for objective quantification, offering efficiency and reproducibility suitable for large-scale testing. Nonetheless, it often fails to account for contextual or subtle emotional variations and cannot adequately assess the naturalness or perceived empathy of language. Task-driven evaluation involves designing specific tasks such as emotion cause recognition (e.g., RECCON) or leveraging psychological empathy paradigms and standardized scales (e.g., IRI, BES) to assess model performance. This method aligns more closely with real-world applications and yields quantifiable metrics, though its generalizability is constrained by the specific design of tasks. This study compares the strengths and limitations of the three approaches and highlights the lack of a unified evaluation framework. It emphasizes the urgent need to develop a standardized and integrated assessment system, particularly one that incorporates psychological measurement paradigms to probe deeper empathetic response mechanisms within LLMs.

Recent studies using these varied evaluation methods suggest that LLMs can generate empathetic responses comparable to, or even surpassing, human outputs in certain scenarios, providing effective emotional support. However, there is still considerable room for improvement in specific empathy-related subtasks and in performance on standardized empathy scales—especially when confronting complex or mixed emotions. To enhance empathy simulation in LLMs, four key strategies are proposed: data augmentation, architectural and framework optimization, reinforcement learning, and prompt engineering. Data augmentation involves constructing larger, higher-quality, and culturally diverse empathetic dialogue datasets for fine-tuning. Architectural and framework optimization refers to the development of novel model structures capable of dynamically capturing emotional and personality traits (e.g., Pecer), or hybrid frameworks combining expert models and chain-of-thought reasoning (e.g., HEF, EBG) to improve the understanding of fine-grained emotions. Reinforcement learning integrates feedback from humans or other AI agents (e.g., the Muffin framework) to reward high-quality responses and guide the generation of empathy-aligned outputs. Prompt engineering can embed psychological theories—such as cognitive behavioral therapy—into prompt design, guiding the model to conduct deeper emotional reasoning and generate contextually appropriate responses from the input level.

Nevertheless, LLMs face persistent and, in some cases, insurmountable technical limitations. Their empathy simulation is fundamentally rooted in large-scale statistical pattern matching, rather than genuine emotional experience or intrinsic motivation. As a result, their responses often appear formulaic and lack authenticity. Moreover, LLMs struggle with interpreting complex emotions, sarcasm, and culturally variable expressions of empathy. Their adaptability to diverse cultural contexts and emotionally ambiguous situations remains limited. In addition, the use of LLMs inevitably raises ethical concerns, including the potential generation of harmful or discriminatory content, misuse for information manipulation, and the risk of users developing excessive emotional dependence on AI systems—potentially undermining real-life social interactions.

In conclusion, this study presents a comprehensive investigation into LLMs’ empathy simulation from the perspectives of evaluation, current capabilities, enhancement strategies, and inherent challenges. It identifies key issues and delineates future research directions, offering a conceptual foundation for advancing both academic inquiry and practical implementation in this emerging field.

Key words: large language models, empathy simulation, empathy evaluation, ethical issues

中图分类号: