ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

Advances in Psychological Science ›› 2025, Vol. 33 ›› Issue (10): 1783-1793.doi: 10.3724/SP.J.1042.2025.1783

• Regular Articles • Previous Articles     Next Articles

Empathy in large language models: Evaluation, enhancement, and challenges

ZHOU Qianyi1,#, CAI Yaqi1,#, ZHANG Ya1,2()   

  1. 1 Shanghai Key Laboratory of Mental Health and Psychological Crisis Intervention, School of Psychology and Cognitive Science, East China Normal University, Shanghai 200062, China
    2 Key Laboratory of Philosophy and Social Science of Anhui Province on Adolescent Mental Health and Crisis Intelligence Intervention, Hefei Normal University, Hefei 230601, China
  • Received:2025-03-09 Online:2025-10-15 Published:2025-08-18
  • Contact: ZHANG Ya E-mail:yzhang@psy.ecnu.edu.cn

Abstract:

Amid the rapid evolution of artificial intelligence technologies, the application scope of large language models (LLMs) has extended beyond traditional information processing tasks to novel domains involving the simulation of complex human emotions and interactions. Particularly in emotion-intensive contexts such as psychological counseling, physician-patient communication, and customer service, the capacity of LLMs for empathy simulation has emerged as a focal point in academic research and demonstrates substantial potential for real-world application. However, fundamental questions remain: What are the essential differences between LLM-simulated empathy and human empathy? How can we evaluate such capabilities in a scientific and comprehensive manner? What is the current state of development, and what are the core bottlenecks? More critically, how can LLMs’ empathetic performance be effectively enhanced while addressing the associated ethical risks?

While existing studies have explored some of these issues, a systematic integrative framework is still lacking. Therefore, this study conducts a comprehensive analysis of empathy simulation in LLMs across four key dimensions: evaluation methods, current development status, enhancement strategies, and critical challenges. The goal is to provide a theoretical foundation and directional guidance for future research and practical deployment in this domain.

Currently, the evaluation of LLMs’ empathy simulation can be categorized into three main approaches: human-based, automated, and task-driven. Human evaluation relies on subjective ratings or comparative judgments made by human annotators or domain experts, and excels at capturing nuanced emotional perceptions and context-dependent subtleties. However, it suffers from high subjectivity and cost. Automated evaluation employs computational techniques such as sentiment classification and cosine similarity for objective quantification, offering efficiency and reproducibility suitable for large-scale testing. Nonetheless, it often fails to account for contextual or subtle emotional variations and cannot adequately assess the naturalness or perceived empathy of language. Task-driven evaluation involves designing specific tasks such as emotion cause recognition (e.g., RECCON) or leveraging psychological empathy paradigms and standardized scales (e.g., IRI, BES) to assess model performance. This method aligns more closely with real-world applications and yields quantifiable metrics, though its generalizability is constrained by the specific design of tasks. This study compares the strengths and limitations of the three approaches and highlights the lack of a unified evaluation framework. It emphasizes the urgent need to develop a standardized and integrated assessment system, particularly one that incorporates psychological measurement paradigms to probe deeper empathetic response mechanisms within LLMs.

Recent studies using these varied evaluation methods suggest that LLMs can generate empathetic responses comparable to, or even surpassing, human outputs in certain scenarios, providing effective emotional support. However, there is still considerable room for improvement in specific empathy-related subtasks and in performance on standardized empathy scales—especially when confronting complex or mixed emotions. To enhance empathy simulation in LLMs, four key strategies are proposed: data augmentation, architectural and framework optimization, reinforcement learning, and prompt engineering. Data augmentation involves constructing larger, higher-quality, and culturally diverse empathetic dialogue datasets for fine-tuning. Architectural and framework optimization refers to the development of novel model structures capable of dynamically capturing emotional and personality traits (e.g., Pecer), or hybrid frameworks combining expert models and chain-of-thought reasoning (e.g., HEF, EBG) to improve the understanding of fine-grained emotions. Reinforcement learning integrates feedback from humans or other AI agents (e.g., the Muffin framework) to reward high-quality responses and guide the generation of empathy-aligned outputs. Prompt engineering can embed psychological theories—such as cognitive behavioral therapy—into prompt design, guiding the model to conduct deeper emotional reasoning and generate contextually appropriate responses from the input level.

Nevertheless, LLMs face persistent and, in some cases, insurmountable technical limitations. Their empathy simulation is fundamentally rooted in large-scale statistical pattern matching, rather than genuine emotional experience or intrinsic motivation. As a result, their responses often appear formulaic and lack authenticity. Moreover, LLMs struggle with interpreting complex emotions, sarcasm, and culturally variable expressions of empathy. Their adaptability to diverse cultural contexts and emotionally ambiguous situations remains limited. In addition, the use of LLMs inevitably raises ethical concerns, including the potential generation of harmful or discriminatory content, misuse for information manipulation, and the risk of users developing excessive emotional dependence on AI systems—potentially undermining real-life social interactions.

In conclusion, this study presents a comprehensive investigation into LLMs’ empathy simulation from the perspectives of evaluation, current capabilities, enhancement strategies, and inherent challenges. It identifies key issues and delineates future research directions, offering a conceptual foundation for advancing both academic inquiry and practical implementation in this emerging field.

Key words: large language models, empathy simulation, empathy evaluation, ethical issues

CLC Number: