Empathy in large language models: Evaluation, enhancement, and challenges

doi:10.3724/SP.J.1042.2025.1783

Abstract

Abstract:

Amid the rapid evolution of artificial intelligence technologies, the application scope of large language models (LLMs) has extended beyond traditional information processing tasks to novel domains involving the simulation of complex human emotions and interactions. Particularly in emotion-intensive contexts such as psychological counseling, physician-patient communication, and customer service, the capacity of LLMs for empathy simulation has emerged as a focal point in academic research and demonstrates substantial potential for real-world application. However, fundamental questions remain: What are the essential differences between LLM-simulated empathy and human empathy? How can we evaluate such capabilities in a scientific and comprehensive manner? What is the current state of development, and what are the core bottlenecks? More critically, how can LLMs’ empathetic performance be effectively enhanced while addressing the associated ethical risks?

While existing studies have explored some of these issues, a systematic integrative framework is still lacking. Therefore, this study conducts a comprehensive analysis of empathy simulation in LLMs across four key dimensions: evaluation methods, current development status, enhancement strategies, and critical challenges. The goal is to provide a theoretical foundation and directional guidance for future research and practical deployment in this domain.

Currently, the evaluation of LLMs’ empathy simulation can be categorized into three main approaches: human-based, automated, and task-driven. Human evaluation relies on subjective ratings or comparative judgments made by human annotators or domain experts, and excels at capturing nuanced emotional perceptions and context-dependent subtleties. However, it suffers from high subjectivity and cost. Automated evaluation employs computational techniques such as sentiment classification and cosine similarity for objective quantification, offering efficiency and reproducibility suitable for large-scale testing. Nonetheless, it often fails to account for contextual or subtle emotional variations and cannot adequately assess the naturalness or perceived empathy of language. Task-driven evaluation involves designing specific tasks such as emotion cause recognition (e.g., RECCON) or leveraging psychological empathy paradigms and standardized scales (e.g., IRI, BES) to assess model performance. This method aligns more closely with real-world applications and yields quantifiable metrics, though its generalizability is constrained by the specific design of tasks. This study compares the strengths and limitations of the three approaches and highlights the lack of a unified evaluation framework. It emphasizes the urgent need to develop a standardized and integrated assessment system, particularly one that incorporates psychological measurement paradigms to probe deeper empathetic response mechanisms within LLMs.

Recent studies using these varied evaluation methods suggest that LLMs can generate empathetic responses comparable to, or even surpassing, human outputs in certain scenarios, providing effective emotional support. However, there is still considerable room for improvement in specific empathy-related subtasks and in performance on standardized empathy scales—especially when confronting complex or mixed emotions. To enhance empathy simulation in LLMs, four key strategies are proposed: data augmentation, architectural and framework optimization, reinforcement learning, and prompt engineering. Data augmentation involves constructing larger, higher-quality, and culturally diverse empathetic dialogue datasets for fine-tuning. Architectural and framework optimization refers to the development of novel model structures capable of dynamically capturing emotional and personality traits (e.g., Pecer), or hybrid frameworks combining expert models and chain-of-thought reasoning (e.g., HEF, EBG) to improve the understanding of fine-grained emotions. Reinforcement learning integrates feedback from humans or other AI agents (e.g., the Muffin framework) to reward high-quality responses and guide the generation of empathy-aligned outputs. Prompt engineering can embed psychological theories—such as cognitive behavioral therapy—into prompt design, guiding the model to conduct deeper emotional reasoning and generate contextually appropriate responses from the input level.

Nevertheless, LLMs face persistent and, in some cases, insurmountable technical limitations. Their empathy simulation is fundamentally rooted in large-scale statistical pattern matching, rather than genuine emotional experience or intrinsic motivation. As a result, their responses often appear formulaic and lack authenticity. Moreover, LLMs struggle with interpreting complex emotions, sarcasm, and culturally variable expressions of empathy. Their adaptability to diverse cultural contexts and emotionally ambiguous situations remains limited. In addition, the use of LLMs inevitably raises ethical concerns, including the potential generation of harmful or discriminatory content, misuse for information manipulation, and the risk of users developing excessive emotional dependence on AI systems—potentially undermining real-life social interactions.

In conclusion, this study presents a comprehensive investigation into LLMs’ empathy simulation from the perspectives of evaluation, current capabilities, enhancement strategies, and inherent challenges. It identifies key issues and delineates future research directions, offering a conceptual foundation for advancing both academic inquiry and practical implementation in this emerging field.

Key words: large language models, empathy simulation, empathy evaluation, ethical issues

CLC Number:

B842
R395

ZHOU Qianyi, CAI Yaqi, ZHANG Ya. Empathy in large language models: Evaluation, enhancement, and challenges[J]. Advances in Psychological Science, 2025, 33(10): 1783-1793.

Figures/Tables 2

References 44

[1]	侯悍超, 倪士光, 林书亚, 王蒲生. (2024). 当AI学习共情:心理学视角下共情计算的主题、场景与优化. 心理科学进展, 32(5), 845-858 doi: 10.3724/SP.J.1042.2024.00845
[2]	Ayers J. W., Poliak A., Dredze M., Leas E. C., Zhu Z., Kelley J. B.,... Smith D. M. (2023). Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Internal Medicine, 183(6), 589-596. https://doi.org/10.1001/jamainternmed.2023.1838 doi: 10.1001/jamainternmed.2023.1838 URL pmid: 37115527
[3]	Cai M., Wang D., Feng S., & Zhang Y. (2024). PECER: Empathetic response generation via dynamic personality extraction and contextual emotional reasoning. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 10631-10635). IEEE. http://10.1109/ICASSP48485.2024.10446914
[4]	Cao H., Zhang Y., Feng S., Yang X., Wang D., & Zhang Y. (2025). TOOL-ED: Enhancing empathetic response generation with the tool calling capability of LLM . In Proceedings of the 31st International Conference on Computational Linguistics (pp. 5305-5320). Association for Computational Linguistics. https://aclanthology.org/2025.coling-main.355/
[5]	Chen Y., Xing X., Lin J., Zheng H., Wang Z., Liu Q., & Xu X. (2023). SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations . In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 1170-1183). Association for Computational Linguistics. https://aclanthology.org/2023.findings-emnlp.83/
[6]	Cuadra A., Wang M., Stein L. A., Jung M. F., Dell N., Estrin D., & Landay J. A. (2024). The illusion of empathy? Notes on displays of emotion in human- computer interaction . In Proceedings of the CHI Conference on Human Factors in Computing Systems (pp. 1-18). https://doi.org/10.1145/3613904.3642336
[7]	Dillion D., Tandon N., Gu Y., & Gray K. (2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597-600. https://doi.org/10.1016/j.tics.2023.04.008 doi: 10.1016/j.tics.2023.04.008 URL pmid: 37173156
[8]	Hagendorff T., Dasgupta I., Binz M., Chan S. C., Lampinen A., Wang J. X.,... Schulz E. (2023). Machine psychology. arXiv. https://doi.org/10.48550/arXiv.2303.13988
[9]	Havaldar S., Singhal B., Rai S., Liu L., Guntuku S. C., & Ungar L. (2023). Multilingual language models are not multicultural:A case study in emotion . In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis (pp. 202-214). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.wassa-1.19
[10]	Holmes W., Porayska-Pomsta K., Holstein K., Sutherland E., Baker T., Shum S. B., & Koedinger K. R. (2022). Ethics of AI in education: Towards a community-wide framework. International Journal of Artificial Intelligence in Education, 32, 504-526. https://doi.org/10.1007/s40593-021-00239-1
[11]	Hu M., Chua X. C. W., Diong S. F., Kasturiratna K. S., Majeed N. M., & Hartanto A. (2025). AI as your ally: The effects of AI‐assisted venting on negative affect and perceived social support. Applied Psychology: Health and Well‐Being, 17(1), e12621. https://doi.org/10.1111/aphw.12621
[12]	Huang J. T., Lam M. H., Li E. J., Ren S., Wang W., Jiao W., & Lyu M. R. (2025). Apathetic or empathetic? Evaluating LLMs’ emotional alignments with humans. Advances in Neural Information Processing Systems, 37, 97053-97087. https://doi.org/10.48550/arXiv.2308.03656
[13]	Ickes W., Stinson L., Bissonnette V., & Garcia S. (1990). Naturalistic social cognition: Empathic accuracy in mixed-sex dyads. Journal of Personality and Social Psychology, 59(4), 730-742. https://doi.org/10.1037/0022-3514.59.4.730
[14]	Koranteng E., Rao A., Flores E., Lev M., Landman A., Dreyer K., & Succi M. (2023). Empathy and equity: Key considerations for large language model adoption in health care. JMIR Medical Education, 9, e51199. https://doi.org/10.2196/51199
[15]	Laskar M. T. R., Bari M. S., Rahman M., Bhuiyan M. A. H., Joty S., & Huang J. (2023). A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. In Findings of the Association for Computational Linguistics: ACL 2023 (pp. 431-469). Association for Computational Linguistics. https://aclanthology.org/2023.findings-acl.29/
[16]	Lee Y. K., Lee I., Shin M., Bae S., & Hahn S. (2024). Enhancing empathic reasoning of large language models based on psychotherapy models for AI-assisted social support. Korean Journal of Cognitive Science, 35(1), 23-48. https://doi.org/10.19066/cogsci.2024.35.1.002
[17]	Lee Y. K., Suh J., Zhan H., Li J. J., & Ong D. C. (2024). Large language models produce responses perceived to be empathic. arXiv. https://doi.org/10.48550/arXiv.2403.18148
[18]	Liang H., Sun L., Wei J., Huang X., Sun L., Yu B.,... Zhang W. (2024). Synth-empathy: Towards high-quality synthetic empathy data. arXiv. https://doi.org/10.48550/arXiv.2407.21669
[19]	Liu Y., Han D., Wu G., & Qiao B. (2024). KnowDT: Empathetic dialogue generation with knowledge-enhanced dependency tree. Applied Intelligence, 54(17), 8059-8072. https://doi.org/10.1007/s10489-024-05611-x
[20]	Liu-Thompkins Y., Okazaki S., & Li H. (2022). Artificial empathy in marketing interactions: Bridging the human-AI gap in affective and social customer experience. Journal of the Academy of Marketing Science, 50(6), 1198-1218. https://doi.org/10.1007/s11747-022-00892-5
[21]	Loh S. B., & Sesagiri Raamkumar A. (2023). Harnessing large language models' empathetic response generation capabilities for online mental health counselling support. arXiv. https://doi.org/10.48550/arXiv.2310.08017
[22]	Luo M., Warren C. J., Cheng L., Abdul-Muhsin H. M., & Banerjee I. (2024). Assessing empathy in large language models with real-world physician-patient interactions . In 2024 IEEE International Conference on Big Data (BigData) (pp. 6510-6519). IEEE. https://doi.org/10.1109/BigData62323.2024.10825307
[23]	Ma J., Chen B., Wang K., Hu Y., Wang X., Zhan H., & Wu W. (2024). Emotional contagion and cognitive empathy regulate the effect of depressive symptoms on empathy-related brain functional connectivity in patients with chronic back pain. Journal of Affective Disorders, 362, 459-467. https://doi.org/10.1016/j.jad.2024.07.026 doi: 10.1016/j.jad.2024.07.026 URL pmid: 39013522
[24]	Naik, N., Jenkins P., Prajapat S., & Grace P. (Eds.). (2024). Contributions Presented at The International Conference on Computing, Communication, Cybersecurity and AI, July 3-4, 2024, London, UK: The C3AI 2024 (1st ed.). Springer Cham. https://doi.org/10.1007/978-3-031-74443-3
[25]	Pan S., Fan C., Zhao B., Luo S., & Jin Y. (2024). Can large language models exhibit cognitive and affective empathy as humans? OSF Preprints. https://doi.org/10.31219/osf.io/w5rsu
[26]	Patil D. D., Dhotre D. R., Gawande G. S., Mate D. S., Shelke M. V., & Bhoye T. S. (2024). Transformative trends in generative ai: Harnessing large language models for natural language understanding and generation. International Journal of Intelligent Systems and Applications in Engineering, 12(4s), 309-319. https://ijisae.org/index.php/IJISAE/article/view/3794
[27]	Poria S., Majumder N., Hazarika D., Ghosal D., Bhardwaj R., Jian S. Y. B.,... Mihalcea R. (2021). Recognizing emotion cause in conversations. Cognitive Computation, 13(5), 1317-1332. https://doi.org/10.1007/s12559-021-09925-7
[28]	Rashkin H., Smith E. M., Li M., & Boureau Y. L. (2018). Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv. https://doi.org/10.48550/arXiv.1811.00207
[29]	Reimers N., & Gurevych I. (2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982-3992). Association for Computational Linguistics. https://doi.org/10.48550/arXiv.1908.10084
[30]	Ren C., Zhang Y., He D., & Qin J. (2024). WundtGPT: Shaping large language models to be an empathetic, proactive psychologist. arXiv preprint arXiv:2406.15474. https://aclanthology.org/D19-1410/
[31]	Sabour S., Zheng C., & Huang M. (2022). CEM:Commonsense-aware empathetic response generation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10), 11229-11237. https://doi.org/10.1609/aaai.v36i10.21373
[32]	Sap M., Le Bras R., Fried D., & Choi Y. (2022). Neural theory-of-mind? On the limits of social intelligence in LLMs . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 3762-3780). Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.248/
[33]	Shao M., Basit A., Karri R., & Shafique M. (2024). Survey of different large language model architectures: Trends, benchmarks, and challenges. IEEE Access, 12, 188664-188706. https://doi.org/10.1109/ACCESS.2024.3482107
[34]	Sharma A., Lin I. W., Miner A. S., Atkins D. C., & Althoff T. (2023). Human-AI collaboration enables more empathic conversations in text-based peer-to-peer mental health support. Nature Machine Intelligence, 5(1), 46-57. https://doi.org/10.1038/s42256-022-00593-2
[35]	Shen X., Chen Z., Backes M., Shen Y., & Zhang Y. (2024). "Do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (pp. 1671-1685). Association for Computational Linguistics. https://doi.org/10.1145/3658644.3670388
[36]	Sorin V., Brin D., Barash Y., Konen E., Charney A., Nadkarni G., & Klang E. (2024). Large language models and empathy: Systematic review. Journal of Medical Internet Research, 26, e52597. https://doi.org/10.2196/52597
[37]	Svikhnushina E., & Pu P. (2022). PEACE: A model of key social and emotional qualities of conversational chatbots. ACM Transactions on Interactive Intelligent Systems, 12(4), 1-29. https://doi.org/10.1145/3531064
[38]	Wang J., Xu C., Leong C. T., Li W., & Li J. (2024). Mitigating unhelpfulness in emotional support conversations with multifaceted AI feedback. arXiv. https://doi.org/10.48550/arXiv.2401.05928
[39]	Wang L., Ma C., Feng X., Zhang Z., Yang H., Zhang J.,... Wen J. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6), 186345. https://doi.org/10.1007/s11704-024-40231-1
[40]	Welivita A., & Pu P. (2024). Are large language models more empathetic than humans?. arXiv. https://doi.org/10.48550/arXiv.2406.05063
[41]	Yang Z., Ren Z., Yufeng W., Peng S., Sun H., Zhu X., & Liao X. (2024). Enhancing empathetic response generation by augmenting LLMs with small-scale empathetic models. arXiv. https://doi.org/10.48550/arXiv.2402.11801
[42]	Zhao W., Zhao Y., Lu X., Wang S., Tong Y., & Qin B. (2023). Is ChatGPT equipped with emotional dialogue capabilities? arXiv. https://doi.org/10.48550/arXiv.2304.09582
[43]	Zhu J., Jiang Z., Zhou B., Su J., Zhang J., & Li Z. (2024). Empathizing before generation: A double-layered framework for emotional support LLM. In Lecture Notes in Computer Science(pp. 490-503). Springer Nature Switzerland. https://doi.org/10.1007/978-981-97-8490-5_35
[44]	Zhuang Z., Chen Q., Ma L., Li M., Han Y., Qian Y.,... Liu T. (2023). Through the lens of core competency: Survey on evaluation of large language models . In Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2:Frontier Forum) (pp. 88-109). Chinese Information Processing Society of China. https://aclanthology.org/2023.ccl-2.8/

评估方式	定义	方法	优点	局限性	适用场景
人工评估	依靠人类(标注员、用户或专家)对LLMs的共情性回应进行主观或比较性评估。	1. Likert主观评分：人类对模型回应按量表评分(如1~5分), 评估情感理解、情感反应、语境适应性、语言自然度等维度。 2. 比较评估：人类判断人类回应与LLM回应在同一情境下的共情性优劣。	1. 能够捕捉人类主观感受, 提供细致量化评分(如Likert量表)。 2. 专家评估可提供专业视角(如心理咨询领域)。 3. 维度明确, 评估全面。	1. 主观性强, 受评估者背景影响。 2. 耗时耗力, 成本高。 3. 比较评估信息量有限, 难以横向比较多个模型。	1. 小规模、高精度研究。 2. 需要专业视角的场景(如心理健康咨询)。 3. 评估模型与人类共情回应的细微差异。
自动化评估	利用算法或机器学习技术对LLMs共情能力进行客观量化分析。	1. 情感分类：使用预训练模型进行情感分类, 比较用户输入与生成文本的情感一致性。 2. 余弦相似度：计算用户输入与生成文本在情感向量空间的相似度。	1. 高效、快速, 适合大规模数据分析。 2. 客观性高, 减少人为偏见。 3. 可量化情感相似度, 便于模型优化。	1. 可能忽略语境或细微情感差异。 2. 依赖预训练模型质量和词典覆盖率。 3. 难以评估语言自然度或主观共情感受。	1. 大规模模型性能测试。 2. 快速迭代开发的初期评估。 3. 情感分类或向量分析相关研究。
任务驱动评估	通过让LLMs完成特定情感任务, 评估其情感识别和共情能力。	1. 情感识别任务：评估模型对对话中情感分类的准确性。 2. 情感原因识别：分析对话中情感来源及因果关系。 3.心理学实验任务：分析对话或视频描述, 推测情感和想法。 4. 心理学量表：使用共情量表评估LLMs的共情表现。	1. 任务明确, 评估结果可量化。 2. 能深入分析情感因果, 提升模型可解释性。 3. 适合测试模型在复杂对话中的表现。	1. 任务设计复杂, 需大量标注数据。 2. 可能无法全面反映主观共情能力。 3. 评估范围受任务定义限制。 4.心理学量表可能不完全适用于非人类主体。	1. 情感分析和对话系统研究。 2. 需要评估模型对复杂情感动态的理解。 3. 学术研究中的特定任务测试。

评估方式	定义	方法	优点	局限性	适用场景
人工评估	依靠人类(标注员、用户或专家)对LLMs的共情性回应进行主观或比较性评估。	1. Likert主观评分：人类对模型回应按量表评分(如1~5分), 评估情感理解、情感反应、语境适应性、语言自然度等维度。 2. 比较评估：人类判断人类回应与LLM回应在同一情境下的共情性优劣。	1. 能够捕捉人类主观感受, 提供细致量化评分(如Likert量表)。 2. 专家评估可提供专业视角(如心理咨询领域)。 3. 维度明确, 评估全面。	1. 主观性强, 受评估者背景影响。 2. 耗时耗力, 成本高。 3. 比较评估信息量有限, 难以横向比较多个模型。	1. 小规模、高精度研究。 2. 需要专业视角的场景(如心理健康咨询)。 3. 评估模型与人类共情回应的细微差异。
自动化评估	利用算法或机器学习技术对LLMs共情能力进行客观量化分析。	1. 情感分类：使用预训练模型进行情感分类, 比较用户输入与生成文本的情感一致性。 2. 余弦相似度：计算用户输入与生成文本在情感向量空间的相似度。	1. 高效、快速, 适合大规模数据分析。 2. 客观性高, 减少人为偏见。 3. 可量化情感相似度, 便于模型优化。	1. 可能忽略语境或细微情感差异。 2. 依赖预训练模型质量和词典覆盖率。 3. 难以评估语言自然度或主观共情感受。	1. 大规模模型性能测试。 2. 快速迭代开发的初期评估。 3. 情感分类或向量分析相关研究。
任务驱动评估	通过让LLMs完成特定情感任务, 评估其情感识别和共情能力。	1. 情感识别任务：评估模型对对话中情感分类的准确性。 2. 情感原因识别：分析对话中情感来源及因果关系。 3.心理学实验任务：分析对话或视频描述, 推测情感和想法。 4. 心理学量表：使用共情量表评估LLMs的共情表现。	1. 任务明确, 评估结果可量化。 2. 能深入分析情感因果, 提升模型可解释性。 3. 适合测试模型在复杂对话中的表现。	1. 任务设计复杂, 需大量标注数据。 2. 可能无法全面反映主观共情能力。 3. 评估范围受任务定义限制。 4.心理学量表可能不完全适用于非人类主体。	1. 情感分析和对话系统研究。 2. 需要评估模型对复杂情感动态的理解。 3. 学术研究中的特定任务测试。

提升策略	定义	方法	优点	局限性	适用场景
数据增强	构建或优化包含更多情感信息的共情对话数据集, 并在数据集上对LLMs进行微调, 从而优其共情表现。	1. 基于现有数据集(如EmpatheticDialogues, ED)进行增强, 筛选并标注情感信息, 构建新数据集。 2. 自动生成高质量共情数据, 丢弃低质量数据。 3. 构建多轮共情对话数据集, 融入提问、安慰等情感支持表达。	1. 提升数据质量和多样性, 直接增强模型共情能力。 2. 可生成大规模数据集, 降低数据收集成本。 3. 多轮对话数据贴近真实场景, 效果显著。	1. 数据标注和筛选需大量计算资源和人力。 2. 增强数据可能引入新偏见或噪声。 3. 依赖原始数据集的质量和覆盖范围。	1. 对话系统开发需要高质量情感数据时。 2. 心理咨询或情感支持场景的模型训练。 3. 数据资源有限但需快速扩展的场景。
模型架构与框架优化	通过改进LLMs的结构、组件或使用框架, 增强其在共情对话任务中的表现。	1. 设计新模型(如Pecer)动态捕捉情感和人格信息, 融入共情回应生成。 2. 结合语法依赖树和NRC情感词典构建情感依赖树, 捕捉情感信息。 3.将小型共情模型(SEMs)作为插件提升LLMs细粒度情感理解。 4. 使用生成前共情(EBG)框架, 通过思维链分析提升情绪推断。	1. 针对性优化模型结构, 效果精准。 2. 框架灵活, 可结合多种技术提升共情能力。 3. 能处理细粒度情感和复杂对话场景。	1. 模型设计和优化需深厚技术背景和高计算成本。 2. 新架构可能增加推理时间或资源需求。 3. 框架复杂性可能降低模型可解释性。	1. 学术研究探索新型共情模型架构。 2. 需要处理复杂情感动态的对话系统。 3. 结合小型模型优化大型模型的场景。
强化学习	通过引入人类或AI反馈, 引导生成更贴合人类期望的回应。	1. 使用反馈框架(如Muffin)通过AI反馈和对比学习优化回应质量。 2. 开展随机对照实验, 比较人类与LLMs回应的共情性, 将结果反馈给LLMs。 3. 小规模人类评估对比生成内容。	1. 直接利用人类偏好, 生成更自然的共情回应。 2. 反馈机制可持续优化模型表现。 3. 能针对特定情感场景(如悲伤)优化。	1. 反馈流程缺乏统一规范, 评估稳定性不足。 2. 小规模反馈可能无法代表广泛用户需求。 3. 收集高质量人类反馈成本高, 耗时长。	1. 需要高用户满意度的商业应用。 2. 迭代优化已有模型的共情表现。 3. 针对特定用户群体的个性化共情优化。
引导词设计	通过优化输入提示或上下文信息, 引导LLMs生成符合特定情感需求的共情回应。	1. 在提示中加入共情定义, 引导生成共情性回复。 2. 融入咨询心理学理论(如提出共情链提示), 分析情感和认知陷阱。	1. 实现成本低, 易于实施和调整。 2. 可快速提升现有模型的共情表现。 3. 能结合心理学理论, 生成专业化回应。	1. 效果高度依赖提示设计质量。 2. 可能无法解决模型的情感理解局限。 3. 在复杂对话中可能缺乏稳定性。	1. 快速测试或原型开发阶段。 2. 心理咨询或教育场景需要专业化回应。 3. 资源或时间有限时。

提升策略	定义	方法	优点	局限性	适用场景
数据增强	构建或优化包含更多情感信息的共情对话数据集, 并在数据集上对LLMs进行微调, 从而优其共情表现。	1. 基于现有数据集(如EmpatheticDialogues, ED)进行增强, 筛选并标注情感信息, 构建新数据集。 2. 自动生成高质量共情数据, 丢弃低质量数据。 3. 构建多轮共情对话数据集, 融入提问、安慰等情感支持表达。	1. 提升数据质量和多样性, 直接增强模型共情能力。 2. 可生成大规模数据集, 降低数据收集成本。 3. 多轮对话数据贴近真实场景, 效果显著。	1. 数据标注和筛选需大量计算资源和人力。 2. 增强数据可能引入新偏见或噪声。 3. 依赖原始数据集的质量和覆盖范围。	1. 对话系统开发需要高质量情感数据时。 2. 心理咨询或情感支持场景的模型训练。 3. 数据资源有限但需快速扩展的场景。
模型架构与框架优化	通过改进LLMs的结构、组件或使用框架, 增强其在共情对话任务中的表现。	1. 设计新模型(如Pecer)动态捕捉情感和人格信息, 融入共情回应生成。 2. 结合语法依赖树和NRC情感词典构建情感依赖树, 捕捉情感信息。 3.将小型共情模型(SEMs)作为插件提升LLMs细粒度情感理解。 4. 使用生成前共情(EBG)框架, 通过思维链分析提升情绪推断。	1. 针对性优化模型结构, 效果精准。 2. 框架灵活, 可结合多种技术提升共情能力。 3. 能处理细粒度情感和复杂对话场景。	1. 模型设计和优化需深厚技术背景和高计算成本。 2. 新架构可能增加推理时间或资源需求。 3. 框架复杂性可能降低模型可解释性。	1. 学术研究探索新型共情模型架构。 2. 需要处理复杂情感动态的对话系统。 3. 结合小型模型优化大型模型的场景。
强化学习	通过引入人类或AI反馈, 引导生成更贴合人类期望的回应。	1. 使用反馈框架(如Muffin)通过AI反馈和对比学习优化回应质量。 2. 开展随机对照实验, 比较人类与LLMs回应的共情性, 将结果反馈给LLMs。 3. 小规模人类评估对比生成内容。	1. 直接利用人类偏好, 生成更自然的共情回应。 2. 反馈机制可持续优化模型表现。 3. 能针对特定情感场景(如悲伤)优化。	1. 反馈流程缺乏统一规范, 评估稳定性不足。 2. 小规模反馈可能无法代表广泛用户需求。 3. 收集高质量人类反馈成本高, 耗时长。	1. 需要高用户满意度的商业应用。 2. 迭代优化已有模型的共情表现。 3. 针对特定用户群体的个性化共情优化。
引导词设计	通过优化输入提示或上下文信息, 引导LLMs生成符合特定情感需求的共情回应。	1. 在提示中加入共情定义, 引导生成共情性回复。 2. 融入咨询心理学理论(如提出共情链提示), 分析情感和认知陷阱。	1. 实现成本低, 易于实施和调整。 2. 可快速提升现有模型的共情表现。 3. 能结合心理学理论, 生成专业化回应。	1. 效果高度依赖提示设计质量。 2. 可能无法解决模型的情感理解局限。 3. 在复杂对话中可能缺乏稳定性。	1. 快速测试或原型开发阶段。 2. 心理咨询或教育场景需要专业化回应。 3. 资源或时间有限时。