ISSN 0439-755X
CN 11-1911/B
主办:中国心理学会
   中国科学院心理研究所
出版:科学出版社

心理学报 ›› 2025, Vol. 57 ›› Issue (11): 1988-2000.doi: 10.3724/SP.J.1041.2025.1988 cstr: 32110.14.2025.1988

• 人工智能心理与治理专刊 • 上一篇    下一篇

多模态大语言模型动态社会互动情景下的情感能力测评

周子森1, 黄琪1, 谭泽宏2, 刘睿3, 曹子亨4, 母芳蔓5, 樊亚春2(), 秦绍正1()   

  1. 1 北京师范大学认知神经科学与学习国家重点实验室
    2 北京师范大学人工智能学院, 北京 100875
    3 内蒙古财经大学工商管理学院, 呼和浩特 010070
    4 阿里巴巴集团, 杭州 310020
    5 楚雄师范学院数学与计算机科学学院, 云南 楚雄 675000
  • 收稿日期:2024-06-23 发布日期:2025-09-24 出版日期:2025-11-25
  • 通讯作者: 樊亚春, E-mail: fanyachun@bnu.edu.cn;
    秦绍正, E-mail: szqin@bnu.edu.cn
  • 基金资助:
    国家自然科学基金重点项目(32130045);组织间合作项目(32361163611)

Emotional capabilities evaluation of multimodal large language model in dynamic social interaction scenarios

ZHOU Zisen1, HUANG Qi1, TAN Zehong2, LIU Rui3, CAO Ziheng4, MU Fangman5, FAN Yachun2(), QIN Shaozheng1()   

  1. 1 State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing 100875, China
    2 School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
    3 School of Business Administration, Inner Mongolia University of Finance and Economics, Hohhot, Inner Mongolia 010070, China
    4 Alibaba Group Holding Ltd, Hangzhou, Zhejiang 310020, China
    5 School of Mathematics and Computer Science, Chuxiong Normal University, Chuxiong, Yunnan 675000, China
  • Received:2024-06-23 Online:2025-09-24 Published:2025-11-25

摘要:

多模态大语言模型(MLLMs)能够处理并整合图像、文本等多模态数据信息, 为理解人类心理与认知行为提供了强有力工具。结合经典的情绪心理学范式, 本研究通过比较两种主流MLLMs与人类被试在动态社会互动情景下情绪识别与情绪推理的表现, 分离出人物对话视觉特征(图像)和对话内容(文本)在识别与推理相关人物情绪中的不同作用。结果表明, 基于人物对话图像和对话内容的MLLMs已经初步展现出和人类被试类似的情绪识别与情绪推理能力。之后进一步比较仅基于人物对话图像、仅基于对话内容以及基于两者结合共三种条件下MLLMs的情绪识别与情绪推理表现, 发现人物对话视觉特征一定程度上制约MLLMs基本情绪识别的表现, 但能够有效促进复合情绪识别, 对情绪推理则未产生显著影响。通过对比两种主流MLLMs及其不同版本(GPT-4-vision/turbo vs. Claude-3-haiku)的表现, 发现相较于单纯扩大训练数据规模, 技术原理框架的创新对提升MLLMs在社会互动中情绪识别与推理能力更为重要。本研究结果对理解社会互动中情绪识别与推理的心理学机制、启发类人的情感计算与智能算法具有重要科学价值和意义。

关键词: 多模态大语言模型, 社会互动, 情绪识别, 情绪推理

Abstract:

Multimodal Large Language Models (MLLMs) can process and integrate multimodal data, such as images and text, providing a powerful tool for understanding human psychology and behavior. Combining classic emotional behavior experimental paradigms, this study compares the emotion recognition and prediction abilities of human participants and two mainstream MLLMs in dynamic social interaction contexts, aiming to disentangle the distinct roles of visual features of conversational characters (images) and conversational content (text) in emotion recognition and prediction.

The results indicate that the emotion recognition and prediction performance of MLLMs, based on character images and conversational content, exhibits moderate or lower correlations with human participants. Despite a notable gap, MLLMs have begun to demonstrate preliminary capabilities in emotion recognition and prediction similar to human participants in dyadic interactions. Using human performance as a benchmark, the study further compares MLLMs under different conditions: integrating both character images and conversational content, using only character images, or relying solely on conversational content. The results suggest that visual features of character interactions somewhat constrain MLLMs’ basic emotion recognition but effectively facilitate the recognition of complex emotions, while having no significant impact on emotion prediction.

Additionally, by comparing the emotion recognition and prediction performance of two mainstream MLLMs and different versions of GPT-4, the study finds that, rather than merely increasing the scale of training data, innovations in the underlying technical framework play a more crucial role in enhancing MLLMs’ emotional capabilities in dynamic social interaction contexts. Overall, this study deepens the understanding of the interaction between human visual features and conversational content, fosters interdisciplinary integration between psychology and artificial intelligence, and provides valuable theoretical and practical insights for developing explainable affective computing models and general artificial intelligence.

Key words: multimodal large language model, social interaction, emotion recognition, emotion prediction

中图分类号: