ISSN 0439-755X
CN 11-1911/B

Acta Psychologica Sinica ›› 2025, Vol. 57 ›› Issue (11): 1988-2000.doi: 10.3724/SP.J.1041.2025.1988

• Reports of Empirical Studies • Previous Articles     Next Articles

Emotional capabilities evaluation of multimodal large language model in dynamic social interaction scenarios

ZHOU Zisen1, HUANG Qi1, TAN Zehong2, LIU Rui3, CAO Ziheng4, MU Fangman5, FAN Yachun2(), QIN Shaozheng1()   

  1. 1State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing 100875, China
    2School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
    3School of Business Administration, Inner Mongolia University of Finance and Economics, Hohhot, Inner Mongolia 010070, China
    4Alibaba Group Holding Ltd, Hangzhou, Zhejiang 310020, China
    5School of Mathematics and Computer Science, Chuxiong Normal University, Chuxiong, Yunnan 675000, China
  • Published:2025-11-25 Online:2025-09-25
  • Contact: Fan Yanchun, E-mail: fanyachun@bnu.edu.cn; Qin Shaozheng, E-mail: szqin@bnu.edu.cn

Abstract:

Multimodal Large Language Models (MLLMs) can process and integrate multimodal data, such as images and text, providing a powerful tool for understanding human psychology and behavior. Combining classic emotional behavior experimental paradigms, this study compares the emotion recognition and prediction abilities of human participants and two mainstream MLLMs in dynamic social interaction contexts, aiming to disentangle the distinct roles of visual features of conversational characters (images) and conversational content (text) in emotion recognition and prediction.

The results indicate that the emotion recognition and prediction performance of MLLMs, based on character images and conversational content, exhibits moderate or lower correlations with human participants. Despite a notable gap, MLLMs have begun to demonstrate preliminary capabilities in emotion recognition and prediction similar to human participants in dyadic interactions. Using human performance as a benchmark, the study further compares MLLMs under different conditions: integrating both character images and conversational content, using only character images, or relying solely on conversational content. The results suggest that visual features of character interactions somewhat constrain MLLMs’ basic emotion recognition but effectively facilitate the recognition of complex emotions, while having no significant impact on emotion prediction.

Additionally, by comparing the emotion recognition and prediction performance of two mainstream MLLMs and different versions of GPT-4, the study finds that, rather than merely increasing the scale of training data, innovations in the underlying technical framework play a more crucial role in enhancing MLLMs’ emotional capabilities in dynamic social interaction contexts. Overall, this study deepens the understanding of the interaction between human visual features and conversational content, fosters interdisciplinary integration between psychology and artificial intelligence, and provides valuable theoretical and practical insights for developing explainable affective computing models and general artificial intelligence.

Key words: multimodal large language model, social interaction, emotion recognition, emotion prediction