ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

心理科学进展 ›› 2023, Vol. 31 ›› Issue (6): 1002-1019.doi: 10.3724/SP.J.1042.2023.01002

• 研究前沿 • 上一篇    下一篇

语义在人脑中的分布式表征:来自自然语言处理技术的证据

蒋嘉浩1, 赵国钰2, 马英博1, 丁国盛3, 刘兰芳2,4()   

  1. 1北京师范大学心理学部, 北京 100875
    2北京师范大学文理学院心理学系, 珠海 519087
    3北京师范大学认知神经科学与学习国家重点实验室和IDG/麦戈文脑科学研究院, 北京 100875
    4北京师范大学认知神经科学与学习国家重点实验室认知神经工效研究中心, 珠海 519087
  • 收稿日期:2022-11-06 出版日期:2023-06-15 发布日期:2023-03-07
  • 通讯作者: 刘兰芳 E-mail:liulanfang21@bnu.edu.cn
  • 基金资助:
    国家自然科学基金青年科学基金项目(31900802)

Distributed representation of semantics in the human brain: Evidence from studies using natural language processing techniques

JIANG Jiahao1, ZHAO Guoyu2, MA Yingbo1, DING Guosheng3, LIU Lanfang2,4()   

  1. 1Faculty of Psychology, Beijing Normal University, Beijing 100875, China
    2Faculty of Psychology, School of Arts and Sciences, Beijing Normal University, Zhuhai 519087, China
    3State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University & IDG/McGovern Institute for Brain Research, Beijing 100875, China
    4Center for Cognition and Neuroergonomics at the State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Zhuhai 519087, China
  • Received:2022-11-06 Online:2023-06-15 Published:2023-03-07
  • Contact: LIU Lanfang E-mail:liulanfang21@bnu.edu.cn

摘要:

人脑如何表征语义信息一直以来是认知神经科学的核心问题。传统研究主要通过人为操纵刺激属性或任务要求等实验方法来定位语义表征脑区, 这类方法虽然取得了诸多成果, 但是依然存在难以详细量化语义信息和语境效应等问题。基于语义的分布式假设, 自然语言处理(NLP)技术将离散的、难以客观量化的语义信息转变为统一的、可计算的向量形式, 极大提高了语义信息的刻画精度, 提供了有效量化语境和句法等信息的工具。运用NLP技术提取刺激语义信息, 并通过表征相似性分析或线性回归建立语义向量与脑活动模式的映射关系, 研究者发现表征语义信息的神经结构广泛分布在颞叶、额叶和枕叶等多个脑区。未来研究可引入知识图谱和多模态融合模型等更复杂的语义表示方法, 将语言模型用于评估特殊人群语言能力, 或利用认知神经科学实验来提高深度语言模型的可解释性。

关键词: 语义表征, 大脑, 自然语言处理, 语言模型

Abstract:

How semantics are represented in human brains is a central issue in cognitive neuroscience. Previous studies typically detect semantic information by manipulating the properties of stimuli or task demands, or by asking a group of participants to judge the stimuli according to several given dimensions or features. Despite having brought valuable insights into the neurobiology of language, these approaches have some limitations. First, the experimental approach may only provide a coarse depiction of semantic properties, while human judgment is time-consuming and the results may vary substantially across subjects. Second, the conventional approach has difficulty quantifying the effect of context on word meaning. Third, the conventional approach is unable to extract the topic information of discourses, the semantic relations between the different parts of a discourse, or the semantic distance between discourses.
The recently-developed natural language processing (NLP) techniques provide a useful tool that may overcome the above-mentioned limitations. Grounded on the distributional hypothesis of semantics, NLP models represent meanings of words, sentences, or documents in the form of computable vectors, which can be derived from word-word or word-document co-occurrence relationships, and neural networks trained on language tasks.
Recent studies have applied NLP techniques to model the semantics of stimuli and mapped the semantic vectors onto brain activities through representational similarity analyses or linear regression. Those studies have mainly examined how the brain (i) represents word semantics; (ii) integrates context information and represents sentence-level meanings; and (iii) represents the topic information and the semantic structure of discourses. Besides, a few studies have applied NLP to untangle sentences’ syntactic and semantic information and looked for their respective neural representations. A consistent finding across those studies is that, the representation of semantic information of words, sentences and discourses, as well as the syntactic information, seems to recruit a widely distributed network covering the frontal, temporal, parietal and occipital cortices. This observation is in contrast to the results from conventional imaging studies and lesions studies which typically report localized neural correlates for language processing. One possible explanation for this discrepancy is that NLP language models trained on large-scale text corpus may have captured multiple aspects of semantic information, while the conventional experimental approach may selectively activate a (or several) specific aspects of semantics and therefore only a small part of the brain can be detected.
Though NLP techniques provide a powerful tool to quantify semantic information, they still face some limitations when being applied to investigate semantic representations in the brain. Firstly, embeddings from NLP models (especially those from a deep neural network) are uninterpretable. Secondly, models differ from each other in training material, network architecture, amount of parameters, training tasks and so on, which may lead to potential discrepancies among research results. Finally, model training procedures differ from how humans learn language and semantics, and the inner computational and processing mechanism may also be fundamentally different between NLP models and the human brain. Therefore, researchers need to select a proper model based on research questions, test the validity of models with experimental designs, and interpret results carefully. In the future, it is promising to (i) adopt more informational semantic representation methods such as knowledge-graph and multimodal models; (ii) apply NLP models to assess the language ability of patients; (iii) improve the interpretability and performance of models taking advantages of cognitive neuroscience findings about how human process language.

Key words: semantic representation, brain, natural language processing, language model

中图分类号: