ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

Advances in Psychological Science ›› 2023, Vol. 31 ›› Issue (suppl.): 11-11.

Previous Articles     Next Articles

Language Decoding for Visual Perception Based on Transformer

Wei Huanga, Hengjiang Lia, Diwei Wua, Huafu Chena, Hongmei Yana   

  1. aMOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, China, 610054
  • Online:2023-08-26 Published:2023-09-11

Abstract: PURPOSE: When we view a scene, the visual cortex extracts and processes visual information in the scene through various kinds of neural activities. Previous studies have decoded the neural activity into single/multiple semantic category tags which can caption the scene to some extent. However, these tags are isolated words with no grammatical structure, insufficiently conveying what the scene contains. It is well-known that textual language (sentences/phrases) is superior to single word in disclosing the meaning of images as well as reflecting people's real understanding of the images. Here, based on artificial intelligence technologies, we attempted to build a language decoding model to decode the neural activities evoked by images into language (phrases or short sentences).
METHODS: Here, we propose a Dual-Channel Language Decoding Model (DC-LDM), which contains five modules, namely “Image-Extractor”, “Image-Encoder”, “Nerve-Extractor”, “Nerve-Encoder” and “Language-Decoder”. The first channel (image-channel), including “Image-Extractor” and “Image-Encoder”, aims to extract the semantic features of natural images ($\text{I}\in {{\mathbb{R}}^{\text{L}\times \text{W}\times \text{C}}}$). L, W and C denote the length, width, and number of channels of the image respectively. The second channel (nerve-channel), including “Nerve-Extractor” and “Nerve-Encoder”, aims to extract the semantic features of visual activities ($\text{X}={{\left[ {{\text{x}}_{1}},\ldots ,{{\text{x}}_{\text{T}}} \right]}^{\text{T}}}\in {{\mathbb{R}}^{\text{T}\times \text{M}}}$). T and M denote the time length and the number of voxels of visual activities, respectively. In the training phase, the corresponding outputs of the two channels are weighted by the transfer factor (α) to “Language-Decoder”. In addition, we employed a strategy of progressive transfer to train the DC-LDM for improving the performance of language decoding.
RESULTS: The decoding results of different images in the test set with VC fMRI activities from a sample subject. It can be seen that the decoded texts by our model are reasonable for describing the natural images, although they are not completely consistent with annotator’s texts. The results show that our proposed model can capture the semantic information from visual activities and represent it through textual language. We adopted six indexes to quantitatively evaluate the difference between the decoded texts and the annotated texts of corresponding visual images, and found that Word2vec-Cosine Similarity (WCS) was the best indicator to reflect the similarity between the decoded and the annotated texts. In addition, among different visual cortices, we found that the text decoded by the higher visual cortex was more consistent with the description of the natural image than the lower one.
CONCLUSIONS: By comparing different visual areas, we found that the decoding performance of the high-level visual cortex (HVC and VC) is significantly higher than that of low-level visual areas (V1, V2, V3, and LVC). It is once again confirmed that the high-level visual cortex contains more semantic information than the low-level visual cortex, even when this semantic information is decoded into text. Our decoding model may provide enlightenment in language-based brain-computer interface explorations.

Key words: Language decoding, functional magnetic resonance imaging, visual cortex, artificial intelligence