ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

心理科学进展 ›› 2023, Vol. 31 ›› Issue (6): 887-904.doi: 10.3724/SP.J.1042.2023.00887

• 研究方法 •    下一篇

基于词嵌入技术的心理学研究:方法及应用

包寒吴霜1,2,3, 王梓西1,2, 程曦1,2, 苏展1,2, 杨盈1,2, 张光耀1,2,4, 王博5, 蔡华俭1,2()   

  1. 1中国科学院心理研究所行为科学重点实验室, 北京 100101
    2中国科学院大学心理学系, 北京 100049
    3英国曼彻斯特大学曼彻斯特中国研究院, 曼彻斯特 M13 9PL
    4北京师范大学认知神经科学与学习国家重点实验室和IDG/麦戈文脑科学研究院, 北京 100875
    5天津大学智能与计算学部, 天津 300350
  • 收稿日期:2022-08-23 出版日期:2023-06-15 发布日期:2023-03-07
  • 通讯作者: 蔡华俭 E-mail:caihj@psych.ac.cn
  • 基金资助:
    国家社会科学基金重大项目“中国社会变迁过程中的文化与心理变化”(17ZDA324);中国科学院心理研究所自主部署项目“文化变迁与社会适应:行为和影像学的研究”(E2CX3935CX)

Using word embeddings to investigate human psychology: Methods and applications

BAO Han-Wu-Shuang1,2,3, WANG Zi-Xi1,2, CHENG Xi1,2, SU Zhan1,2, YANG Ying1,2, ZHANG Guang-Yao1,2,4, WANG Bo5, CAI Hua-Jian1,2()   

  1. 1CAS Key Laboratory of Behavioral Science, Institute of Psychology, Chinese Academy of Sciences, Beijing 100101, China
    2Department of Psychology, University of Chinese Academy of Sciences, Beijing 100049, China
    3Manchester China Institute, The University of Manchester, Manchester M13 9PL, United Kingdom
    4State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, and IDG/McGovern Institute for Brain Research, Beijing 100875, China
    5College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
  • Received:2022-08-23 Online:2023-06-15 Published:2023-03-07
  • Contact: CAI Hua-Jian E-mail:caihj@psych.ac.cn

摘要:

词嵌入是自然语言处理的一项基础技术。其核心理念是根据大规模语料中词语和上下文的联系, 使用神经网络等机器学习算法自动提取有限维度的语义特征, 将每个词表示为一个低维稠密的数值向量(词向量), 以用于后续分析。心理学研究中, 词向量及其衍生的各种语义联系指标可用于探究人类的语义加工、认知判断、发散思维、社会偏见与刻板印象、社会与文化心理变迁等各类问题。未来, 基于词嵌入技术的心理学研究需要区分心理的内隐和外显成分, 深化拓展动态词向量和大型预训练语言模型(如GPT、BERT)的应用, 并在时间和空间维度建立细粒度词向量数据库, 更多开展基于词嵌入的社会变迁和跨文化研究。我们为心理学专门开发的R语言工具包PsychWordVec可以帮助研究者利用词嵌入技术开展心理学研究。

关键词: 自然语言处理, 词嵌入, 词向量, 语义表征, 语义关联, 词嵌入联系测验

Abstract:

As a fundamental technique in natural language processing (NLP), word embedding quantifies a word as a low-dimensional, dense, and continuous numeric vector (i.e., word vector). This process is based on machine learning algorithms such as neural networks, through which semantic features of a word can be extracted automatically. There are two types of word embeddings: static and dynamic. Static word embeddings aggregate all contextual information of a word in an entire corpus into a fixed vectorized representation. The static word embeddings can be obtained by predicting the surrounding words given a word or vice versa (Word2Vec and FastText) or by predicting the probability of co-occurrence of multiple words (GloVe) in large-scale text corpora. Dynamic or contextualized word embeddings, in contrast, derive a word vector based on a specific context, which can be generated through pre-trained language models such as ELMo, GPT, and BERT. Theoretically, the dimensions of a word vector reflect the pattern of how the word can be predicted in contexts; however, they also connote substantial semantic information of the word. Therefore, word embeddings can be used to analyze semantic meanings of text.
In recent years, word embeddings have been increasingly applied to study human psychology. In doing this, word embeddings have been used in various ways, including the raw vectors of word embeddings, vector sums or differences, absolute or relative semantic similarity and distance. So far, the Word Embedding Association Test (WEAT) has received the most attention. Based on word embeddings, psychologists have explored a wide range of topics, including human semantic processing, cognitive judgment, divergent thinking, social biases and stereotypes, and sociocultural changes at the societal or population level. Particularly, the WEAT has been widely used to investigate attitudes, stereotypes, social biases, the relationship between culture and psychology, as well as their origin, development, and cross-temporal changes.
As a novel methodology, word embeddings offer several unique advantages over traditional approaches in psychology, including lower research costs, higher sample representativeness, stronger objectivity of analysis, and more replicable results. Nonetheless, word embeddings also have limitations, such as their inability to capture deeper psychological processes, limited generalizability of conclusions, and dubious reliability and validity. Future research using word embeddings should address these limitations by (1) distinguishing between implicit and explicit components of social cognition, (2) training fine-grained word vectors in terms of time and region to facilitate cross-temporal and cross-cultural research, and (3) applying contextualized word embeddings and large pre-trained language models such as GPT and BERT. To enhance the application of word embeddings in psychological research, we have developed the R package “PsychWordVec”, an integrated word embedding toolkit for researchers to study human psychology in natural language.

Key words: natural language processing, word embedding, word vector, semantic representation, semantic relatedness, Word Embedding Association Test (WEAT)

中图分类号: