听到“牛黄”能想到“黄牛”吗?——口语识别中的语音位置编码机制

doi:10.3724/SP.J.1042.2024.01488

摘要/Abstract

摘要：

在众多语言中, 都存在一系列词汇, 经过语音位置转置后仍能有效成词, 典型如中文中的“牛黄”与“黄牛”。阐明这类可转置词汇在语言理解过程中的编码方式, 是一项至关重要的研究课题。在阅读领域, 学者们已就词汇的位置编码机制展开了讨论, 然而针对口语加工中语音位置编码的认知机制, 至今仍存在序列−灵活编码之争: 早期口语识别理论认为语音位置编码主要以序列编码方式为主, 而近年来的研究则发现, 音位、音节和句子等层面上存在以灵活编码为主的语音位置编码方式。未来研究应深入探索与口语识别中语音编码相关的认知机理、神经机制、语言获得以及人工智能等重要问题, 由于汉字词在形音对应关系和语音加工单元等方面独具特殊性, 后续研究应对汉字词的语音位置编码予以特别关注。

关键词: 口语识别, 语音位置编码, 汉字词

Abstract:

Across various languages, there exists a set of words that retain their meaning even when their phonetic components are transposed. A typical illustration can be found in Chinese with words like “牛黄/niu2 huang2/” and “黄牛/huang2 niu2/,” and in English with words like “bus” and “sub.” Investigating how these transposable words are processed during language comprehension has become a crucial research topic. Within the field of reading, scholars have been engaged in discussions regarding the mechanisms for encoding word positions. However, there persists a controversy regarding the cognitive mechanisms governing phonetic position encoding in spoken word recognition.

Early theories posited that phonetic position encoding primarily followed a sequential approach. These models assume that words are represented as sequences of phonemes, with activation based on linear positional matching during the temporal unfolding of spoken words, as exemplified by models like the COHORT or TRACE model. The COHORT model suggests that word activation follows an all-or-none rule, where only words matching at onset compete for activation. Later models diverge from this principle, proposing that word activation and recognition stem from the linear matching of input speech signals with phonemic segments, as seen in models like TRACE, NAM, and Shortlist models. These slot-based models postulate that the phonetic information of words is encapsulated within fixed ‘slots’, and word activation hinges on the degree of match between each slot’s phonetic and positional features.

Nevertheless, in the field of reading or visual word recognition, researchers have discovered that the encoding of words may adopt a coarse-grained encoding approach. Throughout the process of reading sentences, readers consistently maintain a sense of uncertainty regarding recently encountered words during the comprehension process. The Noisy-channel model of speech perception proposed by Gibson et al. (2013) also elucidates how we understand language amid ‘noise.’ Findings regarding this ‘uncertainty’ shed light on the possibility of employing coarse-grained encoding of phonetic information during spoken word recognition. Indeed, recent studies have shown the flexibility of phonetic position encoding at levels of phonemes, syllables, and sentences. For instance, researchers have discovered mutual activation between anadromous word pairs such as “sub” and “bus,” or “/byt/” and “/tyb/,” demonstrating a transposed-phoneme effect in spoken word recognition. This position-independent phoneme encoding suggested that phonetic encoding adopts a coarse-grained, more flexible approach, independent of positional information.

However, current research mostly focuses on alphabetic writing systems and lacks universality. Exploring logographic languages like Chinese can provide broader evidence for this flexible encoding mechanism. Chinese characters, as ideographic symbols, exhibit several distinctive features worthy of investigation. Foremost is the spelling-sound dissociation in Chinese, allowing for the identification of word pairs with entirely different forms and meanings yet sharing anadromic phonetic sequences, such as “冰锥/bing1 zhui1/, meaning ice pick” and “追兵/zhui1 bing1/, meaning pursuing soldiers.” The peculiarities of spelling-sound association rules in Chinese characters enable the examination of phonetic encoding independent of visual form, offering more precise and meticulous insights into phonetic position encoding in spoken word recognition. Additionally, the phonetic processing units in Chinese lexicon are distinctive. Unlike alphabetic languages, Chinese, with its syllabic nature, possibly operates at the syllable level, with each character corresponding to a syllable. Finally, the unique relationship among syllables, morphemes, and semantics in Chinese spoken word recognition differentiates it from alphabetic scripts, highlighting the importance of investigating phonetic position encoding in logographic languages.

In conclusion, the encoding of phonetic positions in spoken word recognition likely adopts a flexible, position-independent approach. However, additional empirical evidence is required to substantiate this hypothesis. Further exploration is needed on four key questions regarding phonetic position encoding in spoken word recognition: (1) Elucidating the universality of flexible encoding across languages by examining the phonetic encoding peculiarities of Chinese characters; (2) Unraveling the neural mechanisms underlying temporal processing of phonetic positions utilizing techniques such as electroencephalography (EEG) and functional magnetic resonance imaging (fMRI); (3) Leveraging existing research findings to guide language acquisition and learning processes across diverse populations; (4) Harnessing insights into the mechanisms of human speech signal processing to facilitate the development of more advanced and comprehensive functionalities in artificial intelligence, which is rapidly permeating various facets of modern life and evolving alongside advancements in information technology.

Key words: spoken word recognition, phonetic position encoding, Chinese character

中图分类号:

B842.5

韩海宾, 李兴珊. (2024). 听到“牛黄”能想到“黄牛”吗?——口语识别中的语音位置编码机制. 心理科学进展 , 32(9), 1488-1501.

HAN Haibin, LI Xingshan. (2024). The mechanism of phonetic position encoding in spoken word recognition. Advances in Psychological Science, 32(9), 1488-1501.

图/表 3

图1 Cohort模型识别听觉词汇“Beaker/bi:kər/烧杯”的心理过程。第一行为听到音位/b/后激活的起始音相同的词汇组; 第二行为听到音节/bi:/后激活的一系列词汇, 此时与/bi:/不匹配的词汇已经被移除; 最后当整词语音结束后, 除Beaker以外的词汇全部被移除; 但后期修正的模型发现, 韵脚所在位置的音节也会激活韵脚相同的词汇。

表1 早期词汇识别的代表性序列模型对比

代表模型	识别方式	识别过程	识别要素	自上而下信息起作用阶段
Cohort模型	自下而上	严格序列性	词首信息	后期整合阶段
TRACE模型	自下而上与自上而下	交互激活	词汇与心理词典匹配的整体效应	自始至终
Shortlist模型	自下而上与自上而下	交互激活	词汇与心理词典匹配的整体效应	单词候选列表阶段之后的选择阶段
NAM模型	自下而上与自上而下	交互激活	词汇的“邻居”的整体相似性	词汇“邻居”激活后的词汇决策阶段

图2 TISK模型口语识别过程示例。首先, 从言语声音信号输入开始, 需要经历一组音位单元激活构成声音信号的音位; 其次到一组开放二音位单元水平, 该水平是一组与输入位置无关的开放二音位; 最后为词汇水平, 与之可匹配的词汇得到激活(Hannagan et al., 2013)。

参考文献 78

[1]	韩海宾, 许萍萍, 屈青青, 程茜, 李兴珊. (2019). 语言加工过程中的视听跨通道整合. 心理科学进展, 27(3), 475-489. doi: 10.3724/SP.J.1042.2019.00475
[2]	黄伯荣, 廖序东. (2011). 现代汉语 (上册, 增订五版). 北京: 高等教育出版社.
[3]	彭聃龄, 丁国盛, 王春茂, Taft, 朱晓平. (1999). 汉语逆序词的加工——词素在词加工中的作用. 心理学报, 1, 36-46.
[4]	Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38(38), 419-439.
[5]	Andersson, R., Ferreira, F., & Henderson, J. M. (2011). I see what you’re saying: The integration of complex speech and scenes during language comprehension. Acta Psychologica, 137(2), 208-216. doi: 10.1016/j.actpsy.2011.01.007 pmid: 21303711
[6]	Chambers, S. M. (1979). Letter and order information in lexical access. Journal of Verbal Learning and Verbal Behavior, 18(2), 225-241.
[7]	Chen, J. Y., Chen, T. M., & Dell, G. S. (2002). Word-form encoding in Mandarin Chinese as assessed by the implicit priming task. Journal of Memory and Language, 46, 751-781.
[8]	Chen, Q., & Mirman, D. (2012). Competition and cooperation among similar representations: Toward a unified account of facilitative and inhibitory effects of lexical neighbors. Psychological Review, 119, 417-430. doi: 10.1037/a0027175 pmid: 22352357
[9]	Connine, C. M., Blasko, D. G., & Titone, D. (1993). Do the beginnings of spoken words have a special status in auditory word recognition? Journal of Memory and Language, 32(2), 193-210.
[10]	Connolly, J. F., & Phillips, N. A. (1994). Event-related potential components reflect phonological and semantic processing of the terminal word of spoken sentences. Journal of Cognitive Neuroscience, 6(3), 256-266. doi: 10.1162/jocn.1994.6.3.256 pmid: 23964975
[11]	Cooper, R. M. (1974). The control of eye fixation by the meaning of spoken language. Cognitive Psychology, 107(1), 84-107.
[12]	Dahan, D., & Magnuson, J. S. (2006). Spoken Word Recognition. In M. J. Traxler & M. A. Gernsbacher (Eds.), Handbook of psycholinguistics (pp. 249-283). Academic Press.
[13]	Davis, C. J. (2010). The spatial coding model of visual word identification. Psychological Review, 117, 713-758. doi: 10.1037/a0019738 pmid: 20658851
[14]	Dufour, S., & Frauenfelder, U. H. (2010). Phonological neighbourhood effects in French spoken-word recognition. Quarterly Journal of Experimental Psychology, 63(2), 226-238.
[15]	Dufour, S., & Grainger, J. (2019). Phoneme‐order encoding during spoken word recognition: A priming investigation. Cognitive Science, 43(10), e12785.
[16]	Dufour, S., & Grainger, J. (2020). The influence of word frequency on the transposed-phoneme priming effect. Attention, Perception, & Psychophysics, 82(6), 2785-2792.
[17]	Dufour, S., & Grainger, J. (2022). When you hear /baksɛt/ do you think /baskɛt/? Evidence for transposed-phoneme effect with multisyllabic words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 48(1), 98-107.
[18]	Dufour, S., Mirault, J., & Grainger, J. (2021). Do you want /ʃoloka/ on a /bistɔk/? On the scope of transposed- phoneme effects with non-adjacent phonemes. Psychonomic Bulletin & Review, 28(5), 1668-1678.
[19]	Dufour, S., Mirault, J., & Grainger, J. (2022). Transposed- word effects in speeded grammatical decisions to sequences of spoken words. Scientific Reports, 12(1), 22035.
[20]	Dufour, S., Mirault, J., & Grainger, J. (2023). When facilitation becomes inhibition: Effects of modality and lexicality on transposed-phoneme priming. Language, Cognition and Neuroscience, 38(2), 147-156.
[21]	Dufour, S., & Peereman, R. (2003). Inhibitory priming effects in auditory word recognition: When the target's competitors conflict with the prime word. Cognition, 88(3), B33-B44.
[22]	Frankish, C., & Turner, E. (2007). SIHGT and SUNOD: The role of orthography and phonology in the perception of transposed letter anagrams. Journal of Memory and Language, 56(2), 189-211.
[23]	Gaskell, M. G., & Marslen-Wilson, W. D. (1997). Integrating form and meaning: A distributed model of speech perception. Language and Cognitive Processes, 12, 613-656.
[24]	Gibson, E., Piantadosi, S. T., Brink, K., Bergen, L., Lim, E., & Saxe, R. (2013). A noisy-channel account of crosslinguistic word-order variation. Psychological Science, 24(7), 1079-1088. doi: 10.1177/0956797612463705 pmid: 23649563
[25]	Gomez, P., Ratcliff, R., & Perea, M. (2008). The overlap model: A model of letter position coding. Psychological Review, 115(3), 577-600. doi: 10.1037/a0012667 pmid: 18729592
[26]	Grainger, J., & Van Heuven, W. J. B. (2004). Modeling letter position coding in printed word perception. In P. Bonin (Ed.), Mental lexicon: "Some words to talk about words" (pp. 1-23). Nova Science Publishers.
[27]	Grainger, J., & Whitney, C. (2004). Does the huamn mnid raed wrods as a wlohe? Trends in Cognitive Sciences, 8, 58-59. pmid: 15588808
[28]	Gregg, J., Inhoff, A. W., & Connine, C. M. (2019). Re-reconsidering the role of temporal order in spoken word recognition. Quarterly Journal of Experimental Psychology, 72(11), 2574-2583.
[29]	Guerrara, C., & Forster, K. (2008). Masked form priming with extreme transposition. Language & Cognitive Processes, 23, 117-142.
[30]	Gwilliams, L., King, J. R., Marantz, A., & Poeppel, D. (2022). Neural dynamics of phoneme sequences reveal position-invariant code for content and order. Nature Communications, 13(1), 6606.
[31]	Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. In Proceedings of NAACL-2001 (pp. 159-166). Stroudsburg, PA: Association for Computational Linguistics.
[32]	Han, H., & Li, X. (2020). Degree of conceptual overlap affects eye movements in visual world paradigm. Language, Cognition and Neuroscience, 35(10), 1456-1464.
[33]	Hannagan, T., Dupoux, E., & Christophe, A. (2011). Holographic string encoding. Cognitive Science, 35, 79-118. doi: 10.1111/j.1551-6709.2010.01149.x pmid: 21428993
[34]	Hannagan, T., Magnuson, J. S., & Grainger, J. (2013). Spoken word recognition without a TRACE. Frontiers in Psychology, 4, 563. doi: 10.3389/fpsyg.2013.00563 pmid: 24058349
[35]	Hofmann, T., Schölkopf, B., & Smola, A. J. (2008). Kernel methods in aachine learning. The Annals of Statistics, 36(3), 1171-1220.
[36]	Huettig, F., & Altmann, G. T. M. (2005). Word meaning and the control of eye fixation: Semantic competitor effects and the visual world paradigm. Cognition, 96(1), 23-32. pmid: 15833303
[37]	Jurafsky, D. (1996). A probabilistic model of lexical and syntactic access and disambiguation. Cognitive science, 20(2), 137-194.
[38]	Lahiri, A., & Marslen-Wilson, W. (1991). The mental representation of lexical form: A phonological approach to the recognition lexicon. Cognition, 38(3), 245-294. pmid: 2060271
[39]	Levy, R. (2008). Expectation-based syntactic comprehension. Cognition 106(3), 1126-1177.
[40]	Levy, R., Bicknell, K., Slattery, T., & Rayner, K. (2009). Eye movement evidence that readers maintain and act on uncertainty about past linguistic input. Proceedings of the National Academy of Sciences, USA, 106, 21086-21090.
[41]	Liu, Z., Li, Y., Cutter, M. G., Paterson, K. B., & Wang, J. (2022). A transposed-word effect across space and time: Evidence from Chinese. Cognition, 218, 104922.
[42]	Liu, Z., Li, Y., Paterson, K. B., & Wang, J. (2020). A transposed-word effect in Chinese reading. Attention, Perception, & Psychophysics, 82(8), 3788-3794.
[43]	Liu, Z., Li, Y., & Wang, J. (2021). Context but not reading speed modulates transposed-word effects in Chinese reading. Acta Psychologica, 215, 103272.
[44]	Luce, P. A., Goldinger, S. D., Auer, E. T., Jr., & Vitevitch, M. S. (2000). Phonetic priming, neighborhood activation, and PARSYN. Perception & Psychophysics, 62, 615-625.
[45]	Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19, 1-36. doi: 10.1097/00003446-199802000-00001 pmid: 9504270
[46]	Marslen-Wilson, W. D. (1990). Activation, competition, and frequency in lexical access. In G. T. M. Altmann (Ed.), Cognitive models of speech processing: Psycholinguistic and computational perspectives (pp. 148-172). Cambridge, MA: MIT Press.
[47]	Marslen-Wilson, W. D., Moss, H. E., & van Halen, S. (1996). Perceptual distance and competition in lexical access. Journal of Experimental Psychology: Human Perception and Performance, 22(6), 1376-1392.
[48]	Marslen-Wilson, W. D., & Warren, P. (1994). Levels of perceptual representation and process in lexical access: Words, phonemes, and features. Psychological Review, 101(4), 653-675. pmid: 7984710
[49]	Marslen-Wilson, W. D., & Tyler, L. K. (1987). Against modularity. In J. L. Garfield (Ed.), Modularity in knowledge representation and natural-language understanding (pp. 37-62). Cambridge: The MIT Press.
[50]	Marslen-Wilson, W. D., & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive Psychology, 10(1), 29-63.
[51]	Marslen-Wilson, W. D., & Zwitserlood, P. (1989). Accessing spoken words: The importance of word onsets. Journal of Experimental Psychology: Human Perception and Performance, 15(3), 576-585.
[52]	Marslen-Wilson, W. D. (1993). Issues of process and representation in lexical access. In G. T. M. Altmann & R. Shillcock (Eds.), Cognitive models of speech processing: The second Sperlonga meeting (pp.187-210). Lawrence Erlbaum Associates Publishers.
[53]	Marslen-Wilson, W. (1973). Linguistic structure and speech shadowing at very short latencies. Nature, 244, 522-523.
[54]	Marslen-Wilson, W. (1985). Speech shadowing and speech comprehension. Speech Communication, 4, 55-73.
[55]	McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1-86. pmid: 3753912
[56]	McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within-category phonetic variation on lexical access. Cognition, 86, B33-B42
[57]	Mirault, J., Snell, J., & Grainger, J. (2018). You that read wrong again! A transposed-word effect in grammaticality judgments. Psychological Science, 29(12), 1922-1929. doi: 10.1177/0956797618806296 pmid: 30355054
[58]	Norris, D. (1994). SHORTLIST: A connectionist model of continuous speech recognition. Cognition, 52, 189-234.
[59]	Norris, D., & McQueen, J. M. (2008). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review, 115, 357-395. doi: 10.1037/0033-295X.115.2.357 pmid: 18426294
[60]	O’Connor, R. E., & Forster, K. I. (1981). Criterion bias and search sequence bias in word recognition. Memory & Cognition, 9, 78-92.
[61]	O’Seaghdha, P. G., Chen, J. Y., & Chen, T. M. (2010). Proximate units in word production: Phonological encoding begins with syllables in Mandarin Chinese but with segments in English. Cognition, 115(2), 282-302. doi: 10.1016/j.cognition.2010.01.001 pmid: 20149354
[62]	Perea, M., & Lupker, S. J. (2003). Does jugde activate COURT? Transposed-letter similarity effects in masked associative priming. Memory & Cognition, 31, 829-841.
[63]	Perea, M., & Lupker, S. J. (2004). Can CANISO activate CASINO? Transposed-letter similarity effects with nonadjacent letter positions. Journal of Memory and Language, 51, 231-246.
[64]	Prabhakaran, R., Blumstein, S. E., Myers, E. B., Hutchison, E., & Britton, B. (2006). An event-related fMRI investigation of phonological-lexical competition. Neuropsychologia, 44, 2209-2221. pmid: 16842827
[65]	Qu, Q. Q., Damian, M. F., & Kazanina, N. (2012). Sound- size segments are significant for Mandarin speakers. Proceedings of the National Academy of Sciences of the United States of America (PNAS), 109, 14265-14270.
[66]	Rayner, K. (1975). The perceptual span and peripheral cues in reading. Cognitive psychology, 7(1), 65-81.
[67]	Reichle, E. D., Pollatsek, A., Fisher, D. L., & Rayner, K. (1998). Toward a model of eye movement control in reading. Psychological Review, 105(1), 125-157. pmid: 9450374
[68]	Righi, G., Blumstein, S. E., Mertus, J., & Worden, M. S. (2010). Neural systems underlying lexical competition: An eye tracking and fMRI study. Journal of Cognitive Neuroscience, 22(2), 213-224. doi: 10.1162/jocn.2009.21200 pmid: 19301991
[69]	Scott, S. K. (2019). From speech and talkers to the social world: The neural processing of human spoken language. Science, 366(6461), 58-62. doi: 10.1126/science.aax0288 pmid: 31604302
[70]	Sereno, S. C., Brewer, C. C., & O'Donnell, P. J. (2003). Context effects in word recognition: Evidence for early interactive processing. Psychological Science, 14(4), 328-333. pmid: 12807405
[71]	Toscano, J. C., Anderson, N. D., & McMurray, B. (2013). Reconsidering the role of temporal order in spoken word recognition. Psychonomic Bulletin & Review, 20(5), 981-987.
[72]	Van Petten, C., Coulson, S., Rubin, S., Plante, E., & Parks, M. (1999). Time course of word identification and semantic integration in spoken language. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25(2), 394-417.
[73]	Whitney, C. (2001). How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review. Psychonomic Bulletin & Review, 8, 221-243.
[74]	Yee, E., Blumstein, S., & Sedivy, J. C. (2008). Lexical- semantic activation in Brocaʼs and Wernickeʼs aphasia: Evidence from eye movements. Journal of Cognitive Neuroscience, 20(4), 592-612.
[75]	Yi, H. G., Leonard, M. K., & Chang, E. F. (2019). The encoding of speech sounds in the superior temporal gyrus. Neuron, 102(6), 1096-1110. doi: S0896-6273(19)30380-0 pmid: 31220442
[76]	You, H., & Magnuson, J. S. (2018). TISK 1.0: An easy-to- use Python implementation of the time-invariant string kernel model of spoken word recognition. Behavior Research Methods, 50, 871-889.
[77]	You, W., Zhang, Q., & Verdonschot, R. G. (2012). Masked syllable priming effects in word and picture naming in Chinese. PloS one, 7(10), e46595.
[78]	Zwitserlood, P. (1989). The locus of the effects of sentential-semantic context in spoken-word processing. Cognition, 32(1), 25-64. doi: 10.1016/0010-0277(89)90013-9 pmid: 2752705