中国科学院心理研究所, 北京 100101;中国科学院大学心理学系, 北京 100049
Cross-modal integration of audiovisual information in language processing
Institute of Psychology, Chinese Academy of Sciences, Beijing 100101, China;University of Chinese Academy of Sciences, Beijing 100049, China
收稿日期: 2018-02-28 网络出版日期: 2019-03-15
Received: 2018-02-28 Online: 2019-03-15
日常生活中, 语言的使用往往出现在某个视觉情境里。大量认知科学研究表明, 视觉信息与语言信息加工模块并不是独立工作, 而是存在复杂的交互作用。本文以视觉信息对语言加工的影响为主线, 首先对视觉信息影响言语理解, 言语产生以及言语交流的相关研究进展进行了综述。其次, 重点对视觉信息影响语言加工的机制进行了探讨。最后介绍了关于视觉信息影响语言加工的计算模型, 并对未来的研究方向提出了展望。
日常生活中, 语言的使用往往出现在某个视觉情境里。大量认知科学研究表明, 视觉信息与语言信息加工模块并不是独立工作, 而是存在复杂的交互作用。本文以视觉信息对语言加工的影响为主线, 首先对视觉信息影响言语理解, 言语产生以及言语交流的相关研究进展进行了综述。其次, 重点对视觉信息影响语言加工的机制进行了探讨。最后介绍了关于视觉信息影响语言加工的计算模型, 并对未来的研究方向提出了展望。
In daily life, the use of language often occurs in a visual context. A large number of cognitive science studies have shown that visual and linguistic information processing modules do not work independently, but have complex interactions. The present paper centers on the impact of visual information on language processing, and first reviews research progress on the impact of visual information on speech comprehension, speech production and verbal communication. Secondly, the mechanism of visual information affecting language processing is discussed. Finally, computational models of visually situated language processing are reviewed, and the future research directions are prospected.
In daily life, the use of language often occurs in a visual context. A large number of cognitive science studies have shown that visual and linguistic information processing modules do not work independently, but have complex interactions. The present paper centers on the impact of visual information on language processing, and first reviews research progress on the impact of visual information on speech comprehension, speech production and verbal communication. Secondly, the mechanism of visual information affecting language processing is discussed. Finally, computational models of visually situated language processing are reviewed, and the future research directions are prospected.
韩海宾, 许萍萍, 屈青青, 程茜, 李兴珊. (2019).
HAN Haibin, XU Pingping, QU Qingqing, CHENG Xi, LI Xingshan. (2019).
日常生活中, 人们经常同时接受来自不同感觉通道的信息。例如, 当人和人面对面交流时, 人们的耳朵在听到话语的同时, 眼睛能同时看到相关的视觉信息。人们在加工这些来自不同通道的信息时, 往往利用不同的认知模块。近代认知神经科学的研究表明, 人脑也往往利用不同的脑区对不同通道的信息进行加工(Binder et al., 1997; Grill- Spector & Malach, 2004)。然而, 也有研究发现不同的认知模块往往不是在独立工作, 而是相互影响的(Beauchamp, 2016; Kuchenbuch, Paraskevopoulos, Herholz, & Pantev, 2014; Marslen-Wilson, 1975; Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995; Eggermont, 2017)。举例来说, 语言的“意有所指”众所周知, 我们听到的口语词汇往往对应着视觉世界中的特定物体。因此, 在同时加工口语和视觉信息时, 语言会引导视觉注意, 视觉信息也会影响语言加工, 听觉与视觉通道的信息相互影响, 共同完成整合任务。近年来, 随着计算机技术的不断发展, 人工智能也成为了研究热点。研究者也开始尝试把不同通道的信息整合进人工智能, 使其完成更为复杂的功能, 更好地为人类服务(Ng et al., 2017; Heinrich & Wermter, 2017)。
加工来自不同感觉通道信息的认知模块如何相互影响, 又如何完成来自不同通道的信息整合任务, 是认知心理学需要研究的重要问题。而针对多通道信息整合原因以及整合机制的问题, 目前研究尚浅。本文将重点综述近年来针对视觉信息的加工如何影响口语信息加工的研究进展。首先通过介绍语言加工的模块化理论以及交互理论两大理论来引出争议问题; 其次介绍视觉信息影响语言加工的表现以及为何会影响语言加工两个大问题; 最后将介绍视觉影响语言加工的计算模型, 并对未来的研究进行展望。
语言的加工包含语言理解、语言产生等多种加工过程。语言理解、言语产生过程又细分为词汇识别、句法解析、言语计划等过程。这些语言加工的过程是独立进行, 还是会受到其他信息的影响, 长久以来都是心理语言学家们互相争论的热点。上世纪80年代早期, Fodor (1983)提出了语言的“模块化理论”, 该理论认为人脑的认知系统由许多不同的模块构成。例如, 在语言加工系统中, 有负责语音加工的模块, 有负责词汇加工的模块, 有负责句法加工的模块等等。每个模块都是独立的加工单位, 其活动与输出不受其他信息的影响。举例来讲, 句子的理解过程包含对句子语义信息的通达、句法结构的建构等过程。根据模块化理论观点, 句法加工模块独立于语境、语义等信息的加工模块, 负责句法加工的模块是“封装”起来的, 不受其他高级认知或者感知觉机制的影响。当然, 模块化理论并非不承认高水平的信息(例如语境)对句法加工的影响, 在遇到一词多义或者歧义现象时, 同样需要根据语境等信息来确定歧义词在句子中的语义。其倡导的主要观点是：高水平信息无法影响句法加工的最初阶段, 但是会在句法加工的最初阶段完成后给予反馈, 而不是直接参与到句法初级阶段的加工过程中来。
支持模块化理论的代表模型是花园路径模型。该模型由Frazier和Rayner (1982)提出, 并得到其实验结果的支持。作者认为对任何一个歧义句的加工最开始只考虑一种可能的句法结构, 并且最初句法结构的选择只是纯粹的句法加工模块起作用, 之后出现加工困难之后才会依据语境、语义等信息进行反馈。但Altmann, Garnham和Dennis (1992)通过严格地控制语境, 使语境符合歧义句多种句法结构的其中一种, 发现恰当的语境可以移除歧义句中的加工困难, 直接选择与语境相符的句法结构。Altmann等人(1992)的结果并不支持模块化理论, 作者认为句法加工模块并非无法“渗透”, 也并没有“封装”起来, 语境这种高水平的信息可以自上而下地影响句法的最初选择策略。
除此之外, 模块化理论出现之前已有研究发现语言加工系统的加工器之间可以互相传递信息, 各种加工过程之间也会相互影响。例如, Marslen- Wilson (1975)发现句法水平和语义水平的信息可以影响词汇识别过程。作者采用影子跟读任务(the shadowing task)考察了语境对词汇识别以及词汇整合过程的影响。被试需要听句子并及时对听到的词汇进行复述(跟读任务)。目标词的类型分为语义反常、句法反常以及正常词汇三个条件, 例如目标词“universe (宇宙)”在句子“the new peace terms have been announced. They call for the unconditional universe of all the enemy force (新的和平条款已经宣布了。他们呼吁所有的敌人无条件的宇宙”中属语义反常的词汇; “already (已经)”在句子“he thinks she won't get the letter. He's afraid he forgot to put a stamp on the already before he went to post it (他认为她不会收到那封信。恐怕他在去邮寄之前忘了在已经上贴邮票)”中属句法反常词汇。每种条件下的词汇又分为四组, 一组为原始词汇(universe), 其余三组分别为替换掉首音节(u)、次音节(n)以及三音节(i)的非词。结果发现和语境相匹配的无反常条件下, 次音节与三音节替换组的跟读错误率最高, 即, 在首音节不变而且符合语境的条件下被试会将其跟读成正常词, 说明语境这种高水平的信息确实会影响词汇的识别过程, 与之发生交互作用。
以上研究都支持在语言加工过程中, 句法和语义是有交互作用的。其中一种交互作用的观点被称为约束满足理论(constraint satisfaction theory)或者基于制约的模型(constraint-based model;MacDonald, 1993; MacDonald, Pearlmutter, & Seidenberg, 1994)。该模型强调了语言加工中各类信息即时相互作用, 认为语境、句法使用频率等信息可以即时被句法加工所使用, 初级阶段的句法选择也会受到影响, 整个句子的建构过程是各种信息交互作用、相互制约的结果。例如, 在歧义句理解中, 可供选择的句法是平行的, 会受到语境信息、句法使用频率以及语义等信息的制约。歧义消解则是一个约束满足的过程, 语境、句法使用频率等信息提供证据支持部分被激活的句法结构。该模型得到了众多研究的支持(Chen & Tsai, 2015; Knoeferle & Guerra, 2016; Linzen & Jaeger, 2016; MacDonald, 1993)。除语境、频率等语言类信息之外, 还有突显性更高的视觉情境等非语言信息也可能会影响句法加工过程, 但由于实验技术等方面的原因, 早期对这个问题的考察较少。视觉情境范式的广泛应用使得这类研究如雨后春笋般涌现出来。
已有很多来自视觉情境范式的研究证据支持语言的加工会受到视觉场景等非语言信息的影响。视觉情境范式(the visual world paradigm, VWP)的出现为考察视知觉信息与词汇、句法和语义等更高级语言加工的交互作用打开了一扇大门。这种范式最突出的特点就是在被试观看视觉刺激的同时向被试呈现听觉语言信息, 要求被试根据听到内容选择相对应的视觉刺激或对物体进行一定的操作, 或者单纯地听和看。通过记录被试的眼动来评估语言加工过程中视觉注意的分配情况, 进而对语言加工的机制进行推论。视觉情境范式由Roger M. Cooper于1974年首创, 他向被试呈现一些物体图片, 同时播放一些短文录音, 发现在听到某个特定的词语时, 被试会更多地注视与听觉信息具有语义关系的图片。比如, 在听到“非洲(Africa)”时, 对语义上相关的“斑马(zebra)”、“狮子(lion)”以及“蛇(snake)”等物体的注视比例比无关的物体更多, 而且被试的眼动与文本的听觉呈现在时间上是紧密相关的。通过视觉情境范式我们不仅可以揭示语言的加工机制, 而且能够考察视觉信息如何影响语言的加工过程。
这部分主要介绍关于视觉信息影响语言加工的一些经典研究, 主要从视觉信息影响口语理解以及言语产生两个方面来综述视觉信息影响语言加工的表现。除此之外, 视觉情境范式里不仅有听觉呈现的语言, 而且还有视觉画面的呈现, 这和单纯的语言加工过程也存在差异, 这里将会对此类现象进行一些探讨。
视觉信息影响音节层面的口语信息加工。早期发现的“麦格克效应” (the McGurk Effect)就已经发现了音节层面视觉与听觉之间的交互作用(McGurk & MacDonald, 1976)。实验任务要求被试看到一个面孔重复发音“ga”的嘴部动作, 并听到和视频中嘴部动作同时出现的声音“ba”。结果发现, 虽然听觉输入可以非常清楚地被知觉为“ba”, 但由于视觉输入的面孔口型的影响会让被试知觉为“da”, 表明视知觉对音节的感知有干扰作用。
视觉信息可以影响单个词汇的理解过程。Tanenhaus等(1995)首次使用视觉情境范式对视觉信息如何影响单个词汇上的暂时歧义消解进行了探究。实验过程中, 被试需要在听到词汇“candy (糖果)”的同时看包含一些物体的图片, 这些图片分为两组：一组包含目标物“candy”以及干扰项; 另一组包含目标物“candy”, 与“candy”具有相同起始音节的竞争物“candle (蜡烛)”以及干扰项。作者发现如果视觉画面中不呈现竞争物“candle”, 被试指向目标物“candy”的眼跳潜伏期为145 ms, 如果同时呈现目标物“candy”和竞争物“candle”, 指向目标物的眼跳潜伏期则变为230 ms, 显著长于不呈现竞争物的条件。作者认为被试听到“can-”的时候会产生临时歧义, 视觉信息中竞争物的呈现影响了被试在词汇水平上的临时歧义消解过程。在词汇水平上, Chambers, Tanenhaus, Eberhard, Carlson和Filip (1998)发现视觉场景信息中的语用信息(pragmatic factor)和介词语义信息会共同作用来缩小介宾短语中宾语的指涉范围。在实验中, 作者给被试呈现包含有可以装下“cube”的“big can (大容器)”, 不可以装下“cube”的“small can (小容器)”以及其他干扰物的视觉场景, 同时让被试听句子“put the cube inside the can (把方块放到容器中)”。结果发现被试并不是注视所有容器类物体, 而是直接注视场景中可以装下“cube”的大容器。这是因为视觉场景含有“哪个容器可以放下方块”的语用信息, 而且这些语用信息影响了个体对介词“inside (里面)”的理解, 缩小了介宾短语中宾语的指涉范围。
除影响单个词汇的理解过程之外, 视觉信息还会影响句法加工过程, 这也是使用视觉情境范式考察最多的一部分。Tanenhaus等人(1995)的研究开创了先河, 对视觉信息如何影响歧义句中的句法选择进行了考察, 并为视觉信息对句法加工的影响提供了充足的证据。该研究选取局部歧义句作为听觉实验材料, 例如, 句子“Put the apple on the towel in the box (把毛巾上的苹果放到盒子里)”中临时歧义部分是短语“on the towel (在毛巾上)”, 既可以修饰名词“apple (苹果)”, 意为“毛巾上的苹果”, 也可以指向“put (放)”的目标位置, 意为“把苹果放到毛巾上”。和听觉刺激同时呈现的视觉刺激包含两种条件(如图1), 单表征物情境(1-referent, 左图)以及双表征物情境(2-referents, 右图)。作者的假设为不同的视觉情境会使被试有不同的句法选择策略, 即两种视觉刺激条件下会对歧义短语有不同的理解, 并表现为不同的眼动模式。具体来说, 单表征物情境条件下, 由于只有一个目标物, 所以被试更倾向于将“on the towel”理解成和动词“put”相关的目的地, 会有更多的错误注视在毛巾上; 在双表征物情境条件下, 由于存在两个目标物, 被试需要选择其中一个做为接受动作的客体, 会更多地将“on the towel”加工成“apple”的修饰语, 表现为对毛巾错误注视概率的减少。结果验证了其假设, 在句法加工早期单表征物情境条件下的错误注视概率要显著多于后者, 视觉情境参与了句法加工的早期阶段, 并且移除了临时歧义句中的加工困难。作者认为句法加工过程并非如模块化理论所倡导的不受到其他信息影响, 视觉情境提供的信息可以即时地影响大脑对句法结构的选择策略, 并即时地应用在句子结构歧义的消解上。这与Altmann等人(1992)的研究结果是一致的, 从“语境”变为了“视觉情境”, 都支持了基于制约的理论。
Tanenhaus等(1995)使用的视觉刺激。左图为单表征物情境, 右图为双表征物情境, 被试在看图片的同时会听到局部歧义句“Put the apple on the towel in the box”。
很多研究利用相似的范式对这个问题进行了重复与拓展。一些研究采用和其相同的电脑屏幕呈现的图片形式, 还有部分研究采用呈现实物的方法来代替电脑呈现。例如, 有研究考察了儿童与成人在视觉与语言加工整合上的差异, 结果发现成人可以将语言信息(如, 词汇信息)和表征物信息(视觉信息)有效地结合, 移除句子中的临时歧义; 儿童却只能利用听觉句子中的语义和句法信息来进行句子理解, 对视觉信息的利用是十分有限的(Snedeker & Trueswell, 2004)。
除视觉信息之外, 其他非语言信息如物体的动允性、事件以及情景记忆都会和语言的加工发生交互作用(Chambers & Juan, 2008; Chambers, Tanenhaus, & Magnuson, 2004; Lee, Chambers, Huettig, & Ganea, 2017; Leonard & Chang, 2014; Milburn, Warren, & Dickey, 2015)。例如, Chambers等人(2004)采用视觉情境范式考察了非语言信息(动允性, affordance, 指的是环境的属性使得动物个体的某种行为得以实施的可能性, Eysenck & Keane, 2000)对局部歧义句句法加工过程的影响。实验中, 作者给被试听指导语“Pour the egg in the bowl over the flour (把碗里的鸡蛋放到面粉上)”, 其中, “in the bowl (在碗里)”既可以修饰名词“egg (鸡蛋)”, 意为“碗里的鸡蛋”, 也可以指向“pour (倒)”的目标位置, 意为“把鸡蛋倒入碗里”。同时呈现两种真实的视觉场景并让读者根据听到的指导语操作物体(如图2)。作者构建了两个条件：一个是竞争物和目标物都是液体形式的鸡蛋(都可以被倒在面粉上, 具有“pour (倒)”的动允性); 另一条件下只有一个鸡蛋是液体形式。结果表明, 在第二个条件的场景下被试对“bowl (碗)”的错误注视概率会更高, 更容易把“in the bowl (在碗里)”理解成行为的目的, 这表明和动作相关的非语言信息影响了句法的早期加工过程。
情节记忆同样会影响语言的加工过程(Chambers & Juan, 2008; van Bergen & Flecken, 2017)。例如, 在Chambers和Juan (2008)的研究中, 被试需要看如图3所示图片的同时听三个指导语, 分别为“move the chair to area two (把椅子移到区域2)”, “now move/return the chair to area five (把椅子移到/放回区域5)”, “now move the square to area seven (把方块移到区域7)”。其中第二个指导语是关键指导语, 分为“move (移动)”和“return (放回)”两个实验条件, 其中“return”条件需要第一个指导语产生的情节记忆的参与。结果发现, 在“return”条件下, 被试在听到“return”的时候会出现指向椅子未移动之前的区域5的预期眼跳, “move”条件下则没有。实验表明听者的预期不仅仅基于物体的特点, 视觉场景产生的情节记忆也同样会影响被试对句子的加工过程。这些都为非语言信息影响句子加工过程提供了进一步的证据。
此外, 不仅静态视觉信息对句法加工有影响, 动态的事件也会影响口语理解过程(Hafri, Trueswell, & Strickland, 2018; Knoeferle & Guerra, 2016; Knoeferle, Crocker, Scheepers, & Pickering, 2005)。Knoeferle等人(2005)采用视觉-情境范式考察了图片所呈现事件情境是否可以影响口语句子中题元角色的分配(thematic-role assignment), 即是否影响被试在句子加工中施动者(agent)和受动者(patient)的角色分配。在实验过程中, 给被试呈现一个视觉事件, 如图4所示, “princess (公主)”处在一个既在给“pirate (海盗)”清洗, 同时又被“fencer (击剑者)”所画的两种角色状态下(即公主既可能是受动者也可能是施动者)。同时以听觉形式给被试呈现两种条件的指导语“the princess is apparently washing the pirate (公主很明显在清洗海盗)”和“the princess is apparently painted by the fencer (公主很明显在被击剑者画), 前者公主作为施动者, 后者为受动者。结果发现, 前者条件下, 被试听到动词“washing (洗)”会出现更多的指向海盗的预期眼动, 后者则更多的看向击剑者。实验表明被试已经从视觉情境中提取出该事件的题元角色的分配情况, 一旦动词出现, 句子的题元角色分配就已经完成。作者认为视觉画面中提取的题元角色信息加速了口语理解中题元角色的分配, 描述某事件的视觉场景促进了口语理解的过程。
Knoeferle等(2005)使用的视觉刺激示例。共包含三个角色, 其中左侧为海盗, 中间为拿着水桶正在清洗海盗的公主, 右边为拿着画笔正在画公主的击剑者。
综上所述, 不仅静态图片和真实情境能够影响我们对听觉语言信息的加工, 动态事件情境信息也同样会影响语言的理解过程。这种影响不仅体现在单个词汇水平, 同样表现在语言加工过程中的句法选择策略上, 甚至会影响我们对施动者和受动者的题元角色分配。模块化理论所倡导的“封装”也在视觉情境范式的各类研究中受到了挑战, 语言的加工并不是独立于其他信息的加工, 而是与其他信息进行动态的即时交互。这些来自语言理解的研究都支持了基于制约的理论, 语言的加工会实时的受到其他各类信息的影响和制约。
“听”别人说话并理解语言的过程会受到视觉信息的影响, 同样, 我们“说”的过程也同样会受到当前视觉画面或者场景的影响。言语产生过程多发生在某个特定视觉背景之下, 个体需要对场景中的物体进行定位, 与此同时也会提取物体的视觉特征以及相关的语言信息。有研究发现, 个体在表达目标物体900 ms之前就会注视到相关的物体(Griffin & Bock, 2000), 视觉和语言信息的加工密不可分, 需要跨通道合作才能完成整个言语产生过程。
已有研究发现, 视觉刺激的不同特征会影响言语产生过程。例如, 低水平的视觉特征会影响语言加工(Ostarek & Hüettig, 2017)。Rossion和Pourtois (2004)采用图片命名任务发现图片的颜色特征会影响图片的命名过程, 带有颜色的图片要比黑白线条图片的识别速度快, 命名也快, 而且有颜色物体的命名一致性更高。Coco和Keller (2009)使用视觉情境范式, 采用真实场景, 通过改变所呈现场景的复杂程度和画面中人物数量, 考察了视觉信息的复杂程度和特点对言语产生过程的影响。该研究发现视觉画面越复杂, 人物越多, 被试就会需要更多的时间来产生句子。阈下视觉刺激也会影响言语产生过程, Gleitman, January, Nappa和Trueswell (2007)发现在视觉画面呈现之前在目标位置呈现一个快速(60~75 ms)的注意捕捉信号(黑色方块), 结果发现, 虽然被试报告并没有发现注意捕捉信号, 但此位置出现的人物在句子产生过程中被作为主语的概率要更高。
Coco和Keller (2012)更为直接地观察到了视觉场景和言语产生之间的关系。以往多通道加工的研究发现, 相对于两个不同的场景, 两个相同的场景被试会有非常相似的扫视路径(scan pattern)。因此作者在其试验中, 要求被试根据提示的线索(场景中所包含的物体)产生一个和场景相关的句子, 考察对场景的扫视路径和句子产生之间协作的内在机制。结果显示, 在言语产生的计划、编码以及产生阶段都发现场景的扫视路径相似和句子产生的相似有很高的相关, 即对场景的扫视路径相同, 产生的句子也会相似。Ferreira, Foucart和Engelhardt (2013)的实验4为了考察视觉情境范式中的预视阶段可以给被试提供何种信息, 采用了言语产生范式, 被试需要看视觉场景并在规定时间内猜测指导语内容。结果发现, 在规定时间内, 被试对指导语的猜测正确率显著大于了随机水平。这些研究都体现了视觉信息和语言产生过程之间的交互作用。
在言语产生领域中, 另一个非常重要的问题是我们如何从图片中提取语义信息。这个问题同样也是人工智能领域的一个难题, 如何让计算机“看图说话”?来自计算机领域的研究者Vaidyanathan, Prud, Alm, Pelz和Haake (2015)以皮肤科专家为被试, 采用经典的皮肤病图片作为实验材料试图建立语料库来让计算机“学会”提取图片中的语义信息。在其研究中, 每个专家需要对29幅皮肤病的图像进行描述, 并同时记录专家的眼动以及描述图片的声音数据。分析阶段, 作者把眼动以及声音数据做成两个数据流并且严格地匹配形成一个“双维度语料库”, 眼动数据作为视觉单元, 录音数据作为语言单元, 借助机器翻译的技术, 成功地对图像进行了语义标注。经过训练的转换模块可以基于眼动数据(视觉信息)产生出对应的病情(语言单元)。表明视觉和语言之间存在语义上的联结, 不同的视觉画面在注视期间会产生不同的包含语义信息的眼动数据, 根据这些数据可以很好地预测出对应的语言单元。
言语交流作为复杂的语言加工现象, 视觉信息的参与尤为重要。双方的视觉注意不仅会因为对方语言和视觉情境中物体的发生转移, 还会影响对方的状态, 继而影响语言加工过程。有研究发现, 对话双方的口型、面部表情、反馈以及注视的变化等视觉信息都会影响双方的感知觉状态, 与对话双方的语言加工产生交互作用, 影响句法加工以及题元角色分配等过程(Carminati & Knoeferle, 2013; Garoufi, Staudte, Koller, & Crocker, 2016; Knoeferle & Kreysa, 2012; Kreysa, Knoeferle, & Nunneman, 2014)。例如, Carminati和Knoeferle (2013)的研究发现讲话者的带情绪的面部表情会影响听者的视觉注意以及语言理解过程。研究发现对话双方对对方的视角进行捕捉可以帮助他们更好地理解言语的内容, 并且对后续发言更好地计划(Tanenhaus & Brown-Schmidt, 2008)。Knoeferle和Kreysa (2012)发现被试能够根据讲话者注视的变化预测出讲话者将要提到的词汇。
对话双方所能共同获取的感知觉信息在言语交流中占有重要的地位。视觉呈现的物体可以同时呈现给对话双方, 形成可以被双方同时观察到的视觉共享区域, 研究发现个体会即时地将共享的视觉信息应用到目前的认知加工中。例如, Allopenna, Magnuson和Tanenhaus (1998)此前发现, 被试在听到目标词“beaker (烧杯)”之后会看向和其起始音相同的竞争物“beetle (甲虫)”, 被试听到的声音是通过耳机呈现的。有趣的是, Tanenhaus和Brown- Schmidt (2008)将实验过程变成听者和讲话者的交互对话过程, 即讲话者直接对听者的对话, 而非通过耳机呈现声音, 并且双方可以同时看到一组物体。结果发现, 视觉信息可以在对话中限制对话双方的知觉状态, 会把语言的指涉范围(referential domain)限制在呈现的视觉物体上, 语音竞争效应消失了, 作者认为双方在对话过程中看到的视觉信息影响了双方的语言理解过程。Brown-Schmidt和Tanenhaus等人针对听者与讲话者的视角做了大量研究, 都表明对话双方的协作状态能够促进语言的加工过程, 而这种协作状态大多情况下是有共同感知的视觉信息提供的。
视觉信息可以实时地为语言理解提供预测线索, 提高交流效率。除去视觉场景来说, Huettig (2015)认为语言加工过程中预测性的存在的一个重要作用就是提高双方的交流效率。例如, 在对话双方的交流过程中, 经常补全对方语言的现象表明一方对另一方的言语产生内容进行了预测(Clark & Wilkesgibbs, 1986)。而视觉场景的呈现则更进一步提高了交流的效率, 研究者采用视觉情境范式对语言加工中的预测性进行了考察, 并发现在目标词还未出现之前就产生了指向目标物的眼动(Altmann & Kamide, 1999; Altmann, 2004; Altmann & Kamide, 2009; Hintz, Meyer, & Huettig, 2017; Trueswell & Thompson-Schill, 2016; Staub, Abbott, & Bogartz, 2012)。例如, Altmann和Kamide (1999)的研究让被试听句子“The boy will eat/move the cake (男孩将会吃掉/移动蛋糕)”的同时给被试呈现一个视觉场景, 场景中包含“男孩”, 目标物“蛋糕”以及其他干扰项。结果发现在被试听到“eat (吃)”的时候就已经有指向目标物“cake (蛋糕)”的眼跳出现, 并且蛋糕得到了比其他干扰项更多的注视。对目标物的注视的原因不单是对句子中动词特征的分析, 即“eat (吃)”后面要跟一个可食用的物体, 而是源于视觉场景和语言表征的共同作用。视觉场景首先提供了线索, 形成各种物体的视觉表征, 这为后续句子中特定词汇的预测提供了基础, 双方可根据场景中提供的视觉信息更好地对对方的讲话内容进行预测。
综上, 言语产生过程中, 个体从场景中提取语义信息, 也因此言语产生过程会受到视觉画面的影响。不仅颜色、画面复杂程度等视觉信息会影响言语产生过程, 讲话者的情绪面孔、注视变化等视觉信息都会影响听者感知觉状态, 进而影响其语言加工过程。除此之外, 视觉场景还能为对话双方的交流提供基础, 对双方的言语产生内容进行预测, 提高交流的效率。
以上研究表明, 语言加工不是独立的单通道的过程, 而是各种信息交互作用的结果。视觉、触觉、听觉等感觉通道都会影响语言的加工。视觉通道作为人类接受外界信息最主要的通道, 会实时地帮助个体消除歧义句中的歧义, 分配题元角色, 对后续产生的词汇进行预测来协助语言的加工。但是这些影响并非都是促进作用。
首先, 视觉信息的呈现会改变我们对语言本来的理解过程。例如, Pickering, Garrod和McElree (2004)等人指出视觉-情境范式中图片的呈现改变了语言理解过程。在他们给出的例子中, 被试听指导语“In the morning Harry let out his dog Fido. In the evening he returned to find a starving beast (早上哈瑞放出了他那只叫费多的狗, 傍晚回来的时候他发现了一只饥饿的野兽)”。对句子理解来说, “beast (野兽)”指向前半句提到的狗“Fido (费多)”, 如果同时给被试呈现视觉图片“tiger (老虎)”, 被试听到“beast”后可能会更多地注视“tiger”, 即视觉信息的呈现会改变我们对语言的理解。除此之外, 文字版的视觉情境范式中更能体现这种视觉信息的影响。在Salverda和Tanenhaus (2010)的研究中, 其在视觉画面中用文字代替物体给被试呈现目标词“bead”, 竞争词“bear”以及无关项, 同时听觉通道呈现目标词“bead”。结果发现被试对竞争词的注视要显著多于无关项, 体现出非常显著的竞争效应。Pickering等人(2004)的研究同样也质疑这种语音竞争效应是由于视觉呈现的词汇影响了我们对目标词的识别, 还是单纯语言加工里的竞争效应?
其次, 视觉画面由于其较高的凸显性(salience)会影响个体语言表征的心理模拟过程。Altmann和Kamide (2009)的研究试图考察语言发生变化时, 是否心理表征也会随之发生动态变化。被试听句子“The woman (will/is too lazy to) put the glass onto the table. Then, she will pick up the bottle, and pour the wine carefully into the glass (这个女人(会/太懒了以至于不会)把酒杯放到桌子上, 然后, 她会拿起酒瓶, 把红酒倒入酒杯中)”的同时看一幅场景图片(如图5)。结果发现, 被试对“table (桌子)”的注视概率在移动酒杯的条件下要显著大于不移动酒杯的条件, 表明语言引起了心理表征的模拟, 在移动条件下, 心理模拟的酒杯位置换到了桌子上。但在眼动数据上发现, 不管是在哪个条件下, 对于图片中酒杯的注视概率都要大于对于桌子的注视概率, 这表明语言表征和视觉表征的两种表征机制产生了竞争效应, 但由于视觉画面的突显性更大, 所以会更占优势, 比心理模拟指向的“table”得到更多的注视。在作者的实验2中, 作者在听句子的过程中视觉画面在句子呈现的时候由灰屏替代, 结果发现, 对桌子的注视概率远远大于对酒杯的注视概率, 语言表征的心理模拟过程则起了主导作用。
最后, 视觉画面的呈现会缩小个体对句子中特定词汇的指涉范围。在传统的研究语言加工的实验中, 激活扩散模型认为听到某个词汇, 心理词典中所有和此词汇相关的词汇都有可能得到激活。但如果有视觉场景或者图片呈现的话, 被激活的词汇就会被限制在图片中所呈现的几个物体上。举例来说, Altmann等人(1999)的研究中指导语为“the boy will eat the cake”, 同时图片中只呈现一个可供食用的物体“cake (蛋糕)”, 在听到词汇“eat”的时候, “cake”立即得到了注视。视觉信息限制了语言加工中可能被激活的条目, 并不能反映整个心理词典的结构。
以上概括了视觉信息影响语言加工多个方面的表现, 不管是口语理解还是言语产生过程, 都存在视觉信息与语言加工的跨通道交互作用。视觉信息为何会影响语言加工过程, 在语言加工过程中起着何种作用, 对这些问题探讨有助于揭示视觉信息与语言信息跨通道整合的机制。本节尝试对此进行一些综述和探讨。
首先, 视觉信息会作为大脑的外部存储, 有利于减少语言加工过程耗费的认知资源。Findlay和Gilchrist (2003)区分了两种视觉信息表征方式, 被动视觉(passive vision)与主动视觉(active vision)。前者认为个体对视觉图像的理解过程是被动的, 看过的图像作为视觉信息输入并存储在大脑中作为内部表征以供后续使用; 后者则认为对视觉画面的理解是主动的, 其加工的重要特点不是存储, 而是个体对视觉画面外显的指向性注视, 即, 大脑会重新把注意转移到目标位置, 以通过中央凹注视获取更精确的视觉信息。这两种观点最关键的区别是后续加工中视觉注意是否会转移。前者把视觉画面存储大脑中作为内部表征, 后续提取的时候不需要重新注视, 注意的转移是内隐的; 后者并不存储视觉画面, 只需存储位置信息, 后续加工需要通过外显的注意转移来提取视觉信息。Findlay等人认为视觉信息的加工模式是后者。Huettig, Gaskell和Quinlan (2004)同样认为后者符合认知系统的经济原则, 这样视觉感知系统无需存储大量的视觉信息, 而是把外部世界当成大脑的外部存储。这样来看, 大脑中只需存储物体的空间位置信息并作为一个指针(pointer), 当语言加工需要提取相应视觉信息的时候, 通过指针指向特定位置来获取所需信息, 大大减少了语言加工过程所耗费的认知资源。
使用视觉情境范式的研究同样证实了这个观点。在视觉情境范式中听觉刺激呈现之前会有对视觉画面的预视阶段, 结果也都发现当听到目标词的时候会出现指向目标物的注意转移。如果视觉画面存储在大脑中, 则无需眼动便可获取信息, 这种注意的转移恰恰表明大脑把外部世界当成了外部存储。Altmann (2004)改变了视觉情境范式, 画面在预视阶段之后消失并同时呈现空白屏幕和听觉刺激, 作者称之为“空屏范式(the blank screen paradigm)”。结果发现被试依然会看向目标物曾经出现过的位置, 这个结果和上面的假设是吻合的, 被试储存了物体的空间位置信息, 并将视觉画面作为了大脑的外部存储。不仅如此, 更有研究发现和目标词语义相关的物体在空屏范式下也会引起指向目标物的眼动(De Groot, Huettig, & Olivers, 2016)。Richardson和Spivey (2000)认为, 这种空间位置信息的存储是视觉系统利用眼动协调(Oculomotor coordinate)来实现的, 视觉系统并非直接记录整个场景, 而是指引眼睛移动到相应的坐标提取相应的场景。视觉信息和语言的加工过程可能存在这样一个系统：语言会指向相应的位置, 并且只有当眼睛到达目标位置的时候关于这个位置的具体信息才会提取出来。
其次, 正如语言信息可以影响我们对物体的分类一样, 在语言习得过程中, 视觉信息同样可以塑造语言的加工过程。目前很多研究者强调了婴幼儿与成人中语言加工的多通道特性(Mani & Schneider, 2013; Yeung & Nazzi, 2014; Yeung & Werker, 2009)。语言加工属高级认知过程, 与之相比, 视觉信息在婴幼儿的早期发展中占有更为重要的地位。有关儿童词汇识别的研究发现, 幼儿听到一个词汇的时候可以提取与这个词汇相联系物体的感知觉信息(Arias-Trejo & Plunkett, 2009; Johnson & Huettig, 2011; Johnson, McQueen, & Huettig, 2011; Mani, Johnson, McQueen, & Huettig, 2013)。更有研究发现, 儿童在目标词出现之前就会激活其形状信息, 表现在对和目标词形状相似物体更多的注视上(Bobb, Huettig, & Mani, 2016)。Yeung和Werker (2009)发现仅仅是教婴儿区分两种形状不同的物体和两种声音之间的联系就可以帮助婴儿更好地区分开两种声音。这些研究都表明视觉信息在语言习得和加工中起着重要的作用, 感知觉信息与听觉语言信息的共同激活可以帮助儿童在听到一个词汇的时候, 在其所处的场景中更快速地寻找到匹配的物体。很多研究采用视觉情境范式对儿童语言理解发展进行了多方面的考察, 发现虽然儿童可以利用视觉信息来帮助区分声音或是协助语言的习得, 但儿童在视觉和语言的整合功能上和成人依旧存在差异(Bunger, Skordos, Trueswell, & Papafragou, 2016; Huang & Snedeker, 2009, 2011; Melissa, Snedeker, & Schulz, 2017)。例如, 有研究发现儿童在概念表征和句法解歧上与成人也存在着显著的差异(Pluciennicka, Coello, & Kalénine, 2016)。也有研究发现在第二语言的习得上, 二语者和母语者表现为不同的影响模式(Ito, Pickering, & Corley, 2018; Noh & Lee, 2017; Pozzan & Trueswell, 2016)。
最后, 视觉信息可以移除或者降低句子加工中的加工困难。在本文第二部分已列出多种视觉信息可以移除或者降低歧义句中的加工困难的例证。例如, Tanenhaus等人(1995)使用视觉背景消解了句子的暂时歧义, 在双表征物语境下, 被试的错误注视概率减少。笔者认为, 惊异理论(Surprisal theory)可以很好地解释视觉信息对语言加工中句法歧义消解的影响, 并且已有研究使用惊异理论解释句子加工中的句法选择策略(Staub & Clifton, 2006)。惊异理论是计算语言学家Hale (2001)提出的一个概念, 用来描述句子理解过程中遇到某个词后产生的加工困难或者说认知负担, 惊异系数(surprisal)的高低决定了句法加工难度。举例来说, Tanenhaus等人的研究使用的局部歧义句“put the apple on the towel in the box”由于“on the towel”的歧义现象, 在遇到“in”的时候会产生加工困难。双表征物语境的视觉画面中, 错误注视概率减少, 表明视觉背景的呈现减少了介词短语“in the box”的惊异系数。视觉背景能够影响句法加工策略的过程可以被看成降低句子加工困难的过程。可惜的是, 由于视觉背景难以量化, 计算句子中惊异系数的改变也是一个非常大的难题, 此类研究少之又少, 笔者只发现有研究考察了世界知识对于句子中惊异系数的影响(Venhuizen, Brouwer, & Crocker, 2016)。
总之, 视觉信息在语言的加工过程中扮演着非常重要的角色, 这不仅表现在成人身上, 在儿童语言的不同发展阶段中也占有重要位置。视觉信息以其非常高的突显性不仅能够为其他认知过程提供大量的信息作为加工的基础, 还可以实时地参与到认知加工中来。对于语言加工来说：首先, 视觉信息可以作为大脑的外部存储降低语言加工的认知负担, 听到相应词汇再去相应位置获取更精确的信息, 增加了视觉与语言加工之间的互动; 其次, 视觉信息极大地促进了儿童的语言习得过程, 帮助儿童更好地把语言以及生活中的物体进行匹配; 再次, 视觉信息可以帮助我们降低句子加工中遇到的加工困难; 最后, 如第3部分提到的视觉信息对后续词汇的预测以及对言语交流的影响上看, 视觉信息还可以帮助我们更好地预测出句子后续将会输入的词汇, 使对话双方更好地进行交流, 当双方处在同一个场景中的时候, 能够减少语言产生的负担。这些都是视觉信息和语言加工整合的原因, 但目前还没有对为什么视觉信息和语言加工之间的交互有确切的解释, 本节希望能够为揭示这种跨通道整合的内在机制提供一些思路。
理论假设和实验论证是实践应用的基础, 科技的发展使人工智能技术越来越多地出现在生活的方方面面, 如何将理论基础运用在科技实践上是目前我们面临的重要问题。很多研究者开始尝试着通过构建计算机模型来模拟跨通道的交互作用, 以促进人工智能领域的发展。因此, 对目前“视觉信息影响语言加工计算模型”的梳理不仅有利于我们对这种交互机制的全面理解, 而且有利于提起对实践应用的重视。目前的模型大多是对词汇水平的视觉-情境范式的模拟, 旨在揭示在口语理解过程中语义、语音、字形以及视觉特征等激活的时间进程。这类研究的计算机模拟相对较为成熟, 不少研究使用之前的计算模型对视觉-情境范式的研究进行模拟取得了比较可靠的数据(McClelland, Mirman, Bolger, & Khaitan, 2014; Smith, Monaghan & Huettig, 2013; Smith, Monaghan & Huettig, 2014, 2017)。目前用来模拟词汇水平的视觉-情境范式比较成熟的模型有“工作记忆模型(working memory model)” (Huettig et al., 2011), “中心辐射模型(Hub and Spoke model, H&S)” (Dilkina, McClelland & Plaut, 2010; Smith et al., 2013)以及一种神经网络模型“简单递归网络模型(simple recurrent network model, SRN)” (Elman, 1990)。
但以上的这些模型重点都放在语言如何引导视觉注意上, 并未揭示视觉信息如何影响语言加工过程。据笔者了解, 心理语言学中还没有研究者对视觉背景影响句法歧义消解的过程建立模型。目前一些来自计算语言学的研究者尝试把视觉信息与听觉语言信息在语义层面建立接口来模拟这类实验, 但往往停留在一个描述性的层面(Baumgärtner, Beuck, & Menzel, 2012; McCrae, 2009)。例如, Venhuizen等(2016)通过建立向量模型, 考察了世界知识(world knowledge)对于语言理解中加工困难的影响, 但是其分析也主要是把世界知识转换成事件发生的先后顺序, 来估测不同的先后顺序下句子中每个词汇出现的概率, 并没有提出一个系统的模型对整个过程进行模拟。这类模型的困难点主要是难以将“情境”这种高水平的信息量化, 使得计算机模拟相对较为困难。这部分主要简单介绍一种关于视觉信息如何影响语言加工的模型。
McCrae模型最终目的是建立加入视觉背景信息后的句法解析器(parser), 即在有视觉信息的影响下进行句法解析。模型的建立首先需要一种句法解析器对句子的句法进行解析; 其次, 在此解析器上建立接口以输入视觉信息。作者和其合作者之前构建了一个权重制约的依存句法解析器(WCDG, Weighted constraint dependency parser), 此解析器提供了包含多种非语言信息的一般性接口, 对于研究视觉信息对句法加工的影响有很大优势。作者借助WCDG, 以Jackendoff的理论作为基础, 构建了此模型来模拟视觉信息与句法解析之间的交互作用。模型由三个模块构成, 分别为：语言模块、概念结构模块以及视觉感知模块(如图6)。
由于视觉信息的难以量化, 该模型对视觉信息模块的处理是把视觉信息描述的事件简化成题元角色的分配, 即施动者和受动者的角色分配。语言信息模块中, 作者采用德语中的歧义句式“谁对谁做了什么(who did what to whom)”, 最终通过WCDG来接入视觉模块的输入的角色分配的信息流, 并和语言模块中的角色分配信息进行匹配, 最后输出模型的句法解析结果。作者在实际的模拟中, 单独采用语言模块和采用加入视觉背景信息的模块对歧义句进行解析的结果是不同的, 视觉信息的加入可以改变本身的句法选择策略, 成功地对此类现象进行了模拟。
本综述重点梳理了视觉信息如何影响语言加工过程的研究, 从口语理解、言语产生以及言语交流等方面概括了视觉信息影响语言加工过程的表现。总体来看, 语言加工的过程并不是独立进行的, 模块化理论中“封装”起来的句法加工模块也受到了挑战。采用视觉-情境范式考察口语理解, 言语产生的研究都发现视觉场景、动作特点、情节记忆以及事件等信息可以即时的影响语言加工过程, 语言的加工是汇集了各类不同通道的信息实时交互作用的结果。视觉场景不仅可以作为大脑的外部存储器降低我们语言加工过程中的认知资源消耗, 而且可以促进语言习得过程, 降低我们语言加工过程中遇到的加工困难, 提高语言加工效率, 促进言语交流过程。
不仅很多研究者对视觉信息影响语言加工的现象展开了研究, 而且也同样有很多研究考察了语言加工对视觉注意的引导过程, 视觉与语言之间的交互机制的研究和解决是揭示人类跨通道整合机制的关键环节。但目前这个领域还有很多亟待解决的问题, 将来的研究应该围绕揭示视觉与语言整合的内在机制, 如何利用现有研究理论来指导儿童语言发展中的视听整合过程, 以及如何促进人工智能的发展这三个大问题来展开, 解决这些问题将会极大促进我们对人类认知过程的全面理解。
第一, 揭示视觉与语言整合的内在机制。已有研究开始关注不同通道信息交互作用的神经机制。例如, Hagoort (2005)构建了一个语言加工的神经结构模型, 主要从布洛卡区着手分别阐述了语音、句法、语义在神经结构上的整合过程, 并且强调了左侧额下回(LIFG, left inferior frontal gyrus)对非语言信息(例如手势)和语言信息的整合的重要作用。Peeters, Snijders, Hagoort和Özyürek (2017)利用事件相关功能磁共振成像技术也同样发现左侧额下回与双侧颞中回在口语和视觉情境交互中的重要作用。但是, 这些研究只考察了“听声识物”过程中的神经机制, 视觉信息与句法加工, 语义加工等交互作用的神经机制将会是未来非常重要的研究课题。
第二, 如何利用现有研究结果来指导儿童语言发展中的视听整合过程。多个研究发现, 婴幼儿语言加工同样有多通道特性, 视觉信息可以塑造语言的加工过程。但儿童在视觉和语言的整合功能上和成人依旧存在差异, 并不能很好地利用视觉信息来进行语言加工, 因此, 如何使用现有的研究结果与理论训练与干预儿童的语言习得过程, 以提高其语言加工效率, 促进儿童认知发展显得尤为重要。
第三, 视觉通道是多种感觉通道中的一种, 是人类获取信息最主要的通道。而对于盲人或者有视觉缺陷的人, 听觉和触觉则是最直接和有效的。因此对于视觉和语言加工交互机制的揭示有助于推进其他感觉通道和语言加工交互作用的研究。例如, 在一些情景下, 视觉信息起到的是语境的作用, 若以其他通道呈现, 也可能同样会影响到语言的加工过程, 起到相同的效果。为了提高这类人群的生活质量, 这类研究必将有广阔的发展前景。
第四, 人工智能领域在现代信息技术的带领下飞速发展, 已经慢慢进入到现代生活当中, 在各个行业也得到了广泛应用。人工智能技术通过多通道的整合技术能够实现更多更全面的功能。然而, 关于视觉信息与语言加工交互作用的模型依旧还是短板, 如何量化视觉信息, 并快速地和语言进行匹配, 目前的计算模型都尚未解决这些问题。因此视觉和语言交互作用机制的揭示可以使我们了解这些信息如何共同作用实现视听整合, 从而为人工智能的进一步发展提供科学依据。
Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models,
Language-mediated eye movements in the absence of a visual world: The “blank screen paradigm.”,
Avoiding the garden path: Eye movements in context,
Pragmatic factors, such as referential context, influence the decisions of the syntactic processor. At issue, however, has been whether such effects take place in the first or second pass analysis of the sentence. It has been suggested that eye movement studies are the only appropriate means for deciding between first and second pass effects. In this paper, we report two experiments using ambiguous relative/complement sentences and unambiguous controls. In Experiment 1 we show that referential context eliminates all the first pass reading time differences that are indicative of a garden path to the relative continuation in the null context. We observe, however, that the context does not eliminate the increased proportion of regressions from that disambiguating continuation. We therefore introduce a regression-contingent analysis of the first pass reading times and show that this new measure provides an important tool for aiding in the interpretation of the apparently conflicting data. Experiment 2 investigated whether the results of Experiment 1 were an artifact of the kinds of questions about the contexts that were asked in order to encourage subjects to attend to the contexts. The results demonstrated that the use of explicity referential questions had little effect. There was some small evidence for a garden path effect in this second experiment, but the regression-contingent measure enabled us to locate all garden path effects in only a small proportion of trials and to conclude that context does influence the initial decisions of the syntactic processor.
Incremental interpretation at verbs: Restricting the domain of subsequent reference,
Participants' eye movements were recorded as they inspected a semi-realistic visual scene showing a boy, a cake, and various distractor objects. Whilst viewing this scene, they heard sentences such as 'the boy will move the cake' or 'the boy will eat the cake'. The cake was the only edible object portrayed in the scene. In each of two experiments, the onset of saccadic eye movements to the target object (the cake) was significantly later in the move condition than in the eat condition; saccades to the target were launched after the onset of the spoken word cake in the move condition, but before its onset in the eat condition. The results suggest that information at the verb can be used to restrict the domain within the context to which subsequent reference will be made by the (as yet unencountered) post-verbal grammatical object. The data support a hypothesis in which sentence processing is driven by the predictive relationships between verbs, their syntactic arguments, and the real-world contexts in which they occur.
Discourse- mediation of the mapping between language and the visual world: Eye movements and mental representation,
Lexical-semantic priming effects during infancy,
When and how do infants develop a semantic system of words that are related to each other? We investigated word-word associations in early lexical development using an adaptation of the inter-modal preferential looking task where word pairs (as opposed to single target words) were used to direct infants' attention towards a target picture. Two words (prime and target) were presented in quick succession after which infants were presented with a picture pair (target and distracter). Prime-target word pairs were either semantically and associatively related or unrelated; the targets were either named or unnamed. Experiment 1 demonstrated a lexical-semantic priming effect for 21-month olds but not for 18-month olds: unrelated prime words interfered with linguistic target identification for 21-month olds. Follow-up experiments confirmed the interfering effects of unrelated prime words and identified the existence of repetition priming effects as young as 18 months of age. The results of these experiments indicate that infants have begun to develop semantic-associative links between lexical items as early as 21 months of age.
An architecture for incremental information fusion of cross- modal representations,
We present an architecture for natural language processing that parses an input sentence incrementally and merges information about its structure with a representation of visual input, thereby changing the results of parsing. At each step of incremental processing, the elements in the context representation are judged whether they match the content of the sentence fragment up to that step. The information contained in the best matching subset then influences the result of parsing the subsentence. As processing progresses and the sentence is extended by adding new words, new information is searched in the context to concur with the expanded language input. This incremental approach to information fusion is highly adaptable with regard to the integration of dynamic knowledge extracted from a constantly changing environment.
Chapter 42-Audiovisual speech integration: Neural substrates and behavior, (
Speech perception is multisensory, making use of information from both the auditory modality (the talker’s voice) and the visual modality (the talker’s face). This chapter describes recent advances in our understanding of the neural processing of audiovisual speech, driven by studies using blood-oxygen level-dependent functional magnetic resonance imaging (BOLD fMRI), electrocorticography, causal inference modeling, and transcranial magnetic stimulation. An area of special importance is the posterior superior temporal sulcus and adjacent superior temporal gyrus and middle temporal gyrus. An audiovisual speech illusion known as the McGurk effect, in which incongruent auditory and visual syllables are perceived as a third syllable, has been a useful tool for interrogating the cortical speech network. There is a previously unappreciated level of intersubject and interstimulus variability in the behavioral and neural responses to this illusion.
Human brain language areas identified by functional magnetic resonance imaging,
Abstract Functional magnetic resonance imaging (FMRI) was used to identify candidate language processing areas in the intact human brain. Language was defined broadly to include both phonological and lexical-semantic functions and to exclude sensory, motor, and general executive functions. The language activation task required phonetic and semantic analysis of aurally presented words and was compared with a control task involving perceptual analysis of nonlinguistic sounds. Functional maps of the entire brain were obtained from 30 right-handed subjects. These maps were averaged in standard stereotaxic space to produce a robust "average activation map" that proved reliable in a split-half analysis. As predicted from classical models of language organization based on lesion data, cortical activation associated with language processing was strongly lateralized to the left cerebral hemisphere and involved a network of regions in the frontal, temporal, and parietal lobes. Less consistent with classical models were (1) the existence of left hemisphere temporoparietal language areas outside the traditional "Wernicke area," namely, in the middle temporal, inferior temporal, fusiform, and angular gyri; (2) extensive left prefrontal language areas outside the classical "Broca area"; and (3) clear participation of these left frontal areas in a task emphasizing "receptive" language functions. Although partly in conflict with the classical model of language localization, these findings are generally compatible with reported lesion data and provide additional support for ongoing efforts to refine and extend the classical model.
Predicting visual information during sentence processing: Toddlers activate an object’s shape before it is mentioned,
We examined the contents of language-mediated prediction in toddlers by investigating the extent to which toddlers are sensitive to visual shape representations of upcoming words. Previous studies with adults suggest limits to the degree to which information about the visual form of a referent is predicted during language comprehension in low constraint sentences. Toddlers (30-month-olds) heard either contextually constraining sentences or contextually neutral sentences as they viewed images that were either identical or shape related to the heard target label. We observed that toddlers activate shape information of upcoming linguistic input in contextually constraining semantic contexts; hearing a sentence context that was predictive of the target word activated perceptual information that subsequently influenced visual attention toward shape-related targets. Our findings suggest that visual shape is central to predictive language processing in toddlers.
Real-time investigation of referential domains in unscripted conversation: A targeted language game approach,
Two experiments examined the restriction of referential domains during unscripted conversation by analyzing the modification and online interpretation of referring expressions. Experiment 1 demonstrated that from the earliest moments of processing, addressees interpreted referring expressions with respect to referential domains constrained by the conversation. Analysis of eye movements during the conversation showed elimination of standard competition effects seen with scripted language. Results from Experiment 2 pinpointed two pragmatic factors responsible for restriction of the referential domains used by speakers to design referential expressions and demonstrated that the same factors predict whether addressees consider local competitors to be potential referents during online interpretation of the same expressions. These experiments demonstrate, for the first time, that online interpretation of referring expressions in conversation is facilitated by referential domains constrained by pragmatic factors that predict when addressees are likely to encounter temporary ambiguity in language processing.
How children and adults encode causative events cross-linguistically: Implications for language production and attention,
It is well known that languages differ in how they encode motion. Languages such as English use verbs that communicate the manner of motion (e.g., climb, float), while languages such as Greek often encode the path of motion in verbs (e.g., advance, exit). In two studies with English- and Greek-speaking adults and five year olds, we ask how such lexical constraints are used in combination with... [Show full abstract]
Effects of speaker emotional facial expression and listener age on incremental sentence processing,
We report two visual-world eye-tracking experiments that investigated how and with which time course emotional information from a speaker's face affects younger (N=6632, Mean age 66=6623) and older (N=6632, Mean age 66=6664) listeners’ visual attention and language comprehension as they processed emotional sentences in a visual context. The age manipulation tested predictions by socio-emotional selectivity theory of a positivity effect in older adults. After viewing the emotional face of a speaker (happy or sad) on a computer display, participants were presented simultaneously with two pictures depicting opposite-valence events (positive and negative; IAPS database) while they listened to a sentence referring to one of the events. Participants' eye fixations on the pictures while processing the sentence were increased when the speaker's face was (vs. wasn't) emotionally congruent with the sentence. The enhancement occurred from the early stages of referential disambiguation and was modulated by age. For the older adults it was more pronounced with positive faces, and for the younger ones with negative faces. These findings demonstrate for the first time that emotional facial expressions, similarly to previously-studied speaker cues such as eye gaze and gestures, are rapidly integrated into sentence processing. They also provide new evidence for positivity effects in older adults during situated sentence processing.
Perception and presupposition in real-time language comprehension: Insights from anticipatory processing,
Recent studies have shown that listeners use verbs and other predicate terms to anticipate reference to semantic entities during real-time language comprehension. This process involves evaluating the denoted action against relevant properties of potential referents. The current study explored whether action-relevant properties are readily available to comprehension systems as a result of the embodied nature of linguistic and conceptual representations. In three experiments, eye movements were monitored as listeners followed instructions to move depicted objects on a computer screen. Critical instructions contained the verb return (e.g., Now return the block to area 3), which presupposes the previous displacement of its complement object a property that is not reflected in perceptible or stable characteristics of objects. Experiment 1 demonstrated that predictions for previously displaced objects are generated upon hearing return, ruling out the possibility that anticipatory effects draw directly on static affordances in perceptual symbols. Experiment 2 used a referential communication task to evaluate how communicative relevance constrains the use of perceptually derived information. Results showed that listeners anticipate previously displaced objects as candidates upon hearing return only when their displacement was known to the speaker. Experiment 3 showed that the outcome of the original act of displacement further modulates referential predictions. The results show that the use of perceptually grounded information in language interpretation is subject to communicative constraints, even when language denotes physical actions performed on concrete objects.
Words and worlds: The construction of context for definite reference(pp.
Actions and affordances in syntactic ambiguity resolution,
In 2 experiments, eye movements were monitored as participants followed instructions containing temporary syntactic ambiguities (e.g., "Pour the egg in the bowl over the flour"). The authors varied the affordances of task-relevant objects with respect to the action required by the instruction (e.g., whether 1 or both eggs in the visual workspace were in liquid form, allowing them to be poured). The number of candidate objects that could afford the action was found to determine whether listeners initially misinterpreted the ambiguous phrase ("in the bowl") as specifying a location. The findings indicate that syntactic decisions are guided by the listener's situation-specific evaluation of how to achieve the behavioral goal of an utterance.
The influence of syntactic category and semantic constraints on lexical ambiguity resolution: An eye movement study of processing Chinese homographs,
The purpose of the present study is twofold: (1) To examine whether the syntactic category constraint can determine the semantic resolution of Chinese syntactic category ambiguous words; and (2) to investigate whether the syntactic category of alternative meanings of Chinese homographs can influence the subordinate bias effect (SBE) during lexical ambiguity resolution. In the present study, four types of Chinese biased homographs (NN, VV, VN, and NV) were embedded into syntactically and semantically subordinate-biased sentences. Each homograph was assigned a frequency-matched unambiguous word as control, which could fit into the same sentence frame. Participants eye movements were recorded as they read each sentence. In general, the results showed that in a subordinate-biased context, (1) the SBE for the four types of homograph was significant only in the second-pass reading on the post-target words and (2) numerically, the NV homographs revealed a larger effect size of SBE than VN homographs on both target and post-target words. Our findings support the constraint-satisfaction models, suggesting that the syntactic category constraint is not the only factor influencing the semantic resolution of syntactic category ambiguous words, which is opposed to the prediction of the syntax-first models.
Referring as a collaborative process,
In conversation, speakers and addressees work together in the making of a definite reference. In the model we propose, the speaker initiates the process by presenting or inviting a noun phrase. Before going on to the next contribution, the participants, if necessary, repair, expand on, or replace the noun phrase in an iterative process until they reach a version they mutually accept. In doing so they try to minimize their joint effort. The preferred procedure is for the speaker to present a simple noun phrase and for the addressee to accept it by allowing the next contribution to begin. We describe a communication task in which pairs of people conversed about arranging complex figures and show how the proposed model accounts for many features of the references they produced. The model follows, we suggest, from the mutual responsibility that participants in conversation bear toward the understanding of each utterance.
The impact of visual information on reference assignment in sentence production,
Scan patterns predict sentence production in the cross-modal processing of visual scenes,
Most everyday tasks involve multiple modalities, which raises the question of how the processing of these modalities is coordinated by the cognitive system. In this paper, we focus on the coordination of visual attention and linguistic processing during speaking. Previous research has shown that objects in a visual scene are fixated before they are mentioned, leading us to hypothesize that the scan pattern of a participant can be used to predict what he or she will say. We test this hypothesis using a data set of cued scene descriptions of photo-realistic scenes. We demonstrate that similar scan patterns are correlated with similar sentences, within and between visual scenes; and that this correlation holds for three phases of the language production process (target identification, sentence planning, and speaking). We also present a simple algorithm that uses scan patterns to accurately predict associated sentences by utilizing similarity-based retrieval.
The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing,
Revisiting the looking at nothing phenomenon: Visual and semantic biases in memory search,
Abstract When visual stimuli remain present during search, people spend more time fixating objects that are semantically or visually related to the target instruction than fixating unrelated objects. Are these semantic and visual biases also observable when participants search within memory? We removed the visual display prior to search while continuously measuring eye movements towards locations previously occupied by objects. The target absent trials contained objects that were either visually or semantically related to the target instruction. When the overall mean proportion of fixation time was considered, we found biases towards the location previously occupied by the target, but failed to find biases towards visually or semantically related objects. However, in two experiments, the pattern of biases towards the target over time provided a reliable predictor for biases towards the visually and semantically related objects. We therefore conclude that visual and semantic representations alone can guide eye movements in memory search, but that orienting biases are weak when the stimuli are no longer present.
Are there mental lexicons? The role of semantics in lexical decision,
78 lexical decision correlated with naming, WPM, and PPT but not in item-specific ways. 78 LD also correlated with word reading and spelling; only the latter is item-specific. 78 Concept consistency governs semantic performance. 78 Spelling consistency governs LD performance. 78 Both spelling consistency and concept consistency affect reading and spelling.
Finding structure in time,
Language processing in the visual world: Effects of preview, visual complexity, and prediction,
This study investigates how people interpret spoken sentences in the context of a relevant visual world by focusing on garden-path sentences, such as Put the book on the chair in the bucket, in which the prepositional phrase on the chair is temporarily ambiguous between a goal and modifier interpretation. In three comprehension experiments, listeners heard these types of sentences (along with disambiguated controls) while viewing arrays of objects. These experiments demonstrate that a classic garden-path effect is obtained only when listeners have a preview of the display and when the visual context contains relatively few objects. Results from a production experiment suggest that listeners accrue knowledge that may allow them to have certain expectations of the upcoming utterance based on visual information. Taken together, these findings have theoretical implications for both the role of prediction as an adaptive comprehension strategy, and for how comprehension tendencies change under variable visual and temporal processing demands. (C) 2013 Elsevier Inc. All rights reserved.
The modularity of mind
Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences,
Exploiting listener gaze to improve situated communication in dynamic virtual environments,
Abstract Beyond the observation that both speakers and listeners rapidly inspect the visual targets of referring expressions, it has been argued that such gaze may constitute part of the communicative signal. In this study, we investigate whether a speaker may, in principle, exploit listener gaze to improve communicative success. In the context of a virtual environment where listeners follow computer-generated instructions, we provide two kinds of support for this claim. First, we show that listener gaze provides a reliable real-time index of understanding even in dynamic and complex environments, and on a per-utterance basis. Second, we show that a language generation system that uses listener gaze to provide rapid feedback improves overall task performance in comparison with two systems that do not use gaze. Aside from demonstrating the utility of listener gaze in situated communication, our findings open the door to new methods for developing and evaluating multi-modal models of situated interaction.
On the give and take between event apprehension and utterance formulation,
Two experiments are reported which examine how manipulations of visual attention affect speakers linguistic choices regarding word order, verb use and syntactic structure when describing simple pictured scenes. Experiment 1 presented participants with scenes designed to elicit the use of a perspective predicate ( The man chases the dog/The dog flees from the man) or a conjoined noun phrase sentential Subject ( A cat and a dog/A dog and a cat). Gaze was directed to a particular scene character by way of an attention-capture manipulation. Attention capture increased the likelihood that this character would be the sentential Subject and altered the choice of perspective verb or word order within conjoined NP Subjects accordingly. These effects occurred even though participants reported being unaware that their visual attention had been manipulated. Experiment 2 extended these results to word order choice within Active versus Passive structures ( The girl is kicking the boy/ The boy is being kicked by the girl) and symmetrical predicates ( The girl is meeting the boy/The boy is meeting the girl). Experiment 2 also found that early endogenous shifts in attention influence word order choices. These findings indicate a reliable relationship between initial looking patterns and speaking patterns, reflecting considerable parallelism between the on-line apprehension of events and the on-line construction of descriptive utterances.
What the eyes say about speaking,
To study the time course of sentence formulation, we monitored the eye movements of speakers as they described simple events. The similarity between speakers' initial eye movements and those of observers performing a nonverbal event-comprehension task suggested that response-relevant information was rapidly extracted from scenes, allowing speakers to select grammatical subjects based on comprehended events rather than salience. When speaking extemporaneously, speakers began fixating pictured elements less than a second before naming them within their descriptions, a finding consistent with incremental lexical encoding. Eye movements anticipated the order of mention despite changes in picture orientation, in who-did-what-to-whom, and in sentence structure. The results support Wundt's theory of sentence production.
The human visual cortex,
Extraction of event roles from visual scenes is rapid, automatic, and interacts with higher-level visual processing. In(Vol.
On Broca, brain, and binding: A new framework,
A probabilistic Earley parser as a psycholinguistic model(pp.
ABSTRACT In human sentence processing, cognitive load can be defined many ways. This report considers a definition of cognitive load in terms of the total probability of structural options that have been disconfirmed at some point in a sentence: the surprisal of word w i given its prefix w 0...i-1 on a phrase-structural language model. These loads can be e#ciently calculated using a probabilistic Earley parser (Stolcke, 1995) which is interpreted as generating predictions about reading time on a word-by-word basis. Under grammatical assumptions supported by corpusfrequency data, the operation of Stolcke's probabilistic Earley parser correctly predicts processing phenomena associated with garden path structural ambiguity and with the subject/object relative asymmetry.
Interactive natural language acquisition in a multi-modal recurrent neural architecture,
For the complex human brain that enables us to communicate in natural language, we gathered good understandings of principles underlying language acquisition and processing, knowledge about sociocultural conditions, and insights into activity patterns in the brain. However, we were not yet able to understand the behavioural and mechanistic characteristics for natural language and how mechanisms in the brain allow to acquire and process language. In bridging the insights from behavioural psychology and neuroscience, the goal of this paper is to contribute a computational understanding of appropriate characteristics that favour language acquisition. Accordingly, we provide concepts and refinements in cognitive modelling regarding principles and mechanisms in the brain and propose a neurocognitively plausible model for embodied language acquisition from real-world interaction of a humanoid robot with its environment. In particular, the architecture consists of a continuous time recurrent neural network, where parts have different leakage characteristics and thus operate on multiple timescales for every modality and the association of the higher level nodes of all modalities into cell assemblies. The model is capable of learning language production grounded in both, temporal dynamic somatosensation and vision, and features hierarchical concept abstraction, concept decomposition, multi-modal integration, and self-organisation of latent representations.
Predictors of verb-mediated anticipatory eye movements in the visual world,
Many studies have demonstrated that listeners use information extracted from verbs to guide anticipatory eye movements to objects in the visual context that satisfy the selection restrictions of the verb. An important question is what underlies such verb-mediated anticipatory eye gaze. Based on empirical and theoretical suggestions, we investigated the influence of 5 potential predictors of this behavior: functional associations and general associations between verb and target object, as well as the listeners' production fluency, receptive vocabulary knowledge, and nonverbal intelligence. In 3 eye-tracking experiments, participants looked at sets of 4 objects and listened to sentences where the final word was predictable or not predictable (e.g., "The man peels/draws an apple"). On predictable trials only the target object, but not the distractors, were functionally and associatively related to the verb. In Experiments 1 and 2, objects were presented before the verb was heard. In Experiment 3, participants were given a short preview of the display after the verb was heard. Functional associations and receptive vocabulary were found to be important predictors of verb-mediated anticipatory eye gaze independent of the amount of contextual visual input. General word associations did not and nonverbal intelligence was only a very weak predictor of anticipatory eye movements. Participants' production fluency correlated positively with the likelihood of anticipatory eye movements when participants were given the long but not the short visual display preview. These findings fit best with a pluralistic approach to predictive language processing in which multiple mechanisms, mediating factors, and situational context dynamically interact.
Semantic meaning and pragmatic interpretation in 5-year-olds: Evidence from real-time spoken language comprehension,
Recent research on children's inferencing has found that although adults typically adopt the pragmatic interpretation of some (implying not all), 5- to 9-year-olds often prefer the semantic interpretation of the quantifier (meaning possibly all). Do these failures reflect a breakdown of pragmatic competence or the metalinguistic demands of prior tasks? In 3 experiments, the authors used the visual-world eye-tracking paradigm to elicit an implicit measure of adults' and children's abilities to generate scalar implicatures. Although adults' eye-movements indicated that adults had interpreted some with the pragmatic inference, children's looks suggested that children persistently interpreted some as compatible with all (Experiment 1). Nevertheless, both adults and children were able to quickly reject competitors that were inconsistent with the semantics of some; this confirmed the sensitivity of the paradigm (Experiment 2). Finally, adults, but not children, successfully distinguished between situations that violated the scalar implicature and those that did not (Experiment 3). These data demonstrate that children interpret quantifiers on the basis of their semantic content and fail to generate scalar implicatures during online language comprehension.
Logic and conversation revisited: Evidence for a division between semantic and pragmatic content in real-time language comprehension,
The distinction between semantics (linguistically encoded meaning) and pragmatics (inferences about communicative intentions) can often be unclear and counterintuitive. For example, linguistic theories argue that the meaning of some encompasses the meaning of all while the intuition that some implies not all results from an inference. We explored how online interpretation of some evolves using an eye-tracking while listening paradigm. Early eye-movements indicated that while some was initially interpreted as compatible with all, participants began excluding referents compatible with all approximately 800 ms later. These results contrast with recent evidence of immediate inferencing and highlight the presence of bottom-up semantic ragmatic interactions which necessarily rely on initial access to lexical meanings to trigger inferences.
Four central questions about prediction in language processing,
How speech processing affects our attention to visually similar objects: Shape competitor effects and the visual world paradigm(pp.
Looking, language, and memory: Bridging research from the visual world and visual search paradigms,
In the visual world paradigm as used in psycholinguistics, eye gaze (i.e. visual orienting) is measured in order to draw conclusions about linguistic processing. However, current theories are underspecified with respect to how visual attention is guided on the basis of linguistic representations. In the visual search paradigm as used within the area of visual attention research, investigators have become more and more interested in how visual orienting is affected by higher order representations, such as those involved in memory and language. Within this area more specific models of orienting on the basis of visual information exist, but they need to be extended with mechanisms that allow for language-mediated orienting. In the present paper we review the evidence from these two different but highly related research areas. We arrive at a model in which working memory serves as the nexus in which long-term visual as well as linguistic representations (i.e. types) are bound to specific locations (i.e. tokens or indices). The model predicts that the interaction between language and visual attention is subject to a number of conditions, such as the presence of the guiding representation in working memory, capacity limitations, and cognitive control mechanisms.
Investigating the time-course of phonological prediction in native and non-native speakers of English: A visual world eye-tracking study,
Semantics and cognition(Vol.
Eye movements during language-mediated visual search reveal a strong link between overt visual attention and lexical processing in 36-month-olds,
The nature of children early lexical processing was investigated by asking what information 36-month-olds access and use when instructed to find a known but absent referent. Children readily retrieved stored knowledge about characteristic color, i.e., when asked to find an object with a typical color (e.g., strawberry), children tended to fixate more upon an object that had the same (e.g., red plane) as opposed to a different (e.g., yellow plane) color. They did so regardless of the fact that they had plenty of time to recognize the pictures for what they are, i.e., planes and not strawberries. These data represent the first demonstration that language-mediated shifts of overt attention in young children can be driven by individual stored visual attributes of known words that mismatch on most other dimensions. The finding suggests that lexical processing and overt attention are strongly linked from an early age.
Toddlers’ language-mediated visual search: They need not have the words for it,
Eye movements made by listeners during language-mediated visual search reveal a strong link between visual processing and conceptual processing. For example, upon hearing the word for a missing referent with a characteristic colour (e.g., “strawberry”), listeners tend to fixate a colour-matched distractor (e.g., a red plane) more than a colour-mismatched distractor (e.g., a yellow plane). We ask whether these shifts in visual attention are mediated by the retrieval of lexically stored colour labels. Do children who do not yet possess verbal labels for the colour attribute that spoken and viewed objects have in common exhibit language-mediated eye movements like those made by older children and adults? That is, do toddlers look at a red plane when hearing “strawberry”? We observed that 24-month-olds lacking colour term knowledge nonetheless recognized the perceptual–conceptual commonality between named and seen objects. This indicates that language-mediated visual search need not depend on stored labels for concepts.
The influence of the immediate visual context on incremental thematic role-assignment: Evidence from eye-movements in depicted events,
Studies monitoring eye-movements in scenes containing entities have provided robust evidence for incremental reference resolution processes. This paper addresses the less studied question of whether depicted event scenes can affect processes of incremental thematic role-assignment. In Experiments 1 and 2, participants inspected agent-action-patient events while listening to German verb-second sentences with initial structural and role ambiguity. The experiments investigated the time course with which listeners could resolve this ambiguity by relating the verb to the depicted events. Such verb-mediated visual event information allowed early disambiguation on-line, as evidenced by anticipatory eye-movements to the appropriate agent/patient role filler. We replicated this finding while investigating the effects of intonation. Experiment 3 demonstrated that when the verb was sentence-final and thus did not establish early reference to the depicted events, linguistic cues alone enabled disambiguation before people encountered the verb. Our results reveal the on-line influence of depicted events on incremental thematic role-assignment and disambiguation of local structural and role ambiguity. In consequence, our findings require a notion of reference that includes actions and events in addition to entities (e.g. Semantics and Cognition, 1983), and argue for a theory of on-line sentence comprehension that exploits a rich inventory of semantic categories.
Visually situated language comprehension,
Can speaker gaze modulate syntactic structuring and thematic role assignment during spoken sentence comprehension?,
During comprehension, a listener can rapidly follow a frontally seated speaker’s gaze to an object before its mention, a behavior which can shorten latencies in speeded sentence verification. However, the robustness of gaze-following, its interaction with core comprehension processes such as syntactic structuring, and the persistence of its effects are unclear. In two “visual-world” eye-tracking experiments participants watched a video of a speaker, seated at an angle, describing transitive (non-depicted) actions between two of three Second Life characters on a computer screen. Sentences were in German and had either subjectNP1-verb-objectNP2or objectNP1-verb-subjectNP2structure; the speaker either shifted gaze to the NP2 character or was obscured. Several seconds later, participants verified either the sentence referents or their role relations. When participants had seen the speaker’s gaze shift, they anticipated the NP2 character before its mention and earlier than when the speaker was obscured. This effect was more pronounced for SVO than OVS sentences in both tasks. Interactions of speaker gaze and sentence structure were more pervasive in role-relations verification: participants verified the role relations faster for SVO than OVS sentences, and faster when they had seen the speaker shift gaze than when the speaker was obscured. When sentence and template role-relations matched, gaze-following even eliminated the SVO-OVS response-time differences. Thus, gaze-following is robust even when the speaker is seated at an angle to the listener; it varies depending on the syntactic structure and thematic role relations conveyed by a sentence; and its effects can extend to delayed post-sentence comprehension processes. These results suggest that speaker gaze effects contribute pervasively to visual attention and comprehension processes and should thus be accommodated by accounts of situated language comprehension.
Effects of speaker gaze versus depicted actions on visual attention during sentence comprehension( Vol.
Audio-tactile integration and the influence of musical training,
Abstract Perception of our environment is a multisensory experience; information from different sensory systems like the auditory, visual and tactile is constantly integrated. Complex tasks that require high temporal and spatial precision of multisensory integration put strong demands on the underlying networks but it is largely unknown how task experience shapes multisensory processing. Long-term musical training is an excellent model for brain plasticity because it shapes the human brain at functional and structural levels, affecting a network of brain areas. In the present study we used magnetoencephalography (MEG) to investigate how audio-tactile perception is integrated in the human brain and if musicians show enhancement of the corresponding activation compared to non-musicians. Using a paradigm that allowed the investigation of combined and separate auditory and tactile processing, we found a multisensory incongruency response, generated in frontal, cingulate and cerebellar regions, an auditory mismatch response generated mainly in the auditory cortex and a tactile mismatch response generated in frontal and cerebellar regions. The influence of musical training was seen in the audio-tactile as well as in the auditory condition, indicating enhanced higher-order processing in musicians, while the sources of the tactile MMN were not influenced by long-term musical training. Consistent with the predictive coding model, more basic, bottom-up sensory processing was relatively stable and less affected by expertise, whereas areas for top-down models of multisensory expectancies were modulated by training.
Children’s semantic and world knowledge overrides fictional information during anticipatory linguistic processing
Dynamic speech representations in the human temporal lobe,
Speech perception requires rapid integration of acoustic input with context-dependent knowledge. Recent methodological advances have allowed researchers to identify underlying information representations in primary and secondary auditory cortex and to examine how context modulates these representations. We review recent studies that focus on contextual modulations of neural activity in the superior temporal gyrus (STG), a major hub for spectrotemporal encoding. Recent findings suggest a highly interactive flow of information processing through the auditory ventral stream, including influences of higher-level linguistic and metalinguistic knowledge, even within individual areas. Such mechanisms may give rise to more abstract representations, such as those for words. We discuss the importance of characterizing representations of context-dependent and dynamic patterns of neural activity in the approach to speech perception research.
Uncertainty and expectation in sentence processing: Evidence from subcategorization distributions,
There is now considerable evidence that human sentence processing is expectation based: As people read a sentence, they use their statistical experience with their language to generate predictions about upcoming syntactic structure. This study examines how sentence processing is affected by readers' "uncertainty" about those expectations. In a self-paced reading study, we use lexical subcategorization distributions to factorially manipulate both the strength of expectations and the uncertainty about them. We compare two types of uncertainty: uncertainty about the verb's complement, reflecting the next prediction step; and uncertainty about the full sentence, reflecting an unbounded number of prediction steps. We find that uncertainty about the full structure, but not about the next step, was a significant predictor of processing difficulty: Greater reduction in uncertainty was correlated with increased reading times (RTs). We additionally replicated previously observed
The interaction of lexical and syntactic ambiguity,
Abstract Two experiments investigated comprehension of noun/verb lexical category ambiguities such as trains, in order to determine whether resolution of these ambiguities was similar to other types of ambiguity resolution. Frazier and Rayner (1987, Journal of Memory and Language, 26, 505-526) argued that these ambiguities were resolved with a delay strategy that is not used for other ambiguities. Experiment l′s self-paced reading data replicated Frazier & Rayner′s results but also showed that evidence taken to support delay had other explanations. Experiment 2 investigated the influence of semantic biases on ambiguity resolution and found that three probabilistic factors influenced lexical category ambiguity resolution: (1) the relative frequency of head vs. modifying noun usage of a biasing noun, (2) the frequency of cooccurrence of a biasing noun and category ambiguous word in English, and (3) the combinatorial semantic information in the sentence. The extent to which alternative models account for the use of probabilistic information in ambiguity resolution is discussed.
Lexical nature of syntactic ambiguity resolution,
Ambiguity resolution is a central problem in language comprehension. Lexical and syntactic ambiguities are standardly assumed to involve different types of knowledge representations and be resolved by different mechanisms. An alternative account is provided in which both types of ambiguity derive from aspects of lexical representation and are resolved by the same processing mechanisms. Reinterpreting syntactic ambiguity resolution as a form of lexical ambiguity resolution obviates the need for special parsing principles to account for syntactic interpretation preferences, reconciles a number of apparently conflicting results concerning the roles of lexical and contextual information in sentence processing, explains differences among ambiguities in terms of ease of resolution, and provides a more unified account of language comprehension than was previously available.
How yellow is your banana? Toddlers’ language-mediated visual search in referent-present tasks,
What is the relative salience of different aspects of word meaning in the developing lexicon? The current study examines the time-course of retrieval of semantic and color knowledge associated with words during toddler word recognition: At what point do toddlers orient toward an image of a yellow cup upon hearing color-matching words such as "banana" (typically yellow) relative to unrelated words (e.g., "house")? Do children orient faster to semantic matching images relative to color matching images, for example, orient faster to an image of a cookie relative to a yellow cup upon hearing the word "banana"? The results strongly suggest a prioritization of semantic information over color information in children's word-referent mappings. This indicates that even for natural objects (e.g., food, animals that are more likely to have a prototypical color), semantic knowledge is a more salient aspect of toddler's word meaning than color knowledge. For 24-month-old Dutch toddlers, bananas are thus more edible than they are yellow.
Speaker identity supports phonetic category learning,
Visual cues from the speaker's face, such as the discriminable mouth movements used to produce speech sounds, improve discrimination of these sounds by adults. 'The speaker's face, however, provides more information than just the mouth movements used to produce speech it also provides a visual indexical cue of the identity of the speaker. The current article examines the extent to which there is separable encoding of speaker identity in speech processing and asks whether speech discrimination is influenced by speaker identity. Does consistent pairing of different speakers' faces with different sounds that is, hearing one speaker saying one sound and a second speaker saying the second sound influence the brain's discrimination of the sounds? ERP data from participants previously exposed to consistent speaker sound pairing indicated improved detection of the phoneme change relative to participants previously exposed to inconsistent speaker sound pairing that is, hearing both speakers say both sounds. The results strongly suggest an influence of visual speaker identity in speech processing.
Sentence perception as an interactive parallel process,
The restoration of disrupted words to their original form in a sentence shadowing task is dependent upon semantic and syntactic context variables, thus demonstrating an on-line interaction between the structural and the lexical and phonetic levels of sentence processing.
Interactive activation and mutual constraint satisfaction in perception and cognition,
Abstract In a seminal 1977 article, Rumelhart argued that perception required the simultaneous use of multiple sources of information, allowing perceivers to optimally interpret sensory information at many levels of representation in real time as information arrives. Building on Rumelhart's arguments, we present the Interactive Activation hypothesis-the idea that the mechanism used in perception and comprehension to achieve these feats exploits an interactive activation process implemented through the bidirectional propagation of activation among simple processing units. We then examine the interactive activation model of letter and word perception and the TRACE model of speech perception, as early attempts to explore this hypothesis, and review the experimental evidence relevant to their assumptions and predictions. We consider how well these models address the computational challenge posed by the problem of perception, and we consider how consistent they are with evidence from behavioral experiments. We examine empirical and theoretical controversies surrounding the idea of interactive processing, including a controversy that swirls around the relationship between interactive computation and optimal Bayesian inference. Some of the implementation details of early versions of interactive activation models caused deviation from optimality and from aspects of human performance data. More recent versions of these models, however, overcome these deficiencies. Among these is a model called the multinomial interactive activation model, which explicitly links interactive activation and Bayesian computations. We also review evidence from neurophysiological and neuroimaging studies supporting the view that interactive processing is a characteristic of the perceptual processing machinery in the brain. In sum, we argue that a computational analysis, as well as behavioral and neuroscience evidence, all support the Interactive Activation hypothesis. The evidence suggests that contemporary versions of models based on the idea of interactive activation continue to provide a basis for efforts to achieve a fuller understanding of the process of perception. Copyright 2014 Cognitive Science Society, Inc.
A model for the cross-modal influence of visual context upon language processing,
Hearing lips and seeing voices,
MOST verbal communication occurs in contexts where the listener can see the speaker as well as hear him. However, speech perception is normally regarded as a purely auditory process. The study reported here demonstrates a previously unrecognised influence of vision upon speech perception. It stems from an observation that, on being shown a film of a young woman's talking head, in which repeated utterances of the syllable [ba] had been dubbed on to lip movements for [ga], normal adults reported hearing [da]. With the reverse dubbing process, a majority reported hearing [bagba] or [gaba]. When these subjects listened to the soundtrack from the film, without visual input, or when they watched untreated film, they reported the syllables accurately as repetitions of [ba] or [ga]. Subsequent replications confirm the reliability of these findings; they have important implications for the understanding of speech perception.
Linking language and events: Spatiotemporal cues drive children’s expectations about the meanings of novel transitive verbs,
How do children map linguistic representations onto the conceptual structures that they encode? In the present studies, we provided 3-4 year old children with minimal-pair scene contrasts in order to determine the effect of particular event properties on novel verb learning. Specifically, we tested whether spatiotemporal cues to causation also inform children interpretation of transitive verbs either with or without the causal/inchoative alternation (She broke the lamp/the lamp broke). In Experiment 1, we examined spatiotemporal continuity. Children saw scenes with puppets that approached a toy in a distinctive manner, and toys that lit up or played a sound. In the causal events, the puppet contacted the object, and activation was immediate. In the noncausal events, the puppet stopped short before reaching the object, and the effect occurred after a short pause (apparently spontaneously). Children expected novel verbs used in the inchoative transitive/intransitive alternation to refer to spatiotemporally intact causal interactions rather than to 'gap' control scenes. In Experiment 2, we manipulated the temporal order of sub-events, holding spatial relationships constant, and provided evidence for only one verb frame (either transitive or intransitive). Children mapped transitive verbs to scenes where the agent's action closely preceded the activation of the toy over scenes in which the timing of the two events was switched, but did not do so when they heard an intransitive construction. These studies reveal that children expectations about transitive verbs are at least partly driven by their nonlinguistic understanding of causal events: children expect transitive syntax to refer to scenes where the agent's action is a plausible cause of the outcome. These findings open a wide avenue for exploration into the relationship between children linguistic knowledge and their nonlinguistic understanding of events.
World knowledge affects prediction as quickly as selectional restrictions: Evidence from the visual world paradigm,
Abstract There has been considerable debate regarding the question of whether linguistic knowledge and world knowledge are separable and used differently during processing or not (Hagoort, Hald, Bastiaansen, & Petersson, 2004; Matsuki et al., 2011; Paczynski & Kuperberg, 2012; Warren & McConnell, 2007; Warren, McConnell, & Rayner, 2008). Previous investigations into this question have provided mixed evidence as to whether violations of selectional restrictions are detected earlier than violations of world knowledge. We report a visual-world eye-tracking study comparing the timing of facilitation contributed by selectional restrictions versus world knowledge. College-aged adults (n=36) viewed photographs of natural scenes while listening to sentences. Participants anticipated upcoming direct objects similarly regardless of whether facilitation was provided by only world knowledge or a combination of selectional restrictions and world knowledge. These results suggest that selectional restrictions are not available earlier in comprehension than world knowledge.
Hey Robot, Why don't you talk to me
This paper describes the techniques used in the submitted video presenting an interaction scenario, realised using the Neuro-Inspired Companion (NICO) robot. NICO engages the users in a personalised conversation where the robot always tracks the users' face, remembers them and interacts with them using natural language. NICO can also learn to perform tasks such as remembering and recalling objects and thus can assist users in their daily chores. The interaction system helps the users to interact as naturally as possible with the robot, enriching their experience with the robot, making it more interesting and engaging.
The impact of inhibitory controls on anticipatory sentence processing in L2,
The interplay of local attraction, context and domain-general cognitive control in activation and suppression of semantic distractors during sentence comprehension,
Abstract During sentence comprehension, real-time identification of a referent is driven both by local, context-independent lexical information and by more global sentential information related to the meaning of the utterance as a whole. This paper investigates the cognitive factors that limit the consideration of referents that are supported by local lexical information but not supported by more global sentential information. In an eye-tracking paradigm, participants heard sentences like “She will eat the red pear” while viewing four black-and-white (colorless) line-drawings. In the experimental condition, the display contained a “local attractor” (e.g., a heart), which was locally compatible with the adjective but incompatible with the context (“eat”). In the control condition, the local attractor was replaced by a picture which was incompatible with the adjective (e.g., “igloo”). A second factor manipulated contextual constraint, by using either a constraining verb (e.g., “eat”), or a non-constraining one (e.g., “see”). Results showed consideration of the local attractor, the magnitude of which was modulated by verb constraint, but also by each subject’s cognitive control abilities, as measured in a separate Flanker task run on the same subjects. The findings are compatible with a processing model in which the interplay between local attraction, context, and domain-general control mechanisms determines the consideration of possible referents.
Spoken words can make the invisible visible - Testing the involvement of low- level visual representations in spoken word processing,
The notion that processing spoken (object) words involves activation of category-specific representations in visual cortex is a key prediction of modality-specific theories of representation that contrasts with theories assuming dedicated conceptual representational systems abstracted away from sensorimotor systems. In the present study, we investigated whether participants can detect otherwise invisible pictures of objects when they are presented with the corresponding spoken word shortly before the picture appears. Our results showed facilitated detection for congruent ("bottle" → picture of a bottle) versus incongruent ("bottle" → picture of a banana) trials. A second experiment investigated the time-course of the effect by manipulating the timing of picture presentation relative to word onset and revealed that it arises as soon as 200–400 ms after word onset and decays at 600 ms after word onset. Together, these data strongly suggest that spoken words can rapidly activate low-level category-specific visual representations that affect the mere detection of a stimulus, that is, what we see. More generally, our findings fit best with the notion that spoken words activate modality-specific visual representations that are low level enough to provide information related to a given token and at the same time abstract enough to be relevant not only for previously seen tokens but also for generalizing to novel exemplars one has never seen before
Linking language to the visual world: Neural correlates of comprehending verbal reference to objects through pointing and visual cues,
In everyday communication speakers often refer in speech and/or gesture to objects in their immediate environment, thereby shifting their addressee's attention to an intended referent. The neurobiological infrastructure involved in the comprehension of such basic multimodal communicative acts remains unclear. In an event-related fMRI study, we presented participants with pictures of a speaker and two objects while they concurrently listened to her speech. In each picture, one of the objects was singled out, either through the speaker's index-finger pointing gesture or through a visual cue that made the object perceptually more salient in the absence of gesture. A mismatch (compared to a match) between speech and the object singled out by the speaker's pointing gesture led to enhanced activation in left IFG and bilateral pMTG, showing the importance of these areas in conceptual matching between speech and referent. Moreover, a match (compared to a mismatch) between speech and the object made salient through a visual cue led to enhanced activation in the mentalizing system, arguably reflecting an attempt to converge on a jointly attended referent in the absence of pointing. These findings shed new light on the neurobiological underpinnings of the core communicative process of comprehending a speaker's multimodal referential act and stress the power of pointing as an important natural device to link speech to objects.
Interactions of language and vision restrict "visual world" interpretations
Development of implicit processing of thematic and functional similarity relations during manipulable artifact object identification: Evidence from eye-tracking in the Visual World Paradigm,
Second language processing and revision of garden-path sentences: A visual word study,
We asked whether children's well-known difficulties revising initial sentence processing commitments characterize the immature or the learning parser. Adult L2 speakers of English acted out temporarily ambiguous and unambiguous instructions. While online processing patterns indicate that L2 adults experienced garden-paths and were sensitive to referential information to a similar degree as native adults, their act-out patterns indicate increased difficulties revising initial interpretations, at rates similar to those observed for 5-year-old native children (e.g., Trueswell, Sekerina, Hill & Logrip,1999). We propose that L2 learners difficulties with revision stem from increased recruitment of cognitive control networks during processing of a not fully proficient language, resulting in the reduced availability of cognitive control for parsing revisions.
Representation, space and Hollywood Squares: Looking at things that aren’t there anymore,
It has been argued that the human cognitive system is capable of using spatial indexes or oculomotor coordinates to relieve working memory load (Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (1997). Behavioral and Brain Sciences, 20(4), 723), track multiple moving items through occlusion (Scholl, D. J., & Pylyshyn, Z. W. (1999). Cognitive Psychology, 38, 259) or link incompatible cognitive and sensorimotor codes (Bridgeman, B., & Huemer, V. (1998). Consciousness and Cognition, 7, 454). Here we examine the use of such spatial information in memory for semantic information. Previous research has often focused on the role of task demands and the level of automaticity in the encoding of spatial location in memory tasks. We present five experiments where location is irrelevant to the task, and participants' encoding of spatial information is measured implicitly by their looking behavior during recall. In a paradigm developed from Spivey and Geng (Spivey, M. J., & Geng, J. (2000). submitted for publication), participants were presented with pieces of auditory, semantic information as part of an event occurring in one of four regions of a computer screen. In front of a blank grid, they were asked a question relating to one of those facts. Under certain conditions it was found that during the question period participants made significantly more saccades to the empty region of space where the semantic information had been previously presented. Our findings are discussed in relation to previous research on memory and spatial location, the dorsal and ventral streams of the visual system, and the notion of a cognitive-perceptual system using spatial indexes to exploit the stability of the external world.
Revisiting Snodgrass and Vanderwart’s object pictorial set: The role of surface detail in basic-level object recognition,
Theories of object recognition differ to the extent that they consider object representations as being mediated only by the shape of the object, or shape and surface details, if surface details are part of the representation. In particular, it has been suggested that color information may be helpful at recognizing objects only in very special cases, but not during basic-level object recognition in good viewing conditions. In this study, we collected normative data (naming agreement, familiarity, complexity, and imagery judgments) for Snodgrass and Vanderwart's object database of 260 black-and-white line drawings, and then compared the data to exactly the same shapes but with added gray-level texture and surface details (set 2), and color (set 3). Naming latencies were also recorded. Whereas the addition of texture and shading without color only slightly improved naming agreement scores for the objects, the addition of color information unambiguously improved naming accuracy and speeded correct response times. As shown in previous studies, the advantage provided by color was larger for objects with a diagnostic color, and structurally similar shapes, such as fruits and vegetables, but was also observed for man-made objects with and without a single diagnostic color. These observations show that basic-level 'everyday' object recognition in normal conditions is facilitated by the presence of color information, and support a 'shape + surface' model of object recognition, for which color is an integral part of the object representation. In addition, the new stimuli (sets 2 and 3) and the corresponding normative data provide valuable materials for a wide range of experimental and clinical studies of object recognition.
Tracking the time course of orthographic information in spoken-word recognition,
Two visual-world experiments evaluated the time course and use of orthographic information in spoken-word recognition using printed words as referents. Participants saw 4 words on a computer screen and listened to spoken sentences instructing them to click on one of the words (e.g., Click on the word bead). The printed words appeared 200 ms before the onset of the spoken target word. In Experiment 1, the display included the target word and a competitor with either a lower degree (e.g., bear) or a higher degree (e.g., bean) of phonological overlap with the target. Both competitors had the same degree of orthographic overlap with the target. There were more fixations to the competitors than to unrelated distractors. Crucially, the likelihood of fixating a competitor did not vary as a function of the amount of phonological overlap between target and competitor. In Experiment 2, the display included the target word and a competitor with either a lower degree (e.g., bare) or a higher degree (e.g., bear) of orthographic overlap with the target. Competitors were homophonous and thus had the same degree of phonological overlap with the target. There were more fixations to higher overlap competitors than to lower overlap competitors, beginning during the temporal interval where initial fixations driven by the vowel are expected to occur. The authors conclude that orthographic information is rapidly activated as a spoken word unfolds and is immediately used in mapping spoken words onto potential printed referents.
The multimodal nature of spoken word processing in the visual world: Testing the predictions of alternative models of multimodal integration,
Ambiguity in natural language is ubiquitous, yet spoken communication is effective due to integration of information carried in the speech signal with information available in the surrounding multimodal landscape. Language mediated visual attention requires visual and linguistic information integration and has thus been used to examine properties of the architecture supporting multimodal processing during spoken language comprehension. In this paper we test predictions generated by alternative models of this multimodal system. A model (TRACE) in which multimodal information is combined at the point of the lexical representations of words generated predictions of a stronger effect of phonological rhyme relative to semantic and visual information on gaze behaviour, whereas a model in which sub-lexical information can interact across modalities (MIM) predicted a greater influence of visual and semantic information, compared to phonological rhyme. Two visual world experiments designed to test these predictions offer support for sub-lexical multimodal interaction during online language processing.
Modelling language-Vision interactions in the hub and spoke framework,
Multimodal integration is a central characteristic of human cognition. However our understanding of the interaction between modalities and its influence on behaviour is still in its infancy. This paper examines the value of the Hub & Spoke framework as a tool for exploring multimodal interaction in cognition. We present a Hub and Spoke model of language ision information interaction and report the model's ability to replicate a range of phonological, visual and semantic similarity word-level effects reported in the Visual World Paradigm. The model provides an explicit connection between the percepts of language and the distribution of eye gaze and demonstrates the scope of the Hub-and-Spoke architectural framework by modelling new aspects of multimodal cognition.
A comprehensive model of spoken word recognition must be multimodal: Evidence from studies of language-mediated visual attention
When processing language, the cognitive system has access to information from a range of modalities (e.g. auditory, visual) to support language processing. Language mediated visual attention studies have shown sensitivity of the listener to phonological, visual, and semantic similarity when processing a word. In a computational model of language mediated visual attention, that models spoken word processing as the parallel integration of information from phonological, semantic and visual processing streams, we simulate such effects of competition within modalities. Our simulations raised untested predictions about stronger and earlier effects of visual and semantic similarity compared to phonological similarity around the rhyme of the word. Two visual world studies confirmed these predictions. The model and behavioral studies suggest that, during spoken word comprehension, multimodal information can be recruited rapidly to constrain lexical selection to the extent that phonological rhyme information may exert little influence on this process.
The developing constraints on parsing decisions: The role of lexical-biases and referential scenes in child and adult sentence processing,
Two striking contrasts currently exist in the sentence processing literature. First, whereas adult readers rely heavily on lexical information in the generation of syntactic alternatives, adult listeners in world-situated eye-gaze studies appear to allow referential evidence to override strong countervailing lexical biases ( Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995). Second, in contrast to adults, children in similar listening studies fail to use this referential information and appear to rely exclusively on verb biases or perhaps syntactically based parsing principles ( Trueswell, Sekerina, Hill, & Logrip, 1999). We explore these contrasts by fully crossing verb bias and referential manipulations in a study using the eye-gaze listening technique with adults (Experiment 1) and five-year-olds (Experiment 2). Results indicate that adults combine lexical and referential information to determine syntactic choice. Children rely exclusively on verb bias in their ultimate interpretation. However, their eye movements reveal an emerging sensitivity to referential constraints. The observed changes in information use over ontogenetic time best support a constraint-based lexicalist account of parsing development, which posits that highly reliable cues to structure, like lexical biases, will emerge earlier during development and more robustly than less reliable cues.
Linguistically guided anticipatory eye movements in scene viewing,
The present study replicated the well-known demonstration by Altmann and Kamide (1999) that listeners make linguistically guided anticipatory eye movements, but used photographs of scenes rather than clip-art arrays as the visual stimuli. When listeners heard a verb for which a particular object in a visual scene was the likely theme, they made earlier looks to this object (e.g., looks to a cake upon hearing The boy will eat …) than when they heard a control verb (The boy will move …). New data analyses assessed whether these anticipatory effects are due to a linguistic effect on the targeting of saccades (i.e., the where parameter of eye movement control), the duration of fixations (i.e., the when parameter), or both. Participants made fewer fixations before reaching the target object when the verb was selectionally restricting (e.g., will eat). However, verb type had no effect on the duration of individual eye fixations. These results suggest an important constraint on the linkage between spoken language processing and eye movement control: Linguistic input may influence only the decision of where to move the eyes, not the decision of when to move them.
Syntactic prediction in language comprehension: Evidence from either..or,
Readers' eye movements were monitored as they read sentences in which two noun phrases or two independent clauses were connected by the word or (NP-coordination and S-coordination, respectively). The word either could be present or absent earlier in the sentence. When either was present, the material immediately following or was read more quickly, across both sentence types. In addition, there was evidence that readers misanalyzed the S-coordination structure as an NP-coordination structure only when either was absent. The authors interpret the results as indicating that the word either enabled readers to predict the arrival of a coordination structure; this predictive activation facilitated processing of this structure when it ultimately arrived, and in the case of S-coordination sentences, enabled readers to avoid the incorrect NP-coordination analysis. The authors argue that these results support parsing theories according to which the parser can build predictable syntactic structure before encountering the corresponding lexical input.
Language processing in the natural world,
The authors argue that a more complete understanding of how people produce and comprehend language will require investigating real-time spoken-language processing in natural tasks, including those that require goal-oriented unscripted conversation. One promising methodology for such studies is monitoring eye movements as speakers and listeners perform natural tasks. Three lines of research that adopt this approach are reviewed: (i) spoken word recognition in continuous speech, (ii) reference resolution in real-world contexts, and (iii) real-time language processing in interactive conversation. In each domain, results emerge that provide insights which would otherwise be difficult to obtain. These results extend and, in some cases, challenge standard assumptions about language processing.
Integration of visual and linguistic information in spoken language comprehension,
Alignment of eye movements and spoken language for semantic image understanding,
Extracting meaning from images is a challenging task that has generated much interest in recent years. In domains such as medicine, image understanding requires special expertise. Experts' eye movements can act as pointers to important image regions, while their accompanying spoken lan- guage descriptions, informed by their knowledge and experience, call attention to the concepts and features associated with those regions. In this paper, we apply an unsupervised alignment technique, widely used in machine translation to align parallel corpora, to align observers' eye movements with the verbal narrations they produce while examining an image. The resulting alignments can then be used to create a database of low-level image features and high-level semantic annotations correspond- ing to perceptually important image regions. Such a database can in turn be used to automatically annotate new images. Initial results demonstrate the feasibility of a framework that draws on rec- ognized bitext alignment algorithms for performing unsupervised automatic semantic annotation of image regions. Planned enhancements to the methods are also discussed.
Putting things in new places: Linguistic experience modulates the predictive power of placement verb semantics,
A central question regarding predictive language processing concerns the extent to which linguistic experience modulates the process. We approached this question by investigating sentence processing in advanced second language (L2) users with different native language (L1) backgrounds. Using a visual world eye tracking paradigm, we investigated to what extent L1 and L2 participants showed anticipatory eye movements to objects while listening to Dutch placement event descriptions. L2 groups differed in the degree of similarity between Dutch and their L1 with respect to placement verb semantics: German, like Dutch, specifies object position in placement verbs (put.STAND vs. put.LIE), whereas English and French generally leave position underspecified (put). Results showed that German L2 listeners, like native Dutch listeners, anticipate objects that match the verbally encoded position immediately upon encountering the verb. French/English L2 participants, however, did not show any prediction effects, despite proper understanding of Dutch placement verbs. Our findings suggest that prior experience with a specific semantic contrast in one L1 facilitates prediction in L2, and hence adds to the evidence that linguistic experience modulates predictive sentence processing.
When the food arrives before the menu: Modeling event-driven surprisal in language comprehension.
Object labeling influences infant phonetic learning and generalization,
Different kinds of speech sounds are used to signify possible word forms in every language. For example, lexical stress is used in Spanish (/‘be.be/, ‘he/she drinks’ versus /be.’be/, ‘baby’), but not in French (/‘be.be/ and /be.’be/ both mean ‘baby’). Infants learn many such native language phonetic contrasts in their first year of life, likely using a number of cues from parental speech input. One such cue could be parents’ object labeling, which can explicitly highlight relevant contrasts. Here we ask whether phonetic learning from object labeling is abstract—that is, if learning can generalize to new phonetic contexts. We investigate this issue in the prosodic domain, as the abstraction of prosodic cues (like lexical stress) has been shown to be particularly difficult. One group of 10-month-old French-learners was given consistent word labels that contrasted on lexical stress (e.g., Object A was labeled /‘ma.bu/, and Object B was labeled /ma.’bu/). Another group of 10-month-olds was given inconsistent word labels (i.e., mixed pairings), and stress discrimination in both groups was measured in a test phase with words made up of new syllables. Infants trained with consistently contrastive labels showed an earlier effect of discrimination compared to infants trained with inconsistent labels. Results indicate that phonetic learning from object labeling can indeed generalize, and suggest one way infants may learn the sound properties of their native language(s).
Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information,
One of the central themes in the study of language acquisition is the gap between the linguistic knowledge that learners demonstrate, and the apparent inadequacy of linguistic input to support induction of this knowledge. One of the first linguistic abilities in the course of development to exemplify this problem is in speech perception: specifically, learning the sound system of one native language. Native-language sound systems are defined by meaningful contrasts among words in a language, yet infants learn these sound patterns before any significant numbers of words are acquired. Previous approaches to this learning problem have suggested that infants can learn phonetic categories from statistical analysis of auditory input, without regard to word referents. Experimental evidence presented here suggests instead that young infants can use visual cues present in word-labeling situations to categorize phonetic information. In Experiment 1, 9-month-old English-learning infants failed to discriminate two non-native phonetic categories, establishing baseline performance in a perceptual discrimination task. In Experiment 2, these infants succeeded at discrimination after watching contrasting visual cues (i.e., videos of two novel objects) paired consistently with the two non-native phonetic categories. In Experiment 3, these infants failed at discrimination after watching the same visual cues, but paired inconsistently with the two phonetic categories. At an age before which memory of word labels is demonstrated in the laboratory, 9-month-old infants use contrastive pairings between objects and sounds to influence their phonetic sensitivity. Phonetic learning may have a more functional basis than previous statistical learning mechanisms assume: infants may use cross-modal associations inherent in social contexts to learn native-language phonetic categories.
最后, 视觉画面的呈现会缩小个体对句子中特定词汇的指涉范围.在传统的研究语言加工的实验中, 激活扩散模型认为听到某个词汇, 心理词典中所有和此词汇相关的词汇都有可能得到激活.但如果有视觉场景或者图片呈现的话, 被激活的词汇就会被限制在图片中所呈现的几个物体上.举例来说,
此外, 不仅静态视觉信息对句法加工有影响, 动态的事件也会影响口语理解过程(
综上所述, 不仅静态图片和真实情境能够影响我们对听觉语言信息的加工, 动态事件情境信息也同样会影响语言的理解过程.这种影响不仅体现在单个词汇水平, 同样表现在语言加工过程中的句法选择策略上, 甚至会影响我们对施动者和受动者的题元角色分配.模块化理论所倡导的“封装”也在视觉情境范式的各类研究中受到了挑战, 语言的加工并不是独立于其他信息的加工, 而是与其他信息进行动态的即时交互.这些来自语言理解的研究都支持了基于制约的理论, 语言的加工会实时的受到其他各类信息的影响和制约. ...
由于视觉信息的难以量化, 该模型对视觉信息模块的处理是把视觉信息描述的事件简化成题元角色的分配, 即施动者和受动者的角色分配.语言信息模块中, 作者采用德语中的歧义句式“谁对谁做了什么(who did what to whom)”, 最终通过WCDG来接入视觉模块的输入的角色分配的信息流, 并和语言模块中的角色分配信息进行匹配, 最后输出模型的句法解析结果.作者在实际的模拟中, 单独采用语言模块和采用加入视觉背景信息的模块对歧义句进行解析的结果是不同的, 视觉信息的加入可以改变本身的句法选择策略, 成功地对此类现象进行了模拟. ...
很多研究利用相似的范式对这个问题进行了重复与拓展.一些研究采用和其相同的电脑屏幕呈现的图片形式, 还有部分研究采用呈现实物的方法来代替电脑呈现.例如, 有研究考察了儿童与成人在视觉与语言加工整合上的差异, 结果发现成人可以将语言信息(如, 词汇信息)和表征物信息(视觉信息)有效地结合, 移除句子中的临时歧义; 儿童却只能利用听觉句子中的语义和句法信息来进行句子理解, 对视觉信息的利用是十分有限的(
版权所有 © 《心理科学进展》编辑部