The visual world paradigm (VWP) is a widely used tool in psycholinguistics to study the time course of spoken language processing (Cooper, 1974; Tanenhaus et al., 1995). In this paradigm, eye movements are tracked while participants listen to spoken language and view visual scenes, providing precise temporal information about the processing of words and sentences. As acoustic input unfolds, comprehenders’ focus of attention on particular entities in the mental representations of spoken language changes, and their visual attention also shifts accordingly (Altmann & Mirkovi, 2009). Such allocation of attention can be manifested in eye movements as overt behavioral data.
Linking hypotheses of this field link eye movements in visual contexts with the mental representations of linguistic input. The coordinated interplay account proposed by Knoeferle and Crocker (2006, 2007) defines three phases in visually situated spoken language comprehension: integrating new words, searching for referents in visual contexts and matching linguistic input with objects and actions in the visual contexts. These three phases may take place sequentially or overlap with one another in time. An alternative linking hypothesis raised by Altmann and Mirkovi (2009), however, suggested that the processes of interpreting linguistic input and comprehending visual scenes are intertwined, as linguistic meanings and non-linguistic information (e.g., visual information and world knowledge) are stored in one unitary system and jointly contribute to the dynamic representation of situations. Salverda et al.’s (2011) goal-based linking hypothesis introduces a task-goal dimension into the theoretical model. That is, the goal of the task also affects language processing: Visual objects that are directly related to this goal would attract more attention; and additional tasks such as clicking or moving objects contribute to the goal structure of the task and directly influence eye movements.
The assets visual world paradigm has brought to the field—(i) the possibility to include a visual dimension in linguistic processing; (ii) a fine-grained time course measure of eye movements in real-time language comprehension—have greatly expanded the range of experimental designs for language studies. As the VWP relies primarily on listening tasks and does not require subjects to have full literacy skills in reading, it can be applied to examine language processing in young children, second language learners, and people with specific language impairments.
Dependent variables in a VWP experiment include fixation proportions, target ratio, latency of saccades, etc. Factors such as areas of interest, groups and experimental conditions can be included as independent variables. To make use of the fine-grained time-course data provided by the VWP, including a temporal dimension to the analytical models is crucial. While traditional analyses evaluate fixation/saccade differences between conditions during a time window (using t-test, ANOVA, and mixed-effect models), the divergent point analysis and cluster‑based permutation analysis are informative in detecting and comparing the emergence time of effects (Ito & Knoeferle, 2022). The growth-curve analysis, on the other hand, models the changes of looks to an interest area over time (Mirman, 2008).
Studies fueled by the VWP have revealed that language processing is incremental or even predictive, in contrast to the findings of earlier studies supporting delayed integration of language. At the early stage of word recognition, phonological cohorts compete with the target, and listeners may use phonetic information to anticipate upcoming words. The processing of semantic information in verb-argument and classifier-noun structures, for example, is highly incremental or anticipatory. Discourse processing, including referential processing and the comprehension of coherence relations, is also found to be immediate. In addition, the VWP has shown that the syntactic and pragmatic processing is in accordance with the constraint-based account (Trueswell et al., 1994)—multiple types of information including syntactic structures and pragmatic implicatures, form constraints to language processing at the very early stage, alongside other constraints such as contextual features, visual information, world knowledge, etc.
The VWP is limited in the sense that it cannot provide data on processing time and therefore cannot answer questions related to processing difficulties in language comprehension. Moreover, the VWP experiments can only present a limited number of static objects in visual space, which also differs from the complex visual environment of natural conversation. In experimental settings where only a limited number of objects are presented, listeners may anticipate linguistic input in advance and look strategically at certain objects (Henderson & Ferreira, 2004; see counter-argument in Dahan & Tanenhaus, 2004).
Developments of the VWP are driven by both theoretical and technological advances. For future studies, investigating the role of task-goal in real-time language processing situated in visual contexts is important. Technological innovations such as virtual reality (VR) create comparatively natural communication scenarios while maintaining precise experimental control, largely improving the ecological validity of eye-tracking experiments using the VWP.