ISSN 1671-3710
CN 11-4766/R
主办:中国科学院心理研究所
出版:科学出版社

Advances in Psychological Science ›› 2026, Vol. 34 ›› Issue (6): 1035-1048.doi: 10.3724/SP.J.1042.2026.1035

• Regular Articles • Previous Articles     Next Articles

Comparing the mechanisms of level-1 and level-2 visual perspective taking: Theoretical controversies, behavioral and neuroscientific evidence

WANG Jiayin, LI Jing   

  1. School of Psychology, Nanjing Normal University, Nanjing 210097, China
  • Received:2025-06-05 Online:2026-06-15 Published:2026-04-17

Abstract: Visual Perspective Taking (VPT), the ability to simulate and understand anther's visual experience, is traditionally categorized into two levels: Level-1 (judging visibility, i.e., “what” is seen) and Level-2 (judging appearance, i.e., “how” it is seen). The current theories in this field present two opposing views: Two-systems account proposes that these two processes involve separate but complementary cognitive systems, while single-system account suggests that a unified cognitive system is responsible for both theories, however, struggle to fully explain empirical anomalies. To resolve these inconsistencies, this paper proposes a novel Three-Stage Processing Model. This framework suggests that both levels of VPT undergo three sequential phases: (1) Information Processing, (2) Perspective Simulation, and (3) Information Integration with Response Selection.
Stage 1: Information Processing. In this initial stage, both level-1 and level-2 VPT involve the encoding of spatial relationships between the self, others, and objects. However, the depth and scope of this information processing differ. Behavioral evidence suggests that level-1 VPT primarily involves tracking “line-of-sight” paths, requiring relatively shallow representation of whether a physical barrier exists between the agent and the target. In contrast, level-2 VPT demands more fine-grained spatial representation, including the precise orientation and visual morphology of objects as seen from different angles. While both levels share basic spatial encoding in the occipito-parietal cortex, level-2 VPT triggers more extensive activation in dorsal attention and frontoparietal control networks to manage higher representation depth.
Stage 2: Perspective Simulation. This stage marks the most significant divergence between the two levels. In level-1 VPT, perspective simulation is a relatively straightforward process that involves quickly tracking the other's line of sight and determining whether an object is visible. This simulation process relies on rapid, non-embodied mechanisms, such as gaze tracking, that do not require significant cognitive resources. In contrast, level-2 VPT engages more complex and embodied processes, often requiring mental rotation or reconfiguration of the reference frame. This embodied simulation involves a shift from the self's reference frame to that of the other, requiring cognitive resources such as body representation and spatial reasoning. Behavioral studies demonstrate that body posture alignment significantly facilitates level-2 VPT but has little effect on level-1 VPT. Neuroscientific data support this, showing that level-2 VPT specifically activates brain regions associated with body representation, such as the Extrastriate Body Area (EBA) and the insula, which are largely inactive during level-1 VPT.
Stage 3: Information Integration with Response Selection. In the final stage, individuals must integrate the information gathered in the first two stages and making a final judgment about the object or the other person's perspective. During this stage, both level-1 and level-2 VPT share a common mechanism of integrating information about the other person's intentions and mental states. For instance, when an agent exhibits a goal-directed “reach-to-grasp” action, both level-1 VPT and level-2 VPT performance are enhanced, suggesting a shared understanding of others' psychological states at the response stage. However, level-2 VPT generally requires stronger cognitive control to resolve more complex perspective conflicts. Neural evidence regarding “social brain”—specifically the right Temporoparietal Junction (rTPJ) and dorsomedial Prefrontal Cortex (dmPFC)—play a crucial role in managing these conflicts, Although the role of them remains debated, current evidence suggests these regions are likely recruited in both levels when tasks explicitly require processing social intent or involve high interference.
In conclusion, we proposes The Three-Stage Processing Model by integrating evidence from behavioural and neuroscience research. And this model offers a unified framework that accommodates the similarities and distinct differences between Level-1 and Level-2 VPT. To further validate and refine this model, future research should focus on developing experimental paradigms to dissociate these three stages, utilizing high-temporal-resolution techniques to map the model's temporal dynamics, and exploring the triggering conditions for embodied mechanisms in VPT-2 and their cross-modal integration. This study provides a comprehensive framework that paves the way for a more unified theory of spatial and social cognition.

Key words: visual perspective taking, two-systems account, single-system account, spatial cognition