Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models

Alicja Ziarko; Gracjan G\'oral; Maciej Wo{\l}czyk; Michal Nauman

arxiv: 2409.12969 · v1 · pith:YVLJ5YIBnew · submitted 2024-09-02 · 💻 cs.CL · cs.CV· cs.LG

Seeing Through Their Eyes: Evaluating Visual Perspective Taking in Vision Language Models

Gracjan G\'oral , Alicja Ziarko , Michal Nauman , Maciej Wo{\l}czyk This is my paper

classification 💻 cs.CL cs.CVcs.LG

keywords modelsperformanceperspective-takingtaskslanguageunderstandvisionvisual

0 comments

read the original abstract

Visual perspective-taking (VPT), the ability to understand the viewpoint of another person, enables individuals to anticipate the actions of other people. For instance, a driver can avoid accidents by assessing what pedestrians see. Humans typically develop this skill in early childhood, but it remains unclear whether the recently emerging Vision Language Models (VLMs) possess such capability. Furthermore, as these models are increasingly deployed in the real world, understanding how they perform nuanced tasks like VPT becomes essential. In this paper, we introduce two manually curated datasets, Isle-Bricks and Isle-Dots for testing VPT skills, and we use it to evaluate 12 commonly used VLMs. Across all models, we observe a significant performance drop when perspective-taking is required. Additionally, we find performance in object detection tasks is poorly correlated with performance on VPT tasks, suggesting that the existing benchmarks might not be sufficient to understand this problem. The code and the dataset will be available at https://sites.google.com/view/perspective-taking

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 conditional novelty 7.0

MLLMs exhibit a large perception-reasoning gap on perspective-conditioned spatial reasoning in omnidirectional images, with accuracy falling from 57% on basic direction tasks to under 1% on compositional reasoning, th...
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 unverdicted novelty 7.0

A new benchmark reveals MLLMs achieve only 13% or lower accuracy on advanced perspective-conditioned spatial tasks in omnidirectional images, with RL reward shaping raising a 7B model from 31% to 60% in controlled settings.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 unverdicted novelty 7.0

MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains...
Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.