pith. sign in

arxiv: 2601.22228 · v2 · submitted 2026-01-29 · 💻 cs.CV · cs.AI· cs.CL

Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

Pith reviewed 2026-05-16 09:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords vision-language modelsrelative camera posemulti-view reasoningspatial understandingcamera motiongeometric benchmarkscross-view correspondence
0
0 comments X

The pith

Vision-language models reach only 0.66 accuracy on relative camera pose estimation from image pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether vision-language models can determine the relative motion between two cameras by turning the problem into a multiple-choice question with verbal options. Humans and specialized geometric methods solve it reliably, but the models stay near random guessing even when they handle single images well. The shortfall appears in both full multi-view settings and simplified single-motion cases, especially for roll and forward-backward shifts. The results point to missing abilities in matching features across views and maintaining consistent understanding when the viewpoint changes.

Core claim

Vision-language models struggle to estimate the relative pose between two camera views, reaching a maximum accuracy of 0.66 on the introduced benchmark while humans reach 0.91 and geometric matchers reach 0.99. The models remain inconsistent when the order of the two images is swapped and perform especially poorly on motions along the optical axis. These shortcomings isolate deficits in cross-view correspondence, consistent reasoning across views, and understanding of projective camera motions.

What carries the argument

VRRPI-Bench and VRRPI-Diag, benchmarks that turn real RGB-D image pairs with known object-centric motions into discrete verbal classification questions about relative camera pose.

If this is right

  • VLMs would need explicit mechanisms for cross-view feature matching to close the performance gap with geometric methods.
  • Training objectives must enforce stability under image order reversal to support view-consistent reasoning.
  • Targeted practice on optical-axis motions such as roll and depth translation would be required to address the weakest single-DoF cases.
  • Closing this gap would demonstrate that language-based models can acquire projective geometric understanding without separate 3D modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmarks could become standard tests for any model claiming multi-view spatial competence.
  • Similar verbal framing of other geometric problems might expose parallel gaps in depth or object tracking tasks.
  • If the gap narrows with scale or new data, it would suggest current training corpora simply lack enough multi-view geometric examples.

Load-bearing premise

That turning relative camera pose into a verbal multiple-choice task from image pairs measures genuine multi-view spatial reasoning rather than prompt or language artifacts.

What would settle it

A vision-language model that scores above 0.85 on VRRPI-Bench while showing over 80 percent consistency when the source and target images are swapped would indicate the claimed deficits have been overcome.

Figures

Figures reproduced from arXiv: 2601.22228 by Ken Deng, Shay B. Cohen, Yftah Ziser, Yifu Qiu, Yoni Kasten.

Figure 1
Figure 1. Figure 1: Examples from VRRPI-Bench. Camera moves from source observation (left) to target observation (right), with the verbalized camera motion for simul￾taneous translation and rotation around a shared object. from visual observations, serving as a structural building block for higher-level perception systems, including navigation (Cassinis et al., 2019), aug￾mented reality (Marchand et al., 2015), and 3D reconst… view at source ↗
Figure 2
Figure 2. Figure 2: Examples from VRRPI-Diag. Camera moves from source viewpoint (left) to target viewpoint (right) and both positive and negative descriptions are provided. tracking, and re-localization. Traditional approaches to RCPE rely on geo￾metric constraints derived from low-level visual correspondences, such as keypoint matching and epipolar geometry (Lowe, 2004; Sun et al., 2021). While effective under moderate view… view at source ↗
Figure 3
Figure 3. Figure 3: Consistency performance. The dashed line denotes the random baseline (50.0%), highlighting that most of VLMs are around or below random level. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows models have gains when reason￾ing over isolated DoFs, yet performance is non￾uniform across different axes. While VLMs are relatively proficient at identifying x and y trans￾lations, they struggle with transformations along the z-axis. We hypothesize that this discrepancy stems from a reliance on 2D motion heuristics; for small movements, transformations in the xy-plane often manifest as simple horiz… view at source ↗
Figure 5
Figure 5. Figure 5: Cross-image relational comparison results. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Camera-perspective reasoning. We evaluate [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example prompt for VLM on VRRPI-Bench. Example prompt for VLM on VRRPI-Diag < input > You are given two consecutive frames from a scene . The first image shows the ** source viewpoint ** , and the second image shows the ** target viewpoint **. The camera movement is usually described in 6 degree of freedom (6 DoF ) --- translation and rotation along x , y , and z axes . We now have a constrain on camera mo… view at source ↗
Figure 8
Figure 8. Figure 8: Example prompt for VLM on VRRPI-Diag. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for human annotators. Correct Case Prompt 0. Move left while yawing right 1. Move right while yawing left GPT-5 <thinking> Between frames, the nearby chess table and chairs shift right in the image, indicating the camera translated left (near objects move opposite the motion). The chessboard’s perspective shows depth receding toward the upper-right, meaning the camera is now pointed more to the righ… view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative examples from GPT-5 on VRRPI-Bench. Although the model correctly identifies objects, it fails to ground relative shifts and reason about the underlying 3D camera motion (right), revealing the inherent ambiguity between rotation and translation. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evaluating intra-image spatial reasoning. This stage ensures that models possess the requisite spatial [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
read the original abstract

We study whether vision-language models (VLMs) can solve relative camera pose estimation (RCPE) from image pairs, a direct test of multi-view spatial reasoning. We cast RCPE as a discrete verbal classification task and introduce \texttt{VRRPI-Bench}, built from real RGB-D frames with object-centric camera motion, and \texttt{VRRPI-Diag}, which isolates individual motion degrees of freedom. Humans (0.91) and specialized geometric pipelines such as LoFTR (0.99) solve the task reliably, yet the best VLM reaches only 0.66 and most others remain near random. Our analyses show that this gap is not basic spatial competence: strong VLMs are near ceiling on single-image benchmarks, but most remain near random once reasoning must span views. They are unstable under source-target reversal (best 59.7\% consistency) and remain weak even in simplified single-DoF settings, especially on optical-axis motions such as roll and depth translation (GPT-5: 0.46 on roll). These failures are useful: they localize concrete missing capabilities, namely cross-view correspondence, view-consistent reasoning, and projective camera-motion understanding, making RCPE a targeted diagnostic for improving multi-view spatial reasoning in VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that vision-language models struggle with relative camera pose estimation (RCPE) when cast as a discrete verbal classification task on image pairs. It introduces VRRPI-Bench (real RGB-D frames with object-centric motion) and VRRPI-Diag (isolating single degrees of freedom), reporting humans at 0.91 accuracy, LoFTR at 0.99, and the best VLM at only 0.66 with most near random. Analyses highlight reversal inconsistency (max 59.7% consistency) and persistent weaknesses on single-DoF motions, especially roll and depth translation (e.g., GPT-5 at 0.46 on roll), localizing deficits in cross-view correspondence and projective understanding.

Significance. If the empirical gaps hold under scrutiny, the work supplies a targeted diagnostic benchmark for multi-view spatial reasoning in VLMs, distinguishing single-image competence from cross-view failures. The provision of both a realistic benchmark and controlled single-DoF diagnostics is a constructive contribution that could guide future model improvements.

major comments (2)
  1. [Abstract] Abstract and task formulation: the central claim that the 0.66 VLM ceiling demonstrates missing cross-view correspondence and projective understanding rests on the discrete verbal classification framing; this risks confounding prompt sensitivity, label parsing, and output-format artifacts with true geometric deficits, as geometric baselines like LoFTR avoid these interface issues entirely.
  2. [Abstract] Abstract (results on reversal and single-DoF): the reported 59.7% reversal inconsistency and GPT-5 roll accuracy of 0.46 are presented as evidence of specific missing capabilities, but without explicit details on prompt templates, output parsing rules, and data splits (absent from the provided text), these metrics cannot be isolated from task-formulation artifacts.
minor comments (1)
  1. [Abstract] Abstract: 'GPT-5' appears without clarification of the exact model version or whether it is a stand-in; specify the precise VLM identifiers used in all experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We have revised the manuscript to clarify the task formulation, provide explicit details on prompts and parsing, and better contextualize the empirical gaps within the verbal classification setting. Below we respond point by point.

read point-by-point responses
  1. Referee: [Abstract] Abstract and task formulation: the central claim that the 0.66 VLM ceiling demonstrates missing cross-view correspondence and projective understanding rests on the discrete verbal classification framing; this risks confounding prompt sensitivity, label parsing, and output-format artifacts with true geometric deficits, as geometric baselines like LoFTR avoid these interface issues entirely.

    Authors: We agree that the verbal classification framing can introduce interface artifacts and that this must be clearly distinguished from pure geometric deficits. Our intent is precisely to evaluate VLMs under the language-based interface through which they are typically queried for reasoning tasks, rather than to claim equivalence to specialized geometric pipelines. The human baseline (0.91) is collected under the identical verbal protocol, providing a fair within-paradigm comparison, while LoFTR demonstrates solvability of the underlying geometry. We have revised the abstract to state that the observed limitations apply to verbal RCPE and have added a new subsection detailing prompt templates, output parsing rules, and controls for format sensitivity. revision: partial

  2. Referee: [Abstract] Abstract (results on reversal and single-DoF): the reported 59.7% reversal inconsistency and GPT-5 roll accuracy of 0.46 are presented as evidence of specific missing capabilities, but without explicit details on prompt templates, output parsing rules, and data splits (absent from the provided text), these metrics cannot be isolated from task-formulation artifacts.

    Authors: We acknowledge that the initial submission text omitted full reproduction details. The complete manuscript contains these in Section 3.2 and Appendix B; we have now moved concise summaries of the prompt templates, parsing procedure (keyword matching followed by majority vote over 5 samples), and train/test splits into the main text. The reversal inconsistency is computed by swapping image order and checking whether the predicted label is the inverse motion; single-DoF accuracies use the controlled VRRPI-Diag subsets. These additions allow the metrics to be isolated from formulation artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation of existing models

full rationale

The paper introduces VRRPI-Bench and VRRPI-Diag as new test sets derived from real RGB-D frames, then measures performance of off-the-shelf VLMs, humans, and geometric baselines like LoFTR on a verbal classification framing of RCPE. No equations, fitted parameters, self-citations, or derivations are present that reduce any claimed result to the inputs by construction. The reported accuracies (best VLM 0.66, humans 0.91, LoFTR 0.99) and consistency metrics are direct measurements on held-out data, not predictions forced by prior fits or definitions within the paper. The framing choice is an explicit methodological decision, not a hidden self-definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the verbal classification formulation and the chosen RGB-D object-centric data faithfully probe multi-view spatial reasoning.

axioms (1)
  • domain assumption RGB-D frames with object-centric camera motion provide a valid proxy for general relative pose estimation
    Stated in the construction of VRRPI-Bench and VRRPI-Diag

pith-pipeline@v0.9.0 · 5538 in / 1103 out tokens · 21945 ms · 2026-05-16T09:29:22.523664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    SpatialRGPT: Grounded spatial reasoning in vision-language models

    Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE. Lorenzo Pasqualetto Cassinis, Robert Fonod, and Eber- hard Gill. 2019. Review of the robustness and ap- plicability of monocular pose estimation systems for relative navigation with an uncoop...

  2. [2]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE. Erik Daxberger, Nina Wenzel, David Griffiths, Haim- ing Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and 1 others. 2025. Mm-spatial: Exploring 3d spatial understanding in multimodal llms...

  3. [3]

    Martin A Fischler and Robert C Bolles

    IEEE. Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartog- raphy.Communications of the ACM, 24(6):381–395. Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. 2024. Blink: Multi- ...

  4. [4]

    McIntosh-Smith, S

    Pose estimation for augmented reality: a hands-on survey.IEEE transactions on visualiza- tion and computer graphics, 22(12):2633–2651. Simon McIntosh-Smith, Sadaf R Alam, and Christo- pher Woods. 2024. Isambard-ai: a leadership class supercomputer optimised specifically for artificial intelligence.Preprint, arXiv:2410.11199. Etienne Mouragnon, Maxime Lhui...

  5. [5]

    An empirical analysis on spatial reasoning capabilities of large multimodal models.arXiv preprint arXiv:2411.06048, 2024

    The 8-point algorithm as an inductive bias for relative pose prediction by vits. In2022 International Conference on 3D Vision (3DV), pages 1–11. IEEE. Chris Rockwell, Nilesh Kulkarni, Linyi Jin, Jeong Joon Park, Justin Johnson, and David F. Fouhey. 2024. Far: Flexible, accurate and robust 6dof relative camera pose estimation. InCVPR. 10 Dhruv Shah, Bła˙ze...

  6. [7]

    Move right while yawing left </ ans - candidates > < output - format > Please stricly follow the format : Provide your spatial reasoning inside < thinking > </ thinking > XML tags , and provide ** only index of your option ** inside <ans > </ ans > XML tags , e . g . , < thinking >... </ thinking > <ans >... </ ans >. Do not output anything else . </ outp...

  7. [8]

    Rotate left </ ans - candidates > < output - format > Please stricly follow the format : Provide your spatial reasoning inside < thinking > </ thinking > XML tags , and provide ** only index of your option ** inside <ans > </ ans > XML tags , e . g . , < thinking >... </ thinking > <ans >... </ ans >. Do not output anything else . </ output - format > Fig...

  8. [10]

    The chessboard’s perspectiveshows depth receding toward the upper-right, meaning the camera is now pointed more to the right than before (yaw right)

    Move right while yawing left GPT-5 <thinking> Between frames, the nearby chess table and chairs shift right in the image, indicating the camera translated left (near objects move opposite the motion). The chessboard’s perspectiveshows depth receding toward the upper-right, meaning the camera is now pointed more to the right than before (yaw right). Hence:...

  9. [11]

    Move left while yawing right

  10. [12]

    This global shift to the right indicates a yaw to the left

    Move right while yawing left GPT-5 <thinking> In the target frame, more of the sink and left cabinets come into view while the right wall/shelf moves toward the right edge. This global shift to the right indicates a yaw to the left. At the same time, nearby objects ( table edge, chair) shift left relative to the distant background, showing opposite-direct...