Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
Pith reviewed 2026-05-16 09:29 UTC · model grok-4.3
The pith
Vision-language models reach only 0.66 accuracy on relative camera pose estimation from image pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision-language models struggle to estimate the relative pose between two camera views, reaching a maximum accuracy of 0.66 on the introduced benchmark while humans reach 0.91 and geometric matchers reach 0.99. The models remain inconsistent when the order of the two images is swapped and perform especially poorly on motions along the optical axis. These shortcomings isolate deficits in cross-view correspondence, consistent reasoning across views, and understanding of projective camera motions.
What carries the argument
VRRPI-Bench and VRRPI-Diag, benchmarks that turn real RGB-D image pairs with known object-centric motions into discrete verbal classification questions about relative camera pose.
If this is right
- VLMs would need explicit mechanisms for cross-view feature matching to close the performance gap with geometric methods.
- Training objectives must enforce stability under image order reversal to support view-consistent reasoning.
- Targeted practice on optical-axis motions such as roll and depth translation would be required to address the weakest single-DoF cases.
- Closing this gap would demonstrate that language-based models can acquire projective geometric understanding without separate 3D modules.
Where Pith is reading between the lines
- The benchmarks could become standard tests for any model claiming multi-view spatial competence.
- Similar verbal framing of other geometric problems might expose parallel gaps in depth or object tracking tasks.
- If the gap narrows with scale or new data, it would suggest current training corpora simply lack enough multi-view geometric examples.
Load-bearing premise
That turning relative camera pose into a verbal multiple-choice task from image pairs measures genuine multi-view spatial reasoning rather than prompt or language artifacts.
What would settle it
A vision-language model that scores above 0.85 on VRRPI-Bench while showing over 80 percent consistency when the source and target images are swapped would indicate the claimed deficits have been overcome.
Figures
read the original abstract
We study whether vision-language models (VLMs) can solve relative camera pose estimation (RCPE) from image pairs, a direct test of multi-view spatial reasoning. We cast RCPE as a discrete verbal classification task and introduce \texttt{VRRPI-Bench}, built from real RGB-D frames with object-centric camera motion, and \texttt{VRRPI-Diag}, which isolates individual motion degrees of freedom. Humans (0.91) and specialized geometric pipelines such as LoFTR (0.99) solve the task reliably, yet the best VLM reaches only 0.66 and most others remain near random. Our analyses show that this gap is not basic spatial competence: strong VLMs are near ceiling on single-image benchmarks, but most remain near random once reasoning must span views. They are unstable under source-target reversal (best 59.7\% consistency) and remain weak even in simplified single-DoF settings, especially on optical-axis motions such as roll and depth translation (GPT-5: 0.46 on roll). These failures are useful: they localize concrete missing capabilities, namely cross-view correspondence, view-consistent reasoning, and projective camera-motion understanding, making RCPE a targeted diagnostic for improving multi-view spatial reasoning in VLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that vision-language models struggle with relative camera pose estimation (RCPE) when cast as a discrete verbal classification task on image pairs. It introduces VRRPI-Bench (real RGB-D frames with object-centric motion) and VRRPI-Diag (isolating single degrees of freedom), reporting humans at 0.91 accuracy, LoFTR at 0.99, and the best VLM at only 0.66 with most near random. Analyses highlight reversal inconsistency (max 59.7% consistency) and persistent weaknesses on single-DoF motions, especially roll and depth translation (e.g., GPT-5 at 0.46 on roll), localizing deficits in cross-view correspondence and projective understanding.
Significance. If the empirical gaps hold under scrutiny, the work supplies a targeted diagnostic benchmark for multi-view spatial reasoning in VLMs, distinguishing single-image competence from cross-view failures. The provision of both a realistic benchmark and controlled single-DoF diagnostics is a constructive contribution that could guide future model improvements.
major comments (2)
- [Abstract] Abstract and task formulation: the central claim that the 0.66 VLM ceiling demonstrates missing cross-view correspondence and projective understanding rests on the discrete verbal classification framing; this risks confounding prompt sensitivity, label parsing, and output-format artifacts with true geometric deficits, as geometric baselines like LoFTR avoid these interface issues entirely.
- [Abstract] Abstract (results on reversal and single-DoF): the reported 59.7% reversal inconsistency and GPT-5 roll accuracy of 0.46 are presented as evidence of specific missing capabilities, but without explicit details on prompt templates, output parsing rules, and data splits (absent from the provided text), these metrics cannot be isolated from task-formulation artifacts.
minor comments (1)
- [Abstract] Abstract: 'GPT-5' appears without clarification of the exact model version or whether it is a stand-in; specify the precise VLM identifiers used in all experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We have revised the manuscript to clarify the task formulation, provide explicit details on prompts and parsing, and better contextualize the empirical gaps within the verbal classification setting. Below we respond point by point.
read point-by-point responses
-
Referee: [Abstract] Abstract and task formulation: the central claim that the 0.66 VLM ceiling demonstrates missing cross-view correspondence and projective understanding rests on the discrete verbal classification framing; this risks confounding prompt sensitivity, label parsing, and output-format artifacts with true geometric deficits, as geometric baselines like LoFTR avoid these interface issues entirely.
Authors: We agree that the verbal classification framing can introduce interface artifacts and that this must be clearly distinguished from pure geometric deficits. Our intent is precisely to evaluate VLMs under the language-based interface through which they are typically queried for reasoning tasks, rather than to claim equivalence to specialized geometric pipelines. The human baseline (0.91) is collected under the identical verbal protocol, providing a fair within-paradigm comparison, while LoFTR demonstrates solvability of the underlying geometry. We have revised the abstract to state that the observed limitations apply to verbal RCPE and have added a new subsection detailing prompt templates, output parsing rules, and controls for format sensitivity. revision: partial
-
Referee: [Abstract] Abstract (results on reversal and single-DoF): the reported 59.7% reversal inconsistency and GPT-5 roll accuracy of 0.46 are presented as evidence of specific missing capabilities, but without explicit details on prompt templates, output parsing rules, and data splits (absent from the provided text), these metrics cannot be isolated from task-formulation artifacts.
Authors: We acknowledge that the initial submission text omitted full reproduction details. The complete manuscript contains these in Section 3.2 and Appendix B; we have now moved concise summaries of the prompt templates, parsing procedure (keyword matching followed by majority vote over 5 samples), and train/test splits into the main text. The reversal inconsistency is computed by swapping image order and checking whether the predicted label is the inverse motion; single-DoF accuracies use the controlled VRRPI-Diag subsets. These additions allow the metrics to be isolated from formulation artifacts. revision: yes
Circularity Check
No circularity: purely empirical benchmark evaluation of existing models
full rationale
The paper introduces VRRPI-Bench and VRRPI-Diag as new test sets derived from real RGB-D frames, then measures performance of off-the-shelf VLMs, humans, and geometric baselines like LoFTR on a verbal classification framing of RCPE. No equations, fitted parameters, self-citations, or derivations are present that reduce any claimed result to the inputs by construction. The reported accuracies (best VLM 0.66, humans 0.91, LoFTR 0.99) and consistency metrics are direct measurements on held-out data, not predictions forced by prior fits or definitions within the paper. The framing choice is an explicit methodological decision, not a hidden self-definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption RGB-D frames with object-centric camera motion provide a valid proxy for general relative pose estimation
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We cast RCPE as a discrete verbal classification task... VRRPI-Diag isolates individual motion degrees of freedom... humans (0.91) and LoFTR (0.99) solve reliably, best VLM reaches only 0.66
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Relative camera pose estimation (RCPE)... 6 DoF... pitch, yaw, roll, tx, ty, tz
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
SpatialRGPT: Grounded spatial reasoning in vision-language models
Spatialbot: Precise spatial understanding with vision language models. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9490–9498. IEEE. Lorenzo Pasqualetto Cassinis, Robert Fonod, and Eber- hard Gill. 2019. Review of the robustness and ap- plicability of monocular pose estimation systems for relative navigation with an uncoop...
-
[2]
Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProc. Computer Vision and Pattern Recognition (CVPR), IEEE. Erik Daxberger, Nina Wenzel, David Griffiths, Haim- ing Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, and 1 others. 2025. Mm-spatial: Exploring 3d spatial understanding in multimodal llms...
-
[3]
Martin A Fischler and Robert C Bolles
IEEE. Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartog- raphy.Communications of the ACM, 24(6):381–395. Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei- Chiu Ma, and Ranjay Krishna. 2024. Blink: Multi- ...
-
[4]
Pose estimation for augmented reality: a hands-on survey.IEEE transactions on visualiza- tion and computer graphics, 22(12):2633–2651. Simon McIntosh-Smith, Sadaf R Alam, and Christo- pher Woods. 2024. Isambard-ai: a leadership class supercomputer optimised specifically for artificial intelligence.Preprint, arXiv:2410.11199. Etienne Mouragnon, Maxime Lhui...
-
[5]
The 8-point algorithm as an inductive bias for relative pose prediction by vits. In2022 International Conference on 3D Vision (3DV), pages 1–11. IEEE. Chris Rockwell, Nilesh Kulkarni, Linyi Jin, Jeong Joon Park, Justin Johnson, and David F. Fouhey. 2024. Far: Flexible, accurate and robust 6dof relative camera pose estimation. InCVPR. 10 Dhruv Shah, Bła˙ze...
-
[7]
Move right while yawing left </ ans - candidates > < output - format > Please stricly follow the format : Provide your spatial reasoning inside < thinking > </ thinking > XML tags , and provide ** only index of your option ** inside <ans > </ ans > XML tags , e . g . , < thinking >... </ thinking > <ans >... </ ans >. Do not output anything else . </ outp...
-
[8]
Rotate left </ ans - candidates > < output - format > Please stricly follow the format : Provide your spatial reasoning inside < thinking > </ thinking > XML tags , and provide ** only index of your option ** inside <ans > </ ans > XML tags , e . g . , < thinking >... </ thinking > <ans >... </ ans >. Do not output anything else . </ output - format > Fig...
-
[10]
Move right while yawing left GPT-5 <thinking> Between frames, the nearby chess table and chairs shift right in the image, indicating the camera translated left (near objects move opposite the motion). The chessboard’s perspectiveshows depth receding toward the upper-right, meaning the camera is now pointed more to the right than before (yaw right). Hence:...
-
[11]
Move left while yawing right
-
[12]
This global shift to the right indicates a yaw to the left
Move right while yawing left GPT-5 <thinking> In the target frame, more of the sink and left cabinets come into view while the right wall/shelf moves toward the right edge. This global shift to the right indicates a yaw to the left. At the same time, nearby objects ( table edge, chair) shift left relative to the distant background, showing opposite-direct...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.