pith. sign in

arxiv: 2605.12413 · v3 · pith:JM5BIDNVnew · submitted 2026-05-12 · 💻 cs.CV

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

Pith reviewed 2026-05-20 22:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsspatial reasoningomnidirectional imagesperspective-conditioned reasoningbenchmarkegocentric rotationreinforcement learningvisual perception gap
0
0 comments X

The pith

Multimodal LLMs show a sharp perception-reasoning gap on spatial tasks from omnidirectional images, with accuracy collapsing as viewpoint demands increase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper diagnoses how current multimodal large language models handle reasoning about space when the viewpoint shifts in full 360-degree scenes. It builds a benchmark with eight tasks that start with basic perception like counting or simple directions and advance to operations that require imagining rotations, distortions, or re-anchoring from an egocentric perspective. Tests on 14 models reveal strong results on foundational tasks but near failure on the advanced ones, exposing a clear gap. A separate experiment applies reinforcement learning to one model and finds that targeted reward shaping can lift performance in controlled conditions, though the gains stay selective and depend on exact reward choices.

Core claim

The central claim is that MLLMs possess a substantial perception-reasoning gap in perspective-conditioned spatial reasoning from omnidirectional images. Accuracy reaches 57.59 percent on foundational relative direction yet falls to 13.49 percent on egocentric rotation, 7.13 percent on ego-distortion, and 0.64 percent on open-ended compositional reasoning. An RL diagnostic on a 7B-scale model raises a matched baseline from 31.10 percent to 60.06 percent under controlled reward shaping, showing the gap has partial plasticity rather than being fixed, though improvements remain task-selective and sensitive to reward weight, formulation, and evaluation protocol.

What carries the argument

PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs drawn from 2,600 omnidirectional images across 26 indoor environments and organized into eight tasks that progressively isolate viewpoint-dependent spatial inference.

If this is right

  • Models perform adequately on basic relative direction and distance tasks but fail on operations that require explicit egocentric rotation or perspective re-anchoring.
  • Reinforcement learning with reward shaping can raise performance on PCSR tasks from roughly 31 percent to 60 percent in controlled settings.
  • Gains from RL remain selective across tasks and depend on the specific allocation of reward weights and the choice of reward formulation.
  • The perception-reasoning gap constitutes a persistent bottleneck that limits reliable spatial understanding in current MLLMs even when basic visual perception is strong.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training regimes for MLLMs may require explicit objectives that simulate viewpoint changes rather than relying on pattern matching alone.
  • The benchmark could be extended to outdoor or dynamic scenes to test whether additional visual context narrows the observed gap.
  • Architectures that incorporate explicit mental-rotation modules might close the gap more reliably than reward shaping alone.

Load-bearing premise

The eight tasks isolate perspective-conditioned spatial reasoning without substantial contamination from general visual recognition, language priors, or dataset artifacts.

What would settle it

A replicated RL experiment on the same 7B model that produces no gain above the 31.10 percent baseline when using the reported reward shaping and protocol would indicate the observed plasticity is not reproducible.

Figures

Figures reproduced from arXiv: 2605.12413 by (2) Guangzhou University, (3) Queen Mary University of London, 4) ((1) The Hong Kong Polytechnic University, (4) HKUST (Guangzhou)), Ioannis Patras, Jiaxing Li, Wai Keung Wong, Xu Zheng, Yuangong Chen.

Figure 1
Figure 1. Figure 1: Diagnostic task structure of PCSR-Bench and examples, with foundational perception tasks (T0–T2, upper part) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCSR-Bench construction pipeline: ➀ a four-stage construction pipeline that programmatically generates diagnostic QA pairs from 3D ground truth; ➁ the resulting PCSR-Data; and ➂ an evaluation protocol for assessing MLLMs on the benchmark. 5% 10% 15% T0 (8.9%) T1 (16.7%) T2 (12.7%) T3 T4 (8.8%) (7.6%) T6 (6.7%) T7 (11.7%) T5 (26.9%) 5.8% Count Class 3.1% Count Total 6.6% Dist MCQ 10.1% Dist Open Direction F… view at source ↗
Figure 3
Figure 3. Figure 3: Task distribution of PCSR-Bench. The benchmark [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PCSR-Bench, a diagnostic benchmark of 84,373 QA pairs from 2,600 omnidirectional images across 26 indoor scenes, to evaluate perspective-conditioned spatial reasoning (PCSR) in MLLMs. It reports a clear perception-reasoning gap across 14 models, with accuracy at 57.59% on foundational relative direction but falling to 13.49% on egocentric rotation, 7.13% on ego-distortion, and 0.64% on open-ended compositional reasoning. An RL reward-shaping experiment on a 7B model raises a matched baseline from 31.10% to 60.06%, indicating partial plasticity under targeted optimization.

Significance. If the benchmark tasks cleanly isolate PCSR, the work identifies a concrete and practically relevant bottleneck for MLLMs in viewpoint-dependent spatial inference from omnidirectional imagery, with direct relevance to navigation, AR/VR, and embodied agents. The scale of the benchmark, the breadth of evaluated models, and the controlled RL diagnostic (including sensitivity to reward formulation) supply falsifiable, quantitative evidence that strengthens the contribution.

major comments (2)
  1. [PCSR-Bench task construction] PCSR-Bench task construction (Section 3 / task definitions): The central gap claim rests on the eight tasks isolating perspective-conditioned reasoning. However, no explicit controls (e.g., matched sentence length, object count, or scene rarity statistics) are reported comparing foundational tasks such as relative direction against advanced tasks such as egocentric rotation and open-ended compositional reasoning. Without these, the observed drops (57.59% to 0.64%) could partly reflect linguistic or parsing confounds rather than reasoning deficits.
  2. [RL diagnostic experiment] RL diagnostic experiment (Section 5 / reward-shaping results): The reported lift from 31.10% to 60.06% is described as task-selective and sensitive to both weight allocation and reward formulation. A load-bearing clarification is needed on whether the shaped reward directly penalizes perspective errors versus general answer quality; an ablation that holds visual input fixed while varying only the perspective component would strengthen attribution to PCSR plasticity.
minor comments (2)
  1. [Abstract] Terminology consistency: the abstract alternates between 'ego-distortion' and 'egocentric distortion'; adopt a single term throughout.
  2. [Benchmark description] Reproducibility: while total QA count and image count are given, a per-task breakdown of the 84,373 pairs and the exact prompt templates used for each of the eight tasks would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications and indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [PCSR-Bench task construction] PCSR-Bench task construction (Section 3 / task definitions): The central gap claim rests on the eight tasks isolating perspective-conditioned reasoning. However, no explicit controls (e.g., matched sentence length, object count, or scene rarity statistics) are reported comparing foundational tasks such as relative direction against advanced tasks such as egocentric rotation and open-ended compositional reasoning. Without these, the observed drops (57.59% to 0.64%) could partly reflect linguistic or parsing confounds rather than reasoning deficits.

    Authors: We agree that reporting explicit controls would help isolate the contribution of perspective-conditioned reasoning. The tasks were designed with consistent omnidirectional visual inputs and progressively increasing reasoning demands (detailed in Section 3), but comparative statistics on sentence length, object count, and scene rarity were not included in the submission. In the revised version we will add these statistics in a new table or appendix subsection to allow direct comparison between foundational and advanced tasks. revision: yes

  2. Referee: [RL diagnostic experiment] RL diagnostic experiment (Section 5 / reward-shaping results): The reported lift from 31.10% to 60.06% is described as task-selective and sensitive to both weight allocation and reward formulation. A load-bearing clarification is needed on whether the shaped reward directly penalizes perspective errors versus general answer quality; an ablation that holds visual input fixed while varying only the perspective component would strengthen attribution to PCSR plasticity.

    Authors: The shaped reward is computed from answer correctness on the PCSR questions; for advanced tasks this correctness signal directly reflects perspective errors because the questions require viewpoint-dependent inference on the same images. We will clarify this formulation and the observed task-selectivity in the revision. A dedicated ablation that holds visual input fixed while isolating only the perspective component of the reward is not present in the current experiments and would require additional runs; we will discuss its desirability and, if space and compute allow, include a limited version or explicit discussion of why the current design already ties gains to PCSR. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark results and RL experiment are direct measurements

full rationale

The paper constructs PCSR-Bench with 84,373 QA pairs across eight tasks and reports accuracy drops (e.g., 57.59% to 0.64%) plus RL reward-shaping gains from direct model evaluations on 14 MLLMs and a 7B-scale experiment. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. Claims rest on observable benchmark performance rather than reducing to self-defined quantities or prior author work by construction. This is a standard empirical diagnostic study with externally falsifiable results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the benchmark tasks validly measure perspective-conditioned spatial reasoning and that accuracy drops reflect reasoning limitations rather than other model deficiencies.

axioms (1)
  • domain assumption The eight tasks in PCSR-Bench comprehensively and cleanly isolate perspective-conditioned spatial reasoning capabilities.
    The diagnosis of a perception-reasoning gap and the interpretation of RL gains presuppose that these tasks accurately probe the intended phenomenon.

pith-pipeline@v0.9.0 · 5898 in / 1511 out tokens · 51367 ms · 2026-05-20T22:20:50.861827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 5 internal anchors

  1. [1]

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433

  2. [2]

    Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lil- licrap, Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas De- gris, Joseph Modayil, Greg Wayne, Hubert Soyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil Rabinowitz, Razvan Pascanu, Charlie Beattie, Stig Petersen, Amir Sadik, Stephen Gaffney, Helen King, Koray Kavukcu...

  3. [3]

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. 2023. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390(2023)

  4. [4]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14455–14465

  5. [5]

    Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. 2024. EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14291–14302. doi:10.1109/CVPR52733. 2024.01355

  6. [6]

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

  7. [7]

    Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. 2018. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. InProceedings of the European conference on computer vision (ECCV). 518–533

  8. [8]

    Thiago LT da Silveira and Claudio R Jung. 2023. Omnidirectional visual comput- ing: Foundations, challenges, and applications.Computers & Graphics113 (2023), 89–101

  9. [9]

    Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. 2018. Embodied question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1–10

  10. [10]

    Yuan Dong, Chuan Fang, Liefeng Bo, Zilong Dong, and Ping Tan. 2024. PanoContext-Former: Panoramic total scene understanding with a transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). 28087–28097

  11. [11]

    Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. 2025. Are Multimodal Large Lan- guage Models Ready for Omnidirectional Spatial Reasoning?arXiv preprint arXiv:2505.11907(2025)

  12. [12]

    Qingying Gao, Yijiang Li, Haiyun Lyu, Haoran Sun, Dezhi Luo, and Hokin Deng

  13. [13]

    Vision language models see what you want but not what you see.arXiv preprint arXiv:2410.00324(2024)

  14. [14]

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learn- ing in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

  15. [15]

    Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, and Michał Kosiński. 2025. Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models.arXiv preprint arXiv:2505.03821(2025)

  16. [16]

    Gracjan Góral, Alicja Ziarko, Michal Nauman, and Maciej Wołczyk. 2024. Seeing through their eyes: Evaluating visual perspective taking in vision language models.arXiv preprint arXiv:2409.12969(2024)

  17. [17]

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

  18. [18]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913

  19. [19]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al . 2025. DeepSeek-R1 in- centivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638

  20. [20]

    Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, and Weida Wang. 2025. SpatialGeo: Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion. In2025 IEEE International Workshop on Multimedia Signal Processing (MMSP). IEEE, 24–29

  21. [21]

    Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. 2026. Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning.arXiv preprint arXiv:2603.23404(2026)

  22. [22]

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2025. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749(2025)

  23. [23]

    Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real- world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6700–6709

  24. [24]

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2901–2910

  25. [25]

    2011.Thinking, fast and slow

    Daniel Kahneman. 2011.Thinking, fast and slow. macmillan

  26. [26]

    Alexander Kuhnle and Ann Copestake. 2017. ShapeWorld: A New Test Method- ology for Multimodal Language Understanding.arXiv preprint arXiv:1704.04517 (2017)

  27. [27]

    Phillip Y Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung. 2025. Perspective-aware reasoning in vision-language models via mental imagery simulation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 9241–9251

  28. [28]

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. 2024. TopViewRS: Vision-Language Models as Top-View Spatial Reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1786–1807

  29. [29]

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500(2025)

  30. [30]

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. InThe twelfth international conference on learning representations

  31. [31]

    Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin

  32. [32]

    Deconstructing Spatial Intelligence in Vision-Language Models.Authorea Preprints(2025)

  33. [33]

    Fangyu Liu, Guy Emerson, and Nigel Collier. 2023. Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics11 (2023), 635–651

  34. [34]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al . 2024. Mmbench: Is your multi-modal model an all-around player?. InEuropean conference on computer vision. Springer, 216–233

  35. [35]

    Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. 2026. RotBench: Evaluating Multi-modal Large Language Models on Identifying Image Rotation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 5546–5569

  36. [36]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  37. [37]

    Yiming Ren, Yujiu Yang, and Junjie Wang. 2026. Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation. arXiv:2603.26330 [cs.CV] https://arxiv.org/abs/2603.26330

  38. [39]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

  39. [40]

    Roger N Shepard and Jacqueline Metzler. 1971. Mental rotation of three- dimensional objects.Science171, 3972 (1971), 701–703. doi:10.1126/science. 171.3972.701

  40. [41]

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clark- son, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M....

  41. [42]

    Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. 2017. A Corpus of Natural Language for Visual Reasoning. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 217–223

  42. [43]

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi

  43. [44]

    InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics

    A Corpus for Reasoning about Natural Language Grounded in Photographs. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6418–6428

  44. [45]

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al . 2024. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024. 13088–13110. ACM MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Chen, ...

  45. [46]

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. 2021. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems34 (2021), 251–266

  46. [47]

    Keisuke Tateno, Nassir Navab, and Federico Tombari. 2018. Distortion-aware convolutional filters for dense prediction in panoramic images. InProceedings of the European Conference on Computer Vision (ECCV). 707–722

  47. [48]

    Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, and Dezhi Luo. 2026. Egocentric Bias in Vision- Language Models.arXiv preprint arXiv:2602.15892(2026)

  48. [49]

    de Melo, Jieneng Chen, and Alan Yuille

    Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M. de Melo, Jieneng Chen, and Alan Yuille. 2025. Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 24669–24679

  49. [50]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  50. [51]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

  51. [52]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

  52. [53]

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al

  53. [54]

    InStructural Priors for Vision Workshop at ICCV’25

    Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25

  54. [55]

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yuxuan Han, Gang Cui, Shiguang Hu, Weifeng Liu, et al. 2024. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12183–12193

  55. [57]

    Xinshen Zhang, Zhen Ye, and Xu Zheng. 2025. Towards Omnidirectional Rea- soning with 360-R1: A Dataset, Benchmark, and GRPO-based Method.arXiv preprint arXiv:2505.14197(2025). https://arxiv.org/abs/2505.14197

  56. [58]

    Yiwei Zhang, Yixuan Li, and Song Gao. 2026. Do Vision Language Models Rotate in Mind? Evaluating Spatial Transformation Reasoning. https://openreview.net/ forum?id=up2LD7vVdW

  57. [59]

    Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. 2014. Panocontext: A whole-room 3d context model for panoramic scene understanding. InEuropean conference on computer vision. Springer, 668–686

  58. [60]

    Rui Zhu, Xin Shen, Shuchen Wu, Chenxi Miao, Xin Yu, Yang Li, Weikang Li, Deguo Xia, and Jizhou Huang. 2026. Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs.arXiv preprint arXiv:2601.09430(2026)

  59. [61]

    Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, and Yang Liu. 2025. CoSpace: Bench- marking Continuous Space Perception Ability for Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 29569–29579