pith. machine review for the scientific record. sign in

arxiv: 2605.12413 · v2 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelsspatial reasoningomnidirectional imagesperspective-conditioned reasoningbenchmark evaluationreinforcement learning optimizationvisual perception gap
0
0 comments X

The pith

Multimodal large language models show a large gap between perception and perspective-conditioned spatial reasoning on omnidirectional images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that perspective-conditioned spatial reasoning poses a major challenge for multimodal large language models when processing 360-degree images. It creates a new benchmark with eight tasks to measure this capability across basic perception and more complex viewpoint-dependent reasoning. Tests on fourteen models reveal high performance on simple tasks but near failure on advanced ones involving rotation and composition. An optimization experiment using reinforcement learning demonstrates that some improvement is possible, though it is limited and depends on the specific task and reward setup.

Core claim

The central claim is that current MLLMs have a substantial perception-reasoning gap in handling perspective-conditioned spatial reasoning from omnidirectional images. Foundational tasks like relative direction achieve 57.59% accuracy, while egocentric rotation drops to 13.49%, ego-distortion to 7.13%, and open-ended compositional reasoning to 0.64%. An RL-based study shows a 7B model can be improved from 31.10% to 60.06% with reward shaping, indicating that PCSR represents a key bottleneck with partial plasticity under targeted optimization.

What carries the argument

PCSR-Bench, which provides 84,373 question-answer pairs across eight tasks designed to isolate perspective-conditioned spatial reasoning in 2,600 omnidirectional images from 26 indoor environments.

If this is right

  • Accuracy on basic spatial perception does not translate to success on tasks requiring active viewpoint adjustment or composition of relations.
  • Reinforcement learning with carefully designed rewards can boost performance on PCSR tasks in a 7B model.
  • Improvements from optimization are task-dependent and sensitive to reward formulation and evaluation protocol.
  • The gap persists across multiple representative MLLMs, suggesting a systemic issue.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better handling of PCSR could improve model performance in navigation or augmented reality applications that involve changing user perspectives.
  • The benchmark may need validation against human performance to confirm the tasks measure the intended capability.
  • Extending the approach to video sequences could reveal how models handle temporal changes in perspective.

Load-bearing premise

The tasks in the benchmark successfully separate perspective-conditioned spatial reasoning from other factors like image projection distortions or biases in question creation.

What would settle it

Running the same models on variants of the benchmark where questions are reworded to remove potential linguistic cues or where images are converted to different projections and observing if the performance gap remains.

Figures

Figures reproduced from arXiv: 2605.12413 by (2) Guangzhou University, (3) Queen Mary University of London, 4) ((1) The Hong Kong Polytechnic University, (4) HKUST (Guangzhou)), Ioannis Patras (3), Jiaxing Li (2), Wai Keung Wong (1), Xu Zheng (3, Yuangong Chen (1).

Figure 1
Figure 1. Figure 1: Diagnostic task structure of PCSR-Bench and examples, with foundational perception tasks (T0–T2, upper part) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCSR-Bench construction pipeline: ➀ a four-stage construction pipeline that programmatically generates diagnostic QA pairs from 3D ground truth; ➁ the resulting PCSR-Data; and ➂ an evaluation protocol for assessing MLLMs on the benchmark. 5% 10% 15% T0 (8.9%) T1 (16.7%) T2 (12.7%) T3 T4 (8.8%) (7.6%) T6 (6.7%) T7 (11.7%) T5 (26.9%) 5.8% Count Class 3.1% Count Total 6.6% Dist MCQ 10.1% Dist Open Direction F… view at source ↗
Figure 3
Figure 3. Figure 3: Task distribution of PCSR-Bench. The benchmark [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces PCSR-Bench, a diagnostic benchmark of 84,373 QA pairs from 2,600 omnidirectional images across 26 indoor environments, to evaluate Perspective-Conditioned Spatial Reasoning (PCSR) in MLLMs. It reports a perception-reasoning gap across 14 models (e.g., 57.59% on foundational relative direction vs. 13.49% on egocentric rotation, 7.13% on ego-distortion, and 0.64% on compositional reasoning) and shows that RL reward shaping on a 7B model can raise performance from 31.10% to 60.06% under controlled conditions, positioning PCSR as a partially plastic bottleneck.

Significance. If the benchmark tasks isolate PCSR without projection or phrasing artifacts, the work would provide a useful diagnostic tool and empirical evidence that targeted optimization can partially close the gap, informing future MLLM development in viewpoint-dependent spatial reasoning.

major comments (1)
  1. [PCSR-Bench construction and task definitions] The central claim that low accuracies on advanced tasks reflect a genuine PCSR bottleneck (rather than omnidirectional projection artifacts or question-generation biases) is load-bearing for the perception-reasoning gap and the RL recovery interpretation. No ablations are reported, such as perspective-crop controls, distortion-corrected variants, or human baselines on matched questions, to confirm that performance drops survive removal of equirectangular effects.
minor comments (1)
  1. [Abstract and §4] The abstract and results sections would benefit from explicit statements on image sourcing, question-generation procedure, and any statistical controls (e.g., variance across environments) to support replicability of the reported accuracies.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on PCSR-Bench construction. We address the concern regarding potential artifacts point by point below and outline revisions to strengthen the claims.

read point-by-point responses
  1. Referee: [PCSR-Bench construction and task definitions] The central claim that low accuracies on advanced tasks reflect a genuine PCSR bottleneck (rather than omnidirectional projection artifacts or question-generation biases) is load-bearing for the perception-reasoning gap and the RL recovery interpretation. No ablations are reported, such as perspective-crop controls, distortion-corrected variants, or human baselines on matched questions, to confirm that performance drops survive removal of equirectangular effects.

    Authors: We agree that explicit controls are needed to isolate PCSR from equirectangular projection effects and generation biases. The benchmark applies identical omnidirectional inputs across all 14 models and all tasks, with the observed drop (e.g., 57.59% foundational relative direction to 13.49% egocentric rotation) occurring consistently; the RL reward-shaping experiment further shows that performance on advanced tasks can be substantially improved (31.10% to 60.06%) under controlled conditions, which would be unlikely if the gap were driven purely by input artifacts. Nevertheless, we did not include perspective-crop controls, distortion-corrected variants, or human baselines on matched questions in the original submission. In the revised manuscript we will add: (1) a human baseline study on a representative subset of questions, (2) direct comparisons against perspective-cropped and distortion-corrected inputs, and (3) expanded details on question-generation validation. These additions will confirm that the performance drops persist after removal of equirectangular effects and thereby reinforce the PCSR-bottleneck interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark and evaluation are self-contained

full rationale

The paper introduces PCSR-Bench as a new diagnostic benchmark with eight tasks on omnidirectional images, evaluates 14 MLLMs to report a perception-reasoning gap (e.g., 57.59% foundational vs. 0.64% compositional), and performs an RL diagnostic showing improvement from 31.10% to 60.06%. No equations, fitted parameters, predictions by construction, self-definitional constructs, or load-bearing self-citations appear in the text. All central claims rest on direct empirical measurements from the newly defined benchmark and controlled RL runs, which are independent of prior inputs and externally falsifiable via the reported accuracies and task definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that the constructed QA pairs validly measure perspective-conditioned spatial reasoning; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption QA pairs generated from omnidirectional images can isolate viewpoint-dependent spatial reasoning
    Invoked when defining the eight tasks and interpreting accuracy gaps

pith-pipeline@v0.9.0 · 5677 in / 1289 out tokens · 51413 ms · 2026-05-14T21:34:00.189518+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 5 internal anchors

  1. [1]

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433

  2. [2]

    Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lil- licrap, Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas De- gris, Joseph Modayil, Greg Wayne, Hubert Soyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil Rabinowitz, Razvan Pascanu, Charlie Beattie, Stig Petersen, Amir Sadik, Stephen Gaffney, Helen King, Koray Kavukcu...

  3. [3]

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. 2023. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390(2023)

  4. [4]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14455–14465

  5. [5]

    Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. 2024. EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14291–14302. doi:10.1109/CVPR52733. 2024.01355

  6. [6]

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

  7. [7]

    Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. 2018. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. InProceedings of the European conference on computer vision (ECCV). 518–533

  8. [8]

    Thiago LT da Silveira and Claudio R Jung. 2023. Omnidirectional visual comput- ing: Foundations, challenges, and applications.Computers & Graphics113 (2023), 89–101

  9. [9]

    Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. 2018. Embodied question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1–10

  10. [10]

    Yuan Dong, Chuan Fang, Liefeng Bo, Zilong Dong, and Ping Tan. 2024. PanoContext-Former: Panoramic total scene understanding with a transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). 28087–28097

  11. [11]

    Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. 2025. Are Multimodal Large Lan- guage Models Ready for Omnidirectional Spatial Reasoning?arXiv preprint arXiv:2505.11907(2025)

  12. [12]

    Qingying Gao, Yijiang Li, Haiyun Lyu, Haoran Sun, Dezhi Luo, and Hokin Deng

  13. [13]

    Vision language models see what you want but not what you see.arXiv preprint arXiv:2410.00324(2024)

  14. [14]

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learn- ing in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

  15. [15]

    Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, and Michał Kosiński. 2025. Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models.arXiv preprint arXiv:2505.03821(2025)

  16. [16]

    Gracjan Góral, Alicja Ziarko, Michal Nauman, and Maciej Wołczyk. 2024. Seeing through their eyes: Evaluating visual perspective taking in vision language models.arXiv preprint arXiv:2409.12969(2024)

  17. [17]

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

  18. [18]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913

  19. [19]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al . 2025. DeepSeek-R1 in- centivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638

  20. [20]

    Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, and Weida Wang. 2025. SpatialGeo: Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion. In2025 IEEE International Workshop on Multimedia Signal Processing (MMSP). IEEE, 24–29

  21. [21]

    Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. 2026. Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning.arXiv preprint arXiv:2603.23404(2026)

  22. [22]

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2025. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749(2025)

  23. [23]

    Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real- world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6700–6709

  24. [24]

    Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2901–2910

  25. [25]

    2011.Thinking, fast and slow

    Daniel Kahneman. 2011.Thinking, fast and slow. macmillan

  26. [26]

    Alexander Kuhnle and Ann Copestake. 2017. ShapeWorld: A New Test Method- ology for Multimodal Language Understanding.arXiv preprint arXiv:1704.04517 (2017)

  27. [27]

    Phillip Y Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung. 2025. Perspective-aware reasoning in vision-language models via mental imagery simulation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 9241–9251

  28. [28]

    Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. 2024. TopViewRS: Vision-Language Models as Top-View Spatial Reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1786–1807

  29. [29]

    Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500(2025)

  30. [30]

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. InThe twelfth international conference on learning representations

  31. [31]

    Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin

  32. [32]

    Deconstructing Spatial Intelligence in Vision-Language Models.Authorea Preprints(2025)

  33. [33]

    Fangyu Liu, Guy Emerson, and Nigel Collier. 2023. Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics11 (2023), 635–651

  34. [34]

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al . 2024. Mmbench: Is your multi-modal model an all-around player?. InEuropean conference on computer vision. Springer, 216–233

  35. [35]

    Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. 2026. RotBench: Evaluating Multi-modal Large Language Models on Identifying Image Rotation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 5546–5569

  36. [36]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  37. [37]

    Yiming Ren, Yujiu Yang, and Junjie Wang. 2026. Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation. arXiv:2603.26330 [cs.CV] https://arxiv.org/abs/2603.26330

  38. [39]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

  39. [40]

    Roger N Shepard and Jacqueline Metzler. 1971. Mental rotation of three- dimensional objects.Science171, 3972 (1971), 701–703. doi:10.1126/science. 171.3972.701

  40. [41]

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clark- son, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M....

  41. [42]

    Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. 2017. A Corpus of Natural Language for Visual Reasoning. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 217–223

  42. [43]

    Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi

  43. [44]

    InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics

    A Corpus for Reasoning about Natural Language Grounded in Photographs. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6418–6428

  44. [45]

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al . 2024. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024. 13088–13110. ACM MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Chen, ...

  45. [46]

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. 2021. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems34 (2021), 251–266

  46. [47]

    Keisuke Tateno, Nassir Navab, and Federico Tombari. 2018. Distortion-aware convolutional filters for dense prediction in panoramic images. InProceedings of the European Conference on Computer Vision (ECCV). 707–722

  47. [48]

    Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, and Dezhi Luo. 2026. Egocentric Bias in Vision- Language Models.arXiv preprint arXiv:2602.15892(2026)

  48. [49]

    de Melo, Jieneng Chen, and Alan Yuille

    Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M. de Melo, Jieneng Chen, and Alan Yuille. 2025. Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 24669–24679

  49. [50]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  50. [51]

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

  51. [52]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

  52. [53]

    Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al

  53. [54]

    InStructural Priors for Vision Workshop at ICCV’25

    Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25

  54. [55]

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yuxuan Han, Gang Cui, Shiguang Hu, Weifeng Liu, et al. 2024. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12183–12193

  55. [57]

    Xinshen Zhang, Zhen Ye, and Xu Zheng. 2025. Towards Omnidirectional Rea- soning with 360-R1: A Dataset, Benchmark, and GRPO-based Method.arXiv preprint arXiv:2505.14197(2025). https://arxiv.org/abs/2505.14197

  56. [58]

    Yiwei Zhang, Yixuan Li, and Song Gao. 2026. Do Vision Language Models Rotate in Mind? Evaluating Spatial Transformation Reasoning. https://openreview.net/ forum?id=up2LD7vVdW

  57. [59]

    Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. 2014. Panocontext: A whole-room 3d context model for panoramic scene understanding. InEuropean conference on computer vision. Springer, 668–686

  58. [60]

    Rui Zhu, Xin Shen, Shuchen Wu, Chenxi Miao, Xin Yu, Yang Li, Weikang Li, Deguo Xia, and Jizhou Huang. 2026. Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs.arXiv preprint arXiv:2601.09430(2026)

  59. [61]

    Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, and Yang Liu. 2025. CoSpace: Bench- marking Continuous Space Perception Ability for Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 29569–29579