arxiv: 2605.12413 · v2 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images

Yuangong Chen (1) , Wai Keung Wong (1) , Jiaxing Li (2) , Ioannis Patras (3) , Xu Zheng (3 , 4) ((1) The Hong Kong Polytechnic University , (2) Guangzhou University , (3) Queen Mary University of London

show 1 more author

(4) HKUST (Guangzhou))

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsspatial reasoningomnidirectional imagesperspective-conditioned reasoningbenchmark evaluationreinforcement learning optimizationvisual perception gap

0 comments

The pith

Multimodal large language models show a large gap between perception and perspective-conditioned spatial reasoning on omnidirectional images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that perspective-conditioned spatial reasoning poses a major challenge for multimodal large language models when processing 360-degree images. It creates a new benchmark with eight tasks to measure this capability across basic perception and more complex viewpoint-dependent reasoning. Tests on fourteen models reveal high performance on simple tasks but near failure on advanced ones involving rotation and composition. An optimization experiment using reinforcement learning demonstrates that some improvement is possible, though it is limited and depends on the specific task and reward setup.

Core claim

The central claim is that current MLLMs have a substantial perception-reasoning gap in handling perspective-conditioned spatial reasoning from omnidirectional images. Foundational tasks like relative direction achieve 57.59% accuracy, while egocentric rotation drops to 13.49%, ego-distortion to 7.13%, and open-ended compositional reasoning to 0.64%. An RL-based study shows a 7B model can be improved from 31.10% to 60.06% with reward shaping, indicating that PCSR represents a key bottleneck with partial plasticity under targeted optimization.

What carries the argument

PCSR-Bench, which provides 84,373 question-answer pairs across eight tasks designed to isolate perspective-conditioned spatial reasoning in 2,600 omnidirectional images from 26 indoor environments.

If this is right

Accuracy on basic spatial perception does not translate to success on tasks requiring active viewpoint adjustment or composition of relations.
Reinforcement learning with carefully designed rewards can boost performance on PCSR tasks in a 7B model.
Improvements from optimization are task-dependent and sensitive to reward formulation and evaluation protocol.
The gap persists across multiple representative MLLMs, suggesting a systemic issue.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Better handling of PCSR could improve model performance in navigation or augmented reality applications that involve changing user perspectives.
The benchmark may need validation against human performance to confirm the tasks measure the intended capability.
Extending the approach to video sequences could reveal how models handle temporal changes in perspective.

Load-bearing premise

The tasks in the benchmark successfully separate perspective-conditioned spatial reasoning from other factors like image projection distortions or biases in question creation.

What would settle it

Running the same models on variants of the benchmark where questions are reworded to remove potential linguistic cues or where images are converted to different projections and observing if the performance gap remains.

Figures

Figures reproduced from arXiv: 2605.12413 by (2) Guangzhou University, (3) Queen Mary University of London, 4) ((1) The Hong Kong Polytechnic University, (4) HKUST (Guangzhou)), Ioannis Patras (3), Jiaxing Li (2), Wai Keung Wong (1), Xu Zheng (3, Yuangong Chen (1).

**Figure 1.** Figure 1: Diagnostic task structure of PCSR-Bench and examples, with foundational perception tasks (T0–T2, upper part) and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: PCSR-Bench construction pipeline: ➀ a four-stage construction pipeline that programmatically generates diagnostic QA pairs from 3D ground truth; ➁ the resulting PCSR-Data; and ➂ an evaluation protocol for assessing MLLMs on the benchmark. 5% 10% 15% T0 (8.9%) T1 (16.7%) T2 (12.7%) T3 T4 (8.8%) (7.6%) T6 (6.7%) T7 (11.7%) T5 (26.9%) 5.8% Count Class 3.1% Count Total 6.6% Dist MCQ 10.1% Dist Open Direction F… view at source ↗

**Figure 3.** Figure 3: Task distribution of PCSR-Bench. The benchmark [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) show strong visual perception, yet remain limited in reasoning about space under changing viewpoints. We study this challenge as Perspective-Conditioned Spatial Reasoning (PCSR) in 360-degree omnidirectional images, where broad scene coverage reduces ambiguity from partial observations without eliminating the need for viewpoint-dependent inference. To assess this capability, we introduce PCSR-Bench, a diagnostic benchmark of 84,373 question-answer pairs from 2,600 omnidirectional images across 26 indoor environments. PCSR-Bench contains eight tasks spanning foundational perception (e.g., object counting, relative distance, and relative direction) and advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion, and limited-FOV visibility. We evaluate 14 representative MLLMs and observe a substantial perception-reasoning gap: accuracy reaches 57.59% on foundational relative direction, but drops to 13.49% on egocentric rotation, 7.13% on egocentric distortion, and 0.64% on open-ended compositional reasoning. To probe the plasticity of this gap, we conduct an RL-based diagnostic study on a 7B-scale model. Reward shaping improves a matched 7B baseline from 31.10% to 60.06% under a controlled setting, suggesting that PCSR is partial plasticity rather than being fully immutable. Still, the gains are task-selective, sensitive to reward design including both weight allocation and reward formulation, and partially dependent on the evaluation protocol. These results position PCSR as a key bottleneck in current MLLMs and highlight limited but meaningful room for recovery under targeted optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a new benchmark and clear numbers showing MLLMs drop hard on advanced spatial tasks with 360 images, plus some RL recovery, but the tasks may mix in projection artifacts without enough controls.

read the letter

The core contribution is PCSR-Bench: 84k question-answer pairs from 2600 omnidirectional images across 26 indoor scenes, split into eight tasks that range from basic object counting and relative direction to harder ones like egocentric rotation, ego-distortion, and open compositional chains. They test 14 MLLMs and report the expected pattern—foundational tasks hit around 57% while egocentric rotation falls to 13%, distortion to 7%, and compositional to under 1%. A quick RL run on a 7B model lifts a baseline from 31% to 60% under controlled conditions, though the gains stay task-selective and sensitive to reward weights and formulation. That is the new piece: a diagnostic set focused on viewpoint-dependent spatial reasoning in full 360 coverage, plus evidence that the gap is at least partly plastic. The numbers are concrete and the taxonomy is straightforward, which makes the work easy to build on for anyone testing spatial understanding in vision-language models. The soft spot is isolation. The stress-test concern holds: equirectangular 360 images carry strong projection distortions, and without reported ablations (perspective crops, distortion-corrected inputs, or matched human baselines on the same questions) it is hard to know how much of the drop is genuine perspective-conditioned reasoning versus low-level visual artifacts or question phrasing. The RL results are also narrow—one model, one protocol—so they show possibility rather than a general fix. This paper is for groups working on MLLM robustness for navigation or immersive settings. The benchmark itself is worth looking at even if the causal claims need tightening. It deserves peer review; the empirical gap is large enough that referees can usefully push on the controls and generalizability.

Referee Report

1 major / 1 minor

Summary. The paper introduces PCSR-Bench, a diagnostic benchmark of 84,373 QA pairs from 2,600 omnidirectional images across 26 indoor environments, to evaluate Perspective-Conditioned Spatial Reasoning (PCSR) in MLLMs. It reports a perception-reasoning gap across 14 models (e.g., 57.59% on foundational relative direction vs. 13.49% on egocentric rotation, 7.13% on ego-distortion, and 0.64% on compositional reasoning) and shows that RL reward shaping on a 7B model can raise performance from 31.10% to 60.06% under controlled conditions, positioning PCSR as a partially plastic bottleneck.

Significance. If the benchmark tasks isolate PCSR without projection or phrasing artifacts, the work would provide a useful diagnostic tool and empirical evidence that targeted optimization can partially close the gap, informing future MLLM development in viewpoint-dependent spatial reasoning.

major comments (1)

[PCSR-Bench construction and task definitions] The central claim that low accuracies on advanced tasks reflect a genuine PCSR bottleneck (rather than omnidirectional projection artifacts or question-generation biases) is load-bearing for the perception-reasoning gap and the RL recovery interpretation. No ablations are reported, such as perspective-crop controls, distortion-corrected variants, or human baselines on matched questions, to confirm that performance drops survive removal of equirectangular effects.

minor comments (1)

[Abstract and §4] The abstract and results sections would benefit from explicit statements on image sourcing, question-generation procedure, and any statistical controls (e.g., variance across environments) to support replicability of the reported accuracies.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on PCSR-Bench construction. We address the concern regarding potential artifacts point by point below and outline revisions to strengthen the claims.

read point-by-point responses

Referee: [PCSR-Bench construction and task definitions] The central claim that low accuracies on advanced tasks reflect a genuine PCSR bottleneck (rather than omnidirectional projection artifacts or question-generation biases) is load-bearing for the perception-reasoning gap and the RL recovery interpretation. No ablations are reported, such as perspective-crop controls, distortion-corrected variants, or human baselines on matched questions, to confirm that performance drops survive removal of equirectangular effects.

Authors: We agree that explicit controls are needed to isolate PCSR from equirectangular projection effects and generation biases. The benchmark applies identical omnidirectional inputs across all 14 models and all tasks, with the observed drop (e.g., 57.59% foundational relative direction to 13.49% egocentric rotation) occurring consistently; the RL reward-shaping experiment further shows that performance on advanced tasks can be substantially improved (31.10% to 60.06%) under controlled conditions, which would be unlikely if the gap were driven purely by input artifacts. Nevertheless, we did not include perspective-crop controls, distortion-corrected variants, or human baselines on matched questions in the original submission. In the revised manuscript we will add: (1) a human baseline study on a representative subset of questions, (2) direct comparisons against perspective-cropped and distortion-corrected inputs, and (3) expanded details on question-generation validation. These additions will confirm that the performance drops persist after removal of equirectangular effects and thereby reinforce the PCSR-bottleneck interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark and evaluation are self-contained

full rationale

The paper introduces PCSR-Bench as a new diagnostic benchmark with eight tasks on omnidirectional images, evaluates 14 MLLMs to report a perception-reasoning gap (e.g., 57.59% foundational vs. 0.64% compositional), and performs an RL diagnostic showing improvement from 31.10% to 60.06%. No equations, fitted parameters, predictions by construction, self-definitional constructs, or load-bearing self-citations appear in the text. All central claims rest on direct empirical measurements from the newly defined benchmark and controlled RL runs, which are independent of prior inputs and externally falsifiable via the reported accuracies and task definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that the constructed QA pairs validly measure perspective-conditioned spatial reasoning; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption QA pairs generated from omnidirectional images can isolate viewpoint-dependent spatial reasoning
Invoked when defining the eight tasks and interpreting accuracy gaps

pith-pipeline@v0.9.0 · 5677 in / 1289 out tokens · 51413 ms · 2026-05-14T21:34:00.189518+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PCSR-Bench ... eight tasks spanning foundational perception ... advanced PCSR, including compositional chains, egocentric rotation, perspective re-anchoring, ego-distortion

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 5 internal anchors

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433

work page 2015
[2]

Andrea Banino, Caswell Barry, Benigno Uria, Charles Blundell, Timothy Lil- licrap, Piotr Mirowski, Alexander Pritzel, Martin J. Chadwick, Thomas De- gris, Joseph Modayil, Greg Wayne, Hubert Soyer, Fabio Viola, Brian Zhang, Ross Goroshin, Neil Rabinowitz, Razvan Pascanu, Charlie Beattie, Stig Petersen, Amir Sadik, Stephen Gaffney, Helen King, Koray Kavukcu...

work page doi:10.1038/s41586-018-0102-6 2018
[3]

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, et al. 2023. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390(2023)

work page arXiv 2023
[4]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14455–14465

work page 2024
[5]

Sijie Cheng, Zhicheng Guo, Jingwen Wu, Kechen Fang, Peng Li, Huaping Liu, and Yang Liu. 2024. EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14291–14302. doi:10.1109/CVPR52733. 2024.01355

work page doi:10.1109/cvpr52733 2024
[6]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

work page 2017
[7]

Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. 2018. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. InProceedings of the European conference on computer vision (ECCV). 518–533

work page 2018
[8]

Thiago LT da Silveira and Claudio R Jung. 2023. Omnidirectional visual comput- ing: Foundations, challenges, and applications.Computers & Graphics113 (2023), 89–101

work page 2023
[9]

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. 2018. Embodied question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1–10

work page 2018
[10]

Yuan Dong, Chuan Fang, Liefeng Bo, Zilong Dong, and Ping Tan. 2024. PanoContext-Former: Panoramic total scene understanding with a transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). 28087–28097

work page 2024
[11]

Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. 2025. Are Multimodal Large Lan- guage Models Ready for Omnidirectional Spatial Reasoning?arXiv preprint arXiv:2505.11907(2025)

work page arXiv 2025
[12]

Qingying Gao, Yijiang Li, Haiyun Lyu, Haoran Sun, Dezhi Luo, and Hokin Deng

work page
[13]

Vision language models see what you want but not what you see.arXiv preprint arXiv:2410.00324(2024)

work page arXiv 2024
[14]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learn- ing in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

work page 2020
[15]

Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, and Michał Kosiński. 2025. Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models.arXiv preprint arXiv:2505.03821(2025)

work page arXiv 2025
[16]

Gracjan Góral, Alicja Ziarko, Michal Nauman, and Maciej Wołczyk. 2024. Seeing through their eyes: Evaluating visual perspective taking in vision language models.arXiv preprint arXiv:2409.12969(2024)

work page arXiv 2024
[17]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

work page
[18]

InProceedings of the IEEE conference on computer vision and pattern recognition

Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913

work page
[19]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al . 2025. DeepSeek-R1 in- centivizes reasoning in LLMs through reinforcement learning.Nature645, 8081 (2025), 633–638

work page 2025
[20]

Jiajie Guo, Qingpeng Zhu, Jin Zeng, Xiaolong Wu, Changyong He, and Weida Wang. 2025. SpatialGeo: Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion. In2025 IEEE International Workshop on Multimedia Signal Processing (MMSP). IEEE, 24–29

work page 2025
[21]

Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, and Miao Liu. 2026. Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning.arXiv preprint arXiv:2603.23404(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. 2025. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real- world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6700–6709

work page 2019
[24]

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2901–2910

work page 2017
[25]

2011.Thinking, fast and slow

Daniel Kahneman. 2011.Thinking, fast and slow. macmillan

work page 2011
[26]

Alexander Kuhnle and Ann Copestake. 2017. ShapeWorld: A New Test Method- ology for Multimodal Language Understanding.arXiv preprint arXiv:1704.04517 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Phillip Y Lee, Jihyeon Je, Chanho Park, Mikaela Angelina Uy, Leonidas Guibas, and Minhyuk Sung. 2025. Perspective-aware reasoning in vision-language models via mental imagery simulation. InProceedings of the IEEE/CVF International Conference on Computer Vision. 9241–9251

work page 2025
[28]

Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vulić. 2024. TopViewRS: Vision-Language Models as Top-View Spatial Reasoners. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1786–1807

work page 2024
[29]

Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, et al. 2025. Viewspatial-bench: Evaluating multi-perspective spatial localization in vision- language models.arXiv preprint arXiv:2505.21500(2025)

work page arXiv 2025
[30]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. InThe twelfth international conference on learning representations

work page 2023
[31]

Disheng Liu, Tuo Liang, Zhe Hu, Jierui Peng, Yiren Lu, Yi Xu, Yun Fu, and Yu Yin

work page
[32]

Deconstructing Spatial Intelligence in Vision-Language Models.Authorea Preprints(2025)

work page 2025
[33]

Fangyu Liu, Guy Emerson, and Nigel Collier. 2023. Visual Spatial Reasoning. Transactions of the Association for Computational Linguistics11 (2023), 635–651

work page 2023
[34]

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al . 2024. Mmbench: Is your multi-modal model an all-around player?. InEuropean conference on computer vision. Springer, 216–233

work page 2024
[35]

Tianyi Niu, Jaemin Cho, Elias Stengel-Eskin, and Mohit Bansal. 2026. RotBench: Evaluating Multi-modal Large Language Models on Identifying Image Rotation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 5546–5569

work page 2026
[36]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022
[37]

Yiming Ren, Yujiu Yang, and Junjie Wang. 2026. Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation. arXiv:2603.26330 [cs.CV] https://arxiv.org/abs/2603.26330

work page arXiv 2026
[39]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Roger N Shepard and Jacqueline Metzler. 1971. Mental rotation of three- dimensional objects.Science171, 3972 (1971), 701–703. doi:10.1126/science. 171.3972.701

work page doi:10.1126/science 1971
[41]

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, Anton Clark- son, Mingfei Yan, Brian Budge, Yajie Yan, Xiaqing Pan, June Yon, Yuyang Zou, Kimberly Leon, Nigel Carter, Jesus Briales, Tyler Gillingham, Elias Mueggler, Luis Pesqueira, Manolis Savva, Dhruv Batra, Hauke M....

work page internal anchor Pith review Pith/arXiv arXiv 2019
[42]

Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi. 2017. A Corpus of Natural Language for Visual Reasoning. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 217–223

work page 2017
[43]

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi

work page
[44]

InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A Corpus for Reasoning about Natural Language Grounded in Photographs. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6418–6428

work page
[45]

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liangyan Gui, Yu-Xiong Wang, Yiming Yang, et al . 2024. Aligning large multimodal models with factually augmented rlhf. InFindings of the Association for Computational Linguistics: ACL 2024. 13088–13110. ACM MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil Chen, ...

work page 2024
[46]

Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. 2021. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems34 (2021), 251–266

work page 2021
[47]

Keisuke Tateno, Nassir Navab, and Federico Tombari. 2018. Distortion-aware convolutional filters for dense prediction in panoramic images. InProceedings of the European Conference on Computer Vision (ECCV). 707–722

work page 2018
[48]

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, and Dezhi Luo. 2026. Egocentric Bias in Vision- Language Models.arXiv preprint arXiv:2602.15892(2026)

work page arXiv 2026
[49]

de Melo, Jieneng Chen, and Alan Yuille

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M. de Melo, Jieneng Chen, and Alan Yuille. 2025. Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 24669–24679

work page 2025
[50]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

work page 2022
[51]

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

work page
[52]

InProceedings of the Computer Vision and Pattern Recognition Conference

Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference. 10632–10643

work page
[53]

Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, et al

work page
[54]

InStructural Priors for Vision Workshop at ICCV’25

Spatial mental modeling from limited views. InStructural Priors for Vision Workshop at ICCV’25

work page
[55]

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yuxuan Han, Gang Cui, Shiguang Hu, Weifeng Liu, et al. 2024. RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12183–12193

work page 2024
[57]

Xinshen Zhang, Zhen Ye, and Xu Zheng. 2025. Towards Omnidirectional Rea- soning with 360-R1: A Dataset, Benchmark, and GRPO-based Method.arXiv preprint arXiv:2505.14197(2025). https://arxiv.org/abs/2505.14197

work page arXiv 2025
[58]

Yiwei Zhang, Yixuan Li, and Song Gao. 2026. Do Vision Language Models Rotate in Mind? Evaluating Spatial Transformation Reasoning. https://openreview.net/ forum?id=up2LD7vVdW

work page 2026
[59]

Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. 2014. Panocontext: A whole-room 3d context model for panoramic scene understanding. InEuropean conference on computer vision. Springer, 668–686

work page 2014
[60]

Rui Zhu, Xin Shen, Shuchen Wu, Chenxi Miao, Xin Yu, Yang Li, Weikang Li, Deguo Xia, and Jizhou Huang. 2026. Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs.arXiv preprint arXiv:2601.09430(2026)

work page arXiv 2026
[61]

Yiqi Zhu, Ziyue Wang, Can Zhang, Peng Li, and Yang Liu. 2025. CoSpace: Bench- marking Continuous Space Perception Ability for Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 29569–29579

work page 2025