PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

arxiv: 2605.13169 · v2 · pith:GC36BX3Mnew · submitted 2026-05-13 · 💻 cs.CV · cs.AI

PanoWorld: Towards Spatial Supersensing in 360^circ Panorama World

Changpeng Wang , Xin Lin , Junhan Liu , Yuheng Liu , Zhen Wang , Donglian Qi , Yunfeng Yan , Xi Chen This is my paper

Pith reviewed 2026-05-19 16:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords panoramic sensingspatial understandingmultimodal LLMsequirectangular panoramasspherical geometryvisual navigation3D reasoning360 degree world

0 comments p. Extension

pith:GC36BX3M Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{GC36BX3M}

Prints a linked pith:GC36BX3M badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

PanoWorld enables direct pano-native spatial reasoning in MLLMs by treating equirectangular panoramas as continuous spherical spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to establish that multimodal large language models can achieve spatial supersensing by reasoning natively over 360-degree panoramas in equirectangular projection format. It defines core pano-native abilities such as semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. A pipeline is built to generate geometry-aware, language-grounded, and depth-aware supervision from mixed-source panoramas for instruction tuning. The PanoWorld model incorporates Spherical Spatial Cross-Attention to embed spherical geometry, resulting in better performance than baselines on spatial reasoning benchmarks. If correct, this would mean AI systems could better understand and navigate full surrounding environments without relying on limited field-of-view images.

Core claim

PanoWorld is a method for pano-native understanding where an MLLM reasons over an ERP panorama as a continuous, observer-centered space. Key abilities are defined and instantiated via a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision signals. The model uses Spherical Spatial Cross-Attention to inject spherical geometry into the visual stream, and experiments show it substantially outperforms proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks.

What carries the argument

The Spherical Spatial Cross-Attention that injects spherical geometry into the visual stream, enabling the model to handle the spherical structure of ERP panoramas for pano-native spatial reasoning.

If this is right

Pano-native supervision allows MLLMs to perform semantic anchoring and spherical localization on full panoramas.
Depth-aware 3D spatial reasoning becomes feasible in a single forward pass over the ERP image.
Outperformance is observed on diagnostic benchmarks like PanoSpace-Bench and navigation tasks like R2R-CE Val-Unseen.
The approach supports scalable training using mixed-source ERP panoramas.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotic applications could benefit from more efficient full-surround perception using this native panorama processing.
The defined pano-native abilities might generalize to other immersive imaging formats beyond standard ERP.
Creating similar benchmarks for other spatial tasks could accelerate progress in panoramic AI.
Integrating this with video data could extend the supersensing to dynamic environments.

Load-bearing premise

The geometry-aware and depth-aware supervision signals accurately capture continuous observer-centered spatial relationships without artifacts from the equirectangular projection.

What would settle it

Evaluating PanoWorld on a held-out set of real captured 360-degree images with precise 3D ground truth annotations and measuring whether spatial localization accuracy matches the reported gains or reveals projection-related errors.

Figures

Figures reproduced from arXiv: 2605.13169 by Changpeng Wang, Donglian Qi, Junhan Liu, Xi Chen, Xin Lin, Yuheng Liu, Yunfeng Yan, Zhen Wang.

**Figure 2.** Figure 2: Verifiable metadata construction pipeline. We collect mixed-source ERP panoramas, per [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The architecture of PanoWorld. After patch embedding, visual tokens [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Case comparison on H∗Bench. Perspective-view iterative search is inefficient and may fail due to fragmented local observations, whereas direct ERP input enables holistic reasoning and correct prediction in one step [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Object category distribution in the constructed metadata. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 5.** Figure 5: Object category distribution in the constructed metadata. [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Instruction format distribution in the generated training data. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 6.** Figure 6: Instruction format distribution in the generated training data. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Case studies of pano-native spatial reasoning. The first two examples show downstream human-centric visual [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PanoWorld adds spherical cross-attention and a custom supervision pipeline for direct ERP panorama reasoning, but the gains rest on unverified claims about clean geometry signals.

read the letter

The one or two things to know are that this paper proposes handling 360 panoramas natively in multimodal large language models rather than splitting them into perspective crops, and it backs that with a new attention module, a data construction pipeline, and a benchmark. What is new is the Spherical Spatial Cross-Attention component that tries to bring spherical geometry into the visual stream. They also lay out four key abilities for pano-native understanding and create a large-scale pipeline that turns mixed-source equirectangular panoramas into instruction tuning data with semantic, localization, transformation, and depth-aware labels. The PanoSpace-Bench is positioned as a diagnostic tool for ERP-native spatial reasoning. These elements are not in the prior work referenced in the abstract. The paper does well in framing the problem clearly. Standard approaches lose the continuous surrounding space that panoramas provide, which matters for tasks like robotic navigation. By focusing on observer-centered space and building supervision that includes depth and geometry, they offer a concrete alternative. The reported outperformance on multiple benchmarks suggests the direction is worth pursuing. The soft spots are around the soundness of the supervision signals. The central claim requires that the geometry-aware and depth-aware labels accurately reflect continuous spatial relationships without distortion from the equirectangular projection. Near the poles, standard processing can warp angles and depths. If the pipeline does not include explicit corrections like distortion-aware sampling, the gains could come from other factors such as data volume. The abstract gives no indication of how this is handled, so the stress-test concern about artifacts holds until the methods section shows otherwise. Experimental details are also missing from what is available, making it hard to verify the results. This paper is for computer vision researchers working on multimodal models for spatial tasks in robotics or immersive settings. Anyone looking to improve panoramic scene understanding or to use a new benchmark in this area would find value here. It deserves a serious referee because it targets a clear limitation in existing pipelines with specific technical proposals. I recommend sending it to peer review with attention to the pipeline details and any ablations on the attention mechanism.

Referee Report

2 major / 2 minor

Summary. The paper presents PanoWorld, a multimodal large language model for pano-native understanding in 360-degree equirectangular projection (ERP) panoramas. It defines key abilities such as semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. A large-scale metadata construction pipeline is built to generate geometry-aware, language-grounded, and depth-aware supervision from mixed-source ERP panoramas, which is used for instruction tuning. The model uses Spherical Spatial Cross-Attention to incorporate spherical geometry into the visual stream. A new diagnostic benchmark PanoSpace-Bench is introduced, and the model is shown to outperform proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen.

Significance. This work addresses a significant gap in spatial reasoning for panoramic images, which is crucial for applications like navigation and robotic search. By focusing on pano-native methods rather than decomposing into perspective views, it could lead to more robust 3D scene understanding models. The public release of code and data would enhance reproducibility and further research in the field.

major comments (2)

[§3.2] §3.2 (metadata construction pipeline): the claim that mixed-source ERP panoramas are converted into accurate, continuous observer-centered supervision signals (semantic anchoring, spherical localization, depth-aware 3D reasoning) lacks any description of explicit spherical correction for equirectangular distortion. Standard perspective-derived depth or label transfer without distortion-aware sampling or spherical harmonics would embed pole-compression artifacts, making the reported gains on PanoSpace-Bench potentially attributable to data scale rather than true pano-native geometry injection.
[§5.3] §5.3 and Table 2 (main results): the outperformance on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen is presented as evidence that dedicated pano-native supervision is necessary, yet no ablation isolates the Spherical Spatial Cross-Attention module from the instruction data volume or base model capacity. Without this, the central claim that geometry-aware adaptation (rather than scale) drives the gains remains unverified.

minor comments (2)

[§2] The notation for reference-frame transformation in §2 could be formalized with explicit coordinate mappings to improve clarity.
[Figure 4] Figure 4 (attention visualization) would benefit from quantitative metrics on attention distribution across spherical latitudes to demonstrate reduced pole bias.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications on our methodology and experiments while indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (metadata construction pipeline): the claim that mixed-source ERP panoramas are converted into accurate, continuous observer-centered supervision signals (semantic anchoring, spherical localization, depth-aware 3D reasoning) lacks any description of explicit spherical correction for equirectangular distortion. Standard perspective-derived depth or label transfer without distortion-aware sampling or spherical harmonics would embed pole-compression artifacts, making the reported gains on PanoSpace-Bench potentially attributable to data scale rather than true pano-native geometry injection.

Authors: We appreciate this observation regarding the need for explicit detail. Our metadata construction pipeline does incorporate distortion-aware processing when generating geometry-aware and depth-aware supervision from ERP sources, including spherical surface projection and interpolation to preserve continuity. However, we acknowledge that §3.2 would benefit from a more explicit description of these steps. In the revised manuscript we will expand this section to detail the equirectangular distortion correction, including distortion-compensated sampling and spherical interpolation methods used for label and depth transfer. This will make clear that the supervision avoids pole-compression artifacts and supports true pano-native geometry rather than relying solely on data scale. revision: yes
Referee: [§5.3] §5.3 and Table 2 (main results): the outperformance on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen is presented as evidence that dedicated pano-native supervision is necessary, yet no ablation isolates the Spherical Spatial Cross-Attention module from the instruction data volume or base model capacity. Without this, the central claim that geometry-aware adaptation (rather than scale) drives the gains remains unverified.

Authors: We agree that an explicit ablation isolating the Spherical Spatial Cross-Attention module would strengthen verification of our central claim. While our experiments compare PanoWorld against baselines that differ in both architecture and training data, we did not include a controlled ablation holding data volume and base model fixed. In the revised version we will add such an ablation study in §5.3, training variants with and without the Spherical Spatial Cross-Attention module on identical instruction data and base capacity. This will directly demonstrate the contribution of the geometry-aware adaptation to the observed gains on PanoSpace-Bench and other benchmarks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain is self-contained

full rationale

The paper defines pano-native abilities (semantic anchoring, spherical localization, reference-frame transformation, depth-aware 3D reasoning), describes a metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware and depth-aware supervision signals, introduces Spherical Spatial Cross-Attention in PanoWorld, builds PanoSpace-Bench, and reports experimental outperformance on multiple benchmarks. No equations, self-citations, or steps in the abstract reduce any claimed result to a fitted parameter or prior self-result by construction. The supervision pipeline and model adaptation are presented as independent engineering contributions whose validity is tested externally via benchmark comparisons rather than tautologically assumed. This is the normal case of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the work appears to build on standard MLLM training assumptions while adding new components whose internal structure is not specified here.

pith-pipeline@v0.9.0 · 5844 in / 1067 out tokens · 39894 ms · 2026-05-19T16:51:51.861702+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Spherical Spatial Cross-Attention (SSCA) ... si = MLP(γ(λi, ϕi)) ... A = MHA(Q=LN(H(0)), K=LN(S), V=LN(S))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

capability families: semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 12 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, et al. Qwen3-vl technical report, 2025. URL https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, et al. Qwen2.5-vl technical report, 2025. URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation

Zidong Cao, Jinjing Zhu, Weiming Zhang, Hao Ai, Haotian Bai, Hengshuang Zhao, and Lin Wang. Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 982–992, 2025

work page 2025
[4]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

Boyang Chen, Ruijie Xu, Xinyu Zhang, et al. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

work page arXiv 2024
[5]

360+x: A panoptic multi-modal scene understanding dataset

Hao Chen, Yuqi Hou, Chenyuan Qu, Irene Testini, Xiaohan Hong, and Jianbo Jiao. 360+x: A panoptic multi-modal scene understanding dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[6]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In Advances in Neural Information Processing Systems, 2024

work page 2024
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Spherenet: Learning spher- ical representations for detection and classification in omnidirectional images

Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spher- ical representations for detection and classification in omnidirectional images. InProceedings of the European conference on computer vision (ECCV), pages 518–533, 2018

work page 2018
[9]

Lau-net: Latitude adaptive upscaling network for omnidirectional image super-resolution

Xin Deng, Hao Wang, Mai Xu, Yichen Guo, Yuhang Song, and Li Yang. Lau-net: Latitude adaptive upscaling network for omnidirectional image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9189–9198, 2021

work page 2021
[10]

Are multimodal large language models ready for omnidirectional spatial reasoning?, 2025

Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. Are multimodal large language models ready for omnidirectional spatial reasoning?, 2025. URLhttps://arxiv.org/abs/2505.11907

work page arXiv 2025
[11]

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, and Rainer Stiefelhagen. More than the sum: Panorama-language models for adverse omni-scenes.arXiv preprint arXiv:2603.09573, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Dit360: High-fidelity panoramic image generation via hybrid training.arXiv preprint arXiv:2510.11712, 2025

Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, and Lu Qi. Dit360: High-fidelity panoramic image generation via hybrid training.arXiv preprint arXiv:2510.11712, 2025

work page arXiv 2025
[13]

Wedetect: Fast open-vocabulary object detection as retrieval.arXiv preprint arXiv:2512.12309, 2025

Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, and Wei-Shi Zheng. Wedetect: Fast open-vocabulary object detection as retrieval.arXiv preprint arXiv:2512.12309, 2025

work page arXiv 2025
[14]

Airsim360: A panoramic simulation platform within drone view.arXiv preprint arXiv:2512.02009, 2025

Xian Ge, Yuling Pan, Yuhang Zhang, Xiang Li, Weijun Zhang, Dizhe Zhang, Zhaoliang Wan, Xin Lin, Xiangkai Zhang, Juntao Liang, et al. Airsim360: A panoramic simulation platform within drone view.arXiv preprint arXiv:2512.02009, 2025

work page arXiv 2025
[15]

Panovggt: Feed-forward 3d reconstruction from panoramic imagery, 2026

Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, and Yujiao Shi. Panovggt: Feed-forward 3d reconstruction from panoramic imagery, 2026. URLhttps://arxiv.org/abs/2603.17571

work page arXiv 2026
[16]

3d-llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[17]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InInternational Conference on Machine Learning, 2024. 11

work page 2024
[18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, Aaron Ostrow, Alethea Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Sim-2-sim transfer for vision-and-language navigation in continu- ous environments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continu- ous environments. InEuropean Conference on Computer Vision (ECCV), 2022

work page 2022
[20]

Beyond the nav- graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

work page 2020
[21]

Waypoint models for instruction-guided navigation in continuous environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021
[22]

Da 2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo. Da 2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

work page arXiv 2025
[23]

Realsee3d: A large-scale multi-view rgb-d dataset of indoor scenes (version 1.0), 2025

Linyuan Li, Yan Wu, Xi Li, Lingli Wang, Tong Rao, Jie Zhou, Cihui Pan, and Xinchen Hui. Realsee3d: A large-scale multi-view rgb-d dataset of indoor scenes (version 1.0), 2025. URL https://doi.org/10.5281/zenodo.17826243

work page doi:10.5281/zenodo.17826243 2025
[24]

One flight over the gap: A survey from perspective to panoramic vision.arXiv preprint arXiv:2509.04444, 2025

Xin Lin, Xian Ge, Dizhe Zhang, Zhaoliang Wan, Xianshun Wang, Xiangtai Li, Wenjie Jiang, Bo Du, Dacheng Tao, Ming-Hsuan Yang, et al. One flight over the gap: A survey from perspective to panoramic vision.arXiv preprint arXiv:2509.04444, 2025

work page arXiv 2025
[25]

Depth any panoramas: A foundation model for panoramic depth estimation.arXiv preprint arXiv:2512.16913, 2025

Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation.arXiv preprint arXiv:2512.16913, 2025

work page arXiv 2025
[26]

PanoEnv: Exploring 3d spatial intelligence in panoramic environments with reinforcement learning.arXiv preprint arXiv:2602.21992, 2026

Zekai Lin and Xu Zheng. PanoEnv: Exploring 3d spatial intelligence in panoramic environments with reinforcement learning.arXiv preprint arXiv:2602.21992, 2026

work page arXiv 2026
[27]

Panoswin: A pano-style swin transformer for panorama understanding

Zhixin Ling, Zhen Xing, Xiangdong Zhou, Manliang Cao, and Guichun Zhou. Panoswin: A pano-style swin transformer for panorama understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[28]

Ministral 3

Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, et al. Ministral 3, 2026. URLhttps://arxiv.org/abs/2601.08584

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

work page arXiv 2025
[30]

Omniroam: World wandering via long- horizon panoramic video generation.arXiv preprint arXiv:2603.30045, 2026

Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold- Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, et al. Omniroam: World wandering via long- horizon panoramic video generation.arXiv preprint arXiv:2603.30045, 2026

work page arXiv 2026
[31]

Sqa3d: Situated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023

work page 2023
[32]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, and Aravind ...

work page 2024
[33]

Panoformer: Panorama transformer for indoor 360◦ depth estimation

Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360◦ depth estimation. InEuropean Conference on Computer Vision (ECCV), 2022. 12

work page 2022
[34]

Tsaftaris

Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025
[35]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page
[36]

URLhttps://qwen.ai/blog?id=qwen3.5

work page
[37]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[38]

From illusion to intention: Visual rationale learning for vision-language reasoning, 2025

Changpeng Wang, Haozhe Wang, Xi Chen, Junhan Liu, Taofeng Xue, Chong Peng, Donglian Qi, Fangzhen Lin, and Yunfeng Yan. From illusion to intention: Visual rationale learning for vision-language reasoning, 2025. URLhttps://arxiv.org/abs/2511.23031

work page arXiv 2025
[39]

Dreamwalker: Mental planning for continuous vision-language navigation

Hanqing Wang, Wenguan Wang, Tianmin Shu, Wei Liang, and Jianbing Shen. Dreamwalker: Mental planning for continuous vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[40]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, , Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, and Wenhu Chen. Ra- tionalrewards: Reasoning rewards scale visual generation both training and test time.arXiv preprint arXiv:2604.11626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Ning-Hsu Wang and Yu-Lun Liu. Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.Advances in Neural Information Processing Systems, 37:127739–127764, 2024

work page 2024
[44]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025. URL https: //arxiv.org/abs/2508.18265

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Gridmm: Grid memory map for vision-and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[46]

Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and-language navigation

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and-language navigation. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[47]

Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, and Jiangmiao Pang. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

work page arXiv 2025
[48]

Mimo-v2.5

Xiaomi MiMo Team. Mimo-v2.5. https://huggingface.co/collections/XiaomiMiMo/ mimo-v25, 2026. Accessed: 2026-05-06

work page 2026
[49]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[50]

ODI-Bench: Can MLLMs understand immersive omnidirectional environments?arXiv preprint arXiv:2510.11549, 2025

Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, Jing Liu, Xiongkuo Min, and Guangtao Zhai. ODI-Bench: Can MLLMs understand immersive omnidirectional environments?arXiv preprint arXiv:2510.11549, 2025. 13

work page arXiv 2025
[51]

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei- Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Osrt: Omni- directional image super-resolution with distortion-aware transformer

Fanghua Yu, Xintao Wang, Mingdeng Cao, Gen Li, Ying Shan, and Chao Dong. Osrt: Omni- directional image super-resolution with distortion-aware transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13283–13292, 2023

work page 2023
[53]

Thinking in 360 ◦: Humanoid visual search in the wild.arXiv preprint arXiv:2511.20351, 2025

Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, and Yiming Li. Thinking in 360 ◦: Humanoid visual search in the wild.arXiv preprint arXiv:2511.20351, 2025

work page arXiv 2025
[54]

Pano-avqa: Grounded audio-visual question answering on 360 ◦ videos, 2021

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. Pano-avqa: Grounded audio-visual question answering on 360 ◦ videos, 2021. URL https://arxiv. org/abs/2110.05122

work page arXiv 2021
[55]

How to enable llm with 3d capacity? a survey of spatial reasoning in llm

Junsheng Zha et al. How to enable llm with 3d capacity? a survey of spatial reasoning in llm. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025

work page 2025
[56]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Omnidirectional spatial modeling from correlated panoramas.arXiv preprint arXiv:2509.02164, 2025

Xinshen Zhang, Tongxi Fu, and Xu Zheng. Omnidirectional spatial modeling from correlated panoramas.arXiv preprint arXiv:2509.02164, 2025

work page arXiv 2025
[58]

Towards omnidirectional reasoning with 360-r1: A dataset, benchmark, and GRPO-based method.arXiv preprint arXiv:2505.14197, 2025

Xinshen Zhang, Zhen Ye, and Xu Zheng. Towards omnidirectional reasoning with 360-r1: A dataset, benchmark, and GRPO-based method.arXiv preprint arXiv:2505.14197, 2025

work page arXiv 2025
[59]

Efficient-vln: A training-efficient vision-language navigation model.arXiv preprint arXiv:2512.10310, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Efficient-vln: A training-efficient vision-language navigation model.arXiv preprint arXiv:2512.10310, 2025

work page arXiv 2025
[60]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8995–9006, 2025

work page 2025
[61]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020

work page 2020
[62]

Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, and Xuming Hu. Multimodal spatial reasoning in the large model era: A survey and benchmarks.arX...

work page arXiv 2025
[63]

Omnisam: Omnidirectional segment anything model for uda in panoramic semantic segmentation

Ding Zhong, Xu Zheng, Chenfei Liao, Yuanhuiyi Lyu, Jialei Chen, Shengyang Wu, Linfeng Zhang, and Xuming Hu. Omnisam: Omnidirectional segment anything model for uda in panoramic semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23892–23901, 2025

work page 2025
[64]

Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025

Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, and Lu Qi. Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025

work page arXiv 2025
[65]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4295–4305, 2025

work page 2025
[66]

thinking with images

Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang, Tiantian Geng, Rongtao Xu, and Feng Zheng. Navida: Vision-language navigation with inverse dynamics augmentation. arXiv preprint arXiv:2601.18188, 2026. 14 A Dataset Details A.1 ERP Corpus Composition Table 9 summarizes the ERP image sources. Our ERP corpus contains 570,321 full-surround panora...

work page arXiv 2026

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, et al. Qwen3-vl technical report, 2025. URL https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, et al. Qwen2.5-vl technical report, 2025. URLhttps://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation

Zidong Cao, Jinjing Zhu, Weiming Zhang, Hao Ai, Haotian Bai, Hengshuang Zhao, and Lin Wang. Panda: Towards panoramic depth anything with unlabeled panoramas and mobius spatial augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 982–992, 2025

work page 2025

[4] [4]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

Boyang Chen, Ruijie Xu, Xinyu Zhang, et al. SpatialVLM: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

work page arXiv 2024

[5] [5]

360+x: A panoptic multi-modal scene understanding dataset

Hao Chen, Yuqi Hou, Chenyuan Qu, Irene Testini, Xiaohan Hong, and Jianbo Jiao. 360+x: A panoptic multi-modal scene understanding dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[6] [6]

Spatialrgpt: Grounded spatial reasoning in vision-language models

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In Advances in Neural Information Processing Systems, 2024

work page 2024

[7] [7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URLhttps://arxiv.org/abs/2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Spherenet: Learning spher- ical representations for detection and classification in omnidirectional images

Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spher- ical representations for detection and classification in omnidirectional images. InProceedings of the European conference on computer vision (ECCV), pages 518–533, 2018

work page 2018

[9] [9]

Lau-net: Latitude adaptive upscaling network for omnidirectional image super-resolution

Xin Deng, Hao Wang, Mai Xu, Yichen Guo, Yuhang Song, and Li Yang. Lau-net: Latitude adaptive upscaling network for omnidirectional image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9189–9198, 2021

work page 2021

[10] [10]

Are multimodal large language models ready for omnidirectional spatial reasoning?, 2025

Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. Are multimodal large language models ready for omnidirectional spatial reasoning?, 2025. URLhttps://arxiv.org/abs/2505.11907

work page arXiv 2025

[11] [11]

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, and Rainer Stiefelhagen. More than the sum: Panorama-language models for adverse omni-scenes.arXiv preprint arXiv:2603.09573, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Dit360: High-fidelity panoramic image generation via hybrid training.arXiv preprint arXiv:2510.11712, 2025

Haoran Feng, Dizhe Zhang, Xiangtai Li, Bo Du, and Lu Qi. Dit360: High-fidelity panoramic image generation via hybrid training.arXiv preprint arXiv:2510.11712, 2025

work page arXiv 2025

[13] [13]

Wedetect: Fast open-vocabulary object detection as retrieval.arXiv preprint arXiv:2512.12309, 2025

Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, and Wei-Shi Zheng. Wedetect: Fast open-vocabulary object detection as retrieval.arXiv preprint arXiv:2512.12309, 2025

work page arXiv 2025

[14] [14]

Airsim360: A panoramic simulation platform within drone view.arXiv preprint arXiv:2512.02009, 2025

Xian Ge, Yuling Pan, Yuhang Zhang, Xiang Li, Weijun Zhang, Dizhe Zhang, Zhaoliang Wan, Xin Lin, Xiangkai Zhang, Juntao Liang, et al. Airsim360: A panoramic simulation platform within drone view.arXiv preprint arXiv:2512.02009, 2025

work page arXiv 2025

[15] [15]

Panovggt: Feed-forward 3d reconstruction from panoramic imagery, 2026

Yijing Guo, Mengjun Chao, Luo Wang, Tianyang Zhao, Haizhao Dai, Yingliang Zhang, Jingyi Yu, and Yujiao Shi. Panovggt: Feed-forward 3d reconstruction from panoramic imagery, 2026. URLhttps://arxiv.org/abs/2603.17571

work page arXiv 2026

[16] [16]

3d-llm: Injecting the 3d world into large language models

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d-llm: Injecting the 3d world into large language models. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[17] [17]

An embodied generalist agent in 3d world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. InInternational Conference on Machine Learning, 2024. 11

work page 2024

[18] [18]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, Aaron Ostrow, Alethea Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Sim-2-sim transfer for vision-and-language navigation in continu- ous environments

Jacob Krantz and Stefan Lee. Sim-2-sim transfer for vision-and-language navigation in continu- ous environments. InEuropean Conference on Computer Vision (ECCV), 2022

work page 2022

[20] [20]

Beyond the nav- graph: Vision-and-language navigation in continuous environments

Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. Beyond the nav- graph: Vision-and-language navigation in continuous environments. InEuropean Conference on Computer Vision, pages 104–120. Springer, 2020

work page 2020

[21] [21]

Waypoint models for instruction-guided navigation in continuous environments

Jacob Krantz, Aaron Gokaslan, Dhruv Batra, Stefan Lee, and Oleksandr Maksymets. Waypoint models for instruction-guided navigation in continuous environments. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021

[22] [22]

Da 2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo. Da 2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

work page arXiv 2025

[23] [23]

Realsee3d: A large-scale multi-view rgb-d dataset of indoor scenes (version 1.0), 2025

Linyuan Li, Yan Wu, Xi Li, Lingli Wang, Tong Rao, Jie Zhou, Cihui Pan, and Xinchen Hui. Realsee3d: A large-scale multi-view rgb-d dataset of indoor scenes (version 1.0), 2025. URL https://doi.org/10.5281/zenodo.17826243

work page doi:10.5281/zenodo.17826243 2025

[24] [24]

One flight over the gap: A survey from perspective to panoramic vision.arXiv preprint arXiv:2509.04444, 2025

Xin Lin, Xian Ge, Dizhe Zhang, Zhaoliang Wan, Xianshun Wang, Xiangtai Li, Wenjie Jiang, Bo Du, Dacheng Tao, Ming-Hsuan Yang, et al. One flight over the gap: A survey from perspective to panoramic vision.arXiv preprint arXiv:2509.04444, 2025

work page arXiv 2025

[25] [25]

Depth any panoramas: A foundation model for panoramic depth estimation.arXiv preprint arXiv:2512.16913, 2025

Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation.arXiv preprint arXiv:2512.16913, 2025

work page arXiv 2025

[26] [26]

PanoEnv: Exploring 3d spatial intelligence in panoramic environments with reinforcement learning.arXiv preprint arXiv:2602.21992, 2026

Zekai Lin and Xu Zheng. PanoEnv: Exploring 3d spatial intelligence in panoramic environments with reinforcement learning.arXiv preprint arXiv:2602.21992, 2026

work page arXiv 2026

[27] [27]

Panoswin: A pano-style swin transformer for panorama understanding

Zhixin Ling, Zhen Xing, Xiangdong Zhou, Manliang Cao, and Guichun Zhou. Panoswin: A pano-style swin transformer for panorama understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[28] [28]

Ministral 3

Alexander H. Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, et al. Ministral 3, 2026. URLhttps://arxiv.org/abs/2601.08584

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods

Weichen Liu, Qiyao Xue, Haoming Wang, Xiangyu Yin, Boyuan Yang, and Wei Gao. Spatial reasoning in multimodal large language models: A survey of tasks, benchmarks and methods. arXiv preprint arXiv:2511.15722, 2025

work page arXiv 2025

[30] [30]

Omniroam: World wandering via long- horizon panoramic video generation.arXiv preprint arXiv:2603.30045, 2026

Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold- Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, et al. Omniroam: World wandering via long- horizon panoramic video generation.arXiv preprint arXiv:2603.30045, 2026

work page arXiv 2026

[31] [31]

Sqa3d: Situated question answering in 3d scenes

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes. InInternational Conference on Learning Representations, 2023

work page 2023

[32] [32]

Openeqa: Embodied question answering in the era of foundation models

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, and Aravind ...

work page 2024

[33] [33]

Panoformer: Panorama transformer for indoor 360◦ depth estimation

Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: Panorama transformer for indoor 360◦ depth estimation. InEuropean Conference on Computer Vision (ECCV), 2022. 12

work page 2022

[34] [34]

Tsaftaris

Ilias Stogiannidis, Steven McDonagh, and Sotirios A. Tsaftaris. Mind the gap: Benchmarking spatial reasoning in vision-language models.arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025

[35] [35]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page

[36] [36]

URLhttps://qwen.ai/blog?id=qwen3.5

work page

[37] [37]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[38] [38]

From illusion to intention: Visual rationale learning for vision-language reasoning, 2025

Changpeng Wang, Haozhe Wang, Xi Chen, Junhan Liu, Taofeng Xue, Chong Peng, Donglian Qi, Fangzhen Lin, and Yunfeng Yan. From illusion to intention: Visual rationale learning for vision-language reasoning, 2025. URLhttps://arxiv.org/abs/2511.23031

work page arXiv 2025

[39] [39]

Dreamwalker: Mental planning for continuous vision-language navigation

Hanqing Wang, Wenguan Wang, Tianmin Shu, Wei Liang, and Jianbing Shen. Dreamwalker: Mental planning for continuous vision-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[40] [40]

VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Haozhe Wang, Alex Su, , Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, and Wenhu Chen. Ra- tionalrewards: Reasoning rewards scale visual generation both training and test time.arXiv preprint arXiv:2604.11626, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

Ning-Hsu Wang and Yu-Lun Liu. Depth anywhere: Enhancing 360 monocular depth estimation via perspective distillation and unlabeled data augmentation.Advances in Neural Information Processing Systems, 37:127739–127764, 2024

work page 2024

[44] [44]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025. URL https: //arxiv.org/abs/2508.18265

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Gridmm: Grid memory map for vision-and-language navigation

Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, and Shuqiang Jiang. Gridmm: Grid memory map for vision-and-language navigation. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023

[46] [46]

Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and-language navigation

Zihan Wang, Seungjun Lee, and Gim Hee Lee. Dynam3d: Dynamic layered 3d tokens empower vlm for vision-and-language navigation. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[47] [47]

Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, and Jiangmiao Pang. Streamvln: Streaming vision-and-language navigation via slowfast context modeling.arXiv preprint arXiv:2507.05240, 2025

work page arXiv 2025

[48] [48]

Mimo-v2.5

Xiaomi MiMo Team. Mimo-v2.5. https://huggingface.co/collections/XiaomiMiMo/ mimo-v25, 2026. Accessed: 2026-05-06

work page 2026

[49] [49]

Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie

Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[50] [50]

ODI-Bench: Can MLLMs understand immersive omnidirectional environments?arXiv preprint arXiv:2510.11549, 2025

Liu Yang, Huiyu Duan, Ran Tao, Juntao Cheng, Sijing Wu, Yunhao Li, Jing Liu, Xiongkuo Min, and Guangtao Zhai. ODI-Bench: Can MLLMs understand immersive omnidirectional environments?arXiv preprint arXiv:2510.11549, 2025. 13

work page arXiv 2025

[51] [51]

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei- Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Osrt: Omni- directional image super-resolution with distortion-aware transformer

Fanghua Yu, Xintao Wang, Mingdeng Cao, Gen Li, Ying Shan, and Chao Dong. Osrt: Omni- directional image super-resolution with distortion-aware transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13283–13292, 2023

work page 2023

[53] [53]

Thinking in 360 ◦: Humanoid visual search in the wild.arXiv preprint arXiv:2511.20351, 2025

Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, and Yiming Li. Thinking in 360 ◦: Humanoid visual search in the wild.arXiv preprint arXiv:2511.20351, 2025

work page arXiv 2025

[54] [54]

Pano-avqa: Grounded audio-visual question answering on 360 ◦ videos, 2021

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. Pano-avqa: Grounded audio-visual question answering on 360 ◦ videos, 2021. URL https://arxiv. org/abs/2110.05122

work page arXiv 2021

[55] [55]

How to enable llm with 3d capacity? a survey of spatial reasoning in llm

Junsheng Zha et al. How to enable llm with 3d capacity? a survey of spatial reasoning in llm. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, 2025

work page 2025

[56] [56]

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang, Kunyu Wang, Shaoan Wang, Minghan Li, Haoran Liu, Songlin Wei, Zhongyuan Wang, Zhizheng Zhang, and He Wang. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks.arXiv preprint arXiv:2412.06224, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Omnidirectional spatial modeling from correlated panoramas.arXiv preprint arXiv:2509.02164, 2025

Xinshen Zhang, Tongxi Fu, and Xu Zheng. Omnidirectional spatial modeling from correlated panoramas.arXiv preprint arXiv:2509.02164, 2025

work page arXiv 2025

[58] [58]

Towards omnidirectional reasoning with 360-r1: A dataset, benchmark, and GRPO-based method.arXiv preprint arXiv:2505.14197, 2025

Xinshen Zhang, Zhen Ye, and Xu Zheng. Towards omnidirectional reasoning with 360-r1: A dataset, benchmark, and GRPO-based method.arXiv preprint arXiv:2505.14197, 2025

work page arXiv 2025

[59] [59]

Efficient-vln: A training-efficient vision-language navigation model.arXiv preprint arXiv:2512.10310, 2025

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Efficient-vln: A training-efficient vision-language navigation model.arXiv preprint arXiv:2512.10310, 2025

work page arXiv 2025

[60] [60]

Video-3d llm: Learning position-aware video representation for 3d scene understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3d llm: Learning position-aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8995–9006, 2025

work page 2025

[61] [61]

Structured3d: A large photo-realistic dataset for structured 3d modeling

Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3d: A large photo-realistic dataset for structured 3d modeling. InEuropean Conference on Computer Vision, pages 519–535. Springer, 2020

work page 2020

[62] [62]

Multimodal spatial reasoning in the large model era: A survey and benchmarks.arXiv preprint arXiv:2510.25760, 2025

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, and Xuming Hu. Multimodal spatial reasoning in the large model era: A survey and benchmarks.arX...

work page arXiv 2025

[63] [63]

Omnisam: Omnidirectional segment anything model for uda in panoramic semantic segmentation

Ding Zhong, Xu Zheng, Chenfei Liao, Yuanhuiyi Lyu, Jialei Chen, Shengyang Wu, Linfeng Zhang, and Xuming Hu. Omnisam: Omnidirectional segment anything model for uda in panoramic semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23892–23901, 2025

work page 2025

[64] [64]

Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025

Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, and Lu Qi. Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025

work page arXiv 2025

[65] [65]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4295–4305, 2025

work page 2025

[66] [66]

thinking with images

Weiye Zhu, Zekai Zhang, Xiangchen Wang, Hewei Pan, Teng Wang, Tiantian Geng, Rongtao Xu, and Feng Zheng. Navida: Vision-language navigation with inverse dynamics augmentation. arXiv preprint arXiv:2601.18188, 2026. 14 A Dataset Details A.1 ERP Corpus Composition Table 9 summarizes the ERP image sources. Our ERP corpus contains 570,321 full-surround panora...

work page arXiv 2026