pith. machine review for the scientific record. sign in

arxiv: 2603.09573 · v2 · submitted 2026-03-10 · 💻 cs.CV

Recognition: no theorem link

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords panorama language modelsomni-scenespanoramic VQAsparse attentionequirectangular imagesvision language modelsadverse scenes360 degree reasoning
0
0 comments X

The pith

Panorama-language models achieve more complete scene understanding than stitched pinhole views by directly processing equirectangular images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Panorama-Language Modeling paradigm as a way to reason over full 360-degree scenes in one pass rather than assembling multiple narrow images. It argues that stitching overlooks spatial and contextual links that a single panorama keeps intact, especially in difficult conditions such as heavy occlusions or traffic accidents. To make this practical, the authors release the PanoVQA dataset of panoramic question-answer pairs focused on adverse omni-scenes and supply a lightweight sparse attention module that lets existing vision-language models handle equirectangular input without any retraining. Experiments show the resulting models are more robust and produce answers that exceed what separate narrow views can deliver when combined. A sympathetic reader would care because many real applications, from autonomous driving to surveillance, need reliable holistic perception rather than piecemeal reconstruction.

Core claim

The central discovery is that a unified 360-degree vision-language reasoning framework, built on a plug-and-play panoramic sparse attention module, enables existing pinhole-based VLMs to process equirectangular panoramas directly and yields understanding greater than the sum of its narrow parts, with measurable gains in robustness under object occlusions and driving accidents.

What carries the argument

The plug-and-play panoramic sparse attention module that lets existing pinhole VLMs process equirectangular panoramas without retraining while preserving holistic spatial relationships.

If this is right

  • Existing vision-language models can be used on panoramic data without retraining or new data collection.
  • Reasoning performance improves specifically on scenes with occlusions and accidents where stitching breaks spatial context.
  • A single panoramic input replaces the need to capture and align multiple narrow-field images for complete scene coverage.
  • The approach scales to any current pinhole VLM by swapping in the sparse attention module at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the module works on current models, the same lightweight change could be applied to future VLMs trained on mixed pinhole and panoramic data to remove the need for separate pipelines.
  • The same adaptation technique might extend to other wide-field sensors such as fisheye or multi-camera rigs in robotics without requiring full retraining.
  • Because the dataset targets adverse omni-scenes, follow-up work could test whether the same gains appear in less extreme but still wide-field settings such as indoor navigation or sports analysis.

Load-bearing premise

The sparse attention module can adapt pinhole models to full panoramas without retraining while still preserving the spatial relationships that stitching loses.

What would settle it

A controlled test in which the adapted model receives the same panorama both as native equirectangular input and as stitched narrow views, then shows no improvement in accuracy or robustness on PanoVQA questions, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.09573 by Jiale Wei, Jiaming Zhang, Junwei Zheng, Linlin Shen, Qiufu Li, Rainer Stiefelhagen, Ruiping Liu, Weijia Fan, Yufan Chen, Zichao Zeng.

Figure 1
Figure 1. Figure 1: Overview of Panorama-Language Modeling (PLM). (a) To enable PLM, we create the first PanoVQA dataset with 653K QA pairs, including normal (N), occluded (O), accidental (D) driving scenarios. (b) Compared to narrow-FoV multi-view VLMs, PLM with 360◦ spatial semantic consistency can identify the potential risks (e.g., a van in the front-left). (c) Evaluating across PanoVQA, our proposed PLM significantly out… view at source ↗
Figure 2
Figure 2. Figure 2: 1-Pano (41.42%) outperforms 6-Cam (40.22%) on PanoVQA-mini. The panorama’s seamless 360◦ context is key for spatial awareness. As shown, the 6-cam model fails the query, e.g., misidentifying the direction. In contrast, the 1-Pano model lever￾ages the full context to, e.g., correctly locate the object, matching the GT. More examples can be found in the supplementary. SpatialQA [39] focuses on evaluating spa… view at source ↗
Figure 3
Figure 3. Figure 3: Panorama generation overview. Following [ [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Structure of our proposed attention block with SWA and PSA. Right: The visualization of attention masks for Sliding Window Attention (SWA), Simplified Sparse Attention (SSA), and Panoramic Sparse Attention (PSA), respectively. Panoramic Sparse Attention. The global head is imple￾mented by Panoramic Sparse Attention (PSA), which dy￾namically selects the Top-K most relevant key tokens for each query to… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of the performance-parameter trade-off. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaling law study on PanoVQA-mini. adopt a bottleneck dimension of 196 and a Top-K of 512 for all subsequent experiments. All experiments are conducted using PanoLM-3B on PanoVQA-mini. Scaling law study. We confirm the scaling law through ex￾periments on PanoVQA-mini. The results in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Attention visualization of our proposed PSA. PSA filters uninformative regions like “sky” and distant backgrounds, while [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of QA samples in PanoVQA. The dataset features disparate scales to mimic real-world distributions: PanoVQA-N [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Hyperparameters for Qwen2.5-VL Baseline. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt used in PanoVQA generation. Using PanoVQA-N as an example. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used in evaluation. VLM & Multi-view 1-Pano Outperforms 6-Cam on PanoVQA-N What is the visibility and direction of the closest adult to the ego car? Compute TTC for the car in the back right at about 4 meters moving about 19 km/h relative to a stationary ego. Provide result in seconds rounded to one decimal. What is the visibility and approximate distance of the nearest fully visible child? An adul… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison on PanoVQA-N. The panoramic model (1-Pano) correctly identifies the spatial location (“front”) and visibility of the pedestrian, whereas the multi-view model (6-Cam) hallucinates a “front left” direction due to frag￾mented spatial context. shown in [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison on PanoVQA-O. Facing a cluster of bicycles, the 1-Pano model benefits from the unified view to propose a coherent defensive maneuver, demonstrating the importance of seamless context for planning. VLM & Panorama VLM & Multi-view 6-cam vs. 1-pano on PanoVQA-D Focusing on the two colliding cars from metadata (car in the back at 12 meters, 28 km/h and car in the back at 16 meters, 32 k… view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison on PanoVQA-D. Both models accurately predict the severity and type of collision. This confirms that the panoramic representation retains critical visual details nec￾essary for complex accident reasoning. coherence. This coherence proves advantageous in tasks requiring precise localization and holistic scene understand￾ing, without compromising performance on semantic rea￾soning task… view at source ↗
read the original abstract

Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Panorama-Language Modeling (PLM) paradigm for unified 360° vision-language reasoning on equirectangular panoramas, contrasting it with stitching-based approaches that lose holistic context. It contributes the PanoVQA dataset for adverse omni-scenes (occlusions, accidents) and a plug-and-play panoramic sparse attention module that adapts existing pinhole VLMs to panoramas without retraining. The central claim is that PLM yields superior robustness and holistic reasoning, producing understanding greater than the sum of narrow-field parts.

Significance. If the empirical claims hold, this work could advance VLM deployment in robotics, autonomous driving, and surveillance by enabling direct, context-preserving processing of 360° imagery. The PanoVQA dataset would provide a valuable benchmark for adverse conditions. The plug-and-play module, if shown to generalize without retraining, would lower barriers to adopting panoramic inputs in existing models.

major comments (2)
  1. [Abstract / Panoramic sparse attention module] Abstract and method description of the panoramic sparse attention module: the claim that this module preserves holistic spatial relationships across equirectangular distortions (non-uniform scaling near poles, periodic boundaries) without retraining is load-bearing for the central superiority assertion, yet no ablation on distortion compensation, no before/after attention connectivity analysis, and no failure cases on adverse omni-scenes are referenced. The skeptic concern that standard VLM attention patterns may not link distant elements reliably therefore remains unaddressed.
  2. [Experiments / Results] Experimental claims: the abstract states that 'extensive experiments demonstrate superior robustness' but supplies no quantitative metrics, baselines (e.g., stitched pinhole VLMs), error breakdowns by scene type (occlusion vs. accident), or tables. Without these, the 'greater than the sum' claim cannot be verified and the cross-method comparison is unsupported.
minor comments (1)
  1. [Abstract] Abstract: 'PLMparadigm' is missing a space; should read 'PLM paradigm'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better present our contributions. We address each major point below and will revise the manuscript to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Abstract / Panoramic sparse attention module] Abstract and method description of the panoramic sparse attention module: the claim that this module preserves holistic spatial relationships across equirectangular distortions (non-uniform scaling near poles, periodic boundaries) without retraining is load-bearing for the central superiority assertion, yet no ablation on distortion compensation, no before/after attention connectivity analysis, and no failure cases on adverse omni-scenes are referenced. The skeptic concern that standard VLM attention patterns may not link distant elements reliably therefore remains unaddressed.

    Authors: We agree that additional empirical support would strengthen the description of the panoramic sparse attention module. In the revised manuscript we will add (i) an ablation isolating the distortion-compensation components, (ii) side-by-side attention-map visualizations before and after the module to illustrate improved long-range connectivity across poles and periodic boundaries, and (iii) a short failure-case analysis on adverse omni-scenes. These additions will directly address the concern that standard VLM attention may fail to link distant elements reliably. revision: yes

  2. Referee: [Experiments / Results] Experimental claims: the abstract states that 'extensive experiments demonstrate superior robustness' but supplies no quantitative metrics, baselines (e.g., stitched pinhole VLMs), error breakdowns by scene type (occlusion vs. accident), or tables. Without these, the 'greater than the sum' claim cannot be verified and the cross-method comparison is unsupported.

    Authors: The full manuscript already contains quantitative results, stitched-pinhole baselines, and summary tables. To make these findings immediately visible and to address the referee’s request, we will (i) revise the abstract to report the key quantitative metrics and (ii) expand the experiments section with explicit error breakdowns by scene type (occlusion versus accident). These changes will render the superiority claims and cross-method comparisons fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on module design and experiments, not self-referential reduction

full rationale

The paper presents a plug-and-play panoramic sparse attention module and PanoVQA dataset as the foundation for the PLM paradigm. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content that would reduce the 'more than the sum' claim or robustness assertions to inputs by construction. The adaptation claim is asserted as a design property rather than derived from prior fitted quantities or uniqueness theorems imported from the same authors. This is a standard non-circular introduction of an architectural module whose validity is left to empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach implicitly assumes effective adaptation via sparse attention without detailing any fitted values or new postulated constructs.

pith-pipeline@v0.9.0 · 5520 in / 1142 out tokens · 47242 ms · 2026-05-15T13:52:38.814849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

    cs.CV 2026-05 unverdicted novelty 6.0

    PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 1, 3

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 3, 5, 7

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

  4. [4]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 3, 1

  5. [5]

    Occlusion-aware seamless segmentation

    Yihong Cao, Jiaming Zhang, Hao Shi, Kunyu Peng, Yuhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, and Kailun Yang. Occlusion-aware seamless segmentation. In European Conference on Computer Vision (ECCV), 2024. 1, 2, 3, 4

  6. [6]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 3, 7

  7. [7]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019. 5

  8. [8]

    Spherenet: Learning spherical representations for detection and classification in omnidirectional images

    Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European conference on computer vision (ECCV), pages 518–533, 2018. 2

  9. [9]

    Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

  10. [10]

    Vizwiz grand challenge: Answering visual questions from blind people

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

  11. [11]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language mod- els.arXiv preprint arXiv:2203.15556, 2022. 7

  12. [12]

    Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 3, 7

  13. [13]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 8

  14. [14]

    Deformable mamba for wide field of view seg- mentation.arXiv preprint arXiv:2411.16481, 2024

    Jie Hu, Junwei Zheng, Jiale Wei, Jiaming Zhang, and Rainer Stiefelhagen. Deformable mamba for wide field of view seg- mentation.arXiv preprint arXiv:2411.16481, 2024. 3

  15. [15]

    6-dof vr videos with a single 360-camera

    Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 6-dof vr videos with a single 360-camera. In2017 IEEE Virtual Reality (VR), pages 37–44. IEEE, 2017. 1

  16. [16]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 1

  17. [17]

    Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021

    Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021. 2

  18. [18]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  19. [19]

    Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding

    Younggun Kim, Ahmed S Abdelrahman, and Mohamed Abdel-Aty. Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 761–771,

  20. [20]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 5, 7

  21. [21]

    Preference leakage: A contamination problem in llm- as-a-judge.arXiv preprint arXiv:2502.01534, 2025

    Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. Preference leakage: A contamination problem in llm- as-a-judge.arXiv preprint arXiv:2502.01534, 2025. 7

  22. [22]

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a com- prehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024. 7

  23. [23]

    DA2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

    Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo. DA2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025. 2, 4

  24. [24]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

  25. [25]

    Bev- former: Learning bird’s-eye-view representation from multi-camera im- ages via spatiotemporal transformers,

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022. 3

  26. [26]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023. 1, 3, 5, 7

  27. [27]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  28. [28]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 7

  29. [29]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

  30. [30]

    NuPlanQA: A large-scale dataset and benchmark for multi- view driving scene understanding in multi-modal large lan- guage models

    Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, and Ziran Wang. NuPlanQA: A large-scale dataset and benchmark for multi- view driving scene understanding in multi-modal large lan- guage models. InICCV, 2025. 2, 3

  31. [31]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. InProceedings of the European Conference on Computer Vision, 2020. 3

  32. [32]

    NuScenes-QA: A multi-modal visual ques- tion answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A multi-modal visual ques- tion answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 4542–4550, 2024. 2, 3, 4

  33. [33]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InAdvances in Neural In- formation Processing Systems, 2025. 1

  34. [34]

    Panoformer: panorama transformer for indoor360 o depth estimation

    Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: panorama transformer for indoor360 o depth estimation. InEuropean Conference on Computer Vision, pages 195–211. Springer, 2022. 2

  35. [35]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024. 2, 3

  36. [36]

    Horizonnet: Learning room layout with 1d represen- tation and pano stretch data augmentation

    Cheng Sun, Chi-Wei Hsiao, Min Sun, and Hwann-Tzong Chen. Horizonnet: Learning room layout with 1d represen- tation and pano stretch data augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1047–1056, 2019. 2

  37. [37]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 7

  38. [38]

    Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

    OpenGVLab Team. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

  39. [39]

    NuScenes-spatialQA: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

    Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. NuScenes-spatialQA: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025. 2, 3

  40. [40]

    Al- varez

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Al- varez. OmniDrive: A holistic vision-language dataset for au- tonomous driving with counterfactual reasoning. InCVPR,

  41. [41]

    Deepaccident: A motion and accident prediction bench- mark for v2x autonomous driving

    Tianqi Wang, Sukmin Kim, Ji Wenxuan, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, and Ping Luo. Deepaccident: A motion and accident prediction bench- mark for v2x autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5599– 5606, 2024. 3, 4, 1

  42. [42]

    Multi-view panoramic image style transfer with multi- scale attention and global sharing.ACM Transactions on Multimedia Computing, Communications and Applications,

    Weiyu Wang, Chunmei Qing, Junpeng Tan, and XiangMin Xu. Multi-view panoramic image style transfer with multi- scale attention and global sharing.ACM Transactions on Multimedia Computing, Communications and Applications,

  43. [43]

    Onebev: Using one panoramic image for bird, aos-eye-view semantic mapping

    Jiale Wei, Junwei Zheng, Ruiping Liu, Jie Hu, Jiaming Zhang, and Rainer Stiefelhagen. Onebev: Using one panoramic image for bird, aos-eye-view semantic mapping. InProceedings of the Asian Conference on Computer Vision, pages 583–596, 2024. 3, 4, 2

  44. [44]

    Fashion iq: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307– 11317, 2021. 3

  45. [45]

    Show, attend and tell: Neural image caption gen- eration with visual attention

    Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 1, 3

  46. [46]

    Chatbev: A visual language model that under- stands bev maps.arXiv preprint arXiv:2503.13938, 2025

    Qingyao Xu, Siheng Chen, Guang Chen, Yanfeng Wang, and Ya Zhang. Chatbev: A visual language model that under- stands bev maps.arXiv preprint arXiv:2503.13938, 2025. 3

  47. [47]

    Bevformer v2: Adapting modern im- age backbones to bird’s-eye-view recognition via perspective supervision

    Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern im- age backbones to bird’s-eye-view recognition via perspective supervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17830– 17839, 2023. 3

  48. [48]

    Capturing omni-range context for om- nidirectional segmentation

    Kailun Yang, Jiaming Zhang, Simon Reiß, Xinxin Hu, and Rainer Stiefelhagen. Capturing omni-range context for om- nidirectional segmentation. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 1, 2, 3

  49. [49]

    mmwalk: Towards multi-modal multi-view walking assis- tance.arXiv preprint arXiv:2510.11520, 2025

    Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. mmwalk: Towards multi-modal multi-view walking assis- tance.arXiv preprint arXiv:2510.11520, 2025. 1, 2, 3, 7

  50. [50]

    Deeppanocontext: Panoramic 3d scene understanding with holistic scene con- text graph and relation-based optimization

    Cheng Zhang, Zhaopeng Cui, Cai Chen, Shuaicheng Liu, Bing Zeng, Hujun Bao, and Yinda Zhang. Deeppanocontext: Panoramic 3d scene understanding with holistic scene con- text graph and relation-based optimization. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12632–12641, 2021. 2

  51. [51]

    Bending reality: Distortion-aware transformers for adapting to panoramic se- mantic segmentation

    Jiaming Zhang, Kailun Yang, Chaoxiang Ma, Simon Reiß, Kunyu Peng, and Rainer Stiefelhagen. Bending reality: Distortion-aware transformers for adapting to panoramic se- mantic segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16917–16927, 2022. 3

  52. [52]

    Jiaming Zhang, Kailun Yang, Hao Shi, Simon Reiß, Kunyu Peng, Chaoxiang Ma, Haodong Fu, Philip H. S. Torr, Kai- wei Wang, and Rainer Stiefelhagen. Behind every domain there is a shift: Adapting distortion-aware vision transform- ers for panoramic semantic segmentation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 46(12): 8549–8567, 2024. 3

  53. [53]

    Panocontext: A whole-room 3d context model for panoramic scene understanding

    Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. InEuropean conference on computer vision, pages 668–686. Springer, 2014. 2

  54. [54]

    Chameleon: Fast-slow neuro-symbolic lane topology extraction,

    Zongzheng Zhang, Xinrun Li, Sizhe Zou, Guoxuan Chi, Siqi Li, Xuchong Qiu, Guoliang Wang, Guantian Zheng, Leichen Wang, Hang Zhao, et al. Chameleon: Fast-slow neuro-symbolic lane topology extraction.arXiv preprint arXiv:2503.07485, 2025. 7

  55. [55]

    Open panoramic segmentation

    Junwei Zheng, Ruiping Liu, Yufan Chen, Kunyu Peng, Chengzhi Wu, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. Open panoramic segmentation. InEuropean Conference on Computer Vision, pages 164–182. Springer,

  56. [56]

    Scene-agnostic pose regression for visual localization

    Junwei Zheng, Ruiping Liu, Yufan Chen, Zhenfang Chen, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. Scene-agnostic pose regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 27092–27102, 2025. 2

  57. [57]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 7 More than the Sum: Panorama-Language Models for Adverse Omni-Scenes Supplementary Material A. S...

  58. [58]

    Use vague numbers to express distances, without decimal points

    Analyze the panoramic scene annotations, focusing on: − Use a quadruple tuple (category, direction, distance, visibility) to describe an object (e.g., ‘a fully visible pedestrian in the back right around 9 meters’). Use vague numbers to express distances, without decimal points. − Object attributes and spatial relationships (visibility, distance, and dire...

  59. [59]

    Only describe clear information in the images, do not fabricate or invent in the answers

  60. [60]

    Do not make assumptions or invent details

    Base all answers only on what is actually visible in the provided json data. Do not make assumptions or invent details

  61. [61]

    (Describe exact direction such as ‘ front left’, ‘back right’, ‘front’, etc.)

    All positions and absolute coordinates must be described in a directional manner. (Describe exact direction such as ‘ front left’, ‘back right’, ‘front’, etc.)

  62. [62]

    Visibility Encoding: 1: Low visibility (0−40%) 2: Medium visibility (40−60%) 3: High visibility (60−80%) 4: Fully visible (80−100%)

  63. [63]

    For multi−item answers, maintain the order relevant to the question (e.g., nearest to farthest)

    The question can be slightly modified to produce different answers. For multi−item answers, maintain the order relevant to the question (e.g., nearest to farthest). One question should correspond to one answer

  64. [64]

    Instructions: Fully consider following levels to generate questions and multiple answers:

    All responses should be written expressions in natural language, avoid using symbols or brackets. Instructions: Fully consider following levels to generate questions and multiple answers:

  65. [65]

    Short Level QA: QA pairs that query the basic information in the json file or single panoramic image, the answer can be completely verified by the ground truth

  66. [66]

    fully visible

    Long Level QA: QA pairs that contain multiple objects, with attributions and their relationships in concern, the answer stems mainly from the combined ground truth feature information. The questions should be short and rough, while the answers should be detailed and comprehensive. The answer can be partially verified. QA Types: −Type N1− Global scene unde...