arxiv: 2603.09573 · v2 · submitted 2026-03-10 · 💻 cs.CV

Recognition: no theorem link

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Weijia Fan , Ruiping Liu , Jiale Wei , Yufan Chen , Junwei Zheng , Zichao Zeng , Jiaming Zhang , Qiufu Li

show 2 more authors

Linlin Shen Rainer Stiefelhagen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords panorama language modelsomni-scenespanoramic VQAsparse attentionequirectangular imagesvision language modelsadverse scenes360 degree reasoning

0 comments

The pith

Panorama-language models achieve more complete scene understanding than stitched pinhole views by directly processing equirectangular images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Panorama-Language Modeling paradigm as a way to reason over full 360-degree scenes in one pass rather than assembling multiple narrow images. It argues that stitching overlooks spatial and contextual links that a single panorama keeps intact, especially in difficult conditions such as heavy occlusions or traffic accidents. To make this practical, the authors release the PanoVQA dataset of panoramic question-answer pairs focused on adverse omni-scenes and supply a lightweight sparse attention module that lets existing vision-language models handle equirectangular input without any retraining. Experiments show the resulting models are more robust and produce answers that exceed what separate narrow views can deliver when combined. A sympathetic reader would care because many real applications, from autonomous driving to surveillance, need reliable holistic perception rather than piecemeal reconstruction.

Core claim

The central discovery is that a unified 360-degree vision-language reasoning framework, built on a plug-and-play panoramic sparse attention module, enables existing pinhole-based VLMs to process equirectangular panoramas directly and yields understanding greater than the sum of its narrow parts, with measurable gains in robustness under object occlusions and driving accidents.

What carries the argument

The plug-and-play panoramic sparse attention module that lets existing pinhole VLMs process equirectangular panoramas without retraining while preserving holistic spatial relationships.

If this is right

Existing vision-language models can be used on panoramic data without retraining or new data collection.
Reasoning performance improves specifically on scenes with occlusions and accidents where stitching breaks spatial context.
A single panoramic input replaces the need to capture and align multiple narrow-field images for complete scene coverage.
The approach scales to any current pinhole VLM by swapping in the sparse attention module at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the module works on current models, the same lightweight change could be applied to future VLMs trained on mixed pinhole and panoramic data to remove the need for separate pipelines.
The same adaptation technique might extend to other wide-field sensors such as fisheye or multi-camera rigs in robotics without requiring full retraining.
Because the dataset targets adverse omni-scenes, follow-up work could test whether the same gains appear in less extreme but still wide-field settings such as indoor navigation or sports analysis.

Load-bearing premise

The sparse attention module can adapt pinhole models to full panoramas without retraining while still preserving the spatial relationships that stitching loses.

What would settle it

A controlled test in which the adapted model receives the same panorama both as native equirectangular input and as stitched narrow views, then shows no improvement in accuracy or robustness on PanoVQA questions, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2603.09573 by Jiale Wei, Jiaming Zhang, Junwei Zheng, Linlin Shen, Qiufu Li, Rainer Stiefelhagen, Ruiping Liu, Weijia Fan, Yufan Chen, Zichao Zeng.

**Figure 1.** Figure 1: Overview of Panorama-Language Modeling (PLM). (a) To enable PLM, we create the first PanoVQA dataset with 653K QA pairs, including normal (N), occluded (O), accidental (D) driving scenarios. (b) Compared to narrow-FoV multi-view VLMs, PLM with 360◦ spatial semantic consistency can identify the potential risks (e.g., a van in the front-left). (c) Evaluating across PanoVQA, our proposed PLM significantly out… view at source ↗

**Figure 2.** Figure 2: 1-Pano (41.42%) outperforms 6-Cam (40.22%) on PanoVQA-mini. The panorama’s seamless 360◦ context is key for spatial awareness. As shown, the 6-cam model fails the query, e.g., misidentifying the direction. In contrast, the 1-Pano model leverages the full context to, e.g., correctly locate the object, matching the GT. More examples can be found in the supplementary. SpatialQA [39] focuses on evaluating spa… view at source ↗

**Figure 3.** Figure 3: Panorama generation overview. Following [ [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Left: Structure of our proposed attention block with SWA and PSA. Right: The visualization of attention masks for Sliding Window Attention (SWA), Simplified Sparse Attention (SSA), and Panoramic Sparse Attention (PSA), respectively. Panoramic Sparse Attention. The global head is implemented by Panoramic Sparse Attention (PSA), which dynamically selects the Top-K most relevant key tokens for each query to… view at source ↗

**Figure 5.** Figure 5: Analysis of the performance-parameter trade-off. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Scaling law study on PanoVQA-mini. adopt a bottleneck dimension of 196 and a Top-K of 512 for all subsequent experiments. All experiments are conducted using PanoLM-3B on PanoVQA-mini. Scaling law study. We confirm the scaling law through experiments on PanoVQA-mini. The results in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Attention visualization of our proposed PSA. PSA filters uninformative regions like “sky” and distant backgrounds, while [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of QA samples in PanoVQA. The dataset features disparate scales to mimic real-world distributions: PanoVQA-N [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Hyperparameters for Qwen2.5-VL Baseline. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt used in PanoVQA generation. Using PanoVQA-N as an example. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used in evaluation. VLM & Multi-view 1-Pano Outperforms 6-Cam on PanoVQA-N What is the visibility and direction of the closest adult to the ego car? Compute TTC for the car in the back right at about 4 meters moving about 19 km/h relative to a stationary ego. Provide result in seconds rounded to one decimal. What is the visibility and approximate distance of the nearest fully visible child? An adul… view at source ↗

**Figure 12.** Figure 12: Qualitative comparison on PanoVQA-N. The panoramic model (1-Pano) correctly identifies the spatial location (“front”) and visibility of the pedestrian, whereas the multi-view model (6-Cam) hallucinates a “front left” direction due to fragmented spatial context. shown in [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison on PanoVQA-O. Facing a cluster of bicycles, the 1-Pano model benefits from the unified view to propose a coherent defensive maneuver, demonstrating the importance of seamless context for planning. VLM & Panorama VLM & Multi-view 6-cam vs. 1-pano on PanoVQA-D Focusing on the two colliding cars from metadata (car in the back at 12 meters, 28 km/h and car in the back at 16 meters, 32 k… view at source ↗

**Figure 14.** Figure 14: Qualitative comparison on PanoVQA-D. Both models accurately predict the severity and type of collision. This confirms that the panoramic representation retains critical visual details necessary for complex accident reasoning. coherence. This coherence proves advantageous in tasks requiring precise localization and holistic scene understanding, without compromising performance on semantic reasoning task… view at source ↗

read the original abstract

Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a PLM paradigm with a new adverse-scene panoramic VQA dataset and a plug-and-play sparse attention module, but the performance claims rest on assertions rather than shown metrics.

read the letter

The punchline is that this work defines a Panorama-Language Modeling approach to handle full 360-degree scenes directly instead of stitching pinhole views, backed by the PanoVQA dataset for occlusions and accidents plus a sparse attention module meant to adapt existing VLMs without retraining. That combination targets a practical gap in robotics and driving where context across the whole sphere matters. What is actually new is the explicit framing of PLM as more than summed narrow views, the dataset focused on adverse omni-scenes, and the module described as plug-and-play for equirectangular inputs. The paper does well in spelling out how stitching fragments spatial relationships that a single panorama keeps intact, which is a fair observation for safety-critical applications. The soft spots sit in the evidence. The abstract states superior robustness and holistic reasoning but gives no numbers, baselines, ablations, or error breakdowns, so the central claim cannot be checked from the summary. The module's handling of equirectangular distortions, especially pole stretching and wrap-around boundaries, is load-bearing yet untested in the provided text; if the sparsity pattern and positional encodings do not compensate explicitly, distant elements may still fail to connect reliably. The stress-test concern about preserving global context without retraining lands as a real question until the full experiments are reviewed. This paper is for people working on VLMs for panoramic or omnidirectional inputs in real-world settings. Readers building or using datasets for adverse conditions would get direct value from PanoVQA. It deserves a serious referee because the contributions are specific and the problem is timely, even though revisions would need to add quantitative validation and distortion-specific checks. I would recommend sending it to peer review with requests for the full results tables and module ablations.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Panorama-Language Modeling (PLM) paradigm for unified 360° vision-language reasoning on equirectangular panoramas, contrasting it with stitching-based approaches that lose holistic context. It contributes the PanoVQA dataset for adverse omni-scenes (occlusions, accidents) and a plug-and-play panoramic sparse attention module that adapts existing pinhole VLMs to panoramas without retraining. The central claim is that PLM yields superior robustness and holistic reasoning, producing understanding greater than the sum of narrow-field parts.

Significance. If the empirical claims hold, this work could advance VLM deployment in robotics, autonomous driving, and surveillance by enabling direct, context-preserving processing of 360° imagery. The PanoVQA dataset would provide a valuable benchmark for adverse conditions. The plug-and-play module, if shown to generalize without retraining, would lower barriers to adopting panoramic inputs in existing models.

major comments (2)

[Abstract / Panoramic sparse attention module] Abstract and method description of the panoramic sparse attention module: the claim that this module preserves holistic spatial relationships across equirectangular distortions (non-uniform scaling near poles, periodic boundaries) without retraining is load-bearing for the central superiority assertion, yet no ablation on distortion compensation, no before/after attention connectivity analysis, and no failure cases on adverse omni-scenes are referenced. The skeptic concern that standard VLM attention patterns may not link distant elements reliably therefore remains unaddressed.
[Experiments / Results] Experimental claims: the abstract states that 'extensive experiments demonstrate superior robustness' but supplies no quantitative metrics, baselines (e.g., stitched pinhole VLMs), error breakdowns by scene type (occlusion vs. accident), or tables. Without these, the 'greater than the sum' claim cannot be verified and the cross-method comparison is unsupported.

minor comments (1)

[Abstract] Abstract: 'PLMparadigm' is missing a space; should read 'PLM paradigm'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better present our contributions. We address each major point below and will revise the manuscript to strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract / Panoramic sparse attention module] Abstract and method description of the panoramic sparse attention module: the claim that this module preserves holistic spatial relationships across equirectangular distortions (non-uniform scaling near poles, periodic boundaries) without retraining is load-bearing for the central superiority assertion, yet no ablation on distortion compensation, no before/after attention connectivity analysis, and no failure cases on adverse omni-scenes are referenced. The skeptic concern that standard VLM attention patterns may not link distant elements reliably therefore remains unaddressed.

Authors: We agree that additional empirical support would strengthen the description of the panoramic sparse attention module. In the revised manuscript we will add (i) an ablation isolating the distortion-compensation components, (ii) side-by-side attention-map visualizations before and after the module to illustrate improved long-range connectivity across poles and periodic boundaries, and (iii) a short failure-case analysis on adverse omni-scenes. These additions will directly address the concern that standard VLM attention may fail to link distant elements reliably. revision: yes
Referee: [Experiments / Results] Experimental claims: the abstract states that 'extensive experiments demonstrate superior robustness' but supplies no quantitative metrics, baselines (e.g., stitched pinhole VLMs), error breakdowns by scene type (occlusion vs. accident), or tables. Without these, the 'greater than the sum' claim cannot be verified and the cross-method comparison is unsupported.

Authors: The full manuscript already contains quantitative results, stitched-pinhole baselines, and summary tables. To make these findings immediately visible and to address the referee’s request, we will (i) revise the abstract to report the key quantitative metrics and (ii) expand the experiments section with explicit error breakdowns by scene type (occlusion versus accident). These changes will render the superiority claims and cross-method comparisons fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on module design and experiments, not self-referential reduction

full rationale

The paper presents a plug-and-play panoramic sparse attention module and PanoVQA dataset as the foundation for the PLM paradigm. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content that would reduce the 'more than the sum' claim or robustness assertions to inputs by construction. The adaptation claim is asserted as a design property rather than derived from prior fitted quantities or uniqueness theorems imported from the same authors. This is a standard non-circular introduction of an architectural module whose validity is left to empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach implicitly assumes effective adaptation via sparse attention without detailing any fitted values or new postulated constructs.

pith-pipeline@v0.9.0 · 5520 in / 1142 out tokens · 47242 ms · 2026-05-15T13:52:38.814849+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
cs.CV 2026-05 unverdicted novelty 6.0

PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 1, 3

work page 2015
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 3, 1

work page 2020
[5]

Occlusion-aware seamless segmentation

Yihong Cao, Jiaming Zhang, Hao Shi, Kunyu Peng, Yuhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, and Kailun Yang. Occlusion-aware seamless segmentation. In European Conference on Computer Vision (ECCV), 2024. 1, 2, 3, 4

work page 2024
[6]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 3, 7

work page 2024
[7]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019. 5

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

Spherenet: Learning spherical representations for detection and classification in omnidirectional images

Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European conference on computer vision (ECCV), pages 518–533, 2018. 2

work page 2018
[9]

Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

work page 2024
[10]

Vizwiz grand challenge: Answering visual questions from blind people

Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,

work page
[11]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language mod- els.arXiv preprint arXiv:2203.15556, 2022. 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 3, 7

work page 2025
[13]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 8

work page 2022
[14]

Deformable mamba for wide field of view seg- mentation.arXiv preprint arXiv:2411.16481, 2024

Jie Hu, Junwei Zheng, Jiale Wei, Jiaming Zhang, and Rainer Stiefelhagen. Deformable mamba for wide field of view seg- mentation.arXiv preprint arXiv:2411.16481, 2024. 3

work page arXiv 2024
[15]

6-dof vr videos with a single 360-camera

Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 6-dof vr videos with a single 360-camera. In2017 IEEE Virtual Reality (VR), pages 37–44. IEEE, 2017. 1

work page 2017
[16]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 1

work page 2019
[17]

Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021

Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021. 2

work page 2021
[18]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[19]

Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding

Younggun Kim, Ahmed S Abdelrahman, and Mohamed Abdel-Aty. Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 761–771,

work page
[20]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Preference leakage: A contamination problem in llm- as-a-judge.arXiv preprint arXiv:2502.01534, 2025

Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. Preference leakage: A contamination problem in llm- as-a-judge.arXiv preprint arXiv:2502.01534, 2025. 7

work page arXiv 2025
[22]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a com- prehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

DA2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025

Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo. DA2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025. 2, 4

work page arXiv 2025
[24]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1

work page 2023
[25]

Bev- former: Learning bird’s-eye-view representation from multi-camera im- ages via spatiotemporal transformers,

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022. 3

work page arXiv 2022
[26]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023. 1, 3, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

work page 2023
[28]

Llavanext: Improved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 7

work page 2024
[29]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

NuPlanQA: A large-scale dataset and benchmark for multi- view driving scene understanding in multi-modal large lan- guage models

Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, and Ziran Wang. NuPlanQA: A large-scale dataset and benchmark for multi- view driving scene understanding in multi-modal large lan- guage models. InICCV, 2025. 2, 3

work page 2025
[31]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d

Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. InProceedings of the European Conference on Computer Vision, 2020. 3

work page 2020
[32]

NuScenes-QA: A multi-modal visual ques- tion answering benchmark for autonomous driving scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A multi-modal visual ques- tion answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 4542–4550, 2024. 2, 3, 4

work page 2024
[33]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InAdvances in Neural In- formation Processing Systems, 2025. 1

work page 2025
[34]

Panoformer: panorama transformer for indoor360 o depth estimation

Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: panorama transformer for indoor360 o depth estimation. InEuropean Conference on Computer Vision, pages 195–211. Springer, 2022. 2

work page 2022
[35]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024. 2, 3

work page 2024
[36]

Horizonnet: Learning room layout with 1d represen- tation and pano stretch data augmentation

Cheng Sun, Chi-Wei Hsiao, Min Sun, and Hwann-Tzong Chen. Horizonnet: Learning room layout with 1d represen- tation and pano stretch data augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1047–1056, 2019. 2

work page 2019
[37]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

OpenGVLab Team. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,

work page
[39]

NuScenes-spatialQA: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025

Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. NuScenes-spatialQA: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025. 2, 3

work page arXiv 2025
[40]

Al- varez

Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M. Al- varez. OmniDrive: A holistic vision-language dataset for au- tonomous driving with counterfactual reasoning. InCVPR,

work page
[41]

Deepaccident: A motion and accident prediction bench- mark for v2x autonomous driving

Tianqi Wang, Sukmin Kim, Ji Wenxuan, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, and Ping Luo. Deepaccident: A motion and accident prediction bench- mark for v2x autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5599– 5606, 2024. 3, 4, 1

work page 2024
[42]

Multi-view panoramic image style transfer with multi- scale attention and global sharing.ACM Transactions on Multimedia Computing, Communications and Applications,

Weiyu Wang, Chunmei Qing, Junpeng Tan, and XiangMin Xu. Multi-view panoramic image style transfer with multi- scale attention and global sharing.ACM Transactions on Multimedia Computing, Communications and Applications,

work page
[43]

Onebev: Using one panoramic image for bird, aos-eye-view semantic mapping

Jiale Wei, Junwei Zheng, Ruiping Liu, Jie Hu, Jiaming Zhang, and Rainer Stiefelhagen. Onebev: Using one panoramic image for bird, aos-eye-view semantic mapping. InProceedings of the Asian Conference on Computer Vision, pages 583–596, 2024. 3, 4, 2

work page 2024
[44]

Fashion iq: A new dataset towards retrieving images by natural language feedback

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307– 11317, 2021. 3

work page 2021
[45]

Show, attend and tell: Neural image caption gen- eration with visual attention

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 1, 3

work page 2048
[46]

Chatbev: A visual language model that under- stands bev maps.arXiv preprint arXiv:2503.13938, 2025

Qingyao Xu, Siheng Chen, Guang Chen, Yanfeng Wang, and Ya Zhang. Chatbev: A visual language model that under- stands bev maps.arXiv preprint arXiv:2503.13938, 2025. 3

work page arXiv 2025
[47]

Bevformer v2: Adapting modern im- age backbones to bird’s-eye-view recognition via perspective supervision

Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern im- age backbones to bird’s-eye-view recognition via perspective supervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17830– 17839, 2023. 3

work page 2023
[48]

Capturing omni-range context for om- nidirectional segmentation

Kailun Yang, Jiaming Zhang, Simon Reiß, Xinxin Hu, and Rainer Stiefelhagen. Capturing omni-range context for om- nidirectional segmentation. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 1, 2, 3

work page 2021
[49]

mmwalk: Towards multi-modal multi-view walking assis- tance.arXiv preprint arXiv:2510.11520, 2025

Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. mmwalk: Towards multi-modal multi-view walking assis- tance.arXiv preprint arXiv:2510.11520, 2025. 1, 2, 3, 7

work page arXiv 2025
[50]

Deeppanocontext: Panoramic 3d scene understanding with holistic scene con- text graph and relation-based optimization

Cheng Zhang, Zhaopeng Cui, Cai Chen, Shuaicheng Liu, Bing Zeng, Hujun Bao, and Yinda Zhang. Deeppanocontext: Panoramic 3d scene understanding with holistic scene con- text graph and relation-based optimization. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12632–12641, 2021. 2

work page 2021
[51]

Bending reality: Distortion-aware transformers for adapting to panoramic se- mantic segmentation

Jiaming Zhang, Kailun Yang, Chaoxiang Ma, Simon Reiß, Kunyu Peng, and Rainer Stiefelhagen. Bending reality: Distortion-aware transformers for adapting to panoramic se- mantic segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16917–16927, 2022. 3

work page 2022
[52]

Jiaming Zhang, Kailun Yang, Hao Shi, Simon Reiß, Kunyu Peng, Chaoxiang Ma, Haodong Fu, Philip H. S. Torr, Kai- wei Wang, and Rainer Stiefelhagen. Behind every domain there is a shift: Adapting distortion-aware vision transform- ers for panoramic semantic segmentation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 46(12): 8549–8567, 2024. 3

work page 2024
[53]

Panocontext: A whole-room 3d context model for panoramic scene understanding

Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. InEuropean conference on computer vision, pages 668–686. Springer, 2014. 2

work page 2014
[54]

Chameleon: Fast-slow neuro-symbolic lane topology extraction,

Zongzheng Zhang, Xinrun Li, Sizhe Zou, Guoxuan Chi, Siqi Li, Xuchong Qiu, Guoliang Wang, Guantian Zheng, Leichen Wang, Hang Zhao, et al. Chameleon: Fast-slow neuro-symbolic lane topology extraction.arXiv preprint arXiv:2503.07485, 2025. 7

work page arXiv 2025
[55]

Open panoramic segmentation

Junwei Zheng, Ruiping Liu, Yufan Chen, Kunyu Peng, Chengzhi Wu, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. Open panoramic segmentation. InEuropean Conference on Computer Vision, pages 164–182. Springer,

work page
[56]

Scene-agnostic pose regression for visual localization

Junwei Zheng, Ruiping Liu, Yufan Chen, Zhenfang Chen, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. Scene-agnostic pose regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 27092–27102, 2025. 2

work page 2025
[57]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 7 More than the Sum: Panorama-Language Models for Adverse Omni-Scenes Supplementary Material A. S...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

Use vague numbers to express distances, without decimal points

Analyze the panoramic scene annotations, focusing on: − Use a quadruple tuple (category, direction, distance, visibility) to describe an object (e.g., ‘a fully visible pedestrian in the back right around 9 meters’). Use vague numbers to express distances, without decimal points. − Object attributes and spatial relationships (visibility, distance, and dire...

work page
[59]

Only describe clear information in the images, do not fabricate or invent in the answers

work page
[60]

Do not make assumptions or invent details

Base all answers only on what is actually visible in the provided json data. Do not make assumptions or invent details

work page
[61]

(Describe exact direction such as ‘ front left’, ‘back right’, ‘front’, etc.)

All positions and absolute coordinates must be described in a directional manner. (Describe exact direction such as ‘ front left’, ‘back right’, ‘front’, etc.)

work page
[62]

Visibility Encoding: 1: Low visibility (0−40%) 2: Medium visibility (40−60%) 3: High visibility (60−80%) 4: Fully visible (80−100%)

work page
[63]

For multi−item answers, maintain the order relevant to the question (e.g., nearest to farthest)

The question can be slightly modified to produce different answers. For multi−item answers, maintain the order relevant to the question (e.g., nearest to farthest). One question should correspond to one answer

work page
[64]

Instructions: Fully consider following levels to generate questions and multiple answers:

All responses should be written expressions in natural language, avoid using symbols or brackets. Instructions: Fully consider following levels to generate questions and multiple answers:

work page
[65]

Short Level QA: QA pairs that query the basic information in the json file or single panoramic image, the answer can be completely verified by the ground truth

work page
[66]

fully visible

Long Level QA: QA pairs that contain multiple objects, with attributions and their relationships in concern, the answer stems mainly from the combined ground truth feature information. The questions should be short and rough, while the answers should be detailed and comprehensive. The answer can be partially verified. QA Types: −Type N1− Global scene unde...

work page