Recognition: no theorem link
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
Pith reviewed 2026-05-15 13:52 UTC · model grok-4.3
The pith
Panorama-language models achieve more complete scene understanding than stitched pinhole views by directly processing equirectangular images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a unified 360-degree vision-language reasoning framework, built on a plug-and-play panoramic sparse attention module, enables existing pinhole-based VLMs to process equirectangular panoramas directly and yields understanding greater than the sum of its narrow parts, with measurable gains in robustness under object occlusions and driving accidents.
What carries the argument
The plug-and-play panoramic sparse attention module that lets existing pinhole VLMs process equirectangular panoramas without retraining while preserving holistic spatial relationships.
If this is right
- Existing vision-language models can be used on panoramic data without retraining or new data collection.
- Reasoning performance improves specifically on scenes with occlusions and accidents where stitching breaks spatial context.
- A single panoramic input replaces the need to capture and align multiple narrow-field images for complete scene coverage.
- The approach scales to any current pinhole VLM by swapping in the sparse attention module at inference time.
Where Pith is reading between the lines
- If the module works on current models, the same lightweight change could be applied to future VLMs trained on mixed pinhole and panoramic data to remove the need for separate pipelines.
- The same adaptation technique might extend to other wide-field sensors such as fisheye or multi-camera rigs in robotics without requiring full retraining.
- Because the dataset targets adverse omni-scenes, follow-up work could test whether the same gains appear in less extreme but still wide-field settings such as indoor navigation or sports analysis.
Load-bearing premise
The sparse attention module can adapt pinhole models to full panoramas without retraining while still preserving the spatial relationships that stitching loses.
What would settle it
A controlled test in which the adapted model receives the same panorama both as native equirectangular input and as stitched narrow views, then shows no improvement in accuracy or robustness on PanoVQA questions, would falsify the central claim.
Figures
read the original abstract
Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Panorama-Language Modeling (PLM) paradigm for unified 360° vision-language reasoning on equirectangular panoramas, contrasting it with stitching-based approaches that lose holistic context. It contributes the PanoVQA dataset for adverse omni-scenes (occlusions, accidents) and a plug-and-play panoramic sparse attention module that adapts existing pinhole VLMs to panoramas without retraining. The central claim is that PLM yields superior robustness and holistic reasoning, producing understanding greater than the sum of narrow-field parts.
Significance. If the empirical claims hold, this work could advance VLM deployment in robotics, autonomous driving, and surveillance by enabling direct, context-preserving processing of 360° imagery. The PanoVQA dataset would provide a valuable benchmark for adverse conditions. The plug-and-play module, if shown to generalize without retraining, would lower barriers to adopting panoramic inputs in existing models.
major comments (2)
- [Abstract / Panoramic sparse attention module] Abstract and method description of the panoramic sparse attention module: the claim that this module preserves holistic spatial relationships across equirectangular distortions (non-uniform scaling near poles, periodic boundaries) without retraining is load-bearing for the central superiority assertion, yet no ablation on distortion compensation, no before/after attention connectivity analysis, and no failure cases on adverse omni-scenes are referenced. The skeptic concern that standard VLM attention patterns may not link distant elements reliably therefore remains unaddressed.
- [Experiments / Results] Experimental claims: the abstract states that 'extensive experiments demonstrate superior robustness' but supplies no quantitative metrics, baselines (e.g., stitched pinhole VLMs), error breakdowns by scene type (occlusion vs. accident), or tables. Without these, the 'greater than the sum' claim cannot be verified and the cross-method comparison is unsupported.
minor comments (1)
- [Abstract] Abstract: 'PLMparadigm' is missing a space; should read 'PLM paradigm'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify how to better present our contributions. We address each major point below and will revise the manuscript to strengthen the evidence for our claims.
read point-by-point responses
-
Referee: [Abstract / Panoramic sparse attention module] Abstract and method description of the panoramic sparse attention module: the claim that this module preserves holistic spatial relationships across equirectangular distortions (non-uniform scaling near poles, periodic boundaries) without retraining is load-bearing for the central superiority assertion, yet no ablation on distortion compensation, no before/after attention connectivity analysis, and no failure cases on adverse omni-scenes are referenced. The skeptic concern that standard VLM attention patterns may not link distant elements reliably therefore remains unaddressed.
Authors: We agree that additional empirical support would strengthen the description of the panoramic sparse attention module. In the revised manuscript we will add (i) an ablation isolating the distortion-compensation components, (ii) side-by-side attention-map visualizations before and after the module to illustrate improved long-range connectivity across poles and periodic boundaries, and (iii) a short failure-case analysis on adverse omni-scenes. These additions will directly address the concern that standard VLM attention may fail to link distant elements reliably. revision: yes
-
Referee: [Experiments / Results] Experimental claims: the abstract states that 'extensive experiments demonstrate superior robustness' but supplies no quantitative metrics, baselines (e.g., stitched pinhole VLMs), error breakdowns by scene type (occlusion vs. accident), or tables. Without these, the 'greater than the sum' claim cannot be verified and the cross-method comparison is unsupported.
Authors: The full manuscript already contains quantitative results, stitched-pinhole baselines, and summary tables. To make these findings immediately visible and to address the referee’s request, we will (i) revise the abstract to report the key quantitative metrics and (ii) expand the experiments section with explicit error breakdowns by scene type (occlusion versus accident). These changes will render the superiority claims and cross-method comparisons fully verifiable. revision: yes
Circularity Check
No circularity: claims rest on module design and experiments, not self-referential reduction
full rationale
The paper presents a plug-and-play panoramic sparse attention module and PanoVQA dataset as the foundation for the PLM paradigm. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content that would reduce the 'more than the sum' claim or robustness assertions to inputs by construction. The adaptation claim is asserted as a design property rather than derived from prior fitted quantities or uniqueness theorems imported from the same authors. This is a standard non-circular introduction of an architectural module whose validity is left to empirical validation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
PanoWorld adds spherical geometry to MLLMs via cross-attention and pano-specific instruction data, yielding better performance on panoramic spatial reasoning benchmarks than standard perspective-based pipelines.
Reference graph
Works this paper leans on
-
[1]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 1, 3
work page 2015
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 2023. 1, 3, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Jun- yang Lin. Qwen2.5-vl technical repor...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
nuscenes: A multi- modal dataset for autonomous driving
Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 3, 1
work page 2020
-
[5]
Occlusion-aware seamless segmentation
Yihong Cao, Jiaming Zhang, Hao Shi, Kunyu Peng, Yuhongxuan Zhang, Hui Zhang, Rainer Stiefelhagen, and Kailun Yang. Occlusion-aware seamless segmentation. In European Conference on Computer Vision (ECCV), 2024. 1, 2, 3, 4
work page 2024
-
[6]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 3, 7
work page 2024
-
[7]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019. 5
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
Benjamin Coors, Alexandru Paul Condurache, and Andreas Geiger. Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In Proceedings of the European conference on computer vision (ECCV), pages 518–533, 2018. 2
work page 2018
-
[9]
Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools, 2024
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...
work page 2024
-
[10]
Vizwiz grand challenge: Answering visual questions from blind people
Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617,
-
[11]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language mod- els.arXiv preprint arXiv:2203.15556, 2022. 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guob- ing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Li- hang Pan, et al. Glm-4.1 v-thinking: Towards versatile multi- modal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025. 3, 7
work page 2025
-
[13]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 8
work page 2022
-
[14]
Deformable mamba for wide field of view seg- mentation.arXiv preprint arXiv:2411.16481, 2024
Jie Hu, Junwei Zheng, Jiale Wei, Jiaming Zhang, and Rainer Stiefelhagen. Deformable mamba for wide field of view seg- mentation.arXiv preprint arXiv:2411.16481, 2024. 3
-
[15]
6-dof vr videos with a single 360-camera
Jingwei Huang, Zhili Chen, Duygu Ceylan, and Hailin Jin. 6-dof vr videos with a single 360-camera. In2017 IEEE Virtual Reality (VR), pages 37–44. IEEE, 2017. 1
work page 2017
-
[16]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 1
work page 2019
-
[17]
Hualie Jiang, Zhe Sheng, Siyu Zhu, Zilong Dong, and Rui Huang. Unifuse: Unidirectional fusion for 360 panorama depth estimation.IEEE Robotics and Automation Letters, 6 (2):1519–1526, 2021. 2
work page 2021
-
[18]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[19]
Younggun Kim, Ahmed S Abdelrahman, and Mohamed Abdel-Aty. Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 761–771,
-
[20]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Preference leakage: A contamination problem in llm- as-a-judge.arXiv preprint arXiv:2502.01534, 2025
Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. Preference leakage: A contamination problem in llm- as-a-judge.arXiv preprint arXiv:2502.01534, 2025. 7
-
[22]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: a com- prehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
DA2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025
Haodong Li, Wangguangdong Zheng, Jing He, Yuhao Liu, Xin Lin, Xin Yang, Ying-Cong Chen, and Chunchao Guo. DA2: Depth anything in any direction.arXiv preprint arXiv:2509.26618, 2025. 2, 4
-
[24]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 1
work page 2023
-
[25]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chong- hao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers.arXiv preprint arXiv:2203.17270, 2022. 3
-
[26]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023. 1, 3, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1
work page 2023
-
[28]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024. 7
work page 2024
-
[29]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Sung-Yeon Park, Can Cui, Yunsheng Ma, Ahmadreza Moradipari, Rohit Gupta, Kyungtae Han, and Ziran Wang. NuPlanQA: A large-scale dataset and benchmark for multi- view driving scene understanding in multi-modal large lan- guage models. InICCV, 2025. 2, 3
work page 2025
-
[31]
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d
Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unproject- ing to 3d. InProceedings of the European Conference on Computer Vision, 2020. 3
work page 2020
-
[32]
NuScenes-QA: A multi-modal visual ques- tion answering benchmark for autonomous driving scenario
Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. NuScenes-QA: A multi-modal visual ques- tion answering benchmark for autonomous driving scenario. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 4542–4550, 2024. 2, 3, 4
work page 2024
-
[33]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InAdvances in Neural In- formation Processing Systems, 2025. 1
work page 2025
-
[34]
Panoformer: panorama transformer for indoor360 o depth estimation
Zhijie Shen, Chunyu Lin, Kang Liao, Lang Nie, Zishuo Zheng, and Yao Zhao. Panoformer: panorama transformer for indoor360 o depth estimation. InEuropean Conference on Computer Vision, pages 195–211. Springer, 2022. 2
work page 2022
-
[35]
Drivelm: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024. 2, 3
work page 2024
-
[36]
Horizonnet: Learning room layout with 1d represen- tation and pano stretch data augmentation
Cheng Sun, Chi-Wei Hsiao, Min Sun, and Hwann-Tzong Chen. Horizonnet: Learning room layout with 1d represen- tation and pano stretch data augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1047–1056, 2019. 2
work page 2019
-
[37]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
OpenGVLab Team. Internvl2: Better than the best—expanding performance boundaries of open-source multimodal models with the progressive scaling strategy,
-
[39]
Kexin Tian, Jingrui Mao, Yunlong Zhang, Jiwan Jiang, Yang Zhou, and Zhengzhong Tu. NuScenes-spatialQA: A spatial understanding and reasoning benchmark for vision- language models in autonomous driving.arXiv preprint arXiv:2504.03164, 2025. 2, 3
- [40]
-
[41]
Deepaccident: A motion and accident prediction bench- mark for v2x autonomous driving
Tianqi Wang, Sukmin Kim, Ji Wenxuan, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, and Ping Luo. Deepaccident: A motion and accident prediction bench- mark for v2x autonomous driving. InProceedings of the AAAI Conference on Artificial Intelligence, pages 5599– 5606, 2024. 3, 4, 1
work page 2024
-
[42]
Weiyu Wang, Chunmei Qing, Junpeng Tan, and XiangMin Xu. Multi-view panoramic image style transfer with multi- scale attention and global sharing.ACM Transactions on Multimedia Computing, Communications and Applications,
-
[43]
Onebev: Using one panoramic image for bird, aos-eye-view semantic mapping
Jiale Wei, Junwei Zheng, Ruiping Liu, Jie Hu, Jiaming Zhang, and Rainer Stiefelhagen. Onebev: Using one panoramic image for bird, aos-eye-view semantic mapping. InProceedings of the Asian Conference on Computer Vision, pages 583–596, 2024. 3, 4, 2
work page 2024
-
[44]
Fashion iq: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307– 11317, 2021. 3
work page 2021
-
[45]
Show, attend and tell: Neural image caption gen- eration with visual attention
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption gen- eration with visual attention. InInternational conference on machine learning, pages 2048–2057. PMLR, 2015. 1, 3
work page 2048
-
[46]
Chatbev: A visual language model that under- stands bev maps.arXiv preprint arXiv:2503.13938, 2025
Qingyao Xu, Siheng Chen, Guang Chen, Yanfeng Wang, and Ya Zhang. Chatbev: A visual language model that under- stands bev maps.arXiv preprint arXiv:2503.13938, 2025. 3
-
[47]
Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern im- age backbones to bird’s-eye-view recognition via perspective supervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17830– 17839, 2023. 3
work page 2023
-
[48]
Capturing omni-range context for om- nidirectional segmentation
Kailun Yang, Jiaming Zhang, Simon Reiß, Xinxin Hu, and Rainer Stiefelhagen. Capturing omni-range context for om- nidirectional segmentation. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 1, 2, 3
work page 2021
-
[49]
mmwalk: Towards multi-modal multi-view walking assis- tance.arXiv preprint arXiv:2510.11520, 2025
Kedi Ying, Ruiping Liu, Chongyan Chen, Mingzhe Tao, Hao Shi, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. mmwalk: Towards multi-modal multi-view walking assis- tance.arXiv preprint arXiv:2510.11520, 2025. 1, 2, 3, 7
-
[50]
Cheng Zhang, Zhaopeng Cui, Cai Chen, Shuaicheng Liu, Bing Zeng, Hujun Bao, and Yinda Zhang. Deeppanocontext: Panoramic 3d scene understanding with holistic scene con- text graph and relation-based optimization. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12632–12641, 2021. 2
work page 2021
-
[51]
Bending reality: Distortion-aware transformers for adapting to panoramic se- mantic segmentation
Jiaming Zhang, Kailun Yang, Chaoxiang Ma, Simon Reiß, Kunyu Peng, and Rainer Stiefelhagen. Bending reality: Distortion-aware transformers for adapting to panoramic se- mantic segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 16917–16927, 2022. 3
work page 2022
-
[52]
Jiaming Zhang, Kailun Yang, Hao Shi, Simon Reiß, Kunyu Peng, Chaoxiang Ma, Haodong Fu, Philip H. S. Torr, Kai- wei Wang, and Rainer Stiefelhagen. Behind every domain there is a shift: Adapting distortion-aware vision transform- ers for panoramic semantic segmentation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 46(12): 8549–8567, 2024. 3
work page 2024
-
[53]
Panocontext: A whole-room 3d context model for panoramic scene understanding
Yinda Zhang, Shuran Song, Ping Tan, and Jianxiong Xiao. Panocontext: A whole-room 3d context model for panoramic scene understanding. InEuropean conference on computer vision, pages 668–686. Springer, 2014. 2
work page 2014
-
[54]
Chameleon: Fast-slow neuro-symbolic lane topology extraction,
Zongzheng Zhang, Xinrun Li, Sizhe Zou, Guoxuan Chi, Siqi Li, Xuchong Qiu, Guoliang Wang, Guantian Zheng, Leichen Wang, Hang Zhao, et al. Chameleon: Fast-slow neuro-symbolic lane topology extraction.arXiv preprint arXiv:2503.07485, 2025. 7
-
[55]
Junwei Zheng, Ruiping Liu, Yufan Chen, Kunyu Peng, Chengzhi Wu, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. Open panoramic segmentation. InEuropean Conference on Computer Vision, pages 164–182. Springer,
-
[56]
Scene-agnostic pose regression for visual localization
Junwei Zheng, Ruiping Liu, Yufan Chen, Zhenfang Chen, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. Scene-agnostic pose regression for visual localization. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 27092–27102, 2025. 2
work page 2025
-
[57]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 7 More than the Sum: Panorama-Language Models for Adverse Omni-Scenes Supplementary Material A. S...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Use vague numbers to express distances, without decimal points
Analyze the panoramic scene annotations, focusing on: − Use a quadruple tuple (category, direction, distance, visibility) to describe an object (e.g., ‘a fully visible pedestrian in the back right around 9 meters’). Use vague numbers to express distances, without decimal points. − Object attributes and spatial relationships (visibility, distance, and dire...
-
[59]
Only describe clear information in the images, do not fabricate or invent in the answers
-
[60]
Do not make assumptions or invent details
Base all answers only on what is actually visible in the provided json data. Do not make assumptions or invent details
-
[61]
(Describe exact direction such as ‘ front left’, ‘back right’, ‘front’, etc.)
All positions and absolute coordinates must be described in a directional manner. (Describe exact direction such as ‘ front left’, ‘back right’, ‘front’, etc.)
-
[62]
Visibility Encoding: 1: Low visibility (0−40%) 2: Medium visibility (40−60%) 3: High visibility (60−80%) 4: Fully visible (80−100%)
-
[63]
For multi−item answers, maintain the order relevant to the question (e.g., nearest to farthest)
The question can be slightly modified to produce different answers. For multi−item answers, maintain the order relevant to the question (e.g., nearest to farthest). One question should correspond to one answer
-
[64]
Instructions: Fully consider following levels to generate questions and multiple answers:
All responses should be written expressions in natural language, avoid using symbols or brackets. Instructions: Fully consider following levels to generate questions and multiple answers:
-
[65]
Short Level QA: QA pairs that query the basic information in the json file or single panoramic image, the answer can be completely verified by the ground truth
-
[66]
Long Level QA: QA pairs that contain multiple objects, with attributions and their relationships in concern, the answer stems mainly from the combined ground truth feature information. The questions should be short and rough, while the answers should be detailed and comprehensive. The answer can be partially verified. QA Types: −Type N1− Global scene unde...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.