pith. sign in

arxiv: 2607.02497 · v1 · pith:T66TNPCYnew · submitted 2026-07-02 · 💻 cs.CV

Seek to Segment: Active Perception for Panoramic Referring Segmentation

Pith reviewed 2026-07-03 14:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords active perceptionpanoramic segmentationreferring segmentationembodied AIspatial memoryvision-language modelsearch trajectoriesreinforcement learning
0
0 comments X

The pith

PanoSeeker integrates local views into a unified 360° memory to let an agent search for and segment a referred object without redundant scanning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the task of active panoramic referring segmentation, where an embodied agent must adjust its viewpoint in a full 360° setting to locate and mask an object named in natural language. It proposes PanoSeeker, which pairs a vision-language model with an explicit spatial memory that accumulates observations over time. This memory replaces heuristic scanning with planned trajectories that avoid repeated looks at the same region. Once the target appears, the agent aligns its view and produces the segmentation mask. Experiments on a new benchmark show the approach yields shorter search paths and higher accuracy than adapted passive baselines.

Core claim

PanoSeeker maintains an explicit spatial visual memory called EgoSphere that folds successive narrow-field observations into one consistent 360° representation; the vision-language model then reads this memory to select the next viewpoint adjustment, continuing until the referred object is visible, at which point the agent performs final alignment and outputs the mask. The system is first fine-tuned on expert search trajectories and then refined with reinforcement learning that rewards shorter, non-redundant paths.

What carries the argument

EgoSphere, an explicit spatial visual memory that progressively integrates sequential local observations into a unified 360° representation.

If this is right

  • Referring segmentation can be performed by an agent that must move rather than by a model that receives a single fixed image.
  • Explicit memory of the full panorama removes the need for exhaustive scanning patterns.
  • Reinforcement learning after supervised fine-tuning directly reduces the number of viewpoint changes required.
  • Active viewpoint alignment at the end of search improves mask quality over masks taken from arbitrary angles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same memory structure could be reused for other embodied tasks that require stitching partial views, such as object rearrangement or navigation to a described landmark.
  • If EgoSphere can be updated in real time on a physical robot, the method would transfer from simulation to hardware without retraining the memory module.
  • The approach suggests that panoramic referring segmentation is best solved by separating the search phase (memory-driven) from the final segmentation phase (view-aligned).

Load-bearing premise

That accumulating local views into one shared 360° memory is sufficient for the agent to choose non-redundant next directions without losing track of the target.

What would settle it

On the APRS benchmark, a version of the agent that discards EgoSphere and uses only the current view plus random or heuristic moves reaches the target in the same or fewer steps than the full PanoSeeker.

read the original abstract

Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($\Delta\theta, \Delta\phi$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper introduces the task of Active Panoramic Referring Segmentation (APRS), requiring an agent to actively adjust viewing directions (Δ heta, Δϕ) in a 360° environment to locate and segment a referred object. It proposes PanoSeeker, which augments a Vision-Language Model with EgoSphere (an explicit spatial visual memory) to integrate sequential local observations into a unified 360° representation for planning efficient search trajectories. The pipeline uses supervised fine-tuning on an expert-annotated trajectory dataset with memory timelines, followed by reinforcement learning to optimize exploration efficiency. On a newly curated APRS benchmark, PanoSeeker is reported to outperform adapted state-of-the-art baselines in both search efficiency and segmentation accuracy.

Significance. If the empirical claims hold, the work addresses a clear gap between passive referring segmentation and active embodied perception in panoramic settings. The explicit EgoSphere memory and the SFT-then-RL training regimen constitute a concrete, reproducible pipeline for non-redundant exploration. The release of an expert-trajectory benchmark with memory timelines is a tangible community resource that could support further research in embodied vision-language tasks.

minor comments (1)
  1. The abstract states that 'extensive experiments' demonstrate superior performance, yet provides no quantitative metrics, dataset statistics, ablation results, or descriptions of how baselines were adapted; the full experimental section is required to evaluate these claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of our work on Active Panoramic Referring Segmentation and for recognizing the significance of the APRS task, EgoSphere memory, and the expert-trajectory benchmark. The recommendation of 'uncertain' is noted, but the report lists no specific major comments to address. We provide a point-by-point structure below for completeness; if additional comments were intended, we would be glad to respond in a revision.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new task (APRS) and proposes PanoSeeker as a VLM + EgoSphere pipeline trained via SFT on expert trajectories followed by RL, with empirical claims of superior efficiency and accuracy on a newly curated benchmark. No equations, derivations, or load-bearing steps are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method description and performance claims are self-contained empirical contributions without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the high-level description of EgoSphere are stated.

invented entities (1)
  • EgoSphere no independent evidence
    purpose: explicit spatial visual memory that merges sequential local observations into a unified 360° representation
    Introduced as the core memory component of PanoSeeker; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5791 in / 1123 out tokens · 24365 ms · 2026-07-03T14:37:36.899255+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 18 canonical work pages · 13 internal anchors

  1. [1]

    Multimodal referring segmentation: A survey.IJCV, 2026

    Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.IJCV, 2026

  2. [2]

    ROSE: retrieval-oriented segmentation enhance- ment

    Song Tang, Guangquan Jie, Henghui Ding, and Yu-Gang Jiang. ROSE: retrieval-oriented segmentation enhance- ment. InCVPR Findings, 2026

  3. [3]

    Vision-language transformer and query generation for referring segmentation

    Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vision-language transformer and query generation for referring segmentation. InICCV, 2021

  4. [4]

    GRES: Generalized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. InCVPR, 2023

  5. [5]

    GREx: Generalized referring expression segmentation, comprehension, and generation.IJCV, 134(2), 2026

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Yu-Gang Jiang. GREx: Generalized referring expression segmentation, comprehension, and generation.IJCV, 134(2), 2026

  6. [6]

    VLT: vision-language transformer and query generation for referring segmentation.IEEE TPAMI, 45(6), 2022

    Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. VLT: vision-language transformer and query generation for referring segmentation.IEEE TPAMI, 45(6), 2022

  7. [7]

    MeViS: A large-scale benchmark for video segmentation with motion expressions

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. MeViS: A large-scale benchmark for video segmentation with motion expressions. InICCV, 2023

  8. [8]

    Visual instruction tuning.NeurIPS, 36, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2023

  9. [9]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  10. [10]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 11

  11. [11]

    Psdesigner: Automated graphic design with a human-like creative workflow

    Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, and Dacheng Tao. Psdesigner: Automated graphic design with a human-like creative workflow. InCVPR, 2026

  12. [12]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InCVPR, 2024

  13. [13]

    Gsva: Generalized segmentation via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InCVPR, 2024

  14. [14]

    Reasoning to attend: Try to understand how< seg> token works

    Rui Qian, Xin Yin, and Dejing Dou. Reasoning to attend: Try to understand how< seg> token works. InCVPR, 2025

  15. [15]

    Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv e-prints, 2025

    Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv e-prints, 2025

  16. [16]

    Pixelthink: Towards efficient chain-of-pixel reasoning.arXiv preprint arXiv:2505.23727, 2025

    Song Wang, Gongfan Fang, Lingdong Kong, Xiangtai Li, Jianyun Xu, Sheng Yang, Qiang Li, Jianke Zhu, and Xinchao Wang. Pixelthink: Towards efficient chain-of-pixel reasoning.arXiv preprint arXiv:2505.23727, 2025

  17. [17]

    MeViS: a multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. MeViS: a multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  18. [18]

    Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

    Kaining Ying, Henghui Ding, Guangquan Jie, and Yu-Gang Jiang. Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation. InICCV, 2025

  19. [19]

    RVAS: Referring video active exploration and segmentation

    Hengrui Hu, Weiwei Gao, Zipei Zhang, and Henghui Ding. RVAS: Referring video active exploration and segmentation. InICML, 2026

  20. [20]

    Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7), 2024

    Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7), 2024

  21. [21]

    Reverie: Remote embodied visual referring expression in real indoor environments

    Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InCVPR, 2020

  22. [22]

    Segpoint: Segment any point cloud via large language model

    Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. Segpoint: Segment any point cloud via large language model. InECCV. Springer, 2024

  23. [23]

    Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation

    Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Danni Yang, and Xiaoshuai Sun. Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation. InAAAI, volume 39, 2025

  24. [24]

    Refmask3d: Language-guided transformer for 3d referring segmentation

    Shuting He and Henghui Ding. Refmask3d: Language-guided transformer for 3d referring segmentation. InACM MM, 2024

  25. [25]

    Scenedesigner: Controllable multi-object image generation with 9-dof pose manipulation

    Zhenyuan Qin, Xincheng Shuai, and Henghui Ding. Scenedesigner: Controllable multi-object image generation with 9-dof pose manipulation. InNeurIPS, 2025

  26. [26]

    Free-form motion control: Controlling the 6d poses of camera and objects in video generation

    Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, and Dacheng Tao. Free-form motion control: Controlling the 6d poses of camera and objects in video generation. InICCV, 2025

  27. [27]

    Free-form scene editor: Enabling multi-round object manipulation like in a 3d engine

    Xincheng Shuai, Zhenyuan Qin, Henghui Ding, and Dacheng Tao. Free-form scene editor: Enabling multi-round object manipulation like in a 3d engine. InAAAI, 2025

  28. [28]

    Anyi2v: Animating any conditional image with motion control

    Ziye Li, Hao Luo, Xincheng Shuai, and Henghui Ding. Anyi2v: Animating any conditional image with motion control. InICCV, 2025

  29. [29]

    ReferSplat: Referring segmentation in 3d gaussian splatting

    Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, and Henghui Ding. ReferSplat: Referring segmentation in 3d gaussian splatting. InICML, 2025

  30. [30]

    Reasongrounder: Lvlm-guided hierarchical feature splatting for open-vocabulary 3d visual grounding and reasoning

    Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Longfei Liang, Yanwei Fu, and Xiangyang Xue. Reasongrounder: Lvlm-guided hierarchical feature splatting for open-vocabulary 3d visual grounding and reasoning. InCVPR, 2025

  31. [31]

    3d gaussian splatting: Survey, technologies, challenges, and opportunities.IEEE TCSVT, 35(7), 2025

    Yanqi Bao, Tianyu Ding, Jing Huo, Yaoli Liu, Yuxin Li, Wenbin Li, Yang Gao, and Jiebo Luo. 3d gaussian splatting: Survey, technologies, challenges, and opportunities.IEEE TCSVT, 35(7), 2025

  32. [32]

    Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation

    Shilin Yan, Xiaohao Xu, Renrui Zhang, Lingyi Hong, Wenchao Chen, Wenqiang Zhang, and Wei Zhang. Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation. InECCV. Springer, 2024

  33. [33]

    Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025

    Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, and Lu Qi. Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025. 12

  34. [34]

    Visual question answering on 360deg images

    Shih-Han Chou, Wei-Lun Chao, Wei-Sheng Lai, Min Sun, and Ming-Hsuan Yang. Visual question answering on 360deg images. InW ACV, 2020

  35. [35]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  36. [36]

    Cris: Clip-driven referring image segmentation

    Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InCVPR, 2022

  37. [37]

    Segmentation from natural language expressions

    Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. InECCV. Springer, 2016

  38. [38]

    Recurrent multimodal interaction for referring image segmentation

    Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. InICCV, 2017

  39. [39]

    Referring image segmentation via recurrent refinement networks

    Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Referring image segmentation via recurrent refinement networks. InCVPR, 2018

  40. [40]

    Primitivenet: decomposing the global constraints for referring segmentation.Visual Intelligence, 2(1):16, 2024

    Chang Liu, Xudong Jiang, and Henghui Ding. Primitivenet: decomposing the global constraints for referring segmentation.Visual Intelligence, 2(1):16, 2024

  41. [41]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  42. [42]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

  43. [43]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  44. [44]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

  45. [45]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  46. [46]

    Memgpt: towards llms as operating systems

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023

  47. [47]

    Memory os of ai agent

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InEMNLP, 2025

  48. [48]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InAAAI, volume 38, 2024

  49. [49]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  50. [50]

    Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

    Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, and Lin Gui. Beyond rag for agent memory: Retrieval by decoupling and aggregation.arXiv preprint arXiv:2602.02007, 2026

  51. [51]

    Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025

    Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025

  52. [52]

    Refer360◦: A referring expression recognition dataset in 360◦ images

    Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Refer360◦: A referring expression recognition dataset in 360◦ images. InACL, 2020

  53. [53]

    A survey of representation learning, optimization strategies, and applications for omnidirectional vision: H

    Hao Ai, Zidong Cao, and Lin Wang. A survey of representation learning, optimization strategies, and applications for omnidirectional vision: H. ai et al.IJCV, 133(8), 2025

  54. [54]

    2s-odis: Two-stage omni-directional image synthesis by geometric distortion correction

    Atsuya Nakata and Takao Yamanaka. 2s-odis: Two-stage omni-directional image synthesis by geometric distortion correction. InECCV. Springer, 2024

  55. [55]

    Osrt: Omnidirectional image super-resolution with distortion-aware transformer

    Fanghua Yu, Xintao Wang, Mingdeng Cao, Gen Li, Ying Shan, and Chao Dong. Osrt: Omnidirectional image super-resolution with distortion-aware transformer. InCVPR, 2023. 13

  56. [56]

    Distortion-aware convolutional filters for dense prediction in panoramic images

    Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. InECCV, 2018

  57. [57]

    Learning spherical convolution for fast features from 360 imagery.NeurIPS, 30, 2017

    Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery.NeurIPS, 30, 2017

  58. [58]

    Kernel transformer networks for compact spherical convolution

    Yu-Chuan Su and Kristen Grauman. Kernel transformer networks for compact spherical convolution. InCVPR, 2019

  59. [59]

    A survey of multimodal-guided image editing with text-to-image diffusion models.arXiv preprint arXiv:2406.14555, 2024

    Xincheng Shuai, Henghui Ding, Xingjun Ma, Rongcheng Tu, Yu-Gang Jiang, and Dacheng Tao. A survey of multimodal-guided image editing with text-to-image diffusion models.arXiv preprint arXiv:2406.14555, 2024

  60. [60]

    Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025

    Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025

  61. [61]

    Glyphprinter: Region-grouped direct preference optimization for glyph-accurate visual text rendering

    Xincheng Shuai, Ziye Li, Henghui Ding, and Dacheng Tao. Glyphprinter: Region-grouped direct preference optimization for glyph-accurate visual text rendering. InCVPR, 2026

  62. [62]

    360-indoor: Towards learning real-world objects in 360deg indoor equirectangular images

    Shih-Han Chou, Cheng Sun, Wen-Yen Chang, Wan-Ting Hsu, Min Sun, and Jianlong Fu. 360-indoor: Towards learning real-world objects in 360deg indoor equirectangular images. InW ACV, 2020

  63. [63]

    Pandora: A panoramic detection dataset for object with orientation

    Hang Xu, Qiang Zhao, Yike Ma, Xiaodong Li, Peng Yuan, Bailan Feng, Chenggang Yan, and Feng Dai. Pandora: A panoramic detection dataset for object with orientation. InECCV. Springer, 2022

  64. [64]

    Recognizing scene viewpoint using panoramic place representation

    Jianxiong Xiao, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Recognizing scene viewpoint using panoramic place representation. InCVPR. IEEE, 2012

  65. [65]

    Scanrefer: 3d object localization in rgb-d scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV. Springer, 2020

  66. [66]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

  67. [67]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  68. [68]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

  69. [69]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis. IEEE, 2020

  70. [70]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  71. [71]

    Sam4mllm: Enhance multi-modal large language model for referring expression segmentation

    Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InECCV. Springer, 2024

  72. [72]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 14