Seek to Segment: Active Perception for Panoramic Referring Segmentation

Henghui Ding; Shuming Hu; Song Tang; Xincheng Shuai; Yu-Gang Jiang

arxiv: 2607.02497 · v1 · pith:T66TNPCYnew · submitted 2026-07-02 · 💻 cs.CV

Seek to Segment: Active Perception for Panoramic Referring Segmentation

Song Tang , Shuming Hu , Xincheng Shuai , Henghui Ding , Yu-Gang Jiang This is my paper

Pith reviewed 2026-07-03 14:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords active perceptionpanoramic segmentationreferring segmentationembodied AIspatial memoryvision-language modelsearch trajectoriesreinforcement learning

0 comments

The pith

PanoSeeker integrates local views into a unified 360° memory to let an agent search for and segment a referred object without redundant scanning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the task of active panoramic referring segmentation, where an embodied agent must adjust its viewpoint in a full 360° setting to locate and mask an object named in natural language. It proposes PanoSeeker, which pairs a vision-language model with an explicit spatial memory that accumulates observations over time. This memory replaces heuristic scanning with planned trajectories that avoid repeated looks at the same region. Once the target appears, the agent aligns its view and produces the segmentation mask. Experiments on a new benchmark show the approach yields shorter search paths and higher accuracy than adapted passive baselines.

Core claim

PanoSeeker maintains an explicit spatial visual memory called EgoSphere that folds successive narrow-field observations into one consistent 360° representation; the vision-language model then reads this memory to select the next viewpoint adjustment, continuing until the referred object is visible, at which point the agent performs final alignment and outputs the mask. The system is first fine-tuned on expert search trajectories and then refined with reinforcement learning that rewards shorter, non-redundant paths.

What carries the argument

EgoSphere, an explicit spatial visual memory that progressively integrates sequential local observations into a unified 360° representation.

If this is right

Referring segmentation can be performed by an agent that must move rather than by a model that receives a single fixed image.
Explicit memory of the full panorama removes the need for exhaustive scanning patterns.
Reinforcement learning after supervised fine-tuning directly reduces the number of viewpoint changes required.
Active viewpoint alignment at the end of search improves mask quality over masks taken from arbitrary angles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memory structure could be reused for other embodied tasks that require stitching partial views, such as object rearrangement or navigation to a described landmark.
If EgoSphere can be updated in real time on a physical robot, the method would transfer from simulation to hardware without retraining the memory module.
The approach suggests that panoramic referring segmentation is best solved by separating the search phase (memory-driven) from the final segmentation phase (view-aligned).

Load-bearing premise

That accumulating local views into one shared 360° memory is sufficient for the agent to choose non-redundant next directions without losing track of the target.

What would settle it

On the APRS benchmark, a version of the agent that discards EgoSphere and uses only the current view plus random or heuristic moves reaches the target in the same or fewer steps than the full PanoSeeker.

read the original abstract

Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($\Delta\theta, \Delta\phi$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new task APRS for active panoramic referring segmentation and pairs it with a VLM agent that uses explicit EgoSphere memory plus SFT-then-RL training, but the abstract supplies no numbers or ablations to judge whether the claimed gains hold.

read the letter

The core contribution is the task itself: an agent must rotate its camera in a 360° scene to locate and segment the object named in a language instruction. That moves referring segmentation out of static images and into an embodied search setting, which matters for navigation and interaction work.

PanoSeeker adds an explicit spatial memory called EgoSphere that accumulates local views into one unified representation. The pipeline runs a VLM on the current view, updates the memory, plans the next viewpoint, and once the target is located, aligns and segments. They collect expert trajectories for supervised fine-tuning, then apply reinforcement learning to cut redundant moves. On paper this is a clean way to avoid heuristic scanning.

The approach is coherent on its own terms. Connecting memory, viewpoint planning, and segmentation in one loop is a reasonable design choice for panoramic search. The decision to separate SFT from RL also makes sense for efficiency optimization.

The main limitation is that the abstract gives no experimental detail. We see no dataset size, no error bars, no ablation on the memory component, and no description of how the expert trajectories were collected or how the new benchmark was constructed. Without those, the claim that PanoSeeker “significantly outperforms adapted baselines” cannot be checked. The weakest link right now is the missing evidence that the memory actually produces non-redundant paths at scale.

This paper is aimed at researchers who already work on embodied vision-language agents or panoramic scene understanding. A reader who needs a concrete new task definition and a memory-augmented baseline will get something usable from it.

It deserves peer review. The task framing is fresh enough and the method logic does not collapse on inspection, so referees can evaluate the experiments and benchmark once they are fully presented.

Referee Report

0 major / 1 minor

Summary. The paper introduces the task of Active Panoramic Referring Segmentation (APRS), requiring an agent to actively adjust viewing directions (Δ heta, Δϕ) in a 360° environment to locate and segment a referred object. It proposes PanoSeeker, which augments a Vision-Language Model with EgoSphere (an explicit spatial visual memory) to integrate sequential local observations into a unified 360° representation for planning efficient search trajectories. The pipeline uses supervised fine-tuning on an expert-annotated trajectory dataset with memory timelines, followed by reinforcement learning to optimize exploration efficiency. On a newly curated APRS benchmark, PanoSeeker is reported to outperform adapted state-of-the-art baselines in both search efficiency and segmentation accuracy.

Significance. If the empirical claims hold, the work addresses a clear gap between passive referring segmentation and active embodied perception in panoramic settings. The explicit EgoSphere memory and the SFT-then-RL training regimen constitute a concrete, reproducible pipeline for non-redundant exploration. The release of an expert-trajectory benchmark with memory timelines is a tangible community resource that could support further research in embodied vision-language tasks.

minor comments (1)

The abstract states that 'extensive experiments' demonstrate superior performance, yet provides no quantitative metrics, dataset statistics, ablation results, or descriptions of how baselines were adapted; the full experimental section is required to evaluate these claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of our work on Active Panoramic Referring Segmentation and for recognizing the significance of the APRS task, EgoSphere memory, and the expert-trajectory benchmark. The recommendation of 'uncertain' is noted, but the report lists no specific major comments to address. We provide a point-by-point structure below for completeness; if additional comments were intended, we would be glad to respond in a revision.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new task (APRS) and proposes PanoSeeker as a VLM + EgoSphere pipeline trained via SFT on expert trajectories followed by RL, with empirical claims of superior efficiency and accuracy on a newly curated benchmark. No equations, derivations, or load-bearing steps are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method description and performance claims are self-contained empirical contributions without the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities beyond the high-level description of EgoSphere are stated.

invented entities (1)

EgoSphere no independent evidence
purpose: explicit spatial visual memory that merges sequential local observations into a unified 360° representation
Introduced as the core memory component of PanoSeeker; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5791 in / 1123 out tokens · 24365 ms · 2026-07-03T14:37:36.899255+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 18 canonical work pages · 13 internal anchors

[1]

Multimodal referring segmentation: A survey.IJCV, 2026

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.IJCV, 2026

2026
[2]

ROSE: retrieval-oriented segmentation enhance- ment

Song Tang, Guangquan Jie, Henghui Ding, and Yu-Gang Jiang. ROSE: retrieval-oriented segmentation enhance- ment. InCVPR Findings, 2026

2026
[3]

Vision-language transformer and query generation for referring segmentation

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vision-language transformer and query generation for referring segmentation. InICCV, 2021

2021
[4]

GRES: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. InCVPR, 2023

2023
[5]

GREx: Generalized referring expression segmentation, comprehension, and generation.IJCV, 134(2), 2026

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Yu-Gang Jiang. GREx: Generalized referring expression segmentation, comprehension, and generation.IJCV, 134(2), 2026

2026
[6]

VLT: vision-language transformer and query generation for referring segmentation.IEEE TPAMI, 45(6), 2022

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. VLT: vision-language transformer and query generation for referring segmentation.IEEE TPAMI, 45(6), 2022

2022
[7]

MeViS: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. MeViS: A large-scale benchmark for video segmentation with motion expressions. InICCV, 2023

2023
[8]

Visual instruction tuning.NeurIPS, 36, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2023

2023
[9]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Psdesigner: Automated graphic design with a human-like creative workflow

Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, and Dacheng Tao. Psdesigner: Automated graphic design with a human-like creative workflow. InCVPR, 2026

2026
[12]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InCVPR, 2024

2024
[13]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InCVPR, 2024

2024
[14]

Reasoning to attend: Try to understand how< seg> token works

Rui Qian, Xin Yin, and Dejing Dou. Reasoning to attend: Try to understand how< seg> token works. InCVPR, 2025

2025
[15]

Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv e-prints, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv e-prints, 2025

2025
[16]

Pixelthink: Towards efficient chain-of-pixel reasoning.arXiv preprint arXiv:2505.23727, 2025

Song Wang, Gongfan Fang, Lingdong Kong, Xiangtai Li, Jianyun Xu, Sheng Yang, Qiang Li, Jianke Zhu, and Xinchao Wang. Pixelthink: Towards efficient chain-of-pixel reasoning.arXiv preprint arXiv:2505.23727, 2025

work page arXiv 2025
[17]

MeViS: a multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. MeViS: a multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[18]

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying, Henghui Ding, Guangquan Jie, and Yu-Gang Jiang. Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation. InICCV, 2025

2025
[19]

RVAS: Referring video active exploration and segmentation

Hengrui Hu, Weiwei Gao, Zipei Zhang, and Henghui Ding. RVAS: Referring video active exploration and segmentation. InICML, 2026

2026
[20]

Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7), 2024

Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7), 2024

2024
[21]

Reverie: Remote embodied visual referring expression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InCVPR, 2020

2020
[22]

Segpoint: Segment any point cloud via large language model

Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. Segpoint: Segment any point cloud via large language model. InECCV. Springer, 2024

2024
[23]

Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation

Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Danni Yang, and Xiaoshuai Sun. Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation. InAAAI, volume 39, 2025

2025
[24]

Refmask3d: Language-guided transformer for 3d referring segmentation

Shuting He and Henghui Ding. Refmask3d: Language-guided transformer for 3d referring segmentation. InACM MM, 2024

2024
[25]

Scenedesigner: Controllable multi-object image generation with 9-dof pose manipulation

Zhenyuan Qin, Xincheng Shuai, and Henghui Ding. Scenedesigner: Controllable multi-object image generation with 9-dof pose manipulation. InNeurIPS, 2025

2025
[26]

Free-form motion control: Controlling the 6d poses of camera and objects in video generation

Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, and Dacheng Tao. Free-form motion control: Controlling the 6d poses of camera and objects in video generation. InICCV, 2025

2025
[27]

Free-form scene editor: Enabling multi-round object manipulation like in a 3d engine

Xincheng Shuai, Zhenyuan Qin, Henghui Ding, and Dacheng Tao. Free-form scene editor: Enabling multi-round object manipulation like in a 3d engine. InAAAI, 2025

2025
[28]

Anyi2v: Animating any conditional image with motion control

Ziye Li, Hao Luo, Xincheng Shuai, and Henghui Ding. Anyi2v: Animating any conditional image with motion control. InICCV, 2025

2025
[29]

ReferSplat: Referring segmentation in 3d gaussian splatting

Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, and Henghui Ding. ReferSplat: Referring segmentation in 3d gaussian splatting. InICML, 2025

2025
[30]

Reasongrounder: Lvlm-guided hierarchical feature splatting for open-vocabulary 3d visual grounding and reasoning

Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Longfei Liang, Yanwei Fu, and Xiangyang Xue. Reasongrounder: Lvlm-guided hierarchical feature splatting for open-vocabulary 3d visual grounding and reasoning. InCVPR, 2025

2025
[31]

3d gaussian splatting: Survey, technologies, challenges, and opportunities.IEEE TCSVT, 35(7), 2025

Yanqi Bao, Tianyu Ding, Jing Huo, Yaoli Liu, Yuxin Li, Wenbin Li, Yang Gao, and Jiebo Luo. 3d gaussian splatting: Survey, technologies, challenges, and opportunities.IEEE TCSVT, 35(7), 2025

2025
[32]

Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation

Shilin Yan, Xiaohao Xu, Renrui Zhang, Lingyi Hong, Wenchao Chen, Wenqiang Zhang, and Wei Zhang. Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation. InECCV. Springer, 2024

2024
[33]

Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025

Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, and Lu Qi. Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025. 12

work page arXiv 2025
[34]

Visual question answering on 360deg images

Shih-Han Chou, Wei-Lun Chao, Wei-Sheng Lai, Min Sun, and Ming-Hsuan Yang. Visual question answering on 360deg images. InW ACV, 2020

2020
[35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Cris: Clip-driven referring image segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InCVPR, 2022

2022
[37]

Segmentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. InECCV. Springer, 2016

2016
[38]

Recurrent multimodal interaction for referring image segmentation

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. InICCV, 2017

2017
[39]

Referring image segmentation via recurrent refinement networks

Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Referring image segmentation via recurrent refinement networks. InCVPR, 2018

2018
[40]

Primitivenet: decomposing the global constraints for referring segmentation.Visual Intelligence, 2(1):16, 2024

Chang Liu, Xudong Jiang, and Henghui Ding. Primitivenet: decomposing the global constraints for referring segmentation.Visual Intelligence, 2(1):16, 2024

2024
[41]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

2024
[45]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023

2023
[47]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InEMNLP, 2025

2025
[48]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InAAAI, volume 38, 2024

2024
[49]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, and Lin Gui. Beyond rag for agent memory: Retrieval by decoupling and aggregation.arXiv preprint arXiv:2602.02007, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025

work page arXiv 2025
[52]

Refer360◦: A referring expression recognition dataset in 360◦ images

Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Refer360◦: A referring expression recognition dataset in 360◦ images. InACL, 2020

2020
[53]

A survey of representation learning, optimization strategies, and applications for omnidirectional vision: H

Hao Ai, Zidong Cao, and Lin Wang. A survey of representation learning, optimization strategies, and applications for omnidirectional vision: H. ai et al.IJCV, 133(8), 2025

2025
[54]

2s-odis: Two-stage omni-directional image synthesis by geometric distortion correction

Atsuya Nakata and Takao Yamanaka. 2s-odis: Two-stage omni-directional image synthesis by geometric distortion correction. InECCV. Springer, 2024

2024
[55]

Osrt: Omnidirectional image super-resolution with distortion-aware transformer

Fanghua Yu, Xintao Wang, Mingdeng Cao, Gen Li, Ying Shan, and Chao Dong. Osrt: Omnidirectional image super-resolution with distortion-aware transformer. InCVPR, 2023. 13

2023
[56]

Distortion-aware convolutional filters for dense prediction in panoramic images

Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. InECCV, 2018

2018
[57]

Learning spherical convolution for fast features from 360 imagery.NeurIPS, 30, 2017

Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery.NeurIPS, 30, 2017

2017
[58]

Kernel transformer networks for compact spherical convolution

Yu-Chuan Su and Kristen Grauman. Kernel transformer networks for compact spherical convolution. InCVPR, 2019

2019
[59]

A survey of multimodal-guided image editing with text-to-image diffusion models.arXiv preprint arXiv:2406.14555, 2024

Xincheng Shuai, Henghui Ding, Xingjun Ma, Rongcheng Tu, Yu-Gang Jiang, and Dacheng Tao. A survey of multimodal-guided image editing with text-to-image diffusion models.arXiv preprint arXiv:2406.14555, 2024

work page arXiv 2024
[60]

Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025

Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025

work page arXiv 2025
[61]

Glyphprinter: Region-grouped direct preference optimization for glyph-accurate visual text rendering

Xincheng Shuai, Ziye Li, Henghui Ding, and Dacheng Tao. Glyphprinter: Region-grouped direct preference optimization for glyph-accurate visual text rendering. InCVPR, 2026

2026
[62]

360-indoor: Towards learning real-world objects in 360deg indoor equirectangular images

Shih-Han Chou, Cheng Sun, Wen-Yen Chang, Wan-Ting Hsu, Min Sun, and Jianlong Fu. 360-indoor: Towards learning real-world objects in 360deg indoor equirectangular images. InW ACV, 2020

2020
[63]

Pandora: A panoramic detection dataset for object with orientation

Hang Xu, Qiang Zhao, Yike Ma, Xiaodong Li, Peng Yuan, Bailan Feng, Chenggang Yan, and Feng Dai. Pandora: A panoramic detection dataset for object with orientation. InECCV. Springer, 2022

2022
[64]

Recognizing scene viewpoint using panoramic place representation

Jianxiong Xiao, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Recognizing scene viewpoint using panoramic place representation. InCVPR. IEEE, 2012

2012
[65]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV. Springer, 2020

2020
[66]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[67]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022
[69]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis. IEEE, 2020

2020
[70]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[71]

Sam4mllm: Enhance multi-modal large language model for referring expression segmentation

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InECCV. Springer, 2024

2024
[72]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Multimodal referring segmentation: A survey.IJCV, 2026

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.IJCV, 2026

2026

[2] [2]

ROSE: retrieval-oriented segmentation enhance- ment

Song Tang, Guangquan Jie, Henghui Ding, and Yu-Gang Jiang. ROSE: retrieval-oriented segmentation enhance- ment. InCVPR Findings, 2026

2026

[3] [3]

Vision-language transformer and query generation for referring segmentation

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vision-language transformer and query generation for referring segmentation. InICCV, 2021

2021

[4] [4]

GRES: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. InCVPR, 2023

2023

[5] [5]

GREx: Generalized referring expression segmentation, comprehension, and generation.IJCV, 134(2), 2026

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Yu-Gang Jiang. GREx: Generalized referring expression segmentation, comprehension, and generation.IJCV, 134(2), 2026

2026

[6] [6]

VLT: vision-language transformer and query generation for referring segmentation.IEEE TPAMI, 45(6), 2022

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. VLT: vision-language transformer and query generation for referring segmentation.IEEE TPAMI, 45(6), 2022

2022

[7] [7]

MeViS: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. MeViS: A large-scale benchmark for video segmentation with motion expressions. InICCV, 2023

2023

[8] [8]

Visual instruction tuning.NeurIPS, 36, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2023

2023

[9] [9]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Psdesigner: Automated graphic design with a human-like creative workflow

Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, and Dacheng Tao. Psdesigner: Automated graphic design with a human-like creative workflow. InCVPR, 2026

2026

[12] [12]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InCVPR, 2024

2024

[13] [13]

Gsva: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InCVPR, 2024

2024

[14] [14]

Reasoning to attend: Try to understand how< seg> token works

Rui Qian, Xin Yin, and Dejing Dou. Reasoning to attend: Try to understand how< seg> token works. InCVPR, 2025

2025

[15] [15]

Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv e-prints, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv e-prints, 2025

2025

[16] [16]

Pixelthink: Towards efficient chain-of-pixel reasoning.arXiv preprint arXiv:2505.23727, 2025

Song Wang, Gongfan Fang, Lingdong Kong, Xiangtai Li, Jianyun Xu, Sheng Yang, Qiang Li, Jianke Zhu, and Xinchao Wang. Pixelthink: Towards efficient chain-of-pixel reasoning.arXiv preprint arXiv:2505.23727, 2025

work page arXiv 2025

[17] [17]

MeViS: a multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. MeViS: a multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[18] [18]

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Kaining Ying, Henghui Ding, Guangquan Jie, and Yu-Gang Jiang. Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation. InICCV, 2025

2025

[19] [19]

RVAS: Referring video active exploration and segmentation

Hengrui Hu, Weiwei Gao, Zipei Zhang, and Henghui Ding. RVAS: Referring video active exploration and segmentation. InICML, 2026

2026

[20] [20]

Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7), 2024

Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7), 2024

2024

[21] [21]

Reverie: Remote embodied visual referring expression in real indoor environments

Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InCVPR, 2020

2020

[22] [22]

Segpoint: Segment any point cloud via large language model

Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. Segpoint: Segment any point cloud via large language model. InECCV. Springer, 2024

2024

[23] [23]

Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation

Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Danni Yang, and Xiaoshuai Sun. Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation. InAAAI, volume 39, 2025

2025

[24] [24]

Refmask3d: Language-guided transformer for 3d referring segmentation

Shuting He and Henghui Ding. Refmask3d: Language-guided transformer for 3d referring segmentation. InACM MM, 2024

2024

[25] [25]

Scenedesigner: Controllable multi-object image generation with 9-dof pose manipulation

Zhenyuan Qin, Xincheng Shuai, and Henghui Ding. Scenedesigner: Controllable multi-object image generation with 9-dof pose manipulation. InNeurIPS, 2025

2025

[26] [26]

Free-form motion control: Controlling the 6d poses of camera and objects in video generation

Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, and Dacheng Tao. Free-form motion control: Controlling the 6d poses of camera and objects in video generation. InICCV, 2025

2025

[27] [27]

Free-form scene editor: Enabling multi-round object manipulation like in a 3d engine

Xincheng Shuai, Zhenyuan Qin, Henghui Ding, and Dacheng Tao. Free-form scene editor: Enabling multi-round object manipulation like in a 3d engine. InAAAI, 2025

2025

[28] [28]

Anyi2v: Animating any conditional image with motion control

Ziye Li, Hao Luo, Xincheng Shuai, and Henghui Ding. Anyi2v: Animating any conditional image with motion control. InICCV, 2025

2025

[29] [29]

ReferSplat: Referring segmentation in 3d gaussian splatting

Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, and Henghui Ding. ReferSplat: Referring segmentation in 3d gaussian splatting. InICML, 2025

2025

[30] [30]

Reasongrounder: Lvlm-guided hierarchical feature splatting for open-vocabulary 3d visual grounding and reasoning

Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Longfei Liang, Yanwei Fu, and Xiangyang Xue. Reasongrounder: Lvlm-guided hierarchical feature splatting for open-vocabulary 3d visual grounding and reasoning. InCVPR, 2025

2025

[31] [31]

3d gaussian splatting: Survey, technologies, challenges, and opportunities.IEEE TCSVT, 35(7), 2025

Yanqi Bao, Tianyu Ding, Jing Huo, Yaoli Liu, Yuxin Li, Wenbin Li, Yang Gao, and Jiebo Luo. 3d gaussian splatting: Survey, technologies, challenges, and opportunities.IEEE TCSVT, 35(7), 2025

2025

[32] [32]

Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation

Shilin Yan, Xiaohao Xu, Renrui Zhang, Lingyi Hong, Wenchao Chen, Wenqiang Zhang, and Wei Zhang. Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation. InECCV. Springer, 2024

2024

[33] [33]

Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025

Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, and Lu Qi. Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025. 12

work page arXiv 2025

[34] [34]

Visual question answering on 360deg images

Shih-Han Chou, Wei-Lun Chao, Wei-Sheng Lai, Min Sun, and Ming-Hsuan Yang. Visual question answering on 360deg images. InW ACV, 2020

2020

[35] [35]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Cris: Clip-driven referring image segmentation

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InCVPR, 2022

2022

[37] [37]

Segmentation from natural language expressions

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. InECCV. Springer, 2016

2016

[38] [38]

Recurrent multimodal interaction for referring image segmentation

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. InICCV, 2017

2017

[39] [39]

Referring image segmentation via recurrent refinement networks

Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Referring image segmentation via recurrent refinement networks. InCVPR, 2018

2018

[40] [40]

Primitivenet: decomposing the global constraints for referring segmentation.Visual Intelligence, 2(1):16, 2024

Chang Liu, Xudong Jiang, and Henghui Ding. Primitivenet: decomposing the global constraints for referring segmentation.Visual Intelligence, 2(1):16, 2024

2024

[41] [41]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024

2024

[45] [45]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023

2023

[47] [47]

Memory os of ai agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InEMNLP, 2025

2025

[48] [48]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InAAAI, volume 38, 2024

2024

[49] [49]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, and Lin Gui. Beyond rag for agent memory: Retrieval by decoupling and aggregation.arXiv preprint arXiv:2602.02007, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025

work page arXiv 2025

[52] [52]

Refer360◦: A referring expression recognition dataset in 360◦ images

Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Refer360◦: A referring expression recognition dataset in 360◦ images. InACL, 2020

2020

[53] [53]

A survey of representation learning, optimization strategies, and applications for omnidirectional vision: H

Hao Ai, Zidong Cao, and Lin Wang. A survey of representation learning, optimization strategies, and applications for omnidirectional vision: H. ai et al.IJCV, 133(8), 2025

2025

[54] [54]

2s-odis: Two-stage omni-directional image synthesis by geometric distortion correction

Atsuya Nakata and Takao Yamanaka. 2s-odis: Two-stage omni-directional image synthesis by geometric distortion correction. InECCV. Springer, 2024

2024

[55] [55]

Osrt: Omnidirectional image super-resolution with distortion-aware transformer

Fanghua Yu, Xintao Wang, Mingdeng Cao, Gen Li, Ying Shan, and Chao Dong. Osrt: Omnidirectional image super-resolution with distortion-aware transformer. InCVPR, 2023. 13

2023

[56] [56]

Distortion-aware convolutional filters for dense prediction in panoramic images

Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. InECCV, 2018

2018

[57] [57]

Learning spherical convolution for fast features from 360 imagery.NeurIPS, 30, 2017

Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery.NeurIPS, 30, 2017

2017

[58] [58]

Kernel transformer networks for compact spherical convolution

Yu-Chuan Su and Kristen Grauman. Kernel transformer networks for compact spherical convolution. InCVPR, 2019

2019

[59] [59]

A survey of multimodal-guided image editing with text-to-image diffusion models.arXiv preprint arXiv:2406.14555, 2024

Xincheng Shuai, Henghui Ding, Xingjun Ma, Rongcheng Tu, Yu-Gang Jiang, and Dacheng Tao. A survey of multimodal-guided image editing with text-to-image diffusion models.arXiv preprint arXiv:2406.14555, 2024

work page arXiv 2024

[60] [60]

Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025

Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025

work page arXiv 2025

[61] [61]

Glyphprinter: Region-grouped direct preference optimization for glyph-accurate visual text rendering

Xincheng Shuai, Ziye Li, Henghui Ding, and Dacheng Tao. Glyphprinter: Region-grouped direct preference optimization for glyph-accurate visual text rendering. InCVPR, 2026

2026

[62] [62]

360-indoor: Towards learning real-world objects in 360deg indoor equirectangular images

Shih-Han Chou, Cheng Sun, Wen-Yen Chang, Wan-Ting Hsu, Min Sun, and Jianlong Fu. 360-indoor: Towards learning real-world objects in 360deg indoor equirectangular images. InW ACV, 2020

2020

[63] [63]

Pandora: A panoramic detection dataset for object with orientation

Hang Xu, Qiang Zhao, Yike Ma, Xiaodong Li, Peng Yuan, Bailan Feng, Chenggang Yan, and Feng Dai. Pandora: A panoramic detection dataset for object with orientation. InECCV. Springer, 2022

2022

[64] [64]

Recognizing scene viewpoint using panoramic place representation

Jianxiong Xiao, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Recognizing scene viewpoint using panoramic place representation. InCVPR. IEEE, 2012

2012

[65] [65]

Scanrefer: 3d object localization in rgb-d scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV. Springer, 2020

2020

[66] [66]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[67] [67]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022

[69] [69]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis. IEEE, 2020

2020

[70] [70]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[71] [71]

Sam4mllm: Enhance multi-modal large language model for referring expression segmentation

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InECCV. Springer, 2024

2024

[72] [72]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 14

work page internal anchor Pith review Pith/arXiv arXiv 2023