Seek to Segment: Active Perception for Panoramic Referring Segmentation
Pith reviewed 2026-07-03 14:37 UTC · model grok-4.3
The pith
PanoSeeker integrates local views into a unified 360° memory to let an agent search for and segment a referred object without redundant scanning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PanoSeeker maintains an explicit spatial visual memory called EgoSphere that folds successive narrow-field observations into one consistent 360° representation; the vision-language model then reads this memory to select the next viewpoint adjustment, continuing until the referred object is visible, at which point the agent performs final alignment and outputs the mask. The system is first fine-tuned on expert search trajectories and then refined with reinforcement learning that rewards shorter, non-redundant paths.
What carries the argument
EgoSphere, an explicit spatial visual memory that progressively integrates sequential local observations into a unified 360° representation.
If this is right
- Referring segmentation can be performed by an agent that must move rather than by a model that receives a single fixed image.
- Explicit memory of the full panorama removes the need for exhaustive scanning patterns.
- Reinforcement learning after supervised fine-tuning directly reduces the number of viewpoint changes required.
- Active viewpoint alignment at the end of search improves mask quality over masks taken from arbitrary angles.
Where Pith is reading between the lines
- The same memory structure could be reused for other embodied tasks that require stitching partial views, such as object rearrangement or navigation to a described landmark.
- If EgoSphere can be updated in real time on a physical robot, the method would transfer from simulation to hardware without retraining the memory module.
- The approach suggests that panoramic referring segmentation is best solved by separating the search phase (memory-driven) from the final segmentation phase (view-aligned).
Load-bearing premise
That accumulating local views into one shared 360° memory is sufficient for the agent to choose non-redundant next directions without losing track of the target.
What would settle it
On the APRS benchmark, a version of the agent that discards EgoSphere and uses only the current view plus random or heuristic moves reaches the target in the same or fewer steps than the full PanoSeeker.
read the original abstract
Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($\Delta\theta, \Delta\phi$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the task of Active Panoramic Referring Segmentation (APRS), requiring an agent to actively adjust viewing directions (Δ heta, Δϕ) in a 360° environment to locate and segment a referred object. It proposes PanoSeeker, which augments a Vision-Language Model with EgoSphere (an explicit spatial visual memory) to integrate sequential local observations into a unified 360° representation for planning efficient search trajectories. The pipeline uses supervised fine-tuning on an expert-annotated trajectory dataset with memory timelines, followed by reinforcement learning to optimize exploration efficiency. On a newly curated APRS benchmark, PanoSeeker is reported to outperform adapted state-of-the-art baselines in both search efficiency and segmentation accuracy.
Significance. If the empirical claims hold, the work addresses a clear gap between passive referring segmentation and active embodied perception in panoramic settings. The explicit EgoSphere memory and the SFT-then-RL training regimen constitute a concrete, reproducible pipeline for non-redundant exploration. The release of an expert-trajectory benchmark with memory timelines is a tangible community resource that could support further research in embodied vision-language tasks.
minor comments (1)
- The abstract states that 'extensive experiments' demonstrate superior performance, yet provides no quantitative metrics, dataset statistics, ablation results, or descriptions of how baselines were adapted; the full experimental section is required to evaluate these claims.
Simulated Author's Rebuttal
We thank the referee for their summary of our work on Active Panoramic Referring Segmentation and for recognizing the significance of the APRS task, EgoSphere memory, and the expert-trajectory benchmark. The recommendation of 'uncertain' is noted, but the report lists no specific major comments to address. We provide a point-by-point structure below for completeness; if additional comments were intended, we would be glad to respond in a revision.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces a new task (APRS) and proposes PanoSeeker as a VLM + EgoSphere pipeline trained via SFT on expert trajectories followed by RL, with empirical claims of superior efficiency and accuracy on a newly curated benchmark. No equations, derivations, or load-bearing steps are present that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The method description and performance claims are self-contained empirical contributions without the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
invented entities (1)
-
EgoSphere
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Multimodal referring segmentation: A survey.IJCV, 2026
Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.IJCV, 2026
2026
-
[2]
ROSE: retrieval-oriented segmentation enhance- ment
Song Tang, Guangquan Jie, Henghui Ding, and Yu-Gang Jiang. ROSE: retrieval-oriented segmentation enhance- ment. InCVPR Findings, 2026
2026
-
[3]
Vision-language transformer and query generation for referring segmentation
Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. Vision-language transformer and query generation for referring segmentation. InICCV, 2021
2021
-
[4]
GRES: Generalized referring expression segmentation
Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. InCVPR, 2023
2023
-
[5]
GREx: Generalized referring expression segmentation, comprehension, and generation.IJCV, 134(2), 2026
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Yu-Gang Jiang. GREx: Generalized referring expression segmentation, comprehension, and generation.IJCV, 134(2), 2026
2026
-
[6]
VLT: vision-language transformer and query generation for referring segmentation.IEEE TPAMI, 45(6), 2022
Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. VLT: vision-language transformer and query generation for referring segmentation.IEEE TPAMI, 45(6), 2022
2022
-
[7]
MeViS: A large-scale benchmark for video segmentation with motion expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. MeViS: A large-scale benchmark for video segmentation with motion expressions. InICCV, 2023
2023
-
[8]
Visual instruction tuning.NeurIPS, 36, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36, 2023
2023
-
[9]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Psdesigner: Automated graphic design with a human-like creative workflow
Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, and Dacheng Tao. Psdesigner: Automated graphic design with a human-like creative workflow. InCVPR, 2026
2026
-
[12]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InCVPR, 2024
2024
-
[13]
Gsva: Generalized segmentation via multimodal large language models
Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. Gsva: Generalized segmentation via multimodal large language models. InCVPR, 2024
2024
-
[14]
Reasoning to attend: Try to understand how< seg> token works
Rui Qian, Xin Yin, and Dejing Dou. Reasoning to attend: Try to understand how< seg> token works. InCVPR, 2025
2025
-
[15]
Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv e-prints, 2025
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv e-prints, 2025
2025
-
[16]
Pixelthink: Towards efficient chain-of-pixel reasoning.arXiv preprint arXiv:2505.23727, 2025
Song Wang, Gongfan Fang, Lingdong Kong, Xiangtai Li, Jianyun Xu, Sheng Yang, Qiang Li, Jianke Zhu, and Xinchao Wang. Pixelthink: Towards efficient chain-of-pixel reasoning.arXiv preprint arXiv:2505.23727, 2025
-
[17]
MeViS: a multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Henghui Ding, Chang Liu, Shuting He, Kaining Ying, Xudong Jiang, Chen Change Loy, and Yu-Gang Jiang. MeViS: a multi-modal dataset for referring motion expression video segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
2025
-
[18]
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
Kaining Ying, Henghui Ding, Guangquan Jie, and Yu-Gang Jiang. Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation. InICCV, 2025
2025
-
[19]
RVAS: Referring video active exploration and segmentation
Hengrui Hu, Weiwei Gao, Zipei Zhang, and Henghui Ding. RVAS: Referring video active exploration and segmentation. InICML, 2026
2026
-
[20]
Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7), 2024
Wansen Wu, Tao Chang, Xinmeng Li, Quanjun Yin, and Yue Hu. Vision-language navigation: a survey and taxonomy.Neural Computing and Applications, 36(7), 2024
2024
-
[21]
Reverie: Remote embodied visual referring expression in real indoor environments
Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. Reverie: Remote embodied visual referring expression in real indoor environments. InCVPR, 2020
2020
-
[22]
Segpoint: Segment any point cloud via large language model
Shuting He, Henghui Ding, Xudong Jiang, and Bihan Wen. Segpoint: Segment any point cloud via large language model. InECCV. Springer, 2024
2024
-
[23]
Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation
Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Danni Yang, and Xiaoshuai Sun. Ipdn: Image-enhanced prompt decoding network for 3d referring expression segmentation. InAAAI, volume 39, 2025
2025
-
[24]
Refmask3d: Language-guided transformer for 3d referring segmentation
Shuting He and Henghui Ding. Refmask3d: Language-guided transformer for 3d referring segmentation. InACM MM, 2024
2024
-
[25]
Scenedesigner: Controllable multi-object image generation with 9-dof pose manipulation
Zhenyuan Qin, Xincheng Shuai, and Henghui Ding. Scenedesigner: Controllable multi-object image generation with 9-dof pose manipulation. InNeurIPS, 2025
2025
-
[26]
Free-form motion control: Controlling the 6d poses of camera and objects in video generation
Xincheng Shuai, Henghui Ding, Zhenyuan Qin, Hao Luo, Xingjun Ma, and Dacheng Tao. Free-form motion control: Controlling the 6d poses of camera and objects in video generation. InICCV, 2025
2025
-
[27]
Free-form scene editor: Enabling multi-round object manipulation like in a 3d engine
Xincheng Shuai, Zhenyuan Qin, Henghui Ding, and Dacheng Tao. Free-form scene editor: Enabling multi-round object manipulation like in a 3d engine. InAAAI, 2025
2025
-
[28]
Anyi2v: Animating any conditional image with motion control
Ziye Li, Hao Luo, Xincheng Shuai, and Henghui Ding. Anyi2v: Animating any conditional image with motion control. InICCV, 2025
2025
-
[29]
ReferSplat: Referring segmentation in 3d gaussian splatting
Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, and Henghui Ding. ReferSplat: Referring segmentation in 3d gaussian splatting. InICML, 2025
2025
-
[30]
Reasongrounder: Lvlm-guided hierarchical feature splatting for open-vocabulary 3d visual grounding and reasoning
Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Longfei Liang, Yanwei Fu, and Xiangyang Xue. Reasongrounder: Lvlm-guided hierarchical feature splatting for open-vocabulary 3d visual grounding and reasoning. InCVPR, 2025
2025
-
[31]
3d gaussian splatting: Survey, technologies, challenges, and opportunities.IEEE TCSVT, 35(7), 2025
Yanqi Bao, Tianyu Ding, Jing Huo, Yaoli Liu, Yuxin Li, Wenbin Li, Yang Gao, and Jiebo Luo. 3d gaussian splatting: Survey, technologies, challenges, and opportunities.IEEE TCSVT, 35(7), 2025
2025
-
[32]
Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation
Shilin Yan, Xiaohao Xu, Renrui Zhang, Lingyi Hong, Wenchao Chen, Wenqiang Zhang, and Wei Zhang. Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation. InECCV. Springer, 2024
2024
-
[33]
Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025
Yikang Zhou, Tao Zhang, Dizhe Zhang, Shunping Ji, Xiangtai Li, and Lu Qi. Dense360: Dense understanding from omnidirectional panoramas.arXiv preprint arXiv:2506.14471, 2025. 12
-
[34]
Visual question answering on 360deg images
Shih-Han Chou, Wei-Lun Chao, Wei-Sheng Lai, Min Sun, and Ming-Hsuan Yang. Visual question answering on 360deg images. InW ACV, 2020
2020
-
[35]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Cris: Clip-driven referring image segmentation
Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. Cris: Clip-driven referring image segmentation. InCVPR, 2022
2022
-
[37]
Segmentation from natural language expressions
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from natural language expressions. InECCV. Springer, 2016
2016
-
[38]
Recurrent multimodal interaction for referring image segmentation
Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. Recurrent multimodal interaction for referring image segmentation. InICCV, 2017
2017
-
[39]
Referring image segmentation via recurrent refinement networks
Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Referring image segmentation via recurrent refinement networks. InCVPR, 2018
2018
-
[40]
Primitivenet: decomposing the global constraints for referring segmentation.Visual Intelligence, 2(1):16, 2024
Chang Liu, Xudong Jiang, and Henghui Ding. Primitivenet: decomposing the global constraints for referring segmentation.Visual Intelligence, 2(1):16, 2024
2024
-
[41]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InCVPR, 2024
2024
-
[45]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Memgpt: towards llms as operating systems
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems. 2023
2023
-
[47]
Memory os of ai agent
Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent. InEMNLP, 2025
2025
-
[48]
Memorybank: Enhancing large language models with long-term memory
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InAAAI, volume 38, 2024
2024
-
[49]
A-MEM: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, and Lin Gui. Beyond rag for agent memory: Retrieval by decoupling and aggregation.arXiv preprint arXiv:2602.02007, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025
-
[52]
Refer360◦: A referring expression recognition dataset in 360◦ images
Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Refer360◦: A referring expression recognition dataset in 360◦ images. InACL, 2020
2020
-
[53]
A survey of representation learning, optimization strategies, and applications for omnidirectional vision: H
Hao Ai, Zidong Cao, and Lin Wang. A survey of representation learning, optimization strategies, and applications for omnidirectional vision: H. ai et al.IJCV, 133(8), 2025
2025
-
[54]
2s-odis: Two-stage omni-directional image synthesis by geometric distortion correction
Atsuya Nakata and Takao Yamanaka. 2s-odis: Two-stage omni-directional image synthesis by geometric distortion correction. InECCV. Springer, 2024
2024
-
[55]
Osrt: Omnidirectional image super-resolution with distortion-aware transformer
Fanghua Yu, Xintao Wang, Mingdeng Cao, Gen Li, Ying Shan, and Chao Dong. Osrt: Omnidirectional image super-resolution with distortion-aware transformer. InCVPR, 2023. 13
2023
-
[56]
Distortion-aware convolutional filters for dense prediction in panoramic images
Keisuke Tateno, Nassir Navab, and Federico Tombari. Distortion-aware convolutional filters for dense prediction in panoramic images. InECCV, 2018
2018
-
[57]
Learning spherical convolution for fast features from 360 imagery.NeurIPS, 30, 2017
Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery.NeurIPS, 30, 2017
2017
-
[58]
Kernel transformer networks for compact spherical convolution
Yu-Chuan Su and Kristen Grauman. Kernel transformer networks for compact spherical convolution. InCVPR, 2019
2019
-
[59]
Xincheng Shuai, Henghui Ding, Xingjun Ma, Rongcheng Tu, Yu-Gang Jiang, and Dacheng Tao. A survey of multimodal-guided image editing with text-to-image diffusion models.arXiv preprint arXiv:2406.14555, 2024
-
[60]
Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, Danda Pani Paudel, Luc Van Gool, Kailun Yang, and Xuming Hu. Are multimodal large language models ready for omnidirectional spatial reasoning?arXiv preprint arXiv:2505.11907, 2025
-
[61]
Glyphprinter: Region-grouped direct preference optimization for glyph-accurate visual text rendering
Xincheng Shuai, Ziye Li, Henghui Ding, and Dacheng Tao. Glyphprinter: Region-grouped direct preference optimization for glyph-accurate visual text rendering. InCVPR, 2026
2026
-
[62]
360-indoor: Towards learning real-world objects in 360deg indoor equirectangular images
Shih-Han Chou, Cheng Sun, Wen-Yen Chang, Wan-Ting Hsu, Min Sun, and Jianlong Fu. 360-indoor: Towards learning real-world objects in 360deg indoor equirectangular images. InW ACV, 2020
2020
-
[63]
Pandora: A panoramic detection dataset for object with orientation
Hang Xu, Qiang Zhao, Yike Ma, Xiaodong Li, Peng Yuan, Bailan Feng, Chenggang Yan, and Feng Dai. Pandora: A panoramic detection dataset for object with orientation. InECCV. Springer, 2022
2022
-
[64]
Recognizing scene viewpoint using panoramic place representation
Jianxiong Xiao, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Recognizing scene viewpoint using panoramic place representation. InCVPR. IEEE, 2012
2012
-
[65]
Scanrefer: 3d object localization in rgb-d scans using natural language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. InECCV. Springer, 2020
2020
-
[66]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[67]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022
2022
-
[69]
Zero: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis. IEEE, 2020
2020
-
[70]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[71]
Sam4mllm: Enhance multi-modal large language model for referring expression segmentation
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi-modal large language model for referring expression segmentation. InECCV. Springer, 2024
2024
-
[72]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 14
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.