textit{Don't Guess, Just Ask}: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
Pith reviewed 2026-05-20 13:33 UTC · model grok-4.3
The pith
A multi-turn clarification framework resolves ambiguity in referring segmentation by asking questions instead of guessing user intent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IC-Seg is a novel agentic framework that proactively clarifies user intent through multi-turn conversation before performing segmentation on images or videos. To train this capability, Hi-GRPO injects dense and informative supervision signals at the trajectory, turn, and step levels to encourage efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. This leads to superior performance in resolving ambiguous queries on the new Ambi-RVOS benchmark while retaining state-of-the-art results on standard reasoning segmentation benchmarks.
What carries the argument
IC-Seg agentic framework for multi-turn intent clarification in referring segmentation, driven by the Hi-GRPO hierarchical optimization strategy that provides dense supervision at trajectory, turn, and step levels.
Load-bearing premise
Users will engage with and benefit from multi-turn clarification in practice, and the Hi-GRPO strategy will provide effective dense supervision without introducing dialogue inefficiencies or new failure modes.
What would settle it
If evaluations on Ambi-RVOS show that IC-Seg does not outperform baselines by a large margin or if dialogue quality metrics indicate more inefficiencies, the central claim would be falsified.
Figures
read the original abstract
Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose \textbf{IC-Seg}, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce \textbf{Hi-GRPO}, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish \textbf{Ambi-RVOS}, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at \url{https://github.com/iSEE-Laboratory/IC-Seg}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes IC-Seg, an agentic framework for referring video object segmentation that proactively resolves ambiguous user queries via multi-turn clarification dialogues instead of guessing. It introduces Hi-GRPO, a hierarchical optimization strategy that supplies dense supervision signals at the trajectory, turn, and step levels to promote efficient intent clarification and reduce redundant interactions. A new benchmark Ambi-RVOS is created to evaluate performance on ambiguous queries, with claims of large-margin outperformance on this benchmark while retaining state-of-the-art results on standard reasoning segmentation benchmarks.
Significance. If the empirical claims hold, the work addresses a practical gap in referring segmentation by moving beyond the assumption of unambiguous queries, which is common in real-world use. The hierarchical reward design and the Ambi-RVOS benchmark could serve as useful tools for developing more robust interactive vision-language models, provided the gains are shown to stem from the agentic clarification mechanism rather than optimization artifacts.
major comments (3)
- §4.2 (Hi-GRPO description): The central claim that Hi-GRPO delivers dense supervision improving clarification efficiency without new failure modes or dialogue bloat is load-bearing for the large-margin gains on Ambi-RVOS, yet the manuscript provides no ablation that isolates or removes the trajectory/turn/step reward terms individually while reporting turn counts, success rates, and performance on the original non-ambiguous benchmarks.
- §5.1 and Table 2 (Ambi-RVOS results): The reported large-margin outperformance is stated without accompanying quantitative metrics, variance across runs, or direct comparison to a non-hierarchical GRPO baseline, making it impossible to verify that the margin is attributable to the multi-turn clarification policy rather than the new optimization or benchmark construction.
- §5.2 (standard benchmark retention): The assertion that IC-Seg maintains SOTA performance on existing reasoning segmentation benchmarks while adding clarification capability requires explicit side-by-side tables with the same backbone and training regime; without these, it remains unclear whether the hierarchical terms introduce any degradation on unambiguous queries.
minor comments (2)
- The abstract and introduction repeatedly use 'large margin' without defining the metric or providing the numerical delta; this should be replaced with concrete numbers (e.g., mIoU improvement) once the tables are referenced.
- Notation for the three reward levels in Hi-GRPO (trajectory, turn, step) is introduced without a compact equation summarizing their weighted combination; adding such an equation would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor that we have addressed through revisions to the manuscript. Below we respond point-by-point to each major comment.
read point-by-point responses
-
Referee: §4.2 (Hi-GRPO description): The central claim that Hi-GRPO delivers dense supervision improving clarification efficiency without new failure modes or dialogue bloat is load-bearing for the large-margin gains on Ambi-RVOS, yet the manuscript provides no ablation that isolates or removes the trajectory/turn/step reward terms individually while reporting turn counts, success rates, and performance on the original non-ambiguous benchmarks.
Authors: We agree that isolating the contribution of each hierarchical reward level is necessary to substantiate the claims. In the revised manuscript we have added a dedicated ablation subsection in §4.2 (new Table 4) that systematically removes the trajectory-level, turn-level, and step-level reward terms one at a time. For each variant we report average dialogue turns, clarification success rate on Ambi-RVOS, and segmentation performance on the original non-ambiguous benchmarks (RefCOCO, RefCOCO+, DAVIS). The results show that the full three-level hierarchy yields the highest efficiency and accuracy without increasing dialogue length or introducing new failure modes. revision: yes
-
Referee: §5.1 and Table 2 (Ambi-RVOS results): The reported large-margin outperformance is stated without accompanying quantitative metrics, variance across runs, or direct comparison to a non-hierarchical GRPO baseline, making it impossible to verify that the margin is attributable to the multi-turn clarification policy rather than the new optimization or benchmark construction.
Authors: We acknowledge the need for statistical reporting and a controlled baseline. The revised Table 2 now includes mean and standard deviation across three independent runs with different random seeds. We have also added a direct comparison row for a non-hierarchical GRPO baseline (trajectory reward only) trained under identical conditions. The updated results confirm that the performance margin on Ambi-RVOS is attributable to the hierarchical supervision enabling more effective multi-turn clarification rather than optimization or benchmark artifacts alone. revision: yes
-
Referee: §5.2 (standard benchmark retention): The assertion that IC-Seg maintains SOTA performance on existing reasoning segmentation benchmarks while adding clarification capability requires explicit side-by-side tables with the same backbone and training regime; without these, it remains unclear whether the hierarchical terms introduce any degradation on unambiguous queries.
Authors: We agree that a controlled side-by-side evaluation is required. We have inserted a new Table 3 in §5.2 that compares IC-Seg against prior state-of-the-art methods using exactly the same backbone, training data, and optimization schedule on the standard benchmarks (RefCOCO, RefCOCO+, DAVIS). The table demonstrates that IC-Seg retains or slightly exceeds prior SOTA numbers, indicating that the hierarchical reward terms do not degrade performance on unambiguous queries. revision: yes
Circularity Check
No circularity; new framework and benchmark are self-contained
full rationale
The paper introduces IC-Seg as a new agentic multi-turn clarification framework and Hi-GRPO as a hierarchical optimization with trajectory/turn/step rewards, plus the Ambi-RVOS benchmark. Claims of outperformance on ambiguous queries and maintained SOTA on standard benchmarks rest on empirical results from these novel elements rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation reduces to its own inputs by construction; the work is independent of prior fitted quantities.
Axiom & Free-Parameter Ledger
invented entities (3)
-
IC-Seg
no independent evidence
-
Hi-GRPO
no independent evidence
-
Ambi-RVOS
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Hi-GRPO ... injects dense ... supervision signals at the trajectory, turn, and step levels ... Rturn = Rent + Reff ... entropy reduction ... Reff = 1/K Σ I(Nk < Nk-1)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
IC-Seg ... multi-turn conversation before segmentation ... Ambi-RVOS benchmark
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024
work page 2024
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024
work page 2024
-
[4]
End-to-end referring video object segmentation with multimodal transformers
Adam Botach, Evgenii Zheltonozhskii, and Chaim Baskin. End-to-end referring video object segmentation with multimodal transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4985–4995, 2022
work page 2022
-
[5]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020
work page 2020
-
[6]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing
Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan Li. Llava- interactive: An all-in-one demo for image chat, segmentation, generation and editing.arXiv preprint arXiv:2311.00571, 2023
-
[8]
Sam- wise: Infusing wisdom in sam2 for text-driven video segmentation
Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. Sam- wise: Infusing wisdom in sam2 for text-driven video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3395–3405, June 2025
work page 2025
-
[9]
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017
work page 2017
-
[10]
Guesswhat?! visual object discovery through multi-modal dialogue
Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. Guesswhat?! visual object discovery through multi-modal dialogue. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5503–5512, 2017
work page 2017
-
[11]
Mevis: A large-scale benchmark for video segmentation with motion expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023
work page 2023
-
[12]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Reinforcing video reasoning segmentation to think before it segments
Sitong Gong, Yunzhi Zhuge, Lu Zhang, Jiazuo Yu, Xu Jia, Pingping Zhang, and Huchuan Lu. Reinforcing video reasoning segmentation to think before it segments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026
work page 2026
-
[15]
SAM-r1: Leveraging SAM for reward feedback in multimodal segmentation via reinforcement learning
Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. SAM-r1: Leveraging SAM for reward feedback in multimodal segmentation via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[16]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Haichao Jiang, Tianming Liang, Wei-Shi Zheng, and Jian-Fang Hu. Refer-agent: A collaborative multi-agent system with reasoning and reflection for referring video object segmentation.arXiv preprint arXiv:2602.03595, 2026
-
[18]
Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025
work page 2025
-
[19]
Shiu-hong Kao, Yu-Wing Tai, and Chi-Keung Tang. Cot-rvs: Zero-shot chain-of-thought reasoning segmentation for videos.arXiv preprint arXiv:2505.18561, 2025
-
[20]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024
work page 2024
-
[21]
Iag: Input-aware backdoor attack on vlm-based visual grounding
Junxian Li, Beining Xu, and Di Zhang. Iag: Input-aware backdoor attack on vlm-based visual grounding. 2025. URLhttps://api.semanticscholar.org/CorpusID:280641739
work page 2025
-
[22]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Tianming Liang, Haichao Jiang, Yuting Yang, Chaolei Tan, Shuai Li, Wei-Shi Zheng, and Jian-Fang Hu. Long-rvos: A comprehensive benchmark for long-term referring video object segmentation.arXiv preprint arXiv:2505.12702, 2025
-
[24]
Referdino: Referring video object segmentation with visual grounding foundations
Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, and Jian-Fang Hu. Referdino: Referring video object segmentation with visual grounding foundations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025
work page 2025
-
[25]
Tianming Liang, Qirui Du, Jian-Fang Hu, Haichao Jiang, Zicheng Lin, and Wei-Shi Zheng. Seg-research: Segmentation with interleaved reasoning and external search.arXiv preprint arXiv:2602.04454, 2026
-
[26]
Glus: Global-local reasoning unified into a single large language model for video segmentation
Lang Lin, Xueyang Yu, Ziqi Pang, and Yu-Xiong Wang. Glus: Global-local reasoning unified into a single large language model for video segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[27]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[28]
Unipixel: Unified object referring and segmentation for pixel-level visual reasoning
Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Shan Ying, and Chang Wen Chen. Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 11
work page 2025
-
[29]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg- zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Soc: Semantic-assisted object cluster for referring video object segmentation
Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. Soc: Semantic-assisted object cluster for referring video object segmentation. Advances in Neural Information Processing Systems, 36:26425–26437, 2023
work page 2023
-
[31]
Spectrum-guided multi- granularity referring video object segmentation
Bo Miao, Mohammed Bennamoun, Yongsheng Gao, and Ajmal Mian. Spectrum-guided multi- granularity referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 920–930, 2023
work page 2023
-
[32]
Xing, Fahad Shahbaz Khan, and Salman H
Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric P. Xing, Fahad Shahbaz Khan, and Salman H. Khan. Videoglamm : A large multimodal model for pixel-level visual grounding in videos.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19036–19046, 2024. URLhttps://api.semanticscholar.org/CorpusID:273878153
work page 2025
-
[33]
A survey on llm-based conversational user simulation
Bo Ni, Yu Wang, Leyao Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Luera, Samyadeep Basu, Subhojyoti Mukherjee, et al. A survey on llm-based conversational user simulation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4266–4301, 2026
work page 2026
-
[34]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[35]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Urvos: Unified referring video object segmentation network with a large-scale benchmark
Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InEuropean conference on computer vision, pages 208–223. Springer, 2020
work page 2020
-
[37]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Object-centric video question answering with visual grounding and referring
Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, and Stratis Gavves. Object-centric video question answering with visual grounding and referring. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22274– 22284, 2025
work page 2025
-
[39]
Fashion iq: A new dataset towards retrieving images by natural language feedback
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021
work page 2021
-
[40]
Videoseg-r1: reasoning video object segmentation via reinforcement learning
Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li, and Lihua Cai. Videoseg-r1: reasoning video object segmentation via reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11496–11504, 2026
work page 2026
-
[41]
Visa: Reasoning video object segmentation via large language models
Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. In European Conference on Computer Vision (ECCV), pages 98–115. Springer, 2024
work page 2024
-
[42]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026. 12
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
DAPO: An open-source LLM reinforcement learning system at scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...
work page 2025
-
[45]
Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 21805–21830, 2025. 13 A More Implementation Details We implement our method based on the VERL framework....
work page 2025
-
[46]
You are STRICTLY FORBIDDEN from guessing or picking an option when ambiguity exists
-
[47]
Always look at the video frames first before you believe the target is uniquely identified. If after watching the video, you find that the target is not unique and you need to call the ’vlm_tool’
- [48]
-
[49]
Ask about exactly one visual attribute (color / direction / action / relative position / shape). For static questions, especially absolute position,consider including a specific frame number when asking the question so that vlm_tool can answer the question more accurately
-
[50]
You must make this judgment yourself based on the provided video frames
You are FORBIDDEN from using ’vlm_tool’ to ask which frame is the clearest or most visible. You must make this judgment yourself based on the provided video frames. # Note: Always use the viewer’s perspective for left/right orientation System Prompt for User Simulator in Answering Questions You are an expert visual analysis assistant. Your task is to accu...
-
[51]
If there are other similar or identical objects in the image, you MUST IGNORE THEM
You must strictly focus on the object inside the red contour. If there are other similar or identical objects in the image, you MUST IGNORE THEM. Your answer must apply ONLY to the contoured target, never mixing its attributes with others
-
[52]
The query is designed to be answered by observing the entire temporal sequence
You must track the red contour across all provided frames. The query is designed to be answered by observing the entire temporal sequence. 19
-
[53]
Answer ONLY the specific question posed
You are ABSOLUTELY FORBIDDEN from revealing any additional attributes, colors, actions, or context about the target that were not explicitly asked for. Answer ONLY the specific question posed
-
[54]
If a question is ambiguous or cannot be answered definitively, provide a clear indication and request clarification. # Output Format: You must strictly output your reasoning in <thinking> tags, followed by your final concise answer in <answer> tags. <thinking>
-
[55]
Track the object enclosed in the RED CONTOUR from the first frame to the last
-
[56]
Synthesize the object’s action, movement, or interaction across the timeline
-
[57]
neither”. </thinking> <answer> Provide a concise answer (e.g., “it moved to the table
Formulate the absolute minimal text needed to answer the query. If none fit the observed events, conclude “neither”. </thinking> <answer> Provide a concise answer (e.g., “it moved to the table”, “the red one”, “yes”, “no”, “neither”...). - Do NOT mention “red contour”, “red box”, or provide unnecessary explanations in this tag. </answer> # Note: Always us...
-
[58]
Calculate Target Subsets: Determine the remaining candidate count after each dialogue turn
-
[59]
Formulate Holistic Guidance: Abstract the reasoning trajectory into an objective, forward- looking tactical manual. - Extract Principles: Translate successful filtering actions into general declarative rules about candidate space reduction. - Correct Inefficiencies: Translate redundant steps into proactive optimization rules for better information gain (e...
-
[60]
Identify initial candidate objects based on inputs
-
[61]
Sequential Filtering: Iterate through the Dialogue Sequence to identify remaining objects per turn
-
[62]
Analyze trajectory to form holistic advice, strictly adhering to constraints. </thinking> <output> { “sequential_subset_count”: <list of int>, Starts with initial_count, followed by count after each turn. Non-increasing. Length = len(dialogue) + 1. “holistic_guidance”: <string> A comprehensive, objective paragraph of tactical principles focusing on disamb...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.