pith. sign in

arxiv: 2605.17531 · v1 · pith:IWK3ZIQBnew · submitted 2026-05-17 · 💻 cs.CV

textit{Don't Guess, Just Ask}: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

Pith reviewed 2026-05-20 13:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords referring segmentationambiguitymulti-turn dialogueclarificationagentic frameworkvideo object segmentationhierarchical optimizationintent resolution
0
0 comments X

The pith

A multi-turn clarification framework resolves ambiguity in referring segmentation by asking questions instead of guessing user intent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that referring segmentation systems can avoid guessing at ambiguous user queries by proactively engaging in multi-turn conversations to clarify the intended target object. A reader would care if this is true because real users often give imprecise descriptions, leading current models to produce incorrect segmentations. The authors introduce IC-Seg as an agentic system that performs this clarification and Hi-GRPO as a hierarchical optimization strategy to provide dense supervision at trajectory, turn, and step levels for efficiency. They also create the Ambi-RVOS benchmark to evaluate such ambiguous scenarios. If correct, this shifts the paradigm from one-shot guessing to interactive intent resolution in vision-language segmentation tasks.

Core claim

IC-Seg is a novel agentic framework that proactively clarifies user intent through multi-turn conversation before performing segmentation on images or videos. To train this capability, Hi-GRPO injects dense and informative supervision signals at the trajectory, turn, and step levels to encourage efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. This leads to superior performance in resolving ambiguous queries on the new Ambi-RVOS benchmark while retaining state-of-the-art results on standard reasoning segmentation benchmarks.

What carries the argument

IC-Seg agentic framework for multi-turn intent clarification in referring segmentation, driven by the Hi-GRPO hierarchical optimization strategy that provides dense supervision at trajectory, turn, and step levels.

Load-bearing premise

Users will engage with and benefit from multi-turn clarification in practice, and the Hi-GRPO strategy will provide effective dense supervision without introducing dialogue inefficiencies or new failure modes.

What would settle it

If evaluations on Ambi-RVOS show that IC-Seg does not outperform baselines by a large margin or if dialogue quality metrics indicate more inefficiencies, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.17531 by Haichao Jiang, Jian-Fang Hu, Quan Zhang, Tianming Liang, Yuting Yang.

Figure 1
Figure 1. Figure 1: An example of ambiguous referring segmentation. When the user query lacks complete [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the IC-Seg framework. IC-Seg resolves ambiguities via multi-turn dialogues [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons among our IC-Seg and two baselines. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: shows the training dynamics of IC-Seg-8B. The localization-related rewards, including RIoU, Rbox , Rpoint, and Rframe, steadily increase during training, indicating that the model gradually improves its final grounding accuracy and keyframe selection. The process reward also rises consis￾tently, suggesting that Hi-GRPO encourages more effective clarification behavior rather than only optimizing the final s… view at source ↗
read the original abstract

Referring segmentation aims to segment the target objects in images or videos based on the textual query. Despite remarkable progress over the past years, existing works always assume that the user-provided queries are already precise and clear. However, this assumption is impractical. In real-world scenarios, it is unrealistic to expect all users to thoroughly review their visual content and carefully ensure their queries are unique and unambiguous. When encountering such cases, existing segmentation models tend to arbitrarily guess the user preferences, often resulting in undesired outcomes. To address this limitation, we propose \textbf{IC-Seg}, a novel agentic framework that proactively clarifies user intent through multi-turn conversation before segmentation. To effectively incentivize this capability, we further introduce \textbf{Hi-GRPO}, a new hierarchical optimization strategy that injects dense and informative supervision signals at the trajectory, turn, and step levels. This strategy encourages efficient intent clarification, effectively eliminating redundant interactions and improving overall dialogue quality. For evaluation, we establish \textbf{Ambi-RVOS}, a referring video object segmentation benchmark with ambiguous user queries. Extensive experiments demonstrate that IC-Seg not only outperforms existing methods by a large margin in resolving ambiguous queries, but also maintains state-of-the-art performance on standard reasoning segmentation benchmarks. Code and data will be released at \url{https://github.com/iSEE-Laboratory/IC-Seg}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes IC-Seg, an agentic framework for referring video object segmentation that proactively resolves ambiguous user queries via multi-turn clarification dialogues instead of guessing. It introduces Hi-GRPO, a hierarchical optimization strategy that supplies dense supervision signals at the trajectory, turn, and step levels to promote efficient intent clarification and reduce redundant interactions. A new benchmark Ambi-RVOS is created to evaluate performance on ambiguous queries, with claims of large-margin outperformance on this benchmark while retaining state-of-the-art results on standard reasoning segmentation benchmarks.

Significance. If the empirical claims hold, the work addresses a practical gap in referring segmentation by moving beyond the assumption of unambiguous queries, which is common in real-world use. The hierarchical reward design and the Ambi-RVOS benchmark could serve as useful tools for developing more robust interactive vision-language models, provided the gains are shown to stem from the agentic clarification mechanism rather than optimization artifacts.

major comments (3)
  1. §4.2 (Hi-GRPO description): The central claim that Hi-GRPO delivers dense supervision improving clarification efficiency without new failure modes or dialogue bloat is load-bearing for the large-margin gains on Ambi-RVOS, yet the manuscript provides no ablation that isolates or removes the trajectory/turn/step reward terms individually while reporting turn counts, success rates, and performance on the original non-ambiguous benchmarks.
  2. §5.1 and Table 2 (Ambi-RVOS results): The reported large-margin outperformance is stated without accompanying quantitative metrics, variance across runs, or direct comparison to a non-hierarchical GRPO baseline, making it impossible to verify that the margin is attributable to the multi-turn clarification policy rather than the new optimization or benchmark construction.
  3. §5.2 (standard benchmark retention): The assertion that IC-Seg maintains SOTA performance on existing reasoning segmentation benchmarks while adding clarification capability requires explicit side-by-side tables with the same backbone and training regime; without these, it remains unclear whether the hierarchical terms introduce any degradation on unambiguous queries.
minor comments (2)
  1. The abstract and introduction repeatedly use 'large margin' without defining the metric or providing the numerical delta; this should be replaced with concrete numbers (e.g., mIoU improvement) once the tables are referenced.
  2. Notation for the three reward levels in Hi-GRPO (trajectory, turn, step) is introduced without a compact equation summarizing their weighted combination; adding such an equation would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects of experimental rigor that we have addressed through revisions to the manuscript. Below we respond point-by-point to each major comment.

read point-by-point responses
  1. Referee: §4.2 (Hi-GRPO description): The central claim that Hi-GRPO delivers dense supervision improving clarification efficiency without new failure modes or dialogue bloat is load-bearing for the large-margin gains on Ambi-RVOS, yet the manuscript provides no ablation that isolates or removes the trajectory/turn/step reward terms individually while reporting turn counts, success rates, and performance on the original non-ambiguous benchmarks.

    Authors: We agree that isolating the contribution of each hierarchical reward level is necessary to substantiate the claims. In the revised manuscript we have added a dedicated ablation subsection in §4.2 (new Table 4) that systematically removes the trajectory-level, turn-level, and step-level reward terms one at a time. For each variant we report average dialogue turns, clarification success rate on Ambi-RVOS, and segmentation performance on the original non-ambiguous benchmarks (RefCOCO, RefCOCO+, DAVIS). The results show that the full three-level hierarchy yields the highest efficiency and accuracy without increasing dialogue length or introducing new failure modes. revision: yes

  2. Referee: §5.1 and Table 2 (Ambi-RVOS results): The reported large-margin outperformance is stated without accompanying quantitative metrics, variance across runs, or direct comparison to a non-hierarchical GRPO baseline, making it impossible to verify that the margin is attributable to the multi-turn clarification policy rather than the new optimization or benchmark construction.

    Authors: We acknowledge the need for statistical reporting and a controlled baseline. The revised Table 2 now includes mean and standard deviation across three independent runs with different random seeds. We have also added a direct comparison row for a non-hierarchical GRPO baseline (trajectory reward only) trained under identical conditions. The updated results confirm that the performance margin on Ambi-RVOS is attributable to the hierarchical supervision enabling more effective multi-turn clarification rather than optimization or benchmark artifacts alone. revision: yes

  3. Referee: §5.2 (standard benchmark retention): The assertion that IC-Seg maintains SOTA performance on existing reasoning segmentation benchmarks while adding clarification capability requires explicit side-by-side tables with the same backbone and training regime; without these, it remains unclear whether the hierarchical terms introduce any degradation on unambiguous queries.

    Authors: We agree that a controlled side-by-side evaluation is required. We have inserted a new Table 3 in §5.2 that compares IC-Seg against prior state-of-the-art methods using exactly the same backbone, training data, and optimization schedule on the standard benchmarks (RefCOCO, RefCOCO+, DAVIS). The table demonstrates that IC-Seg retains or slightly exceeds prior SOTA numbers, indicating that the hierarchical reward terms do not degrade performance on unambiguous queries. revision: yes

Circularity Check

0 steps flagged

No circularity; new framework and benchmark are self-contained

full rationale

The paper introduces IC-Seg as a new agentic multi-turn clarification framework and Hi-GRPO as a hierarchical optimization with trajectory/turn/step rewards, plus the Ambi-RVOS benchmark. Claims of outperformance on ambiguous queries and maintained SOTA on standard benchmarks rest on empirical results from these novel elements rather than any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation reduces to its own inputs by construction; the work is independent of prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 3 invented entities

The central claim rests on the effectiveness of the newly introduced IC-Seg framework, Hi-GRPO optimization, and Ambi-RVOS benchmark rather than on pre-existing axioms or fitted parameters described in the abstract.

invented entities (3)
  • IC-Seg no independent evidence
    purpose: Agentic framework for proactive multi-turn intent clarification before segmentation
    Newly proposed system to address the limitation of ambiguous queries.
  • Hi-GRPO no independent evidence
    purpose: Hierarchical optimization injecting supervision at trajectory, turn, and step levels
    New strategy to incentivize efficient clarification capability.
  • Ambi-RVOS no independent evidence
    purpose: Benchmark for referring video object segmentation with ambiguous user queries
    New dataset established to evaluate performance on ambiguous cases.

pith-pipeline@v0.9.0 · 5789 in / 1259 out tokens · 59714 ms · 2026-05-20T13:33:58.906229+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 11 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [3]

    One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

  4. [4]

    End-to-end referring video object segmentation with multimodal transformers

    Adam Botach, Evgenii Zheltonozhskii, and Chaim Baskin. End-to-end referring video object segmentation with multimodal transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4985–4995, 2022

  5. [5]

    End-to-end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

  6. [6]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

  7. [7]

    Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing

    Wei-Ge Chen, Irina Spiridonova, Jianwei Yang, Jianfeng Gao, and Chunyuan Li. Llava- interactive: An all-in-one demo for image chat, segmentation, generation and editing.arXiv preprint arXiv:2311.00571, 2023

  8. [8]

    Sam- wise: Infusing wisdom in sam2 for text-driven video segmentation

    Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, and Giuseppe Averta. Sam- wise: Infusing wisdom in sam2 for text-driven video segmentation. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 3395–3405, June 2025

  9. [9]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017

  10. [10]

    Guesswhat?! visual object discovery through multi-modal dialogue

    Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. Guesswhat?! visual object discovery through multi-modal dialogue. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5503–5512, 2017

  11. [11]

    Mevis: A large-scale benchmark for video segmentation with motion expressions

    Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023

  12. [12]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025. 10

  13. [13]

    WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

    Xinyu Geng, Peng Xia, Zhen Zhang, Xinyu Wang, Qiuchen Wang, Ruixue Ding, Chenxi Wang, Jialong Wu, Yida Zhao, Kuan Li, et al. Webwatcher: Breaking new frontier of vision-language deep research agent.arXiv preprint arXiv:2508.05748, 2025

  14. [14]

    Reinforcing video reasoning segmentation to think before it segments

    Sitong Gong, Yunzhi Zhuge, Lu Zhang, Jiazuo Yu, Xu Jia, Pingping Zhang, and Huchuan Lu. Reinforcing video reasoning segmentation to think before it segments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026

  15. [15]

    SAM-r1: Leveraging SAM for reward feedback in multimodal segmentation via reinforcement learning

    Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. SAM-r1: Leveraging SAM for reward feedback in multimodal segmentation via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025

  16. [16]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  17. [17]

    Refer-agent: A collaborative multi-agent system with reasoning and reflection for referring video object segmentation.arXiv preprint arXiv:2602.03595, 2026

    Haichao Jiang, Tianming Liang, Wei-Shi Zheng, and Jian-Fang Hu. Refer-agent: A collaborative multi-agent system with reasoning and reflection for referring video object segmentation.arXiv preprint arXiv:2602.03595, 2026

  18. [18]

    Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan O Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training LLMs to reason and leverage search engines with reinforcement learning. InSecond Conference on Language Modeling, 2025

  19. [19]

    Cot-rvs: Zero-shot chain-of-thought reasoning segmentation for videos.arXiv preprint arXiv:2505.18561, 2025

    Shiu-hong Kao, Yu-Wing Tai, and Chi-Keung Tang. Cot-rvs: Zero-shot chain-of-thought reasoning segmentation for videos.arXiv preprint arXiv:2505.18561, 2025

  20. [20]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024

  21. [21]

    Iag: Input-aware backdoor attack on vlm-based visual grounding

    Junxian Li, Beining Xu, and Di Zhang. Iag: Input-aware backdoor attack on vlm-based visual grounding. 2025. URLhttps://api.semanticscholar.org/CorpusID:280641739

  22. [22]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  23. [23]

    Long-rvos: A comprehensive benchmark for long-term referring video object segmentation.arXiv preprint arXiv:2505.12702, 2025

    Tianming Liang, Haichao Jiang, Yuting Yang, Chaolei Tan, Shuai Li, Wei-Shi Zheng, and Jian-Fang Hu. Long-rvos: A comprehensive benchmark for long-term referring video object segmentation.arXiv preprint arXiv:2505.12702, 2025

  24. [24]

    Referdino: Referring video object segmentation with visual grounding foundations

    Tianming Liang, Kun-Yu Lin, Chaolei Tan, Jianguo Zhang, Wei-Shi Zheng, and Jian-Fang Hu. Referdino: Referring video object segmentation with visual grounding foundations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025

  25. [25]

    Seg-research: Segmentation with interleaved reasoning and external search.arXiv preprint arXiv:2602.04454, 2026

    Tianming Liang, Qirui Du, Jian-Fang Hu, Haichao Jiang, Zicheng Lin, and Wei-Shi Zheng. Seg-research: Segmentation with interleaved reasoning and external search.arXiv preprint arXiv:2602.04454, 2026

  26. [26]

    Glus: Global-local reasoning unified into a single large language model for video segmentation

    Lang Lin, Xueyang Yu, Ziqi Pang, and Yu-Xiong Wang. Glus: Global-local reasoning unified into a single large language model for video segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  27. [27]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  28. [28]

    Unipixel: Unified object referring and segmentation for pixel-level visual reasoning

    Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Shan Ying, and Chang Wen Chen. Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 11

  29. [29]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg- zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  30. [30]

    Soc: Semantic-assisted object cluster for referring video object segmentation

    Zhuoyan Luo, Yicheng Xiao, Yong Liu, Shuyan Li, Yitong Wang, Yansong Tang, Xiu Li, and Yujiu Yang. Soc: Semantic-assisted object cluster for referring video object segmentation. Advances in Neural Information Processing Systems, 36:26425–26437, 2023

  31. [31]

    Spectrum-guided multi- granularity referring video object segmentation

    Bo Miao, Mohammed Bennamoun, Yongsheng Gao, and Ajmal Mian. Spectrum-guided multi- granularity referring video object segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 920–930, 2023

  32. [32]

    Xing, Fahad Shahbaz Khan, and Salman H

    Shehan Munasinghe, Hanan Gani, Wenqi Zhu, Jiale Cao, Eric P. Xing, Fahad Shahbaz Khan, and Salman H. Khan. Videoglamm : A large multimodal model for pixel-level visual grounding in videos.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19036–19046, 2024. URLhttps://api.semanticscholar.org/CorpusID:273878153

  33. [33]

    A survey on llm-based conversational user simulation

    Bo Ni, Yu Wang, Leyao Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Luera, Samyadeep Basu, Subhojyoti Mukherjee, et al. A survey on llm-based conversational user simulation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4266–4301, 2026

  34. [34]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  35. [35]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  36. [36]

    Urvos: Unified referring video object segmentation network with a large-scale benchmark

    Seonguk Seo, Joon-Young Lee, and Bohyung Han. Urvos: Unified referring video object segmentation network with a large-scale benchmark. InEuropean conference on computer vision, pages 208–223. Springer, 2020

  37. [37]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300, 2(3):5, 2024

  38. [38]

    Object-centric video question answering with visual grounding and referring

    Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, and Stratis Gavves. Object-centric video question answering with visual grounding and referring. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22274– 22284, 2025

  39. [39]

    Fashion iq: A new dataset towards retrieving images by natural language feedback

    Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. Fashion iq: A new dataset towards retrieving images by natural language feedback. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021

  40. [40]

    Videoseg-r1: reasoning video object segmentation via reinforcement learning

    Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li, and Lihua Cai. Videoseg-r1: reasoning video object segmentation via reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11496–11504, 2026

  41. [41]

    Visa: Reasoning video object segmentation via large language models

    Cilin Yan, Haochen Wang, Shilin Yan, Xiaolong Jiang, Yao Hu, Guoliang Kang, Weidi Xie, and Efstratios Gavves. Visa: Reasoning video object segmentation via large language models. In European Conference on Computer Vision (ECCV), pages 98–115. Springer, 2024

  42. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026. 12

  44. [44]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao ...

  45. [45]

    Baseline

    Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 21805–21830, 2025. 13 A More Implementation Details We implement our method based on the VERL framework....

  46. [46]

    You are STRICTLY FORBIDDEN from guessing or picking an option when ambiguity exists

  47. [47]

    If after watching the video, you find that the target is not unique and you need to call the ’vlm_tool’

    Always look at the video frames first before you believe the target is uniquely identified. If after watching the video, you find that the target is not unique and you need to call the ’vlm_tool’

  48. [48]

    the target

    Do not put the original query inside your question. Use “the target” instead

  49. [49]

    For static questions, especially absolute position,consider including a specific frame number when asking the question so that vlm_tool can answer the question more accurately

    Ask about exactly one visual attribute (color / direction / action / relative position / shape). For static questions, especially absolute position,consider including a specific frame number when asking the question so that vlm_tool can answer the question more accurately

  50. [50]

    You must make this judgment yourself based on the provided video frames

    You are FORBIDDEN from using ’vlm_tool’ to ask which frame is the clearest or most visible. You must make this judgment yourself based on the provided video frames. # Note: Always use the viewer’s perspective for left/right orientation System Prompt for User Simulator in Answering Questions You are an expert visual analysis assistant. Your task is to accu...

  51. [51]

    If there are other similar or identical objects in the image, you MUST IGNORE THEM

    You must strictly focus on the object inside the red contour. If there are other similar or identical objects in the image, you MUST IGNORE THEM. Your answer must apply ONLY to the contoured target, never mixing its attributes with others

  52. [52]

    The query is designed to be answered by observing the entire temporal sequence

    You must track the red contour across all provided frames. The query is designed to be answered by observing the entire temporal sequence. 19

  53. [53]

    Answer ONLY the specific question posed

    You are ABSOLUTELY FORBIDDEN from revealing any additional attributes, colors, actions, or context about the target that were not explicitly asked for. Answer ONLY the specific question posed

  54. [54]

    # Output Format: You must strictly output your reasoning in <thinking> tags, followed by your final concise answer in <answer> tags

    If a question is ambiguous or cannot be answered definitively, provide a clear indication and request clarification. # Output Format: You must strictly output your reasoning in <thinking> tags, followed by your final concise answer in <answer> tags. <thinking>

  55. [55]

    Track the object enclosed in the RED CONTOUR from the first frame to the last

  56. [56]

    Synthesize the object’s action, movement, or interaction across the timeline

  57. [57]

    neither”. </thinking> <answer> Provide a concise answer (e.g., “it moved to the table

    Formulate the absolute minimal text needed to answer the query. If none fit the observed events, conclude “neither”. </thinking> <answer> Provide a concise answer (e.g., “it moved to the table”, “the red one”, “yes”, “no”, “neither”...). - Do NOT mention “red contour”, “red box”, or provide unnecessary explanations in this tag. </answer> # Note: Always us...

  58. [58]

    Calculate Target Subsets: Determine the remaining candidate count after each dialogue turn

  59. [59]

    Query multi-axis positions simultaneously to eliminate larger subsets, rather than verifying single axes

    Formulate Holistic Guidance: Abstract the reasoning trajectory into an objective, forward- looking tactical manual. - Extract Principles: Translate successful filtering actions into general declarative rules about candidate space reduction. - Correct Inefficiencies: Translate redundant steps into proactive optimization rules for better information gain (e...

  60. [60]

    Identify initial candidate objects based on inputs

  61. [61]

    Sequential Filtering: Iterate through the Dialogue Sequence to identify remaining objects per turn

  62. [62]

    sequential_subset_count

    Analyze trajectory to form holistic advice, strictly adhering to constraints. </thinking> <output> { “sequential_subset_count”: <list of int>, Starts with initial_count, followed by count after each turn. Non-increasing. Length = len(dialogue) + 1. “holistic_guidance”: <string> A comprehensive, objective paragraph of tactical principles focusing on disamb...