pith. sign in

arxiv: 2605.23500 · v1 · pith:NKHY7HFYnew · submitted 2026-05-22 · 💻 cs.CV · cs.LG

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

Pith reviewed 2026-05-25 04:25 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords referring segmentationgroup relative tool optimizationGRPOreinforcement learningvision-language modelssegmentation decoderbootstrapped pre-training
0
0 comments X

The pith

B-GRTO jointly optimizes vision-language policies and segmentation decoders by reusing GRPO rollouts for the auxiliary tool objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces group relative tool optimization (GRTO) to integrate differentiable tool objectives into reinforcement learning for referring segmentation tasks. GRTO reuses rollouts from group relative policy optimization so that segmentation decoder gradients can directly complement policy rewards in a single optimization process. The bootstrapped variant B-GRTO adds a cheap pre-training stage for the tool that speeds convergence. Experiments across three challenging referring segmentation settings show clear gains over plain GRPO while matching or exceeding domain-specific state-of-the-art approaches.

Core claim

The authors establish that reusing GRPO rollouts to optimize an auxiliary differentiable tool objective produces a mathematically grounded joint optimization in which decoder gradients complement policy rewards, and that bootstrapping this process with B-GRTO yields faster convergence and superior performance in referring segmentation.

What carries the argument

Group Relative Tool Optimization (GRTO), the framework that reuses GRPO rollouts to jointly optimize the policy reward and the differentiable tool objective.

If this is right

  • Substantial improvements over plain GRPO across three referring segmentation settings.
  • Performance that matches or surpasses domain-specific state-of-the-art methods.
  • Faster convergence from the cheap bootstrapped pre-training stage.
  • A unified treatment of reinforcement learning and differentiable auxiliary objectives for reasoning-intensive segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rollout-reuse mechanism could be tested on other vision-language tools such as object detectors or depth estimators.
  • Applying B-GRTO to larger vision-language backbones would show whether the joint optimization scales beyond current model sizes.
  • The bootstrapping step might reduce dependence on carefully hand-crafted reward functions in new segmentation domains.

Load-bearing premise

Reusing GRPO rollouts allows decoder gradients to complement policy rewards in joint optimization without introducing instability or bias.

What would settle it

A set of training runs in which B-GRTO produces no performance gain over plain GRPO or exhibits clear instability from the combined gradients would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.23500 by Danda Pani Paudel (INSAIT, Luc Van Gool, Mario Markov, Mohammad Mahdi, Sofia University "St. Kliment Ohridski"), Stefan Maria Ailuro.

Figure 1
Figure 1. Figure 1: a): Examples requiring both reasoning and tuned segmentation tool: frozen tool (GRPO) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: a) Most tool fine-tuning methods require instruction prompts to be perfectly specified [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Bootstrapped Group Relative Tool Optimization (B-GRTO) pipeline: first, the tool [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: a) Results of ablation studies conducted on EarthReason. b) B-GRPO gains compared to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation performance for camouflage trainings. Tracked metric is weighted F-measure. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation performance for remote sensing trainings. Tracked metric is mean between [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation performance for reasoning segmentation trainings. Tracked metric is mean [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: COD10K test set qualitative results. The red box in the image shows the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: ReasonSeg-X test set qualitative results. The red box in the image shows the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: EarthReason test set qualitative results. The red box in the image shows the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Code10k error study [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: EarthReason error study. G Prompts used In this section, we provide the full prompts used to query InternVL3.5-8B across all three domains. Remote sensing. We use the following template for the remote sensing domain, where "prompt" is replaced with the disentangled raw prompt as provided in the dataset. Please find "{prompt}" with bbox(es). Also provide exactly one referential noun phrase that uniquely id… view at source ↗
read the original abstract

Segmentation is a fundamental task in computer vision, underpinning pixel-level scene understanding and serving as a cornerstone for applications ranging from autonomous perception to medical image analysis. For complex referring segmentation, recent methods pair large vision-language models with segmentation decoders: the former analyzes the image and prompt, while the latter predicts the target mask. Although reinforcement learning improves reasoning-intensive vision-language systems, trainable tools such as segmentation decoders are typically optimized separately with differentiable objectives, and the principled integration of such objectives into reinforcement learning remains underexplored. Thus, we introduce group relative tool optimization (GRTO), a mathematically grounded framework for jointly optimizing a policy with differentiable tool use. GRTO reuses group relative policy optimization (GRPO) rollouts to optimize the auxiliary tool objective, letting decoder gradients complement policy rewards. Further, we derive Bootstrapped-GRTO (B-GRTO), a pre-training method that cheaply bootstraps the tool, leading to faster convergence and superior performance. Across three challenging referring segmentation settings, B-GRTO results in substantial improvements over plain GRPO, matching or surpassing domain-specific state-of-the-art methods. This demonstrates the value of unifying reinforcement learning with differentiable auxiliary objectives for reasoning-intensive segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces group relative tool optimization (GRTO), a framework that reuses GRPO rollouts to jointly optimize a policy with differentiable auxiliary tool objectives (e.g., segmentation decoder gradients) in referring segmentation. It further derives Bootstrapped-GRTO (B-GRTO) as a cheap pre-training step for the tool that accelerates convergence. The central empirical claim is that B-GRTO yields substantial gains over plain GRPO across three challenging referring segmentation settings while matching or surpassing domain-specific state-of-the-art methods.

Significance. If the joint-optimization derivation and experimental results hold, the work would supply a concrete, mathematically grounded route for integrating reinforcement learning with differentiable tool objectives inside vision-language segmentation pipelines—an underexplored direction that could improve reasoning-intensive pixel-level tasks.

major comments (2)
  1. [Abstract / §3] Abstract and §3 (framework description): the claim that GRTO provides a 'mathematically grounded' joint optimization rests on the reuse of GRPO rollouts to complement policy rewards with decoder gradients; without the explicit loss formulation, gradient-flow analysis, or stability argument in the full text, it is impossible to verify whether the auxiliary objective is independent or reduces to a fitted quantity by construction.
  2. [Abstract] Abstract: the reported 'substantial improvements' and 'matching or surpassing SOTA' across three settings cannot be assessed for statistical significance, baseline fairness, or post-hoc hyper-parameter effects because no experimental protocol, error bars, or ablation tables are visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below, providing clarifications from the manuscript and indicating where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (framework description): the claim that GRTO provides a 'mathematically grounded' joint optimization rests on the reuse of GRPO rollouts to complement policy rewards with decoder gradients; without the explicit loss formulation, gradient-flow analysis, or stability argument in the full text, it is impossible to verify whether the auxiliary objective is independent or reduces to a fitted quantity by construction.

    Authors: We appreciate the referee raising this verification concern. Section 3 of the manuscript explicitly formulates the GRTO objective as L_GRTO = L_GRPO + λ L_aux, where L_aux is the standard segmentation loss (e.g., Dice + BCE) computed on decoder outputs from the identical group rollouts used for the policy gradient. The auxiliary gradients update the decoder parameters independently via backpropagation through the differentiable tool; they are not derived from or fitted to the policy reward. A gradient-flow diagram and short stability note (bounded variance from group-relative baselines) appear in the appendix. We will expand the main-text derivation with these elements for clarity. revision: partial

  2. Referee: [Abstract] Abstract: the reported 'substantial improvements' and 'matching or surpassing SOTA' across three settings cannot be assessed for statistical significance, baseline fairness, or post-hoc hyper-parameter effects because no experimental protocol, error bars, or ablation tables are visible.

    Authors: The full manuscript details the experimental protocol (datasets, training hyperparameters, baseline implementations, and evaluation metrics) in Section 4. Tables 1–3 report means and standard deviations computed over three random seeds; Section 4.3 presents ablation tables isolating the bootstrapping and joint-optimization components. We will add explicit cross-references to these sections and tables directly in the abstract and introduction to improve visibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external benchmarks

full rationale

The abstract and description introduce GRTO as a new framework reusing GRPO rollouts for joint optimization of policy and differentiable tool objectives, then derive B-GRTO as a bootstrapping pre-training step. No equations, self-citations, or fitted quantities are presented that reduce any claimed prediction or result to the inputs by construction. GRPO is treated as an external base method; the extension to auxiliary objectives and bootstrapping is described as independent. The performance claims rest on experimental results across three settings rather than any definitional or self-referential derivation. This matches the default case of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the approach relies on prior GRPO without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5775 in / 1087 out tokens · 45898 ms · 2026-05-25T04:25:26.197215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    On the effectiveness of textual prompting with lightweight fine-tuning for sam3 remote sensing segmentation, 2026

    Roni Blushtein-Livnon, Osher Rafaeli, David Ioffe, Amir Boger, Karen Sandberg Esquenazi, and Tal Svoray. On the effectiveness of textual prompting with lightweight fine-tuning for sam3 remote sensing segmentation, 2026

  2. [2]

    SAM 3: Segment anything with concepts, 2025

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

  3. [3]

    SAM3-Adapter: Efficient adaptation of segment anything 3 for camouflage object segmentation, shadow detection, and medical image segmentation, 2025

    Tianrun Chen, Runlong Cao, Xinda Yu, Lanyun Zhu, Chaotao Ding, Deyi Ji, Cheng Chen, Qi Zhu, Chunyan Xu, Papa Mao, and Ying Zang. SAM3-Adapter: Efficient adaptation of segment anything 3 for camouflage object segmentation, shadow detection, and medical image segmentation, 2025

  4. [4]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-Pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  5. [5]

    SAM4MLLM: Enhance multi-modal large language model for referring expression segmentation, 2024

    Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. SAM4MLLM: Enhance multi-modal large language model for referring expression segmentation, 2024

  6. [6]

    Robot manipulation in salient vision through referring image segmentation and geometric constraints.arXiv preprint arXiv:2409.11518, 2024

    Allie Luo Chen Jiang and Martin Jagersand. Robot manipulation in salient vision through referring image segmentation and geometric constraints.arXiv preprint arXiv:2409.11518, 2024

  7. [7]

    DeepSeek-V3.2: Pushing the frontier of open large language models, 2025

    DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models, 2025

  8. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  9. [9]

    Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025

    Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025

  10. [10]

    SAM-veteran: An MLLM-based human-like SAM agent for reasoning segmentation

    Tianyuan Du, Haopeng Li, Zhen Fan, Jiarui Zhang, Panwang Pan, and Yang Zhang. SAM-veteran: An MLLM-based human-like SAM agent for reasoning segmentation. InThe Fourteenth International Conference on Learning Representations, 2026

  11. [11]

    GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

    Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, and Xihui Liu. GoT-R1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning.arXiv preprint arXiv:2505.17022, 2025

  12. [12]

    Shape and texture recognition in large vision- language models, 2025

    Sagi Eppel, Mor Bismut, and Alona Faktor-Strugatski. Shape and texture recognition in large vision- language models, 2025

  13. [13]

    Camouflaged object detection

    Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2777–2787, 2020

  14. [14]

    Revisiting fundamentals of experience replay

    William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InProceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020. 10

  15. [15]

    Jehanzeb Mirza, Margret Keuper, and Janis Keuper

    Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, M. Jehanzeb Mirza, Margret Keuper, and Janis Keuper. Can we talk models into seeing the world differently?, 2025

  16. [16]

    LVIS: A dataset for large vocabulary instance segmentation

    Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356– 5364, 2019

  17. [17]

    Cam- ouflaged object detection with feature decomposition and edge reconstruction

    Chunming He, Kai Li, Yachao Zhang, Longxiang Tang, Yulun Zhang, Zhenhua Guo, and Xiu Li. Cam- ouflaged object detection with feature decomposition and edge reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22046–22055, 2023

  18. [18]

    RSAgent: Learning to reason and act for text-guided segmentation via multi-turn tool invocations, 2025

    Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, and Wenqiang Zhang. RSAgent: Learning to reason and act for text-guided segmentation via multi-turn tool invocations, 2025

  19. [19]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  20. [20]

    High-resolution iterative feedback network for camouflaged object detection

    Xiaobin Hu, Shuo Wang, Xuebin Qin, Hang Dai, Wenqi Ren, Donghao Luo, Ying Tai, and Ling Shao. High-resolution iterative feedback network for camouflaged object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 881–889, 2023

  21. [21]

    SAM-R1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning, 2026

    Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. SAM-R1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning, 2026

  22. [22]

    El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais, and Saeed Anwar

    Baber Jan, Aiman H. El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais, and Saeed Anwar. C3net: Context- contrast network for camouflaged object detection.arXiv preprint arXiv:2511.12627, 2025

  23. [23]

    MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation

    Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. InThe Thirteenth International Conference on Learning Representations, 2025

  24. [24]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

  25. [25]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

  26. [26]

    Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto

    Trung-Nghia Le, Tam V . Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto. Anabranch network for camouflaged object segmentation.Journal of Computer Vision and Image Understanding, 184: 45–56, 2019

  27. [27]

    SegEarth-R1: Geospatial pixel reasoning via large language model, 2025

    Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangyong Cao. SegEarth-R1: Geospatial pixel reasoning via large language model, 2025

  28. [28]

    GRES: Generalized referring expression segmentation

    Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592– 23601, 2023

  29. [29]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

  30. [30]

    UniGRPO: Unified policy optimization for reasoning-driven visual generation, 2026

    Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, and Wanli Ouyang. UniGRPO: Unified policy optimization for reasoning-driven visual generation, 2026

  31. [31]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  32. [32]

    VisionReasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

    Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. VisionReasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 11

  33. [33]

    Boosting camouflaged object detection with dual-task interactive transformer

    Zhengyi Liu, Zhili Zhang, Yacheng Tan, and Wei Wu. Boosting camouflaged object detection with dual-task interactive transformer. In2022 26th International Conference on Pattern Recognition (ICPR), pages 140–146. IEEE, 2022

  34. [34]

    Understanding r1-zero-like training: A critical perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InConference on Language Modeling, 2025

  35. [35]

    PathChat-SegR1: Reasoning segmentation in pathology via SO-GRPO

    Zelin Liu, Dongdong Chen, Yusong Sun, Yuqi Hu, Huang Jie, Sicheng Dong, Xu Han, Hongmei Yi, Qiyuan Bao, and Lichi Zhang. PathChat-SegR1: Reasoning segmentation in pathology via SO-GRPO. In The Fourteenth International Conference on Learning Representations, 2026

  36. [36]

    RSVP: Reasoning segmentation via visual prompting and multi-modal chain-of-thought, 2025

    Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. RSVP: Reasoning segmentation via visual prompting and multi-modal chain-of-thought, 2025

  37. [37]

    CoPRS: Learning positional prior from chain-of-thought for reasoning segmentation.arXiv preprint arXiv:2510.11173, 2025

    Zhenyu Lu, Liupeng Li, Jinpeng Wang, Yan Feng, Bin Chen, Ke Chen, and Yaowei Wang. CoPRS: Learning positional prior from chain-of-thought for reasoning segmentation.arXiv preprint arXiv:2510.11173, 2025

  38. [38]

    Simulta- neously localize, segment and rank the camouflaged objects

    Yunqiu Lyu, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simulta- neously localize, segment and rank the camouflaged objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  39. [39]

    STAGE: Stable and generalizable grpo for autoregressive image generation, 2025

    Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, and Feng Zhao. STAGE: Stable and generalizable grpo for autoregressive image generation, 2025

  40. [40]

    Yuille, and Kevin Murphy

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2016

  41. [41]

    FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

    Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, and Danda Pani Paudel. FireScope: Wildfire risk prediction with a chain-of-thought oracle.arXiv preprint arXiv:2511.17171, 2025

  42. [42]

    Camouflaged object segmentation with omni perception.International Journal of Computer Vision, 131(11):3019–3034, 2023

    Haiyang Mei, Ke Xu, Yunduo Zhou, Yang Wang, Haiyin Piao, Xiaopeng Wei, and Xin Yang. Camouflaged object segmentation with omni perception.International Journal of Computer Vision, 131(11):3019–3034, 2023

  43. [43]

    Unigeoseg: Towards unified open-world segmentation for geospatial scenes, 2025

    Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, and Jing Zhang. Unigeoseg: Towards unified open-world segmentation for geospatial scenes, 2025

  44. [44]

    Janus-Pro-R1: Advancing collaborative visual comprehension and generation via reinforcement learning, 2025

    Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, and Yueting Zhuang. Janus-Pro-R1: Advancing collaborative visual comprehension and generation via reinforcement learning, 2025

  45. [45]

    Zoom in and out: A mixed- scale triplet network for camouflaged object detection

    Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. Zoom in and out: A mixed- scale triplet network for camouflaged object detection. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 2160–2170, 2022

  46. [46]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

  47. [47]

    Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. GLaMM: Pixel grounding large multimodal model.The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  48. [48]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  49. [49]

    PixelLM: Pixel reasoning with large multimodal model, 2023

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel reasoning with large multimodal model, 2023

  50. [50]

    U-Net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

  51. [51]

    Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025

    Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025. 12

  52. [52]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024

  53. [53]

    Frequency-spatial entanglement learning for camouflaged object detection

    Yanguang Sun, Chunyan Xu, Jian Yang, Hanyu Xuan, and Lei Luo. Frequency-spatial entanglement learning for camouflaged object detection. InEuropean Conference on Computer Vision, pages 343–360. Springer, 2024

  54. [54]

    RL with KL penalties is better viewed as Bayesian inference

    Christopher Buckley Tomasz Korbak, Ethan Perez. RL with KL penalties is better viewed as Bayesian inference. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics

  55. [55]

    X-SAM: From segment anything to any segmentation

    Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-SAM: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026

  56. [56]

    Image as a foreign lan- guage: BEiT pretraining for vision and vision-language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign lan- guage: BEiT pretraining for vision and vision-language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  57. [57]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  58. [58]

    SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

    XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

  59. [59]

    Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

  60. [60]

    GSV A: Generalized segmentation via multimodal large language models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSV A: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, 2024

  61. [61]

    Samwave: Wavelet-driven feature enrichment for effective adaptation of segment anything model.arXiv preprint arXiv:2507.20186, 2025

    Saurabh Yadav, Avi Gupta, and Koteswar Rao Jerripothula. Samwave: Wavelet-driven feature enrichment for effective adaptation of segment anything model.arXiv preprint arXiv:2507.20186, 2025

  62. [62]

    An improved baseline for reasoning segmentation with large language model

    Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

  63. [63]

    Remotereasoner: Towards unifying geospatial reasoning workflow

    Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Remotereasoner: Towards unifying geospatial reasoning workflow. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11883–11891, 2026

  64. [64]

    Understanding vs

    Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, and Han Hu. Understanding vs. generation: Navigating optimization dilemma in multimodal models, 2026

  65. [65]

    Text-promptable propagation for referring medical image sequence segmentation

    Runtian Yuan, Mohan Chen, Jilan Xu, Ling Zhou, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, and Shang Gao. Text-promptable propagation for referring medical image sequence segmentation. In Proceedings of the 33rd ACM International Conference on Multimedia, page 362–371, New York, NY , USA, 2025. Association for Computing Machinery

  66. [66]

    StAR: Segment anything reasoner.arXiv preprint arXiv:2603.14382, 2026

    Seokju Yun, Dongheon Lee, Noori Bae, Jaesung Jun, Chanseul Cho, and Youngmin Ro. StAR: Segment anything reasoner.arXiv preprint arXiv:2603.14382, 2026

  67. [67]

    Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, and Jimin Liang. Bridging semantics and geometry: A decoupled lvlm–sam framework for reasoning segmentation in optical remote sensing.ISPRS Journal of Photogrammetry and Remote Sensing, 237:217–235, 2026

  68. [68]

    Psalm: Pixelwise segmentation with large multi-modal model

    Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2025

  69. [69]

    Focusdiffuser: Perceiving local disparities for camouflaged object detection

    Jianwei Zhao, Xin Li, Fan Yang, Qiang Zhai, Ao Luo, Zicheng Jiao, and Hong Cheng. Focusdiffuser: Perceiving local disparities for camouflaged object detection. InEuropean Conference on Computer Vision, pages 181–198. Springer, 2024. 13

  70. [70]

    Bilateral reference for high-resolution dichotomous image segmentation.CAAI Artificial Intelligence Research, 3:9150038, 2024

    Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral reference for high-resolution dichotomous image segmentation.CAAI Artificial Intelligence Research, 3:9150038, 2024

  71. [71]

    I can find you! boundary-guided separated attention network for camouflaged object detection

    Hongwei Zhu, Peng Li, Haoran Xie, Xuefeng Yan, Dong Liang, Dapeng Chen, Mingqiang Wei, and Jing Qin. I can find you! boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI conference on artificial intelligence, pages 3608–3616, 2022

  72. [72]

    POPEN: Preference-based optimization and ensemble for lvlm-based reasoning segmentation

    Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. POPEN: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 30231–30240, 2025

  73. [73]

    LENS: Learning to segment anything with unified reinforced reasoning, 2025

    Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, and Xinggang Wang. LENS: Learning to segment anything with unified reinforced reasoning, 2025

  74. [74]

    Segagent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories.arXiv preprint arXiv:2503.08625, 2025

    Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, and Chunhua Shen. Segagent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories.arXiv preprint arXiv:2503.08625, 2025. A Extended Related Works Multimodal LLMs for Language-Instructed Segmentation. Early approaches to language-guide...

  75. [75]

    PixelLM [49] replaces it with a learned code-book of pixel embeddings for multi-target settings, and PSALM [68] adds rejection handling to it

    over object tokens pre-detected by a MLLM. PixelLM [49] replaces it with a learned code-book of pixel embeddings for multi-target settings, and PSALM [68] adds rejection handling to it. A common limitation across this entire family is that all components are optimized with standard cross- entropy on fixed annotation sets, which can overfit to the label di...

  76. [76]

    storage bags

    and Think2Seg [67] in the remote sensing domain, or PathChat-SegR1[ 35] in the pathology domain. SAM3 agent [2] suggests interacting with the segmentation decoder in a multi-turn, agentic manner. SAM-Veteran [10] improves these interactions by multi-turn GRPO training, rewarding both mask and box quality across dialogue turns. RSAgent [18] improves it fur...