B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

Danda Pani Paudel (INSAIT; Luc Van Gool; Mario Markov; Mohammad Mahdi; Sofia University "St. Kliment Ohridski"); Stefan Maria Ailuro

arxiv: 2605.23500 · v1 · pith:NKHY7HFYnew · submitted 2026-05-22 · 💻 cs.CV · cs.LG

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

Mario Markov , Stefan Maria Ailuro , Mohammad Mahdi , Luc Van Gool , Danda Pani Paudel (INSAIT , Sofia University "St. Kliment Ohridski") This is my paper

Pith reviewed 2026-05-25 04:25 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords referring segmentationgroup relative tool optimizationGRPOreinforcement learningvision-language modelssegmentation decoderbootstrapped pre-training

0 comments

The pith

B-GRTO jointly optimizes vision-language policies and segmentation decoders by reusing GRPO rollouts for the auxiliary tool objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces group relative tool optimization (GRTO) to integrate differentiable tool objectives into reinforcement learning for referring segmentation tasks. GRTO reuses rollouts from group relative policy optimization so that segmentation decoder gradients can directly complement policy rewards in a single optimization process. The bootstrapped variant B-GRTO adds a cheap pre-training stage for the tool that speeds convergence. Experiments across three challenging referring segmentation settings show clear gains over plain GRPO while matching or exceeding domain-specific state-of-the-art approaches.

Core claim

The authors establish that reusing GRPO rollouts to optimize an auxiliary differentiable tool objective produces a mathematically grounded joint optimization in which decoder gradients complement policy rewards, and that bootstrapping this process with B-GRTO yields faster convergence and superior performance in referring segmentation.

What carries the argument

Group Relative Tool Optimization (GRTO), the framework that reuses GRPO rollouts to jointly optimize the policy reward and the differentiable tool objective.

If this is right

Substantial improvements over plain GRPO across three referring segmentation settings.
Performance that matches or surpasses domain-specific state-of-the-art methods.
Faster convergence from the cheap bootstrapped pre-training stage.
A unified treatment of reinforcement learning and differentiable auxiliary objectives for reasoning-intensive segmentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rollout-reuse mechanism could be tested on other vision-language tools such as object detectors or depth estimators.
Applying B-GRTO to larger vision-language backbones would show whether the joint optimization scales beyond current model sizes.
The bootstrapping step might reduce dependence on carefully hand-crafted reward functions in new segmentation domains.

Load-bearing premise

Reusing GRPO rollouts allows decoder gradients to complement policy rewards in joint optimization without introducing instability or bias.

What would settle it

A set of training runs in which B-GRTO produces no performance gain over plain GRPO or exhibits clear instability from the combined gradients would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.23500 by Danda Pani Paudel (INSAIT, Luc Van Gool, Mario Markov, Mohammad Mahdi, Sofia University "St. Kliment Ohridski"), Stefan Maria Ailuro.

**Figure 2.** Figure 2: a) Most tool fine-tuning methods require instruction prompts to be perfectly specified [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The Bootstrapped Group Relative Tool Optimization (B-GRTO) pipeline: first, the tool [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: a) Results of ablation studies conducted on EarthReason. b) B-GRPO gains compared to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation performance for camouflage trainings. Tracked metric is weighted F-measure. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluation performance for remote sensing trainings. Tracked metric is mean between [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Evaluation performance for reasoning segmentation trainings. Tracked metric is mean [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: COD10K test set qualitative results. The red box in the image shows the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: ReasonSeg-X test set qualitative results. The red box in the image shows the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: EarthReason test set qualitative results. The red box in the image shows the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Code10k error study [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: EarthReason error study. G Prompts used In this section, we provide the full prompts used to query InternVL3.5-8B across all three domains. Remote sensing. We use the following template for the remote sensing domain, where "prompt" is replaced with the disentangled raw prompt as provided in the dataset. Please find "{prompt}" with bbox(es). Also provide exactly one referential noun phrase that uniquely id… view at source ↗

read the original abstract

Segmentation is a fundamental task in computer vision, underpinning pixel-level scene understanding and serving as a cornerstone for applications ranging from autonomous perception to medical image analysis. For complex referring segmentation, recent methods pair large vision-language models with segmentation decoders: the former analyzes the image and prompt, while the latter predicts the target mask. Although reinforcement learning improves reasoning-intensive vision-language systems, trainable tools such as segmentation decoders are typically optimized separately with differentiable objectives, and the principled integration of such objectives into reinforcement learning remains underexplored. Thus, we introduce group relative tool optimization (GRTO), a mathematically grounded framework for jointly optimizing a policy with differentiable tool use. GRTO reuses group relative policy optimization (GRPO) rollouts to optimize the auxiliary tool objective, letting decoder gradients complement policy rewards. Further, we derive Bootstrapped-GRTO (B-GRTO), a pre-training method that cheaply bootstraps the tool, leading to faster convergence and superior performance. Across three challenging referring segmentation settings, B-GRTO results in substantial improvements over plain GRPO, matching or surpassing domain-specific state-of-the-art methods. This demonstrates the value of unifying reinforcement learning with differentiable auxiliary objectives for reasoning-intensive segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRTO reuses GRPO rollouts to jointly optimize the policy and a differentiable segmentation decoder, with B-GRTO adding cheap bootstrapped pre-training, but the abstract alone leaves the math and results uncheckable.

read the letter

The main point is that this paper introduces GRTO as a way to fold differentiable tool objectives into group relative policy optimization by reusing the same rollouts for the decoder gradients, plus B-GRTO as a bootstrapped pre-training step that supposedly speeds things up. The abstract frames this as filling a gap where RL is used for the vision-language part but the segmentation head is still trained separately with standard losses. That integration step is the actual new piece, and it is presented as distinct from prior GRPO extensions. If the reuse mechanism works without adding bias or instability, it could be a practical tweak for referring segmentation pipelines that already rely on these models. The claimed outcome is substantial gains over plain GRPO across three settings, reaching or beating specialized SOTA methods. That would matter for anyone running these systems in practice. The obvious limitation is that only the abstract is available here. There are no equations, rollout details, training curves, or ablation numbers to inspect, so it is impossible to judge whether the joint objective is actually stable or if the reported improvements hold up under standard controls. The mathematical grounding claim cannot be evaluated without seeing the derivation. This paper is aimed at researchers who already work on RL-augmented vision-language models for segmentation tasks. A reader who needs concrete ways to tighten the policy-tool loop might extract usable ideas once the full methods are visible. It is worth sending to peer review so the experiments and derivations can be checked properly; the core direction is specific enough that referees could give useful feedback even if revisions are needed.

Referee Report

2 major / 0 minor

Summary. The paper introduces group relative tool optimization (GRTO), a framework that reuses GRPO rollouts to jointly optimize a policy with differentiable auxiliary tool objectives (e.g., segmentation decoder gradients) in referring segmentation. It further derives Bootstrapped-GRTO (B-GRTO) as a cheap pre-training step for the tool that accelerates convergence. The central empirical claim is that B-GRTO yields substantial gains over plain GRPO across three challenging referring segmentation settings while matching or surpassing domain-specific state-of-the-art methods.

Significance. If the joint-optimization derivation and experimental results hold, the work would supply a concrete, mathematically grounded route for integrating reinforcement learning with differentiable tool objectives inside vision-language segmentation pipelines—an underexplored direction that could improve reasoning-intensive pixel-level tasks.

major comments (2)

[Abstract / §3] Abstract and §3 (framework description): the claim that GRTO provides a 'mathematically grounded' joint optimization rests on the reuse of GRPO rollouts to complement policy rewards with decoder gradients; without the explicit loss formulation, gradient-flow analysis, or stability argument in the full text, it is impossible to verify whether the auxiliary objective is independent or reduces to a fitted quantity by construction.
[Abstract] Abstract: the reported 'substantial improvements' and 'matching or surpassing SOTA' across three settings cannot be assessed for statistical significance, baseline fairness, or post-hoc hyper-parameter effects because no experimental protocol, error bars, or ablation tables are visible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below, providing clarifications from the manuscript and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (framework description): the claim that GRTO provides a 'mathematically grounded' joint optimization rests on the reuse of GRPO rollouts to complement policy rewards with decoder gradients; without the explicit loss formulation, gradient-flow analysis, or stability argument in the full text, it is impossible to verify whether the auxiliary objective is independent or reduces to a fitted quantity by construction.

Authors: We appreciate the referee raising this verification concern. Section 3 of the manuscript explicitly formulates the GRTO objective as L_GRTO = L_GRPO + λ L_aux, where L_aux is the standard segmentation loss (e.g., Dice + BCE) computed on decoder outputs from the identical group rollouts used for the policy gradient. The auxiliary gradients update the decoder parameters independently via backpropagation through the differentiable tool; they are not derived from or fitted to the policy reward. A gradient-flow diagram and short stability note (bounded variance from group-relative baselines) appear in the appendix. We will expand the main-text derivation with these elements for clarity. revision: partial
Referee: [Abstract] Abstract: the reported 'substantial improvements' and 'matching or surpassing SOTA' across three settings cannot be assessed for statistical significance, baseline fairness, or post-hoc hyper-parameter effects because no experimental protocol, error bars, or ablation tables are visible.

Authors: The full manuscript details the experimental protocol (datasets, training hyperparameters, baseline implementations, and evaluation metrics) in Section 4. Tables 1–3 report means and standard deviations computed over three random seeds; Section 4.3 presents ablation tables isolating the bootstrapping and joint-optimization components. We will add explicit cross-references to these sections and tables directly in the abstract and introduction to improve visibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external benchmarks

full rationale

The abstract and description introduce GRTO as a new framework reusing GRPO rollouts for joint optimization of policy and differentiable tool objectives, then derive B-GRTO as a bootstrapping pre-training step. No equations, self-citations, or fitted quantities are presented that reduce any claimed prediction or result to the inputs by construction. GRPO is treated as an external base method; the extension to auxiliary objectives and bootstrapping is described as independent. The performance claims rest on experimental results across three settings rather than any definitional or self-referential derivation. This matches the default case of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the approach relies on prior GRPO without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5775 in / 1087 out tokens · 45898 ms · 2026-05-25T04:25:26.197215+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 21 canonical work pages · 10 internal anchors

[1]

On the effectiveness of textual prompting with lightweight fine-tuning for sam3 remote sensing segmentation, 2026

Roni Blushtein-Livnon, Osher Rafaeli, David Ioffe, Amir Boger, Karen Sandberg Esquenazi, and Tal Svoray. On the effectiveness of textual prompting with lightweight fine-tuning for sam3 remote sensing segmentation, 2026

2026
[2]

SAM 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

2025
[3]

SAM3-Adapter: Efficient adaptation of segment anything 3 for camouflage object segmentation, shadow detection, and medical image segmentation, 2025

Tianrun Chen, Runlong Cao, Xinda Yu, Lanyun Zhu, Chaotao Ding, Deyi Ji, Cheng Chen, Qi Zhu, Chunyan Xu, Papa Mao, and Ying Zang. SAM3-Adapter: Efficient adaptation of segment anything 3 for camouflage object segmentation, shadow detection, and medical image segmentation, 2025

2025
[4]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-Pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

SAM4MLLM: Enhance multi-modal large language model for referring expression segmentation, 2024

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. SAM4MLLM: Enhance multi-modal large language model for referring expression segmentation, 2024

2024
[6]

Robot manipulation in salient vision through referring image segmentation and geometric constraints.arXiv preprint arXiv:2409.11518, 2024

Allie Luo Chen Jiang and Martin Jagersand. Robot manipulation in salient vision through referring image segmentation and geometric constraints.arXiv preprint arXiv:2409.11518, 2024

work page arXiv 2024
[7]

DeepSeek-V3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models, 2025

2025
[8]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025

work page arXiv 2025
[10]

SAM-veteran: An MLLM-based human-like SAM agent for reasoning segmentation

Tianyuan Du, Haopeng Li, Zhen Fan, Jiarui Zhang, Panwang Pan, and Yang Zhang. SAM-veteran: An MLLM-based human-like SAM agent for reasoning segmentation. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[11]

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, and Xihui Liu. GoT-R1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning.arXiv preprint arXiv:2505.17022, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Shape and texture recognition in large vision- language models, 2025

Sagi Eppel, Mor Bismut, and Alona Faktor-Strugatski. Shape and texture recognition in large vision- language models, 2025

2025
[13]

Camouflaged object detection

Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2777–2787, 2020

2020
[14]

Revisiting fundamentals of experience replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InProceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020. 10

2020
[15]

Jehanzeb Mirza, Margret Keuper, and Janis Keuper

Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, M. Jehanzeb Mirza, Margret Keuper, and Janis Keuper. Can we talk models into seeing the world differently?, 2025

2025
[16]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356– 5364, 2019

2019
[17]

Cam- ouflaged object detection with feature decomposition and edge reconstruction

Chunming He, Kai Li, Yachao Zhang, Longxiang Tang, Yulun Zhang, Zhenhua Guo, and Xiu Li. Cam- ouflaged object detection with feature decomposition and edge reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22046–22055, 2023

2023
[18]

RSAgent: Learning to reason and act for text-guided segmentation via multi-turn tool invocations, 2025

Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, and Wenqiang Zhang. RSAgent: Learning to reason and act for text-guided segmentation via multi-turn tool invocations, 2025

2025
[19]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022
[20]

High-resolution iterative feedback network for camouflaged object detection

Xiaobin Hu, Shuo Wang, Xuebin Qin, Hang Dai, Wenqi Ren, Donghao Luo, Ying Tai, and Ling Shao. High-resolution iterative feedback network for camouflaged object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 881–889, 2023

2023
[21]

SAM-R1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning, 2026

Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. SAM-R1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning, 2026

2026
[22]

El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais, and Saeed Anwar

Baber Jan, Aiman H. El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais, and Saeed Anwar. C3net: Context- contrast network for camouflaged object detection.arXiv preprint arXiv:2511.12627, 2025

work page arXiv 2025
[23]

MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation

Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[24]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

2024
[26]

Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto

Trung-Nghia Le, Tam V . Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto. Anabranch network for camouflaged object segmentation.Journal of Computer Vision and Image Understanding, 184: 45–56, 2019

2019
[27]

SegEarth-R1: Geospatial pixel reasoning via large language model, 2025

Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangyong Cao. SegEarth-R1: Geospatial pixel reasoning via large language model, 2025

2025
[28]

GRES: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592– 23601, 2023

2023
[29]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

UniGRPO: Unified policy optimization for reasoning-driven visual generation, 2026

Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, and Wanli Ouyang. UniGRPO: Unified policy optimization for reasoning-driven visual generation, 2026

2026
[31]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

VisionReasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. VisionReasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 11

work page arXiv 2025
[33]

Boosting camouflaged object detection with dual-task interactive transformer

Zhengyi Liu, Zhili Zhang, Yacheng Tan, and Wei Wu. Boosting camouflaged object detection with dual-task interactive transformer. In2022 26th International Conference on Pattern Recognition (ICPR), pages 140–146. IEEE, 2022

2022
[34]

Understanding r1-zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InConference on Language Modeling, 2025

2025
[35]

PathChat-SegR1: Reasoning segmentation in pathology via SO-GRPO

Zelin Liu, Dongdong Chen, Yusong Sun, Yuqi Hu, Huang Jie, Sicheng Dong, Xu Han, Hongmei Yi, Qiyuan Bao, and Lichi Zhang. PathChat-SegR1: Reasoning segmentation in pathology via SO-GRPO. In The Fourteenth International Conference on Learning Representations, 2026

2026
[36]

RSVP: Reasoning segmentation via visual prompting and multi-modal chain-of-thought, 2025

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. RSVP: Reasoning segmentation via visual prompting and multi-modal chain-of-thought, 2025

2025
[37]

CoPRS: Learning positional prior from chain-of-thought for reasoning segmentation.arXiv preprint arXiv:2510.11173, 2025

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Yan Feng, Bin Chen, Ke Chen, and Yaowei Wang. CoPRS: Learning positional prior from chain-of-thought for reasoning segmentation.arXiv preprint arXiv:2510.11173, 2025

work page arXiv 2025
[38]

Simulta- neously localize, segment and rank the camouflaged objects

Yunqiu Lyu, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simulta- neously localize, segment and rank the camouflaged objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[39]

STAGE: Stable and generalizable grpo for autoregressive image generation, 2025

Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, and Feng Zhao. STAGE: Stable and generalizable grpo for autoregressive image generation, 2025

2025
[40]

Yuille, and Kevin Murphy

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2016

2016
[41]

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, and Danda Pani Paudel. FireScope: Wildfire risk prediction with a chain-of-thought oracle.arXiv preprint arXiv:2511.17171, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Camouflaged object segmentation with omni perception.International Journal of Computer Vision, 131(11):3019–3034, 2023

Haiyang Mei, Ke Xu, Yunduo Zhou, Yang Wang, Haiyin Piao, Xiaopeng Wei, and Xin Yang. Camouflaged object segmentation with omni perception.International Journal of Computer Vision, 131(11):3019–3034, 2023

2023
[43]

Unigeoseg: Towards unified open-world segmentation for geospatial scenes, 2025

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, and Jing Zhang. Unigeoseg: Towards unified open-world segmentation for geospatial scenes, 2025

2025
[44]

Janus-Pro-R1: Advancing collaborative visual comprehension and generation via reinforcement learning, 2025

Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, and Yueting Zhuang. Janus-Pro-R1: Advancing collaborative visual comprehension and generation via reinforcement learning, 2025

2025
[45]

Zoom in and out: A mixed- scale triplet network for camouflaged object detection

Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. Zoom in and out: A mixed- scale triplet network for camouflaged object detection. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 2160–2170, 2022

2022
[46]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

2024
[47]

Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. GLaMM: Pixel grounding large multimodal model.The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[48]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

PixelLM: Pixel reasoning with large multimodal model, 2023

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel reasoning with large multimodal model, 2023

2023
[50]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

2015
[51]

Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025. 12

work page arXiv 2025
[52]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024

2024
[53]

Frequency-spatial entanglement learning for camouflaged object detection

Yanguang Sun, Chunyan Xu, Jian Yang, Hanyu Xuan, and Lei Luo. Frequency-spatial entanglement learning for camouflaged object detection. InEuropean Conference on Computer Vision, pages 343–360. Springer, 2024

2024
[54]

RL with KL penalties is better viewed as Bayesian inference

Christopher Buckley Tomasz Korbak, Ethan Perez. RL with KL penalties is better viewed as Bayesian inference. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics

2022
[55]

X-SAM: From segment anything to any segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-SAM: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026

2026
[56]

Image as a foreign lan- guage: BEiT pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign lan- guage: BEiT pretraining for vision and vision-language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[57]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

work page arXiv 2024
[59]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

GSV A: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSV A: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, 2024

2024
[61]

Samwave: Wavelet-driven feature enrichment for effective adaptation of segment anything model.arXiv preprint arXiv:2507.20186, 2025

Saurabh Yadav, Avi Gupta, and Koteswar Rao Jerripothula. Samwave: Wavelet-driven feature enrichment for effective adaptation of segment anything model.arXiv preprint arXiv:2507.20186, 2025

work page arXiv 2025
[62]

An improved baseline for reasoning segmentation with large language model

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023
[63]

Remotereasoner: Towards unifying geospatial reasoning workflow

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Remotereasoner: Towards unifying geospatial reasoning workflow. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11883–11891, 2026

2026
[64]

Understanding vs

Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, and Han Hu. Understanding vs. generation: Navigating optimization dilemma in multimodal models, 2026

2026
[65]

Text-promptable propagation for referring medical image sequence segmentation

Runtian Yuan, Mohan Chen, Jilan Xu, Ling Zhou, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, and Shang Gao. Text-promptable propagation for referring medical image sequence segmentation. In Proceedings of the 33rd ACM International Conference on Multimedia, page 362–371, New York, NY , USA, 2025. Association for Computing Machinery

2025
[66]

StAR: Segment anything reasoner.arXiv preprint arXiv:2603.14382, 2026

Seokju Yun, Dongheon Lee, Noori Bae, Jaesung Jun, Chanseul Cho, and Youngmin Ro. StAR: Segment anything reasoner.arXiv preprint arXiv:2603.14382, 2026

work page arXiv 2026
[67]

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, and Jimin Liang. Bridging semantics and geometry: A decoupled lvlm–sam framework for reasoning segmentation in optical remote sensing.ISPRS Journal of Photogrammetry and Remote Sensing, 237:217–235, 2026

2026
[68]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2025

2025
[69]

Focusdiffuser: Perceiving local disparities for camouflaged object detection

Jianwei Zhao, Xin Li, Fan Yang, Qiang Zhai, Ao Luo, Zicheng Jiao, and Hong Cheng. Focusdiffuser: Perceiving local disparities for camouflaged object detection. InEuropean Conference on Computer Vision, pages 181–198. Springer, 2024. 13

2024
[70]

Bilateral reference for high-resolution dichotomous image segmentation.CAAI Artificial Intelligence Research, 3:9150038, 2024

Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral reference for high-resolution dichotomous image segmentation.CAAI Artificial Intelligence Research, 3:9150038, 2024

2024
[71]

I can find you! boundary-guided separated attention network for camouflaged object detection

Hongwei Zhu, Peng Li, Haoran Xie, Xuefeng Yan, Dong Liang, Dapeng Chen, Mingqiang Wei, and Jing Qin. I can find you! boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI conference on artificial intelligence, pages 3608–3616, 2022

2022
[72]

POPEN: Preference-based optimization and ensemble for lvlm-based reasoning segmentation

Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. POPEN: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 30231–30240, 2025

2025
[73]

LENS: Learning to segment anything with unified reinforced reasoning, 2025

Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, and Xinggang Wang. LENS: Learning to segment anything with unified reinforced reasoning, 2025

2025
[74]

Segagent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories.arXiv preprint arXiv:2503.08625, 2025

Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, and Chunhua Shen. Segagent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories.arXiv preprint arXiv:2503.08625, 2025. A Extended Related Works Multimodal LLMs for Language-Instructed Segmentation. Early approaches to language-guide...

work page arXiv 2025
[75]

PixelLM [49] replaces it with a learned code-book of pixel embeddings for multi-target settings, and PSALM [68] adds rejection handling to it

over object tokens pre-detected by a MLLM. PixelLM [49] replaces it with a learned code-book of pixel embeddings for multi-target settings, and PSALM [68] adds rejection handling to it. A common limitation across this entire family is that all components are optimized with standard cross- entropy on fixed annotation sets, which can overfit to the label di...
[76]

storage bags

and Think2Seg [67] in the remote sensing domain, or PathChat-SegR1[ 35] in the pathology domain. SAM3 agent [2] suggests interacting with the segmentation decoder in a multi-turn, agentic manner. SAM-Veteran [10] improves these interactions by multi-turn GRPO training, rewarding both mask and box quality across dialogue turns. RSAgent [18] improves it fur...

1974

[1] [1]

On the effectiveness of textual prompting with lightweight fine-tuning for sam3 remote sensing segmentation, 2026

Roni Blushtein-Livnon, Osher Rafaeli, David Ioffe, Amir Boger, Karen Sandberg Esquenazi, and Tal Svoray. On the effectiveness of textual prompting with lightweight fine-tuning for sam3 remote sensing segmentation, 2026

2026

[2] [2]

SAM 3: Segment anything with concepts, 2025

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

2025

[3] [3]

SAM3-Adapter: Efficient adaptation of segment anything 3 for camouflage object segmentation, shadow detection, and medical image segmentation, 2025

Tianrun Chen, Runlong Cao, Xinda Yu, Lanyun Zhu, Chaotao Ding, Deyi Ji, Cheng Chen, Qi Zhu, Chunyan Xu, Papa Mao, and Ying Zang. SAM3-Adapter: Efficient adaptation of segment anything 3 for camouflage object segmentation, shadow detection, and medical image segmentation, 2025

2025

[4] [4]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-Pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

SAM4MLLM: Enhance multi-modal large language model for referring expression segmentation, 2024

Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. SAM4MLLM: Enhance multi-modal large language model for referring expression segmentation, 2024

2024

[6] [6]

Robot manipulation in salient vision through referring image segmentation and geometric constraints.arXiv preprint arXiv:2409.11518, 2024

Allie Luo Chen Jiang and Martin Jagersand. Robot manipulation in salient vision through referring image segmentation and geometric constraints.arXiv preprint arXiv:2409.11518, 2024

work page arXiv 2024

[7] [7]

DeepSeek-V3.2: Pushing the frontier of open large language models, 2025

DeepSeek-AI. DeepSeek-V3.2: Pushing the frontier of open large language models, 2025

2025

[8] [8]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey.arXiv preprint arXiv:2508.00265, 2025

work page arXiv 2025

[10] [10]

SAM-veteran: An MLLM-based human-like SAM agent for reasoning segmentation

Tianyuan Du, Haopeng Li, Zhen Fan, Jiarui Zhang, Panwang Pan, and Yang Zhang. SAM-veteran: An MLLM-based human-like SAM agent for reasoning segmentation. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[11] [11]

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, and Xihui Liu. GoT-R1: Unleashing reasoning capability of mllm for visual generation with reinforcement learning.arXiv preprint arXiv:2505.17022, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Shape and texture recognition in large vision- language models, 2025

Sagi Eppel, Mor Bismut, and Alona Faktor-Strugatski. Shape and texture recognition in large vision- language models, 2025

2025

[13] [13]

Camouflaged object detection

Deng-Ping Fan, Ge-Peng Ji, Guolei Sun, Ming-Ming Cheng, Jianbing Shen, and Ling Shao. Camouflaged object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2777–2787, 2020

2020

[14] [14]

Revisiting fundamentals of experience replay

William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, and Will Dabney. Revisiting fundamentals of experience replay. InProceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020. 10

2020

[15] [15]

Jehanzeb Mirza, Margret Keuper, and Janis Keuper

Paul Gavrikov, Jovita Lukasik, Steffen Jung, Robert Geirhos, M. Jehanzeb Mirza, Margret Keuper, and Janis Keuper. Can we talk models into seeing the world differently?, 2025

2025

[16] [16]

LVIS: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356– 5364, 2019

2019

[17] [17]

Cam- ouflaged object detection with feature decomposition and edge reconstruction

Chunming He, Kai Li, Yachao Zhang, Longxiang Tang, Yulun Zhang, Zhenhua Guo, and Xiu Li. Cam- ouflaged object detection with feature decomposition and edge reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22046–22055, 2023

2023

[18] [18]

RSAgent: Learning to reason and act for text-guided segmentation via multi-turn tool invocations, 2025

Xingqi He, Yujie Zhang, Shuyong Gao, Wenjie Li, Lingyi Hong, Mingxi Chen, Kaixun Jiang, Jiyuan Fu, and Wenqiang Zhang. RSAgent: Learning to reason and act for text-guided segmentation via multi-turn tool invocations, 2025

2025

[19] [19]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

2022

[20] [20]

High-resolution iterative feedback network for camouflaged object detection

Xiaobin Hu, Shuo Wang, Xuebin Qin, Hang Dai, Wenqi Ren, Donghao Luo, Ying Tai, and Ling Shao. High-resolution iterative feedback network for camouflaged object detection. InProceedings of the AAAI Conference on Artificial Intelligence, pages 881–889, 2023

2023

[21] [21]

SAM-R1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning, 2026

Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. SAM-R1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning, 2026

2026

[22] [22]

El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais, and Saeed Anwar

Baber Jan, Aiman H. El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais, and Saeed Anwar. C3net: Context- contrast network for camouflaged object detection.arXiv preprint arXiv:2511.12627, 2025

work page arXiv 2025

[23] [23]

MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation

Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, and Daeshik Kim. MMR: A large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[24] [24]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

2024

[26] [26]

Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto

Trung-Nghia Le, Tam V . Nguyen, Zhongliang Nie, Minh-Triet Tran, and Akihiro Sugimoto. Anabranch network for camouflaged object segmentation.Journal of Computer Vision and Image Understanding, 184: 45–56, 2019

2019

[27] [27]

SegEarth-R1: Geospatial pixel reasoning via large language model, 2025

Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangyong Cao. SegEarth-R1: Geospatial pixel reasoning via large language model, 2025

2025

[28] [28]

GRES: Generalized referring expression segmentation

Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23592– 23601, 2023

2023

[29] [29]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

UniGRPO: Unified policy optimization for reasoning-driven visual generation, 2026

Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, and Wanli Ouyang. UniGRPO: Unified policy optimization for reasoning-driven visual generation, 2026

2026

[31] [31]

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning- chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

VisionReasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. VisionReasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 11

work page arXiv 2025

[33] [33]

Boosting camouflaged object detection with dual-task interactive transformer

Zhengyi Liu, Zhili Zhang, Yacheng Tan, and Wei Wu. Boosting camouflaged object detection with dual-task interactive transformer. In2022 26th International Conference on Pattern Recognition (ICPR), pages 140–146. IEEE, 2022

2022

[34] [34]

Understanding r1-zero-like training: A critical perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. InConference on Language Modeling, 2025

2025

[35] [35]

PathChat-SegR1: Reasoning segmentation in pathology via SO-GRPO

Zelin Liu, Dongdong Chen, Yusong Sun, Yuqi Hu, Huang Jie, Sicheng Dong, Xu Han, Hongmei Yi, Qiyuan Bao, and Lichi Zhang. PathChat-SegR1: Reasoning segmentation in pathology via SO-GRPO. In The Fourteenth International Conference on Learning Representations, 2026

2026

[36] [36]

RSVP: Reasoning segmentation via visual prompting and multi-modal chain-of-thought, 2025

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, and Wenbo Zhu. RSVP: Reasoning segmentation via visual prompting and multi-modal chain-of-thought, 2025

2025

[37] [37]

CoPRS: Learning positional prior from chain-of-thought for reasoning segmentation.arXiv preprint arXiv:2510.11173, 2025

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Yan Feng, Bin Chen, Ke Chen, and Yaowei Wang. CoPRS: Learning positional prior from chain-of-thought for reasoning segmentation.arXiv preprint arXiv:2510.11173, 2025

work page arXiv 2025

[38] [38]

Simulta- neously localize, segment and rank the camouflaged objects

Yunqiu Lyu, Jing Zhang, Yuchao Dai, Aixuan Li, Bowen Liu, Nick Barnes, and Deng-Ping Fan. Simulta- neously localize, segment and rank the camouflaged objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021

[39] [39]

STAGE: Stable and generalizable grpo for autoregressive image generation, 2025

Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, and Feng Zhao. STAGE: Stable and generalizable grpo for autoregressive image generation, 2025

2025

[40] [40]

Yuille, and Kevin Murphy

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L. Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2016

2016

[41] [41]

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, and Danda Pani Paudel. FireScope: Wildfire risk prediction with a chain-of-thought oracle.arXiv preprint arXiv:2511.17171, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Camouflaged object segmentation with omni perception.International Journal of Computer Vision, 131(11):3019–3034, 2023

Haiyang Mei, Ke Xu, Yunduo Zhou, Yang Wang, Haiyin Piao, Xiaopeng Wei, and Xin Yang. Camouflaged object segmentation with omni perception.International Journal of Computer Vision, 131(11):3019–3034, 2023

2023

[43] [43]

Unigeoseg: Towards unified open-world segmentation for geospatial scenes, 2025

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, and Jing Zhang. Unigeoseg: Towards unified open-world segmentation for geospatial scenes, 2025

2025

[44] [44]

Janus-Pro-R1: Advancing collaborative visual comprehension and generation via reinforcement learning, 2025

Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, and Yueting Zhuang. Janus-Pro-R1: Advancing collaborative visual comprehension and generation via reinforcement learning, 2025

2025

[45] [45]

Zoom in and out: A mixed- scale triplet network for camouflaged object detection

Youwei Pang, Xiaoqi Zhao, Tian-Zhu Xiang, Lihe Zhang, and Huchuan Lu. Zoom in and out: A mixed- scale triplet network for camouflaged object detection. InProceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 2160–2170, 2022

2022

[46] [46]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024

2024

[47] [47]

Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. GLaMM: Pixel grounding large multimodal model.The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[48] [48]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

PixelLM: Pixel reasoning with large multimodal model, 2023

Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel reasoning with large multimodal model, 2023

2023

[50] [50]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention (MICCAI), 2015

2015

[51] [51]

Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad S Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing.arXiv preprint arXiv:2501.13925, 2025. 12

work page arXiv 2025

[52] [52]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models, 2024

2024

[53] [53]

Frequency-spatial entanglement learning for camouflaged object detection

Yanguang Sun, Chunyan Xu, Jian Yang, Hanyu Xuan, and Lei Luo. Frequency-spatial entanglement learning for camouflaged object detection. InEuropean Conference on Computer Vision, pages 343–360. Springer, 2024

2024

[54] [54]

RL with KL penalties is better viewed as Bayesian inference

Christopher Buckley Tomasz Korbak, Ethan Perez. RL with KL penalties is better viewed as Bayesian inference. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 1083–1091, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics

2022

[55] [55]

X-SAM: From segment anything to any segmentation

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, and Xiaodan Liang. X-SAM: From segment anything to any segmentation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 26187–26196, 2026

2026

[56] [56]

Image as a foreign lan- guage: BEiT pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and Furu Wei. Image as a foreign lan- guage: BEiT pretraining for vision and vision-language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023

[57] [57]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

XuDong Wang, Shaolun Zhang, Shufan Li, Konstantinos Kallidromitis, Kehan Li, Yusuke Kato, Kazuki Kozuka, and Trevor Darrell. SegLLM: Multi-round reasoning segmentation.arXiv preprint arXiv:2410.18923, 2024

work page arXiv 2024

[59] [59]

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.arXiv preprint arXiv:2410.13848, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

GSV A: Generalized segmentation via multimodal large language models

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSV A: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869, 2024

2024

[61] [61]

Samwave: Wavelet-driven feature enrichment for effective adaptation of segment anything model.arXiv preprint arXiv:2507.20186, 2025

Saurabh Yadav, Avi Gupta, and Koteswar Rao Jerripothula. Samwave: Wavelet-driven feature enrichment for effective adaptation of segment anything model.arXiv preprint arXiv:2507.20186, 2025

work page arXiv 2025

[62] [62]

An improved baseline for reasoning segmentation with large language model

Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023

[63] [63]

Remotereasoner: Towards unifying geospatial reasoning workflow

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Remotereasoner: Towards unifying geospatial reasoning workflow. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11883–11891, 2026

2026

[64] [64]

Understanding vs

Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, and Han Hu. Understanding vs. generation: Navigating optimization dilemma in multimodal models, 2026

2026

[65] [65]

Text-promptable propagation for referring medical image sequence segmentation

Runtian Yuan, Mohan Chen, Jilan Xu, Ling Zhou, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, and Shang Gao. Text-promptable propagation for referring medical image sequence segmentation. In Proceedings of the 33rd ACM International Conference on Multimedia, page 362–371, New York, NY , USA, 2025. Association for Computing Machinery

2025

[66] [66]

StAR: Segment anything reasoner.arXiv preprint arXiv:2603.14382, 2026

Seokju Yun, Dongheon Lee, Noori Bae, Jaesung Jun, Chanseul Cho, and Youngmin Ro. StAR: Segment anything reasoner.arXiv preprint arXiv:2603.14382, 2026

work page arXiv 2026

[67] [67]

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, and Jimin Liang. Bridging semantics and geometry: A decoupled lvlm–sam framework for reasoning segmentation in optical remote sensing.ISPRS Journal of Photogrammetry and Remote Sensing, 237:217–235, 2026

2026

[68] [68]

Psalm: Pixelwise segmentation with large multi-modal model

Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2025

2025

[69] [69]

Focusdiffuser: Perceiving local disparities for camouflaged object detection

Jianwei Zhao, Xin Li, Fan Yang, Qiang Zhai, Ao Luo, Zicheng Jiao, and Hong Cheng. Focusdiffuser: Perceiving local disparities for camouflaged object detection. InEuropean Conference on Computer Vision, pages 181–198. Springer, 2024. 13

2024

[70] [70]

Bilateral reference for high-resolution dichotomous image segmentation.CAAI Artificial Intelligence Research, 3:9150038, 2024

Peng Zheng, Dehong Gao, Deng-Ping Fan, Li Liu, Jorma Laaksonen, Wanli Ouyang, and Nicu Sebe. Bilateral reference for high-resolution dichotomous image segmentation.CAAI Artificial Intelligence Research, 3:9150038, 2024

2024

[71] [71]

I can find you! boundary-guided separated attention network for camouflaged object detection

Hongwei Zhu, Peng Li, Haoran Xie, Xuefeng Yan, Dong Liang, Dapeng Chen, Mingqiang Wei, and Jing Qin. I can find you! boundary-guided separated attention network for camouflaged object detection. In Proceedings of the AAAI conference on artificial intelligence, pages 3608–3616, 2022

2022

[72] [72]

POPEN: Preference-based optimization and ensemble for lvlm-based reasoning segmentation

Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. POPEN: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 30231–30240, 2025

2025

[73] [73]

LENS: Learning to segment anything with unified reinforced reasoning, 2025

Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, and Xinggang Wang. LENS: Learning to segment anything with unified reinforced reasoning, 2025

2025

[74] [74]

Segagent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories.arXiv preprint arXiv:2503.08625, 2025

Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, and Chunhua Shen. Segagent: Exploring pixel understanding capabilities in mllms by imitating human annotator trajectories.arXiv preprint arXiv:2503.08625, 2025. A Extended Related Works Multimodal LLMs for Language-Instructed Segmentation. Early approaches to language-guide...

work page arXiv 2025

[75] [75]

PixelLM [49] replaces it with a learned code-book of pixel embeddings for multi-target settings, and PSALM [68] adds rejection handling to it

over object tokens pre-detected by a MLLM. PixelLM [49] replaces it with a learned code-book of pixel embeddings for multi-target settings, and PSALM [68] adds rejection handling to it. A common limitation across this entire family is that all components are optimized with standard cross- entropy on fixed annotation sets, which can overfit to the label di...

[76] [76]

storage bags

and Think2Seg [67] in the remote sensing domain, or PathChat-SegR1[ 35] in the pathology domain. SAM3 agent [2] suggests interacting with the segmentation decoder in a multi-turn, agentic manner. SAM-Veteran [10] improves these interactions by multi-turn GRPO training, rewarding both mask and box quality across dialogue turns. RSAgent [18] improves it fur...

1974