Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Bei Yu; Bohao Peng; Fanbin Lu; Jiaya Jia; Yuqi Liu; Zhisheng Zhong; Zihao Yue

arxiv: 2503.06520 · v3 · pith:DE5HB5NEnew · submitted 2025-03-09 · 💻 cs.CV · cs.MM

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu , Bohao Peng , Zhisheng Zhong , Zihao Yue , Fanbin Lu , Bei Yu , Jiaya Jia This is my paper

Pith reviewed 2026-05-16 12:27 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords reasoning segmentationreinforcement learningchain-of-thought reasoningzero-shot generalizationimage segmentationGRPO

0 comments

The pith

Reinforcement learning with format and accuracy rewards enables explicit reasoning chains to guide image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to show that image segmentation models can learn to produce their own step-by-step reasoning about user intentions through reinforcement learning, without needing supervised examples of reasoning. Traditional methods depend on fine-tuning with fixed labels and short descriptions, which limits how well they handle new kinds of requests. Seg-Zero separates the task into a reasoning component that creates chains of thought and positional hints, and a segmentation component that turns those hints into masks. By rewarding correct output formats and accurate masks during reinforcement learning, the system develops reasoning abilities that help it generalize to unseen domains. A sympathetic reader would care because this points to a way of building more adaptable vision systems that can explain their choices.

Core claim

Seg-Zero introduces a decoupled architecture where a reasoning model interprets user intentions, generates explicit reasoning chains, and creates positional prompts for a separate segmentation model to produce pixel-level masks. The system is trained exclusively with reinforcement learning using the GRPO algorithm and a reward mechanism that combines format rewards for proper output structure with accuracy rewards for correct segmentation results, without any explicit reasoning supervision data.

What carries the argument

The decoupled reasoning model and segmentation model pair, trained via GRPO reinforcement learning driven by format and accuracy rewards.

If this is right

The model produces visible chain-of-thought reasoning during inference.
Zero-shot performance on the ReasonSeg benchmark reaches 57.5, exceeding previous approaches by a large margin.
Generalization to out-of-domain segmentation tasks improves without additional supervised training.
Reasoning capabilities emerge at test time from the reinforcement training process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar reward-based training could be applied to develop reasoning in other computer vision tasks such as object detection or image captioning.
The separation of reasoning and execution modules might allow easier debugging of where errors occur in complex queries.
Further scaling the reinforcement process could lead to more sophisticated reasoning chains for highly ambiguous segmentation requests.

Load-bearing premise

The combination of format and accuracy rewards through reinforcement learning will consistently generate useful and generalizable reasoning rather than superficial patterns tuned to the training examples.

What would settle it

Run the model on a new set of segmentation queries involving reasoning steps absent from the training distribution, and measure whether the explicit reasoning chains correlate with higher segmentation accuracy or if performance drops to baseline levels.

read the original abstract

Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Seg-Zero gets a clean numerical lift on ReasonSeg by training a decoupled reasoning model with GRPO and format-plus-accuracy rewards, but the absence of ablations leaves open whether the chain-of-thought is doing real work or just fitting the reward.

read the letter

The paper's core move is to split the task into a reasoning model that produces explicit chains and positional prompts, then hands them to a separate segmentation model. It trains the reasoning part end-to-end with GRPO using only a format reward and an accuracy reward on the final mask, with no supervised reasoning traces at all. That setup delivers the reported 57.5 zero-shot score on ReasonSeg, 18 points above LISA-7B. The numerical result is the clearest thing to take away if the numbers hold up under scrutiny.

Referee Report

3 major / 1 minor

Summary. The paper proposes Seg-Zero, a decoupled architecture with a reasoning model that generates explicit chain-of-thought and positional prompts, followed by a segmentation model that produces pixel-level masks. The system is trained exclusively via GRPO reinforcement learning using only format and accuracy rewards, with no explicit reasoning supervision or labeled reasoning data. It claims robust zero-shot generalization and reports that Seg-Zero-7B achieves 57.5 on the ReasonSeg benchmark, an 18% improvement over prior LISA-7B.

Significance. If the central performance claim and the emergence of useful reasoning chains are substantiated, the work would demonstrate that simple reward signals in a decoupled RL setup can induce generalizable chain-of-thought for reasoning segmentation without supervised reasoning data. This would be a notable data-efficient advance for zero-shot multimodal tasks. The public code release strengthens reproducibility.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: The headline result of 57.5 zero-shot on ReasonSeg (18% above LISA-7B) is stated without any description of the training data mixture, exact baseline implementations, number of runs, or statistical significance testing. This absence prevents verification of the central empirical claim.
[Method] Method section: The format-plus-accuracy reward is presented as sufficient to induce useful reasoning chains, yet no ablation removes or randomizes the reasoning tokens while keeping the segmentation model fixed. Without such a control, it remains unclear whether the accuracy signal shapes logically grounded prompts or merely optimizes superficial output patterns that score well on the training distribution.
[Experiments] Experiments section: No qualitative examples of generated reasoning chains are shown, nor is there a comparison of mask quality when the reasoning model is replaced by a non-reasoning baseline. These omissions leave the claim of emergent test-time reasoning unsupported by direct evidence.

minor comments (1)

[Abstract] Abstract: 'precious pixel-level masks' is likely a typo for 'precise pixel-level masks'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The headline result of 57.5 zero-shot on ReasonSeg (18% above LISA-7B) is stated without any description of the training data mixture, exact baseline implementations, number of runs, or statistical significance testing. This absence prevents verification of the central empirical claim.

Authors: We thank the referee for highlighting this. To address the concern, we will expand the Experiments section to include a comprehensive description of the training data mixture, details on how baselines were implemented (including hyperparameters and code references), results from multiple runs with standard deviations, and appropriate statistical tests. This will allow for better verification of the central claims. revision: yes
Referee: [Method] Method section: The format-plus-accuracy reward is presented as sufficient to induce useful reasoning chains, yet no ablation removes or randomizes the reasoning tokens while keeping the segmentation model fixed. Without such a control, it remains unclear whether the accuracy signal shapes logically grounded prompts or merely optimizes superficial output patterns that score well on the training distribution.

Authors: We appreciate this point and acknowledge the value of such an ablation. We will include an ablation study in the revised version where we replace the reasoning model's output with randomized or fixed positional prompts (keeping the segmentation model unchanged) to demonstrate that the learned reasoning chains contribute to the performance gains beyond superficial patterns. revision: yes
Referee: [Experiments] Experiments section: No qualitative examples of generated reasoning chains are shown, nor is there a comparison of mask quality when the reasoning model is replaced by a non-reasoning baseline. These omissions leave the claim of emergent test-time reasoning unsupported by direct evidence.

Authors: We will add qualitative visualizations of the reasoning chains generated by Seg-Zero in the Experiments section to provide direct evidence of the emergent reasoning. Furthermore, we will include a comparison experiment replacing the reasoning model with a non-reasoning baseline (e.g., direct instruction without chain-of-thought) and report the resulting mask quality metrics to support the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result independent of training inputs

full rationale

The paper reports a measured zero-shot score of 57.5 on the external ReasonSeg benchmark after GRPO training with format+accuracy rewards. This performance number is obtained by direct evaluation on held-out data and does not reduce to any fitted parameter, self-defined quantity, or self-citation chain. No equations, uniqueness theorems, or ansatzes are presented that would make the central claim equivalent to its inputs by construction. The emergence of reasoning chains is an empirical observation rather than a deductive step that loops back on itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard reinforcement-learning assumptions about reward shaping and policy optimization; no new physical or mathematical entities are introduced beyond the named GRPO algorithm and the reward design.

free parameters (1)

reward weights for format and accuracy
The abstract describes a 'sophisticated reward mechanism' that combines format and accuracy terms; the relative weighting is not specified and must be chosen or tuned.

axioms (1)

domain assumption Reinforcement learning with format and accuracy rewards can produce explicit, generalizable chain-of-thought reasoning without any supervised reasoning examples.
Invoked when the paper states that training occurs 'exclusively via reinforcement learning with GRPO and without explicit reasoning data'.

pith-pipeline@v0.9.0 · 5521 in / 1354 out tokens · 51293 ms · 2026-05-16T12:27:04.945088+00:00 · methodology

discussion (0)

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ConceptSeg-R1 uses Meta-GRPO meta-RL to learn transferable rules from visual demonstrations and apply them via concept translation for generalized concept segmentation across CI, CD, and CR levels.
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
cs.CV 2026-05 unverdicted novelty 7.0

SetCon achieves state-of-the-art open-ended referring segmentation by using LVLM-generated set-level concepts for joint mask decoding, with gains increasing for multi-target cases on image and video benchmarks.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
cs.CL 2026-05 conditional novelty 7.0

AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
Vision Harnessing Agent for Open Ad-hoc Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification
cs.CV 2026-05 unverdicted novelty 7.0

IC-Seg is a new agentic framework using multi-turn clarification and Hi-GRPO hierarchical optimization to resolve ambiguous queries in referring video object segmentation while maintaining performance on standard benchmarks.
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
Topo-R1: Detecting Topological Anomalies via Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.
A Unified and Controllable Framework for Layered Image Generation with Visual Effects
cs.CV 2026-01 unverdicted novelty 7.0

LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
cs.CV 2026-01 conditional novelty 7.0

IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
SAM 3: Segment Anything with Concepts
cs.CV 2025-11 unverdicted novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.
Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
cs.CV 2025-06 unverdicted novelty 7.0

Proposes CSR task and HalluSegBench using visual counterfactuals to diagnose segmentation hallucinations in VLMs, plus RobustSeg via counterfactual fine-tuning that reduces hallucinations by 30% on FP-RefCOCO.
Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment
cs.CV 2025-05 unverdicted novelty 7.0

Chain-of-Zoom factorizes extreme super-resolution into an autoregressive sequence of intermediate scales using a reused backbone model plus GRPO-tuned multi-scale VLM prompts.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
cs.CV 2025-05 unverdicted novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

B-GRTO extends GRPO by reusing rollouts to optimize auxiliary segmentation decoder objectives, yielding substantial gains over plain GRPO on referring segmentation tasks.
VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment
cs.CV 2026-05 unverdicted novelty 6.0

VersusQ introduces a pairwise margin reasoning framework using large multimodal models to predict signed continuous quality margins between video pairs, claiming improved cross-domain generalization over pointwise sco...
From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding
cs.CV 2026-05 unverdicted novelty 6.0

A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 conditional novelty 6.0

Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.
Image Generators are Generalist Vision Learners
cs.CV 2026-04 unverdicted novelty 6.0

Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
AdaTooler-V: Adaptive Tool-Use for Images and Videos
cs.CV 2025-12 conditional novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
cs.CV 2025-12 unverdicted novelty 6.0

Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
cs.CV 2025-11 unverdicted novelty 6.0

LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
cs.CV 2025-10 unverdicted novelty 6.0

ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring
cs.CL 2026-05 unverdicted novelty 5.0

ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units
cs.CV 2026-04 unverdicted novelty 5.0

SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
cs.CV 2026-03 unverdicted novelty 5.0

A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
Grounding Everything in Tokens for Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 5.0

GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.
OneThinker: All-in-one Reasoning Model for Image and Video
cs.CV 2025-12 unverdicted novelty 5.0

OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
cs.CV 2025-09 unverdicted novelty 5.0

Geo-R1 uses reasoning-centric reinforcement fine-tuning to improve few-shot performance and generalization in geospatial referring expression understanding over supervised baselines.
RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought
cs.CV 2025-06 unverdicted novelty 5.0

RealSR-R1 introduces VLCoT-GRPO with four rewards to add understanding and reasoning to real-world image super-resolution models.
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
cs.CV 2025-04 unverdicted novelty 5.0

Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 4.0

Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 4.0

A pipeline of chain-of-thought data synthesis, LoRA-based supervised fine-tuning, rejection sampling, and rule-based reinforcement learning raises multi-image grounding accuracy by 9.04% on MIG-Bench and 4.41% on aver...
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.