pith. sign in

arxiv: 2512.19302 · v2 · submitted 2025-12-22 · 💻 cs.CV

Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

Pith reviewed 2026-05-16 20:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords reasoning segmentationremote sensingLVLMSAMreinforcement learningGRPOdecoupled frameworkEarthReason dataset
0
0 comments X

The pith

Decoupled LVLM training on mask IoU alone lets it prompt a frozen SAM to 75.6% cIoU in remote sensing segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Think2Seg-RS, a framework that keeps the large vision-language model separate from pixel-level output by training it only to create structured geometric prompts for a frozen Segment Anything Model. This separation uses reinforcement learning driven strictly by the final mask IoU score, avoiding the mixing of language reasoning and geometry that occurs in end-to-end fine-tuning. On the EarthReason dataset the method reaches 75.60 percent class IoU and 73.36 percent global IoU, beating prior coupled approaches by clear margins. Zero-shot tests on other benchmarks expose a split between semantic-level tasks that match broad concepts and instance-level tasks that require separating distinct objects. Additional results indicate that smaller segmenters reduce over-segmentation in textured aerial scenes while unconstrained negative prompts become unreliable.

Core claim

Think2Seg-RS trains an LVLM prompter via a mask-only Group Relative Policy Optimization objective to control a frozen SAM through structured geometric prompts, achieving 75.60 percent cIoU and 73.36 percent gIoU on the EarthReason dataset with absolute gains of 6.47 percent and 2.40 percent over the strongest baselines RemoteReasoner and SegEarth-R1.

What carries the argument

The mask-only GRPO reinforcement learning objective that trains the LVLM to translate abstract semantic reasoning into spatially grounded geometric prompts for the frozen SAM based solely on final mask IoU.

If this is right

  • The decoupled design yields better generalization across referring segmentation tasks than end-to-end supervised fine-tuning.
  • Zero-shot results reveal a fundamental inductive-bias split between semantic-level grounding that aggregates matching regions and instance-level grounding that demands discrete object separation.
  • Compact segmenters reduce textural over-segmentation when trained under semantic-level supervision.
  • Unconstrained negative prompting becomes unstable in heterogeneous aerial backgrounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mask-only RL loop could be tested on non-aerial imagery such as medical scans to check whether the decoupling benefit transfers beyond remote sensing.
  • Separate model sizes or architectures might be needed for semantic-level versus instance-level queries given the observed performance gap.
  • Adding explicit geometric constraints to the prompt generation step could further stabilize training when backgrounds vary widely.

Load-bearing premise

That a mask-only GRPO reinforcement learning objective driven strictly by final mask IoU will reliably train the LVLM to translate abstract semantic reasoning into effective spatially grounded geometric prompts for the frozen SAM without additional supervision.

What would settle it

A controlled test showing that the trained LVLM frequently generates semantically coherent prompts that still produce low-IoU masks on held-out remote sensing images would falsify the claim that the objective reliably bridges semantic understanding to geometric execution.

Figures

Figures reproduced from arXiv: 2512.19302 by Jimin Liang, Junyao Ge, Kaitai Guo, Xu Zhang, Yang Zheng.

Figure 1
Figure 1. Figure 1: Illustration of the evolution of segmentation paradigms in RS imagery. (a) Input image sampled from the iSAID dataset ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of Think2Seg-RS. A trainable LVLM prompter interprets the image–query pair, generates CoT reasoning, and outputs structured JSON prompts [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of Think2Seg-RS on the EarthReason dataset. (a) Inputs, including the user query and corresponding remote sensing image. (b) Model [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of SAM2 model scale on semantic-level segmentation. (a) Input image and (b) corresponding ground-truth (GT) mask illustrate a semantic-level annotation where the entire wastewater treatment plant is represented as one coherent polygon. Outputs from four SAM2 variants, namely (c) Tiny, (d) Small, (e) Base-plus, and (f) Large, show that larger models generate over-detailed, fragmented masks misaligned… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis of prompting combination strategies for SAM. Each pair displays the input image with LVLM generated prompts and the resulting [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Large Vision--Language Models (LVLMs) hold great promise for advancing optical remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only Group Relative Policy Optimization (GRPO) reinforcement learning objective driven strictly by final mask IoU, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Notably, Think2Seg-RS outperforms leading approaches such as RemoteReasoner and SegEarth-R1 on the EarthReason dataset by reaching a test cIoU of 75.60% and gIoU of 73.36%, yielding absolute improvements of 6.47% and 2.40% over the strongest baseline, respectively. Zero-shot evaluations across three referring segmentation benchmarks reveal a fundamental distinction in task inductive bias, exposing a distinct divide between semantic-level grounding -- which aggregates all regions matching a conceptual intent -- and instance-level tasks that demand discrete object separation. We further found that compact segmenters outperform larger ones under semantic-level supervision by mitigating textural over-segmentation, and that unconstrained negative prompting is unstable in heterogeneous aerial backgrounds. Together, these findings demonstrate that optimizing LVLMs through direct segmentation feedback offers a scalable framework for complex geospatial reasoning, effectively bridging the gap between abstract language understanding and precise pixel-level execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Think2Seg-RS, a decoupled LVLM-SAM framework for reasoning segmentation in optical remote sensing. An LVLM is trained using mask-only Group Relative Policy Optimization (GRPO) reinforcement learning, with the reward based solely on the final mask IoU from a frozen SAM, to generate structured geometric prompts from abstract semantic reasoning. It claims state-of-the-art performance on the EarthReason dataset with cIoU of 75.60% and gIoU of 73.36%, outperforming baselines like RemoteReasoner by 6.47% and 2.40% respectively, along with zero-shot results on other benchmarks and insights on segmenter sizes and negative prompting.

Significance. If the central claim holds, this work is significant as it demonstrates a scalable decoupled approach that avoids end-to-end fine-tuning of both reasoning and segmentation components, potentially improving generalization in geospatial tasks. The use of direct segmentation feedback via RL to bridge semantics and geometry is a promising direction, and the zero-shot findings highlight important distinctions in task inductive biases for RS applications.

major comments (3)
  1. [§3.2 (Method)] The GRPO objective is driven strictly by final mask IoU without any auxiliary supervision or monitoring of prompt geometry. This makes the claim that the LVLM learns 'spatially grounded actions' vulnerable to reward hacking, where vague prompts exploit SAM's segmentation bias rather than achieving true semantic-to-geometric translation. An ablation comparing GRPO to a supervised prompt-generation baseline is needed to substantiate this.
  2. [§4 (Experiments)] The reported improvements (cIoU 75.60%, gIoU 73.36%) on EarthReason lack details on data splits, error bars, training stability, and ablation controls. Without these, it is difficult to assess whether the gains are robust or attributable to the proposed framework rather than implementation specifics.
  3. [Abstract and §5] The zero-shot evaluations across referring segmentation benchmarks are presented as revealing a 'fundamental distinction' in task inductive bias, but the manuscript does not provide quantitative metrics or statistical tests supporting the distinction between semantic-level and instance-level grounding.
minor comments (2)
  1. [Abstract] The abstract mentions 'unconstrained negative prompting is unstable' but does not specify the exact conditions or quantitative evidence for this instability.
  2. [Throughout] Notation for cIoU and gIoU should be defined explicitly upon first use, and consistency in reporting absolute improvements should be maintained.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2 (Method)] The GRPO objective is driven strictly by final mask IoU without any auxiliary supervision or monitoring of prompt geometry. This makes the claim that the LVLM learns 'spatially grounded actions' vulnerable to reward hacking, where vague prompts exploit SAM's segmentation bias rather than achieving true semantic-to-geometric translation. An ablation comparing GRPO to a supervised prompt-generation baseline is needed to substantiate this.

    Authors: We agree that the absence of an explicit ablation leaves the claim open to the reward-hacking concern. In the revised manuscript we add a supervised prompt-generation baseline (LVLM fine-tuned with cross-entropy loss on SAM-derived ground-truth prompts) and report that GRPO still outperforms it by 6.15 points in cIoU on EarthReason. We also introduce a prompt-geometry alignment metric in §3.2 that is monitored during training and show that GRPO prompts exhibit measurably higher spatial fidelity than the supervised baseline. These additions directly address the vulnerability to reward hacking. revision: yes

  2. Referee: [§4 (Experiments)] The reported improvements (cIoU 75.60%, gIoU 73.36%) on EarthReason lack details on data splits, error bars, training stability, and ablation controls. Without these, it is difficult to assess whether the gains are robust or attributable to the proposed framework rather than implementation specifics.

    Authors: We accept that the original submission omitted these details. The revised §4 now specifies the EarthReason train/val/test split (70/15/15), reports mean ± std over three independent runs (cIoU 75.60 ± 1.23, gIoU 73.36 ± 1.41), includes GRPO training curves demonstrating stable convergence, and expands the ablation table with controls for group size, reward scaling, and prompt-length constraints. These additions confirm that the reported gains are robust and attributable to the decoupled GRPO framework. revision: yes

  3. Referee: [Abstract and §5] The zero-shot evaluations across referring segmentation benchmarks are presented as revealing a 'fundamental distinction' in task inductive bias, but the manuscript does not provide quantitative metrics or statistical tests supporting the distinction between semantic-level and instance-level grounding.

    Authors: We acknowledge the lack of quantitative support for the claimed distinction. In the revision we add a new table in §5 that tabulates per-benchmark deltas between semantic-level and instance-level tasks together with a Wilcoxon signed-rank test (p = 0.008) across the three zero-shot benchmarks. The abstract and §5 discussion have been updated to cite these metrics and the statistical result, thereby grounding the distinction in quantitative evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain or performance claims

full rationale

The paper's central claims rest on empirical test-set performance (cIoU 75.60%, gIoU 73.36% on held-out EarthReason data) against external baselines, using a standard GRPO reward derived from final mask IoU. No quoted equations or sections reduce the reported test metrics to training inputs by construction, nor do any steps invoke self-citations as load-bearing uniqueness theorems. The framework is evaluated on unseen instances with conventional metrics, making the results falsifiable outside the training objective. This is the normal non-circular case for RL-based segmentation papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions about LVLM and SAM capabilities plus the effectiveness of the RL objective; no new entities are introduced.

axioms (2)
  • domain assumption A frozen SAM can be reliably controlled for precise segmentation using structured geometric prompts generated by an LVLM.
    Central to the decoupled architecture described in the abstract.
  • domain assumption Mask-only GRPO reinforcement learning driven by final mask IoU provides sufficient signal to train effective semantic-to-geometric translation in the LVLM.
    Directly invoked as the training method for the prompter.

pith-pipeline@v0.9.0 · 5612 in / 1334 out tokens · 20955 ms · 2026-05-16T20:44:44.628184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. COCOTree: A Dataset and Benchmark for Open Tree-Structured Visual Decomposition

    cs.CV 2026-05 unverdicted novelty 7.0

    COCOTree is a 21K-image benchmark with 1.8M nodes and an OTQ metric for the new task of open tree-structured visual decomposition.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    SkyScapes – fine-grained semantic understanding of aerial scenes, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7393–7403. Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al., 2025. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923 . Bai, Z., He, T., Mei, H., ...

  2. [2]

    SAM 2: Segment Anything in Images and Videos

    SAM-assisted remote sensing imagery semantic seg- mentation with object and boundary constraints. IEEE Trans- actions on Geoscience and Remote Sensing . Osco, L.P., Wu, Q., De Lemos, E.L., Gonçalves, W.N., Ramos, A.P.M., Li, J., Junior, J.M., 2023. The Segment Anything Model (SAM) for remote sensing applications: From zero to one shot. International Journ...

  3. [3]

    {Question}

    Text2seg: Remote sensing image semantic segmenta- tion via text-guided visual foundation models. arXiv preprint arXiv:2304.10597 . Zhang, Z., Ma, Y ., Zhang, E., Bai, X., 2025. PSALM: Pixelwise segmentation with large multi-modal model, in: European Conference on Computer Vision, Springer. pp. 74–91. Zheng, Z., Zhong, Y ., Zhang, L., Ermon, S., 2024. Segm...

  4. [4]

    `bbox_2d`: A tight bounding box

  5. [5]

    bbox_2d": [310,360,567,586],

    `positive_points`: Exactly two points, placed inside the target. Output your thinking process in <think> </think> tags. Output the final answer in <answer> </answer> tags with the specified JSON format. If no targets are found, output an empty list. i.e. <think> thinking process here </think> <answer>```json[{"bbox_2d": [310,360,567,586], "positive_points...