Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery

Francescopaolo Sica; Islam Mansour; Michael Schmitt

arxiv: 2604.17920 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.AI· cs.LG

Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery

Islam Mansour , Francescopaolo Sica , Michael Schmitt This is my paper

Pith reviewed 2026-05-10 05:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords zero-shot segmentationSAR imageryship instance segmentationfoundation modelsSAM2YOLO detectormaritime surveillanceannotation efficiency

0 comments

The pith

Bounding boxes from a SAR-trained detector can prompt SAM2 to produce ship instance masks without any mask annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that ships in SAR imagery can be segmented at the instance level using only bounding-box localizations from a detector trained on public SAR data. These boxes serve as prompts to the Segment Anything Model 2, which then outputs the pixel-level masks. The approach avoids the need for expensive pixel-level annotations that normally constrain deep learning on radar images. On the SSDD benchmark the method reaches a mean IoU of 0.637, or 89 percent of a fully supervised baseline, while maintaining an 89.2 percent ship detection rate. This creates an annotation-light route to applying general vision foundation models in maritime SAR analysis.

Core claim

The paper establishes that spatial constraints from bounding boxes produced by a YOLOv11 detector trained on open SAR datasets are sufficient to regularize SAM2 predictions and bridge the optical-SAR domain gap, yielding instance masks without fine-tuning the foundation model or using any mask annotations.

What carries the argument

Bounding-box prompts generated by a SAR-trained YOLOv11 detector that constrain and direct the predictions of the Segment Anything Model 2.

If this is right

Vessel classification, size estimation, and wake analysis become possible from the generated instance masks.
Only bounding-box annotations are required to train the detector, greatly reducing labeling cost compared with mask-based supervision.
The method reaches 89 percent of fully supervised IoU on the SSDD benchmark while preserving 89.2 percent ship detection rate.
The design provides a scalable, annotation-efficient path for applying foundation models to SAR maritime surveillance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same box-prompting pattern could be tested on other SAR features such as oil spills or icebergs if suitable detectors exist.
The result implies that geometric prompts may reduce reliance on domain-specific fine-tuning for remote-sensing foundation models.
Applying the pipeline to multi-frequency or polarimetric SAR data would test whether the approach generalizes beyond single-channel imagery.

Load-bearing premise

The assumption that bounding boxes from a SAR detector alone supply enough spatial guidance to overcome the differences between SAM2's optical training data and SAR images.

What would settle it

Running the identical pipeline on a held-out SAR dataset but obtaining mean IoU substantially below 0.5 would show that the reported regularization by detector boxes does not reliably bridge the domain gap.

read the original abstract

Synthetic Aperture Radar (SAR) plays a critical role in maritime surveillance, yet deep learning for SAR analysis is limited by the lack of pixel-level annotations. This paper explores how general-purpose vision foundation models can enable zero-shot ship instance segmentation in SAR imagery, eliminating the need for pixel-level supervision. A YOLOv11-based detector trained on open SAR datasets localizes ships via bounding boxes, which then prompt the Segment Anything Model 2 (SAM2) to produce instance masks without any mask annotations. Unlike prior SAM-based SAR approaches that rely on fine tuning or adapters, our method demonstrates that spatial constraints from a SAR-trained detector alone can effectively regularize foundation model predictions. This design partially mitigates the optical-SAR domain gap and enables downstream applications such as vessel classification, size estimation, and wake analysis. Experiments on the SSDD benchmark achieve a mean IoU of 0.637 (89% of a fully supervised baseline) with an overall ship detection rate of 89.2%, confirming a scalable, annotation-efficient pathway toward foundation-model-driven SAR image understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A simple detector-box prompt to SAM2 delivers usable zero-shot SAR segmentation but the domain adaptation claim needs more backing.

read the letter

The key takeaway is that this work demonstrates a practical zero-shot pipeline for ship instance segmentation in SAR imagery: a YOLOv11 detector trained on available SAR data provides bounding box prompts to SAM2, yielding instance masks with a mean IoU of 0.637 on the SSDD benchmark, reaching 89% of a fully supervised method's performance, all without mask annotations or any fine-tuning of the foundation model. What is new here is the reliance on detector boxes alone to regularize SAM2 predictions in this domain, avoiding the adapters or fine-tuning common in earlier SAM adaptations for SAR. The paper does a good job highlighting the annotation efficiency, which addresses a real bottleneck in SAR analysis for maritime surveillance. Using public open datasets for training the detector and the standard SSDD for evaluation makes the setup straightforward to replicate. The reported ship detection rate of 89.2% adds to the usability for downstream tasks like classification or wake analysis. On the soft spots, the main one is the limited insight into how well the box prompts actually overcome the optical-to-SAR domain gap. SAM2 was trained on natural images with color and texture, while SAR has speckle and intensity variations, so the box provides location but little else for the mask decoder. The IoU number is solid but without ablations on prompt variations, comparisons to other prompting strategies, or detailed error analysis, it's difficult to rule out that the model is mostly box-filling rather than true segmentation. The abstract and results would benefit from more protocol details and perhaps confidence intervals to strengthen the claims. This kind of paper is aimed at the remote sensing and computer vision community working on SAR imagery, particularly those exploring foundation models for specialized domains with scarce labels. Readers focused on efficient adaptation techniques or maritime applications would get the most out of it. It has enough concrete results and a clear contribution to warrant a serious referee, even if it requires revisions for deeper analysis. I recommend putting it through peer review rather than desk rejecting it.

Referee Report

2 major / 1 minor

Summary. The paper proposes a zero-shot ship instance segmentation method for SAR imagery that trains a YOLOv11 detector on open SAR datasets to generate bounding-box prompts, which are then fed to the pre-trained Segment Anything Model 2 (SAM2) to produce instance masks. No mask-level annotations or fine-tuning/adapters are used. On the SSDD benchmark the approach reports a mean IoU of 0.637 (89 % of a fully supervised baseline) together with an 89.2 % ship detection rate, arguing that the spatial constraints supplied by the SAR-trained detector are sufficient to regularize SAM2 across the optical-SAR domain gap.

Significance. If the central empirical result holds, the work demonstrates a practical, annotation-light route to instance-level SAR understanding by repurposing existing detection datasets and off-the-shelf foundation models. This could meaningfully lower the barrier to pixel-level maritime surveillance tasks such as vessel sizing and wake analysis.

major comments (2)

[Abstract / Experiments] Abstract and Experiments: the central claim that 'spatial constraints from a SAR-trained detector alone can effectively regularize foundation model predictions' and thereby 'partially mitigate the optical-SAR domain gap' rests on the reported 0.637 mIoU without any ablation that isolates the contribution of the box prompt versus simple box-filling behavior or versus a naïve mask decoder. No comparison to a baseline that directly rasterizes the YOLO boxes is provided, leaving open the possibility that the measured IoU largely reflects the detector's localization rather than SAM2's segmentation capability.
[Experiments] Experiments: the manuscript states a single scalar mIoU of 0.637 and an 89.2 % detection rate on SSDD but supplies neither error bars across multiple runs, nor sensitivity analysis to the detector's IoU threshold, nor qualitative failure cases on speckled or low-contrast ships. These omissions make it impossible to assess whether the 89 % relative performance is robust or an artifact of a favorable test split.

minor comments (1)

[Abstract] The abstract and introduction repeatedly use 'zero-shot' while the detector itself is trained on SAR data; a brief clarification of the precise sense in which the overall pipeline is zero-shot (i.e., zero mask annotations) would avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects for validating our central claims, and we address each point below with plans for revision.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments: the central claim that 'spatial constraints from a SAR-trained detector alone can effectively regularize foundation model predictions' and thereby 'partially mitigate the optical-SAR domain gap' rests on the reported 0.637 mIoU without any ablation that isolates the contribution of the box prompt versus simple box-filling behavior or versus a naïve mask decoder. No comparison to a baseline that directly rasterizes the YOLO boxes is provided, leaving open the possibility that the measured IoU largely reflects the detector's localization rather than SAM2's segmentation capability.

Authors: We agree that a direct ablation against a box-rasterization baseline would more rigorously isolate SAM2's contribution. While the 89% relative mIoU to the fully supervised (mask-annotated) baseline already indicates that our zero-shot pipeline captures segmentation detail beyond crude rectangular coverage, this is indirect evidence. In the revised manuscript we will add a baseline that converts each YOLOv11 box into a filled binary mask and reports its mIoU on SSDD; we will also state explicitly that SAM2 is used with its original pre-trained mask decoder and no adapters. These additions will quantify the incremental benefit of the foundation-model segmentation step and better support the claim that SAR-trained box prompts help regularize predictions across the domain gap. revision: yes
Referee: [Experiments] Experiments: the manuscript states a single scalar mIoU of 0.637 and an 89.2 % detection rate on SSDD but supplies neither error bars across multiple runs, nor sensitivity analysis to the detector's IoU threshold, nor qualitative failure cases on speckled or low-contrast ships. These omissions make it impossible to assess whether the 89 % relative performance is robust or an artifact of a favorable test split.

Authors: We acknowledge that the current single-run reporting limits assessment of robustness. In the revision we will retrain the YOLOv11 detector with three different random seeds, report mean and standard deviation for both mIoU and detection rate, and include error bars. We will also add a sensitivity plot varying the box-prompt IoU threshold and a dedicated qualitative section showing both successful and failure cases on speckled or low-contrast ships. These changes will allow readers to evaluate whether the reported 89% relative performance is stable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline grounded in external models and benchmark

full rationale

The paper describes an empirical pipeline: a YOLOv11 detector is trained on open SAR datasets to produce bounding boxes that serve as prompts to the pre-trained SAM2 model, with performance measured on the public SSDD benchmark (mIoU 0.637). No equations, fitted parameters, or derivations are presented whose outputs reduce by construction to the inputs. The central claim relies on external foundation models (YOLOv11, SAM2) and an independent test set rather than self-definitional relations, self-citation chains, or renamed empirical patterns. This matches the default case of a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pre-trained foundation models can be steered across the optical-SAR domain gap by box prompts alone; no new free parameters, axioms, or invented entities are introduced beyond standard computer-vision assumptions.

axioms (1)

domain assumption Box prompts from a domain-specific detector are sufficient to regularize a general foundation model across the optical-SAR domain gap
Invoked in the method description and in the claim that spatial constraints alone mitigate the domain gap.

pith-pipeline@v0.9.0 · 5492 in / 1320 out tokens · 83689 ms · 2026-05-10T05:50:32.612290+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Maritime Surveillance Finding Dark Ships with Satellites and Artificial Intelli- gence

K. A. Sørensen, “Maritime Surveillance Finding Dark Ships with Satellites and Artificial Intelli- gence”, Ph.D. dissertation, Technical University of Denmark, 2024

work page 2024
[2]

Automatic Ship Detection Based on RetinaNet Us- ing Multi-Resolution Gaofen-3 Imagery

Y . Wang, C. Wang, H. Zhang, Y . Dong, and S. Wei, “Automatic Ship Detection Based on RetinaNet Us- ing Multi-Resolution Gaofen-3 Imagery”,Remote. Sens., vol. 11, no. 5, 2019

work page 2019
[3]

Data-driven methods for detection of abnormal ship behavior: Progress and trends

Y . Wang, J. Liu, R. W. Liu, Y . Liu, and Z. Yuan, “Data-driven methods for detection of abnormal ship behavior: Progress and trends”,Ocean Engi- neering, vol. 271, 2023

work page 2023
[4]

Ship detection in SAR im- ages based on an improved faster R-CNN

J. Li, C. Qu, and J. Shao, “Ship detection in SAR im- ages based on an improved faster R-CNN”, in2017 SAR in Big Data Era: Models, Methods and Appli- cations (BIGSARDATA), 2017

work page 2017
[5]

A Review of Deep-Learning-Based SAR Image Ship Interpreta- tion Technology: The Latest Advances

S. Qiao, Q. Zhang, and Z. Wang, “A Review of Deep-Learning-Based SAR Image Ship Interpreta- tion Technology: The Latest Advances”,IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., vol. 18, 2025

work page 2025
[6]

LS-SSDD-v1.0: A Deep Learn- ing Dataset Dedicated to Small Ship Detection from Large-Scale Sentinel-1 SAR Images

T. Zhang et al., “LS-SSDD-v1.0: A Deep Learn- ing Dataset Dedicated to Small Ship Detection from Large-Scale Sentinel-1 SAR Images”,Remote. Sens., vol. 12, no. 18, 2020

work page 2020
[7]

SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis

T. Zhang et al., “SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis”,Remote. Sens., vol. 13, no. 18, 2021

work page 2021
[8]

Segment Anything

A. Kirillov et al., “Segment Anything”, in IEEE/CVF International Conference on Com- puter Vision, ICCV 2023, Paris, France, October 1-6, 2023, IEEE, 2023

work page 2023
[9]

SAM 2: Segment Anything in Images and Videos

N. Ravi et al., “SAM 2: Segment Anything in Images and Videos”,CoRR, vol. abs/2408.00714, 2024

work page internal anchor Pith review arXiv 2024
[10]

On the Status of Foundation Mod- els for SAR Imagery

N. Inkawhich, “On the Status of Foundation Mod- els for SAR Imagery”,CoRR, vol. abs/2509.21722, 2025

work page arXiv 2025
[11]

SAMSAR: A modified SAM architecture for oceanic ship segmentation of satellite SAR images using CNN-based Cross- Fused Attention

M. Rahimi and S. Sharifian, “SAMSAR: A modified SAM architecture for oceanic ship segmentation of satellite SAR images using CNN-based Cross- Fused Attention”,Expert Syst. Appl., vol. 284, 2025

work page 2025
[12]

Tun- ing a SAM-Based Model With Multicognitive Vi- sual Adapter to Remote Sensing Instance Segmen- tation

L. Zheng, X. Pu, S. Zhang, and F. Xu, “Tun- ing a SAM-Based Model With Multicognitive Vi- sual Adapter to Remote Sensing Instance Segmen- tation”,IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., vol. 18, 2025

work page 2025
[13]

Context-Aggregated and SAM-Guided Network for ViT-Based Instance Segmentation in Remote Sensing Images

S. Liu, F. Wang, H. You, N. Jiao, G. Zhou, and T. Zhang, “Context-Aggregated and SAM-Guided Network for ViT-Based Instance Segmentation in Remote Sensing Images”,Remote. Sens., vol. 16, no. 13, 2024

work page 2024
[14]

BiFA-YOLO: A Novel YOLO-Based Method for Arbitrary-Oriented Ship Detection in High-Resolution SAR Images

Z. Sun, X. Leng, Y . Lei, B. Xiong, K. Ji, and G. Kuang, “BiFA-YOLO: A Novel YOLO-Based Method for Arbitrary-Oriented Ship Detection in High-Resolution SAR Images”,Remote. Sens., vol. 13, no. 21, 2021

work page 2021
[15]

Contextual Region-Based Convolutional Neural Network with Multilayer Fusion for SAR Ship Detection

M. Kang, K. Ji, X. Leng, and Z. Lin, “Contextual Region-Based Convolutional Neural Network with Multilayer Fusion for SAR Ship Detection”,Re- mote. Sens., vol. 9, no. 8, 2017

work page 2017
[16]

Sam on medical images: A comprehensive study on three prompt modes.arXiv preprint arXiv:2305.00035, 2023

D. Cheng, Z. Qin, Z. Jiang, S. Zhang, Q. Lao, and K. Li, “SAM on Medical Images: A Compre- hensive Study on Three Prompt Modes”,CoRR, vol. abs/2305.00035, 2023

work page arXiv 2023
[17]

Segment anything in medical images

J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images”,Na- ture Communications, vol. 15, no. 1, 2024

work page 2024
[18]

HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmenta- tion

S. Wei, X. Zeng, Q. Qu, M. Wang, H. Su, and J. Shi, “HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmenta- tion”,IEEE Access, vol. 8, 2020

work page 2020
[19]

Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges

Y . Jiang, “Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges”,Corr, vol. abs/2507.9562, 2025

work page arXiv 2025

[1] [1]

Maritime Surveillance Finding Dark Ships with Satellites and Artificial Intelli- gence

K. A. Sørensen, “Maritime Surveillance Finding Dark Ships with Satellites and Artificial Intelli- gence”, Ph.D. dissertation, Technical University of Denmark, 2024

work page 2024

[2] [2]

Automatic Ship Detection Based on RetinaNet Us- ing Multi-Resolution Gaofen-3 Imagery

Y . Wang, C. Wang, H. Zhang, Y . Dong, and S. Wei, “Automatic Ship Detection Based on RetinaNet Us- ing Multi-Resolution Gaofen-3 Imagery”,Remote. Sens., vol. 11, no. 5, 2019

work page 2019

[3] [3]

Data-driven methods for detection of abnormal ship behavior: Progress and trends

Y . Wang, J. Liu, R. W. Liu, Y . Liu, and Z. Yuan, “Data-driven methods for detection of abnormal ship behavior: Progress and trends”,Ocean Engi- neering, vol. 271, 2023

work page 2023

[4] [4]

Ship detection in SAR im- ages based on an improved faster R-CNN

J. Li, C. Qu, and J. Shao, “Ship detection in SAR im- ages based on an improved faster R-CNN”, in2017 SAR in Big Data Era: Models, Methods and Appli- cations (BIGSARDATA), 2017

work page 2017

[5] [5]

A Review of Deep-Learning-Based SAR Image Ship Interpreta- tion Technology: The Latest Advances

S. Qiao, Q. Zhang, and Z. Wang, “A Review of Deep-Learning-Based SAR Image Ship Interpreta- tion Technology: The Latest Advances”,IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., vol. 18, 2025

work page 2025

[6] [6]

LS-SSDD-v1.0: A Deep Learn- ing Dataset Dedicated to Small Ship Detection from Large-Scale Sentinel-1 SAR Images

T. Zhang et al., “LS-SSDD-v1.0: A Deep Learn- ing Dataset Dedicated to Small Ship Detection from Large-Scale Sentinel-1 SAR Images”,Remote. Sens., vol. 12, no. 18, 2020

work page 2020

[7] [7]

SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis

T. Zhang et al., “SAR Ship Detection Dataset (SSDD): Official Release and Comprehensive Data Analysis”,Remote. Sens., vol. 13, no. 18, 2021

work page 2021

[8] [8]

Segment Anything

A. Kirillov et al., “Segment Anything”, in IEEE/CVF International Conference on Com- puter Vision, ICCV 2023, Paris, France, October 1-6, 2023, IEEE, 2023

work page 2023

[9] [9]

SAM 2: Segment Anything in Images and Videos

N. Ravi et al., “SAM 2: Segment Anything in Images and Videos”,CoRR, vol. abs/2408.00714, 2024

work page internal anchor Pith review arXiv 2024

[10] [10]

On the Status of Foundation Mod- els for SAR Imagery

N. Inkawhich, “On the Status of Foundation Mod- els for SAR Imagery”,CoRR, vol. abs/2509.21722, 2025

work page arXiv 2025

[11] [11]

SAMSAR: A modified SAM architecture for oceanic ship segmentation of satellite SAR images using CNN-based Cross- Fused Attention

M. Rahimi and S. Sharifian, “SAMSAR: A modified SAM architecture for oceanic ship segmentation of satellite SAR images using CNN-based Cross- Fused Attention”,Expert Syst. Appl., vol. 284, 2025

work page 2025

[12] [12]

Tun- ing a SAM-Based Model With Multicognitive Vi- sual Adapter to Remote Sensing Instance Segmen- tation

L. Zheng, X. Pu, S. Zhang, and F. Xu, “Tun- ing a SAM-Based Model With Multicognitive Vi- sual Adapter to Remote Sensing Instance Segmen- tation”,IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., vol. 18, 2025

work page 2025

[13] [13]

Context-Aggregated and SAM-Guided Network for ViT-Based Instance Segmentation in Remote Sensing Images

S. Liu, F. Wang, H. You, N. Jiao, G. Zhou, and T. Zhang, “Context-Aggregated and SAM-Guided Network for ViT-Based Instance Segmentation in Remote Sensing Images”,Remote. Sens., vol. 16, no. 13, 2024

work page 2024

[14] [14]

BiFA-YOLO: A Novel YOLO-Based Method for Arbitrary-Oriented Ship Detection in High-Resolution SAR Images

Z. Sun, X. Leng, Y . Lei, B. Xiong, K. Ji, and G. Kuang, “BiFA-YOLO: A Novel YOLO-Based Method for Arbitrary-Oriented Ship Detection in High-Resolution SAR Images”,Remote. Sens., vol. 13, no. 21, 2021

work page 2021

[15] [15]

Contextual Region-Based Convolutional Neural Network with Multilayer Fusion for SAR Ship Detection

M. Kang, K. Ji, X. Leng, and Z. Lin, “Contextual Region-Based Convolutional Neural Network with Multilayer Fusion for SAR Ship Detection”,Re- mote. Sens., vol. 9, no. 8, 2017

work page 2017

[16] [16]

Sam on medical images: A comprehensive study on three prompt modes.arXiv preprint arXiv:2305.00035, 2023

D. Cheng, Z. Qin, Z. Jiang, S. Zhang, Q. Lao, and K. Li, “SAM on Medical Images: A Compre- hensive Study on Three Prompt Modes”,CoRR, vol. abs/2305.00035, 2023

work page arXiv 2023

[17] [17]

Segment anything in medical images

J. Ma, Y . He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images”,Na- ture Communications, vol. 15, no. 1, 2024

work page 2024

[18] [18]

HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmenta- tion

S. Wei, X. Zeng, Q. Qu, M. Wang, H. Su, and J. Shi, “HRSID: A High-Resolution SAR Images Dataset for Ship Detection and Instance Segmenta- tion”,IEEE Access, vol. 8, 2020

work page 2020

[19] [19]

Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges

Y . Jiang, “Prompt Engineering in Segment Anything Model: Methodologies, Applications, and Emerging Challenges”,Corr, vol. abs/2507.9562, 2025

work page arXiv 2025