S4M: 4-points to Segment Anything

Adrien Meyer; Didier Mutter; Giuseppe Massimiani; Lorenzo Arboit; Nicolas Padoy; Shih-Min Yin

arxiv: 2503.05534 · v3 · pith:MPL5JT5Lnew · submitted 2025-03-07 · 💻 cs.CV

S4M: 4-points to Segment Anything

Adrien Meyer , Lorenzo Arboit , Giuseppe Massimiani , Shih-Min Yin , Didier Mutter , Nicolas Padoy This is my paper

Pith reviewed 2026-05-23 00:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords segment anything modelmedical image segmentationpoint promptingultrasoundendoscopystructured promptsannotation efficiency

0 comments

The pith

S4M augments SAM to treat four points as relational shape cues rather than isolated clicks for more accurate medical segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that point prompts in the Segment Anything Model become ambiguous in medical images because of overlapping anatomy and blurred boundaries, requiring repeated manual fixes. It introduces S4M which modifies SAM with role-specific embeddings for each of the four points and an auxiliary Canvas task that forces the model to sketch coarse masks directly from the prompt, encouraging geometry-aware reasoning. The four points are either extreme points or major and minor axis endpoints drawn from clinical ultrasound practice. Experiments across eight ultrasound and endoscopy datasets show a 3.42 mIoU gain over a strong SAM baseline at the same prompt budget, while a clinician study finds the major/minor variant speeds annotation. If correct this would lower the cost of creating large, precise medical segmentation datasets by making prompting more efficient and clinically natural.

Core claim

S4M augments the Segment Anything Model by expanding the prompt space with role-specific embeddings and adding an auxiliary Canvas pretext task that sketches coarse masks directly from prompts, allowing the model to interpret four points as relational cues rather than isolated clicks and thereby producing more accurate instance segmentations on medical images with overlapping anatomy and blurred boundaries.

What carries the argument

Role-specific embeddings for the four points together with the Canvas auxiliary task that sketches coarse masks from prompts to foster geometry-aware reasoning.

If this is right

S4M achieves a 3.42 mIoU improvement over a strong SAM baseline at equal prompt budget across eight ultrasound and surgical endoscopy datasets.
Major and minor axis endpoint prompts enable faster annotation by clinicians compared with standard point prompting.
The four-point strategy aligns prompting with existing clinical measurement practices in ultrasound.
Higher efficiency at equal accuracy supports more scalable development of segmentation datasets in medical imaging.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same relational-prompt design could be tested on non-medical images where objects have clear elongated shapes or standard measurement conventions.
If the Canvas task proves effective, similar auxiliary objectives might help other prompt-based foundation models handle structured multi-point inputs.
Faster annotation workflows could allow clinical teams to label larger and more varied datasets without proportional increases in expert time.

Load-bearing premise

Clinicians can identify the endpoints of major and minor axes consistently with low inter-annotator variability across diverse medical images without introducing new sources of error.

What would settle it

A measurement showing high disagreement among clinicians when asked to mark major and minor axis endpoints on the same set of ultrasound or endoscopy images would indicate the prompting strategy cannot be applied reliably in practice.

read the original abstract

Purpose: The Segment Anything Model (SAM) promises to ease the annotation bottleneck in medical segmentation, but overlapping anatomy and blurred boundaries make its point prompts ambiguous, leading to cycles of manual refinement to achieve precise masks. Better prompting strategies are needed. Methods: We propose a structured prompting strategy using 4 points as a compact instance-level shape description. We study two 4-point variants: extreme points and the proposed major/minor axis endpoints, inspired by ultrasound measurement practice. SAM cannot fully exploit such structured prompts because it treats all points identically and lacks geometry-aware reasoning. To address this, we introduce S4M (4-points to Segment Anything), which augments SAM to interpret 4 points as relational cues rather than isolated clicks. S4M expands the prompt space with role-specific embeddings and adds an auxiliary "Canvas" pretext task that sketches coarse masks directly from prompts, fostering geometry-aware reasoning. Results: Across eight datasets in ultrasound and surgical endoscopy, S4M improves segmentation by +3.42 mIoU over a strong SAM baseline at equal prompt budget. An annotation study with three clinicians further shows that major/minor prompts enable faster annotation. Conclusion: S4M increases performance, reduces annotation effort, and aligns prompting with clinical practice, enabling more scalable dataset development in medical imaging. We release our code and pretrained models at https://github.com/CAMMA-public/S4M.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S4M adds role embeddings and a canvas task to SAM for 4-point medical prompts and reports gains on eight datasets, but the evaluation lacks the details needed to confirm the gains are robust.

read the letter

The paper's main move is to treat four points as a structured shape cue rather than isolated clicks. They pick major/minor axis endpoints because that matches how clinicians already measure in ultrasound, then add role-specific embeddings so the model knows which point is which and an auxiliary canvas task that forces it to sketch a rough mask from the prompt alone. That combination is the actual novelty over plain SAM prompting work. They test on eight ultrasound and endoscopy datasets, claim +3.42 mIoU over a strong baseline at the same prompt budget, and show a small time-saving study with three clinicians. Code and models are released, which is useful. The practical alignment with existing clinical measurement practice is the part that could matter for annotation workflows. The soft spots sit in the evaluation. The abstract gives the headline number but no error bars, no significance tests, no split details, and no per-dataset breakdown. The clinician study is only three people and reports time savings without any inter-annotator agreement numbers on point placement or resulting mask overlap. That leaves the stress-test concern standing: if clinicians place the major/minor endpoints inconsistently across images, the structured prompts could add noise rather than remove it. Without those checks the claimed advantage over standard point prompts is hard to trust. This is for people working on prompt-based medical segmentation tools who want a concrete 4-point recipe and released code. A reader could pull the method and test it themselves, but the current evidence is too thin to treat the gains as settled. It is worth sending to review so the methods and statistics can be checked properly, though heavy revision on the evaluation side would be expected.

Referee Report

2 major / 1 minor

Summary. The paper introduces S4M, an augmentation to SAM that incorporates role-specific embeddings for 4-point prompts (extreme points or major/minor axis endpoints) plus an auxiliary Canvas pretext task to enable geometry-aware reasoning. It claims a +3.42 mIoU gain over a strong SAM baseline across eight ultrasound and endoscopy datasets at equal prompt budget, plus faster annotation in a three-clinician study, arguing that the structured prompts align with clinical practice and reduce refinement cycles.

Significance. If the reported gains prove robust and the prompting strategy reliable across annotators, the work could meaningfully lower the annotation burden in medical imaging by replacing ambiguous point clicks with clinically standard measurements. Code and model release strengthens reproducibility.

major comments (2)

[Abstract / Results] Abstract and Results: the headline +3.42 mIoU improvement and the faster-annotation claim both depend on clinicians being able to place major/minor axis endpoints consistently; the manuscript reports only a three-clinician time study and supplies no quantitative inter-annotator agreement (point-distance variance, Dice overlap of resulting masks, or per-dataset breakdown) on the eight target collections. This is load-bearing for the central performance claim.
[Methods / Experiments] Methods / Experiments: no details are given on statistical significance testing, error bars, cross-validation splits, or whether prompt definitions were standardized across the eight datasets, so it is impossible to judge whether the numeric gain is stable or sensitive to post-hoc choices.

minor comments (1)

[Methods] The Canvas pretext task is described only at a high level; a short diagram or pseudocode would clarify how the coarse-mask output is supervised and how it interacts with the role-specific embeddings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results: the headline +3.42 mIoU improvement and the faster-annotation claim both depend on clinicians being able to place major/minor axis endpoints consistently; the manuscript reports only a three-clinician time study and supplies no quantitative inter-annotator agreement (point-distance variance, Dice overlap of resulting masks, or per-dataset breakdown) on the eight target collections. This is load-bearing for the central performance claim.

Authors: We agree that quantitative evidence of prompt consistency is important to support the claims. The three-clinician study was designed to measure annotation time rather than inter-annotator agreement. In revision we will add per-dataset mIoU breakdowns to the main results and supplementary material, and we will clarify the prompt standardization protocol used across datasets. If the original annotation logs permit, we will also report point-placement variance and mask-overlap statistics; otherwise we will explicitly note the absence of these metrics as a limitation. revision: partial
Referee: [Methods / Experiments] Methods / Experiments: no details are given on statistical significance testing, error bars, cross-validation splits, or whether prompt definitions were standardized across the eight datasets, so it is impossible to judge whether the numeric gain is stable or sensitive to post-hoc choices.

Authors: The reported results used the official or commonly adopted splits for each of the eight datasets, but these details, together with error bars and significance testing, were omitted for brevity. In the revised manuscript we will expand the Methods and Experiments sections to: (i) specify the exact train/validation/test splits or cross-validation scheme, (ii) report standard deviations or error bars on all mIoU figures, (iii) include paired statistical tests (e.g., Wilcoxon signed-rank) with p-values, and (iv) document the precise definition and standardization procedure for both extreme-point and major/minor-axis prompts across all datasets. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture and benchmark results are independent of inputs

full rationale

The paper introduces S4M as an augmentation to SAM via role-specific embeddings and an auxiliary Canvas pretext task, then reports measured mIoU gains on eight external datasets plus a small clinician annotation study. No equations, fitted parameters, or self-citations are presented whose outputs are redefined as predictions or derivations. The +3.42 mIoU figure is an observed performance delta on held-out data, not a quantity forced by construction from the prompt definitions or prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Abstract-only review; the approach rests on standard assumptions of promptable segmentation models and clinical measurement conventions, with no explicit free parameters or invented physical entities listed.

axioms (2)

domain assumption SAM architecture can be extended with role-specific embeddings while preserving its prompt encoder behavior
Implicit in the description of S4M as an augmentation of SAM
domain assumption Major/minor axis endpoints are clinically meaningful and consistently identifiable shape descriptors
Stated as inspired by ultrasound measurement practice

invented entities (2)

S4M model no independent evidence
purpose: Augmented SAM variant for structured 4-point prompts
New model introduced in the paper
Canvas pretext task no independent evidence
purpose: Auxiliary training objective to foster geometry-aware reasoning from prompts
New training component described

pith-pipeline@v0.9.0 · 5793 in / 1403 out tokens · 40048 ms · 2026-05-23T00:19:14.804781+00:00 · methodology

S4M: 4-points to Segment Anything

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)