S4M: 4-points to Segment Anything
Pith reviewed 2026-05-23 00:19 UTC · model grok-4.3
The pith
S4M augments SAM to treat four points as relational shape cues rather than isolated clicks for more accurate medical segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
S4M augments the Segment Anything Model by expanding the prompt space with role-specific embeddings and adding an auxiliary Canvas pretext task that sketches coarse masks directly from prompts, allowing the model to interpret four points as relational cues rather than isolated clicks and thereby producing more accurate instance segmentations on medical images with overlapping anatomy and blurred boundaries.
What carries the argument
Role-specific embeddings for the four points together with the Canvas auxiliary task that sketches coarse masks from prompts to foster geometry-aware reasoning.
If this is right
- S4M achieves a 3.42 mIoU improvement over a strong SAM baseline at equal prompt budget across eight ultrasound and surgical endoscopy datasets.
- Major and minor axis endpoint prompts enable faster annotation by clinicians compared with standard point prompting.
- The four-point strategy aligns prompting with existing clinical measurement practices in ultrasound.
- Higher efficiency at equal accuracy supports more scalable development of segmentation datasets in medical imaging.
Where Pith is reading between the lines
- The same relational-prompt design could be tested on non-medical images where objects have clear elongated shapes or standard measurement conventions.
- If the Canvas task proves effective, similar auxiliary objectives might help other prompt-based foundation models handle structured multi-point inputs.
- Faster annotation workflows could allow clinical teams to label larger and more varied datasets without proportional increases in expert time.
Load-bearing premise
Clinicians can identify the endpoints of major and minor axes consistently with low inter-annotator variability across diverse medical images without introducing new sources of error.
What would settle it
A measurement showing high disagreement among clinicians when asked to mark major and minor axis endpoints on the same set of ultrasound or endoscopy images would indicate the prompting strategy cannot be applied reliably in practice.
read the original abstract
Purpose: The Segment Anything Model (SAM) promises to ease the annotation bottleneck in medical segmentation, but overlapping anatomy and blurred boundaries make its point prompts ambiguous, leading to cycles of manual refinement to achieve precise masks. Better prompting strategies are needed. Methods: We propose a structured prompting strategy using 4 points as a compact instance-level shape description. We study two 4-point variants: extreme points and the proposed major/minor axis endpoints, inspired by ultrasound measurement practice. SAM cannot fully exploit such structured prompts because it treats all points identically and lacks geometry-aware reasoning. To address this, we introduce S4M (4-points to Segment Anything), which augments SAM to interpret 4 points as relational cues rather than isolated clicks. S4M expands the prompt space with role-specific embeddings and adds an auxiliary "Canvas" pretext task that sketches coarse masks directly from prompts, fostering geometry-aware reasoning. Results: Across eight datasets in ultrasound and surgical endoscopy, S4M improves segmentation by +3.42 mIoU over a strong SAM baseline at equal prompt budget. An annotation study with three clinicians further shows that major/minor prompts enable faster annotation. Conclusion: S4M increases performance, reduces annotation effort, and aligns prompting with clinical practice, enabling more scalable dataset development in medical imaging. We release our code and pretrained models at https://github.com/CAMMA-public/S4M.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces S4M, an augmentation to SAM that incorporates role-specific embeddings for 4-point prompts (extreme points or major/minor axis endpoints) plus an auxiliary Canvas pretext task to enable geometry-aware reasoning. It claims a +3.42 mIoU gain over a strong SAM baseline across eight ultrasound and endoscopy datasets at equal prompt budget, plus faster annotation in a three-clinician study, arguing that the structured prompts align with clinical practice and reduce refinement cycles.
Significance. If the reported gains prove robust and the prompting strategy reliable across annotators, the work could meaningfully lower the annotation burden in medical imaging by replacing ambiguous point clicks with clinically standard measurements. Code and model release strengthens reproducibility.
major comments (2)
- [Abstract / Results] Abstract and Results: the headline +3.42 mIoU improvement and the faster-annotation claim both depend on clinicians being able to place major/minor axis endpoints consistently; the manuscript reports only a three-clinician time study and supplies no quantitative inter-annotator agreement (point-distance variance, Dice overlap of resulting masks, or per-dataset breakdown) on the eight target collections. This is load-bearing for the central performance claim.
- [Methods / Experiments] Methods / Experiments: no details are given on statistical significance testing, error bars, cross-validation splits, or whether prompt definitions were standardized across the eight datasets, so it is impossible to judge whether the numeric gain is stable or sensitive to post-hoc choices.
minor comments (1)
- [Methods] The Canvas pretext task is described only at a high level; a short diagram or pseudocode would clarify how the coarse-mask output is supervised and how it interacts with the role-specific embeddings.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results: the headline +3.42 mIoU improvement and the faster-annotation claim both depend on clinicians being able to place major/minor axis endpoints consistently; the manuscript reports only a three-clinician time study and supplies no quantitative inter-annotator agreement (point-distance variance, Dice overlap of resulting masks, or per-dataset breakdown) on the eight target collections. This is load-bearing for the central performance claim.
Authors: We agree that quantitative evidence of prompt consistency is important to support the claims. The three-clinician study was designed to measure annotation time rather than inter-annotator agreement. In revision we will add per-dataset mIoU breakdowns to the main results and supplementary material, and we will clarify the prompt standardization protocol used across datasets. If the original annotation logs permit, we will also report point-placement variance and mask-overlap statistics; otherwise we will explicitly note the absence of these metrics as a limitation. revision: partial
-
Referee: [Methods / Experiments] Methods / Experiments: no details are given on statistical significance testing, error bars, cross-validation splits, or whether prompt definitions were standardized across the eight datasets, so it is impossible to judge whether the numeric gain is stable or sensitive to post-hoc choices.
Authors: The reported results used the official or commonly adopted splits for each of the eight datasets, but these details, together with error bars and significance testing, were omitted for brevity. In the revised manuscript we will expand the Methods and Experiments sections to: (i) specify the exact train/validation/test splits or cross-validation scheme, (ii) report standard deviations or error bars on all mIoU figures, (iii) include paired statistical tests (e.g., Wilcoxon signed-rank) with p-values, and (iv) document the precise definition and standardization procedure for both extreme-point and major/minor-axis prompts across all datasets. revision: yes
Circularity Check
No circularity: empirical architecture and benchmark results are independent of inputs
full rationale
The paper introduces S4M as an augmentation to SAM via role-specific embeddings and an auxiliary Canvas pretext task, then reports measured mIoU gains on eight external datasets plus a small clinician annotation study. No equations, fitted parameters, or self-citations are presented whose outputs are redefined as predictions or derivations. The +3.42 mIoU figure is an observed performance delta on held-out data, not a quantity forced by construction from the prompt definitions or prior author work. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SAM architecture can be extended with role-specific embeddings while preserving its prompt encoder behavior
- domain assumption Major/minor axis endpoints are clinically meaningful and consistently identifiable shape descriptors
invented entities (2)
-
S4M model
no independent evidence
-
Canvas pretext task
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.