pith. sign in

arxiv: 2605.29429 · v1 · pith:FYOUI6YJnew · submitted 2026-05-28 · 💻 cs.CV

One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

Pith reviewed 2026-06-29 08:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords cell instance segmentationinteractive segmentationtraining-free methodSegment Anything Modelgroup promptingChain-of-Promptshistopathology
0
0 comments X

The pith

One click per cell type segments all instances of that type using a frozen foundation model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that interactive cell instance segmentation can shift from one prompt per cell to one prompt per cell type. This works because the frozen encoder in the Segment Anything Model already groups cells of the same type in feature space. Chain-of-Prompts exploits this by recursively finding same-type locations through non-parametric gating of multi-scale features and selecting distant reliable points as next prompts. A reader would care because per-cell prompting becomes impractical on histopathology images with hundreds or thousands of cells. If the approach holds, it makes high-accuracy segmentation feasible without training or per-instance effort.

Core claim

The paper claims that Group Prompting via Chain-of-Prompts reduces interactive segmentation cost from per-instance to per-type by exploiting the frozen SAM image encoder's pre-existing clustering of same-type cells. CoP identifies reliable same-type locations through non-parametric gating of multi-scale encoder features and selects the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks this retains over 90 percent of per-instance performance and surpasses fully-supervised methods without training; on four morphologically homogeneous benchmarks a single click retains over 99 percent.

What carries the argument

Chain-of-Prompts (CoP): recursive expansion of one user click by non-parametric gating of multi-scale SAM encoder features to locate same-type cells and select distant reliable points as subsequent prompts.

If this is right

  • User interactions drop from one per cell instance to one per cell type.
  • Performance stays above 90 percent of full per-instance prompting on cell-type-annotated benchmarks.
  • The method exceeds fully supervised models on out-of-distribution cell types with no training.
  • A single click reaches over 99 percent performance on images containing only one morphological cell type.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feature-clustering property could support group prompting for other dense object categories if their embeddings separate by type.
  • Annotation effort for large cell datasets could fall sharply if one click per type replaces exhaustive per-cell labeling.
  • Applying the recursive gating step to other frozen vision encoders would test whether the clustering behavior is specific to SAM or more general.

Load-bearing premise

The frozen image encoder of the Segment Anything Model already clusters same-type cells in its feature space before any prompt is given.

What would settle it

Measure whether, after an initial click on one cell of a new type, the non-parametric gating of multi-scale encoder features correctly identifies other cells of that type while excluding different types.

Figures

Figures reproduced from arXiv: 2605.29429 by Hyeongsub Kim, Hyungseok Seo, Kyungsu Kim, Sanghyun Jo, Seohyung Hong, Seo Jin Lee, Yoorim Gang.

Figure 1
Figure 1. Figure 1: One Click per Cell Type is All You Need. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: From 245 Clicks to 3: Group Prompting. Manual prompting requires one click per instance; our group prompting propagates each click to all same-type instances, reaching 92.7% of the upper bound with 81.7× fewer prompts. Hierarchical Similarity Gating (HSG) combines SAM’s multi-scale features to non-parametrically identify reliable cell points recursively, achieving precision above 96% without any learnable … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of Chain-of-Prompts (CoP). A frozen SAM encoder extracts Fh and Fl once per image. For each user click px (✩), HSG (Sec. 2.1) produces ini￾tial reliable points R(0) via hierarchical similarity and connected-component labeling (CCL). FPR (Sec. 2.2) then expands R(0) by iteratively prompting the farthest uncov￾ered point (◆) until no new points are found. All propagated points per cell type are fina… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on CoNIC [6]. Fully-supervised methods miss cell pop￾ulations absent from their training set (red dashed boxes), whereas CoP discovers them from a single click per type. UMAP 2 UMAP 1 (a) Fh (Mixed Types) Type 1 (n=16) Type 2 (n=72) Type 3 (n=157) (b) Fl (Distinct Types) Type 1 Type 2 Type 3 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: UMAP [17] of SAM’s frozen image encoder features at GT instance centroids. The UMAP embeddings are extracted from the input image used in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance $O(N)$ to per-type $O(T)$, where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: https://shjo-april.github.io/Chain-of-Prompts/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Group Prompting and Chain-of-Prompts (CoP), a training-free interactive segmentation framework for cell instance segmentation. It shifts from per-instance O(N) prompting to per-type O(T) by leveraging the claim that SAM's frozen ViT encoder already clusters same-type cells in multi-scale feature space; a single click per type is recursively expanded via non-parametric gating to identify reliable same-type points and selection of the most distant reliable point as the next prompt. On three cell-type-annotated benchmarks CoP retains >90% of per-instance performance and surpasses fully-supervised baselines; on four homogeneous benchmarks a single click retains >99%.

Significance. If the central empirical observation holds, the result is significant: it demonstrates that a pretrained natural-image foundation model can be used for dense histopathology segmentation with minimal interaction cost and no task-specific training or fine-tuning. The training-free design, O(T) scaling, and reported outperformance of supervised methods on multiple benchmarks would be a notable practical advance for annotation-efficient cell analysis.

major comments (3)
  1. [§3] §3 (Method), the non-parametric gating step: the claim that multi-scale SAM encoder features enable reliable same-type identification without any learned parameters or target-data thresholds is load-bearing for the training-free and O(T) claims, yet the manuscript provides no quantitative verification (e.g., intra- vs. inter-type cosine similarity distributions or cluster purity metrics) on the three cell-type benchmarks.
  2. [Experiments] Experiments section, performance tables: the headline retention figures (>90% on cell-type benchmarks, >99% on homogeneous) are presented without error bars, multiple random click seeds, or ablation of the distant-point selection rule; this makes it impossible to determine whether the reported superiority over fully-supervised methods is robust or sensitive to the single-click initialization.
  3. [§4.2] §4.2 (ablation or feature analysis): if a dedicated subsection examines whether the observed clustering is an accident of SAM pretraining or holds across morphological variations, it is not referenced in the abstract or results; without it the generalization argument for out-of-distribution cell types rests on untested extrapolation.
minor comments (2)
  1. [§3] Notation for multi-scale feature gating is introduced without an explicit equation or pseudocode listing the exact non-parametric criterion (e.g., percentile threshold or distance metric).
  2. [Abstract] The project page link is given but the manuscript does not state whether code and exact prompt seeds used for the reported numbers will be released.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive comments. We address each major point below and will revise the manuscript to incorporate additional analyses and robustness checks.

read point-by-point responses
  1. Referee: [§3] §3 (Method), the non-parametric gating step: the claim that multi-scale SAM encoder features enable reliable same-type identification without any learned parameters or target-data thresholds is load-bearing for the training-free and O(T) claims, yet the manuscript provides no quantitative verification (e.g., intra- vs. inter-type cosine similarity distributions or cluster purity metrics) on the three cell-type benchmarks.

    Authors: We agree that explicit quantitative verification of the clustering property would strengthen the load-bearing claim. In the revised manuscript we will add intra- versus inter-type cosine similarity distributions together with cluster purity metrics computed on the three cell-type benchmarks to directly support the non-parametric gating step. revision: yes

  2. Referee: [Experiments] Experiments section, performance tables: the headline retention figures (>90% on cell-type benchmarks, >99% on homogeneous) are presented without error bars, multiple random click seeds, or ablation of the distant-point selection rule; this makes it impossible to determine whether the reported superiority over fully-supervised methods is robust or sensitive to the single-click initialization.

    Authors: We acknowledge that error bars and ablations are needed to assess robustness. The revised experiments section will report standard deviations over multiple random click seeds and will include an ablation of the distant-point selection rule. revision: yes

  3. Referee: [§4.2] §4.2 (ablation or feature analysis): if a dedicated subsection examines whether the observed clustering is an accident of SAM pretraining or holds across morphological variations, it is not referenced in the abstract or results; without it the generalization argument for out-of-distribution cell types rests on untested extrapolation.

    Authors: The manuscript does not contain a dedicated subsection on the origin of the clustering. Generalization is supported by consistent empirical results across diverse benchmarks. We will add a concise discussion paragraph in the results or conclusion addressing the contribution of SAM pretraining and the empirical evidence for generalization across morphological variations. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external frozen encoder properties

full rationale

The paper's central procedure (Chain-of-Prompts) is explicitly training-free and applies non-parametric gating directly to multi-scale features from the frozen SAM ViT encoder, without fitting any parameters or thresholds to the target cell datasets. Performance retention claims (>90% on cell-type benchmarks) are presented as empirical results on external benchmarks rather than quantities derived by construction from the same data. No equations reduce a prediction to a fitted input, no self-citations bear the load of uniqueness or ansatz, and the key observation about feature clustering is stated as an empirical property of the external pretrained model, not a self-referential definition. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all details on implementation choices remain unavailable.

pith-pipeline@v0.9.1-grok · 5786 in / 1174 out tokens · 32342 ms · 2026-06-29T08:18:43.413436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Nature Methods22(3), 579–591 (2025)

    Archit, A., Freckmann, L., Nair, S., Pape, C., et al.: Segment anything for microscopy. Nature Methods22(3), 579–591 (2025). https://doi.org/10.1038/ s41592-024-02580-4

  2. [2]

    In: ICLR (2026)

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. In: ICLR (2026)

  3. [3]

    In: MICCAI

    Chen, P., Zhu, C., Shui, Z., Cai, J., Zheng, S., Zhang, S., Yang, L.: Exploring unsupervised cell recognition with prior self-activation maps. In: MICCAI. pp. 559–568. Springer Nature Switzerland, Cham (2023)

  4. [4]

    IJCV88(2), 303–338 (2010) 10 S

    Everingham,M.,VanGool,L.,Williams,C.K.,Winn,J.,Zisserman,A.:Thepascal visual object classes (VOC) challenge. IJCV88(2), 303–338 (2010) 10 S. Jo et al

  5. [5]

    https://doi.org/10.1109/TMI.2021.3113172

    Graham, S., Vu, Q.D., Jahanifar, M., Abraham, A., Durr, N.J., Rajpoot, N., Raza, S.E.A.: A dataset for prostate cancer semantic segmentation and gland detection fromwholeslideimages.IEEETransactionsonMedicalImaging40(12),3923–3933 (2021). https://doi.org/10.1109/TMI.2021.3113172

  6. [6]

    MedIA 91, 103049 (2024)

    Graham, S., Vu, Q.D., Raza, S.E.A., Rajpoot, N., et al.: Conic challenge: Pushing the frontiers of nuclear detection, segmentation, classification and counting. MedIA 91, 103049 (2024). https://doi.org/10.1016/j.media.2023.103049

  7. [7]

    In: MICCAI

    Huang, H., He, H., Xu, L., Zhu, X., Feng, S., Fu, G.: Ca-sam2: Sam2-based context- aware network with auto-prompting for nuclei instance segmentation. In: MICCAI. pp. 86–95. Springer Nature Switzerland (2025)

  8. [8]

    In: MICCAI

    Huang, L., Liang, Y., Liu, J.: DES-SAM: Distillation-Enhanced Semantic SAM for Cervical Nuclear Segmentation with Box Annotation . In: MICCAI. vol. LNCS 15009. Springer Nature Switzerland (October 2024)

  9. [9]

    MedIA94, 103143 (2024)

    Hörst, F., Rempe, M., Heine, L., Seibold, C., Keyl, J., Baldini, G., Ugurel, S., Siveke, J., Grünwald, B., Egger, J., Kleesiek, J.: Cellvit: Vision transformers for precise cell segmentation and classification. MedIA94, 103143 (2024). https://doi. org/https://doi.org/10.1016/j.media.2024.103143

  10. [10]

    In: CVPR (2026)

    Jiang, Q., Huo, J., Chen, X., Xiong, Y., Zeng, Z., Chen, Y., Ren, T., Yu, J., Zhang, L.: Detect anything via next point prediction. In: CVPR (2026)

  11. [11]

    Jo, S., Lee, S.J., Lee, S., Hong, S., Seo, H., Kim, K.: Coin: Confidence score-guided distillationforannotation-freecellsegmentation.In:ICCV.pp.20324–20335(2025)

  12. [12]

    In: ICCV

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023)

  13. [13]

    TMI39(5), 1380–1391 (2020)

    Kumar, N., Verma, R., Anand, D., Zhou, Y., Onder, O.F., Tsougenis, E., Chen, H., Heng, P.A., Li, J., Hu, Z., Wang, Y., Koohbanani, N.A., Jahanifar, M., Tajeddin, N.Z., Gooya, A., Rajpoot, N., Ren, X., Zhou, S., Wang, Q., Shen, D., Yang, C.K., Weng, C.H., Yu, W.H., Yeh, C.Y., Yang, S., Xu, S., Yeung, P.H., Sun, P., Mahbod, A., Schaefer, G., Ellinger, I., E...

  14. [14]

    In: ECCV

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: ECCV. pp. 740–755. Springer (2014)

  15. [15]

    In: ECCV

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: ECCV. pp. 38–55. Springer (2024)

  16. [16]

    Computers in Biology and Medicine 132, 104349 (2021)

    Mahbod, A., Schaefer, G., Bancher, B., Löw, C., Dorffner, G., Ecker, R., Ellinger, I.: Cryonuseg: A dataset for nuclei instance segmentation of cryosec- tioned h&e-stained histological images. Computers in Biology and Medicine 132, 104349 (2021). https://doi.org/https://doi.org/10.1016/j.compbiomed.2021. 104349, https://www.sciencedirect.com/science/artic...

  17. [17]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018) Chain-of-Prompts 11

  18. [18]

    IEEE Transactions on Medical Imaging38(2), 448–459 (2018)

    Naylor, P., Laé, M., Reyal, F., Walter, T.: Segmentation of nuclei in histopathology images by deep regression of the distance map. IEEE Transactions on Medical Imaging38(2), 448–459 (2018)

  19. [19]

    In: ICLR (2025)

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. In: ICLR (2025)

  20. [20]

    In: MICCAI

    Sahasrabudhe, M., Christodoulidis, S., Salgado, R., Michiels, S., Loi, S., André, F., Paragios, N., Vakalopoulou, M.: Self-supervised nuclei segmentation in histopatho- logical images using attention. In: MICCAI. pp. 393–402. Springer (2020)

  21. [21]

    MedIA35, 489–502 (2017)

    Sirinukunwattana, K., Pluim, J.P., Chen, H., Qi, X., Heng, P.A., Guo, Y.B., Wang, L.Y.,Matuszewski,B.J.,Bruni,E.,Sanchez,U.,etal.:Glandsegmentationincolon histology images: The GlaS challenge contest. MedIA35, 489–502 (2017)

  22. [22]

    Nature Methods22(3), 592–599 (2025)

    Stringer, C., Pachitariu, M.: Cellpose3: one-click image restoration for improved cellular segmentation. Nature Methods22(3), 592–599 (2025). https://doi.org/10. 1038/s41592-025-02595-5

  23. [23]

    Fron- tiers in Bioengineering and BiotechnologyV olume 7 - 2019(2019)

    Vu, Q.D., Graham, S., Kurc, T., To, M.N.N., Shaban, M., Qaiser, T., Koohbanani, N.A., Khurram, S.A., Kalpathy-Cramer, J., Zhao, T., Gupta, R., Kwak, J.T., Rajpoot, N., Saltz, J., Farahani, K.: Methods for seg- mentation and classification of digital microscopy tissue images. Fron- tiers in Bioengineering and BiotechnologyV olume 7 - 2019(2019). https://do...

  24. [24]

    In: ICCV

    Wang, A., Liu, L., Chen, H., Lin, Z., Han, J., Ding, G.: Yoloe: Real-time seeing anything. In: ICCV. pp. 24591–24602 (2025)