pith. machine review for the scientific record. sign in

arxiv: 2604.03134 · v2 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shot segmentationmedical image segmentationstable diffusiondiffusion modelssupport-query interactiondomain generalizationdata-efficient learning
0
0 comments X

The pith

Adapting Stable Diffusion with support-query interaction and visual-to-textual translation enables competitive few-shot medical image segmentation with strong cross-domain generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that large-scale diffusion models contain rich visual priors that can be repurposed for few-shot medical image segmentation to overcome data scarcity and domain shifts. It introduces SD-FSMIS, which adapts the pre-trained Stable Diffusion model by adding a Support-Query Interaction module to handle the few-shot paradigm and a Visual-to-Textual Condition Translator that converts visual cues from support examples into implicit textual embeddings for guiding the generation process. This setup allows the diffusion model to produce accurate segmentations of novel classes from just a few annotated images. Experiments show the method matches state-of-the-art performance in standard settings and performs especially well in more difficult cross-domain tests.

Core claim

SD-FSMIS adapts the conditional generative architecture of Stable Diffusion for few-shot medical image segmentation by introducing Support-Query Interaction to integrate support and query images directly and a Visual-to-Textual Condition Translator to turn support-set visual information into guiding textual embeddings, yielding competitive accuracy in standard benchmarks and notably strong results under cross-domain shifts.

What carries the argument

Support-Query Interaction (SQI) and Visual-to-Textual Condition Translator (VTCT) modules that repurpose Stable Diffusion's pre-trained conditional generation for few-shot segmentation by fusing support-query data and translating visual conditions into textual embeddings.

If this is right

  • Competitive segmentation accuracy on standard few-shot medical imaging benchmarks.
  • Stronger performance than prior methods when source and target domains differ substantially.
  • Reduced need for large annotated medical datasets by leveraging pre-trained generative priors.
  • A template for adapting other large diffusion models to additional data-scarce imaging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Diffusion-model priors could lower annotation costs for clinical segmentation tools.
  • The same adaptation pattern might apply to 3D volumes or multi-modal scans with minimal changes.
  • Testing on entirely new medical modalities would reveal how far the natural-image priors extend.

Load-bearing premise

The visual priors learned by Stable Diffusion on natural images transfer effectively enough to medical images to support accurate segmentation from only a few examples despite domain differences.

What would settle it

A new cross-domain medical segmentation test set where SD-FSMIS falls substantially below existing state-of-the-art methods would falsify the generalization advantage.

Figures

Figures reproduced from arXiv: 2604.03134 by Hu Qu, Meihua Li, Weizhao He, Yang Zhang, Yisong Li.

Figure 1
Figure 1. Figure 1: Comparison between our proposed method and previous [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SD-FSMIS Overview and Training Pipeline. Support and query sets are first encoded using the VAE encoder [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Modified BasicTransformerBlocks architecture. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the SD-FSMIS inference process. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison between our method and DiffewS method on the Abd-MRI dataset and Abd-CT dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison between our method and the universal models method on the Abd-MRI dataset and Abd-CT dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of failure cases on the Abd-MRI dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of the training process. References [1] Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. Segdiff: Image segmentation with diffusion proba￾bilistic models. arXiv preprint arXiv:2112.00390, 2021. 2, 3 [2] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. Label-efficient se￾mantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2… view at source ↗
read the original abstract

Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes SD-FSMIS, a framework that adapts the pre-trained Stable Diffusion (SD) model for Few-Shot Medical Image Segmentation (FSMIS) by introducing two modules: Support-Query Interaction (SQI) to adapt the conditional generative architecture and Visual-to-Textual Condition Translator (VTCT) to convert support-set visual cues into implicit textual embeddings for guiding the diffusion process. It claims competitive performance versus state-of-the-art FSMIS methods on standard benchmarks together with strong generalization on cross-domain scenarios, attributing the gains to the rich visual priors learned by large-scale diffusion models.

Significance. If the central empirical claims are substantiated, the work would provide evidence that large-scale generative models can serve as effective backbones for data-efficient medical segmentation, particularly under domain shift. The introduction of SQI and VTCT offers a concrete adaptation strategy, but the lack of controls isolating the contribution of the pre-trained priors versus the added modules prevents a clear assessment of novelty or transferability.

major comments (2)
  1. [Experiments] Experiments section: the central claim that SD priors enable competitive standard-setting results and excellent cross-domain generalization rests on an untested assumption. No ablation freezes the pre-trained SD UNet weights while retaining SQI+VTCT, nor compares against an identically structured but randomly initialized backbone with the same modules; without this, it is impossible to attribute performance to the diffusion priors rather than the new conditioning components.
  2. [Method] Method section (SQI and VTCT definitions): the paper introduces SQI and VTCT as key innovations but provides no quantitative isolation of their individual contributions (e.g., performance with SQI alone, VTCT alone, or neither). This omission is load-bearing because the abstract frames the approach as an adaptation of SD priors, yet the added modules could be carrying the reported gains.
minor comments (2)
  1. [Abstract] Abstract: the claim of 'competitive results' and 'excellent generalization' is stated without any numerical values, dataset names, or baseline comparisons, reducing the reader's ability to gauge the strength of the empirical evidence before reaching the full experiments.
  2. [Method] Notation: the description of VTCT as translating 'visual cues into an implicit textual embedding' is conceptually clear but lacks a precise equation or diagram reference showing how the embedding is injected into the SD conditioning mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of empirical validation. We address each major point below and will revise the manuscript to incorporate the suggested ablations.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central claim that SD priors enable competitive standard-setting results and excellent cross-domain generalization rests on an untested assumption. No ablation freezes the pre-trained SD UNet weights while retaining SQI+VTCT, nor compares against an identically structured but randomly initialized backbone with the same modules; without this, it is impossible to attribute performance to the diffusion priors rather than the new conditioning components.

    Authors: We agree that isolating the contribution of the pre-trained SD priors versus the added modules is essential. In the revised manuscript we will add an ablation that freezes the SD UNet weights (keeping SQI and VTCT) and a second control that uses an identically structured but randomly initialized backbone with the same modules. These results will be reported on the standard benchmarks and cross-domain settings to clarify the source of the observed performance. revision: yes

  2. Referee: [Method] Method section (SQI and VTCT definitions): the paper introduces SQI and VTCT as key innovations but provides no quantitative isolation of their individual contributions (e.g., performance with SQI alone, VTCT alone, or neither). This omission is load-bearing because the abstract frames the approach as an adaptation of SD priors, yet the added modules could be carrying the reported gains.

    Authors: We acknowledge that separate ablations of SQI and VTCT are needed to quantify their individual and joint contributions. The revised version will include results for SQI alone, VTCT alone, and the full model on the same datasets, allowing readers to assess the incremental benefit of each component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation without derivations or self-referential reductions

full rationale

The paper presents SD-FSMIS as a practical framework that repurposes a pre-trained Stable Diffusion model by adding SQI and VTCT modules for few-shot medical image segmentation. Claims rest on experimental results showing competitive performance and cross-domain generalization rather than any mathematical derivation chain. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central premise (rich visual priors from large-scale DMs) is treated as an external starting point justified by the model's established training, not derived internally. This is a standard empirical adaptation paper whose results are independently testable via ablation or replication and therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the transferability of Stable Diffusion priors to medical segmentation via the two new modules; no free parameters are named, but the approach implicitly assumes standard diffusion training objectives and conditioning mechanisms remain effective after adaptation.

axioms (1)
  • domain assumption Pre-trained Stable Diffusion models contain rich visual priors that transfer to medical image segmentation tasks.
    Explicitly stated in the abstract as the foundation for the proposed adaptation.
invented entities (2)
  • Support-Query Interaction (SQI) no independent evidence
    purpose: Adapting Stable Diffusion to the few-shot medical image segmentation paradigm
    New component introduced to enable interaction between support and query images.
  • Visual-to-Textual Condition Translator (VTCT) no independent evidence
    purpose: Translating visual cues from the support set into an implicit textual embedding to guide the diffusion process
    New module to provide conditioning for precise generation in the segmentation task.

pith-pipeline@v0.9.0 · 5549 in / 1317 out tokens · 32179 ms · 2026-05-13T20:59:53.749541+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

  1. [1]

    2112.00390 , archivePrefix=

    Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. Segdiff: Image segmentation with diffusion proba- bilistic models.arXiv preprint arXiv:2112.00390, 2021. 2, 3

  2. [2]

    Label-efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126, 2021

    Dmitry Baranchuk, Ivan Rubachev, Andrey V oynov, Valentin Khrulkov, and Artem Babenko. Label-efficient se- mantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126, 2021. 3

  3. [3]

    Famnet: Frequency-aware matching network for cross- domain few-shot medical image segmentation.arXiv preprint arXiv:2412.09319, 2024

    Yuntian Bo, Yazhou Zhu, Lunbo Li, and Haofeng Zhang. Famnet: Frequency-aware matching network for cross- domain few-shot medical image segmentation.arXiv preprint arXiv:2412.09319, 2024. 7

  4. [4]

    Uni- verseg: Universal medical image segmentation

    Victor Ion Butoi, Jose Javier Gonzalez Ortiz, Tianyu Ma, Mert R Sabuncu, John Guttag, and Adrian V Dalca. Uni- verseg: Universal medical image segmentation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 21438–21451, 2023. 8, 2

  5. [5]

    Dual interspersion and flex- ible deployment for few-shot medical image segmentation

    Ziming Cheng, Shidong Wang, Yang Long, Tao Zhou, Haofeng Zhang, and Ling Shao. Dual interspersion and flex- ible deployment for few-shot medical image segmentation. IEEE Transactions on Medical Imaging, 2025. 2, 6, 7, 1

  6. [6]

    Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 2

  7. [7]

    Few-shot medical image segmentation with cycle- resemblance attention

    Hao Ding, Changchang Sun, Hao Tang, Dawen Cai, and Yan Yan. Few-shot medical image segmentation with cycle- resemblance attention. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2488–2497, 2023. 3

  8. [8]

    Few-shot semantic segmen- tation with prototype learning

    Nanqing Dong and Eric P Xing. Few-shot semantic segmen- tation with prototype learning. InBMVC, page 4, 2018. 2

  9. [9]

    Self- support few-shot semantic segmentation

    Qi Fan, Wenjie Pei, Yu-Wing Tai, and Chi-Keung Tang. Self- support few-shot semantic segmentation. InEuropean Con- ference on Computer Vision, pages 701–719. Springer, 2022. 4

  10. [10]

    Interac- tive few-shot learning: Limited supervision, better medical image segmentation.IEEE Transactions on Medical Imag- ing, 40(10):2575–2588, 2021

    Ruiwei Feng, Xiangshang Zheng, Tianxiang Gao, Jintai Chen, Wenzhe Wang, Danny Z Chen, and Jian Wu. Interac- tive few-shot learning: Limited supervision, better medical image segmentation.IEEE Transactions on Medical Imag- ing, 40(10):2575–2588, 2021. 3

  11. [11]

    Anomaly detection-inspired few-shot medi- cal image segmentation through self-supervision with super- voxels.Medical Image Analysis, 78:102385, 2022

    Stine Hansen, Srishti Gautam, Robert Jenssen, and Michael Kampffmeyer. Anomaly detection-inspired few-shot medi- cal image segmentation through self-supervision with super- voxels.Medical Image Analysis, 78:102385, 2022. 3, 4, 6, 1

  12. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 1

  13. [13]

    Apseg: Auto-prompt network for cross-domain few-shot semantic segmentation

    Weizhao He, Yang Zhang, Wei Zhuo, Linlin Shen, Jiaqi Yang, Songhe Deng, and Liang Sun. Apseg: Auto-prompt network for cross-domain few-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23762–23772, 2024. 1

  14. [14]

    Unsupervised semantic correspondence using stable diffu- sion.Advances in Neural Information Processing Systems, 36:8266–8279, 2023

    Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised semantic correspondence using stable diffu- sion.Advances in Neural Information Processing Systems, 36:8266–8279, 2023. 2, 3

  15. [15]

    Prototype-guided graph reasoning network for few-shot medical image segmentation.IEEE Transac- tions on Medical Imaging, 2024

    Wendong Huang, Jinwu Hu, Junhao Xiao, Yang Wei, Xiuli Bi, and Bin Xiao. Prototype-guided graph reasoning network for few-shot medical image segmentation.IEEE Transac- tions on Medical Imaging, 2024. 6

  16. [16]

    Chaos challenge-combined (ct-mr) healthy abdominal organ seg- mentation.Medical image analysis, 69:101950, 2021

    A Emre Kavur, N Sinem Gezer, Mustafa Barıs ¸, Sinem Aslan, Pierre-Henri Conze, Vladimir Groza, Duc Duy Pham, Soumick Chatterjee, Philipp Ernst, Savas ¸¨Ozkan, et al. Chaos challenge-combined (ct-mr) healthy abdominal organ seg- mentation.Medical image analysis, 69:101950, 2021. 5

  17. [17]

    Repurpos- ing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492– 9502, 2024. 4

  18. [18]

    Miccai multi- atlas labeling beyond the cranial vault–workshop and chal- lenge

    Bennett Landman, Zhoubing Xu, Juan Igelsias, Martin Styner, Thomas Langerak, and Arno Klein. Miccai multi- atlas labeling beyond the cranial vault–workshop and chal- lenge. InProc. MICCAI multi-atlas labeling beyond cra- nial vault—workshop challenge, page 12. Munich, Germany,

  19. [19]

    Learning what not to segment: A new perspective on few- shot segmentation

    Chunbo Lang, Gong Cheng, Binfei Tu, and Junwei Han. Learning what not to segment: A new perspective on few- shot segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8057–8067, 2022. 2

  20. [20]

    Ex- ploiting diffusion prior for generalizable dense prediction

    Hsin-Ying Lee, Hung-Yu Tseng, and Ming-Hsuan Yang. Ex- ploiting diffusion prior for generalizable dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 7861–7871, 2024. 3

  21. [21]

    Your diffusion model is secretly a zero-shot classifier

    Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2206–2217,

  22. [22]

    Stable diffusion segmentation for biomed- ical images with single-step reverse process

    Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, and Fudan Zheng. Stable diffusion segmentation for biomed- ical images with single-step reverse process. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 656–666. Springer, 2024. 1, 3

  23. [23]

    Few shot medical image segmentation with cross attention transformer

    Yi Lin, Yufan Chen, Kwang-Ting Cheng, and Hao Chen. Few shot medical image segmentation with cross attention transformer. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 233–

  24. [24]

    Diffusion hyperfeatures: Search- ing through time and space for semantic correspondence

    Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holyn- ski, and Trevor Darrell. Diffusion hyperfeatures: Search- ing through time and space for semantic correspondence. Advances in Neural Information Processing Systems, 36: 47500–47510, 2023. 3

  25. [25]

    Ospa: Enhanc- ing identity-preserving image generation via online self- preference alignment

    Xusen Ma, Xiaoqin Wang, Xianxu Hou, Meidan Ding, Zhe Kong, Junliang Chen, and Linlin Shen. Ospa: Enhanc- ing identity-preserving image generation via online self- preference alignment. 2

  26. [26]

    Cross-domain few-shot segmentation via iterative support-query correspon- dence mining

    Jiahao Nie, Yun Xing, Gongjie Zhang, Pei Yan, Aoran Xiao, Yap-Peng Tan, Alex C Kot, and Shijian Lu. Cross-domain few-shot segmentation via iterative support-query correspon- dence mining. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3380– 3390, 2024. 7, 2

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 6

  28. [28]

    Self-supervision with superpixels: Training few-shot medical image segmentation without an- notation

    Cheng Ouyang, Carlo Biffi, Chen Chen, Turkay Kart, Huaqi Qiu, and Daniel Rueckert. Self-supervision with superpixels: Training few-shot medical image segmentation without an- notation. InEuropean conference on computer vision, pages 762–780. Springer, 2020. 3, 6, 7, 2

  29. [29]

    Bastien Rigaud, Brian M Anderson, H Yu Zhiqian, Maxime Gobeli, Guillaume Cazoulat, Jonas S ¨oderberg, Elin Samuelsson, David Lidberg, Christopher Ward, Nicolette Taku, et al. Automatic segmentation using deep learning to enable online dose optimization during adaptive radiation therapy of cervical cancer.International Journal of Radia- tion Oncology* Biol...

  30. [30]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3

  31. [31]

    ‘squeeze & ex- cite’guided few-shot segmentation of volumetric images

    Abhijit Guha Roy, Shayan Siddiqui, Sebastian P ¨olsterl, Nassir Navab, and Christian Wachinger. ‘squeeze & ex- cite’guided few-shot segmentation of volumetric images. Medical image analysis, 59:101587, 2020. 3, 6

  32. [32]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2, 3

  33. [33]

    One-shot learning for semantic segmentation

    Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and Byron Boots. One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410, 2017. 2

  34. [34]

    Q-net: Query-informed few-shot medical image segmentation

    Qianqian Shen, Yanan Li, Jiyong Jin, and Bin Liu. Q-net: Query-informed few-shot medical image segmentation. In Proceedings of SAI Intelligent Systems Conference, pages 610–628. Springer, 2023. 3, 6

  35. [35]

    Dual-guided prototype alignment network for few-shot medical image segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 2024

    Yue Shen, Wanshu Fan, Cong Wang, Wenfei Liu, Wei Wang, Qiang Zhang, and Dongsheng Zhou. Dual-guided prototype alignment network for few-shot medical image segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 2024. 6

  36. [36]

    Metrics to evaluate the performance of auto-segmentation for radiation treatment planning: A critical review.Radiother- apy and Oncology, 160:185–191, 2021

    Michael V Sherer, Diana Lin, Sharif Elguindi, Simon Duke, Li-Tee Tan, Jon Cacicedo, Max Dahele, and Erin F Gillespie. Metrics to evaluate the performance of auto-segmentation for radiation treatment planning: A critical review.Radiother- apy and Oncology, 160:185–191, 2021. 1

  37. [37]

    Domain-rectifying adapter for cross-domain few-shot segmentation

    Jiapeng Su, Qi Fan, Wenjie Pei, Guangming Lu, and Fanglin Chen. Domain-rectifying adapter for cross-domain few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24036– 24045, 2024. 7, 2

  38. [38]

    Emergent correspondence from image diffusion.Advances in Neural Information Pro- cessing Systems, 36:1363–1389, 2023

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advances in Neural Information Pro- cessing Systems, 36:1363–1389, 2023. 3

  39. [39]

    Prior guided feature enrich- ment network for few-shot segmentation.IEEE transactions on pattern analysis and machine intelligence, 44(2):1050– 1065, 2020

    Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich- ment network for few-shot segmentation.IEEE transactions on pattern analysis and machine intelligence, 44(2):1050– 1065, 2020. 2

  40. [40]

    Panet: Few-shot image semantic seg- mentation with prototype alignment

    Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. Panet: Few-shot image semantic seg- mentation with prototype alignment. Inproceedings of the IEEE/CVF international conference on computer vision, pages 9197–9206, 2019. 2, 3, 6, 7

  41. [41]

    Disfacerep: Representation disentanglement for co-occurring facial com- ponents in weakly supervised face parsing

    Xiaoqin Wang, Xianxu Hou, Meidan Ding, Junliang Chen, Kaijun Deng, Jinheng Xie, and Linlin Shen. Disfacerep: Representation disentanglement for co-occurring facial com- ponents in weakly supervised face parsing. InProceedings of the 33rd ACM International Conference on Multimedia, pages 4020–4029, 2025. 1

  42. [42]

    Facebench: A multi-view multi-level fa- cial attribute vqa dataset for benchmarking face perception mllms

    Xiaoqin Wang, Xusen Ma, Xianxu Hou, Meidan Ding, Yudong Li, Junliang Chen, Wenting Chen, Xiaoyang Peng, and Linlin Shen. Facebench: A multi-view multi-level fa- cial attribute vqa dataset for benchmarking face perception mllms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9154–9164, 2025. 1

  43. [43]

    Multiverseg: scalable interactive segmen- tation of biomedical imaging datasets with in-context guid- ance

    Hallee E Wong, Jose Javier Gonzalez Ortiz, John Guttag, and Adrian V Dalca. Multiverseg: scalable interactive segmen- tation of biomedical imaging datasets with in-context guid- ance. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 20966–20980, 2025. 8, 2

  44. [44]

    Dual con- trastive learning with anatomical auxiliary supervision for few-shot medical image segmentation

    Huisi Wu, Fangyan Xiao, and Chongxin Liang. Dual con- trastive learning with anatomical auxiliary supervision for few-shot medical image segmentation. InEuropean Con- ference on Computer Vision, pages 417–434. Springer, 2022. 6

  45. [45]

    One-prompt to segment all med- ical images

    Junde Wu and Min Xu. One-prompt to segment all med- ical images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11302– 11312, 2024. 8

  46. [46]

    Medsegdiff: Medical image segmentation with diffusion probabilistic model

    Junde Wu, Rao Fu, Huihui Fang, Yu Zhang, Yehui Yang, Haoyi Xiong, Huiying Liu, and Yanwu Xu. Medsegdiff: Medical image segmentation with diffusion probabilistic model. InMedical Imaging with Deep Learning, pages 1623–1639. PMLR, 2024. 3

  47. [47]

    Medsegdiff-v2: Diffusion-based medical im- age segmentation with transformer

    Junde Wu, Wei Ji, Huazhu Fu, Min Xu, Yueming Jin, and Yanwu Xu. Medsegdiff-v2: Diffusion-based medical im- age segmentation with transformer. InProceedings of the AAAI conference on artificial intelligence, pages 6030–6038,

  48. [48]

    Open-vocabulary panop- tic segmentation with text-to-image diffusion models

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2955–2966, 2023. 3, 5

  49. [49]

    Open-vocabulary panop- tic segmentation with text-to-image diffusion models

    Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2955–2966, 2023. 2, 3

  50. [50]

    A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

  51. [51]

    Unleash- ing the potential of the diffusion model in few-shot semantic segmentation.Advances in Neural Information Processing Systems, 37:42672–42695, 2024

    Muzhi Zhu, Yang Liu, Zekai Luo, Chenchen Jing, Hao Chen, Guangkai Xu, Xinlong Wang, and Chunhua Shen. Unleash- ing the potential of the diffusion model in few-shot semantic segmentation.Advances in Neural Information Processing Systems, 37:42672–42695, 2024. 1, 3, 4, 5, 6, 7

  52. [52]

    Few-shot medical image segmentation via a region-enhanced prototypical transformer

    Yazhou Zhu, Shidong Wang, Tong Xin, and Haofeng Zhang. Few-shot medical image segmentation via a region-enhanced prototypical transformer. InInternational Conference on Medical Image Computing and Computer-Assisted Interven- tion, pages 271–280. Springer, 2023. 2, 5, 6, 7, 1

  53. [53]

    Partition-a-medical-image: Extracting mul- tiple representative sub-regions for few-shot medical image segmentation.IEEE Transactions on Instrumentation and Measurement, 2024

    Yazhou Zhu, Shidong Wang, Tong Xin, Zheng Zhang, and Haofeng Zhang. Partition-a-medical-image: Extracting mul- tiple representative sub-regions for few-shot medical image segmentation.IEEE Transactions on Instrumentation and Measurement, 2024. 6