arxiv: 2604.03134 · v2 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

Meihua Li , Yang Zhang , Weizhao He , Hu Qu , Yisong Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords few-shot segmentationmedical image segmentationstable diffusiondiffusion modelssupport-query interactiondomain generalizationdata-efficient learning

0 comments

The pith

Adapting Stable Diffusion with support-query interaction and visual-to-textual translation enables competitive few-shot medical image segmentation with strong cross-domain generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that large-scale diffusion models contain rich visual priors that can be repurposed for few-shot medical image segmentation to overcome data scarcity and domain shifts. It introduces SD-FSMIS, which adapts the pre-trained Stable Diffusion model by adding a Support-Query Interaction module to handle the few-shot paradigm and a Visual-to-Textual Condition Translator that converts visual cues from support examples into implicit textual embeddings for guiding the generation process. This setup allows the diffusion model to produce accurate segmentations of novel classes from just a few annotated images. Experiments show the method matches state-of-the-art performance in standard settings and performs especially well in more difficult cross-domain tests.

Core claim

SD-FSMIS adapts the conditional generative architecture of Stable Diffusion for few-shot medical image segmentation by introducing Support-Query Interaction to integrate support and query images directly and a Visual-to-Textual Condition Translator to turn support-set visual information into guiding textual embeddings, yielding competitive accuracy in standard benchmarks and notably strong results under cross-domain shifts.

What carries the argument

Support-Query Interaction (SQI) and Visual-to-Textual Condition Translator (VTCT) modules that repurpose Stable Diffusion's pre-trained conditional generation for few-shot segmentation by fusing support-query data and translating visual conditions into textual embeddings.

If this is right

Competitive segmentation accuracy on standard few-shot medical imaging benchmarks.
Stronger performance than prior methods when source and target domains differ substantially.
Reduced need for large annotated medical datasets by leveraging pre-trained generative priors.
A template for adapting other large diffusion models to additional data-scarce imaging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Diffusion-model priors could lower annotation costs for clinical segmentation tools.
The same adaptation pattern might apply to 3D volumes or multi-modal scans with minimal changes.
Testing on entirely new medical modalities would reveal how far the natural-image priors extend.

Load-bearing premise

The visual priors learned by Stable Diffusion on natural images transfer effectively enough to medical images to support accurate segmentation from only a few examples despite domain differences.

What would settle it

A new cross-domain medical segmentation test set where SD-FSMIS falls substantially below existing state-of-the-art methods would falsify the generalization advantage.

Figures

Figures reproduced from arXiv: 2604.03134 by Hu Qu, Meihua Li, Weizhao He, Yang Zhang, Yisong Li.

**Figure 2.** Figure 2: SD-FSMIS Overview and Training Pipeline. Support and query sets are first encoded using the VAE encoder [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Modified BasicTransformerBlocks architecture. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the SD-FSMIS inference process. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between our method and DiffewS method on the Abd-MRI dataset and Abd-CT dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison between our method and the universal models method on the Abd-MRI dataset and Abd-CT dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of failure cases on the Abd-MRI dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of the training process. References [1] Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021. 2, 3 [2] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov, Valentin Khrulkov, and Artem Babenko. Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126, 2… view at source ↗

read the original abstract

Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SD-FSMIS adapts Stable Diffusion with SQI and VTCT for few-shot medical segmentation but the evidence for the priors' role is still thin.

read the letter

The main thing to know is that this paper adapts Stable Diffusion for few-shot medical image segmentation by introducing Support-Query Interaction and a Visual-to-Textual Condition Translator to handle the conditioning from limited examples. They claim this leads to competitive performance on standard tasks and strong results when crossing domains. It does a decent job framing the problem and proposing a way to turn visual support into conditions for the diffusion model. The reported ability to generalize across domains is the part that could matter most if it holds, because few-shot medical work often struggles with distribution shifts between different imaging sources. The soft spots are the missing details. The abstract mentions competitive results and excellent cross-domain performance but gives no numbers, baselines, or ablations. We don't know if freezing the Stable Diffusion weights and training only the new modules still works, or if a random backbone with the same additions would do as well. That matters because medical images look nothing like the photos Stable Diffusion was trained on, so the priors might not transfer directly. This is for researchers in medical image analysis looking at generative approaches for low-data regimes. A reader interested in diffusion model adaptations might pick up some ideas here, but the current version needs more evidence before it influences the field much. I recommend putting it through peer review. The idea has potential, and a proper review would clarify the experiments and strengthen the claims.

Referee Report

2 major / 2 minor

Summary. The paper proposes SD-FSMIS, a framework that adapts the pre-trained Stable Diffusion (SD) model for Few-Shot Medical Image Segmentation (FSMIS) by introducing two modules: Support-Query Interaction (SQI) to adapt the conditional generative architecture and Visual-to-Textual Condition Translator (VTCT) to convert support-set visual cues into implicit textual embeddings for guiding the diffusion process. It claims competitive performance versus state-of-the-art FSMIS methods on standard benchmarks together with strong generalization on cross-domain scenarios, attributing the gains to the rich visual priors learned by large-scale diffusion models.

Significance. If the central empirical claims are substantiated, the work would provide evidence that large-scale generative models can serve as effective backbones for data-efficient medical segmentation, particularly under domain shift. The introduction of SQI and VTCT offers a concrete adaptation strategy, but the lack of controls isolating the contribution of the pre-trained priors versus the added modules prevents a clear assessment of novelty or transferability.

major comments (2)

[Experiments] Experiments section: the central claim that SD priors enable competitive standard-setting results and excellent cross-domain generalization rests on an untested assumption. No ablation freezes the pre-trained SD UNet weights while retaining SQI+VTCT, nor compares against an identically structured but randomly initialized backbone with the same modules; without this, it is impossible to attribute performance to the diffusion priors rather than the new conditioning components.
[Method] Method section (SQI and VTCT definitions): the paper introduces SQI and VTCT as key innovations but provides no quantitative isolation of their individual contributions (e.g., performance with SQI alone, VTCT alone, or neither). This omission is load-bearing because the abstract frames the approach as an adaptation of SD priors, yet the added modules could be carrying the reported gains.

minor comments (2)

[Abstract] Abstract: the claim of 'competitive results' and 'excellent generalization' is stated without any numerical values, dataset names, or baseline comparisons, reducing the reader's ability to gauge the strength of the empirical evidence before reaching the full experiments.
[Method] Notation: the description of VTCT as translating 'visual cues into an implicit textual embedding' is conceptually clear but lacks a precise equation or diagram reference showing how the embedding is injected into the SD conditioning mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of empirical validation. We address each major point below and will revise the manuscript to incorporate the suggested ablations.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that SD priors enable competitive standard-setting results and excellent cross-domain generalization rests on an untested assumption. No ablation freezes the pre-trained SD UNet weights while retaining SQI+VTCT, nor compares against an identically structured but randomly initialized backbone with the same modules; without this, it is impossible to attribute performance to the diffusion priors rather than the new conditioning components.

Authors: We agree that isolating the contribution of the pre-trained SD priors versus the added modules is essential. In the revised manuscript we will add an ablation that freezes the SD UNet weights (keeping SQI and VTCT) and a second control that uses an identically structured but randomly initialized backbone with the same modules. These results will be reported on the standard benchmarks and cross-domain settings to clarify the source of the observed performance. revision: yes
Referee: [Method] Method section (SQI and VTCT definitions): the paper introduces SQI and VTCT as key innovations but provides no quantitative isolation of their individual contributions (e.g., performance with SQI alone, VTCT alone, or neither). This omission is load-bearing because the abstract frames the approach as an adaptation of SD priors, yet the added modules could be carrying the reported gains.

Authors: We acknowledge that separate ablations of SQI and VTCT are needed to quantify their individual and joint contributions. The revised version will include results for SQI alone, VTCT alone, and the full model on the same datasets, allowing readers to assess the incremental benefit of each component. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation without derivations or self-referential reductions

full rationale

The paper presents SD-FSMIS as a practical framework that repurposes a pre-trained Stable Diffusion model by adding SQI and VTCT modules for few-shot medical image segmentation. Claims rest on experimental results showing competitive performance and cross-domain generalization rather than any mathematical derivation chain. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central premise (rich visual priors from large-scale DMs) is treated as an external starting point justified by the model's established training, not derived internally. This is a standard empirical adaptation paper whose results are independently testable via ablation or replication and therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the transferability of Stable Diffusion priors to medical segmentation via the two new modules; no free parameters are named, but the approach implicitly assumes standard diffusion training objectives and conditioning mechanisms remain effective after adaptation.

axioms (1)

domain assumption Pre-trained Stable Diffusion models contain rich visual priors that transfer to medical image segmentation tasks.
Explicitly stated in the abstract as the foundation for the proposed adaptation.

invented entities (2)

Support-Query Interaction (SQI) no independent evidence
purpose: Adapting Stable Diffusion to the few-shot medical image segmentation paradigm
New component introduced to enable interaction between support and query images.
Visual-to-Textual Condition Translator (VTCT) no independent evidence
purpose: Translating visual cues from the support set into an implicit textual embedding to guide the diffusion process
New module to provide conditioning for precise generation in the segmentation task.

pith-pipeline@v0.9.0 · 5549 in / 1317 out tokens · 32179 ms · 2026-05-13T20:59:53.749541+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

[1]

2112.00390 , archivePrefix=

Tomer Amit, Tal Shaharbany, Eliya Nachmani, and Lior Wolf. Segdiff: Image segmentation with diffusion proba- bilistic models.arXiv preprint arXiv:2112.00390, 2021. 2, 3

work page arXiv 2021
[2]

Label-efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126, 2021

Dmitry Baranchuk, Ivan Rubachev, Andrey V oynov, Valentin Khrulkov, and Artem Babenko. Label-efficient se- mantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126, 2021. 3

work page arXiv 2021
[3]

Famnet: Frequency-aware matching network for cross- domain few-shot medical image segmentation.arXiv preprint arXiv:2412.09319, 2024

Yuntian Bo, Yazhou Zhu, Lunbo Li, and Haofeng Zhang. Famnet: Frequency-aware matching network for cross- domain few-shot medical image segmentation.arXiv preprint arXiv:2412.09319, 2024. 7

work page arXiv 2024
[4]

Uni- verseg: Universal medical image segmentation

Victor Ion Butoi, Jose Javier Gonzalez Ortiz, Tianyu Ma, Mert R Sabuncu, John Guttag, and Adrian V Dalca. Uni- verseg: Universal medical image segmentation. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 21438–21451, 2023. 8, 2

work page 2023
[5]

Dual interspersion and flex- ible deployment for few-shot medical image segmentation

Ziming Cheng, Shidong Wang, Yang Long, Tao Zhou, Haofeng Zhang, and Ling Shao. Dual interspersion and flex- ible deployment for few-shot medical image segmentation. IEEE Transactions on Medical Imaging, 2025. 2, 6, 7, 1

work page 2025
[6]

Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural informa- tion processing systems, 34:8780–8794, 2021. 2

work page 2021
[7]

Few-shot medical image segmentation with cycle- resemblance attention

Hao Ding, Changchang Sun, Hao Tang, Dawen Cai, and Yan Yan. Few-shot medical image segmentation with cycle- resemblance attention. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2488–2497, 2023. 3

work page 2023
[8]

Few-shot semantic segmen- tation with prototype learning

Nanqing Dong and Eric P Xing. Few-shot semantic segmen- tation with prototype learning. InBMVC, page 4, 2018. 2

work page 2018
[9]

Self- support few-shot semantic segmentation

Qi Fan, Wenjie Pei, Yu-Wing Tai, and Chi-Keung Tang. Self- support few-shot semantic segmentation. InEuropean Con- ference on Computer Vision, pages 701–719. Springer, 2022. 4

work page 2022
[10]

Interac- tive few-shot learning: Limited supervision, better medical image segmentation.IEEE Transactions on Medical Imag- ing, 40(10):2575–2588, 2021

Ruiwei Feng, Xiangshang Zheng, Tianxiang Gao, Jintai Chen, Wenzhe Wang, Danny Z Chen, and Jian Wu. Interac- tive few-shot learning: Limited supervision, better medical image segmentation.IEEE Transactions on Medical Imag- ing, 40(10):2575–2588, 2021. 3

work page 2021
[11]

Anomaly detection-inspired few-shot medi- cal image segmentation through self-supervision with super- voxels.Medical Image Analysis, 78:102385, 2022

Stine Hansen, Srishti Gautam, Robert Jenssen, and Michael Kampffmeyer. Anomaly detection-inspired few-shot medi- cal image segmentation through self-supervision with super- voxels.Medical Image Analysis, 78:102385, 2022. 3, 4, 6, 1

work page 2022
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 1

work page 2016
[13]

Apseg: Auto-prompt network for cross-domain few-shot semantic segmentation

Weizhao He, Yang Zhang, Wei Zhuo, Linlin Shen, Jiaqi Yang, Songhe Deng, and Liang Sun. Apseg: Auto-prompt network for cross-domain few-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23762–23772, 2024. 1

work page 2024
[14]

Unsupervised semantic correspondence using stable diffu- sion.Advances in Neural Information Processing Systems, 36:8266–8279, 2023

Eric Hedlin, Gopal Sharma, Shweta Mahajan, Hossam Isack, Abhishek Kar, Andrea Tagliasacchi, and Kwang Moo Yi. Unsupervised semantic correspondence using stable diffu- sion.Advances in Neural Information Processing Systems, 36:8266–8279, 2023. 2, 3

work page 2023
[15]

Prototype-guided graph reasoning network for few-shot medical image segmentation.IEEE Transac- tions on Medical Imaging, 2024

Wendong Huang, Jinwu Hu, Junhao Xiao, Yang Wei, Xiuli Bi, and Bin Xiao. Prototype-guided graph reasoning network for few-shot medical image segmentation.IEEE Transac- tions on Medical Imaging, 2024. 6

work page 2024
[16]

Chaos challenge-combined (ct-mr) healthy abdominal organ seg- mentation.Medical image analysis, 69:101950, 2021

A Emre Kavur, N Sinem Gezer, Mustafa Barıs ¸, Sinem Aslan, Pierre-Henri Conze, Vladimir Groza, Duc Duy Pham, Soumick Chatterjee, Philipp Ernst, Savas ¸¨Ozkan, et al. Chaos challenge-combined (ct-mr) healthy abdominal organ seg- mentation.Medical image analysis, 69:101950, 2021. 5

work page 2021
[17]

Repurpos- ing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Met- zger, Rodrigo Caye Daudt, and Konrad Schindler. Repurpos- ing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492– 9502, 2024. 4

work page 2024
[18]

Miccai multi- atlas labeling beyond the cranial vault–workshop and chal- lenge

Bennett Landman, Zhoubing Xu, Juan Igelsias, Martin Styner, Thomas Langerak, and Arno Klein. Miccai multi- atlas labeling beyond the cranial vault–workshop and chal- lenge. InProc. MICCAI multi-atlas labeling beyond cra- nial vault—workshop challenge, page 12. Munich, Germany,

work page
[19]

Learning what not to segment: A new perspective on few- shot segmentation

Chunbo Lang, Gong Cheng, Binfei Tu, and Junwei Han. Learning what not to segment: A new perspective on few- shot segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8057–8067, 2022. 2

work page 2022
[20]

Ex- ploiting diffusion prior for generalizable dense prediction

Hsin-Ying Lee, Hung-Yu Tseng, and Ming-Hsuan Yang. Ex- ploiting diffusion prior for generalizable dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 7861–7871, 2024. 3

work page 2024
[21]

Your diffusion model is secretly a zero-shot classifier

Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, and Deepak Pathak. Your diffusion model is secretly a zero-shot classifier. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 2206–2217,

work page
[22]

Stable diffusion segmentation for biomed- ical images with single-step reverse process

Tianyu Lin, Zhiguang Chen, Zhonghao Yan, Weijiang Yu, and Fudan Zheng. Stable diffusion segmentation for biomed- ical images with single-step reverse process. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 656–666. Springer, 2024. 1, 3

work page 2024
[23]

Few shot medical image segmentation with cross attention transformer

Yi Lin, Yufan Chen, Kwang-Ting Cheng, and Hao Chen. Few shot medical image segmentation with cross attention transformer. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 233–

work page
[24]

Diffusion hyperfeatures: Search- ing through time and space for semantic correspondence

Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holyn- ski, and Trevor Darrell. Diffusion hyperfeatures: Search- ing through time and space for semantic correspondence. Advances in Neural Information Processing Systems, 36: 47500–47510, 2023. 3

work page 2023
[25]

Ospa: Enhanc- ing identity-preserving image generation via online self- preference alignment

Xusen Ma, Xiaoqin Wang, Xianxu Hou, Meidan Ding, Zhe Kong, Junliang Chen, and Linlin Shen. Ospa: Enhanc- ing identity-preserving image generation via online self- preference alignment. 2

work page
[26]

Cross-domain few-shot segmentation via iterative support-query correspon- dence mining

Jiahao Nie, Yun Xing, Gongjie Zhang, Pei Yan, Aoran Xiao, Yap-Peng Tan, Alex C Kot, and Shijian Lu. Cross-domain few-shot segmentation via iterative support-query correspon- dence mining. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3380– 3390, 2024. 7, 2

work page 2024
[27]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Self-supervision with superpixels: Training few-shot medical image segmentation without an- notation

Cheng Ouyang, Carlo Biffi, Chen Chen, Turkay Kart, Huaqi Qiu, and Daniel Rueckert. Self-supervision with superpixels: Training few-shot medical image segmentation without an- notation. InEuropean conference on computer vision, pages 762–780. Springer, 2020. 3, 6, 7, 2

work page 2020
[29]

Bastien Rigaud, Brian M Anderson, H Yu Zhiqian, Maxime Gobeli, Guillaume Cazoulat, Jonas S ¨oderberg, Elin Samuelsson, David Lidberg, Christopher Ward, Nicolette Taku, et al. Automatic segmentation using deep learning to enable online dose optimization during adaptive radiation therapy of cervical cancer.International Journal of Radia- tion Oncology* Biol...

work page 2021
[30]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3

work page 2022
[31]

‘squeeze & ex- cite’guided few-shot segmentation of volumetric images

Abhijit Guha Roy, Shayan Siddiqui, Sebastian P ¨olsterl, Nassir Navab, and Christian Wachinger. ‘squeeze & ex- cite’guided few-shot segmentation of volumetric images. Medical image analysis, 59:101587, 2020. 3, 6

work page 2020
[32]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 2, 3

work page 2022
[33]

One-shot learning for semantic segmentation

Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and Byron Boots. One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410, 2017. 2

work page arXiv 2017
[34]

Q-net: Query-informed few-shot medical image segmentation

Qianqian Shen, Yanan Li, Jiyong Jin, and Bin Liu. Q-net: Query-informed few-shot medical image segmentation. In Proceedings of SAI Intelligent Systems Conference, pages 610–628. Springer, 2023. 3, 6

work page 2023
[35]

Dual-guided prototype alignment network for few-shot medical image segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 2024

Yue Shen, Wanshu Fan, Cong Wang, Wenfei Liu, Wei Wang, Qiang Zhang, and Dongsheng Zhou. Dual-guided prototype alignment network for few-shot medical image segmenta- tion.IEEE Transactions on Instrumentation and Measure- ment, 2024. 6

work page 2024
[36]

Metrics to evaluate the performance of auto-segmentation for radiation treatment planning: A critical review.Radiother- apy and Oncology, 160:185–191, 2021

Michael V Sherer, Diana Lin, Sharif Elguindi, Simon Duke, Li-Tee Tan, Jon Cacicedo, Max Dahele, and Erin F Gillespie. Metrics to evaluate the performance of auto-segmentation for radiation treatment planning: A critical review.Radiother- apy and Oncology, 160:185–191, 2021. 1

work page 2021
[37]

Domain-rectifying adapter for cross-domain few-shot segmentation

Jiapeng Su, Qi Fan, Wenjie Pei, Guangming Lu, and Fanglin Chen. Domain-rectifying adapter for cross-domain few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24036– 24045, 2024. 7, 2

work page 2024
[38]

Emergent correspondence from image diffusion.Advances in Neural Information Pro- cessing Systems, 36:1363–1389, 2023

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advances in Neural Information Pro- cessing Systems, 36:1363–1389, 2023. 3

work page 2023
[39]

Prior guided feature enrich- ment network for few-shot segmentation.IEEE transactions on pattern analysis and machine intelligence, 44(2):1050– 1065, 2020

Zhuotao Tian, Hengshuang Zhao, Michelle Shu, Zhicheng Yang, Ruiyu Li, and Jiaya Jia. Prior guided feature enrich- ment network for few-shot segmentation.IEEE transactions on pattern analysis and machine intelligence, 44(2):1050– 1065, 2020. 2

work page 2020
[40]

Panet: Few-shot image semantic seg- mentation with prototype alignment

Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. Panet: Few-shot image semantic seg- mentation with prototype alignment. Inproceedings of the IEEE/CVF international conference on computer vision, pages 9197–9206, 2019. 2, 3, 6, 7

work page 2019
[41]

Disfacerep: Representation disentanglement for co-occurring facial com- ponents in weakly supervised face parsing

Xiaoqin Wang, Xianxu Hou, Meidan Ding, Junliang Chen, Kaijun Deng, Jinheng Xie, and Linlin Shen. Disfacerep: Representation disentanglement for co-occurring facial com- ponents in weakly supervised face parsing. InProceedings of the 33rd ACM International Conference on Multimedia, pages 4020–4029, 2025. 1

work page 2025
[42]

Facebench: A multi-view multi-level fa- cial attribute vqa dataset for benchmarking face perception mllms

Xiaoqin Wang, Xusen Ma, Xianxu Hou, Meidan Ding, Yudong Li, Junliang Chen, Wenting Chen, Xiaoyang Peng, and Linlin Shen. Facebench: A multi-view multi-level fa- cial attribute vqa dataset for benchmarking face perception mllms. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9154–9164, 2025. 1

work page 2025
[43]

Multiverseg: scalable interactive segmen- tation of biomedical imaging datasets with in-context guid- ance

Hallee E Wong, Jose Javier Gonzalez Ortiz, John Guttag, and Adrian V Dalca. Multiverseg: scalable interactive segmen- tation of biomedical imaging datasets with in-context guid- ance. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 20966–20980, 2025. 8, 2

work page 2025
[44]

Dual con- trastive learning with anatomical auxiliary supervision for few-shot medical image segmentation

Huisi Wu, Fangyan Xiao, and Chongxin Liang. Dual con- trastive learning with anatomical auxiliary supervision for few-shot medical image segmentation. InEuropean Con- ference on Computer Vision, pages 417–434. Springer, 2022. 6

work page 2022
[45]

One-prompt to segment all med- ical images

Junde Wu and Min Xu. One-prompt to segment all med- ical images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11302– 11312, 2024. 8

work page 2024
[46]

Medsegdiff: Medical image segmentation with diffusion probabilistic model

Junde Wu, Rao Fu, Huihui Fang, Yu Zhang, Yehui Yang, Haoyi Xiong, Huiying Liu, and Yanwu Xu. Medsegdiff: Medical image segmentation with diffusion probabilistic model. InMedical Imaging with Deep Learning, pages 1623–1639. PMLR, 2024. 3

work page 2024
[47]

Medsegdiff-v2: Diffusion-based medical im- age segmentation with transformer

Junde Wu, Wei Ji, Huazhu Fu, Min Xu, Yueming Jin, and Yanwu Xu. Medsegdiff-v2: Diffusion-based medical im- age segmentation with transformer. InProceedings of the AAAI conference on artificial intelligence, pages 6030–6038,

work page
[48]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2955–2966, 2023. 3, 5

work page 2023
[49]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 2955–2966, 2023. 2, 3

work page 2023
[50]

A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence.Advances in Neural Information Processing Systems, 36:45533–45547,

work page
[51]

Unleash- ing the potential of the diffusion model in few-shot semantic segmentation.Advances in Neural Information Processing Systems, 37:42672–42695, 2024

Muzhi Zhu, Yang Liu, Zekai Luo, Chenchen Jing, Hao Chen, Guangkai Xu, Xinlong Wang, and Chunhua Shen. Unleash- ing the potential of the diffusion model in few-shot semantic segmentation.Advances in Neural Information Processing Systems, 37:42672–42695, 2024. 1, 3, 4, 5, 6, 7

work page 2024
[52]

Few-shot medical image segmentation via a region-enhanced prototypical transformer

Yazhou Zhu, Shidong Wang, Tong Xin, and Haofeng Zhang. Few-shot medical image segmentation via a region-enhanced prototypical transformer. InInternational Conference on Medical Image Computing and Computer-Assisted Interven- tion, pages 271–280. Springer, 2023. 2, 5, 6, 7, 1

work page 2023
[53]

Partition-a-medical-image: Extracting mul- tiple representative sub-regions for few-shot medical image segmentation.IEEE Transactions on Instrumentation and Measurement, 2024

Yazhou Zhu, Shidong Wang, Tong Xin, Zheng Zhang, and Haofeng Zhang. Partition-a-medical-image: Extracting mul- tiple representative sub-regions for few-shot medical image segmentation.IEEE Transactions on Instrumentation and Measurement, 2024. 6

work page 2024