Prompting Diffusion Models for Zero-Shot Instance Segmentation
Pith reviewed 2026-06-26 10:40 UTC · model grok-4.3
The pith
Prompt2Seg conditions diffusion segmentation models on spatial prompts to achieve zero-shot instance segmentation on unseen objects and domains.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prompt2Seg augments a frozen diffusion segmentation model with a conditioning branch that accepts spatial prompts represented as 2D Gaussians or confidence maps as explicit input signals. Fine-tuned on a deliberately constrained set of object categories drawn from Hypersim and Virtual KITTI 2, Prompt2Seg generalizes zero-shot to a wide range of unseen object types and visual domains. It consistently outperforms the underlying diffusion segmentation backbone across seven datasets that range from standard benchmarks to paintings, egocentric views, and X-ray data.
What carries the argument
The conditioning branch added to the frozen diffusion segmentation model that receives spatial prompts (2D Gaussians or confidence maps) and trains the model to follow them directly.
If this is right
- Spatial prompts influence the segmentation output directly rather than only in post-processing.
- Zero-shot transfer occurs to object categories and visual domains absent from the fine-tuning data.
- Consistent gains appear over the unfine-tuned diffusion backbone on every benchmark tested.
- Interactive segmentation becomes possible without large-scale real-world mask supervision.
Where Pith is reading between the lines
- The limited synthetic training data may indicate that only a small number of categories are needed to unlock broad generalization when diffusion priors are already present.
- The same conditioning approach could be tested on other generative architectures for segmentation or related tasks.
- Extending the evaluation to additional modalities such as infrared or depth-only images would probe how far the zero-shot behavior reaches.
Load-bearing premise
Fine-tuning the conditioning branch on a deliberately constrained set of synthetic object categories is sufficient to produce reliable zero-shot generalization to real photographs, paintings, egocentric views, and X-ray data.
What would settle it
A result in which Prompt2Seg fails to outperform the base diffusion model on any one of the seven reported datasets or produces clearly incorrect segments on a domain such as medical or satellite imagery outside the tested set.
Figures
read the original abstract
Several disruptive research directions have recently emerged in computer vision, including foundation models achieving previously unseen zero-shot performance in scene understanding, even interactively, and generative models that synthesize extremely realistic images. The latter have also been shown to be highly effective in scene understanding tasks thanks to their rich priors. However, for promptable segmentation, foundation models struggle with accurately segmenting an object's region, leading to false positives and over-segmentation. Notably, early attempts that leverage generative priors use prompts only during post-processing, yielding suboptimal segments because the process is agnostic to the user input. In this paper, we target these limitations with Prompt2Seg, a spatial conditioning framework for diffusion-based segmentation. Prompt2Seg augments a frozen diffusion segmentation model with a conditioning branch. Our approach takes spatial prompts, represented as 2D Gaussians or confidence maps, as explicit input signals, training the model to respond directly to user intent. Fine-tuned on a deliberately constrained set of object categories drawn from Hypersim and Virtual KITTI 2, Prompt2Seg generalizes zero-shot to a wide range of unseen object types and visual domains. We evaluate on seven datasets ranging from standard benchmarks to more challenging domains, including paintings, egocentric views, and X-ray data. Furthermore, we demonstrate that Prompt2Seg consistently outperforms the underlying diffusion segmentation backbone across all benchmarks. Our results suggest that the rich priors encoded in generative pretraining, combined with principled spatial conditioning, offer a compelling path toward broadly generalizing interactive segmentation without large-scale mask supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Prompt2Seg, a spatial conditioning framework that augments a frozen diffusion-based segmentation model with an additional branch accepting explicit spatial prompts (2D Gaussians or confidence maps). The model is fine-tuned only on a constrained set of object categories from the synthetic datasets Hypersim and Virtual KITTI 2, after which it is claimed to generalize zero-shot to unseen object categories and to out-of-distribution visual domains (paintings, egocentric views, X-ray). It is further claimed to outperform the underlying diffusion backbone consistently across seven evaluation datasets, demonstrating that generative pretraining priors plus principled spatial conditioning can enable broadly generalizing interactive segmentation without large-scale mask supervision.
Significance. If the reported zero-shot transfer holds under rigorous controls, the result would be significant for promptable segmentation: it would show that limited synthetic supervision on a narrow category set can unlock the rich priors already present in diffusion models for interactive use across domains whose appearance statistics differ sharply from the fine-tuning distribution, thereby reducing reliance on large annotated mask corpora.
major comments (3)
- [Abstract and Experiments section] The central claim that fine-tuning on constrained synthetic categories from Hypersim and Virtual KITTI 2 produces reliable zero-shot generalization to real photographs, paintings, egocentric views, and X-ray data is load-bearing for the entire contribution. The abstract and method description provide no quantitative evidence (mIoU, boundary F-measure, or domain-gap metrics with error bars) demonstrating that the domain shift is closed rather than that the frozen backbone already encodes useful structure for those domains; without such numbers or ablations isolating the conditioning branch's contribution, the generalization assertion cannot be evaluated.
- [Method section (conditioning branch and training protocol)] The paper states that the diffusion backbone is frozen while only the conditioning branch is trained. No details are supplied on the loss used for the branch, the precise form of the spatial prompt encoding, or whether any domain-randomization or style-transfer augmentations were applied during the synthetic fine-tuning. These omissions make it impossible to determine whether the reported cross-domain performance arises from the proposed architecture or from incidental robustness already present in the pretrained diffusion model.
- [Experiments and Results section] Evaluation is described on seven datasets spanning standard benchmarks to challenging domains, yet the abstract supplies neither per-dataset scores, comparison tables against the frozen backbone, nor statistics on the number of unseen categories or images per domain. Without these concrete results, the repeated claim of “consistent outperformance across all benchmarks” cannot be verified as robust rather than post-hoc or selective.
minor comments (2)
- [Abstract] The abstract refers to “seven datasets” without naming them or indicating which are held-out; a table listing dataset names, domains, and number of images would improve clarity.
- [Method] Notation for the spatial prompt (2D Gaussians vs. confidence maps) is introduced without an accompanying equation or diagram showing how these inputs are injected into the diffusion U-Net; a small figure or equation would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications from the manuscript and indicate revisions where the presentation can be strengthened.
read point-by-point responses
-
Referee: [Abstract and Experiments section] The central claim that fine-tuning on constrained synthetic categories from Hypersim and Virtual KITTI 2 produces reliable zero-shot generalization to real photographs, paintings, egocentric views, and X-ray data is load-bearing for the entire contribution. The abstract and method description provide no quantitative evidence (mIoU, boundary F-measure, or domain-gap metrics with error bars) demonstrating that the domain shift is closed rather than that the frozen backbone already encodes useful structure for those domains; without such numbers or ablations isolating the conditioning branch's contribution, the generalization assertion cannot be evaluated.
Authors: The Experiments section contains quantitative tables reporting mIoU and boundary F-measure on all seven datasets, with direct comparisons to the frozen backbone and standard deviations across runs. These results isolate the contribution of the conditioning branch through controlled ablations. We agree the abstract would be stronger with explicit summary numbers; we will revise it to report average gains and reference the full tables. revision: partial
-
Referee: [Method section (conditioning branch and training protocol)] The paper states that the diffusion backbone is frozen while only the conditioning branch is trained. No details are supplied on the loss used for the branch, the precise form of the spatial prompt encoding, or whether any domain-randomization or style-transfer augmentations were applied during the synthetic fine-tuning. These omissions make it impossible to determine whether the reported cross-domain performance arises from the proposed architecture or from incidental robustness already present in the pretrained diffusion model.
Authors: We will expand the method section to specify that training uses the standard diffusion denoising loss applied exclusively to the conditioning branch parameters. Spatial prompts are encoded by a lightweight convolutional network whose output is concatenated with timestep embeddings before cross-attention layers in the U-Net. No domain-randomization or style-transfer augmentations were applied. These details, plus a supplementary architecture diagram, will be added. revision: yes
-
Referee: [Experiments and Results section] Evaluation is described on seven datasets spanning standard benchmarks to challenging domains, yet the abstract supplies neither per-dataset scores, comparison tables against the frozen backbone, nor statistics on the number of unseen categories or images per domain. Without these concrete results, the repeated claim of “consistent outperformance across all benchmarks” cannot be verified as robust rather than post-hoc or selective.
Authors: The Experiments section already includes per-dataset tables with mIoU scores for Prompt2Seg versus the backbone, notes that every test category is unseen during fine-tuning, and reports image counts per domain. We will revise the abstract to include representative per-domain metrics and the range of improvements to make these results immediately visible. revision: yes
Circularity Check
No circularity: empirical claims rest on cross-dataset evaluation, not self-referential definitions or fitted predictions.
full rationale
The paper describes an empirical architecture (frozen diffusion backbone plus added conditioning branch) that is fine-tuned on synthetic categories from Hypersim and Virtual KITTI 2 and then evaluated zero-shot on seven held-out datasets. No equations, uniqueness theorems, or self-citations are invoked to derive performance; the central assertions are supported by direct benchmark comparisons rather than any quantity that reduces to the training inputs by construction. The generalization claim is therefore falsifiable by the reported experiments and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Diffusion models pre-trained for image generation encode rich priors that are useful for segmentation tasks when properly conditioned.
Reference graph
Works this paper leans on
-
[1]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inICCV, pp. 4015–4026, 2023
2023
-
[2]
gen2seg: Generative models enable generalizable instance segmentation,
O. Khangaonkar and H. Pirsiavash, “gen2seg: Generative models enable generalizable instance segmentation,” inICLR, 2026
2026
-
[3]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, pp. 10684–10695, 2022
2022
-
[4]
Repurposing diffusion-based image generators for monocular depth estimation,
B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” inCVPR, pp. 9492–9502, 2024
2024
-
[5]
Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,
X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” inECCV, pp. 241–258, Springer, 2024
2024
-
[6]
Stablenormal: Reducing diffusion variance for stable and sharp normal,
C. Ye, L. Qiu, X. Gu, Q. Zuo, Y . Wu, Z. Dong, L. Bo, Y . Xiu, and X. Han, “Stablenormal: Reducing diffusion variance for stable and sharp normal,”ACM Transactions on Graphics (ToG), vol. 43, no. 6, pp. 1–18, 2024
2024
-
[7]
Diffpose: Toward more reliable 3d pose estimation,
J. Gong, L. G. Foo, Z. Fan, Q. Ke, H. Rahmani, and J. Liu, “Diffpose: Toward more reliable 3d pose estimation,” inCVPR, pp. 13041–13051, 2023
2023
-
[8]
Posediffusion: Solving pose estimation via diffusion- aided bundle adjustment,
J. Wang, C. Rupprecht, and D. Novotny, “Posediffusion: Solving pose estimation via diffusion- aided bundle adjustment,” inICCV, pp. 9773–9783, 2023
2023
-
[9]
Flowdiffuser: Advancing optical flow estimation with diffusion models,
A. Luo, X. Li, F. Yang, J. Liu, H. Fan, and S. Liu, “Flowdiffuser: Advancing optical flow estimation with diffusion models,” inCVPR, pp. 19167–19176, 2024
2024
-
[10]
Llamaseg: Image segmentation via autoregressive mask generation,
J. Deng, T. Weng, T. Yang, W. Luo, Z. Li, and W. Jiang, “Llamaseg: Image segmentation via autoregressive mask generation,”arXiv preprint arXiv:2505.19422, 2025
arXiv 2025
-
[11]
Gs: Generative segmentation via label diffusion,
Y . Chen, S. Chen, L. Lin, and G. Wang, “Gs: Generative segmentation via label diffusion,” arXiv preprint arXiv:2508.20020, 2025
arXiv 2025
-
[12]
Generative pretraining from pixels,
M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” inInternational conference on machine learning, pp. 1691–1703, PMLR, 2020
2020
-
[13]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022
2022
-
[14]
Label-efficient semantic segmentation with diffusion models,
D. Baranchuk, I. Rubachev, A. V oynov, V . Khrulkov, and A. Babenko, “Label-efficient semantic segmentation with diffusion models,” inICLR, 2022
2022
-
[15]
Unleashing text-to-image diffusion models for visual perception,
W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” inICCV, pp. 5729–5739, 2023
2023
-
[16]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”NeurIPS, vol. 33, pp. 6840–6851, 2020. 10
2020
-
[17]
Adding conditional control to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inICCV, pp. 3836–3847, 2023
2023
-
[18]
Exploring pre-trained text-to-video diffusion models for referring video object segmentation,
Z. Zhu, X. Feng, D. Chen, J. Yuan, C. Qiao, and G. Hua, “Exploring pre-trained text-to-video diffusion models for referring video object segmentation,” inECCV, pp. 452–469, Springer, 2024
2024
-
[19]
Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,
Q. Nguyen, T. Vu, A. Tran, and K. Nguyen, “Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,”NeurIPS, vol. 36, pp. 76872–76892, 2023
2023
-
[20]
Diffusionseg: Adapting diffusion towards unsupervised object discovery,
C. Ma, Y . Yang, C. Ju, F. Zhang, J. Liu, Y . Wang, Y . Zhang, and Y . Wang, “Diffusionseg: Adapting diffusion towards unsupervised object discovery,”arXiv preprint arXiv:2303.09813, 2023
arXiv 2023
-
[21]
Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,
W. Wu, Y . Zhao, M. Z. Shou, H. Zhou, and C. Shen, “Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,” inICCV, pp. 1206– 1217, 2023
2023
-
[22]
Exploring phrase-level grounding with text-to-image diffusion model,
D. Yang, R. Dong, J. Ji, Y . Ma, H. Wang, X. Sun, and R. Ji, “Exploring phrase-level grounding with text-to-image diffusion model,” inECCV, pp. 161–180, Springer, 2024
2024
-
[23]
Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders,
X. Liu, S. Huang, Y . Kang, H. Chen, and D. Wang, “Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders,” inICASSP 2024, pp. 2765–2769, IEEE, 2024
2024
-
[24]
Ref-diff: Zero-shot referring image segmentation with generative models,
M. Ni, Y . Zhang, K. Feng, X. Li, Y . Guo, and W. Zuo, “Ref-diff: Zero-shot referring image segmentation with generative models,”arXiv preprint arXiv:2308.16777, 2023
arXiv 2023
-
[25]
Simpleclick: Interactive image segmentation with simple vision transformers,
Q. Liu, Z. Xu, G. Bertasius, and M. Niethammer, “Simpleclick: Interactive image segmentation with simple vision transformers,” inICCV, pp. 22290–22300, 2023
2023
-
[26]
Sam 2: Segment anything in images and videos,
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson,et al., “Sam 2: Segment anything in images and videos,” inICLR, 2025
2025
-
[27]
Grounded sam: Assembling open-world models for diverse visual tasks,
T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yan,et al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024
Pith/arXiv arXiv 2024
-
[28]
Sam 3: Segment anything with concepts,
N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang,et al., “Sam 3: Segment anything with concepts,”ICLR, 2026
2026
-
[29]
Image generators are generalist vision learners,
V . Gabeur, S. Long, S. Peng, P. V oigtlaender, S. Sun, Y . Bao, K. Truong, Z. Wang, W. Zhou, J. T. Barron,et al., “Image generators are generalist vision learners,”arXiv preprint arXiv:2604.20329, 2026
Pith/arXiv arXiv 2026
-
[30]
Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,
M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inICCV, pp. 10912–10922, 2021
2021
-
[31]
Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020
Pith/arXiv arXiv 2001
-
[32]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV, pp. 740–755, Springer, 2014
2014
-
[33]
Semantic segmentation in art paintings,
N. Cohen, Y . Newman, and A. Shamir, “Semantic segmentation in art paintings,” inComputer graphics forum, vol. 41, pp. 261–275, Wiley Online Library, 2022
2022
-
[34]
Fine-grained egocentric hand-object segmentation: Dataset, model, and applications,
L. Zhang, S. Zhou, S. Stent, and J. Shi, “Fine-grained egocentric hand-object segmentation: Dataset, model, and applications,” inECCV, pp. 127–145, Springer, 2022
2022
-
[35]
Pidray: A large-scale x-ray benchmark for real-world prohibited item detection,
L. Zhang, L. Jiang, R. Ji, and H. Fan, “Pidray: A large-scale x-ray benchmark for real-world prohibited item detection,”IJCV, vol. 131, no. 12, pp. 3170–3192, 2023. 11
2023
-
[36]
The pascal visual object classes (voc) challenge,
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,”International Journal of Computer Vision, vol. 88, pp. 303–338, June 2010
2010
-
[37]
Towards high-resolution salient object detection,
Y . Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu, “Towards high-resolution salient object detection,” inICCV, pp. 7234–7243, 2019
2019
-
[38]
Zerowaste dataset: Towards deformable object segmentation in cluttered scenes,
D. Bashkirova, M. Abdelfattah, Z. Zhu, J. Akl, F. Alladkani, P. Hu, V . Ablavsky, B. Calli, S. A. Bargal, and K. Saenko, “Zerowaste dataset: Towards deformable object segmentation in cluttered scenes,” inCVPR, pp. 21147–21157, 2022
2022
-
[39]
Rba: Segmenting unknown regions rejected by all,
N. Nayal, M. Yavuz, J. F. Henriques, and F. Güney, “Rba: Segmenting unknown regions rejected by all,” inICCV, pp. 711–722, 2023
2023
-
[40]
Masked-attention mask transformer for universal image segmentation,
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inCVPR, pp. 1290–1299, 2022
2022
-
[41]
Detecting the unexpected via image resynthesis,
K. Lis, K. Nakka, P. Fua, and M. Salzmann, “Detecting the unexpected via image resynthesis,” inICCV, pp. 2152–2161, 2019. 12 A Implementation Details Backbone and conditioning branch.We initialize the segmentation backbone from the official Gen2Seg Stable-Diffusion checkpoint [2]. The V AE, CLIP text encoder, and denoising U-Net are kept frozen throughout...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.