Prompting Diffusion Models for Zero-Shot Instance Segmentation

Andrea Ramazzina; Federico Tombari; Irem Zeynep Alag\"oz; Nassir Navab; Nils Morbitzer; Stefano Gasperini

arxiv: 2606.22660 · v1 · pith:IRDVJZ4Nnew · submitted 2026-06-21 · 💻 cs.CV

Prompting Diffusion Models for Zero-Shot Instance Segmentation

Irem Zeynep Alag\"oz , Nils Morbitzer , Andrea Ramazzina , Nassir Navab , Federico Tombari , Stefano Gasperini This is my paper

Pith reviewed 2026-06-26 10:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot instance segmentationdiffusion modelsspatial promptinggenerative priorsinteractive segmentationsynthetic data

0 comments

The pith

Prompt2Seg conditions diffusion segmentation models on spatial prompts to achieve zero-shot instance segmentation on unseen objects and domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Prompt2Seg as a way to fix inaccurate object boundaries and over-segmentation in promptable segmentation tasks. It adds a conditioning branch to a frozen diffusion segmentation model so that spatial prompts, given as 2D Gaussians or confidence maps, become direct inputs during generation. The branch is fine-tuned only on a small set of object categories from two synthetic datasets. After this limited training the model produces accurate segments for many new object types and visual styles it never saw, including paintings, egocentric video, and X-ray images. It beats the original diffusion backbone on every one of seven test datasets.

Core claim

Prompt2Seg augments a frozen diffusion segmentation model with a conditioning branch that accepts spatial prompts represented as 2D Gaussians or confidence maps as explicit input signals. Fine-tuned on a deliberately constrained set of object categories drawn from Hypersim and Virtual KITTI 2, Prompt2Seg generalizes zero-shot to a wide range of unseen object types and visual domains. It consistently outperforms the underlying diffusion segmentation backbone across seven datasets that range from standard benchmarks to paintings, egocentric views, and X-ray data.

What carries the argument

The conditioning branch added to the frozen diffusion segmentation model that receives spatial prompts (2D Gaussians or confidence maps) and trains the model to follow them directly.

If this is right

Spatial prompts influence the segmentation output directly rather than only in post-processing.
Zero-shot transfer occurs to object categories and visual domains absent from the fine-tuning data.
Consistent gains appear over the unfine-tuned diffusion backbone on every benchmark tested.
Interactive segmentation becomes possible without large-scale real-world mask supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The limited synthetic training data may indicate that only a small number of categories are needed to unlock broad generalization when diffusion priors are already present.
The same conditioning approach could be tested on other generative architectures for segmentation or related tasks.
Extending the evaluation to additional modalities such as infrared or depth-only images would probe how far the zero-shot behavior reaches.

Load-bearing premise

Fine-tuning the conditioning branch on a deliberately constrained set of synthetic object categories is sufficient to produce reliable zero-shot generalization to real photographs, paintings, egocentric views, and X-ray data.

What would settle it

A result in which Prompt2Seg fails to outperform the base diffusion model on any one of the seven reported datasets or produces clearly incorrect segments on a domain such as medical or satellite imagery outside the tested set.

Figures

Figures reproduced from arXiv: 2606.22660 by Andrea Ramazzina, Federico Tombari, Irem Zeynep Alag\"oz, Nassir Navab, Nils Morbitzer, Stefano Gasperini.

**Figure 1.** Figure 1: While discriminative-based models [1] and prior work leveraging diffusion priors [2] struggle to segment objects with strong pixel gradients, leading to over-segmentation, our proposed Prompt2Seg conditions the diffusion process on the user input using spatial prompts and delivers semantic-aware, interactive segmentation. The scores in the plot are from our method. Abstract Several disruptive research dire… view at source ↗

**Figure 2.** Figure 2: Our Prompt2Seg framework. Prompt2Seg segments any instance by guiding the diffusion process with a ControlNet conditioned on the user prompts as Gaussian maps. reformulates language-driven segmentation as a label diffusion process, and LlamaSeg [10] treats it as autoregressive mask token generation conditioned on language instructions. Most closely related to our work, Gen2Seg [2] finetunes Stable Diffusio… view at source ↗

**Figure 3.** Figure 3: Qualitative Results from various datasets (left) comparing Prompt2Seg with SAM and Gen2Seg. SAM masks are binary, unlike the others, which output features, shown here masked. 4.2 Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on RoadAnomaly [41]. Conditioned on the source uncertainty map, P2Seg suppresses noisy background responses while preserving the true anomalous objects. 4.3 Beyond Sparse Spatial Prompts [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Additional qualitative failure cases. For each example, the left image shows the input [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Several disruptive research directions have recently emerged in computer vision, including foundation models achieving previously unseen zero-shot performance in scene understanding, even interactively, and generative models that synthesize extremely realistic images. The latter have also been shown to be highly effective in scene understanding tasks thanks to their rich priors. However, for promptable segmentation, foundation models struggle with accurately segmenting an object's region, leading to false positives and over-segmentation. Notably, early attempts that leverage generative priors use prompts only during post-processing, yielding suboptimal segments because the process is agnostic to the user input. In this paper, we target these limitations with Prompt2Seg, a spatial conditioning framework for diffusion-based segmentation. Prompt2Seg augments a frozen diffusion segmentation model with a conditioning branch. Our approach takes spatial prompts, represented as 2D Gaussians or confidence maps, as explicit input signals, training the model to respond directly to user intent. Fine-tuned on a deliberately constrained set of object categories drawn from Hypersim and Virtual KITTI 2, Prompt2Seg generalizes zero-shot to a wide range of unseen object types and visual domains. We evaluate on seven datasets ranging from standard benchmarks to more challenging domains, including paintings, egocentric views, and X-ray data. Furthermore, we demonstrate that Prompt2Seg consistently outperforms the underlying diffusion segmentation backbone across all benchmarks. Our results suggest that the rich priors encoded in generative pretraining, combined with principled spatial conditioning, offer a compelling path toward broadly generalizing interactive segmentation without large-scale mask supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompt2Seg adds an explicit conditioning branch to steer a frozen diffusion segmentation model with spatial prompts, but the zero-shot generalization from narrow synthetic training to X-ray and artistic domains rests on unshown results.

read the letter

The main contribution is a trainable branch that ingests spatial prompts (2D Gaussians or confidence maps) directly into the diffusion process rather than applying them only afterward. This is a clean technical move that addresses the post-processing limitation the authors flag in prior generative-prior work.

They freeze the diffusion backbone and fine-tune the new branch on a deliberately small set of categories from Hypersim and Virtual KITTI 2, then report zero-shot transfer across seven datasets that include paintings, egocentric footage, and X-ray images. The abstract states consistent outperformance over the backbone on all of them.

The approach is coherent on its own terms and the conditioning mechanism looks like a reasonable way to make the generative prior more prompt-responsive. That part earns credit.

The soft spot is the generalization claim. Training on constrained synthetic categories and expecting reliable behavior on real photographs, art, and especially X-ray data is a large leap. Because the backbone stays frozen, any domain robustness must come from the original pretraining plus this limited fine-tuning; the abstract supplies no numbers, error bars, or controls to show the gap is actually closed. The stress-test concern about domain shift therefore lands.

This paper is for groups working on interactive segmentation or ways to steer generative models without large mask datasets. If the full manuscript contains proper quantitative tables, ablations on the conditioning branch, and checks against domain shift, it deserves a serious referee. Based on the abstract alone the evidence is too thin to judge the central result.

Referee Report

3 major / 2 minor

Summary. The paper introduces Prompt2Seg, a spatial conditioning framework that augments a frozen diffusion-based segmentation model with an additional branch accepting explicit spatial prompts (2D Gaussians or confidence maps). The model is fine-tuned only on a constrained set of object categories from the synthetic datasets Hypersim and Virtual KITTI 2, after which it is claimed to generalize zero-shot to unseen object categories and to out-of-distribution visual domains (paintings, egocentric views, X-ray). It is further claimed to outperform the underlying diffusion backbone consistently across seven evaluation datasets, demonstrating that generative pretraining priors plus principled spatial conditioning can enable broadly generalizing interactive segmentation without large-scale mask supervision.

Significance. If the reported zero-shot transfer holds under rigorous controls, the result would be significant for promptable segmentation: it would show that limited synthetic supervision on a narrow category set can unlock the rich priors already present in diffusion models for interactive use across domains whose appearance statistics differ sharply from the fine-tuning distribution, thereby reducing reliance on large annotated mask corpora.

major comments (3)

[Abstract and Experiments section] The central claim that fine-tuning on constrained synthetic categories from Hypersim and Virtual KITTI 2 produces reliable zero-shot generalization to real photographs, paintings, egocentric views, and X-ray data is load-bearing for the entire contribution. The abstract and method description provide no quantitative evidence (mIoU, boundary F-measure, or domain-gap metrics with error bars) demonstrating that the domain shift is closed rather than that the frozen backbone already encodes useful structure for those domains; without such numbers or ablations isolating the conditioning branch's contribution, the generalization assertion cannot be evaluated.
[Method section (conditioning branch and training protocol)] The paper states that the diffusion backbone is frozen while only the conditioning branch is trained. No details are supplied on the loss used for the branch, the precise form of the spatial prompt encoding, or whether any domain-randomization or style-transfer augmentations were applied during the synthetic fine-tuning. These omissions make it impossible to determine whether the reported cross-domain performance arises from the proposed architecture or from incidental robustness already present in the pretrained diffusion model.
[Experiments and Results section] Evaluation is described on seven datasets spanning standard benchmarks to challenging domains, yet the abstract supplies neither per-dataset scores, comparison tables against the frozen backbone, nor statistics on the number of unseen categories or images per domain. Without these concrete results, the repeated claim of “consistent outperformance across all benchmarks” cannot be verified as robust rather than post-hoc or selective.

minor comments (2)

[Abstract] The abstract refers to “seven datasets” without naming them or indicating which are held-out; a table listing dataset names, domains, and number of images would improve clarity.
[Method] Notation for the spatial prompt (2D Gaussians vs. confidence maps) is introduced without an accompanying equation or diagram showing how these inputs are injected into the diffusion U-Net; a small figure or equation would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications from the manuscript and indicate revisions where the presentation can be strengthened.

read point-by-point responses

Referee: [Abstract and Experiments section] The central claim that fine-tuning on constrained synthetic categories from Hypersim and Virtual KITTI 2 produces reliable zero-shot generalization to real photographs, paintings, egocentric views, and X-ray data is load-bearing for the entire contribution. The abstract and method description provide no quantitative evidence (mIoU, boundary F-measure, or domain-gap metrics with error bars) demonstrating that the domain shift is closed rather than that the frozen backbone already encodes useful structure for those domains; without such numbers or ablations isolating the conditioning branch's contribution, the generalization assertion cannot be evaluated.

Authors: The Experiments section contains quantitative tables reporting mIoU and boundary F-measure on all seven datasets, with direct comparisons to the frozen backbone and standard deviations across runs. These results isolate the contribution of the conditioning branch through controlled ablations. We agree the abstract would be stronger with explicit summary numbers; we will revise it to report average gains and reference the full tables. revision: partial
Referee: [Method section (conditioning branch and training protocol)] The paper states that the diffusion backbone is frozen while only the conditioning branch is trained. No details are supplied on the loss used for the branch, the precise form of the spatial prompt encoding, or whether any domain-randomization or style-transfer augmentations were applied during the synthetic fine-tuning. These omissions make it impossible to determine whether the reported cross-domain performance arises from the proposed architecture or from incidental robustness already present in the pretrained diffusion model.

Authors: We will expand the method section to specify that training uses the standard diffusion denoising loss applied exclusively to the conditioning branch parameters. Spatial prompts are encoded by a lightweight convolutional network whose output is concatenated with timestep embeddings before cross-attention layers in the U-Net. No domain-randomization or style-transfer augmentations were applied. These details, plus a supplementary architecture diagram, will be added. revision: yes
Referee: [Experiments and Results section] Evaluation is described on seven datasets spanning standard benchmarks to challenging domains, yet the abstract supplies neither per-dataset scores, comparison tables against the frozen backbone, nor statistics on the number of unseen categories or images per domain. Without these concrete results, the repeated claim of “consistent outperformance across all benchmarks” cannot be verified as robust rather than post-hoc or selective.

Authors: The Experiments section already includes per-dataset tables with mIoU scores for Prompt2Seg versus the backbone, notes that every test category is unseen during fine-tuning, and reports image counts per domain. We will revise the abstract to include representative per-domain metrics and the range of improvements to make these results immediately visible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on cross-dataset evaluation, not self-referential definitions or fitted predictions.

full rationale

The paper describes an empirical architecture (frozen diffusion backbone plus added conditioning branch) that is fine-tuned on synthetic categories from Hypersim and Virtual KITTI 2 and then evaluated zero-shot on seven held-out datasets. No equations, uniqueness theorems, or self-citations are invoked to derive performance; the central assertions are supported by direct benchmark comparisons rather than any quantity that reduces to the training inputs by construction. The generalization claim is therefore falsifiable by the reported experiments and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that diffusion models pre-trained for generation encode sufficiently rich scene priors that a modest spatial conditioning branch can unlock accurate promptable segmentation; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Diffusion models pre-trained for image generation encode rich priors that are useful for segmentation tasks when properly conditioned.
Invoked throughout the abstract as the justification for leveraging generative pretraining rather than training from scratch on masks.

pith-pipeline@v0.9.1-grok · 5823 in / 1386 out tokens · 40573 ms · 2026-06-26T10:40:43.430287+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 3 linked inside Pith

[1]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inICCV, pp. 4015–4026, 2023

2023
[2]

gen2seg: Generative models enable generalizable instance segmentation,

O. Khangaonkar and H. Pirsiavash, “gen2seg: Generative models enable generalizable instance segmentation,” inICLR, 2026

2026
[3]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, pp. 10684–10695, 2022

2022
[4]

Repurposing diffusion-based image generators for monocular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” inCVPR, pp. 9492–9502, 2024

2024
[5]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” inECCV, pp. 241–258, Springer, 2024

2024
[6]

Stablenormal: Reducing diffusion variance for stable and sharp normal,

C. Ye, L. Qiu, X. Gu, Q. Zuo, Y . Wu, Z. Dong, L. Bo, Y . Xiu, and X. Han, “Stablenormal: Reducing diffusion variance for stable and sharp normal,”ACM Transactions on Graphics (ToG), vol. 43, no. 6, pp. 1–18, 2024

2024
[7]

Diffpose: Toward more reliable 3d pose estimation,

J. Gong, L. G. Foo, Z. Fan, Q. Ke, H. Rahmani, and J. Liu, “Diffpose: Toward more reliable 3d pose estimation,” inCVPR, pp. 13041–13051, 2023

2023
[8]

Posediffusion: Solving pose estimation via diffusion- aided bundle adjustment,

J. Wang, C. Rupprecht, and D. Novotny, “Posediffusion: Solving pose estimation via diffusion- aided bundle adjustment,” inICCV, pp. 9773–9783, 2023

2023
[9]

Flowdiffuser: Advancing optical flow estimation with diffusion models,

A. Luo, X. Li, F. Yang, J. Liu, H. Fan, and S. Liu, “Flowdiffuser: Advancing optical flow estimation with diffusion models,” inCVPR, pp. 19167–19176, 2024

2024
[10]

Llamaseg: Image segmentation via autoregressive mask generation,

J. Deng, T. Weng, T. Yang, W. Luo, Z. Li, and W. Jiang, “Llamaseg: Image segmentation via autoregressive mask generation,”arXiv preprint arXiv:2505.19422, 2025

arXiv 2025
[11]

Gs: Generative segmentation via label diffusion,

Y . Chen, S. Chen, L. Lin, and G. Wang, “Gs: Generative segmentation via label diffusion,” arXiv preprint arXiv:2508.20020, 2025

arXiv 2025
[12]

Generative pretraining from pixels,

M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” inInternational conference on machine learning, pp. 1691–1703, PMLR, 2020

2020
[13]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022

2022
[14]

Label-efficient semantic segmentation with diffusion models,

D. Baranchuk, I. Rubachev, A. V oynov, V . Khrulkov, and A. Babenko, “Label-efficient semantic segmentation with diffusion models,” inICLR, 2022

2022
[15]

Unleashing text-to-image diffusion models for visual perception,

W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” inICCV, pp. 5729–5739, 2023

2023
[16]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”NeurIPS, vol. 33, pp. 6840–6851, 2020. 10

2020
[17]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inICCV, pp. 3836–3847, 2023

2023
[18]

Exploring pre-trained text-to-video diffusion models for referring video object segmentation,

Z. Zhu, X. Feng, D. Chen, J. Yuan, C. Qiao, and G. Hua, “Exploring pre-trained text-to-video diffusion models for referring video object segmentation,” inECCV, pp. 452–469, Springer, 2024

2024
[19]

Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,

Q. Nguyen, T. Vu, A. Tran, and K. Nguyen, “Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,”NeurIPS, vol. 36, pp. 76872–76892, 2023

2023
[20]

Diffusionseg: Adapting diffusion towards unsupervised object discovery,

C. Ma, Y . Yang, C. Ju, F. Zhang, J. Liu, Y . Wang, Y . Zhang, and Y . Wang, “Diffusionseg: Adapting diffusion towards unsupervised object discovery,”arXiv preprint arXiv:2303.09813, 2023

arXiv 2023
[21]

Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,

W. Wu, Y . Zhao, M. Z. Shou, H. Zhou, and C. Shen, “Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,” inICCV, pp. 1206– 1217, 2023

2023
[22]

Exploring phrase-level grounding with text-to-image diffusion model,

D. Yang, R. Dong, J. Ji, Y . Ma, H. Wang, X. Sun, and R. Ji, “Exploring phrase-level grounding with text-to-image diffusion model,” inECCV, pp. 161–180, Springer, 2024

2024
[23]

Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders,

X. Liu, S. Huang, Y . Kang, H. Chen, and D. Wang, “Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders,” inICASSP 2024, pp. 2765–2769, IEEE, 2024

2024
[24]

Ref-diff: Zero-shot referring image segmentation with generative models,

M. Ni, Y . Zhang, K. Feng, X. Li, Y . Guo, and W. Zuo, “Ref-diff: Zero-shot referring image segmentation with generative models,”arXiv preprint arXiv:2308.16777, 2023

arXiv 2023
[25]

Simpleclick: Interactive image segmentation with simple vision transformers,

Q. Liu, Z. Xu, G. Bertasius, and M. Niethammer, “Simpleclick: Interactive image segmentation with simple vision transformers,” inICCV, pp. 22290–22300, 2023

2023
[26]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson,et al., “Sam 2: Segment anything in images and videos,” inICLR, 2025

2025
[27]

Grounded sam: Assembling open-world models for diverse visual tasks,

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yan,et al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

Pith/arXiv arXiv 2024
[28]

Sam 3: Segment anything with concepts,

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang,et al., “Sam 3: Segment anything with concepts,”ICLR, 2026

2026
[29]

Image generators are generalist vision learners,

V . Gabeur, S. Long, S. Peng, P. V oigtlaender, S. Sun, Y . Bao, K. Truong, Z. Wang, W. Zhou, J. T. Barron,et al., “Image generators are generalist vision learners,”arXiv preprint arXiv:2604.20329, 2026

Pith/arXiv arXiv 2026
[30]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inICCV, pp. 10912–10922, 2021

2021
[31]

Virtual kitti 2,

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020

Pith/arXiv arXiv 2001
[32]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV, pp. 740–755, Springer, 2014

2014
[33]

Semantic segmentation in art paintings,

N. Cohen, Y . Newman, and A. Shamir, “Semantic segmentation in art paintings,” inComputer graphics forum, vol. 41, pp. 261–275, Wiley Online Library, 2022

2022
[34]

Fine-grained egocentric hand-object segmentation: Dataset, model, and applications,

L. Zhang, S. Zhou, S. Stent, and J. Shi, “Fine-grained egocentric hand-object segmentation: Dataset, model, and applications,” inECCV, pp. 127–145, Springer, 2022

2022
[35]

Pidray: A large-scale x-ray benchmark for real-world prohibited item detection,

L. Zhang, L. Jiang, R. Ji, and H. Fan, “Pidray: A large-scale x-ray benchmark for real-world prohibited item detection,”IJCV, vol. 131, no. 12, pp. 3170–3192, 2023. 11

2023
[36]

The pascal visual object classes (voc) challenge,

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,”International Journal of Computer Vision, vol. 88, pp. 303–338, June 2010

2010
[37]

Towards high-resolution salient object detection,

Y . Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu, “Towards high-resolution salient object detection,” inICCV, pp. 7234–7243, 2019

2019
[38]

Zerowaste dataset: Towards deformable object segmentation in cluttered scenes,

D. Bashkirova, M. Abdelfattah, Z. Zhu, J. Akl, F. Alladkani, P. Hu, V . Ablavsky, B. Calli, S. A. Bargal, and K. Saenko, “Zerowaste dataset: Towards deformable object segmentation in cluttered scenes,” inCVPR, pp. 21147–21157, 2022

2022
[39]

Rba: Segmenting unknown regions rejected by all,

N. Nayal, M. Yavuz, J. F. Henriques, and F. Güney, “Rba: Segmenting unknown regions rejected by all,” inICCV, pp. 711–722, 2023

2023
[40]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inCVPR, pp. 1290–1299, 2022

2022
[41]

Detecting the unexpected via image resynthesis,

K. Lis, K. Nakka, P. Fua, and M. Salzmann, “Detecting the unexpected via image resynthesis,” inICCV, pp. 2152–2161, 2019. 12 A Implementation Details Backbone and conditioning branch.We initialize the segmentation backbone from the official Gen2Seg Stable-Diffusion checkpoint [2]. The V AE, CLIP text encoder, and denoising U-Net are kept frozen throughout...

2019

[1] [1]

Segment anything,

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo,et al., “Segment anything,” inICCV, pp. 4015–4026, 2023

2023

[2] [2]

gen2seg: Generative models enable generalizable instance segmentation,

O. Khangaonkar and H. Pirsiavash, “gen2seg: Generative models enable generalizable instance segmentation,” inICLR, 2026

2026

[3] [3]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, pp. 10684–10695, 2022

2022

[4] [4]

Repurposing diffusion-based image generators for monocular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” inCVPR, pp. 9492–9502, 2024

2024

[5] [5]

Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,

X. Fu, W. Yin, M. Hu, K. Wang, Y . Ma, P. Tan, S. Shen, D. Lin, and X. Long, “Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image,” inECCV, pp. 241–258, Springer, 2024

2024

[6] [6]

Stablenormal: Reducing diffusion variance for stable and sharp normal,

C. Ye, L. Qiu, X. Gu, Q. Zuo, Y . Wu, Z. Dong, L. Bo, Y . Xiu, and X. Han, “Stablenormal: Reducing diffusion variance for stable and sharp normal,”ACM Transactions on Graphics (ToG), vol. 43, no. 6, pp. 1–18, 2024

2024

[7] [7]

Diffpose: Toward more reliable 3d pose estimation,

J. Gong, L. G. Foo, Z. Fan, Q. Ke, H. Rahmani, and J. Liu, “Diffpose: Toward more reliable 3d pose estimation,” inCVPR, pp. 13041–13051, 2023

2023

[8] [8]

Posediffusion: Solving pose estimation via diffusion- aided bundle adjustment,

J. Wang, C. Rupprecht, and D. Novotny, “Posediffusion: Solving pose estimation via diffusion- aided bundle adjustment,” inICCV, pp. 9773–9783, 2023

2023

[9] [9]

Flowdiffuser: Advancing optical flow estimation with diffusion models,

A. Luo, X. Li, F. Yang, J. Liu, H. Fan, and S. Liu, “Flowdiffuser: Advancing optical flow estimation with diffusion models,” inCVPR, pp. 19167–19176, 2024

2024

[10] [10]

Llamaseg: Image segmentation via autoregressive mask generation,

J. Deng, T. Weng, T. Yang, W. Luo, Z. Li, and W. Jiang, “Llamaseg: Image segmentation via autoregressive mask generation,”arXiv preprint arXiv:2505.19422, 2025

arXiv 2025

[11] [11]

Gs: Generative segmentation via label diffusion,

Y . Chen, S. Chen, L. Lin, and G. Wang, “Gs: Generative segmentation via label diffusion,” arXiv preprint arXiv:2508.20020, 2025

arXiv 2025

[12] [12]

Generative pretraining from pixels,

M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” inInternational conference on machine learning, pp. 1691–1703, PMLR, 2020

2020

[13] [13]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022

2022

[14] [14]

Label-efficient semantic segmentation with diffusion models,

D. Baranchuk, I. Rubachev, A. V oynov, V . Khrulkov, and A. Babenko, “Label-efficient semantic segmentation with diffusion models,” inICLR, 2022

2022

[15] [15]

Unleashing text-to-image diffusion models for visual perception,

W. Zhao, Y . Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” inICCV, pp. 5729–5739, 2023

2023

[16] [16]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”NeurIPS, vol. 33, pp. 6840–6851, 2020. 10

2020

[17] [17]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inICCV, pp. 3836–3847, 2023

2023

[18] [18]

Exploring pre-trained text-to-video diffusion models for referring video object segmentation,

Z. Zhu, X. Feng, D. Chen, J. Yuan, C. Qiao, and G. Hua, “Exploring pre-trained text-to-video diffusion models for referring video object segmentation,” inECCV, pp. 452–469, Springer, 2024

2024

[19] [19]

Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,

Q. Nguyen, T. Vu, A. Tran, and K. Nguyen, “Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation,”NeurIPS, vol. 36, pp. 76872–76892, 2023

2023

[20] [20]

Diffusionseg: Adapting diffusion towards unsupervised object discovery,

C. Ma, Y . Yang, C. Ju, F. Zhang, J. Liu, Y . Wang, Y . Zhang, and Y . Wang, “Diffusionseg: Adapting diffusion towards unsupervised object discovery,”arXiv preprint arXiv:2303.09813, 2023

arXiv 2023

[21] [21]

Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,

W. Wu, Y . Zhao, M. Z. Shou, H. Zhou, and C. Shen, “Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models,” inICCV, pp. 1206– 1217, 2023

2023

[22] [22]

Exploring phrase-level grounding with text-to-image diffusion model,

D. Yang, R. Dong, J. Ji, Y . Ma, H. Wang, X. Sun, and R. Ji, “Exploring phrase-level grounding with text-to-image diffusion model,” inECCV, pp. 161–180, Springer, 2024

2024

[23] [23]

Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders,

X. Liu, S. Huang, Y . Kang, H. Chen, and D. Wang, “Vgdiffzero: Text-to-image diffusion models can be zero-shot visual grounders,” inICASSP 2024, pp. 2765–2769, IEEE, 2024

2024

[24] [24]

Ref-diff: Zero-shot referring image segmentation with generative models,

M. Ni, Y . Zhang, K. Feng, X. Li, Y . Guo, and W. Zuo, “Ref-diff: Zero-shot referring image segmentation with generative models,”arXiv preprint arXiv:2308.16777, 2023

arXiv 2023

[25] [25]

Simpleclick: Interactive image segmentation with simple vision transformers,

Q. Liu, Z. Xu, G. Bertasius, and M. Niethammer, “Simpleclick: Interactive image segmentation with simple vision transformers,” inICCV, pp. 22290–22300, 2023

2023

[26] [26]

Sam 2: Segment anything in images and videos,

N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson,et al., “Sam 2: Segment anything in images and videos,” inICLR, 2025

2025

[27] [27]

Grounded sam: Assembling open-world models for diverse visual tasks,

T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yan,et al., “Grounded sam: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

Pith/arXiv arXiv 2024

[28] [28]

Sam 3: Segment anything with concepts,

N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang,et al., “Sam 3: Segment anything with concepts,”ICLR, 2026

2026

[29] [29]

Image generators are generalist vision learners,

V . Gabeur, S. Long, S. Peng, P. V oigtlaender, S. Sun, Y . Bao, K. Truong, Z. Wang, W. Zhou, J. T. Barron,et al., “Image generators are generalist vision learners,”arXiv preprint arXiv:2604.20329, 2026

Pith/arXiv arXiv 2026

[30] [30]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind, “Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding,” inICCV, pp. 10912–10922, 2021

2021

[31] [31]

Virtual kitti 2,

Y . Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,”arXiv preprint arXiv:2001.10773, 2020

Pith/arXiv arXiv 2001

[32] [32]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV, pp. 740–755, Springer, 2014

2014

[33] [33]

Semantic segmentation in art paintings,

N. Cohen, Y . Newman, and A. Shamir, “Semantic segmentation in art paintings,” inComputer graphics forum, vol. 41, pp. 261–275, Wiley Online Library, 2022

2022

[34] [34]

Fine-grained egocentric hand-object segmentation: Dataset, model, and applications,

L. Zhang, S. Zhou, S. Stent, and J. Shi, “Fine-grained egocentric hand-object segmentation: Dataset, model, and applications,” inECCV, pp. 127–145, Springer, 2022

2022

[35] [35]

Pidray: A large-scale x-ray benchmark for real-world prohibited item detection,

L. Zhang, L. Jiang, R. Ji, and H. Fan, “Pidray: A large-scale x-ray benchmark for real-world prohibited item detection,”IJCV, vol. 131, no. 12, pp. 3170–3192, 2023. 11

2023

[36] [36]

The pascal visual object classes (voc) challenge,

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,”International Journal of Computer Vision, vol. 88, pp. 303–338, June 2010

2010

[37] [37]

Towards high-resolution salient object detection,

Y . Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu, “Towards high-resolution salient object detection,” inICCV, pp. 7234–7243, 2019

2019

[38] [38]

Zerowaste dataset: Towards deformable object segmentation in cluttered scenes,

D. Bashkirova, M. Abdelfattah, Z. Zhu, J. Akl, F. Alladkani, P. Hu, V . Ablavsky, B. Calli, S. A. Bargal, and K. Saenko, “Zerowaste dataset: Towards deformable object segmentation in cluttered scenes,” inCVPR, pp. 21147–21157, 2022

2022

[39] [39]

Rba: Segmenting unknown regions rejected by all,

N. Nayal, M. Yavuz, J. F. Henriques, and F. Güney, “Rba: Segmenting unknown regions rejected by all,” inICCV, pp. 711–722, 2023

2023

[40] [40]

Masked-attention mask transformer for universal image segmentation,

B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inCVPR, pp. 1290–1299, 2022

2022

[41] [41]

Detecting the unexpected via image resynthesis,

K. Lis, K. Nakka, P. Fua, and M. Salzmann, “Detecting the unexpected via image resynthesis,” inICCV, pp. 2152–2161, 2019. 12 A Implementation Details Backbone and conditioning branch.We initialize the segmentation backbone from the official Gen2Seg Stable-Diffusion checkpoint [2]. The V AE, CLIP text encoder, and denoising U-Net are kept frozen throughout...

2019