pith. machine review for the scientific record. sign in

arxiv: 2604.24575 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Diffusion Model as a Generalist Segmentation Learner

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelssemantic segmentationopen-vocabulary segmentationtext-conditioned segmentationgeneralist frameworkvisual priorscross-domain transfer
0
0 comments X

The pith

Pretrained diffusion models can be repurposed as generalist segmentation learners by conditioning on image masks and text features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion models, usually for generating images, have denoising processes that capture detailed visual information aligned with space. By feeding the input image and its mask as latents into the model along with text descriptions from CLIP, the same model can produce accurate segmentation masks for semantic and open-vocabulary tasks. This creates a single framework that works well on standard benchmarks and transfers to medical, remote sensing, and agricultural images without changes to the architecture. Readers should care because it suggests generation and understanding tasks can share the same backbone, reducing the need for separate specialized models.

Core claim

By encoding the input image and ground-truth mask into the latent space and concatenating them as conditioning signals for the diffusion U-Net, together with a parallel CLIP-aligned text pathway that injects language features at multiple scales, an off-the-shelf diffusion backbone becomes a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts, achieving state-of-the-art performance on semantic segmentation and strong generalization across domains.

What carries the argument

The conditioning strategy that concatenates image and mask latents into the diffusion U-Net while adding multi-scale text features from CLIP to guide the denoising toward segmentation outputs.

If this is right

  • State-of-the-art performance on standard semantic segmentation benchmarks
  • Strong results on open-vocabulary segmentation tasks
  • Effective cross-domain transfer to medical, remote sensing, and agricultural scenarios without any domain-specific changes

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to other visual understanding tasks like object detection or instance segmentation using similar conditioning.
  • If diffusion models unify generation and segmentation, future vision systems might rely on fewer pretrained backbones for multiple purposes.
  • Testing on more diverse tasks could reveal whether the visual priors are truly general or specific to certain image types.

Load-bearing premise

The denoising trajectories from pretrained diffusion models contain rich spatially aligned visual priors that become usable for segmentation when the model is conditioned on concatenated image and mask latents plus parallel text features.

What would settle it

Training the same architecture from scratch without diffusion pretraining and observing whether it still achieves comparable segmentation accuracy on standard benchmarks.

Figures

Figures reproduced from arXiv: 2604.24575 by Antao Xiang, Changhao Pan, Haiyang Sun, Haoxiao Wang, Minjie Hong, Peilin Sun, Shuang Chen, Weijie Wang, Yifu Chen, Yue Chen, Zhou Zhao.

Figure 1
Figure 1. Figure 1: We introduce DiGSeg, a general-purpose segmentation framework built on pretrained diffusion models. By exploiting the strong spatial priors encoded in gen￾erative models and fine-tuning with segmentation-aware objectives, DiGSeg produces consistent, high-quality masks across diverse settings, including semantic segmenta￾tion, open-vocabulary queries, and cross-domain datasets spanning medical, agricul￾tura… view at source ↗
Figure 2
Figure 2. Figure 2: DiGSeg pipeline overview, which presents the training and inference pipelines of our diffusion-based generation model. In training, paired images are en￾coded into latent space and, together with text prompts, guide the diffusion U-Net to predict noise using an MSE objective. In inference, noise is sampled and progressively denoised under text conditioning, after which the VAE decoder reconstructs the fina… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Hyperparameter τ . To characterize this behavior, we ex￾amine how IoU varies with respect to τ for several representative cate￾gories on view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of semantic and open-vocabulary segmentation across different datasets. Input Image GT Prediction Pheno Pheno Pheno DeepGlobe DeepGlobe DeepGlobe BDD100K BDD100K BDD100K view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of cross-domain segmentation across different datasets. and ADE20K [114]. Following the experiment in [30, 100], we report perfor￾mance on A-150 with 150 common classes and A-847 with all the 847 classes of ADE20K [114], PC-59 with 59 common classes and PC-459 with full 459 classes of Pascal Context [62], we also incorporated the CityScapes [18] dataset in order view at source ↗
Figure 6
Figure 6. Figure 6: Effect of training data ratio. Our model exhibits remarkable data effi￾ciency, maintaining nearly equivalent per￾formance even when trained on only 50% of the total dataset. 1 2 4 10 20 Ensemble Size 51.00 53.12 55.25 57.38 59.50 mIoU (%) ADE20K COCO view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study of denoising schedule. Under the DDIM-trailing setting, one step of denoising can achieve a relatively high effect, while DDIM requires at least 10 steps. Method COCO ADE20K Standard Gaussian noise 48.9 56.7 w/ Annealed noise 49.2 57.1 w/ Multi-resolution noise 49.7 57.6 w/ Multi-res. + Ann. (ours) 50.8 58.6 view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results A-847 on open-vocabulary segmentation view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results PC-459 on open-vocabulary segmentation view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative results A-150 on open-vocabulary segmentation view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results PC-59 on open-vocabulary segmentation view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative results ADE20K on semantic segmentation view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative results COCO on semantic segmentation view at source ↗
read the original abstract

Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DiGSeg, which repurposes a pretrained diffusion U-Net for text-conditioned semantic and open-vocabulary segmentation. The input image and ground-truth mask are encoded into latent space and concatenated as conditioning signals to the diffusion U-Net; a parallel CLIP-aligned text pathway injects language features at multiple scales. The model is trained under the standard denoising objective to output structured masks conditioned on appearance and arbitrary text prompts. It claims state-of-the-art performance on standard semantic segmentation benchmarks plus strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios without domain-specific architectural changes.

Significance. If the empirical results hold under rigorous validation, the work would be significant for computer vision by showing that denoising trajectories in modern diffusion backbones encode spatially aligned priors usable for discriminative tasks. The unified, architecture-agnostic framework for multiple segmentation variants is a practical strength and could help narrow the generation-understanding divide. Credit is due for the clean repurposing of an off-the-shelf backbone and standard training objective without introducing new invented entities or free parameters.

major comments (2)
  1. [Experiments] Experiments section: the central SOTA and cross-domain claims rest on reported numbers, yet the manuscript provides insufficient ablations isolating the contribution of concatenated mask latents versus the multi-scale CLIP injection; without these, it is difficult to confirm that the performance gains derive from the diffusion priors rather than the added conditioning pathways.
  2. [Experiments] The open-vocabulary and cross-domain transfer results require explicit details on prompt construction, evaluation protocols, and whether any post-hoc hyperparameter tuning was performed per domain; the current description leaves open the possibility that reported generalization is partly attributable to evaluation choices rather than the model itself.
minor comments (2)
  1. The abstract and method description would benefit from a concise statement of the exact diffusion backbone (e.g., Stable Diffusion v1.5 or v2) and latent resolution used, as these choices affect reproducibility.
  2. Qualitative figures should include failure cases or edge examples (e.g., ambiguous text prompts or low-contrast regions) to balance the reported successes and strengthen the generalization narrative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of our experimental validation. We agree that additional details and ablations will strengthen the claims regarding the contributions of individual components and the robustness of the generalization results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central SOTA and cross-domain claims rest on reported numbers, yet the manuscript provides insufficient ablations isolating the contribution of concatenated mask latents versus the multi-scale CLIP injection; without these, it is difficult to confirm that the performance gains derive from the diffusion priors rather than the added conditioning pathways.

    Authors: We agree that isolating the contributions of the concatenated mask latents and the multi-scale CLIP text pathway is valuable for confirming the role of the diffusion priors. In the revised manuscript, we will add targeted ablations: (1) a variant using only image latents without mask concatenation, (2) a variant removing the multi-scale CLIP injection while retaining mask conditioning, and (3) comparisons against a non-diffusion baseline with equivalent conditioning. These will be reported on the same benchmarks to quantify each component's impact. revision: yes

  2. Referee: [Experiments] The open-vocabulary and cross-domain transfer results require explicit details on prompt construction, evaluation protocols, and whether any post-hoc hyperparameter tuning was performed per domain; the current description leaves open the possibility that reported generalization is partly attributable to evaluation choices rather than the model itself.

    Authors: We appreciate this point and will expand the experimental details in the revision. Specifically, we will include: (i) the exact prompt templates and phrasing used for open-vocabulary queries (e.g., class-name-only vs. descriptive), (ii) full evaluation protocols including mIoU computation, dataset splits, and inference settings, and (iii) explicit confirmation that a single set of hyperparameters and the same trained model were used across all domains without per-domain tuning or post-hoc adjustments. This will clarify that the reported cross-domain performance stems from the generalist framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training recipe

full rationale

The paper describes a concrete architectural repurposing of a pretrained diffusion U-Net: image and ground-truth mask are encoded to latents and concatenated as conditioning, with parallel multi-scale CLIP text injection, all trained under the standard denoising objective. Central claims of SOTA performance, open-vocabulary generalization, and cross-domain transfer rest on reported experimental outcomes across benchmarks rather than any equation, parameter fit, or uniqueness theorem that reduces to the inputs by construction. No self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain appears; the derivation chain is a training procedure validated externally by results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that diffusion denoising trajectories contain transferable semantic structure; no new entities are postulated and no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Pretrained diffusion models' denoising trajectories encode rich, spatially aligned visual priors usable for segmentation.
    This premise is stated directly in the abstract as the foundation for repurposing the model.

pith-pipeline@v0.9.0 · 5546 in / 1186 out tokens · 34885 ms · 2026-05-08T04:34:13.789437+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

118 extracted references · 22 canonical work pages · 4 internal anchors

  1. [1]

    Segdiff: Image segmentation with diffusion probabilistic models.arXiv preprint arXiv:2112.00390, 2021

    Amit, T., Shaharbany, T., Nachmani, E., Wolf, L.: Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22861–22872 (2024)

  3. [3]

    The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification

    Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Fara- hani, K., Kalpathy-Cramer, J., Kitamura, F.C., Pati, S., et al.: The rsna-asnr- miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314 (2021)

  4. [4]

    Label-efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126, 2021

    Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., Babenko, A.: Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126 (2021)

  5. [5]

    Advances in Neural Information Processing Systems32(2019)

    Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Advances in Neural Information Processing Systems32(2019)

  6. [6]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Cai, L., Zhao, K., Yuan, H., Zhang, Y., Zhang, S., Huang, K.: Freemask: Rethink- ing the importance of attention masks for zero-shot video editing. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 1898–1906 (2025)

  7. [7]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Cavagnero, N., Rosi, G., Cuttano, C., Pistilli, F., Ciccone, M., Averta, G., Cer- melli, F.: Pem: Prototype-based efficient maskformer for image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 15804–15813 (2024)

  8. [8]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Celikkan, E., Saberioon, M., Herold, M., Klein, N.: Semantic segmentation of crops and weeds with probabilistic modeling and uncertainty quantification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 582–592 (2023)

  9. [9]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)

  10. [10]

    In: Proceedings of the European conference on computer vision (ECCV)

    Chen,L.C.,Zhu,Y.,Papandreou,G.,Schroff,F.,Adam,H.:Encoder-decoderwith atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)

  11. [11]

    In: International conference on machine learning

    Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Gen- erative pretraining from pixels. In: International conference on machine learning. pp. 1691–1703. PMLR (2020)

  12. [12]

    Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

    Chen, S., Shou, Q., Chen, H., Zhou, Y., Feng, K., Hu, W., Zhang, Y.F., Lin, Y., Huang, W., Song, M., et al.: Unify-agent: A unified multimodal agent for world-grounded image synthesis. arXiv preprint arXiv:2603.29620 (2026)

  13. [13]

    arXiv preprint arXiv:2602.14193 (2026)

    Chen, Y., Jiang, M., Zheng, K., Liang, J., Tie, C., Lu, H., Wu, R., Dong, H.: Learning part-aware dense 3d feature field for generalizable articulated object manipulation. arXiv preprint arXiv:2602.14193 (2026)

  14. [14]

    arXiv preprint arXiv:2408.01953 (2024) 22 H

    Chen, Y., Tie, C., Wu, R., Dong, H.: Eqvafford: Se (3) equivariance for point-level affordance learning. arXiv preprint arXiv:2408.01953 (2024) 22 H. Wang et al

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1290– 1299 (2022)

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chiu, M.T., Xu, X., Wei, Y., Huang, Z., Schwing, A.G., Brunner, R., Khacha- trian, H., Karapetyan, H., Dozier, I., Rose, G., et al.: Agriculture-vision: A large aerial image database for agricultural pattern analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2828– 2838 (2020)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cho, S., Shin, H., Hong, S., Arnab, A., Seo, P.H., Kim, S.: Cat-seg: Cost ag- gregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113– 4123 (2024)

  18. [18]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

  19. [19]

    Advances in Neural Information Processing Systems37, 13548–13578 (2024)

    Couairon, P., Shukor, M., Haugeard, J.E., Cord, M., Thome, N.: Diffcut: Catalyz- ing zero-shot semantic segmentation with diffusion features and recursive normal- ized cut. Advances in Neural Information Processing Systems37, 13548–13578 (2024)

  20. [20]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D., Raskar, R.: Deepglobe 2018: A challenge to parse the earth through satellite images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 172–181 (2018)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11583–11592 (2022)

  22. [22]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14084–14093 (2022)

  23. [23]

    In: European Conference on Computer Vision

    Duan, Y., Guo, X., Zhu, Z.: Diffusiondepth: Diffusion denoising approach for monocular depth estimation. In: European Conference on Computer Vision. pp. 432–449. Springer (2024)

  24. [24]

    Refuge2 challenge: A treasure trove for multi-dimension analysis and evaluation in glaucoma screening.arXiv preprint arXiv:2202.08994, 2022

    Fang, H., Li, F., Wu, J., Fu, H., Sun, X., Son, J., Yu, S., Zhang, M., Yuan, C., Bian, C., et al.: Refuge2 challenge: A treasure trove for multi-dimension analysis and evaluation in glaucoma screening. arXiv preprint arXiv:2202.08994 (2022)

  25. [25]

    arXiv preprint arXiv:2510.19400 (2025)

    Feng, Z., Kang, Z., Wang, Q., Du, Z., Yan, J., Shi, S., Yuan, C., Liang, H., Deng, Y., Li, Q., et al.: Seeing across views: Benchmarking spatial reasoning of vision- language models in robotic scenes. arXiv preprint arXiv:2510.19400 (2025)

  26. [26]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Fu, Y., Lou, M., Yu, Y.: Segman: Omni-scale context modeling with state space models and local attention for semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19077–19087 (2025)

  27. [27]

    Image Generators are Generalist Vision Learners

    Gabeur, V., Long, S., Peng, S., Voigtlaender, P., Sun, S., Bao, Y., Truong, K., Wang, Z., Zhou, W., Barron, J.T., et al.: Image generators are generalist vision learners. arXiv preprint arXiv:2604.20329 (2026)

  28. [28]

    IEEE Transactions on Geoscience and Remote Sensing62, 1–19 (2024) Diffusion Model as a Generalist Segmentation Learner 23

    Gao, L., Zhou, Y., Tian, J., Cai, W.: Ddctnet: A deformable and dynamic cross- transformer network for road extraction from high-resolution remote sensing im- ages. IEEE Transactions on Geoscience and Remote Sensing62, 1–19 (2024) Diffusion Model as a Generalist Segmentation Learner 23

  29. [29]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Garcia, G.M., Abou Zeid, K., Schmidt, C., De Geus, D., Hermans, A., Leibe, B.: Fine-tuning image-conditional diffusion models is easier than you think. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 753–762. IEEE (2025)

  30. [30]

    In: European conference on computer vision

    Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmen- tation with image-level labels. In: European conference on computer vision. pp. 540–557. Springer (2022)

  31. [31]

    Open-vocabulary object detection via vision and language knowledge distillation,

    Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)

  32. [32]

    In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Gu, Z., Chen, H., Xu, Z.: Diffusioninst: Diffusion model for instance segmentation. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2730–2734. IEEE (2024)

  33. [33]

    In: Proceedings of the Computer Vision and Pattern Recognition Con- ference

    Hatamizadeh, A., Kautz, J.: Mambavision: A hybrid mamba-transformer vision backbone. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 25261–25270 (2025)

  34. [34]

    Lotus: Diffusion-based visual foundation model for high-quality dense prediction.arXiv preprint arXiv:2409.18124, 2024

    He, J., Li, H., Yin, W., Liang, Y., Li, L., Zhou, K., Zhang, H., Liu, B., Chen, Y.C.: Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124 (2024)

  35. [35]

    In: Proceedings of the IEEE international conference on computer vision

    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)

  36. [36]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  37. [37]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17, 8370–8396 (2023)

    Huang, L., Jiang, B., Lv, S., Liu, Y., Fu, Y.: Deep-learning-based semantic seg- mentation of remote sensing images: A survey. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17, 8370–8396 (2023)

  38. [38]

    Huang, J

    Huang, S., Wu, J., Zhou, Q., Miao, S., Long, M.: Vid2world: Crafting video diffu- sion models to interactive world models. arXiv preprint arXiv:2505.14357 (2025)

  39. [39]

    Nature methods18(2), 203–211 (2021)

    Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: Oneformer: One trans- former to rule universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2989–2998 (2023)

  41. [41]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 752–761 (2023)

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Ji, W., Yu, S., Wu, J., Ma, K., Bian, C., Bi, Q., Li, J., Liu, H., Cheng, L., Zheng, Y.: Learning calibrated medical image segmentation via multi-rater agreement modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12341–12351 (2021)

  43. [43]

    In: European Conference on Computer Vision

    Jiao, S., Zhu, H., Huang, J., Zhao, Y., Wei, Y., Shi, H.: Collaborative vision- text representation optimizing for open-vocabulary segmentation. In: European Conference on Computer Vision. pp. 399–416. Springer (2024)

  44. [44]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 9492–9502 (2024)

  45. [45]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Kerssies, T., Cavagnero, N., Hermans, A., Norouzi, N., Averta, G., Leibe, B., Dubbelman, G., de Geus, D.: Your vit is secretly an image segmentation model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 25303–25313 (2025) 24 H. Wang et al

  46. [46]

    arXiv preprint arXiv:2509.18096 , year=

    Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P.H., Hong, S., Kim, S.: Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers. arXiv preprint arXiv:2509.18096 (2025)

  47. [47]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Kim, C., Ju, D., Han, W., Yang, M.H., Hwang, S.J.: Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15033–15042 (2025)

  48. [48]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6399–6408 (2019)

  49. [49]

    In: Proceedings of the AAAI Conference on Artificial In- telligence

    Le, M.Q., Nguyen, T.V., Le, T.N., Do, T.T., Do, M.N., Tran, M.T.: Maskd- iff: Modeling mask distribution with diffusion probabilistic model for few-shot instance segmentation. In: Proceedings of the AAAI Conference on Artificial In- telligence. vol. 38, pp. 2874–2881 (2024)

  50. [50]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Lee, M., Cho, S., Lee, J., Yang, S., Choi, H., Kim, I.J., Lee, S.: Effective sam combination for open-vocabulary semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26081–26090 (2025)

  51. [51]

    Language-driven semantic segmentation,

    Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)

  52. [52]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L.M., Shum, H.Y.: Mask dino: Towards a unified transformer-based framework for object detection and segmen- tation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3041–3050 (2023)

  53. [53]

    International Journal of Digital Earth 17(1), 2328827 (2024)

    Li, J., Cai, Y., Li, Q., Kou, M., Zhang, T.: A review of remote sensing image segmentation by deep learning methods. International Journal of Digital Earth 17(1), 2328827 (2024)

  54. [54]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 10965–10975 (2022)

  55. [55]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Li, Y., Cheng, T., Feng, B., Liu, W., Wang, X.: Mask-adapter: The devil is in the masks for open-vocabulary segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14998–15008 (2025)

  56. [56]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7061–7070 (2023)

  57. [57]

    In: European confer- ence on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European confer- ence on computer vision. pp. 740–755. Springer (2014)

  58. [58]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Y., Bai, S., Li, G., Wang, Y., Tang, Y.: Open-vocabulary segmentation with semantic-assisted calibration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3491–3500 (2024)

  59. [59]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, Y., Wu, S.L., Bai, S., Wang, J., Wang, Y., Tang, Y.: Stepping out of similar semantic space for open-vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22664–22674 (2025)

  60. [60]

    Mei, J., Li, R.J., Gao, W., Cheng, M.M.: Coanet: Connectivity attention network forroadextractionfromsatelliteimagery.IEEETransactionsonImageProcessing 30, 8540–8552 (2021) Diffusion Model as a Generalist Segmentation Learner 25

  61. [61]

    In: European conference on computer vision

    Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Doso- vitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al.: Simple open-vocabulary object detection. In: European conference on computer vision. pp. 728–755. Springer (2022)

  62. [62]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 891–898 (2014)

  63. [63]

    In: European Conference on Computer Vision

    Ni, Z., Chen, X., Zhai, Y., Tang, Y., Wang, Y.: Context-guided spatial feature reconstruction for efficient semantic segmentation. In: European Conference on Computer Vision. pp. 239–255. Springer (2024)

  64. [64]

    In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

    Patni, S., Agarwal, A., Arora, C.: Ecodepth: Effective conditioning of diffusion models for monocular depth estimation. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 28285–28295 (2024)

  65. [65]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Peng, Z., Xu, Z., Zeng, Z., Wen, C., Huang, Y., Yang, M., Tang, F., Shen, W.: Understanding fine-tuning clip for open-vocabulary semantic segmentation in hy- perbolic space. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4562–4572 (2025)

  66. [66]

    In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference

    Qorbani, R., Villani, G., Panagiotakopoulos, T., Colomer, M.B., Härenstam- Nielsen, L., Segu, M., Dovesi, P.L., Karlgren, J., Cremers, D., Tombari, F., et al.: Semanticlibraryadaptation:Loraretrievalandfusionforopen-vocabularyseman- tic segmentation. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference. pp. 9804–9815 (2025)

  67. [67]

    In: International conference on machine learning

    Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  68. [68]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  69. [69]

    In: International Conference on Medical image comput- ing and computer-assisted intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: International Conference on Medical image comput- ing and computer-assisted intervention. pp. 234–241. Springer (2015)

  70. [70]

    Advances in Neural Information Processing Systems36, 39443–39469 (2023)

    Saxena, S., Herrmann, C., Hur, J., Kar, A., Norouzi, M., Sun, D., Fleet, D.J.: The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. Advances in Neural Information Processing Systems36, 39443–39469 (2023)

  71. [71]

    arXiv preprint arXiv:2302.14816 (2023)

    Saxena, S., Kar, A., Norouzi, M., Fleet, D.J.: Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816 (2023)

  72. [72]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Shan,X.,Wu,D.,Zhu,G.,Shao,Y.,Sang,N.,Gao,C.:Open-vocabularysemantic segmentation with image embedding balancing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28412–28421 (2024)

  73. [73]

    In: Proceedings of the AAAI conference on artificial intelligence

    Shim, J.h., Yu, H., Kong, K., Kang, S.J.: Feedformer: Revisiting transformer de- coder for efficient semantic segmentation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 2263–2271 (2023)

  74. [74]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  75. [75]

    Advances in neural information processing systems33, 12438–12448 (2020) 26 H

    Song, Y., Ermon, S.: Improved techniques for training score-based generative models. Advances in neural information processing systems33, 12438–12448 (2020) 26 H. Wang et al

  76. [76]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for se- mantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7262–7272 (2021)

  77. [77]

    IEEE Robotics and Automation Letters (2025)

    Su, Y., Zhan, X., Fang, H., Li, Y.L., Lu, C., Yang, L.: Motion before action: Dif- fusing object motion as manipulation condition. IEEE Robotics and Automation Letters (2025)

  78. [78]

    arXiv preprint arXiv:2509.16063 (2025)

    Su, Y., Zhang, C., Chen, S., Tan, L., Tang, Y., Wang, J., Liu, X.: Dspv2: Im- proved dense policy for effective and generalizable whole-body mobile manipula- tion. arXiv preprint arXiv:2509.16063 (2025)

  79. [79]

    Remote Sensing15(6), 1602 (2023)

    Tao, J., Chen, Z., Sun, Z., Guo, H., Leng, B., Yu, Z., Wang, Y., He, Z., Lei, X., Yang, J.: Seg-road: a segmentation network for road extraction based on transformer and cnn with connectivity structures. Remote Sensing15(6), 1602 (2023)

  80. [80]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Tian, J., Aggarwal, L., Colaco, A., Kira, Z., Gonzalez-Franco, M.: Diffuse at- tend and segment: Unsupervised zero-shot segmentation using stable diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3554–3563 (2024)

Showing first 80 references.