Recognition: unknown
Diffusion Model as a Generalist Segmentation Learner
Pith reviewed 2026-05-08 04:34 UTC · model grok-4.3
The pith
Pretrained diffusion models can be repurposed as generalist segmentation learners by conditioning on image masks and text features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By encoding the input image and ground-truth mask into the latent space and concatenating them as conditioning signals for the diffusion U-Net, together with a parallel CLIP-aligned text pathway that injects language features at multiple scales, an off-the-shelf diffusion backbone becomes a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts, achieving state-of-the-art performance on semantic segmentation and strong generalization across domains.
What carries the argument
The conditioning strategy that concatenates image and mask latents into the diffusion U-Net while adding multi-scale text features from CLIP to guide the denoising toward segmentation outputs.
If this is right
- State-of-the-art performance on standard semantic segmentation benchmarks
- Strong results on open-vocabulary segmentation tasks
- Effective cross-domain transfer to medical, remote sensing, and agricultural scenarios without any domain-specific changes
Where Pith is reading between the lines
- This approach could extend to other visual understanding tasks like object detection or instance segmentation using similar conditioning.
- If diffusion models unify generation and segmentation, future vision systems might rely on fewer pretrained backbones for multiple purposes.
- Testing on more diverse tasks could reveal whether the visual priors are truly general or specific to certain image types.
Load-bearing premise
The denoising trajectories from pretrained diffusion models contain rich spatially aligned visual priors that become usable for segmentation when the model is conditioned on concatenated image and mask latents plus parallel text features.
What would settle it
Training the same architecture from scratch without diffusion pretraining and observing whether it still achieves comparable segmentation accuracy on standard benchmarks.
Figures
read the original abstract
Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and open-vocabulary segmentation, and this approach can be generalized to various downstream tasks to make a general-purpose diffusion segmentation framework. Concretely, we introduce DiGSeg (Diffusion Models as a Generalist Segmentation Learner), which repurposes a pretrained diffusion model into a unified segmentation framework. Our approach encodes the input image and ground-truth mask into the latent space and concatenates them as conditioning signals for the diffusion U-Net. A parallel CLIP-aligned text pathway injects language features across multiple scales, enabling the model to align textual queries with evolving visual representations. This design transforms an off-the-shelf diffusion backbone into a universal interface that produces structured segmentation masks conditioned on both appearance and arbitrary text prompts. Extensive experiments demonstrate state-of-the-art performance on standard semantic segmentation benchmarks, as well as strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios-without domain-specific architectural customization. These results indicate that modern diffusion backbones can serve as generalist segmentation learners rather than pure generators, narrowing the gap between visual generation and visual understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DiGSeg, which repurposes a pretrained diffusion U-Net for text-conditioned semantic and open-vocabulary segmentation. The input image and ground-truth mask are encoded into latent space and concatenated as conditioning signals to the diffusion U-Net; a parallel CLIP-aligned text pathway injects language features at multiple scales. The model is trained under the standard denoising objective to output structured masks conditioned on appearance and arbitrary text prompts. It claims state-of-the-art performance on standard semantic segmentation benchmarks plus strong open-vocabulary generalization and cross-domain transfer to medical, remote sensing, and agricultural scenarios without domain-specific architectural changes.
Significance. If the empirical results hold under rigorous validation, the work would be significant for computer vision by showing that denoising trajectories in modern diffusion backbones encode spatially aligned priors usable for discriminative tasks. The unified, architecture-agnostic framework for multiple segmentation variants is a practical strength and could help narrow the generation-understanding divide. Credit is due for the clean repurposing of an off-the-shelf backbone and standard training objective without introducing new invented entities or free parameters.
major comments (2)
- [Experiments] Experiments section: the central SOTA and cross-domain claims rest on reported numbers, yet the manuscript provides insufficient ablations isolating the contribution of concatenated mask latents versus the multi-scale CLIP injection; without these, it is difficult to confirm that the performance gains derive from the diffusion priors rather than the added conditioning pathways.
- [Experiments] The open-vocabulary and cross-domain transfer results require explicit details on prompt construction, evaluation protocols, and whether any post-hoc hyperparameter tuning was performed per domain; the current description leaves open the possibility that reported generalization is partly attributable to evaluation choices rather than the model itself.
minor comments (2)
- The abstract and method description would benefit from a concise statement of the exact diffusion backbone (e.g., Stable Diffusion v1.5 or v2) and latent resolution used, as these choices affect reproducibility.
- Qualitative figures should include failure cases or edge examples (e.g., ambiguous text prompts or low-contrast regions) to balance the reported successes and strengthen the generalization narrative.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of our experimental validation. We agree that additional details and ablations will strengthen the claims regarding the contributions of individual components and the robustness of the generalization results. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central SOTA and cross-domain claims rest on reported numbers, yet the manuscript provides insufficient ablations isolating the contribution of concatenated mask latents versus the multi-scale CLIP injection; without these, it is difficult to confirm that the performance gains derive from the diffusion priors rather than the added conditioning pathways.
Authors: We agree that isolating the contributions of the concatenated mask latents and the multi-scale CLIP text pathway is valuable for confirming the role of the diffusion priors. In the revised manuscript, we will add targeted ablations: (1) a variant using only image latents without mask concatenation, (2) a variant removing the multi-scale CLIP injection while retaining mask conditioning, and (3) comparisons against a non-diffusion baseline with equivalent conditioning. These will be reported on the same benchmarks to quantify each component's impact. revision: yes
-
Referee: [Experiments] The open-vocabulary and cross-domain transfer results require explicit details on prompt construction, evaluation protocols, and whether any post-hoc hyperparameter tuning was performed per domain; the current description leaves open the possibility that reported generalization is partly attributable to evaluation choices rather than the model itself.
Authors: We appreciate this point and will expand the experimental details in the revision. Specifically, we will include: (i) the exact prompt templates and phrasing used for open-vocabulary queries (e.g., class-name-only vs. descriptive), (ii) full evaluation protocols including mIoU computation, dataset splits, and inference settings, and (iii) explicit confirmation that a single set of hyperparameters and the same trained model were used across all domains without per-domain tuning or post-hoc adjustments. This will clarify that the reported cross-domain performance stems from the generalist framework. revision: yes
Circularity Check
No significant circularity; empirical training recipe
full rationale
The paper describes a concrete architectural repurposing of a pretrained diffusion U-Net: image and ground-truth mask are encoded to latents and concatenated as conditioning, with parallel multi-scale CLIP text injection, all trained under the standard denoising objective. Central claims of SOTA performance, open-vocabulary generalization, and cross-domain transfer rest on reported experimental outcomes across benchmarks rather than any equation, parameter fit, or uniqueness theorem that reduces to the inputs by construction. No self-definitional loop, fitted-input-as-prediction, or load-bearing self-citation chain appears; the derivation chain is a training procedure validated externally by results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained diffusion models' denoising trajectories encode rich, spatially aligned visual priors usable for segmentation.
Reference graph
Works this paper leans on
-
[1]
Amit, T., Shaharbany, T., Nachmani, E., Wolf, L.: Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390 (2021)
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22861–22872 (2024)
2024
-
[3]
Baid, U., Ghodasara, S., Mohan, S., Bilello, M., Calabrese, E., Colak, E., Fara- hani, K., Kalpathy-Cramer, J., Kitamura, F.C., Pati, S., et al.: The rsna-asnr- miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314 (2021)
work page internal anchor Pith review arXiv 2021
-
[4]
Label-efficient semantic segmentation with diffusion models.arXiv preprint arXiv:2112.03126, 2021
Baranchuk, D., Rubachev, I., Voynov, A., Khrulkov, V., Babenko, A.: Label-efficient semantic segmentation with diffusion models. arXiv preprint arXiv:2112.03126 (2021)
-
[5]
Advances in Neural Information Processing Systems32(2019)
Bucher, M., Vu, T.H., Cord, M., Pérez, P.: Zero-shot semantic segmentation. Advances in Neural Information Processing Systems32(2019)
2019
-
[6]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Cai, L., Zhao, K., Yuan, H., Zhang, Y., Zhang, S., Huang, K.: Freemask: Rethink- ing the importance of attention masks for zero-shot video editing. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 1898–1906 (2025)
1906
-
[7]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Cavagnero, N., Rosi, G., Cuttano, C., Pistilli, F., Ciccone, M., Averta, G., Cer- melli, F.: Pem: Prototype-based efficient maskformer for image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 15804–15813 (2024)
2024
-
[8]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Celikkan, E., Saberioon, M., Herold, M., Klein, N.: Semantic segmentation of crops and weeds with probabilistic modeling and uncertainty quantification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 582–592 (2023)
2023
-
[9]
Rethinking Atrous Convolution for Semantic Image Segmentation
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
work page internal anchor Pith review arXiv 2017
-
[10]
In: Proceedings of the European conference on computer vision (ECCV)
Chen,L.C.,Zhu,Y.,Papandreou,G.,Schroff,F.,Adam,H.:Encoder-decoderwith atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV). pp. 801–818 (2018)
2018
-
[11]
In: International conference on machine learning
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., Sutskever, I.: Gen- erative pretraining from pixels. In: International conference on machine learning. pp. 1691–1703. PMLR (2020)
2020
-
[12]
Chen, S., Shou, Q., Chen, H., Zhou, Y., Feng, K., Hu, W., Zhang, Y.F., Lin, Y., Huang, W., Song, M., et al.: Unify-agent: A unified multimodal agent for world-grounded image synthesis. arXiv preprint arXiv:2603.29620 (2026)
-
[13]
arXiv preprint arXiv:2602.14193 (2026)
Chen, Y., Jiang, M., Zheng, K., Liang, J., Tie, C., Lu, H., Wu, R., Dong, H.: Learning part-aware dense 3d feature field for generalizable articulated object manipulation. arXiv preprint arXiv:2602.14193 (2026)
-
[14]
arXiv preprint arXiv:2408.01953 (2024) 22 H
Chen, Y., Tie, C., Wu, R., Dong, H.: Eqvafford: Se (3) equivariance for point-level affordance learning. arXiv preprint arXiv:2408.01953 (2024) 22 H. Wang et al
-
[15]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1290– 1299 (2022)
2022
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Chiu, M.T., Xu, X., Wei, Y., Huang, Z., Schwing, A.G., Brunner, R., Khacha- trian, H., Karapetyan, H., Dozier, I., Rose, G., et al.: Agriculture-vision: A large aerial image database for agricultural pattern analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2828– 2838 (2020)
2020
-
[17]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Cho, S., Shin, H., Hong, S., Arnab, A., Seo, P.H., Kim, S.: Cat-seg: Cost ag- gregation for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4113– 4123 (2024)
2024
-
[18]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
2016
-
[19]
Advances in Neural Information Processing Systems37, 13548–13578 (2024)
Couairon, P., Shukor, M., Haugeard, J.E., Cord, M., Thome, N.: Diffcut: Catalyz- ing zero-shot semantic segmentation with diffusion features and recursive normal- ized cut. Advances in Neural Information Processing Systems37, 13548–13578 (2024)
2024
-
[20]
In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops
Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D., Raskar, R.: Deepglobe 2018: A challenge to parse the earth through satellite images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops. pp. 172–181 (2018)
2018
-
[21]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ding, J., Xue, N., Xia, G.S., Dai, D.: Decoupling zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11583–11592 (2022)
2022
-
[22]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Du, Y., Wei, F., Zhang, Z., Shi, M., Gao, Y., Li, G.: Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14084–14093 (2022)
2022
-
[23]
In: European Conference on Computer Vision
Duan, Y., Guo, X., Zhu, Z.: Diffusiondepth: Diffusion denoising approach for monocular depth estimation. In: European Conference on Computer Vision. pp. 432–449. Springer (2024)
2024
-
[24]
Fang, H., Li, F., Wu, J., Fu, H., Sun, X., Son, J., Yu, S., Zhang, M., Yuan, C., Bian, C., et al.: Refuge2 challenge: A treasure trove for multi-dimension analysis and evaluation in glaucoma screening. arXiv preprint arXiv:2202.08994 (2022)
-
[25]
arXiv preprint arXiv:2510.19400 (2025)
Feng, Z., Kang, Z., Wang, Q., Du, Z., Yan, J., Shi, S., Yuan, C., Liang, H., Deng, Y., Li, Q., et al.: Seeing across views: Benchmarking spatial reasoning of vision- language models in robotic scenes. arXiv preprint arXiv:2510.19400 (2025)
-
[26]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Fu, Y., Lou, M., Yu, Y.: Segman: Omni-scale context modeling with state space models and local attention for semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19077–19087 (2025)
2025
-
[27]
Image Generators are Generalist Vision Learners
Gabeur, V., Long, S., Peng, S., Voigtlaender, P., Sun, S., Bao, Y., Truong, K., Wang, Z., Zhou, W., Barron, J.T., et al.: Image generators are generalist vision learners. arXiv preprint arXiv:2604.20329 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
IEEE Transactions on Geoscience and Remote Sensing62, 1–19 (2024) Diffusion Model as a Generalist Segmentation Learner 23
Gao, L., Zhou, Y., Tian, J., Cai, W.: Ddctnet: A deformable and dynamic cross- transformer network for road extraction from high-resolution remote sensing im- ages. IEEE Transactions on Geoscience and Remote Sensing62, 1–19 (2024) Diffusion Model as a Generalist Segmentation Learner 23
2024
-
[29]
In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Garcia, G.M., Abou Zeid, K., Schmidt, C., De Geus, D., Hermans, A., Leibe, B.: Fine-tuning image-conditional diffusion models is easier than you think. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 753–762. IEEE (2025)
2025
-
[30]
In: European conference on computer vision
Ghiasi, G., Gu, X., Cui, Y., Lin, T.Y.: Scaling open-vocabulary image segmen- tation with image-level labels. In: European conference on computer vision. pp. 540–557. Springer (2022)
2022
-
[31]
Open-vocabulary object detection via vision and language knowledge distillation,
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021)
-
[32]
In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Gu, Z., Chen, H., Xu, Z.: Diffusioninst: Diffusion model for instance segmentation. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2730–2734. IEEE (2024)
2024
-
[33]
In: Proceedings of the Computer Vision and Pattern Recognition Con- ference
Hatamizadeh, A., Kautz, J.: Mambavision: A hybrid mamba-transformer vision backbone. In: Proceedings of the Computer Vision and Pattern Recognition Con- ference. pp. 25261–25270 (2025)
2025
-
[34]
He, J., Li, H., Yin, W., Liang, Y., Li, L., Zhou, K., Zhang, H., Liu, B., Chen, Y.C.: Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124 (2024)
-
[35]
In: Proceedings of the IEEE international conference on computer vision
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
2017
-
[36]
Advances in neural information processing systems33, 6840–6851 (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)
2020
-
[37]
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17, 8370–8396 (2023)
Huang, L., Jiang, B., Lv, S., Liu, Y., Fu, Y.: Deep-learning-based semantic seg- mentation of remote sensing images: A survey. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing17, 8370–8396 (2023)
2023
- [38]
-
[39]
Nature methods18(2), 203–211 (2021)
Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods18(2), 203–211 (2021)
2021
-
[40]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: Oneformer: One trans- former to rule universal image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2989–2998 (2023)
2023
-
[41]
In: Proceedings of the IEEE/CVF international conference on computer vision
Jain, J., Singh, A., Orlov, N., Huang, Z., Li, J., Walton, S., Shi, H.: Semask: Semantically masked transformers for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 752–761 (2023)
2023
-
[42]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Ji, W., Yu, S., Wu, J., Ma, K., Bian, C., Bi, Q., Li, J., Liu, H., Cheng, L., Zheng, Y.: Learning calibrated medical image segmentation via multi-rater agreement modeling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12341–12351 (2021)
2021
-
[43]
In: European Conference on Computer Vision
Jiao, S., Zhu, H., Huang, J., Zhao, Y., Wei, Y., Shi, H.: Collaborative vision- text representation optimizing for open-vocabulary segmentation. In: European Conference on Computer Vision. pp. 399–416. Springer (2024)
2024
-
[44]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Re- purposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 9492–9502 (2024)
2024
-
[45]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Kerssies, T., Cavagnero, N., Hermans, A., Norouzi, N., Averta, G., Leibe, B., Dubbelman, G., de Geus, D.: Your vit is secretly an image segmentation model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 25303–25313 (2025) 24 H. Wang et al
2025
-
[46]
arXiv preprint arXiv:2509.18096 , year=
Kim, C., Shin, H., Hong, E., Yoon, H., Arnab, A., Seo, P.H., Hong, S., Kim, S.: Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers. arXiv preprint arXiv:2509.18096 (2025)
-
[47]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Kim, C., Ju, D., Han, W., Yang, M.H., Hwang, S.J.: Distilling spectral graph for object-context aware open-vocabulary semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15033–15042 (2025)
2025
-
[48]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6399–6408 (2019)
2019
-
[49]
In: Proceedings of the AAAI Conference on Artificial In- telligence
Le, M.Q., Nguyen, T.V., Le, T.N., Do, T.T., Do, M.N., Tran, M.T.: Maskd- iff: Modeling mask distribution with diffusion probabilistic model for few-shot instance segmentation. In: Proceedings of the AAAI Conference on Artificial In- telligence. vol. 38, pp. 2874–2881 (2024)
2024
-
[50]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Lee, M., Cho, S., Lee, J., Yang, S., Choi, H., Kim, I.J., Lee, S.: Effective sam combination for open-vocabulary semantic segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 26081–26090 (2025)
2025
-
[51]
Language-driven semantic segmentation,
Li, B., Weinberger, K.Q., Belongie, S., Koltun, V., Ranftl, R.: Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022)
-
[52]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Li, F., Zhang, H., Xu, H., Liu, S., Zhang, L., Ni, L.M., Shum, H.Y.: Mask dino: Towards a unified transformer-based framework for object detection and segmen- tation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3041–3050 (2023)
2023
-
[53]
International Journal of Digital Earth 17(1), 2328827 (2024)
Li, J., Cai, Y., Li, Q., Kou, M., Zhang, T.: A review of remote sensing image segmentation by deep learning methods. International Journal of Digital Earth 17(1), 2328827 (2024)
2024
-
[54]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 10965–10975 (2022)
2022
-
[55]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Li, Y., Cheng, T., Feng, B., Liu, W., Wang, X.: Mask-adapter: The devil is in the masks for open-vocabulary segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14998–15008 (2025)
2025
-
[56]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liang, F., Wu, B., Dai, X., Li, K., Zhao, Y., Zhang, H., Zhang, P., Vajda, P., Marculescu, D.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 7061–7070 (2023)
2023
-
[57]
In: European confer- ence on computer vision
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European confer- ence on computer vision. pp. 740–755. Springer (2014)
2014
-
[58]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, Y., Bai, S., Li, G., Wang, Y., Tang, Y.: Open-vocabulary segmentation with semantic-assisted calibration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3491–3500 (2024)
2024
-
[59]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Liu, Y., Wu, S.L., Bai, S., Wang, J., Wang, Y., Tang, Y.: Stepping out of similar semantic space for open-vocabulary segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22664–22674 (2025)
2025
-
[60]
Mei, J., Li, R.J., Gao, W., Cheng, M.M.: Coanet: Connectivity attention network forroadextractionfromsatelliteimagery.IEEETransactionsonImageProcessing 30, 8540–8552 (2021) Diffusion Model as a Generalist Segmentation Learner 25
2021
-
[61]
In: European conference on computer vision
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Doso- vitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., et al.: Simple open-vocabulary object detection. In: European conference on computer vision. pp. 728–755. Springer (2022)
2022
-
[62]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 891–898 (2014)
2014
-
[63]
In: European Conference on Computer Vision
Ni, Z., Chen, X., Zhai, Y., Tang, Y., Wang, Y.: Context-guided spatial feature reconstruction for efficient semantic segmentation. In: European Conference on Computer Vision. pp. 239–255. Springer (2024)
2024
-
[64]
In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition
Patni, S., Agarwal, A., Arora, C.: Ecodepth: Effective conditioning of diffusion models for monocular depth estimation. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 28285–28295 (2024)
2024
-
[65]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Peng, Z., Xu, Z., Zeng, Z., Wen, C., Huang, Y., Yang, M., Tang, F., Shen, W.: Understanding fine-tuning clip for open-vocabulary semantic segmentation in hy- perbolic space. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4562–4572 (2025)
2025
-
[66]
In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference
Qorbani, R., Villani, G., Panagiotakopoulos, T., Colomer, M.B., Härenstam- Nielsen, L., Segu, M., Dovesi, P.L., Karlgren, J., Cremers, D., Tombari, F., et al.: Semanticlibraryadaptation:Loraretrievalandfusionforopen-vocabularyseman- tic segmentation. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference. pp. 9804–9815 (2025)
2025
-
[67]
In: International conference on machine learning
Radford,A.,Kim,J.W.,Hallacy,C.,Ramesh,A.,Goh,G.,Agarwal,S.,Sastry,G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
2021
-
[68]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
2022
-
[69]
In: International Conference on Medical image comput- ing and computer-assisted intervention
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: International Conference on Medical image comput- ing and computer-assisted intervention. pp. 234–241. Springer (2015)
2015
-
[70]
Advances in Neural Information Processing Systems36, 39443–39469 (2023)
Saxena, S., Herrmann, C., Hur, J., Kar, A., Norouzi, M., Sun, D., Fleet, D.J.: The surprising effectiveness of diffusion models for optical flow and monocular depth estimation. Advances in Neural Information Processing Systems36, 39443–39469 (2023)
2023
-
[71]
arXiv preprint arXiv:2302.14816 (2023)
Saxena, S., Kar, A., Norouzi, M., Fleet, D.J.: Monocular depth estimation using diffusion models. arXiv preprint arXiv:2302.14816 (2023)
-
[72]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Shan,X.,Wu,D.,Zhu,G.,Shao,Y.,Sang,N.,Gao,C.:Open-vocabularysemantic segmentation with image embedding balancing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 28412–28421 (2024)
2024
-
[73]
In: Proceedings of the AAAI conference on artificial intelligence
Shim, J.h., Yu, H., Kong, K., Kang, S.J.: Feedformer: Revisiting transformer de- coder for efficient semantic segmentation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 2263–2271 (2023)
2023
-
[74]
Denoising Diffusion Implicit Models
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
work page internal anchor Pith review arXiv 2010
-
[75]
Advances in neural information processing systems33, 12438–12448 (2020) 26 H
Song, Y., Ermon, S.: Improved techniques for training score-based generative models. Advances in neural information processing systems33, 12438–12448 (2020) 26 H. Wang et al
2020
-
[76]
In: Proceedings of the IEEE/CVF international conference on computer vision
Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for se- mantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 7262–7272 (2021)
2021
-
[77]
IEEE Robotics and Automation Letters (2025)
Su, Y., Zhan, X., Fang, H., Li, Y.L., Lu, C., Yang, L.: Motion before action: Dif- fusing object motion as manipulation condition. IEEE Robotics and Automation Letters (2025)
2025
-
[78]
arXiv preprint arXiv:2509.16063 (2025)
Su, Y., Zhang, C., Chen, S., Tan, L., Tang, Y., Wang, J., Liu, X.: Dspv2: Im- proved dense policy for effective and generalizable whole-body mobile manipula- tion. arXiv preprint arXiv:2509.16063 (2025)
-
[79]
Remote Sensing15(6), 1602 (2023)
Tao, J., Chen, Z., Sun, Z., Guo, H., Leng, B., Yu, Z., Wang, Y., He, Z., Lei, X., Yang, J.: Seg-road: a segmentation network for road extraction based on transformer and cnn with connectivity structures. Remote Sensing15(6), 1602 (2023)
2023
-
[80]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Tian, J., Aggarwal, L., Colaco, A., Kira, Z., Gonzalez-Franco, M.: Diffuse at- tend and segment: Unsupervised zero-shot segmentation using stable diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3554–3563 (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.