FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation
Pith reviewed 2026-05-19 07:31 UTC · model grok-4.3
The pith
FA-Seg segments arbitrary text categories by pulling attention maps from one diffusion step plus refinements to hit 43.8 percent mIoU without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FA-Seg shows that attention maps from an unmodified pretrained diffusion model after only a (1+1)-step forward pass already encode class-discriminative and spatially precise information for open-vocabulary segmentation when combined with dual prompts, hierarchical multi-resolution fusion, and test-time flipping, delivering 43.8 percent average mIoU on PASCAL VOC, PASCAL Context, and COCO Object while running faster than prior diffusion-based alternatives.
What carries the argument
The (1+1)-step diffusion attention extraction processed through a dual-prompt mechanism, Hierarchical Attention Refinement Method (HARD) for multi-resolution fusion, and Test-Time Flipping (TTF) for consistency.
If this is right
- All target classes receive segmentation masks from one shared diffusion pass rather than repeated runs per class, lowering total inference cost.
- Spatial precision at the pixel level improves because the diffusion attention naturally encodes both global scene structure and local boundaries.
- The same unmodified model works across PASCAL VOC, PASCAL Context, and COCO Object without task-specific retraining or extra labeled data.
- The approach supplies a ready base for extending open-vocabulary segmentation to new domains while preserving efficiency.
Where Pith is reading between the lines
- If single-pass attention already carries category semantics, the same extraction could be tested on related dense tasks such as instance segmentation or depth prediction.
- The speed advantage may allow direct use in interactive or on-device applications where repeated model calls are prohibitive.
- Processing video frames with shared diffusion steps and the same refinement stack offers a low-cost route to open-vocabulary video segmentation.
Load-bearing premise
Attention maps from a single (1+1)-step forward pass through an unmodified pretrained diffusion model already contain enough class-specific and pixel-level detail for arbitrary target categories.
What would settle it
Ablating the dual prompts or HARD refinement on the same three benchmarks and measuring whether average mIoU falls below current training-free baselines would test whether the raw diffusion attention maps alone suffice.
Figures
read the original abstract
Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code is available at https://github.com/chequanghuy/FA-Seg.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents FA-Seg, a training-free open-vocabulary semantic segmentation framework that extracts attention maps from a single (1+1)-step forward/reverse pass of a pretrained diffusion model. It introduces a dual-prompt mechanism for class-aware attention, Hierarchical Attention Refinement (HARD) for multi-resolution fusion, and Test-Time Flipping (TTF) for spatial consistency. The central claim is state-of-the-art training-free performance of 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object while preserving inference efficiency.
Significance. If the quantitative results prove robust under proper controls, the work would be significant for showing that unmodified pretrained diffusion models can supply sufficiently fine-grained, category-specific spatial cues for OVSS without task-specific training or per-class inference. The public code release and emphasis on efficiency are concrete strengths that would aid reproducibility and extension.
major comments (2)
- [Abstract and §3] Abstract and §3 (Method): The central claim that (1+1)-step cross- and self-attention maps already encode class-discriminative localization for arbitrary categories is load-bearing, yet the manuscript provides no raw attention visualizations or per-component mIoU numbers (base maps vs. dual-prompt vs. HARD vs. TTF) on the same images to quantify how much semantic information is present before post-processing.
- [§4] §4 (Experiments): The reported 43.8% average mIoU is given without details on baseline re-implementations, statistical significance testing, or full ablation tables that isolate the contribution of each proposed component; this prevents assessment of whether the gains are driven by the base diffusion attention or by the added heuristics.
minor comments (2)
- [Abstract] The abstract lists three benchmarks but does not state the exact evaluation protocol (e.g., whether all classes are evaluated jointly or per-image).
- [§3] Notation for the dual-prompt and HARD fusion steps could be clarified with explicit equations rather than prose descriptions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below and have revised the manuscript to incorporate additional visualizations, quantitative breakdowns, and experimental details that strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that (1+1)-step cross- and self-attention maps already encode class-discriminative localization for arbitrary categories is load-bearing, yet the manuscript provides no raw attention visualizations or per-component mIoU numbers (base maps vs. dual-prompt vs. HARD vs. TTF) on the same images to quantify how much semantic information is present before post-processing.
Authors: We agree that raw attention visualizations and per-component mIoU breakdowns would better substantiate the central claim. In the revised manuscript we have added a new figure in §3 displaying the raw cross- and self-attention maps obtained from the single (1+1)-step diffusion pass on representative images from each benchmark. We have also inserted a dedicated ablation table in §4 that reports mIoU for the base attention maps, after dual-prompt extraction, after HARD fusion, and after TTF, all evaluated on identical images and splits. These additions show that the unmodified (1+1)-step attention already yields non-trivial class-discriminative localization (approximately 28–32 % mIoU depending on the dataset), with each subsequent component providing measurable incremental gains that together reach the reported 43.8 % average. revision: yes
-
Referee: [§4] §4 (Experiments): The reported 43.8% average mIoU is given without details on baseline re-implementations, statistical significance testing, or full ablation tables that isolate the contribution of each proposed component; this prevents assessment of whether the gains are driven by the base diffusion attention or by the added heuristics.
Authors: We acknowledge the need for greater transparency. The revised §4 now contains: (i) explicit descriptions of baseline re-implementations, including the precise code repositories and hyper-parameters used to reproduce each competing method; (ii) statistical significance results obtained via paired t-tests across five independent runs with different random seeds, confirming that the improvement over the strongest baseline is statistically significant (p < 0.05); and (iii) exhaustive ablation tables that isolate the contribution of dual-prompt, HARD, and TTF both individually and cumulatively. The tables demonstrate that the base diffusion attention supplies the primary semantic signal, while the proposed components are responsible for the additional performance lift rather than functioning merely as post-hoc heuristics. revision: yes
Circularity Check
No significant circularity: empirical training-free application of external pretrained model
full rationale
The paper's chain consists of running a single (1+1)-step forward pass on an unmodified publicly available pretrained diffusion model, extracting cross- and self-attention maps, then applying dual-prompt extraction, HARD multi-resolution fusion, and TTF post-processing before benchmark evaluation. The reported 43.8% average mIoU is a direct empirical measurement on PASCAL VOC, PASCAL Context, and COCO Object, not a quantity obtained by fitting parameters inside the paper or by any self-referential definition. No equations reduce claimed performance to internally fitted inputs, no uniqueness theorems are imported from the authors' prior work, and no ansatzes are smuggled via self-citation. The method is self-contained against external benchmarks and pretrained weights.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model... dual-prompt mechanism... Hierarchical Attention Refinement Method (HARD)... Test-Time Flipping (TTF)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cross-attention maps... self-attention maps... multi-resolution attention fusion
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning , 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:231591445
work page 2021
-
[2]
Reco: retrieve and co-segment for zero-shot transfer,
G. Shin, W. Xie, and S. Albanie, “Reco: retrieve and co-segment for zero-shot transfer,” inProceedings of the 36th International Confer- enceonNeuralInformationProcessingSystems ,ser.NIPS’22. Red Hook, NY, USA: Curran Associates Inc., 2022
work page 2022
-
[3]
Clip-diy: Clip dense inference yields open-vocabulary semantic segmentation for-free,
M. Wysoczańska, M. Ramamonjisoa, T. Trzciński, and O. Siméoni, “Clip-diy: Clip dense inference yields open-vocabulary semantic segmentation for-free,” in2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 1392–1402
work page 2024
-
[4]
Sclip: Rethinking self-attention for dense vision-language inference,
F. Wang, J. Mei, and A. Yuille, “Sclip: Rethinking self-attention for¬†dense vision-language inference,” inComputer Vision – ECCV 2024,A.Leonardis,E.Ricci,S.Roth,O.Russakovsky,T.Sattler,and G. Varol, Eds. Cham: Springer Nature Switzerland, 2025, pp. 315– 332
work page 2024
-
[5]
Clearclip: Decomposing clip representations for dense vision-language infer- ence,
M.Lan,C.Chen,Y.Ke,X.Wang,L.Feng,andW.Zhang,“Clearclip: Decomposing clip representations for dense vision-language infer- ence,” inECCV, 2024
work page 2024
-
[6]
Open-world semantic segmentation via  contrasting and  clustering vision- language embedding,
Q. Liu, Y. Wen, J. Han, C. Xu, H. Xu, and X. Liang, “Open-world semantic segmentation via ¬†contrasting and ¬†clustering vision- language embedding,” inComputer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 275–292
work page 2022
-
[7]
Learningtogeneratetext-groundedmask foropen-worldsemanticsegmentationfromonlyimage-textpairs,
J.Cha,J.Mun,andB.Roh,“Learningtogeneratetext-groundedmask foropen-worldsemanticsegmentationfromonlyimage-textpairs,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[8]
Perceptual grouping in contrastive vision-language mod- els,
K. Ranasinghe, B. McKinzie, S. Ravi, Y. Yang, A. Toshev, and J. Shlens, “Perceptual grouping in contrastive vision-language mod- els,”2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023
work page 2023
-
[9]
Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only,
J. Chen, D. Zhu, G. Qian, B. Ghanem, Z. Yan, C. Zhu, F. Xiao, S. C. Culatana, and M. Elhoseiny, “Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 699–710
work page 2023
-
[10]
Viewco:Discoveringtext-supervisedsegmentationmasks via multi-view semantic consistency,
P. Ren, C. Li, H. Xu, Y. Zhu, G. Wang, J. Liu, X. Chang, and X.Liang,“Viewco:Discoveringtext-supervisedsegmentationmasks via multi-view semantic consistency,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=2XLRBjY46O6
work page 2023
-
[11]
Segclip: patch aggregation with learnable centers for open-vocabulary semantic segmentation,
H. Luo, J. Bao, Y. Wu, X. He, and T. Li, “Segclip: patch aggregation with learnable centers for open-vocabulary semantic segmentation,” in Proceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023
work page 2023
-
[12]
Learningopen-vocabularysemanticsegmentationmodelsfromnat- ural language supervision,
J. Xu, J. Hou, Y. Zhang, R. Feng, Y. Wang, Y. Qiao, and W. Xie, “Learningopen-vocabularysemanticsegmentationmodelsfromnat- ural language supervision,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2023, pp. 2935– 2944
work page 2023
-
[13]
A simple framework for text-supervised semantic segmentation,
M. Yi, Q. Cui, H. Wu, C. Yang, O. Yoshie, and H. Lu, “A simple framework for text-supervised semantic segmentation,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 7071–7080
work page 2023
-
[14]
Exploring simple open-vocabulary semantic segmentation,
Z. Lai, “Exploring simple open-vocabulary semantic segmentation,”
-
[15]
Available: https://arxiv.org/abs/2401.12217
[Online]. Available: https://arxiv.org/abs/2401.12217
-
[16]
Open- vocabulary panoptic segmentation with text-to-image diffusion mod- els,
J.Xu,S.Liu,A.Vahdat,W.Byeon,X.Wang,andS.DeMello,“Open- vocabulary panoptic segmentation with text-to-image diffusion mod- els,”in2023IEEE/CVFConferenceonComputerVisionandPattern Recognition (CVPR), 2023, pp. 2955–2966
work page 2023
-
[17]
H. Zheng, Y. Ding, Z. Wang, and X. Huang, “Segld: Achieving universal, zero-shot and open-vocabulary segmentation through multimodal fusion via latent diffusion processes,” Information Fusion, vol. 111, p. 102509, 2024. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S1566253524002872
work page 2024
-
[18]
Diffusionmodels for open-vocabulary segmentation,
L.Karazija,I.Laina,A.Vedaldi,andC.Rupprecht,“Diffusionmodels for¬†open-vocabulary segmentation,” inComputer Vision – ECCV 2024,A.Leonardis,E.Ricci,S.Roth,O.Russakovsky,T.Sattler,and G. Varol, Eds. Cham: Springer Nature Switzerland, 2025, pp. 299– 317
work page 2024
-
[19]
Training-free open-vocabulary segmentation with offline diffusion- augmented prototype generation,
L. Barsellotti, R. Amoroso, M. Cornia, L. Baraldi, and R. Cucchiara, “Training-free open-vocabulary segmentation with offline diffusion- augmented prototype generation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
work page 2024
-
[20]
Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models,
B.T.Corradini,M.Shukor,P.Couairon,G.Couairon,F.Scarselli,and M. Cord, “Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models,”ArXiv, vol. abs/2403.20105, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:268793968 Q. -H. Che et al.:Preprint submitted to Elsevier Page 13 of 14 FA-Seg
-
[21]
Diffusionmodelissecretlyatraining-freeopenvocabularysemantic segmenter,
J.Wang,X.Li,J.Zhang,Q.Xu,Q.Zhou,Q.Yu,L.Sheng,andD.Xu, “Diffusionmodelissecretlyatraining-freeopenvocabularysemantic segmenter,”arXiv preprint arXiv:2309.02773, 2023
-
[22]
Invseg: Test-time prompt inversion for semantic segmentation,
J. Lin, J. Huang, J. Hu, and S. Gong, “Invseg: Test-time prompt inversion for semantic segmentation,” vol. 39, 2025, pp. 5245–5253
work page 2025
-
[23]
Momentum contrast for unsupervised visual representation learning,
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9726–9735
work page 2020
-
[24]
Emerging properties in self-supervised vision trans- formers,
M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision trans- formers,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9630–9640
work page 2021
-
[25]
Extract free dense labels from clip,
C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” inComputer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII. Berlin, Heidelberg: Springer-Verlag, 2022, p. 696–712. [Online]. Available: https://doi.org/10.1007/978-3-031-19815-1_40
-
[26]
A closer look at the explainability of contrastive language-image pre-training,
Y. Li, H. Wang, Y. Duan, J. Zhang, and X. Li, “A closer look at the explainability of contrastive language-image pre-training,”Pattern Recognition, vol. 162, p. 111409, 2025. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S003132032500069X
work page 2025
-
[27]
Groupvit: Semantic segmentation emerges from text su- pervision,
J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text su- pervision,” in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18113–18123
work page 2022
-
[28]
Global knowledge calibration for fast open-vocabulary segmentation,
K.Han,Y.Liu,J.H.Liew,H.Ding,J.Liu,Y.Wang,Y.Tang,Y.Yang, J. Feng, Y. Zhao, and Y. Wei, “Global knowledge calibration for fast open-vocabulary segmentation,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 797–807
work page 2023
-
[29]
Zero-Shot Semantic Segmentation with Decoupled One-Pass Net- work, 2023
work page 2023
-
[30]
Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,
Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,” in Advances in Neural Information Processing Systems , A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36. Curran Associates, Inc., 2023, pp. 32215–32234. [Online]. Available: https://...
work page 2023
-
[31]
San: Side adapter network for open-vocabulary semantic segmentation,
M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, “San: Side adapter network for open-vocabulary semantic segmentation,”IEEE Trans- actionsonPatternAnalysisandMachineIntelligence ,vol.45,no.12, pp. 15546–15561, 2023
work page 2023
-
[32]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
H. Caesar, J. Uijlings, and V. Ferrari, “ COCO-Stuff: Thing and Stuff Classes in Context ,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Los Alamitos, CA, USA: IEEE Computer Society, Jun. 2018, pp. 1209–1218. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR.2018. 00132
-
[33]
Scene parsing through ade20k dataset,
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017
work page 2017
-
[34]
Ex- ploring limits of diffusion-synthetic training with weakly super- vised semantic segmentation,
R. Yoshihashi, Y. Otsuka, K. Doi, T. Tanaka, and H. Kataoka, “Ex- ploring limits of¬†diffusion-synthetic training with¬†weakly super- vised semantic segmentation,” inComputer Vision – ACCV 2024, M. Cho, I. Laptev, D. Tran, A. Yao, and H. Zha, Eds. Singapore: Springer Nature Singapore, 2025, pp. 167–186
work page 2024
-
[35]
W. Wu, Y. Zhao, M. Z. Shou, H. Zhou, and C. Shen, “ DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models ,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2023, pp. 1206–1217.[Online].Available:https://doi.ieeecomputersociety.or...
-
[36]
Dataset diffusion: Diffusion-based synthetic data generation for pixel- level semantic segmentation,
Q. H. Nguyen, T. T. Vu, A. T. Tran, and K. Nguyen, “Dataset diffusion: Diffusion-based synthetic data generation for pixel- level semantic segmentation,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=StD4J5ZlI5
work page 2023
-
[37]
Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion,
J. Tian, L. Aggarwal, A. Colaco, Z. Kira, and M. Gonzalez-Franco, “Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion,” in2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 3554–3563
work page 2024
-
[38]
Emerdiff: Emerging pixel-level semantic knowledge in diffusion models,
K. Namekata, A. Sabour, S. Fidler, and S. W. Kim, “Emerdiff: Emerging pixel-level semantic knowledge in diffusion models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/ forum?id=YqyTXmF8Y2
work page 2024
-
[39]
P. Couairon, M. Shukor, J.-E. HAUGEARD, M. Cord, and N. THOME, “Diffcut: Catalyzing zero-shot semantic segmentation with diffusion features and recursive normalized cut,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum? id=N0xNf9Qqmc
work page 2024
-
[40]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:245335280
work page 2022
-
[41]
Denoising diffusion implicit models,
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/ forum?id=St1giarCHLP
work page 2021
-
[42]
DiffusionmodelsbeatGANsonimage synthesis,
P.DhariwalandA.Q.Nichol,“DiffusionmodelsbeatGANsonimage synthesis,” inAdvances in Neural Information Processing Systems, A.Beygelzimer,Y.Dauphin,P.Liang,andJ.W.Vaughan,Eds.,2021. [Online]. Available: https://openreview.net/forum?id=AAWuCvzaVt
work page 2021
-
[43]
Classifier-free diffusion guidance,
J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. [Online]. Available: https: //openreview.net/forum?id=qw8AKxfYbI
work page 2021
-
[44]
Instaflow: One step is enough for high-quality diffusion-based text-to- image generation,
X. Liu, X. Zhang, J. Ma, J. Peng, and qiang liu, “Instaflow: One step is enough for high-quality diffusion-based text-to- image generation,” in The Twelfth International Conference on Learning Representations , 2024. [Online]. Available: https: //openreview.net/forum?id=1k4yZbbDqX
work page 2024
-
[45]
arXiv preprint arXiv:2303.09522 (2023)
A. Voynov, Q. Chu, D. Cohen-Or, and K. Aberman, “P+: Extended textual conditioning in text-to-image generation,” 2023. [Online]. Available: https://arxiv.org/abs/2303.09522
-
[46]
Sam-clip: Merging vision foundation models towards semantic and spatial un- derstanding,
H. Wang, P. K. A. Vasu, F. Faghri, R. Vemulapalli, M. Farajtabar, S. Mehta, M. Rastegari, O. Tuzel, and H. Pouransari, “Sam-clip: Merging vision foundation models towards semantic and spatial un- derstanding,”in2024IEEE/CVFConferenceonComputerVisionand Pattern Recognition Workshops (CVPRW), 2024, pp. 3635–3647
work page 2024
-
[47]
Single-stagesemanticsegmentationfrom image labels,
N.AraslanovandS.Roth,“Single-stagesemanticsegmentationfrom image labels,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 4253–4262
work page 2020
-
[48]
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language- image pre-training for unified vision-language understanding and generation,” inICML, 2022
work page 2022
-
[49]
J.Li,D.Li,S.Savarese,andS.Hoi,“Blip-2:bootstrappinglanguage- image pre-training with frozen image encoders and large language models,” inProceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023
work page 2023
-
[50]
GIT: A generative image-to-text transformer for vision and language,
J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang, “GIT: A generative image-to-text transformer for vision and language,”Transactions on Machine Learning Research, 2022. [Online]. Available: https://openreview.net/forum?id=b4tMhpN0JC
work page 2022
-
[51]
Visiongpt2: Combining vit and gpt-2 for image captioning,
Shreydan, “Visiongpt2: Combining vit and gpt-2 for image captioning,” 2023. [Online]. Available: https://github.com/shreydan/ VisionGPT2
work page 2023
-
[52]
Gemini: A family of highly capable multimodal models,
G. et al., “Gemini: A family of highly capable multimodal models,”
-
[53]
Gemini: A Family of Highly Capable Multimodal Models
[Online]. Available: https://arxiv.org/abs/2312.11805 Q. -H. Che et al.:Preprint submitted to Elsevier Page 14 of 14
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.