Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
Pith reviewed 2026-05-19 13:14 UTC · model grok-4.3
The pith
Split-text conditioning improves diffusion transformers by processing semantic primitives in separate denoising stages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that DiT-ST mitigates the complete-text comprehension defect of DiTs by converting complete-text captions into split-text captions, a collection of simplified sentences, and injecting tokens of diverse semantic primitive types into input tokens via cross-attention at appropriate timesteps. LLMs are used to parse captions, extract diverse primitives, and hierarchically sort them, while the denoising process is partitioned according to differential sensitivities to these primitive types, enabling incremental injection that enhances representation learning of specific semantic primitive types across different stages.
What carries the argument
Split-text conditioning framework that extracts semantic primitives with LLMs and injects them incrementally into DiT at partitioned denoising timesteps via cross-attention.
Load-bearing premise
The diffusion denoising process can be partitioned according to differential sensitivities to diverse semantic primitive types, and LLMs can reliably extract and hierarchically sort these primitives without introducing parsing errors that affect downstream generation quality.
What would settle it
Running generation experiments on a set of complex captions where DiT-ST shows equivalent or worse performance in metrics like CLIP score or human preference compared to the baseline DiT would disprove the effectiveness of the split-text method.
Figures
read the original abstract
Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DiT-ST, a split-text conditioning framework for text-to-image Diffusion Transformers. It uses LLMs to parse complete-text captions into hierarchically sorted split-text sentences that isolate semantic primitives, then injects tokens of these primitives into the DiT via cross-attention at selected denoising timesteps chosen according to the process's purported differential sensitivities to each primitive type. The central claim is that this staged, incremental injection mitigates the complete-text comprehension defect of standard DiTs and improves representation learning for specific semantic elements.
Significance. If the differential-sensitivity partitioning and injection schedule can be shown to be stable and independently measurable, the approach would constitute a lightweight, training-free architectural modification that directly targets a known weakness in DiT conditioning. It could be adopted by existing DiT pipelines with minimal overhead and might generalize to other transformer-based diffusion models. The paper does not yet supply the measurements or ablations needed to confirm these benefits.
major comments (3)
- [Method (description of timestep partitioning and injection schedule)] The manuscript states that the diffusion denoising process is partitioned 'according to its differential sensitivities to diverse semantic primitive types' and that 'appropriate timesteps' are thereby determined for incremental injection. No per-timestep attention statistics, FID curves, or ablation that isolates this partitioning from the final generation metric are presented; the schedule therefore remains an unverified heuristic rather than a data-driven choice.
- [Method (LLM parsing and hierarchical construction of split-text)] The central claim that DiT-ST mitigates the complete-text comprehension defect rests on the assumption that LLM-extracted primitives can be reliably sorted and injected without introducing parsing artifacts that degrade downstream quality. No error analysis of the LLM parsing step or comparison against alternative caption-splitting strategies is supplied.
- [Experiments] The abstract asserts that 'extensive experiments validate the effectiveness,' yet the manuscript supplies neither quantitative metrics (FID, CLIP score, human preference), baseline comparisons (standard DiT, other conditioning variants), nor ablation tables that would allow the reader to judge the magnitude or robustness of the reported improvement.
minor comments (2)
- [Method] Notation for the split-text tokens and the cross-attention injection operator should be defined explicitly with an equation or pseudocode block rather than described only in prose.
- [Figures] Figure captions should state the exact model backbone, resolution, and number of sampling steps used for all qualitative examples.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript proposing DiT-ST. We address each major comment point by point below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Method (description of timestep partitioning and injection schedule)] The manuscript states that the diffusion denoising process is partitioned 'according to its differential sensitivities to diverse semantic primitive types' and that 'appropriate timesteps' are thereby determined for incremental injection. No per-timestep attention statistics, FID curves, or ablation that isolates this partitioning from the final generation metric are presented; the schedule therefore remains an unverified heuristic rather than a data-driven choice.
Authors: We appreciate the referee highlighting the need for stronger empirical grounding of the timestep schedule. Our partitioning draws from observed stage-wise sensitivities in the diffusion process, but we agree it would benefit from explicit validation. In the revised manuscript we will add per-timestep attention statistics, FID curves across injection variants, and an ablation that isolates the schedule choice from overall generation quality, thereby converting the current heuristic into a data-supported design choice. revision: yes
-
Referee: [Method (LLM parsing and hierarchical construction of split-text)] The central claim that DiT-ST mitigates the complete-text comprehension defect rests on the assumption that LLM-extracted primitives can be reliably sorted and injected without introducing parsing artifacts that degrade downstream quality. No error analysis of the LLM parsing step or comparison against alternative caption-splitting strategies is supplied.
Authors: We agree that reliability of the LLM parsing step requires explicit verification. The revised version will include an error analysis of primitive extraction accuracy together with comparisons against alternative splitting strategies (rule-based segmentation and varied LLM prompting). These additions will quantify any parsing artifacts and confirm that the hierarchical construction does not degrade downstream image quality. revision: yes
-
Referee: [Experiments] The abstract asserts that 'extensive experiments validate the effectiveness,' yet the manuscript supplies neither quantitative metrics (FID, CLIP score, human preference), baseline comparisons (standard DiT, other conditioning variants), nor ablation tables that would allow the reader to judge the magnitude or robustness of the reported improvement.
Authors: We acknowledge that the experimental section would be strengthened by more prominent and comprehensive reporting. The revised manuscript will expand the experiments with full quantitative tables (FID, CLIP score, human preference), direct baseline comparisons against standard DiT and related conditioning methods, and additional ablation studies. These changes will allow readers to assess both the magnitude and robustness of the observed gains. revision: yes
Circularity Check
No circularity: architectural proposal with independent empirical validation
full rationale
The paper presents DiT-ST as an empirical architectural modification: LLM-based parsing of captions into split-text primitives, followed by staged cross-attention injection during denoising. No equations, closed-form predictions, or first-principles derivations appear that reduce the claimed gains to fitted parameters or self-referential definitions. The partitioning by 'differential sensitivities' is stated as a design choice supported by experiments rather than a mathematical result derived from the method itself. The framework is self-contained against external benchmarks and does not rely on load-bearing self-citations or ansatzes that loop back to the inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large language models can accurately parse captions and hierarchically construct split-text inputs expressing semantic primitives and their interconnections.
- domain assumption The diffusion denoising process exhibits differential sensitivities to diverse semantic primitive types that can be used to select appropriate injection timesteps.
invented entities (1)
-
DiT-ST framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
prioritization order for semantic primitive types is object-relation-attribute
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015
work page 2015
-
[2]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[3]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022
work page 2022
-
[4]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[5]
Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications.ACM Computing Surveys, 56(4):1–39, 2023
work page 2023
-
[6]
Efficient diffusion models: A survey.arXiv preprint arXiv:2502.06805, 2025
Hui Shen, Jingxuan Zhang, Boning Xiong, Rui Hu, Shoufa Chen, Zhongwei Wan, Xin Wang, Yu Zhang, Zixuan Gong, Guangyin Bao, et al. Efficient diffusion models: A survey.arXiv preprint arXiv:2502.06805, 2025
-
[7]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[8]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
work page 2023
-
[9]
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excitep: Attention-based semantic guidance for text-to-image diffusion models.ACM transactions on Graphics (TOG), 42(4):1–10, 2023
work page 2023
-
[10]
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022
-
[11]
Styletokenizer: Defining image style by a single instance for controlling diffusion models
Wen Li, Muyuan Fang, Cheng Zou, Biao Gong, Ruobing Zheng, Meng Wang, Jingdong Chen, and Ming Yang. Styletokenizer: Defining image style by a single instance for controlling diffusion models. InEuropean Conference on Computer Vision, pages 110–126. Springer, 2024
work page 2024
-
[12]
Deadiff: An efficient stylization diffusion model with disentangled representations
Tianhao Qi, Shancheng Fang, Yanze Wu, Hongtao Xie, Jiawei Liu, Lang Chen, Qian He, and Yongdong Zhang. Deadiff: An efficient stylization diffusion model with disentangled representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8693–8702, 2024
work page 2024
-
[13]
Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023
Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion.ACM transactions on graphics (TOG), 42(4):1–11, 2023
work page 2023
-
[14]
How to blend concepts in diffusion models.arXiv preprint arXiv:2407.14280, 2024
Lorenzo Olearo, Giorgio Longari, Simone Melzi, Alessandro Raganato, and Rafael Peñaloza. How to blend concepts in diffusion models.arXiv preprint arXiv:2407.14280, 2024
-
[15]
arXiv preprint arXiv:2210.04885 , year=
Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, and Ferhan Ture. What the daam: Interpreting stable diffusion using cross attention.arXiv preprint arXiv:2210.04885, 2022
-
[16]
Long-clip: Unlocking the long-text capability of clip
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip. InEuropean Conference on Computer Vision, pages 310–325. Springer, 2024. 18
work page 2024
-
[17]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[18]
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax bottleneck: A high-rank rnn language model.arXiv preprint arXiv:1711.03953, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
Amita Kamath, Jack Hessel, and Kai-Wei Chang. Text encoders bottleneck compositionality in contrastive vision-language models.arXiv preprint arXiv:2305.14897, 2023
-
[20]
Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammadali Banayeeanzade, Mohammad Hossein Rohban, and Mahdieh Soleymani Baghshah. Clip under the microscope: A fine-grained analysis of multi-object representation.arXiv preprint arXiv:2502.19842, 2025
-
[21]
Perception prioritized training of diffusion models
Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11472–11481, 2022
work page 2022
-
[22]
Cross-attention makes inference cumbersome in text-to-image diffusion models
Wentian Zhang, Haozhe Liu, Jinheng Xie, Francesco Faccio, Mike Zheng Shou, and Jürgen Schmidhuber. Cross-attention makes inference cumbersome in text-to-image diffusion models. arXiv e-prints, pages arXiv–2404, 2024
work page 2024
-
[23]
Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, and Trevor Darrell. Diffusion hyperfeatures: Searching through time and space for semantic correspondence.Advances in Neural Information Processing Systems, 36:47500–47510, 2023
work page 2023
-
[24]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Qwen: A scalable and multilingual language model family, 2024
Baichuan Inc. Qwen: A scalable and multilingual language model family, 2024. https: //huggingface.co/Qwen/Qwen-14B-Plus
work page 2024
-
[26]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[27]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.CoRR, abs/2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[28]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis.CoRR, abs/2105.05233, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[29]
and Norouzi, Mohammad and Chan, William , year =
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation.arXiv preprint arXiv:2009.00713, 2020
-
[30]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models, 2022
work page 2022
-
[31]
Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction
Zixuan Gong, Guangyin Bao, Qi Zhang, Zhongwei Wan, Duoqian Miao, Shoujin Wang, Lei Zhu, Changwei Wang, Rongtao Xu, Liang Hu, Ke Liu, and Yu Zhang. Neuroclips: Towards high-fidelity and smooth fmri-to-video reconstruction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing...
work page 2024
-
[32]
Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,
Bencheng Liao, Shaoyu Chen, Haoran Yin, Bo Jiang, Cheng Wang, Sixu Yan, Xinbang Zhang, Xiangyu Li, Ying Zhang, Qian Zhang, and Xinggang Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.arXiv preprint arXiv:2411.15139, 2024. 19
-
[33]
Cindy M. Nguyen, Eric R. Chan, Alexander W. Bergman, and Gordon Wetzstein. Diffusion in the dark: A diffusion model for low-light text recognition. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 4146–4157, January 2024
work page 2024
-
[34]
Noise-consistent siamese-diffusion for medical image synthesis and segmentation, 2025
Kunpeng Qiu, Zhiqiang Gao, Zhiying Zhou, Mingjie Sun, and Yongxin Guo. Noise-consistent siamese-diffusion for medical image synthesis and segmentation, 2025
work page 2025
-
[35]
Biodiffusion: A versatile diffusion model for biomedical signal synthesis, 2024
Xiaomin Li, Mykhailo Sakevych, Gentry Atkinson, and Vangelis Metsis. Biodiffusion: A versatile diffusion model for biomedical signal synthesis, 2024
work page 2024
-
[36]
U-net: Convolutional networks for biomedical image segmentation, 2015
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation, 2015
work page 2015
-
[37]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[38]
Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023
work page 2023
-
[39]
Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, and Baobao Chang. Multimodal representation alignment for image generation: Text-image interleaved control is easier than you think, 2025
work page 2025
-
[40]
Dixuan Wang, Yanda Li, Junyuan Jiang, Zepeng Ding, Guochao Jiang, Jiaqing Liang, and Deqing Yang. Tokenization matters! degrading large language models through challenging their tokenization.arXiv preprint arXiv:2405.17067, 2024
-
[41]
Richard Diehl Martinez, Zébulon Goriely, Andrew Caines, Paula Buttery, and Lisa Beinborn. Mitigating frequency bias and anisotropy in language model pre-training with syntactic smooth- ing.arXiv preprint arXiv:2410.11462, 2024
-
[42]
Self-correcting llm-controlled diffusion models
Tsung-Han Wu, Long Lian, Joseph E Gonzalez, Boyi Li, and Trevor Darrell. Self-correcting llm-controlled diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6327–6336, 2024
work page 2024
-
[43]
Conform: Contrast is all you need for high-fidelity text-to-image diffusion models
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, and Pinar Yanardag. Conform: Contrast is all you need for high-fidelity text-to-image diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9005–9014, 2024
work page 2024
-
[44]
Arman Zarei, Keivan Rezaei, Samyadeep Basu, Mehrdad Saberi, Mazda Moayeri, Priyatham Kattakinda, and Soheil Feizi. Improving compositional attribute binding in text-to-image generative models via enhanced text embeddings, 2025
work page 2025
-
[45]
Improving long-text alignment for text-to-image diffusion models, 2025
Luping Liu, Chao Du, Tianyu Pang, Zehan Wang, Chongxuan Li, and Dong Xu. Improving long-text alignment for text-to-image diffusion models, 2025
work page 2025
-
[46]
Progres- sive prompt detailing for improved alignment in text-to-image generative models, 2025
Ketan Suhaas Saichandran, Xavier Thomas, Prakhar Kaushik, and Deepti Ghadiyaram. Progres- sive prompt detailing for improved alignment in text-to-image generative models, 2025
work page 2025
-
[47]
Hifi-score: Fine-grained image description eval- uation with hierarchical parsing graphs
Ziwei Yao, Ruiping Wang, and Xilin Chen. Hifi-score: Fine-grained image description eval- uation with hierarchical parsing graphs. InEuropean Conference on Computer Vision, pages 441–458. Springer, 2024
work page 2024
-
[48]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021
work page 2021
-
[50]
Imagenet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 20
work page 2009
-
[51]
Cogview3: Finer and faster text-to-image generation via relay diffusion, 2024
Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogview3: Finer and faster text-to-image generation via relay diffusion, 2024
work page 2024
-
[52]
δ-dit: A training-free acceleration method tailored for diffusion transformers, 2024
Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen. δ-dit: A training-free acceleration method tailored for diffusion transformers, 2024
work page 2024
-
[53]
Geneval: An object-focused framework for evaluating text-to-image alignment, 2023
Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment, 2023
work page 2023
-
[54]
Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018
work page 2018
-
[55]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 21
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.