Recognition: 2 theorem links
· Lean TheoremPixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Pith reviewed 2026-05-12 20:33 UTC · model grok-4.3
The pith
PIXART-α trains a high-quality text-to-image diffusion transformer in 10.8 percent the time of Stable Diffusion v1.5.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PIXART-α is a Transformer-based diffusion model for text-to-image synthesis whose generation quality matches state-of-the-art systems. The model is obtained by decomposing training into three successive stages that optimize pixel dependency, text-image alignment, and aesthetic quality in turn; by embedding cross-attention modules inside a Diffusion Transformer to handle text conditions efficiently; and by automatically labeling training pairs with dense pseudo-captions from a large vision-language model. These choices produce a system that trains in 675 A100 GPU days, 10.8 percent of the time reported for Stable Diffusion v1.5, while supporting up to 1024-pixel resolution and competitive or,
What carries the argument
Three-stage training decomposition together with a Diffusion Transformer that uses cross-attention for text conditioning and dense VLM-generated captions for alignment.
If this is right
- High-resolution text-to-image synthesis up to 1024 pixels becomes practical at far lower compute budgets.
- Training cost drops to 1 percent of larger state-of-the-art models such as RAPHAEL while preserving semantic control and artistic quality.
- Carbon emissions associated with model development fall by roughly 90 percent compared with prior large-scale diffusion models.
- Startups and research groups can iterate on new generative models without requiring thousands of GPU days.
- The resulting models excel at image quality, artistry, and fine-grained semantic control according to the reported experiments.
Where Pith is reading between the lines
- The same staged-optimization pattern could be tested on video or 3D generation tasks to check whether similar efficiency gains appear.
- If dense captions prove decisive, then future work might focus on improving the captioning model itself rather than scaling raw image data.
- The cost reduction could enable more frequent public releases of updated models, shortening the iteration cycle in the field.
- Open questions remain about whether the efficiency advantage persists when the same decomposition is applied to other transformer variants or non-diffusion backbones.
Load-bearing premise
The three-stage training split and the use of dense pseudo-captions are the main reasons for both the quality and the large reduction in training cost, rather than differences in total data volume or other implementation choices.
What would settle it
A side-by-side training run of the identical architecture and data scale, once with the three-stage schedule and dense captions and once without them, that shows whether the reported quality and speed gains disappear.
read the original abstract
The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PixArt-α, a Transformer-based diffusion model for text-to-image synthesis. It claims competitive photorealistic quality with SOTA models (SDXL, Imagen, RAPHAEL, Midjourney) at up to 1024px resolution, enabled by three designs: (1) three-stage training decomposition optimizing pixel dependency, text-image alignment, and aesthetics separately; (2) efficient DiT with cross-attention for text conditioning instead of class labels; (3) VLM-generated dense pseudo-captions for high-informative data. This yields training in 675 A100 GPU days (10.8% of SD v1.5's 6,250 days), saving ~$300k and 90% CO2, with extensive visual/quantitative experiments.
Significance. If the efficiency and quality claims hold under controlled conditions, the work could substantially lower barriers for training high-quality T2I models, offering cost and environmental benefits plus practical insights on staged training and caption density for the AIGC community. The empirical comparisons to multiple baselines and support for high-res synthesis are notable strengths.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The central efficiency claim (675 vs. 6,250 A100 GPU days, 10.8% of SD v1.5) attributes the speedup to the three core designs, but provides no controls or reporting for total data volume processed, number of optimization steps, model parameter count, or auxiliary VLM captioning compute. Without matched ablations or full hyperparameter tables isolating these factors, differences in data scale or implementation details could drive the reported savings rather than the proposed decomposition.
- [§3] §3 (Training Strategy Decomposition): The three-stage approach is presented as separately optimizing distinct objectives, yet the manuscript lacks quantitative ablations demonstrating the incremental gains of each stage (or the full decomposition) over a single-stage baseline trained with equivalent total compute and data.
minor comments (2)
- [Abstract] Abstract: Limited detail on exact quantitative metrics (e.g., specific FID, CLIP, or human preference scores) and evaluation protocols for comparisons to SDXL, Imagen, and RAPHAEL.
- [Figures 1-2 and §4] Figures 1-2 and §4: Visual results support the quality claims, but additional details on conditioning, resolution, and baseline implementation (e.g., whether all models used identical data regimes) would improve clarity and reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of our efficiency claims and training strategy that merit clarification. We address each major comment below and indicate planned revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central efficiency claim (675 vs. 6,250 A100 GPU days, 10.8% of SD v1.5) attributes the speedup to the three core designs, but provides no controls or reporting for total data volume processed, number of optimization steps, model parameter count, or auxiliary VLM captioning compute. Without matched ablations or full hyperparameter tables isolating these factors, differences in data scale or implementation details could drive the reported savings rather than the proposed decomposition.
Authors: We appreciate this observation. The manuscript reports model parameter counts (around 600M for PixArt-α) and total training steps, but we agree that a consolidated hyperparameter table and explicit data volume per stage would strengthen transparency. In the revision we will add such a table, including estimated VLM captioning cost (which is a one-time preprocessing step amortized over training). While direct matched ablations isolating every variable were not feasible within our compute budget, the staged approach demonstrably accelerates convergence on alignment and aesthetics objectives compared to joint training, as evidenced by our internal monitoring of loss curves and downstream metrics. Comparisons to SD v1.5 use the publicly stated training cost for that model. revision: partial
-
Referee: [§3] §3 (Training Strategy Decomposition): The three-stage approach is presented as separately optimizing distinct objectives, yet the manuscript lacks quantitative ablations demonstrating the incremental gains of each stage (or the full decomposition) over a single-stage baseline trained with equivalent total compute and data.
Authors: The referee correctly notes the absence of a full single-stage baseline trained for the same total compute. Such an experiment would require substantial additional resources and was not performed. Instead, we show progressive improvements across stages via FID, CLIP score, and human preference metrics in §4, supporting that each stage contributes distinct gains (pixel-level fidelity in stage 1, semantic alignment in stage 2, aesthetic quality in stage 3). We will expand §3 with a clearer rationale for the decomposition and include any available partial ablations (e.g., stage-wise metric deltas). We maintain that the decomposition enables more efficient use of data and objectives, but acknowledge a direct head-to-head comparison would be ideal. revision: partial
Circularity Check
No circularity in derivation or efficiency claims
full rationale
The paper reports empirical training costs (675 A100 GPU days) and quality metrics from direct model training runs, compared against external baselines such as Stable Diffusion v1.5. No equations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The three core designs are methodological choices whose effects are measured experimentally rather than derived by construction from the inputs themselves. The efficiency attribution rests on reported wall-clock measurements, not tautological redefinitions or unverified self-references.
Axiom & Free-Parameter Ledger
free parameters (1)
- stage-specific training hyperparameters
axioms (2)
- domain assumption Diffusion transformers can achieve high image quality when text conditions are properly injected via cross-attention.
- domain assumption Dense pseudo-captions generated by a VLM improve text-image alignment learning.
Forward citations
Cited by 32 Pith papers
-
CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers
CoReDiT reduces self-attention FLOPs in DiTs by up to 55% via linear-time spatial coherence pruning and neighbor-based reconstruction, delivering 1.33x-1.72x speedups with maintained quality.
-
ImageAttributionBench: How Far Are We from Generalizable Attribution?
ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
-
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
-
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
-
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
-
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference
DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
-
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
-
L2P: Unlocking Latent Potential for Pixel Generation
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
-
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
-
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
Leveraging Verifier-Based Reinforcement Learning in Image Editing
Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
-
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
-
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
-
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents
EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
-
Generative Refinement Networks for Visual Synthesis
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
-
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models
BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
-
Evolutionary Token-Level Prompt Optimization for Diffusion Models
A genetic algorithm evolves CLIP token vectors to optimize aesthetic quality and prompt alignment in diffusion models, outperforming Promptist and random search by up to 23.93% on a combined fitness score.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
LTX-Video: Realtime Video Latent Diffusion
LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
-
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
-
On the Limits of Latent Reuse in Diffusion Models
Reusing source latent spaces in diffusion models under distribution shift produces target score error set by principal-angle misalignment and diffusion-time-amplified ambient noise.
-
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
-
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion
Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.
-
Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models
Target-based prompting lets users define fairness distributions for skin tones in generative AI, shifting outputs closer to chosen targets across 36 tested prompts for occupations and contexts.
-
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
-
Open-Sora: Democratizing Efficient Video Production for All
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
-
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.
-
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.
Reference graph
Works this paper leans on
-
[2]
ediffi: Text-to-image diffusion models with an ensemble of expert denoisers
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. In arXiv, 2022
work page 2022
-
[3]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In CVPR, 2023
work page 2023
-
[4]
A study on the evaluation of generative models
Eyal Betzalel, Coby Penso, Aviv Navon, and Ethan Fetaya. A study on the evaluation of generative models. In arXiv, 2022
work page 2022
-
[5]
End-to-end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020
work page 2020
- [6]
-
[7]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009
work page 2009
-
[8]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 0 8780--8794, 2021
work page 2021
-
[9]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020 a
work page 2020
-
[10]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In arXiv, 2020 b
work page 2020
-
[11]
Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong Yin, Shikun Feng, et al. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In CVPR, 2023
work page 2023
-
[12]
Metabev: Solving sensor failures for 3d detection and map segmentation
Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, and Ping Luo. Metabev: Solving sensor failures for 3d detection and map segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8721--8731, 2023
work page 2023
-
[13]
Imagebind: One embedding space to bind them all
Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15180--15190, 2023
work page 2023
-
[14]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014
work page 2014
-
[15]
Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. NeurIPS, 2021
work page 2021
-
[16]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022
work page 2022
-
[17]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017
work page 2017
-
[19]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020
work page 2020
-
[20]
Lora: Low-rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2021
work page 2021
-
[21]
T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation
Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. In ICCV, 2023
work page 2023
-
[22]
Scaling up gans for text-to-image synthesis
Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In CVPR, 2023
work page 2023
-
[23]
Diffusionclip: Text-guided diffusion models for robust image manipulation
Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2426--2435, June 2022
work page 2022
-
[24]
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In arXiv, 2013
work page 2013
-
[25]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023
work page 2023
-
[26]
Pick-a-pic: An open dataset of user preferences for text-to-image generation
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In arXiv, 2023
work page 2023
-
[27]
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022 a
work page 2022
-
[28]
Panoptic segformer: Delving deeper into panoptic segmentation with transformers
Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, Ping Luo, and Tong Lu. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, 2022 b
work page 2022
-
[29]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014
work page 2014
-
[30]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In arXiv, 2023
work page 2023
-
[31]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021
work page 2021
-
[32]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In CVPR, 2022
work page 2022
-
[33]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In arXiv, 2017
work page 2017
-
[34]
Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35: 0 5775--5787, 2022
work page 2022
- [35]
- [36]
-
[37]
Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In arXiv, 2023
work page 2023
-
[38]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.\ 8162--8171. PMLR, 2021
work page 2021
- [39]
-
[40]
Getting immediate speedups with a100 and tf32, 2023
NVIDIA. Getting immediate speedups with a100 and tf32, 2023. URL https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32
work page 2023
- [41]
-
[42]
Journeydb: A benchmark for generative image understanding
Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. In arXiv, 2023
work page 2023
-
[43]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023
work page 2023
-
[44]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018
work page 2018
-
[45]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In arXiv, 2023
work page 2023
-
[46]
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022
work page 2022
-
[47]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI blog, 2018
work page 2018
-
[48]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019
work page 2019
-
[49]
Variational inference with normalizing flows
Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015
work page 2015
-
[50]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022
work page 2022
-
[51]
U-net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015
work page 2015
-
[52]
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In arXiv, 2022
work page 2022
-
[53]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022
work page 2022
-
[54]
Laion-400m: Open dataset of clip-filtered 400 million image-text pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In arXiv, 2021
work page 2021
-
[55]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015
work page 2015
-
[56]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019
work page 2019
-
[57]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021
work page 2021
-
[58]
Segmenter: Transformer for semantic segmentation
Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, 2021
work page 2021
-
[59]
Transtrack: Multiple object tracking with transformer
Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple object tracking with transformer. In arXiv, 2020
work page 2020
-
[60]
Training data-efficient image transformers & distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv \'e J \'e gou. Training data-efficient image transformers & distillation through attention. In ICML, 2021
work page 2021
-
[61]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017
work page 2017
-
[62]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021
work page 2021
-
[63]
Pvt v2: Improved baselines with pyramid vision transformer
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 2022
work page 2022
-
[65]
Segformer: Simple and efficient design for semantic segmentation with transformers
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34: 0 12077--12090, 2021
work page 2021
-
[66]
Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In ICCV, 2023
work page 2023
-
[67]
Holistically-nested edge detection
Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In ICCV, 2015
work page 2015
-
[69]
Raphael: Text-to-image generation via large mixture of diffusion paths
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. In arXiv, 2023 b
work page 2023
-
[70]
Tokens-to-token vit: Training vision transformers from scratch on imagenet
Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 2021
work page 2021
-
[71]
Adding conditional control to text-to-image diffusion models
Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023
work page 2023
-
[72]
Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In ICCV, 2021
work page 2021
-
[73]
Fast training of diffusion models with masked transformers
Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. In arXiv, 2023
work page 2023
-
[74]
Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021
work page 2021
-
[75]
Deepvit: Towards deeper vision transformer
Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. In arXiv, 2021
work page 2021
-
[76]
Understanding the robustness in vision transformers
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. In International Conference on Machine Learning, pp.\ 27378--27394. PMLR, 2022
work page 2022
-
[77]
Getting Immediate Speedups with A100 and TF32
NVIDIA. Getting Immediate Speedups with A100 and TF32. 2023
work page 2023
- [78]
-
[79]
OpenAI. Dalle-2. 2023
work page 2023
- [80]
- [81]
-
[82]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Kim, Gwanghyun and Kwon, Taesung and Ye, Jong Chul , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =
work page 2022
-
[83]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , author=. arXiv preprint arXiv:2212.11565 , year=
-
[84]
Poole, Ben and Jain, Ajay and Barron, Jonathan T. and Mildenhall, Ben , title =. arXiv , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.