arxiv: 2310.00426 · v3 · submitted 2023-09-30 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Chongjian Ge, Enze Xie, Huchuan Lu, James Kwok, Jincheng Yu, Junsong Chen, Lewei Yao, Ping Luo, Yue Wu, Zhenguo Li, Zhongdao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 20:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords text-to-image synthesisdiffusion transformertraining efficiencydense captionsPIXART-αvision-language modelimage generationhigh-resolution synthesis

0 comments

The pith

PIXART-α trains a high-quality text-to-image diffusion transformer in 10.8 percent the time of Stable Diffusion v1.5.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a diffusion transformer for text-to-image synthesis can reach competitive quality with models like Imagen and SDXL while using dramatically lower training resources. It achieves this through three specific designs that separate the learning of pixel relationships, text alignment, and visual appeal, replace class conditioning with cross-attention for text, and supply dense captions generated by a vision-language model. A sympathetic reader would care because the resulting model supports 1024-pixel outputs and claims to cut training cost to roughly one percent of larger alternatives like RAPHAEL. If the designs work as described, they would make it feasible for more groups to build capable generative systems from scratch without million-dollar compute budgets.

Core claim

PIXART-α is a Transformer-based diffusion model for text-to-image synthesis whose generation quality matches state-of-the-art systems. The model is obtained by decomposing training into three successive stages that optimize pixel dependency, text-image alignment, and aesthetic quality in turn; by embedding cross-attention modules inside a Diffusion Transformer to handle text conditions efficiently; and by automatically labeling training pairs with dense pseudo-captions from a large vision-language model. These choices produce a system that trains in 675 A100 GPU days, 10.8 percent of the time reported for Stable Diffusion v1.5, while supporting up to 1024-pixel resolution and competitive or,

What carries the argument

Three-stage training decomposition together with a Diffusion Transformer that uses cross-attention for text conditioning and dense VLM-generated captions for alignment.

If this is right

High-resolution text-to-image synthesis up to 1024 pixels becomes practical at far lower compute budgets.
Training cost drops to 1 percent of larger state-of-the-art models such as RAPHAEL while preserving semantic control and artistic quality.
Carbon emissions associated with model development fall by roughly 90 percent compared with prior large-scale diffusion models.
Startups and research groups can iterate on new generative models without requiring thousands of GPU days.
The resulting models excel at image quality, artistry, and fine-grained semantic control according to the reported experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged-optimization pattern could be tested on video or 3D generation tasks to check whether similar efficiency gains appear.
If dense captions prove decisive, then future work might focus on improving the captioning model itself rather than scaling raw image data.
The cost reduction could enable more frequent public releases of updated models, shortening the iteration cycle in the field.
Open questions remain about whether the efficiency advantage persists when the same decomposition is applied to other transformer variants or non-diffusion backbones.

Load-bearing premise

The three-stage training split and the use of dense pseudo-captions are the main reasons for both the quality and the large reduction in training cost, rather than differences in total data volume or other implementation choices.

What would settle it

A side-by-side training run of the identical architecture and data scale, once with the three-stage schedule and dense captions and once without them, that shows whether the reported quality and speed gains disappear.

read the original abstract

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-$\alpha$'s training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$ excels in image quality, artistry, and semantic control. We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PixArt-α gives a workable recipe for cheaper DiT-based text-to-image training, but the 10x speedup attribution rests on comparisons that skip key controls.

read the letter

PixArt-α trains a competitive diffusion transformer for text-to-image in roughly one-tenth the GPU days of Stable Diffusion v1.5 while matching quality on visuals and some metrics against SDXL and RAPHAEL. The three pieces they highlight—splitting training into pixel, alignment, and aesthetic stages, swapping in cross-attention for text conditioning inside DiT, and feeding dense VLM-generated captions—form a concrete training plan that others can copy or adapt.

Referee Report

2 major / 2 minor

Summary. The paper introduces PixArt-α, a Transformer-based diffusion model for text-to-image synthesis. It claims competitive photorealistic quality with SOTA models (SDXL, Imagen, RAPHAEL, Midjourney) at up to 1024px resolution, enabled by three designs: (1) three-stage training decomposition optimizing pixel dependency, text-image alignment, and aesthetics separately; (2) efficient DiT with cross-attention for text conditioning instead of class labels; (3) VLM-generated dense pseudo-captions for high-informative data. This yields training in 675 A100 GPU days (10.8% of SD v1.5's 6,250 days), saving ~$300k and 90% CO2, with extensive visual/quantitative experiments.

Significance. If the efficiency and quality claims hold under controlled conditions, the work could substantially lower barriers for training high-quality T2I models, offering cost and environmental benefits plus practical insights on staged training and caption density for the AIGC community. The empirical comparisons to multiple baselines and support for high-res synthesis are notable strengths.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central efficiency claim (675 vs. 6,250 A100 GPU days, 10.8% of SD v1.5) attributes the speedup to the three core designs, but provides no controls or reporting for total data volume processed, number of optimization steps, model parameter count, or auxiliary VLM captioning compute. Without matched ablations or full hyperparameter tables isolating these factors, differences in data scale or implementation details could drive the reported savings rather than the proposed decomposition.
[§3] §3 (Training Strategy Decomposition): The three-stage approach is presented as separately optimizing distinct objectives, yet the manuscript lacks quantitative ablations demonstrating the incremental gains of each stage (or the full decomposition) over a single-stage baseline trained with equivalent total compute and data.

minor comments (2)

[Abstract] Abstract: Limited detail on exact quantitative metrics (e.g., specific FID, CLIP, or human preference scores) and evaluation protocols for comparisons to SDXL, Imagen, and RAPHAEL.
[Figures 1-2 and §4] Figures 1-2 and §4: Visual results support the quality claims, but additional details on conditioning, resolution, and baseline implementation (e.g., whether all models used identical data regimes) would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of our efficiency claims and training strategy that merit clarification. We address each major comment below and indicate planned revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central efficiency claim (675 vs. 6,250 A100 GPU days, 10.8% of SD v1.5) attributes the speedup to the three core designs, but provides no controls or reporting for total data volume processed, number of optimization steps, model parameter count, or auxiliary VLM captioning compute. Without matched ablations or full hyperparameter tables isolating these factors, differences in data scale or implementation details could drive the reported savings rather than the proposed decomposition.

Authors: We appreciate this observation. The manuscript reports model parameter counts (around 600M for PixArt-α) and total training steps, but we agree that a consolidated hyperparameter table and explicit data volume per stage would strengthen transparency. In the revision we will add such a table, including estimated VLM captioning cost (which is a one-time preprocessing step amortized over training). While direct matched ablations isolating every variable were not feasible within our compute budget, the staged approach demonstrably accelerates convergence on alignment and aesthetics objectives compared to joint training, as evidenced by our internal monitoring of loss curves and downstream metrics. Comparisons to SD v1.5 use the publicly stated training cost for that model. revision: partial
Referee: [§3] §3 (Training Strategy Decomposition): The three-stage approach is presented as separately optimizing distinct objectives, yet the manuscript lacks quantitative ablations demonstrating the incremental gains of each stage (or the full decomposition) over a single-stage baseline trained with equivalent total compute and data.

Authors: The referee correctly notes the absence of a full single-stage baseline trained for the same total compute. Such an experiment would require substantial additional resources and was not performed. Instead, we show progressive improvements across stages via FID, CLIP score, and human preference metrics in §4, supporting that each stage contributes distinct gains (pixel-level fidelity in stage 1, semantic alignment in stage 2, aesthetic quality in stage 3). We will expand §3 with a clearer rationale for the decomposition and include any available partial ablations (e.g., stage-wise metric deltas). We maintain that the decomposition enables more efficient use of data and objectives, but acknowledge a direct head-to-head comparison would be ideal. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or efficiency claims

full rationale

The paper reports empirical training costs (675 A100 GPU days) and quality metrics from direct model training runs, compared against external baselines such as Stable Diffusion v1.5. No equations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. The three core designs are methodological choices whose effects are measured experimentally rather than derived by construction from the inputs themselves. The efficiency attribution rests on reported wall-clock measurements, not tautological redefinitions or unverified self-references.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard diffusion model assumptions and empirical scaling behavior rather than new theoretical derivations.

free parameters (1)

stage-specific training hyperparameters
Learning rates, batch sizes, and iteration counts per training stage are chosen to optimize each phase separately.

axioms (2)

domain assumption Diffusion transformers can achieve high image quality when text conditions are properly injected via cross-attention.
Inherited from prior DiT and diffusion literature.
domain assumption Dense pseudo-captions generated by a VLM improve text-image alignment learning.
Core justification for the high-informative data component.

pith-pipeline@v0.9.0 · 5716 in / 1192 out tokens · 40916 ms · 2026-05-12T20:33:40.515106+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ImageAttributionBench: How Far Are We from Generalizable Attribution?
cs.CV 2026-05 unverdicted novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
What Concepts Lie Within? Detecting and Suppressing Risky Content in Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 7.0

A method using attention head vectors detects and suppresses risky content generation in Diffusion Transformers at inference time.
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
SycoPhantasy: Quantifying Sycophancy and Hallucination in Small Open Weight VLMs for Vision-Language Scoring of Fantasy Characters
cs.CV 2026-04 unverdicted novelty 7.0

Small VLMs show higher sycophancy (22.3% for 450M model) than larger ones (6.0% for 7B) when scoring image-text alignment on 173k fantasy portraits, quantified via a new Bluffing Coefficient metric.
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference
cs.AR 2026-04 unverdicted novelty 7.0

DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
Leveraging Verifier-Based Reinforcement Learning in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Edit-R1 trains a CoT-based reasoning reward model with GCPO and uses it to boost image editing performance over VLMs and models like FLUX.1-kontext via GRPO.
SpatialFusion: Endowing Unified Image Generation with Intrinsic 3D Geometric Awareness
cs.CV 2026-04 unverdicted novelty 6.0

SpatialFusion internalizes 3D geometric awareness into unified image generation models by pairing an MLLM with a spatial transformer that produces depth maps to constrain diffusion generation.
The Thinking Pixel: Recursive Sparse Reasoning in Multimodal Diffusion Latents
cs.CV 2026-04 unverdicted novelty 6.0

A recursive sparse MoE framework integrated into diffusion models iteratively refines visual tokens via gated module selection to improve structured reasoning and image generation performance.
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents
cs.CV 2026-04 unverdicted novelty 6.0

EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
Generative Refinement Networks for Visual Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
BiasIG: Benchmarking Multi-dimensional Social Biases in Text-to-Image Models
cs.CY 2026-04 conditional novelty 6.0

BiasIG is a multi-dimensional benchmark for social biases in T2I models that shows debiasing interventions frequently cause confounding discrimination effects.
Evolutionary Token-Level Prompt Optimization for Diffusion Models
cs.AI 2026-04 unverdicted novelty 6.0

A genetic algorithm evolves CLIP token vectors to optimize aesthetic quality and prompt alignment in diffusion models, outperforming Promptist and random search by up to 23.93% on a combined fitness score.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
LTX-Video: Realtime Video Latent Diffusion
cs.CV 2024-12 conditional novelty 6.0

LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
cs.CV 2023-10 unverdicted novelty 6.0

Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
On the Limits of Latent Reuse in Diffusion Models
stat.ML 2026-05 unverdicted novelty 5.0

Reusing source latent spaces in diffusion models under distribution shift produces target score error set by principal-angle misalignment and diffusion-time-amplified ambient noise.
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion
cs.LG 2026-04 unverdicted novelty 5.0

Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.
Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models
cs.AI 2026-04 unverdicted novelty 5.0

Target-based prompting lets users define fairness distributions for skin tones in generative AI, shifting outputs closer to chosen targets across 36 tested prompts for occupations and contexts.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
cs.CV 2025-06 unverdicted novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
Open-Sora: Democratizing Efficient Video Production for All
cs.CV 2024-12 unverdicted novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
AHS: Adaptive Head Synthesis via Synthetic Data Augmentations
cs.CV 2026-04 unverdicted novelty 4.0

Adaptive Head Synthesis (AHS) employs head-reenacted synthetic data augmentation to enable robust head swapping on full upper-body images without paired training data.
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
cs.AI 2025-01 conditional novelty 3.0

Scaling data, model size, and training optimization on the Janus architecture yields better multimodal understanding and more stable, instruction-following text-to-image generation.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · cited by 30 Pith papers · 1 internal anchor

[2]

ediffi: Text-to-image diffusion models with an ensemble of expert denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. In arXiv, 2022

work page 2022
[3]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In CVPR, 2023

work page 2023
[4]

A study on the evaluation of generative models

Eyal Betzalel, Coby Penso, Aviv Navon, and Ethan Fetaya. A study on the evaluation of generative models. In arXiv, 2022

work page 2022
[5]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020

work page 2020
[6]

Deepfloyd, 2023

DeepFloyd. Deepfloyd, 2023. URL https://www.deepfloyd.ai/

work page 2023
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

work page 2009
[8]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 0 8780--8794, 2021

work page 2021
[9]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020 a

work page 2020
[10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In arXiv, 2020 b

work page 2020
[11]

Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts

Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong Yin, Shikun Feng, et al. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In CVPR, 2023

work page 2023
[12]

Metabev: Solving sensor failures for 3d detection and map segmentation

Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, and Ping Luo. Metabev: Solving sensor failures for 3d detection and map segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 8721--8731, 2023

work page 2023
[13]

Imagebind: One embedding space to bind them all

Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 15180--15190, 2023

work page 2023
[14]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014

work page 2014
[15]

Transformer in transformer

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. NeurIPS, 2021

work page 2021
[16]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In CVPR, 2022

work page 2022
[17]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017

work page 2017
[19]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020

work page 2020
[20]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR, 2021

work page 2021
[21]

T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. In ICCV, 2023

work page 2023
[22]

Scaling up gans for text-to-image synthesis

Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In CVPR, 2023

work page 2023
[23]

Diffusionclip: Text-guided diffusion models for robust image manipulation

Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2426--2435, June 2022

work page 2022
[24]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In arXiv, 2013

work page 2013
[25]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In ICCV, 2023

work page 2023
[26]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. In arXiv, 2023

work page 2023
[27]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022 a

work page 2022
[28]

Panoptic segformer: Delving deeper into panoptic segmentation with transformers

Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, Ping Luo, and Tong Lu. Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In CVPR, 2022 b

work page 2022
[29]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll \'a r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014

work page 2014
[30]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In arXiv, 2023

work page 2023
[31]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021

work page 2021
[32]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. In CVPR, 2022

work page 2022
[33]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In arXiv, 2017

work page 2017
[34]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35: 0 5775--5787, 2022

work page 2022
[35]

Gpu selling, 2023

Microsoft. Gpu selling, 2023. URL https://www.leadergpu.com/

work page 2023
[36]

Midjourney, 2023

Midjourney. Midjourney, 2023. URL https://www.midjourney.com

work page 2023
[37]

T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In arXiv, 2023

work page 2023
[38]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.\ 8162--8171. PMLR, 2021

work page 2021
[39]

Nltk, 2023

NLTK. Nltk, 2023. URL https://www.nltk.org/

work page 2023
[40]

Getting immediate speedups with a100 and tf32, 2023

NVIDIA. Getting immediate speedups with a100 and tf32, 2023. URL https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32

work page 2023
[41]

Dalle-2, 2023

OpenAI. Dalle-2, 2023. URL https://openai.com/dall-e-2

work page 2023
[42]

Journeydb: A benchmark for generative image understanding

Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. In arXiv, 2023

work page 2023
[43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023

work page 2023
[44]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[45]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In arXiv, 2023

work page 2023
[46]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022

work page 2022
[47]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI blog, 2018

work page 2018
[48]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019

work page 2019
[49]

Variational inference with normalizing flows

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015

work page 2015
[50]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj \"o rn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022

work page 2022
[51]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015

work page 2015
[52]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In arXiv, 2022

work page 2022
[53]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022

work page 2022
[54]

Laion-400m: Open dataset of clip-filtered 400 million image-text pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In arXiv, 2021

work page 2021
[55]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015

work page 2015
[56]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019

work page 2019
[57]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021

work page 2021
[58]

Segmenter: Transformer for semantic segmentation

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In ICCV, 2021

work page 2021
[59]

Transtrack: Multiple object tracking with transformer

Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple object tracking with transformer. In arXiv, 2020

work page 2020
[60]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv \'e J \'e gou. Training data-efficient image transformers & distillation through attention. In ICML, 2021

work page 2021
[61]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

work page 2017
[62]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, 2021

work page 2021
[63]

Pvt v2: Improved baselines with pyramid vision transformer

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 2022

work page 2022
[65]

Segformer: Simple and efficient design for semantic segmentation with transformers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34: 0 12077--12090, 2021

work page 2021
[66]

Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning

Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, and Zhenguo Li. Difffit: Unlocking transferability of large diffusion models via simple parameter-efficient fine-tuning. In ICCV, 2023

work page 2023
[67]

Holistically-nested edge detection

Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In ICCV, 2015

work page 2015
[69]

Raphael: Text-to-image generation via large mixture of diffusion paths

Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image generation via large mixture of diffusion paths. In arXiv, 2023 b

work page 2023
[70]

Tokens-to-token vit: Training vision transformers from scratch on imagenet

Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 2021

work page 2021
[71]

Adding conditional control to text-to-image diffusion models

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023

work page 2023
[72]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In ICCV, 2021

work page 2021
[73]

Fast training of diffusion models with masked transformers

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. In arXiv, 2023

work page 2023
[74]

Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021

work page 2021
[75]

Deepvit: Towards deeper vision transformer

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. In arXiv, 2021

work page 2021
[76]

Understanding the robustness in vision transformers

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. In International Conference on Machine Learning, pp.\ 27378--27394. PMLR, 2022

work page 2022
[77]

Getting Immediate Speedups with A100 and TF32

NVIDIA. Getting Immediate Speedups with A100 and TF32. 2023

work page 2023
[78]

GPU selling

Microsoft. GPU selling. 2023

work page 2023
[79]

OpenAI. Dalle-2. 2023

work page 2023
[80]

Midjourney

Midjourney. Midjourney. 2023

work page 2023
[81]

DeepFloyd

DeepFloyd. DeepFloyd. 2023

work page 2023
[82]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Kim, Gwanghyun and Kwon, Taesung and Ye, Jong Chul , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2022 , pages =

work page 2022
[83]

arXiv preprint arXiv:2212.11565 , year=

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , author=. arXiv preprint arXiv:2212.11565 , year=

work page arXiv
[84]

and Mildenhall, Ben , title =

Poole, Ben and Jain, Ajay and Barron, Jonathan T. and Mildenhall, Ben , title =. arXiv , year =

work page

Showing first 80 references.