arxiv: 2212.09748 · v2 · submitted 2022-12-19 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Scalable Diffusion Models with Transformers

William Peebles , Saining Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:55 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords diffusion modelstransformerslatent diffusionimage generationscalabilityImageNetFID

0 comments

The pith

Diffusion transformers replace U-Nets and improve ImageNet generation quality as Gflops increase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the U-Net backbone in latent diffusion models with a transformer that works on sequences of latent image patches. It measures scalability by counting Gflops in the forward pass and shows that adding depth, width, or more patches reliably lowers the FID score. The biggest models reach a new low of 2.27 FID on class-conditional ImageNet at 256 by 256 resolution and also lead at 512 by 512. Readers care because this points to a simple way to keep getting better image synthesis just by spending more compute on the same architecture.

Core claim

We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

What carries the argument

Diffusion Transformer (DiT) that operates on sequences of latent patches, with scaling behavior tracked directly by Gflops in the forward pass.

If this is right

Higher Gflops from greater transformer depth or width produce lower FID scores.
Adding more input tokens from latent patches also improves generation quality.
The largest DiT models surpass all previous diffusion models on ImageNet 256x256 and 512x512.
Scalability can be predicted from forward-pass Gflops without additional architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar scaling may appear in diffusion models for video or 3D data if the same Gflops-FID relationship holds.
Training runs could be budgeted directly in Gflops rather than by guessing depth or width in advance.
Other generative tasks that already use transformers might adopt the same latent-patch approach for consistency.

Load-bearing premise

That raising Gflops by making the transformer deeper, wider, or by using more latent patches will keep reducing FID without training instabilities or diminishing returns.

What would settle it

An experiment in which FID stops falling or starts rising once Gflops exceed the level of the DiT-XL/2 model on the same ImageNet class-conditional benchmarks.

read the original abstract

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Diffusion Transformers (DiTs), a new class of latent diffusion models that replace the standard U-Net backbone with a transformer operating on latent patches. The authors analyze scalability through forward-pass complexity measured in Gflops and show that increasing Gflops—via greater transformer depth/width or more input tokens—consistently reduces FID on class-conditional ImageNet. Their largest DiT-XL/2 model achieves state-of-the-art FID scores of 2.27 on the 256×256 benchmark and competitive results on 512×512, outperforming prior diffusion models such as ADM and LDM under a standardized 50k-sample evaluation protocol with classifier-free guidance.

Significance. If the reported scaling trends and benchmark results hold under closer scrutiny, the work demonstrates that transformer architectures can serve as scalable, high-performing backbones for diffusion models, offering an alternative to convolutional U-Nets that improves with compute. The monotonic Gflops-vs-FID relationship across multiple DiT variants (S/B/L/XL) and patch sizes supplies concrete empirical support for the central scalability thesis and could influence backbone design choices in future generative modeling research.

major comments (3)

[Experiments] The central scalability claim (higher Gflops yields lower FID) is supported by curves across DiT variants, but the Experiments section provides insufficient detail on training procedures, including optimizer settings, learning-rate schedules, total training steps, and data-augmentation choices. Without these, it is difficult to verify that the observed FID gains are attributable to Gflops rather than differences in optimization or regularization.
[Benchmark tables] Benchmark tables report single-point FID values (e.g., 2.27 for DiT-XL/2) without error bars, standard deviations, or results from multiple independent runs. This omission weakens the strength of the SOTA claim relative to prior models, as small differences in FID can arise from stochasticity in sampling or evaluation.
[Ablation studies] While Gflops scaling is examined by varying depth/width and patch size (1/2/4/8), the manuscript lacks a controlled ablation that isolates the contribution of each factor while holding total Gflops fixed. Such an analysis would strengthen the claim that the improvement is driven by compute rather than architectural specifics.

minor comments (2)

[Figures] Figure captions for the Gflops-vs-FID plots should explicitly state the number of samples used for FID computation and whether classifier-free guidance scale is held constant across all points.
[Model architecture] The notation for model variants (DiT-S/B/L/XL) and patch sizes (DiT-XL/2) is introduced without a dedicated table summarizing parameter counts, Gflops, and layer configurations; adding one would improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their positive summary of our work and for the constructive major comments. We address each point below with our responses and planned revisions.

read point-by-point responses

Referee: [Experiments] The central scalability claim (higher Gflops yields lower FID) is supported by curves across DiT variants, but the Experiments section provides insufficient detail on training procedures, including optimizer settings, learning-rate schedules, total training steps, and data-augmentation choices. Without these, it is difficult to verify that the observed FID gains are attributable to Gflops rather than differences in optimization or regularization.

Authors: We agree that expanded details on training procedures will strengthen verifiability. We will revise the Experiments section to explicitly describe the optimizer, learning-rate schedule, total training steps, and data augmentations, noting that these choices are held fixed across all DiT variants. This will clarify that observed FID differences arise from Gflops scaling. We will also release training code for full reproducibility. revision: yes
Referee: [Benchmark tables] Benchmark tables report single-point FID values (e.g., 2.27 for DiT-XL/2) without error bars, standard deviations, or results from multiple independent runs. This omission weakens the strength of the SOTA claim relative to prior models, as small differences in FID can arise from stochasticity in sampling or evaluation.

Authors: We acknowledge the value of error bars for robustness. However, multiple independent runs of the largest models are computationally prohibitive. We adhere to the standardized 50k-sample evaluation protocol with classifier-free guidance used by prior works (ADM, LDM) for fair comparison. The consistent monotonic scaling trends across variants support result reliability. We will add a discussion of evaluation variance and practical limitations in the revised Experiments section. revision: partial
Referee: [Ablation studies] While Gflops scaling is examined by varying depth/width and patch size (1/2/4/8), the manuscript lacks a controlled ablation that isolates the contribution of each factor while holding total Gflops fixed. Such an analysis would strengthen the claim that the improvement is driven by compute rather than architectural specifics.

Authors: We agree a controlled iso-Gflops analysis would be beneficial. Our existing results include multiple architectural paths to similar Gflops levels. We will add a new analysis (derived from current data) that bins models by Gflops and compares FID for different depth/width/patch configurations at matched compute, to better isolate the role of total Gflops. revision: yes

standing simulated objections not resolved

[Benchmark tables] The request for error bars or results from multiple independent runs on the benchmark FID scores.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents empirical results from training and evaluating DiT models on public ImageNet benchmarks, including Gflops-vs-FID scaling curves across model variants and patch sizes plus direct FID comparisons to ADM, LDM and other baselines under identical 50k-sample protocols. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology; the reported scaling trends and SOTA FID of 2.27 are externally falsifiable through independent training runs and benchmark tables.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Limited to abstract; relies on standard assumptions from diffusion and transformer literature without new postulates.

free parameters (1)

Gflops scaling factors
Depth, width, and token count varied to achieve different compute levels for the scaling study.

axioms (1)

domain assumption Transformers can effectively replace U-Nets when operating on latent patches for diffusion
Core premise for exploring DiTs as a new class of models.

pith-pipeline@v0.9.0 · 5412 in / 1129 out tokens · 62031 ms · 2026-05-12T05:55:35.042350+00:00 · methodology

discussion (0)

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
cs.LG 2026-05 unverdicted novelty 7.0

Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
cs.LG 2026-05 unverdicted novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions
cs.LG 2026-05 unverdicted novelty 7.0

FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.
D-Rex : Diffusion Rendering for Relightable Expressive Avatars
cs.GR 2026-04 conditional novelty 7.0

D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
cs.GR 2026-04 unverdicted novelty 7.0

MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
Learning-Guided Force-Feedback Model Predictive Control with Obstacle Avoidance for Robotic Deburring
cs.RO 2026-04 unverdicted novelty 7.0

A framework merges diffusion-based motion priors with force-feedback MPC to enable reliable tool insertion, force tracking, and collision-free circular motions in robotic deburring.
GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow
cs.CV 2026-03 unverdicted novelty 7.0

GVCC achieves the lowest LPIPS on UVG at bitrates down to 0.003 bpp by encoding stochastic innovations in a marginal-preserving stochastic process derived from a pretrained rectified-flow video model, with 65% LPIPS r...
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
CUBic: Coordinated Unified Bimanual Perception and Control Framework
cs.RO 2026-05 unverdicted novelty 6.0

CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...
The Diffusion Encoder
cs.LG 2026-05 unverdicted novelty 6.0

A diffusion model serves as the encoder in an autoencoder when trained alternately with the decoder to resolve opposing update directions while retaining the standard diffusion training objective.
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
cs.CV 2026-05 unverdicted novelty 6.0

SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
cs.RO 2026-05 unverdicted novelty 6.0

VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages
cs.CV 2026-05 unverdicted novelty 6.0

SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...
GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers
cs.CV 2026-04 unverdicted novelty 6.0

A unified diffusion transformer jointly solves single-image relighting and 3D reconstruction via a new isotropic NDC-Orthographic Depth representation and mixed synthetic/real training.
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
cs.RO 2026-04 unverdicted novelty 6.0

The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video
cs.CV 2026-04 unverdicted novelty 6.0

LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
cs.LG 2026-04 unverdicted novelty 6.0

AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
SkyReels-V2: Infinite-length Film Generative Model
cs.CV 2025-04 unverdicted novelty 6.0

SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
cs.CV 2025-03 accept novelty 6.0

VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
cs.CV 2024-10 unverdicted novelty 6.0

Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
Understanding Asynchronous Inference Methods for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 5.0

Controlled benchmarks show per-step residual correction (A2C2) as most effective for VLA asynchronous inference up to d=8 delays on Kinetix with over 90% solve rate, outperforming inpainting and conditioning while tra...
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk
cs.CL 2026-04 unverdicted novelty 5.0

Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.
Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models
cs.AI 2026-04 unverdicted novelty 5.0

Target-based prompting lets users define fairness distributions for skin tones in generative AI, shifting outputs closer to chosen targets across 36 tested prompts for occupations and contexts.
Gated Memory Policy
cs.RO 2026-04 unverdicted novelty 5.0

GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Target Parameterization in Diffusion Models for Nonlinear Spatiotemporal System Identification
eess.SY 2026-04 unverdicted novelty 4.0

Clean-state prediction in diffusion models for turbulent spatiotemporal systems improves rollout stability and reduces long-horizon error compared to velocity- and noise-based objectives.
Show-o2: Improved Native Unified Multimodal Models
cs.CV 2025-06 unverdicted novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results
cs.CV 2026-04 unverdicted novelty 2.0

The NTIRE 2026 Challenge establishes a benchmark for bitstream-corrupted video restoration and summarizes the top methods and observed trends from participating teams.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 40 Pith papers · 12 internal anchors

[1]

JAX: composable transformations of Python+NumPy programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau- rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. 6

work page 2018
[2]

Large scale GAN training for high ﬁdelity natural image synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. In ICLR, 2019. 5, 9

work page 2019
[3]

Lan- guage models are few-shot learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. In NeurIPS, 2020. 1

work page 2020
[4]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In CVPR, pages 11315–11325, 2022. 2

work page 2022
[5]

Decision transformer: Reinforce- ment learning via sequence modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srini- vas, and Igor Mordatch. Decision transformer: Reinforce- ment learning via sequence modeling. In NeurIPS, 2021. 2

work page 2021
[6]

Generative pre- training from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In ICML, 2020. 1, 2

work page 2020
[7]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In NAACL-HCT, 2019. 1

work page 2019
[9]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021. 1, 2, 3, 5, 6, 9, 12

work page 2021
[10]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 1, 2, 4, 5

work page 2020
[11]

Taming transformers for high-resolution image synthesis, 2020

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2020. 2

work page 2020
[12]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 3

work page 2014
[13]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017. 5

work page internal anchor Pith review arXiv 2017
[14]

Vec- tor quantized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022. 2

work page 2022
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,

work page
[16]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. 12

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020. 2

work page internal anchor Pith review arXiv 2010
[18]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. 2017. 6

work page 2017
[19]

Denoising diffu- sion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In NeurIPS, 2020. 2, 3

work page 2020
[20]

Cascaded diffusion models for high ﬁdelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cas- caded diffusion models for high ﬁdelity image generation. arXiv:2106.15282, 2021. 3, 9

work page arXiv 2021
[21]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 3, 4

work page 2021
[22]

Estimation of non- normalized statistical models by score matching

Aapo Hyv ¨arinen and Peter Dayan. Estimation of non- normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. 3

work page 2005
[23]

Image-to-image translation with conditional adver- sarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,

work page
[24]

Scalable adaptive computation for iterative generation,

Allan Jabri, David Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022. 3

work page arXiv 2022
[25]

Ofﬂine rein- forcement learning as one big sequence modeling problem

Michael Janner, Qiyang Li, and Sergey Levine. Ofﬂine rein- forcement learning as one big sequence modeling problem. In NeurIPS, 2021. 2

work page 2021
[26]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020. 2, 13

work page internal anchor Pith review Pith/arXiv arXiv 2001
[27]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022. 3

work page 2022
[28]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 5

work page 2019
[29]

Adam: A method for stochastic optimization

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 5

work page 2015
[30]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2013
[31]

Imagenet classiﬁcation with deep convolutional neural net- works

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural net- works. In NeurIPS, 2012. 5

work page 2012
[32]

Improved precision and recall met- ric for assessing generative models

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. In NeurIPS, 2019. 6

work page 2019
[33]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017. 5 10

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Battaglia

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021. 6

work page arXiv 2021
[35]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021. 3

work page 2021
[37]

On aliased resizing and surprising subtleties in gan evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022. 6

work page 2022
[38]

Im- age transformer

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- age transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018. 2

work page 2018
[39]

Learning to learn with genera- tive models of neural network checkpoints

William Peebles, Ilija Radosavovic, Tim Brooks, Alexei Efros, and Jitendra Malik. Learning to learn with genera- tive models of neural network checkpoints. arXiv preprint arXiv:2209.12892, 2022. 2

work page arXiv 2022
[40]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. 2, 5

work page 2018
[41]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 2

work page 2021
[42]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. 1

work page 2018
[43]

Language models are unsu- pervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners. 2019. 1

work page 2019
[44]

On network design spaces for visual recog- nition

Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Doll´ar. On network design spaces for visual recog- nition. In ICCV, 2019. 3

work page 2019
[45]

Designing network design spaces

Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. In CVPR, 2020. 3

work page 2020
[46]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv:2204.06125, 2022. 1, 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021. 1, 2

work page 2021
[48]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 2, 3, 4, 6, 9

work page 2022
[49]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention , pages 234–241. Springer, 2015. 2, 3

work page 2015
[50]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding. arXiv:2205.11487, 2022. 3

work page internal anchor Pith review arXiv 2022
[51]

Improved techniques for training GANs

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training GANs. In NeurIPS, 2016. 6

work page 2016
[52]

PixelCNN++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modiﬁcations

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modiﬁcations. arXiv preprint arXiv:1701.05517, 2017. 2

work page arXiv 2017
[53]

Stylegan- xl: Scaling stylegan to large diverse datasets

Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InSIGGRAPH,

work page
[54]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015. 3

work page 2015
[55]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. arXiv:2010.02502, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[56]

Generative modeling by es- timating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019. 3

work page 2019
[57]

How to train your ViT? data, augmentation, and regularization in vision transformers

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? data, augmentation, and regularization in vision transformers. TMLR, 2022. 6

work page 2022
[58]

Conditional image genera- tion with pixelcnn decoders

Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image genera- tion with pixelcnn decoders. Advances in neural information processing systems, 29, 2016. 2

work page 2016
[59]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 2

work page 2017
[60]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1, 2, 5

work page 2017
[61]

Early convolutions help trans- formers see better

Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor Darrell, and Ross Girshick. Early convolutions help trans- formers see better. In NeurIPS, 2021. 6

work page 2021
[62]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autore- gressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[63]

arctic wolf

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In CVPR, 2022. 2, 5 11 Figure 11. Additional selected samples from our 512×512 and 256×256 resolution DiT-XL/2 models.We use a classiﬁer-free guidance scale of 6.0 for the 512× 512 model and 4.0 for the 256× 256 model. Both models use the ft-EMA V AE decoder....

work page 2022