pith. machine review for the scientific record. sign in

arxiv: 2212.09748 · v2 · submitted 2022-12-19 · 💻 cs.CV · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Scalable Diffusion Models with Transformers

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:55 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords diffusion modelstransformerslatent diffusionimage generationscalabilityImageNetFID
0
0 comments X

The pith

Diffusion transformers replace U-Nets and improve ImageNet generation quality as Gflops increase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces the U-Net backbone in latent diffusion models with a transformer that works on sequences of latent image patches. It measures scalability by counting Gflops in the forward pass and shows that adding depth, width, or more patches reliably lowers the FID score. The biggest models reach a new low of 2.27 FID on class-conditional ImageNet at 256 by 256 resolution and also lead at 512 by 512. Readers care because this points to a simple way to keep getting better image synthesis just by spending more compute on the same architecture.

Core claim

We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

What carries the argument

Diffusion Transformer (DiT) that operates on sequences of latent patches, with scaling behavior tracked directly by Gflops in the forward pass.

If this is right

  • Higher Gflops from greater transformer depth or width produce lower FID scores.
  • Adding more input tokens from latent patches also improves generation quality.
  • The largest DiT models surpass all previous diffusion models on ImageNet 256x256 and 512x512.
  • Scalability can be predicted from forward-pass Gflops without additional architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar scaling may appear in diffusion models for video or 3D data if the same Gflops-FID relationship holds.
  • Training runs could be budgeted directly in Gflops rather than by guessing depth or width in advance.
  • Other generative tasks that already use transformers might adopt the same latent-patch approach for consistency.

Load-bearing premise

That raising Gflops by making the transformer deeper, wider, or by using more latent patches will keep reducing FID without training instabilities or diminishing returns.

What would settle it

An experiment in which FID stops falling or starts rising once Gflops exceed the level of the DiT-XL/2 model on the same ImageNet class-conditional benchmarks.

read the original abstract

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Diffusion Transformers (DiTs), a new class of latent diffusion models that replace the standard U-Net backbone with a transformer operating on latent patches. The authors analyze scalability through forward-pass complexity measured in Gflops and show that increasing Gflops—via greater transformer depth/width or more input tokens—consistently reduces FID on class-conditional ImageNet. Their largest DiT-XL/2 model achieves state-of-the-art FID scores of 2.27 on the 256×256 benchmark and competitive results on 512×512, outperforming prior diffusion models such as ADM and LDM under a standardized 50k-sample evaluation protocol with classifier-free guidance.

Significance. If the reported scaling trends and benchmark results hold under closer scrutiny, the work demonstrates that transformer architectures can serve as scalable, high-performing backbones for diffusion models, offering an alternative to convolutional U-Nets that improves with compute. The monotonic Gflops-vs-FID relationship across multiple DiT variants (S/B/L/XL) and patch sizes supplies concrete empirical support for the central scalability thesis and could influence backbone design choices in future generative modeling research.

major comments (3)
  1. [Experiments] The central scalability claim (higher Gflops yields lower FID) is supported by curves across DiT variants, but the Experiments section provides insufficient detail on training procedures, including optimizer settings, learning-rate schedules, total training steps, and data-augmentation choices. Without these, it is difficult to verify that the observed FID gains are attributable to Gflops rather than differences in optimization or regularization.
  2. [Benchmark tables] Benchmark tables report single-point FID values (e.g., 2.27 for DiT-XL/2) without error bars, standard deviations, or results from multiple independent runs. This omission weakens the strength of the SOTA claim relative to prior models, as small differences in FID can arise from stochasticity in sampling or evaluation.
  3. [Ablation studies] While Gflops scaling is examined by varying depth/width and patch size (1/2/4/8), the manuscript lacks a controlled ablation that isolates the contribution of each factor while holding total Gflops fixed. Such an analysis would strengthen the claim that the improvement is driven by compute rather than architectural specifics.
minor comments (2)
  1. [Figures] Figure captions for the Gflops-vs-FID plots should explicitly state the number of samples used for FID computation and whether classifier-free guidance scale is held constant across all points.
  2. [Model architecture] The notation for model variants (DiT-S/B/L/XL) and patch sizes (DiT-XL/2) is introduced without a dedicated table summarizing parameter counts, Gflops, and layer configurations; adding one would improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their positive summary of our work and for the constructive major comments. We address each point below with our responses and planned revisions.

read point-by-point responses
  1. Referee: [Experiments] The central scalability claim (higher Gflops yields lower FID) is supported by curves across DiT variants, but the Experiments section provides insufficient detail on training procedures, including optimizer settings, learning-rate schedules, total training steps, and data-augmentation choices. Without these, it is difficult to verify that the observed FID gains are attributable to Gflops rather than differences in optimization or regularization.

    Authors: We agree that expanded details on training procedures will strengthen verifiability. We will revise the Experiments section to explicitly describe the optimizer, learning-rate schedule, total training steps, and data augmentations, noting that these choices are held fixed across all DiT variants. This will clarify that observed FID differences arise from Gflops scaling. We will also release training code for full reproducibility. revision: yes

  2. Referee: [Benchmark tables] Benchmark tables report single-point FID values (e.g., 2.27 for DiT-XL/2) without error bars, standard deviations, or results from multiple independent runs. This omission weakens the strength of the SOTA claim relative to prior models, as small differences in FID can arise from stochasticity in sampling or evaluation.

    Authors: We acknowledge the value of error bars for robustness. However, multiple independent runs of the largest models are computationally prohibitive. We adhere to the standardized 50k-sample evaluation protocol with classifier-free guidance used by prior works (ADM, LDM) for fair comparison. The consistent monotonic scaling trends across variants support result reliability. We will add a discussion of evaluation variance and practical limitations in the revised Experiments section. revision: partial

  3. Referee: [Ablation studies] While Gflops scaling is examined by varying depth/width and patch size (1/2/4/8), the manuscript lacks a controlled ablation that isolates the contribution of each factor while holding total Gflops fixed. Such an analysis would strengthen the claim that the improvement is driven by compute rather than architectural specifics.

    Authors: We agree a controlled iso-Gflops analysis would be beneficial. Our existing results include multiple architectural paths to similar Gflops levels. We will add a new analysis (derived from current data) that bins models by Gflops and compares FID for different depth/width/patch configurations at matched compute, to better isolate the role of total Gflops. revision: yes

standing simulated objections not resolved
  • [Benchmark tables] The request for error bars or results from multiple independent runs on the benchmark FID scores.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents empirical results from training and evaluating DiT models on public ImageNet benchmarks, including Gflops-vs-FID scaling curves across model variants and patch sizes plus direct FID comparisons to ADM, LDM and other baselines under identical 50k-sample protocols. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology; the reported scaling trends and SOTA FID of 2.27 are externally falsifiable through independent training runs and benchmark tables.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Limited to abstract; relies on standard assumptions from diffusion and transformer literature without new postulates.

free parameters (1)
  • Gflops scaling factors
    Depth, width, and token count varied to achieve different compute levels for the scaling study.
axioms (1)
  • domain assumption Transformers can effectively replace U-Nets when operating on latent patches for diffusion
    Core premise for exploring DiTs as a new class of models.

pith-pipeline@v0.9.0 · 5412 in / 1129 out tokens · 62031 ms · 2026-05-12T05:55:35.042350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 41 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.

  2. From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.

  3. LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

  4. A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions

    cs.LG 2026-05 unverdicted novelty 7.0

    FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.

  5. D-Rex : Diffusion Rendering for Relightable Expressive Avatars

    cs.GR 2026-04 conditional novelty 7.0

    D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.

  6. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  7. MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

    cs.GR 2026-04 unverdicted novelty 7.0

    MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.

  8. Learning-Guided Force-Feedback Model Predictive Control with Obstacle Avoidance for Robotic Deburring

    cs.RO 2026-04 unverdicted novelty 7.0

    A framework merges diffusion-based motion priors with force-feedback MPC to enable reliable tool insertion, force tracking, and collision-free circular motions in robotic deburring.

  9. GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow

    cs.CV 2026-03 unverdicted novelty 7.0

    GVCC achieves the lowest LPIPS on UVG at bitrates down to 0.003 bpp by encoding stochastic innovations in a marginal-preserving stochastic process derived from a pretrained rectified-flow video model, with 65% LPIPS r...

  10. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  11. CUBic: Coordinated Unified Bimanual Perception and Control Framework

    cs.RO 2026-05 unverdicted novelty 6.0

    CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...

  12. The Diffusion Encoder

    cs.LG 2026-05 unverdicted novelty 6.0

    A diffusion model serves as the encoder in an autoencoder when trained alternately with the decoder to resolve opposing update directions while retaining the standard diffusion training objective.

  13. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  14. SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.

  15. SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...

  16. Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation

    cs.RO 2026-05 unverdicted novelty 6.0

    VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.

  17. SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages

    cs.CV 2026-05 unverdicted novelty 6.0

    SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.

  18. Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.

  19. HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation

    cs.CV 2026-04 unverdicted novelty 6.0

    HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...

  20. GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers

    cs.CV 2026-04 unverdicted novelty 6.0

    A unified diffusion transformer jointly solves single-image relighting and 3D reconstruction via a new isotropic NDC-Orthographic Depth representation and mixed synthetic/real training.

  21. BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving

    cs.RO 2026-04 unverdicted novelty 6.0

    The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.

  22. ELT: Elastic Looped Transformers for Visual Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

  23. PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

  24. LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video

    cs.CV 2026-04 unverdicted novelty 6.0

    LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.

  25. AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling

    cs.LG 2026-04 unverdicted novelty 6.0

    AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...

  26. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  27. SkyReels-V2: Infinite-length Film Generative Model

    cs.CV 2025-04 unverdicted novelty 6.0

    SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.

  28. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  29. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  30. SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    cs.CV 2024-10 unverdicted novelty 6.0

    Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.

  31. Emu3: Next-Token Prediction is All You Need

    cs.CV 2024-09 unverdicted novelty 6.0

    Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.

  32. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  33. Understanding Asynchronous Inference Methods for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Controlled benchmarks show per-step residual correction (A2C2) as most effective for VLA asynchronous inference up to d=8 delays on Kinetix with over 90% solve rate, outperforming inpainting and conditioning while tra...

  34. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  35. Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk

    cs.CL 2026-04 unverdicted novelty 5.0

    Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.

  36. Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Target-based prompting lets users define fairness distributions for skin tones in generative AI, shifting outputs closer to chosen targets across 36 tested prompts for occupations and contexts.

  37. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  38. Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

    cs.CV 2026-05 unverdicted novelty 4.0

    Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

  39. Target Parameterization in Diffusion Models for Nonlinear Spatiotemporal System Identification

    eess.SY 2026-04 unverdicted novelty 4.0

    Clean-state prediction in diffusion models for turbulent spatiotemporal systems improves rollout stability and reduces long-horizon error compared to velocity- and noise-based objectives.

  40. Show-o2: Improved Native Unified Multimodal Models

    cs.CV 2025-06 unverdicted novelty 4.0

    Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

  41. NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 Challenge establishes a benchmark for bitstream-corrupted video restoration and summarizes the top methods and observed trends from participating teams.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 40 Pith papers · 12 internal anchors

  1. [1]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau- rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. 6

  2. [2]

    Large scale GAN training for high fidelity natural image synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019. 5, 9

  3. [3]

    Lan- guage models are few-shot learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. In NeurIPS, 2020. 1

  4. [4]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In CVPR, pages 11315–11325, 2022. 2

  5. [5]

    Decision transformer: Reinforce- ment learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srini- vas, and Igor Mordatch. Decision transformer: Reinforce- ment learning via sequence modeling. In NeurIPS, 2021. 2

  6. [6]

    Generative pre- training from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In ICML, 2020. 1, 2

  7. [7]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019. 2

  8. [8]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In NAACL-HCT, 2019. 1

  9. [9]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021. 1, 2, 3, 5, 6, 9, 12

  10. [10]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 1, 2, 4, 5

  11. [11]

    Taming transformers for high-resolution image synthesis, 2020

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2020. 2

  12. [12]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 3

  13. [13]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017. 5

  14. [14]

    Vec- tor quantized diffusion model for text-to-image synthesis

    Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022. 2

  15. [15]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,

  16. [16]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. 12

  17. [17]

    Scaling Laws for Autoregressive Generative Modeling

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020. 2

  18. [18]

    Gans trained by a two time-scale update rule converge to a local nash equilib- rium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. 2017. 6

  19. [19]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In NeurIPS, 2020. 2, 3

  20. [20]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cas- caded diffusion models for high fidelity image generation. arXiv:2106.15282, 2021. 3, 9

  21. [21]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 3, 4

  22. [22]

    Estimation of non- normalized statistical models by score matching

    Aapo Hyv ¨arinen and Peter Dayan. Estimation of non- normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. 3

  23. [23]

    Image-to-image translation with conditional adver- sarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,

  24. [24]

    Scalable adaptive computation for iterative generation,

    Allan Jabri, David Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022. 3

  25. [25]

    Offline rein- forcement learning as one big sequence modeling problem

    Michael Janner, Qiyang Li, and Sergey Levine. Offline rein- forcement learning as one big sequence modeling problem. In NeurIPS, 2021. 2

  26. [26]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020. 2, 13

  27. [27]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022. 3

  28. [28]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 5

  29. [29]

    Adam: A method for stochastic optimization

    Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 5

  30. [30]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3, 6

  31. [31]

    Imagenet classification with deep convolutional neural net- works

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works. In NeurIPS, 2012. 5

  32. [32]

    Improved precision and recall met- ric for assessing generative models

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. In NeurIPS, 2019. 6

  33. [33]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017. 5 10

  34. [34]

    Battaglia

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021. 6

  35. [35]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021. 3, 4

  36. [36]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021. 3

  37. [37]

    On aliased resizing and surprising subtleties in gan evaluation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022. 6

  38. [38]

    Im- age transformer

    Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- age transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018. 2

  39. [39]

    Learning to learn with genera- tive models of neural network checkpoints

    William Peebles, Ilija Radosavovic, Tim Brooks, Alexei Efros, and Jitendra Malik. Learning to learn with genera- tive models of neural network checkpoints. arXiv preprint arXiv:2209.12892, 2022. 2

  40. [40]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. 2, 5

  41. [41]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 2

  42. [42]

    Improving language understanding by generative pre-training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. 1

  43. [43]

    Language models are unsu- pervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners. 2019. 1

  44. [44]

    On network design spaces for visual recog- nition

    Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Doll´ar. On network design spaces for visual recog- nition. In ICCV, 2019. 3

  45. [45]

    Designing network design spaces

    Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. In CVPR, 2020. 3

  46. [46]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv:2204.06125, 2022. 1, 2, 3, 4

  47. [47]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021. 1, 2

  48. [48]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 2, 3, 4, 6, 9

  49. [49]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention , pages 234–241. Springer, 2015. 2, 3

  50. [50]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding. arXiv:2205.11487, 2022. 3

  51. [51]

    Improved techniques for training GANs

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training GANs. In NeurIPS, 2016. 6

  52. [52]

    PixelCNN++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modifications

    Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. 2

  53. [53]

    Stylegan- xl: Scaling stylegan to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InSIGGRAPH,

  54. [54]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015. 3

  55. [55]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. arXiv:2010.02502, 2020. 3

  56. [56]

    Generative modeling by es- timating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019. 3

  57. [57]

    How to train your ViT? data, augmentation, and regularization in vision transformers

    Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? data, augmentation, and regularization in vision transformers. TMLR, 2022. 6

  58. [58]

    Conditional image genera- tion with pixelcnn decoders

    Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image genera- tion with pixelcnn decoders. Advances in neural information processing systems, 29, 2016. 2

  59. [59]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 2

  60. [60]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1, 2, 5

  61. [61]

    Early convolutions help trans- formers see better

    Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor Darrell, and Ross Girshick. Early convolutions help trans- formers see better. In NeurIPS, 2021. 6

  62. [62]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autore- gressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022. 2

  63. [63]

    arctic wolf

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In CVPR, 2022. 2, 5 11 Figure 11. Additional selected samples from our 512×512 and 256×256 resolution DiT-XL/2 models.We use a classifier-free guidance scale of 6.0 for the 512× 512 model and 4.0 for the 256× 256 model. Both models use the ft-EMA V AE decoder....