Recognition: 2 theorem links
· Lean TheoremScalable Diffusion Models with Transformers
Pith reviewed 2026-05-12 05:55 UTC · model grok-4.3
The pith
Diffusion transformers replace U-Nets and improve ImageNet generation quality as Gflops increase.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
What carries the argument
Diffusion Transformer (DiT) that operates on sequences of latent patches, with scaling behavior tracked directly by Gflops in the forward pass.
If this is right
- Higher Gflops from greater transformer depth or width produce lower FID scores.
- Adding more input tokens from latent patches also improves generation quality.
- The largest DiT models surpass all previous diffusion models on ImageNet 256x256 and 512x512.
- Scalability can be predicted from forward-pass Gflops without additional architectural changes.
Where Pith is reading between the lines
- Similar scaling may appear in diffusion models for video or 3D data if the same Gflops-FID relationship holds.
- Training runs could be budgeted directly in Gflops rather than by guessing depth or width in advance.
- Other generative tasks that already use transformers might adopt the same latent-patch approach for consistency.
Load-bearing premise
That raising Gflops by making the transformer deeper, wider, or by using more latent patches will keep reducing FID without training instabilities or diminishing returns.
What would settle it
An experiment in which FID stops falling or starts rising once Gflops exceed the level of the DiT-XL/2 model on the same ImageNet class-conditional benchmarks.
read the original abstract
We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Diffusion Transformers (DiTs), a new class of latent diffusion models that replace the standard U-Net backbone with a transformer operating on latent patches. The authors analyze scalability through forward-pass complexity measured in Gflops and show that increasing Gflops—via greater transformer depth/width or more input tokens—consistently reduces FID on class-conditional ImageNet. Their largest DiT-XL/2 model achieves state-of-the-art FID scores of 2.27 on the 256×256 benchmark and competitive results on 512×512, outperforming prior diffusion models such as ADM and LDM under a standardized 50k-sample evaluation protocol with classifier-free guidance.
Significance. If the reported scaling trends and benchmark results hold under closer scrutiny, the work demonstrates that transformer architectures can serve as scalable, high-performing backbones for diffusion models, offering an alternative to convolutional U-Nets that improves with compute. The monotonic Gflops-vs-FID relationship across multiple DiT variants (S/B/L/XL) and patch sizes supplies concrete empirical support for the central scalability thesis and could influence backbone design choices in future generative modeling research.
major comments (3)
- [Experiments] The central scalability claim (higher Gflops yields lower FID) is supported by curves across DiT variants, but the Experiments section provides insufficient detail on training procedures, including optimizer settings, learning-rate schedules, total training steps, and data-augmentation choices. Without these, it is difficult to verify that the observed FID gains are attributable to Gflops rather than differences in optimization or regularization.
- [Benchmark tables] Benchmark tables report single-point FID values (e.g., 2.27 for DiT-XL/2) without error bars, standard deviations, or results from multiple independent runs. This omission weakens the strength of the SOTA claim relative to prior models, as small differences in FID can arise from stochasticity in sampling or evaluation.
- [Ablation studies] While Gflops scaling is examined by varying depth/width and patch size (1/2/4/8), the manuscript lacks a controlled ablation that isolates the contribution of each factor while holding total Gflops fixed. Such an analysis would strengthen the claim that the improvement is driven by compute rather than architectural specifics.
minor comments (2)
- [Figures] Figure captions for the Gflops-vs-FID plots should explicitly state the number of samples used for FID computation and whether classifier-free guidance scale is held constant across all points.
- [Model architecture] The notation for model variants (DiT-S/B/L/XL) and patch sizes (DiT-XL/2) is introduced without a dedicated table summarizing parameter counts, Gflops, and layer configurations; adding one would improve readability.
Simulated Author's Rebuttal
We thank the referee for their positive summary of our work and for the constructive major comments. We address each point below with our responses and planned revisions.
read point-by-point responses
-
Referee: [Experiments] The central scalability claim (higher Gflops yields lower FID) is supported by curves across DiT variants, but the Experiments section provides insufficient detail on training procedures, including optimizer settings, learning-rate schedules, total training steps, and data-augmentation choices. Without these, it is difficult to verify that the observed FID gains are attributable to Gflops rather than differences in optimization or regularization.
Authors: We agree that expanded details on training procedures will strengthen verifiability. We will revise the Experiments section to explicitly describe the optimizer, learning-rate schedule, total training steps, and data augmentations, noting that these choices are held fixed across all DiT variants. This will clarify that observed FID differences arise from Gflops scaling. We will also release training code for full reproducibility. revision: yes
-
Referee: [Benchmark tables] Benchmark tables report single-point FID values (e.g., 2.27 for DiT-XL/2) without error bars, standard deviations, or results from multiple independent runs. This omission weakens the strength of the SOTA claim relative to prior models, as small differences in FID can arise from stochasticity in sampling or evaluation.
Authors: We acknowledge the value of error bars for robustness. However, multiple independent runs of the largest models are computationally prohibitive. We adhere to the standardized 50k-sample evaluation protocol with classifier-free guidance used by prior works (ADM, LDM) for fair comparison. The consistent monotonic scaling trends across variants support result reliability. We will add a discussion of evaluation variance and practical limitations in the revised Experiments section. revision: partial
-
Referee: [Ablation studies] While Gflops scaling is examined by varying depth/width and patch size (1/2/4/8), the manuscript lacks a controlled ablation that isolates the contribution of each factor while holding total Gflops fixed. Such an analysis would strengthen the claim that the improvement is driven by compute rather than architectural specifics.
Authors: We agree a controlled iso-Gflops analysis would be beneficial. Our existing results include multiple architectural paths to similar Gflops levels. We will add a new analysis (derived from current data) that bins models by Gflops and compares FID for different depth/width/patch configurations at matched compute, to better isolate the role of total Gflops. revision: yes
- [Benchmark tables] The request for error bars or results from multiple independent runs on the benchmark FID scores.
Circularity Check
No significant circularity
full rationale
The paper presents empirical results from training and evaluating DiT models on public ImageNet benchmarks, including Gflops-vs-FID scaling curves across model variants and patch sizes plus direct FID comparisons to ADM, LDM and other baselines under identical 50k-sample protocols. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter, self-citation chain, or definitional tautology; the reported scaling trends and SOTA FID of 2.27 are externally falsifiable through independent training runs and benchmark tables.
Axiom & Free-Parameter Ledger
free parameters (1)
- Gflops scaling factors
axioms (1)
- domain assumption Transformers can effectively replace U-Nets when operating on latent patches for diffusion
Forward citations
Cited by 41 Pith papers
-
CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation
CRePE supplies depth-aware positional distributions along curved rays for stable unified-camera control in frozen video DiT models.
-
From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models
Chirality emerges in SMILES translation models through an abrupt encoder-centered reorganization of representations after a long plateau, identified via checkpoint analysis and ablation.
-
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
-
A Flow Matching Algorithm for Many-Shot Adaptation to Unseen Distributions
FP-FM adapts flow matching models to unseen distributions via least-squares projection onto basis functions spanning training velocity fields, yielding improved precision and recall without inference-time training.
-
D-Rex : Diffusion Rendering for Relightable Expressive Avatars
D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation
MoZoo generates high-fidelity animal videos with fur and muscle dynamics from coarse meshes by extending video diffusion with role-aware RoPE and asymmetric decoupled attention, trained on a new synthetic-to-real dataset.
-
Learning-Guided Force-Feedback Model Predictive Control with Obstacle Avoidance for Robotic Deburring
A framework merges diffusion-based motion priors with force-feedback MPC to enable reliable tool insertion, force tracking, and collision-free circular motions in robotic deburring.
-
GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow
GVCC achieves the lowest LPIPS on UVG at bitrates down to 0.003 bpp by encoding stochastic innovations in a marginal-preserving stochastic process derived from a pretrained rectified-flow video model, with 65% LPIPS r...
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
CUBic: Coordinated Unified Bimanual Perception and Control Framework
CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordinatio...
-
The Diffusion Encoder
A diffusion model serves as the encoder in an autoencoder when trained alternately with the decoder to resolve opposing update directions while retaining the standard diffusion training objective.
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
SwiftI2V matches end-to-end 2K I2V quality on VBench while cutting GPU time by 202x via conditional segment-wise generation that bounds token cost and preserves input fidelity.
-
SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
SwiftI2V achieves comparable 2K I2V quality to end-to-end models on VBench-I2V while cutting GPU time by 202x through low-resolution motion planning followed by strongly image-conditioned segment-wise high-resolution ...
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages
SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.
-
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.
-
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation
HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt ...
-
GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers
A unified diffusion transformer jointly solves single-image relighting and 3D reconstruction via a new isotropic NDC-Orthographic Depth representation and mixed synthetic/real training.
-
BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving
The primary OL-CL gap in end-to-end autonomous driving arises from objective mismatch creating structural inability to model reactive behaviors, which a test-time adaptation method can mitigate.
-
ELT: Elastic Looped Transformers for Visual Generation
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
-
PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing
PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.
-
LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video
LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.
-
AE-ViT: Stable Long-Horizon Parametric Partial Differential Equations Modeling
AE-ViT combines a convolutional autoencoder with a latent-space transformer and multi-stage parameter plus coordinate injection to deliver stable long-horizon predictions for parametric PDEs, cutting relative rollout ...
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
SkyReels-V2: Infinite-length Film Generative Model
SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
-
Understanding Asynchronous Inference Methods for Vision-Language-Action Models
Controlled benchmarks show per-step residual correction (A2C2) as most effective for VLA asynchronous inference up to d=8 delays on Kinetix with over 90% solve rate, outperforming inpainting and conditioning while tra...
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk
Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.
-
Who Defines Fairness? Target-Based Prompting for Demographic Representation in Generative Models
Target-based prompting lets users define fairness distributions for skin tones in generative AI, shifting outputs closer to chosen targets across 36 tested prompts for occupations and contexts.
-
Gated Memory Policy
GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.
-
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
-
Target Parameterization in Diffusion Models for Nonlinear Spatiotemporal System Identification
Clean-state prediction in diffusion models for turbulent spatiotemporal systems improves rollout stability and reduces long-horizon error compared to velocity- and noise-based objectives.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results
The NTIRE 2026 Challenge establishes a benchmark for bitstream-corrupted video restoration and summarizes the top methods and observed trends from participating teams.
Reference graph
Works this paper leans on
-
[1]
JAX: composable transformations of Python+NumPy programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau- rin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. 6
work page 2018
-
[2]
Large scale GAN training for high fidelity natural image synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019. 5, 9
work page 2019
-
[3]
Lan- guage models are few-shot learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. In NeurIPS, 2020. 1
work page 2020
-
[4]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In CVPR, pages 11315–11325, 2022. 2
work page 2022
-
[5]
Decision transformer: Reinforce- ment learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srini- vas, and Igor Mordatch. Decision transformer: Reinforce- ment learning via sequence modeling. In NeurIPS, 2021. 2
work page 2021
-
[6]
Generative pre- training from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In ICML, 2020. 1, 2
work page 2020
-
[7]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In NAACL-HCT, 2019. 1
work page 2019
-
[9]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021. 1, 2, 3, 5, 6, 9, 12
work page 2021
-
[10]
An image is worth 16x16 words: Trans- formers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. In ICLR, 2020. 1, 2, 4, 5
work page 2020
-
[11]
Taming transformers for high-resolution image synthesis, 2020
Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis, 2020. 2
work page 2020
-
[12]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014. 3
work page 2014
-
[13]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal, Piotr Doll ´ar, Ross Girshick, Pieter Noord- huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017. 5
work page internal anchor Pith review arXiv 2017
-
[14]
Vec- tor quantized diffusion model for text-to-image synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec- tor quantized diffusion model for text-to-image synthesis. In CVPR, pages 10696–10706, 2022. 2
work page 2022
-
[15]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,
-
[16]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. 12
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[17]
Scaling Laws for Autoregressive Generative Modeling
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020. 2
work page internal anchor Pith review arXiv 2010
-
[18]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. 2017. 6
work page 2017
-
[19]
Denoising diffu- sion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. In NeurIPS, 2020. 2, 3
work page 2020
-
[20]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cas- caded diffusion models for high fidelity image generation. arXiv:2106.15282, 2021. 3, 9
-
[21]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 3, 4
work page 2021
-
[22]
Estimation of non- normalized statistical models by score matching
Aapo Hyv ¨arinen and Peter Dayan. Estimation of non- normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. 3
work page 2005
-
[23]
Image-to-image translation with conditional adver- sarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134,
-
[24]
Scalable adaptive computation for iterative generation,
Allan Jabri, David Fleet, and Ting Chen. Scalable adap- tive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022. 3
-
[25]
Offline rein- forcement learning as one big sequence modeling problem
Michael Janner, Qiyang Li, and Sergey Levine. Offline rein- forcement learning as one big sequence modeling problem. In NeurIPS, 2021. 2
work page 2021
-
[26]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020. 2, 13
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[27]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022. 3
work page 2022
-
[28]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019. 5
work page 2019
-
[29]
Adam: A method for stochastic optimization
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015. 5
work page 2015
-
[30]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding varia- tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3, 6
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[31]
Imagenet classification with deep convolutional neural net- works
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural net- works. In NeurIPS, 2012. 5
work page 2012
-
[32]
Improved precision and recall met- ric for assessing generative models
Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. In NeurIPS, 2019. 6
work page 2019
-
[33]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017. 5 10
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [34]
-
[35]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021. 3
work page 2021
-
[37]
On aliased resizing and surprising subtleties in gan evaluation
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022. 6
work page 2022
-
[38]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- age transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018. 2
work page 2018
-
[39]
Learning to learn with genera- tive models of neural network checkpoints
William Peebles, Ilija Radosavovic, Tim Brooks, Alexei Efros, and Jitendra Malik. Learning to learn with genera- tive models of neural network checkpoints. arXiv preprint arXiv:2209.12892, 2022. 2
-
[40]
Film: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm De Vries, Vincent Du- moulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018. 2, 5
work page 2018
-
[41]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In ICML, 2021. 2
work page 2021
-
[42]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. 1
work page 2018
-
[43]
Language models are unsu- pervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners. 2019. 1
work page 2019
-
[44]
On network design spaces for visual recog- nition
Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, and Piotr Doll´ar. On network design spaces for visual recog- nition. In ICCV, 2019. 3
work page 2019
-
[45]
Designing network design spaces
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Doll ´ar. Designing network design spaces. In CVPR, 2020. 3
work page 2020
-
[46]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents. arXiv:2204.06125, 2022. 1, 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021. 1, 2
work page 2021
-
[48]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. In CVPR, 2022. 2, 3, 4, 6, 9
work page 2022
-
[49]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In International Conference on Medical image com- puting and computer-assisted intervention , pages 234–241. Springer, 2015. 2, 3
work page 2015
-
[50]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to- image diffusion models with deep language understanding. arXiv:2205.11487, 2022. 3
work page internal anchor Pith review arXiv 2022
-
[51]
Improved techniques for training GANs
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training GANs. In NeurIPS, 2016. 6
work page 2016
-
[52]
Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017. 2
-
[53]
Stylegan- xl: Scaling stylegan to large diverse datasets
Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan- xl: Scaling stylegan to large diverse datasets. InSIGGRAPH,
-
[54]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015. 3
work page 2015
-
[55]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. arXiv:2010.02502, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[56]
Generative modeling by es- timating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by es- timating gradients of the data distribution. InNeurIPS, 2019. 3
work page 2019
-
[57]
How to train your ViT? data, augmentation, and regularization in vision transformers
Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train your ViT? data, augmentation, and regularization in vision transformers. TMLR, 2022. 6
work page 2022
-
[58]
Conditional image genera- tion with pixelcnn decoders
Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image genera- tion with pixelcnn decoders. Advances in neural information processing systems, 29, 2016. 2
work page 2016
-
[59]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information pro- cessing systems, 30, 2017. 2
work page 2017
-
[60]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 1, 2, 5
work page 2017
-
[61]
Early convolutions help trans- formers see better
Tete Xiao, Piotr Dollar, Mannat Singh, Eric Mintun, Trevor Darrell, and Ross Girshick. Early convolutions help trans- formers see better. In NeurIPS, 2021. 6
work page 2021
-
[62]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autore- gressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[63]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In CVPR, 2022. 2, 5 11 Figure 11. Additional selected samples from our 512×512 and 256×256 resolution DiT-XL/2 models.We use a classifier-free guidance scale of 6.0 for the 512× 512 model and 4.0 for the 256× 256 model. Both models use the ft-EMA V AE decoder....
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.