pith. machine review for the scientific record. sign in

arxiv: 2112.10752 · v2 · submitted 2021-12-20 · 💻 cs.CV

Recognition: no theorem link

High-Resolution Image Synthesis with Latent Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 21:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords latent diffusion modelsimage synthesisdenoising diffusionautoencodersconditional generationimage inpaintingsuper-resolutioncross-attention
0
0 comments X

The pith

Diffusion models trained in the latent space of pretrained autoencoders generate high-resolution images with substantially lower computational cost than pixel-space versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion models can be moved from raw pixel space into the compressed latent space of a fixed pretrained autoencoder. This shift preserves enough visual structure for high-fidelity synthesis while cutting the cost of training and sampling dramatically. Readers care because the same denoising process now supports conditioning via cross-attention layers, turning the model into a flexible generator for text, boxes, or masks without retraining. The result is practical high-resolution synthesis on ordinary hardware and new performance levels on inpainting.

Core claim

By applying the diffusion process to the latent representations of a pretrained autoencoder rather than to pixels, and by inserting cross-attention layers to accept arbitrary conditioning inputs, latent diffusion models reach a favorable trade-off between model capacity and perceptual fidelity while requiring far fewer resources than pixel-based diffusion models.

What carries the argument

The latent diffusion model (LDM), which runs the forward and reverse diffusion processes on the lower-dimensional latent codes produced by a fixed variational autoencoder and uses cross-attention to incorporate conditioning signals such as text or spatial layouts.

If this is right

  • Training and inference of powerful diffusion models become feasible on limited hardware while retaining visual quality.
  • High-resolution synthesis is performed directly in a convolutional manner without patch-wise processing.
  • Image inpainting reaches state-of-the-art results.
  • Unconditional generation, semantic scene synthesis, and super-resolution remain competitive with prior pixel-space methods.
  • Conditioning on text, bounding boxes, or other inputs is enabled without retraining the core model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separation of perceptual compression from the generative diffusion stage suggests similar latent-space training could be tested on other modalities once suitable autoencoders exist.
  • If the autoencoder is kept fixed, future improvements in autoencoder quality would immediately lift the upper bound on LDM fidelity without changing the diffusion architecture.
  • The approach implies that many existing pixel-based diffusion pipelines could be accelerated by first training a domain-specific autoencoder rather than scaling the diffusion model itself.

Load-bearing premise

The latent codes from the pretrained autoencoder already contain enough perceptual detail and spatial structure that the diffusion model can recover high-fidelity images without uncorrectable artifacts.

What would settle it

High-resolution outputs that consistently exhibit uncorrectable artifacts or visible loss of fine detail relative to pixel-based diffusion models of comparable training effort would show the assumption does not hold.

read the original abstract

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that applying diffusion models in the latent space of pretrained autoencoders enables efficient high-resolution image synthesis. Latent diffusion models (LDMs) reduce spatial dimensions via a KL-regularized VAE (with downsampling factors f=4/8/16) while preserving detail, incorporate cross-attention for conditioning on text or bounding boxes, and achieve new state-of-the-art inpainting results along with competitive performance on unconditional generation, semantic synthesis, and super-resolution, all at substantially lower computational cost than pixel-space DMs. Public code is released.

Significance. If the results hold, this has high significance for making diffusion-based synthesis practical at high resolutions with limited resources. Strengths include the public code release, direct ablations on autoencoder factors, and quantitative FID/LPIPS tables on ImageNet, Places2, and ADE20K that support the efficiency and quality claims. The stress-test concern on latent representation fidelity does not land as a load-bearing issue, since the f=8 model empirically recovers high-frequency detail without uncorrectable artifacts and matches or exceeds pixel DM quality.

major comments (2)
  1. [Ablations on autoencoder downsampling factors] Ablations on autoencoder downsampling factors: the claim of reaching a 'near-optimal point' between complexity reduction and detail preservation for f=8 rests on FID comparisons, but the exact spatial cost reduction (stated as ~1/64) should be derived explicitly from the UNet channel dimensions and latent resolution to allow verification of the efficiency gain.
  2. [Cross-attention layers] Cross-attention for conditioning: while cross-attention enables flexible conditioning, the manuscript does not include an ablation against simpler conditioning mechanisms (e.g., concatenation or FiLM), which would isolate whether this architecture choice is necessary for the flexibility and high-resolution claims.
minor comments (2)
  1. [Abstract] The abstract's reference to 'hundreds of GPU days' for pixel-space DM optimization would be strengthened by citing the specific prior works being compared.
  2. [Methods] Notation for the latent variable z and the diffusion forward/reverse processes in latent space could be clarified with an explicit equation reference or diagram early in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on our efficiency claims and conditioning design. We address each major comment below and have incorporated revisions to improve clarity.

read point-by-point responses
  1. Referee: Ablations on autoencoder downsampling factors: the claim of reaching a 'near-optimal point' between complexity reduction and detail preservation for f=8 rests on FID comparisons, but the exact spatial cost reduction (stated as ~1/64) should be derived explicitly from the UNet channel dimensions and latent resolution to allow verification of the efficiency gain.

    Authors: We agree that an explicit derivation would strengthen the presentation. The ~1/64 factor follows directly from reducing the spatial resolution of the UNet input by f=8 in each dimension (latent size H/8 × W/8), which quadratically reduces the number of spatial operations. Accounting for the UNet channel schedule (starting at 320 channels with doubling in down-blocks), the overall computational cost of the diffusion process scales by this factor relative to pixel-space models. In the revised manuscript we will add a short derivation in Section 3.1 (or an appendix table) that computes the reduction from the exact latent resolution and channel dimensions, enabling straightforward verification. revision: yes

  2. Referee: Cross-attention for conditioning: while cross-attention enables flexible conditioning, the manuscript does not include an ablation against simpler conditioning mechanisms (e.g., concatenation or FiLM), which would isolate whether this architecture choice is necessary for the flexibility and high-resolution claims.

    Authors: We appreciate the suggestion. Cross-attention is chosen because it supports conditioning inputs of arbitrary length and structure (e.g., variable-length text token sequences or unordered sets of bounding-box embeddings) without requiring fixed-dimensional inputs, which concatenation or FiLM layers would necessitate. This flexibility is central to the high-resolution text-to-image and layout-to-image results. A full retraining ablation is outside the scope of a minor revision, but we will add a concise discussion paragraph in Section 3.2 explaining the architectural rationale and contrasting it with simpler alternatives, thereby addressing the concern without misrepresenting the design. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes applying diffusion models in the latent space of a separately pretrained autoencoder, with the central claims of state-of-the-art inpainting and competitive performance on generation tasks supported by direct empirical ablations (e.g., downsampling factors f=4/8/16) and quantitative comparisons to pixel-space baselines on ImageNet, Places2, and ADE20K. No load-bearing step reduces a result or prediction to its own inputs by construction, fitted parameters renamed as outputs, or a self-citation chain; the autoencoder training and latent diffusion training are independent stages, and all performance assertions rest on measured metrics rather than theoretical closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a pretrained autoencoder can compress images into a latent space that retains sufficient detail for diffusion-based generation; no free parameters are introduced in the abstract description, and no new entities are postulated.

axioms (1)
  • domain assumption Pretrained autoencoders produce latent representations that preserve perceptual details necessary for high-fidelity image synthesis.
    Invoked to justify operating diffusion in latent space rather than pixels.

pith-pipeline@v0.9.0 · 5539 in / 1297 out tokens · 47273 ms · 2026-05-11T21:56:11.949352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.

  2. What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching

    cs.LG 2026-05 unverdicted novelty 8.0

    Data geometry makes time identifiable from noisy interpolants at rate O(1/sqrt(d-k)), rendering the time-blindness gap asymptotically negligible relative to coupling variance.

  3. CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

    cs.CV 2026-05 conditional novelty 7.0

    CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...

  4. AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

    cs.CV 2026-05 conditional novelty 7.0

    AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.

  5. Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion...

  6. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  7. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  8. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  9. Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.

  10. Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes

    cs.CV 2026-04 unverdicted novelty 7.0

    Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.

  11. Advantage-Guided Diffusion for Model-Based Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.

  12. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 7.0

    FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.

  13. Drifting Fields are not Conservative

    cs.LG 2026-04 conditional novelty 7.0

    Drift fields in single-pass generative models are not conservative except for Gaussian kernels; a sharp kernel normalization makes them conservative for any radial kernel while noting that non-conservative fields offe...

  14. SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.

  15. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  16. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  17. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  18. Network-Efficient World Model Token Streaming

    cs.RO 2026-05 unverdicted novelty 6.0

    An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...

  19. Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Probability-Flow Distillation exactly matches the Wasserstein gradient flow of the target distribution when distilling 2D diffusion priors into 3D models, yielding higher-fidelity results than SDS or SDI.

  20. AIMIP Phase 1: systematic evaluations of AI weather and climate models

    physics.ao-ph 2026-05 unverdicted novelty 6.0

    AIMIP Phase 1 shows AI models simulate historical climate and El Niño responses as well as traditional models, though some underestimate trends and diverge in generalization tests, with a public dataset released for f...

  21. GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model

    cs.AI 2026-05 unverdicted novelty 6.0

    GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.

  22. Velox: Learning Representations of 4D Geometry and Appearance

    cs.CV 2026-05 unverdicted novelty 6.0

    Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...

  23. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  24. Scale-Aware Adversarial Analysis: A Diagnostic for Generative AI in Multiscale Complex Systems

    cs.LG 2026-05 unverdicted novelty 6.0

    A new scale-aware diagnostic framework shows that unconstrained diffusion generative models exhibit structural freezing and instability instead of smooth physical responses under multiscale perturbations.

  25. Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.

  26. Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.

  27. Deepfake Detection Generalization with Diffusion Noise

    cs.CV 2026-04 unverdicted novelty 6.0

    ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.

  28. PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

    cs.CV 2026-04 unverdicted novelty 6.0

    PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...

  29. ELT: Elastic Looped Transformers for Visual Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

  30. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 6.0

    VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...

  31. Drifting Fields are not Conservative

    cs.LG 2026-04 unverdicted novelty 6.0

    Drift fields are not conservative except for Gaussian kernels; sharp normalization makes them conservative for any radial kernel by equating them to score differences of kernel density estimates.

  32. Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.

  33. LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems

    cs.LG 2026-04 conditional novelty 6.0

    A decoupled offline-online framework uses LLMs and latent diffusion models to generate fault scenarios for testing edge-based lane-following models, revealing large robustness drops under conditions like fog.

  34. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  35. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  36. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  37. Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods

    cs.CV 2026-05 unverdicted novelty 5.0

    Kernel interpolation with a constant multiplier scales convolution and fully-connected layers in neural networks to higher resolutions or dimensions without training, producing competitive results on Stable Diffusion ...

  38. Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods

    cs.CV 2026-05 unverdicted novelty 5.0

    Kernel interpolation with a constant scaling factor enables Stable Diffusion to produce higher-resolution images without training and extends to general neural networks with small accuracy drops.

  39. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  40. ClayScape: A GenAI-Supported Workflow for Designing Chinese Style Ceramics with Clay 3D Printing

    cs.HC 2026-04 unverdicted novelty 5.0

    ClayScape is a hybrid GenAI and clay 3D printing workflow that makes Chinese ceramic design more accessible to creators, as tested with four users who reported expanded creative options alongside agency challenges.

  41. Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk

    cs.CL 2026-04 unverdicted novelty 5.0

    Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.

  42. Style-Based Neural Architectures for Real-Time Weather Classification

    cs.CV 2026-04 unverdicted novelty 5.0

    Three style-based neural architectures are proposed for real-time weather classification from images, with two truncated ResNet variants claimed to outperform prior methods and generalize across public datasets.

  43. Open-Sora: Democratizing Efficient Video Production for All

    cs.CV 2024-12 unverdicted novelty 5.0

    Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...

  44. Discrete Meanflow Training Curriculum

    cs.LG 2026-04 unverdicted novelty 4.0

    A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.

  45. Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection

    cs.DC 2026-03 unverdicted novelty 4.0

    TensorRT YOLO pipelines on Jetson Nano keep GPU occupancy, power draw, and temperature stable even under heavy fault-injected inputs for object detection and lane following.

  46. A Real-Calibrated Synthetic-First Data Engine

    eess.IV 2026-05 unverdicted novelty 3.0

    A data curation pipeline using diffusion-generated synthetic images improves pose estimation when added to real data but underperforms when used without real anchors.

  47. Generative AI for material design: A mechanics perspective from burgers to matter

    cs.CE 2026-04 unverdicted novelty 3.0

    Diffusion models from generative AI, sharing math with material mechanics, generate new burger recipes from 2,260 examples that some blind tasters prefer over the Big Mac.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · cited by 44 Pith papers · 15 internal anchors

  1. [1]

    NTIRE 2017 chal- lenge on single image super-resolution: Dataset and study

    Eirikur Agustsson and Radu Timofte. NTIRE 2017 chal- lenge on single image super-resolution: Dataset and study. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122–1131. IEEE Com- puter Society, 2017. 1

  2. [2]

    Wasserstein gan, 2017

    Martin Arjovsky, Soumith Chintala, and L ´eon Bottou. Wasserstein gan, 2017. 3

  3. [3]

    Large scale GAN training for high fidelity natural image synthe- sis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthe- sis. In Int. Conf. Learn. Represent. , 2019. 1, 2, 7, 8, 22, 28

  4. [4]

    Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In 2018 IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 1209–1218. Computer Vision Foundation / IEEE Computer Society, 2018. 7, 20, 22

  5. [5]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) , pages 2633–2650, 2021. 9

  6. [6]

    Generative pre- training from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR,

  7. [7]

    Weiss, Mo- hammad Norouzi, and William Chan

    Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mo- hammad Norouzi, and William Chan. Wavegrad: Estimat- ing gradients for waveform generation. In ICLR. OpenRe- view.net, 2021. 1

  8. [8]

    Fast fourier convolu- tion

    Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolu- tion. In NeurIPS, 2020. 8

  9. [9]

    Very deep vaes generalize autoregressive models and can outperform them on images

    Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. CoRR, abs/2011.10650, 2020. 3

  10. [10]

    Generating Long Sequences with Sparse Transformers

    Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. 3

  11. [11]

    Bin Dai and David P. Wipf. Diagnosing and enhancing V AE models. In ICLR (Poster). OpenReview.net, 2019. 2, 3

  12. [12]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical im- age database. In CVPR, pages 248–255. IEEE Computer Society, 2009. 1, 5, 7, 22

  13. [13]

    Ethical considerations of generative ai

    Emily Denton. Ethical considerations of generative ai. AI for Content Creation Workshop, CVPR, 2021. 9

  14. [14]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR, abs/1810.04805, 2018. 7

  15. [15]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021. 1, 2, 3, 4, 6, 7, 8, 18, 22, 25, 26, 28

  16. [16]

    Musings on typicality, 2020

    Sander Dieleman. Musings on typicality, 2020. 1, 3

  17. [17]

    Cogview: Mastering text-to- image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to- image generation via transformers. CoRR, abs/2105.13290,

  18. [18]

    Nice: Non-linear independent components estimation, 2015

    Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation, 2015. 3

  19. [19]

    Density estimation using real NVP

    Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real NVP. In 5th Inter- national Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 1, 3

  20. [20]

    Generating images with perceptual similarity metrics based on deep networks

    Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform. Process. Syst., pages 658–666, 2016. 3

  21. [21]

    Imagebart: Bidirectional context with multi- nomial diffusion for autoregressive image synthesis.CoRR, abs/2108.08827, 2021

    Patrick Esser, Robin Rombach, Andreas Blattmann, and Bj¨orn Ommer. Imagebart: Bidirectional context with multi- nomial diffusion for autoregressive image synthesis.CoRR, abs/2108.08827, 2021. 6, 7, 22

  22. [22]

    A note on data biases in generative models

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. A note on data biases in generative models. arXiv preprint arXiv:2012.02516, 2020. 9

  23. [23]

    Esser, R

    Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. CoRR, abs/2012.09841, 2020. 2, 3, 4, 6, 7, 21, 22, 29, 34, 36

  24. [24]

    Sex, lies, and videotape: Deep fakes and free speech delusions

    Mary Anne Franks and Ari Ezra Waldman. Sex, lies, and videotape: Deep fakes and free speech delusions. Md. L. Rev., 78:892, 2018. 9

  25. [25]

    Soros, and Olaf Witkowski

    Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language- image encoders. ArXiv, abs/2106.14843, 2021. 3

  26. [26]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. CoRR, abs/2203.13131, 2022. 6, 7, 16

  27. [27]

    Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, 2014. 1, 2

  28. [28]

    Improved training of wasserstein gans, 2017

    Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017. 3

  29. [29]

    Gans trained by a two time-scale update rule converge to a local nash equi- librium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equi- librium. In Adv. Neural Inform. Process. Syst., pages 6626– 6637, 2017. 1, 5, 26

  30. [30]

    Denoising dif- fusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In NeurIPS, 2020. 1, 2, 3, 4, 6, 17

  31. [31]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.CoRR, abs/2106.15282, 2021. 1, 3, 22 10

  32. [32]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 6, 7, 16, 22, 28, 37, 38

  33. [33]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. In CVPR, pages 5967–5976. IEEE Com- puter Society, 2017. 3, 4

  34. [34]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. 2017 IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pages 5967–5976,

  35. [35]

    Bowen Jing, Bonnie Berger, and Tommi Jaakkola

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. H ´enaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo ˜ao Carreira. Perceiver IO: A general architecture for structured inputs &outputs. CoRR, abs/2107.14795, 2021. 4

  36. [36]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Jo ˜ao Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Researc...

  37. [37]

    High- resolution complex scene synthesis with transformers

    Manuel Jahn, Robin Rombach, and Bj ¨orn Ommer. High- resolution complex scene synthesis with transformers. CoRR, abs/2105.06458, 2021. 20, 22, 27

  38. [38]

    Imperfect ima- ganation: Implications of gans exacerbating biases on fa- cial data augmentation and snapchat selfie lenses

    Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao Kambhampati. Imperfect ima- ganation: Implications of gans exacerbating biases on fa- cial data augmentation and snapchat selfie lenses. arXiv preprint arXiv:2001.09528, 2020. 9

  39. [39]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti- nen. Progressive growing of gans for improved quality, sta- bility, and variation. CoRR, abs/1710.10196, 2017. 5, 6

  40. [40]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 4401– 4410, 2019. 1

  41. [41]

    Karras, S

    T. Karras, S. Laine, and T. Aila. A style-based gener- ator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2019. 5, 6

  42. [42]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. CoRR, abs/1912.04958,

  43. [43]

    Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation

    Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Score matching model for un- bounded data score. CoRR, abs/2106.05527, 2021. 6

  44. [44]

    Glow: Generative flow with invertible 1x1 convolutions

    Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Process- ing Systems, 2018. 3

  45. [45]

    Variational diffusion models

    Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. CoRR, abs/2107.00630, 2021. 1, 3, 16

  46. [46]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-Encoding Vari- ational Bayes. In 2nd International Conference on Learn- ing Representations, ICLR, 2014. 1, 3, 4, 29

  47. [47]

    On fast sampling of diffusion probabilistic models

    Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. CoRR, abs/2106.00132, 2021. 3

  48. [48]

    Diffwave: A versatile diffusion model for audio synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR. OpenReview.net, 2021. 1

  49. [49]

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: unified image classi- fication, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018. 7, 20, 22

  50. [50]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and re- call metric for assessing generative models. CoRR, abs/1904.06991, 2019. 5, 26

  51. [51]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zit- nick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. 6, 7, 27

  52. [52]

    Region-wise generative adversarial imageinpainting for large missing ar- eas

    Yuqing Ma, Xianglong Liu, Shihao Bai, Le-Yi Wang, Ais- han Liu, Dacheng Tao, and Edwin Hancock. Region-wise generative adversarial imageinpainting for large missing ar- eas. ArXiv, abs/1909.12507, 2019. 9

  53. [53]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021. 1

  54. [54]

    Mescheder

    Lars M. Mescheder. On the convergence properties of GAN training. CoRR, abs/1801.04406, 2018. 3

  55. [55]

    Unrolled generative adversarial networks

    Luke Metz, Ben Poole, David Pfau, and Jascha Sohl- Dickstein. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 3

  56. [56]

    Conditional Generative Adversarial Nets

    Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014. 4

  57. [57]

    Engel, Curtis Hawthorne, and Ian Simon

    Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. CoRR, abs/2103.16091, 2021. 1

  58. [58]

    Qureshi, and Mehran Ebrahimi

    Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative im- age inpainting with adversarial edge learning. ArXiv, abs/1901.00212, 2019. 9

  59. [59]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image genera- tion and editing with text-guided diffusion models. CoRR, abs/2112.10741, 2021. 6, 7, 16

  60. [60]

    Ha and J

    Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Se- men Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin. 11 High-fidelity performance metrics for generative models in pytorch, 2020. Version: 0.3.0, DOI: 10.5281/zen- odo.4957738. 26, 27

  61. [61]

    Semantic image synthesis with spatially-adaptive normalization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 4, 7

  62. [62]

    Semantic image synthesis with spatially-adaptive normalization

    Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2019. 22

  63. [63]

    Dual contradistinctive generative autoencoder

    Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 823–832. Computer Vision Foundation / IEEE, 2021. 6

  64. [64]

    On buggy resizing libraries and surprising subtleties in fid calculation.arXiv preprint arXiv:2104.11222, 5:14,

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid cal- culation. arXiv preprint arXiv:2104.11222, 2021. 26

  65. [65]

    Carbon Emissions and Large Neural Network Training

    David A. Patterson, Joseph Gonzalez, Quoc V . Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. CoRR, abs/2104.10350,

  66. [66]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021. 1, 2, 3, 4, 7, 21, 27

  67. [67]

    Gen- erating diverse high-fidelity images with VQ-V AE-2

    Ali Razavi, A ¨aron van den Oord, and Oriol Vinyals. Gen- erating diverse high-fidelity images with VQ-V AE-2. In NeurIPS, pages 14837–14847, 2019. 1, 2, 3, 22

  68. [68]

    Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee

    Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee. Generative ad- versarial text to image synthesis. In ICML, 2016. 4

  69. [69]

    Stochastic backpropagation and approximate in- ference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate in- ference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML, 2014. 1, 4, 29

  70. [70]

    Network-to-network translation with conditional invertible neural networks

    Robin Rombach, Patrick Esser, and Bj ¨orn Ommer. Network-to-network translation with conditional invertible neural networks. In NeurIPS, 2020. 3

  71. [71]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015. 2, 3, 4

  72. [72]

    Fleet, and Mohammad Norouzi

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J. Fleet, and Mohammad Norouzi. Im- age super-resolution via iterative refinement. CoRR, abs/2104.07636, 2021. 1, 4, 8, 16, 22, 23, 27

  73. [73]

    Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017. 1, 3

  74. [74]

    NVIDIA Developer Blog

    Dave Salvator. NVIDIA Developer Blog. https : / / developer . nvidia . com / blog / getting - immediate- speedups- with- a100- tf32, 2020. 28

  75. [75]

    Noise estimation for generative diffusion models

    Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. CoRR, abs/2104.02600, 2021. 3

  76. [76]

    Projected gans converge faster

    Axel Sauer, Kashyap Chitta, Jens M ¨uller, and An- dreas Geiger. Projected gans converge faster. CoRR, abs/2111.01007, 2021. 6

  77. [77]

    A u- net based discriminator for generative adversarial networks

    Edgar Sch ¨onfeld, Bernt Schiele, and Anna Khoreva. A u- net based discriminator for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204–8213. Computer Vision Founda- tion / IEEE, 2020. 6

  78. [78]

    Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs, 2021

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs, 2021. 6, 7

  79. [79]

    Very deep con- volutional networks for large-scale image recognition

    Karen Simonyan and Andrew Zisserman. Very deep con- volutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, Int. Conf. Learn. Represent., 2015. 29, 43, 44, 45

  80. [80]

    D2C: diffusion-denoising models for few-shot con- ditional generation

    Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: diffusion-denoising models for few-shot con- ditional generation. CoRR, abs/2106.06819, 2021. 3

Showing first 80 references.