Recognition: no theorem link
High-Resolution Image Synthesis with Latent Diffusion Models
Pith reviewed 2026-05-11 21:56 UTC · model grok-4.3
The pith
Diffusion models trained in the latent space of pretrained autoencoders generate high-resolution images with substantially lower computational cost than pixel-space versions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By applying the diffusion process to the latent representations of a pretrained autoencoder rather than to pixels, and by inserting cross-attention layers to accept arbitrary conditioning inputs, latent diffusion models reach a favorable trade-off between model capacity and perceptual fidelity while requiring far fewer resources than pixel-based diffusion models.
What carries the argument
The latent diffusion model (LDM), which runs the forward and reverse diffusion processes on the lower-dimensional latent codes produced by a fixed variational autoencoder and uses cross-attention to incorporate conditioning signals such as text or spatial layouts.
If this is right
- Training and inference of powerful diffusion models become feasible on limited hardware while retaining visual quality.
- High-resolution synthesis is performed directly in a convolutional manner without patch-wise processing.
- Image inpainting reaches state-of-the-art results.
- Unconditional generation, semantic scene synthesis, and super-resolution remain competitive with prior pixel-space methods.
- Conditioning on text, bounding boxes, or other inputs is enabled without retraining the core model.
Where Pith is reading between the lines
- The separation of perceptual compression from the generative diffusion stage suggests similar latent-space training could be tested on other modalities once suitable autoencoders exist.
- If the autoencoder is kept fixed, future improvements in autoencoder quality would immediately lift the upper bound on LDM fidelity without changing the diffusion architecture.
- The approach implies that many existing pixel-based diffusion pipelines could be accelerated by first training a domain-specific autoencoder rather than scaling the diffusion model itself.
Load-bearing premise
The latent codes from the pretrained autoencoder already contain enough perceptual detail and spatial structure that the diffusion model can recover high-fidelity images without uncorrectable artifacts.
What would settle it
High-resolution outputs that consistently exhibit uncorrectable artifacts or visible loss of fine detail relative to pixel-based diffusion models of comparable training effort would show the assumption does not hold.
read the original abstract
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that applying diffusion models in the latent space of pretrained autoencoders enables efficient high-resolution image synthesis. Latent diffusion models (LDMs) reduce spatial dimensions via a KL-regularized VAE (with downsampling factors f=4/8/16) while preserving detail, incorporate cross-attention for conditioning on text or bounding boxes, and achieve new state-of-the-art inpainting results along with competitive performance on unconditional generation, semantic synthesis, and super-resolution, all at substantially lower computational cost than pixel-space DMs. Public code is released.
Significance. If the results hold, this has high significance for making diffusion-based synthesis practical at high resolutions with limited resources. Strengths include the public code release, direct ablations on autoencoder factors, and quantitative FID/LPIPS tables on ImageNet, Places2, and ADE20K that support the efficiency and quality claims. The stress-test concern on latent representation fidelity does not land as a load-bearing issue, since the f=8 model empirically recovers high-frequency detail without uncorrectable artifacts and matches or exceeds pixel DM quality.
major comments (2)
- [Ablations on autoencoder downsampling factors] Ablations on autoencoder downsampling factors: the claim of reaching a 'near-optimal point' between complexity reduction and detail preservation for f=8 rests on FID comparisons, but the exact spatial cost reduction (stated as ~1/64) should be derived explicitly from the UNet channel dimensions and latent resolution to allow verification of the efficiency gain.
- [Cross-attention layers] Cross-attention for conditioning: while cross-attention enables flexible conditioning, the manuscript does not include an ablation against simpler conditioning mechanisms (e.g., concatenation or FiLM), which would isolate whether this architecture choice is necessary for the flexibility and high-resolution claims.
minor comments (2)
- [Abstract] The abstract's reference to 'hundreds of GPU days' for pixel-space DM optimization would be strengthened by citing the specific prior works being compared.
- [Methods] Notation for the latent variable z and the diffusion forward/reverse processes in latent space could be clarified with an explicit equation reference or diagram early in the methods section.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on our efficiency claims and conditioning design. We address each major comment below and have incorporated revisions to improve clarity.
read point-by-point responses
-
Referee: Ablations on autoencoder downsampling factors: the claim of reaching a 'near-optimal point' between complexity reduction and detail preservation for f=8 rests on FID comparisons, but the exact spatial cost reduction (stated as ~1/64) should be derived explicitly from the UNet channel dimensions and latent resolution to allow verification of the efficiency gain.
Authors: We agree that an explicit derivation would strengthen the presentation. The ~1/64 factor follows directly from reducing the spatial resolution of the UNet input by f=8 in each dimension (latent size H/8 × W/8), which quadratically reduces the number of spatial operations. Accounting for the UNet channel schedule (starting at 320 channels with doubling in down-blocks), the overall computational cost of the diffusion process scales by this factor relative to pixel-space models. In the revised manuscript we will add a short derivation in Section 3.1 (or an appendix table) that computes the reduction from the exact latent resolution and channel dimensions, enabling straightforward verification. revision: yes
-
Referee: Cross-attention for conditioning: while cross-attention enables flexible conditioning, the manuscript does not include an ablation against simpler conditioning mechanisms (e.g., concatenation or FiLM), which would isolate whether this architecture choice is necessary for the flexibility and high-resolution claims.
Authors: We appreciate the suggestion. Cross-attention is chosen because it supports conditioning inputs of arbitrary length and structure (e.g., variable-length text token sequences or unordered sets of bounding-box embeddings) without requiring fixed-dimensional inputs, which concatenation or FiLM layers would necessitate. This flexibility is central to the high-resolution text-to-image and layout-to-image results. A full retraining ablation is outside the scope of a minor revision, but we will add a concise discussion paragraph in Section 3.2 explaining the architectural rationale and contrasting it with simpler alternatives, thereby addressing the concern without misrepresenting the design. revision: partial
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper proposes applying diffusion models in the latent space of a separately pretrained autoencoder, with the central claims of state-of-the-art inpainting and competitive performance on generation tasks supported by direct empirical ablations (e.g., downsampling factors f=4/8/16) and quantitative comparisons to pixel-space baselines on ImageNet, Places2, and ADE20K. No load-bearing step reduces a result or prediction to its own inputs by construction, fitted parameters renamed as outputs, or a self-citation chain; the autoencoder training and latent diffusion training are independent stages, and all performance assertions rest on measured metrics rather than theoretical closure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained autoencoders produce latent representations that preserve perceptual details necessary for high-fidelity image synthesis.
Forward citations
Cited by 47 Pith papers
-
Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds
Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.
-
What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching
Data geometry makes time identifiable from noisy interpolants at rate O(1/sqrt(d-k)), rendering the time-blindness gap asymptotically negligible relative to coupling variance.
-
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
-
AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters
AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
-
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models
LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion...
-
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
-
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
-
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.
-
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
-
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
-
Drifting Fields are not Conservative
Drift fields in single-pass generative models are not conservative except for Gaussian kernels; a sharp kernel normalization makes them conservative for any radial kernel while noting that non-conservative fields offe...
-
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
-
LAION-5B: An open large-scale dataset for training next generation image-text models
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
-
Hierarchical Text-Conditional Image Generation with CLIP Latents
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
Network-Efficient World Model Token Streaming
An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...
-
Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation
Probability-Flow Distillation exactly matches the Wasserstein gradient flow of the target distribution when distilling 2D diffusion priors into 3D models, yielding higher-fidelity results than SDS or SDI.
-
AIMIP Phase 1: systematic evaluations of AI weather and climate models
AIMIP Phase 1 shows AI models simulate historical climate and El Niño responses as well as traditional models, though some underestimate trends and diverge in generalization tests, with a public dataset released for f...
-
GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model
GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
-
Compared to What? Baselines and Metrics for Counterfactual Prompting
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
-
Scale-Aware Adversarial Analysis: A Diagnostic for Generative AI in Multiscale Complex Systems
A new scale-aware diagnostic framework shows that unconstrained diffusion generative models exhibit structural freezing and instability instead of smooth physical responses under multiscale perturbations.
-
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
-
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.
-
Deepfake Detection Generalization with Diffusion Noise
ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
-
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
-
ELT: Elastic Looped Transformers for Visual Generation
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
-
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
-
Drifting Fields are not Conservative
Drift fields are not conservative except for Gaussian kernels; sharp normalization makes them conservative for any radial kernel by equating them to score differences of kernel density estimates.
-
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models
Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
-
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems
A decoupled offline-online framework uses LLMs and latent diffusion models to generate fault scenarios for testing edge-based lane-following models, revealing large robustness drops under conditions like fog.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
-
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
-
Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods
Kernel interpolation with a constant multiplier scales convolution and fully-connected layers in neural networks to higher resolutions or dimensions without training, producing competitive results on Stable Diffusion ...
-
Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods
Kernel interpolation with a constant scaling factor enables Stable Diffusion to produce higher-resolution images without training and extends to general neural networks with small accuracy drops.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
ClayScape: A GenAI-Supported Workflow for Designing Chinese Style Ceramics with Clay 3D Printing
ClayScape is a hybrid GenAI and clay 3D printing workflow that makes Chinese ceramic design more accessible to creators, as tested with four users who reported expanded creative options alongside agency challenges.
-
Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk
Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.
-
Style-Based Neural Architectures for Real-Time Weather Classification
Three style-based neural architectures are proposed for real-time weather classification from images, with two truncated ResNet variants claimed to outperform prior methods and generalize across public datasets.
-
Open-Sora: Democratizing Efficient Video Production for All
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
-
Discrete Meanflow Training Curriculum
A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
-
Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection
TensorRT YOLO pipelines on Jetson Nano keep GPU occupancy, power draw, and temperature stable even under heavy fault-injected inputs for object detection and lane following.
-
A Real-Calibrated Synthetic-First Data Engine
A data curation pipeline using diffusion-generated synthetic images improves pose estimation when added to real data but underperforms when used without real anchors.
-
Generative AI for material design: A mechanics perspective from burgers to matter
Diffusion models from generative AI, sharing math with material mechanics, generate new burger recipes from 2,260 examples that some blind tasters prefer over the Big Mac.
Reference graph
Works this paper leans on
-
[1]
NTIRE 2017 chal- lenge on single image super-resolution: Dataset and study
Eirikur Agustsson and Radu Timofte. NTIRE 2017 chal- lenge on single image super-resolution: Dataset and study. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122–1131. IEEE Com- puter Society, 2017. 1
work page 2017
-
[2]
Martin Arjovsky, Soumith Chintala, and L ´eon Bottou. Wasserstein gan, 2017. 3
work page 2017
-
[3]
Large scale GAN training for high fidelity natural image synthe- sis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthe- sis. In Int. Conf. Learn. Represent. , 2019. 1, 2, 7, 8, 22, 28
work page 2019
-
[4]
Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In 2018 IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 1209–1218. Computer Vision Foundation / IEEE Computer Society, 2018. 7, 20, 22
work page 2018
-
[5]
Extracting training data from large language models
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) , pages 2633–2650, 2021. 9
work page 2021
-
[6]
Generative pre- training from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR,
-
[7]
Weiss, Mo- hammad Norouzi, and William Chan
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mo- hammad Norouzi, and William Chan. Wavegrad: Estimat- ing gradients for waveform generation. In ICLR. OpenRe- view.net, 2021. 1
work page 2021
-
[8]
Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolu- tion. In NeurIPS, 2020. 8
work page 2020
-
[9]
Very deep vaes generalize autoregressive models and can outperform them on images
Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. CoRR, abs/2011.10650, 2020. 3
-
[10]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. 3
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[11]
Bin Dai and David P. Wipf. Diagnosing and enhancing V AE models. In ICLR (Poster). OpenReview.net, 2019. 2, 3
work page 2019
-
[12]
Imagenet: A large-scale hierarchical im- age database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical im- age database. In CVPR, pages 248–255. IEEE Computer Society, 2009. 1, 5, 7, 22
work page 2009
-
[13]
Ethical considerations of generative ai
Emily Denton. Ethical considerations of generative ai. AI for Content Creation Workshop, CVPR, 2021. 9
work page 2021
-
[14]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR, abs/1810.04805, 2018. 7
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[15]
Diffusion Models Beat GANs on Image Synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021. 1, 2, 3, 4, 6, 7, 8, 18, 22, 25, 26, 28
work page internal anchor Pith review arXiv 2021
- [16]
-
[17]
Cogview: Mastering text-to- image generation via transformers
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to- image generation via transformers. CoRR, abs/2105.13290,
-
[18]
Nice: Non-linear independent components estimation, 2015
Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation, 2015. 3
work page 2015
-
[19]
Density estimation using real NVP
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real NVP. In 5th Inter- national Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 1, 3
work page 2017
-
[20]
Generating images with perceptual similarity metrics based on deep networks
Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform. Process. Syst., pages 658–666, 2016. 3
work page 2016
-
[21]
Patrick Esser, Robin Rombach, Andreas Blattmann, and Bj¨orn Ommer. Imagebart: Bidirectional context with multi- nomial diffusion for autoregressive image synthesis.CoRR, abs/2108.08827, 2021. 6, 7, 22
-
[22]
A note on data biases in generative models
Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. A note on data biases in generative models. arXiv preprint arXiv:2012.02516, 2020. 9
- [23]
-
[24]
Sex, lies, and videotape: Deep fakes and free speech delusions
Mary Anne Franks and Ari Ezra Waldman. Sex, lies, and videotape: Deep fakes and free speech delusions. Md. L. Rev., 78:892, 2018. 9
work page 2018
-
[25]
Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language- image encoders. ArXiv, abs/2106.14843, 2021. 3
-
[26]
Make-a-scene: Scene-based text-to-image generation with human priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. CoRR, abs/2203.13131, 2022. 6, 7, 16
-
[27]
Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, 2014. 1, 2
work page 2014
-
[28]
Improved training of wasserstein gans, 2017
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017. 3
work page 2017
-
[29]
Gans trained by a two time-scale update rule converge to a local nash equi- librium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equi- librium. In Adv. Neural Inform. Process. Syst., pages 6626– 6637, 2017. 1, 5, 26
work page 2017
-
[30]
Denoising dif- fusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In NeurIPS, 2020. 1, 2, 3, 4, 6, 17
work page 2020
-
[31]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.CoRR, abs/2106.15282, 2021. 1, 3, 22 10
-
[32]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 6, 7, 16, 22, 28, 37, 38
work page 2021
-
[33]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. In CVPR, pages 5967–5976. IEEE Com- puter Society, 2017. 3, 4
work page 2017
-
[34]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. 2017 IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pages 5967–5976,
work page 2017
-
[35]
Bowen Jing, Bonnie Berger, and Tommi Jaakkola
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. H ´enaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo ˜ao Carreira. Perceiver IO: A general architecture for structured inputs &outputs. CoRR, abs/2107.14795, 2021. 4
-
[36]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Jo ˜ao Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Researc...
work page 2021
-
[37]
High- resolution complex scene synthesis with transformers
Manuel Jahn, Robin Rombach, and Bj ¨orn Ommer. High- resolution complex scene synthesis with transformers. CoRR, abs/2105.06458, 2021. 20, 22, 27
-
[38]
Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao Kambhampati. Imperfect ima- ganation: Implications of gans exacerbating biases on fa- cial data augmentation and snapchat selfie lenses. arXiv preprint arXiv:2001.09528, 2020. 9
-
[39]
Progressive Growing of GANs for Improved Quality, Stability, and Variation
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti- nen. Progressive growing of gans for improved quality, sta- bility, and variation. CoRR, abs/1710.10196, 2017. 5, 6
work page internal anchor Pith review arXiv 2017
-
[40]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 4401– 4410, 2019. 1
work page 2019
- [41]
-
[42]
Analyzing and improving the image quality of stylegan
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. CoRR, abs/1912.04958,
-
[43]
Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Score matching model for un- bounded data score. CoRR, abs/2106.05527, 2021. 6
-
[44]
Glow: Generative flow with invertible 1x1 convolutions
Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Process- ing Systems, 2018. 3
work page 2018
-
[45]
Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. CoRR, abs/2107.00630, 2021. 1, 3, 16
-
[46]
Diederik P. Kingma and Max Welling. Auto-Encoding Vari- ational Bayes. In 2nd International Conference on Learn- ing Representations, ICLR, 2014. 1, 3, 4, 29
work page 2014
-
[47]
On fast sampling of diffusion probabilistic models
Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. CoRR, abs/2106.00132, 2021. 3
-
[48]
Diffwave: A versatile diffusion model for audio synthesis
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR. OpenReview.net, 2021. 1
work page 2021
-
[49]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: unified image classi- fication, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018. 7, 20, 22
-
[50]
Improved precision and recall metric for assessing generative models
Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and re- call metric for assessing generative models. CoRR, abs/1904.06991, 2019. 5, 26
-
[51]
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zit- nick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. 6, 7, 27
work page internal anchor Pith review arXiv 2014
-
[52]
Region-wise generative adversarial imageinpainting for large missing ar- eas
Yuqing Ma, Xianglong Liu, Shihao Bai, Le-Yi Wang, Ais- han Liu, Dacheng Tao, and Edwin Hancock. Region-wise generative adversarial imageinpainting for large missing ar- eas. ArXiv, abs/1909.12507, 2019. 9
-
[53]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021. 1
work page internal anchor Pith review arXiv 2021
- [54]
-
[55]
Unrolled generative adversarial networks
Luke Metz, Ben Poole, David Pfau, and Jascha Sohl- Dickstein. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 3
work page 2017
-
[56]
Conditional Generative Adversarial Nets
Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014. 4
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[57]
Engel, Curtis Hawthorne, and Ian Simon
Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. CoRR, abs/2103.16091, 2021. 1
-
[58]
Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative im- age inpainting with adversarial edge learning. ArXiv, abs/1901.00212, 2019. 9
-
[59]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image genera- tion and editing with text-guided diffusion models. CoRR, abs/2112.10741, 2021. 6, 7, 16
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[60]
Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Se- men Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin. 11 High-fidelity performance metrics for generative models in pytorch, 2020. Version: 0.3.0, DOI: 10.5281/zen- odo.4957738. 26, 27
-
[61]
Semantic image synthesis with spatially-adaptive normalization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 4, 7
work page 2019
-
[62]
Semantic image synthesis with spatially-adaptive normalization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2019. 22
work page 2019
-
[63]
Dual contradistinctive generative autoencoder
Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 823–832. Computer Vision Foundation / IEEE, 2021. 6
work page 2021
-
[64]
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid cal- culation. arXiv preprint arXiv:2104.11222, 2021. 26
-
[65]
Carbon Emissions and Large Neural Network Training
David A. Patterson, Joseph Gonzalez, Quoc V . Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. CoRR, abs/2104.10350,
work page internal anchor Pith review arXiv
-
[66]
Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021. 1, 2, 3, 4, 7, 21, 27
work page internal anchor Pith review arXiv 2021
-
[67]
Gen- erating diverse high-fidelity images with VQ-V AE-2
Ali Razavi, A ¨aron van den Oord, and Oriol Vinyals. Gen- erating diverse high-fidelity images with VQ-V AE-2. In NeurIPS, pages 14837–14847, 2019. 1, 2, 3, 22
work page 2019
-
[68]
Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee
Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee. Generative ad- versarial text to image synthesis. In ICML, 2016. 4
work page 2016
-
[69]
Stochastic backpropagation and approximate in- ference in deep generative models
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate in- ference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML, 2014. 1, 4, 29
work page 2014
-
[70]
Network-to-network translation with conditional invertible neural networks
Robin Rombach, Patrick Esser, and Bj ¨orn Ommer. Network-to-network translation with conditional invertible neural networks. In NeurIPS, 2020. 3
work page 2020
-
[71]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015. 2, 3, 4
work page 2015
-
[72]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J. Fleet, and Mohammad Norouzi. Im- age super-resolution via iterative refinement. CoRR, abs/2104.07636, 2021. 1, 4, 8, 16, 22, 23, 27
- [73]
-
[74]
Dave Salvator. NVIDIA Developer Blog. https : / / developer . nvidia . com / blog / getting - immediate- speedups- with- a100- tf32, 2020. 28
work page 2020
-
[75]
Noise estimation for generative diffusion models
Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. CoRR, abs/2104.02600, 2021. 3
-
[76]
Projected gans converge faster
Axel Sauer, Kashyap Chitta, Jens M ¨uller, and An- dreas Geiger. Projected gans converge faster. CoRR, abs/2111.01007, 2021. 6
-
[77]
A u- net based discriminator for generative adversarial networks
Edgar Sch ¨onfeld, Bernt Schiele, and Anna Khoreva. A u- net based discriminator for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204–8213. Computer Vision Founda- tion / IEEE, 2020. 6
work page 2020
-
[78]
Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs, 2021
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs, 2021. 6, 7
work page 2021
-
[79]
Very deep con- volutional networks for large-scale image recognition
Karen Simonyan and Andrew Zisserman. Very deep con- volutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, Int. Conf. Learn. Represent., 2015. 29, 43, 44, 45
work page 2015
-
[80]
D2C: diffusion-denoising models for few-shot con- ditional generation
Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: diffusion-denoising models for few-shot con- ditional generation. CoRR, abs/2106.06819, 2021. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.