arxiv: 2112.10752 · v2 · submitted 2021-12-20 · 💻 cs.CV

Recognition: no theorem link

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach , Andreas Blattmann , Dominik Lorenz , Patrick Esser , Bj\"orn Ommer

Authors on Pith no claims yet

Pith reviewed 2026-05-11 21:56 UTC · model grok-4.3

classification 💻 cs.CV

keywords latent diffusion modelsimage synthesisdenoising diffusionautoencodersconditional generationimage inpaintingsuper-resolutioncross-attention

0 comments

The pith

Diffusion models trained in the latent space of pretrained autoencoders generate high-resolution images with substantially lower computational cost than pixel-space versions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that diffusion models can be moved from raw pixel space into the compressed latent space of a fixed pretrained autoencoder. This shift preserves enough visual structure for high-fidelity synthesis while cutting the cost of training and sampling dramatically. Readers care because the same denoising process now supports conditioning via cross-attention layers, turning the model into a flexible generator for text, boxes, or masks without retraining. The result is practical high-resolution synthesis on ordinary hardware and new performance levels on inpainting.

Core claim

By applying the diffusion process to the latent representations of a pretrained autoencoder rather than to pixels, and by inserting cross-attention layers to accept arbitrary conditioning inputs, latent diffusion models reach a favorable trade-off between model capacity and perceptual fidelity while requiring far fewer resources than pixel-based diffusion models.

What carries the argument

The latent diffusion model (LDM), which runs the forward and reverse diffusion processes on the lower-dimensional latent codes produced by a fixed variational autoencoder and uses cross-attention to incorporate conditioning signals such as text or spatial layouts.

If this is right

Training and inference of powerful diffusion models become feasible on limited hardware while retaining visual quality.
High-resolution synthesis is performed directly in a convolutional manner without patch-wise processing.
Image inpainting reaches state-of-the-art results.
Unconditional generation, semantic scene synthesis, and super-resolution remain competitive with prior pixel-space methods.
Conditioning on text, bounding boxes, or other inputs is enabled without retraining the core model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of perceptual compression from the generative diffusion stage suggests similar latent-space training could be tested on other modalities once suitable autoencoders exist.
If the autoencoder is kept fixed, future improvements in autoencoder quality would immediately lift the upper bound on LDM fidelity without changing the diffusion architecture.
The approach implies that many existing pixel-based diffusion pipelines could be accelerated by first training a domain-specific autoencoder rather than scaling the diffusion model itself.

Load-bearing premise

The latent codes from the pretrained autoencoder already contain enough perceptual detail and spatial structure that the diffusion model can recover high-fidelity images without uncorrectable artifacts.

What would settle it

High-resolution outputs that consistently exhibit uncorrectable artifacts or visible loss of fine detail relative to pixel-based diffusion models of comparable training effort would show the assumption does not hold.

read the original abstract

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LDMs move diffusion into pretrained VAE latents plus cross-attention conditioning, delivering competitive quality at far lower compute than pixel-space models.

read the letter

The core advance is training the diffusion process on the compressed latent codes from a fixed KL-regularized autoencoder rather than raw pixels, then inserting cross-attention layers so the UNet can take arbitrary conditioning signals like text or bounding boxes. That combination lets them train on a single GPU and still produce 256x256 or higher outputs without the hundreds of GPU-days that earlier pixel DMs required. The f=8 latent model matches or beats the pixel baseline on FID while using roughly 1/64 the spatial footprint, and the same architecture sets a new mark on inpainting and stays competitive on semantic synthesis and super-resolution. Public code for both the VAE and the conditioned UNet is a real plus for anyone who wants to reproduce or extend it. Ablations on the downsampling factor and direct quantitative tables across ImageNet, Places2, and ADE20K give the claims a solid empirical footing. The main soft spot is that the autoencoder is pretrained and frozen, so any information lost in the initial compression cannot be recovered later; their results show the diffusion step compensates for most high-frequency detail, but the paper does not quantify how much perceptual structure is permanently discarded. Conditioning works for the tested inputs, yet the experiments do not stress-test against the very long or contradictory prompts that later systems would face. Overall the work is straightforward and reproducible. It is aimed at researchers who need controllable image synthesis without massive hardware budgets. The evidence is strong enough that a serious editor should send it to peer review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that applying diffusion models in the latent space of pretrained autoencoders enables efficient high-resolution image synthesis. Latent diffusion models (LDMs) reduce spatial dimensions via a KL-regularized VAE (with downsampling factors f=4/8/16) while preserving detail, incorporate cross-attention for conditioning on text or bounding boxes, and achieve new state-of-the-art inpainting results along with competitive performance on unconditional generation, semantic synthesis, and super-resolution, all at substantially lower computational cost than pixel-space DMs. Public code is released.

Significance. If the results hold, this has high significance for making diffusion-based synthesis practical at high resolutions with limited resources. Strengths include the public code release, direct ablations on autoencoder factors, and quantitative FID/LPIPS tables on ImageNet, Places2, and ADE20K that support the efficiency and quality claims. The stress-test concern on latent representation fidelity does not land as a load-bearing issue, since the f=8 model empirically recovers high-frequency detail without uncorrectable artifacts and matches or exceeds pixel DM quality.

major comments (2)

[Ablations on autoencoder downsampling factors] Ablations on autoencoder downsampling factors: the claim of reaching a 'near-optimal point' between complexity reduction and detail preservation for f=8 rests on FID comparisons, but the exact spatial cost reduction (stated as ~1/64) should be derived explicitly from the UNet channel dimensions and latent resolution to allow verification of the efficiency gain.
[Cross-attention layers] Cross-attention for conditioning: while cross-attention enables flexible conditioning, the manuscript does not include an ablation against simpler conditioning mechanisms (e.g., concatenation or FiLM), which would isolate whether this architecture choice is necessary for the flexibility and high-resolution claims.

minor comments (2)

[Abstract] The abstract's reference to 'hundreds of GPU days' for pixel-space DM optimization would be strengthened by citing the specific prior works being compared.
[Methods] Notation for the latent variable z and the diffusion forward/reverse processes in latent space could be clarified with an explicit equation reference or diagram early in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, the recommendation of minor revision, and the constructive comments on our efficiency claims and conditioning design. We address each major comment below and have incorporated revisions to improve clarity.

read point-by-point responses

Referee: Ablations on autoencoder downsampling factors: the claim of reaching a 'near-optimal point' between complexity reduction and detail preservation for f=8 rests on FID comparisons, but the exact spatial cost reduction (stated as ~1/64) should be derived explicitly from the UNet channel dimensions and latent resolution to allow verification of the efficiency gain.

Authors: We agree that an explicit derivation would strengthen the presentation. The ~1/64 factor follows directly from reducing the spatial resolution of the UNet input by f=8 in each dimension (latent size H/8 × W/8), which quadratically reduces the number of spatial operations. Accounting for the UNet channel schedule (starting at 320 channels with doubling in down-blocks), the overall computational cost of the diffusion process scales by this factor relative to pixel-space models. In the revised manuscript we will add a short derivation in Section 3.1 (or an appendix table) that computes the reduction from the exact latent resolution and channel dimensions, enabling straightforward verification. revision: yes
Referee: Cross-attention for conditioning: while cross-attention enables flexible conditioning, the manuscript does not include an ablation against simpler conditioning mechanisms (e.g., concatenation or FiLM), which would isolate whether this architecture choice is necessary for the flexibility and high-resolution claims.

Authors: We appreciate the suggestion. Cross-attention is chosen because it supports conditioning inputs of arbitrary length and structure (e.g., variable-length text token sequences or unordered sets of bounding-box embeddings) without requiring fixed-dimensional inputs, which concatenation or FiLM layers would necessitate. This flexibility is central to the high-resolution text-to-image and layout-to-image results. A full retraining ablation is outside the scope of a minor revision, but we will add a concise discussion paragraph in Section 3.2 explaining the architectural rationale and contrasting it with simpler alternatives, thereby addressing the concern without misrepresenting the design. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper proposes applying diffusion models in the latent space of a separately pretrained autoencoder, with the central claims of state-of-the-art inpainting and competitive performance on generation tasks supported by direct empirical ablations (e.g., downsampling factors f=4/8/16) and quantitative comparisons to pixel-space baselines on ImageNet, Places2, and ADE20K. No load-bearing step reduces a result or prediction to its own inputs by construction, fitted parameters renamed as outputs, or a self-citation chain; the autoencoder training and latent diffusion training are independent stages, and all performance assertions rest on measured metrics rather than theoretical closure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a pretrained autoencoder can compress images into a latent space that retains sufficient detail for diffusion-based generation; no free parameters are introduced in the abstract description, and no new entities are postulated.

axioms (1)

domain assumption Pretrained autoencoders produce latent representations that preserve perceptual details necessary for high-fidelity image synthesis.
Invoked to justify operating diffusion in latent space rather than pixels.

pith-pipeline@v0.9.0 · 5539 in / 1297 out tokens · 47273 ms · 2026-05-11T21:56:11.949352+00:00 · methodology

discussion (0)

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.
What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching
cs.LG 2026-05 unverdicted novelty 8.0

Data geometry makes time identifiable from noisy interpolants at rate O(1/sqrt(d-k)), rendering the time-blindness gap asymptotically negligible relative to coupling variance.
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters
cs.CV 2026-05 conditional novelty 7.0

AuraMask produces 40 aesthetic anti-facial recognition filters that match or exceed prior adversarial effectiveness and achieve significantly higher user acceptance in a 630-person study.
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

LVO applies optimization-based feature visualization to latent diffusion models after disentangling their representations with sparse autoencoders, yielding recognizable concept images on a fine-tuned Stable Diffusion...
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
cs.LG 2026-04 unverdicted novelty 7.0

FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.
Beyond Prompts: Unconditional 3D Inversion for Out-of-Distribution Shapes
cs.CV 2026-04 unverdicted novelty 7.0

Text-to-3D models lose prompt sensitivity for out-of-distribution shapes due to sink traps but retain geometric diversity via unconditional priors, enabling a decoupled inversion method for robust editing.
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
Drifting Fields are not Conservative
cs.LG 2026-04 conditional novelty 7.0

Drift fields in single-pass generative models are not conservative except for Gaussian kernels; a sharp kernel normalization makes them conservative for any radial kernel while noting that non-conservative fields offe...
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
cs.CV 2026-04 unverdicted novelty 7.0

SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
LAION-5B: An open large-scale dataset for training next generation image-text models
cs.CV 2022-10 accept novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
Network-Efficient World Model Token Streaming
cs.RO 2026-05 unverdicted novelty 6.0

An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bit...
Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation
cs.CV 2026-05 unverdicted novelty 6.0

Probability-Flow Distillation exactly matches the Wasserstein gradient flow of the target distribution when distilling 2D diffusion priors into 3D models, yielding higher-fidelity results than SDS or SDI.
AIMIP Phase 1: systematic evaluations of AI weather and climate models
physics.ao-ph 2026-05 unverdicted novelty 6.0

AIMIP Phase 1 shows AI models simulate historical climate and El Niño responses as well as traditional models, though some underestimate trends and diverge in generalization tests, with a public dataset released for f...
GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model
cs.AI 2026-05 unverdicted novelty 6.0

GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Scale-Aware Adversarial Analysis: A Diagnostic for Generative AI in Multiscale Complex Systems
cs.LG 2026-05 unverdicted novelty 6.0

A new scale-aware diagnostic framework shows that unconstrained diffusion generative models exhibit structural freezing and instability instead of smooth physical responses under multiscale perturbations.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

L2P trains per-timestep linear weights on feature trajectories in about 20 seconds to enable aggressive caching in DiT models, delivering up to 4.55x FLOPs reduction with maintained visual quality.
Deepfake Detection Generalization with Diffusion Noise
cs.CV 2026-04 unverdicted novelty 6.0

ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
cs.CV 2026-04 unverdicted novelty 6.0

PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 6.0

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
Drifting Fields are not Conservative
cs.LG 2026-04 unverdicted novelty 6.0

Drift fields are not conservative except for Gaussian kernels; sharp normalization makes them conservative for any radial kernel by equating them to score differences of kernel density estimates.
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
LLM-Generated Fault Scenarios for Evaluating Perception-Driven Lane Following in Autonomous Edge Systems
cs.LG 2026-04 conditional novelty 6.0

A decoupled offline-online framework uses LLMs and latent diffusion models to generate fault scenarios for testing edge-based lane-following models, revealing large robustness drops under conditions like fog.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods
cs.CV 2026-05 unverdicted novelty 5.0

Kernel interpolation with a constant multiplier scales convolution and fully-connected layers in neural networks to higher resolutions or dimensions without training, producing competitive results on Stable Diffusion ...
Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods
cs.CV 2026-05 unverdicted novelty 5.0

Kernel interpolation with a constant scaling factor enables Stable Diffusion to produce higher-resolution images without training and extends to general neural networks with small accuracy drops.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
ClayScape: A GenAI-Supported Workflow for Designing Chinese Style Ceramics with Clay 3D Printing
cs.HC 2026-04 unverdicted novelty 5.0

ClayScape is a hybrid GenAI and clay 3D printing workflow that makes Chinese ceramic design more accessible to creators, as tested with four users who reported expanded creative options alongside agency challenges.
Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk
cs.CL 2026-04 unverdicted novelty 5.0

Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.
Style-Based Neural Architectures for Real-Time Weather Classification
cs.CV 2026-04 unverdicted novelty 5.0

Three style-based neural architectures are proposed for real-time weather classification from images, with two truncated ResNet variants claimed to outperform prior methods and generalize across public datasets.
Open-Sora: Democratizing Efficient Video Production for All
cs.CV 2024-12 unverdicted novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
Discrete Meanflow Training Curriculum
cs.LG 2026-04 unverdicted novelty 4.0

A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
Hardware Utilization and Inference Performance of Edge Object Detection Under Fault Injection
cs.DC 2026-03 unverdicted novelty 4.0

TensorRT YOLO pipelines on Jetson Nano keep GPU occupancy, power draw, and temperature stable even under heavy fault-injected inputs for object detection and lane following.
A Real-Calibrated Synthetic-First Data Engine
eess.IV 2026-05 unverdicted novelty 3.0

A data curation pipeline using diffusion-generated synthetic images improves pose estimation when added to real data but underperforms when used without real anchors.
Generative AI for material design: A mechanics perspective from burgers to matter
cs.CE 2026-04 unverdicted novelty 3.0

Diffusion models from generative AI, sharing math with material mechanics, generate new burger recipes from 2,260 examples that some blind tasters prefer over the Big Mac.

Reference graph

Works this paper leans on

109 extracted references · 109 canonical work pages · cited by 44 Pith papers · 15 internal anchors

[1]

NTIRE 2017 chal- lenge on single image super-resolution: Dataset and study

Eirikur Agustsson and Radu Timofte. NTIRE 2017 chal- lenge on single image super-resolution: Dataset and study. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122–1131. IEEE Com- puter Society, 2017. 1

work page 2017
[2]

Wasserstein gan, 2017

Martin Arjovsky, Soumith Chintala, and L ´eon Bottou. Wasserstein gan, 2017. 3

work page 2017
[3]

Large scale GAN training for high ﬁdelity natural image synthe- sis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high ﬁdelity natural image synthe- sis. In Int. Conf. Learn. Represent. , 2019. 1, 2, 7, 8, 22, 28

work page 2019
[4]

Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In 2018 IEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018, pages 1209–1218. Computer Vision Foundation / IEEE Computer Society, 2018. 7, 20, 22

work page 2018
[5]

Extracting training data from large language models

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21) , pages 2633–2650, 2021. 9

work page 2021
[6]

Generative pre- training from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee- woo Jun, David Luan, and Ilya Sutskever. Generative pre- training from pixels. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR,

work page
[7]

Weiss, Mo- hammad Norouzi, and William Chan

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mo- hammad Norouzi, and William Chan. Wavegrad: Estimat- ing gradients for waveform generation. In ICLR. OpenRe- view.net, 2021. 1

work page 2021
[8]

Fast fourier convolu- tion

Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolu- tion. In NeurIPS, 2020. 8

work page 2020
[9]

Very deep vaes generalize autoregressive models and can outperform them on images

Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. CoRR, abs/2011.10650, 2020. 3

work page arXiv 2011
[10]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1904
[11]

Bin Dai and David P. Wipf. Diagnosing and enhancing V AE models. In ICLR (Poster). OpenReview.net, 2019. 2, 3

work page 2019
[12]

Imagenet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical im- age database. In CVPR, pages 248–255. IEEE Computer Society, 2009. 1, 5, 7, 22

work page 2009
[13]

Ethical considerations of generative ai

Emily Denton. Ethical considerations of generative ai. AI for Content Creation Workshop, CVPR, 2021. 9

work page 2021
[14]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR, abs/1810.04805, 2018. 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021. 1, 2, 3, 4, 6, 7, 8, 18, 22, 25, 26, 28

work page internal anchor Pith review arXiv 2021
[16]

Musings on typicality, 2020

Sander Dieleman. Musings on typicality, 2020. 1, 3

work page 2020
[17]

Cogview: Mastering text-to- image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to- image generation via transformers. CoRR, abs/2105.13290,

work page arXiv
[18]

Nice: Non-linear independent components estimation, 2015

Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation, 2015. 3

work page 2015
[19]

Density estimation using real NVP

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Ben- gio. Density estimation using real NVP. In 5th Inter- national Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 1, 3

work page 2017
[20]

Generating images with perceptual similarity metrics based on deep networks

Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform. Process. Syst., pages 658–666, 2016. 3

work page 2016
[21]

Imagebart: Bidirectional context with multi- nomial diffusion for autoregressive image synthesis.CoRR, abs/2108.08827, 2021

Patrick Esser, Robin Rombach, Andreas Blattmann, and Bj¨orn Ommer. Imagebart: Bidirectional context with multi- nomial diffusion for autoregressive image synthesis.CoRR, abs/2108.08827, 2021. 6, 7, 22

work page arXiv 2021
[22]

A note on data biases in generative models

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. A note on data biases in generative models. arXiv preprint arXiv:2012.02516, 2020. 9

work page arXiv 2012
[23]

Esser, R

Patrick Esser, Robin Rombach, and Bj ¨orn Ommer. Taming transformers for high-resolution image synthesis. CoRR, abs/2012.09841, 2020. 2, 3, 4, 6, 7, 21, 22, 29, 34, 36

work page arXiv 2012
[24]

Sex, lies, and videotape: Deep fakes and free speech delusions

Mary Anne Franks and Ari Ezra Waldman. Sex, lies, and videotape: Deep fakes and free speech delusions. Md. L. Rev., 78:892, 2018. 9

work page 2018
[25]

Soros, and Olaf Witkowski

Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language- image encoders. ArXiv, abs/2106.14843, 2021. 3

work page arXiv 2021
[26]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. CoRR, abs/2203.13131, 2022. 6, 7, 16

work page arXiv 2022
[27]

Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, 2014. 1, 2

work page 2014
[28]

Improved training of wasserstein gans, 2017

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017. 3

work page 2017
[29]

Gans trained by a two time-scale update rule converge to a local nash equi- librium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equi- librium. In Adv. Neural Inform. Process. Syst., pages 6626– 6637, 2017. 1, 5, 26

work page 2017
[30]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. In NeurIPS, 2020. 1, 2, 3, 4, 6, 17

work page 2020
[31]

Cascaded diffusion models for high ﬁdelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation.CoRR, abs/2106.15282, 2021. 1, 3, 22 10

work page arXiv 2021
[32]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 6, 7, 16, 22, 28, 37, 38

work page 2021
[33]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. In CVPR, pages 5967–5976. IEEE Com- puter Society, 2017. 3, 4

work page 2017
[34]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adver- sarial networks. 2017 IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR) , pages 5967–5976,

work page 2017
[35]

Bowen Jing, Bonnie Berger, and Tommi Jaakkola

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. H ´enaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo ˜ao Carreira. Perceiver IO: A general architecture for structured inputs &outputs. CoRR, abs/2107.14795, 2021. 4

work page arXiv 2021
[36]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Jo ˜ao Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Researc...

work page 2021
[37]

High- resolution complex scene synthesis with transformers

Manuel Jahn, Robin Rombach, and Bj ¨orn Ommer. High- resolution complex scene synthesis with transformers. CoRR, abs/2105.06458, 2021. 20, 22, 27

work page arXiv 2021
[38]

Imperfect ima- ganation: Implications of gans exacerbating biases on fa- cial data augmentation and snapchat selﬁe lenses

Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao Kambhampati. Imperfect ima- ganation: Implications of gans exacerbating biases on fa- cial data augmentation and snapchat selﬁe lenses. arXiv preprint arXiv:2001.09528, 2020. 9

work page arXiv 2001
[39]

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehti- nen. Progressive growing of gans for improved quality, sta- bility, and variation. CoRR, abs/1710.10196, 2017. 5, 6

work page internal anchor Pith review arXiv 2017
[40]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog. , pages 4401– 4410, 2019. 1

work page 2019
[41]

Karras, S

T. Karras, S. Laine, and T. Aila. A style-based gener- ator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2019. 5, 6

work page 2019
[42]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improv- ing the image quality of stylegan. CoRR, abs/1912.04958,

work page arXiv 1912
[43]

Soft truncation: A universal training technique of score-based diffusion model for high precision score estimation

Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Score matching model for un- bounded data score. CoRR, abs/2106.05527, 2021. 6

work page arXiv 2021
[44]

Glow: Generative ﬂow with invertible 1x1 convolutions

Durk P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Process- ing Systems, 2018. 3

work page 2018
[45]

Variational diffusion models

Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. CoRR, abs/2107.00630, 2021. 1, 3, 16

work page arXiv 2021
[46]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-Encoding Vari- ational Bayes. In 2nd International Conference on Learn- ing Representations, ICLR, 2014. 1, 3, 4, 29

work page 2014
[47]

On fast sampling of diffusion probabilistic models

Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. CoRR, abs/2106.00132, 2021. 3

work page arXiv 2021
[48]

Diffwave: A versatile diffusion model for audio synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR. OpenReview.net, 2021. 1

work page 2021
[49]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: uniﬁed image classi- ﬁcation, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018. 7, 20, 22

work page arXiv 2018
[50]

Improved precision and recall metric for assessing generative models

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and re- call metric for assessing generative models. CoRR, abs/1904.06991, 2019. 5, 26

work page arXiv 1904
[51]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C. Lawrence Zit- nick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. 6, 7, 27

work page internal anchor Pith review arXiv 2014
[52]

Region-wise generative adversarial imageinpainting for large missing ar- eas

Yuqing Ma, Xianglong Liu, Shihao Bai, Le-Yi Wang, Ais- han Liu, Dacheng Tao, and Edwin Hancock. Region-wise generative adversarial imageinpainting for large missing ar- eas. ArXiv, abs/1909.12507, 2019. 9

work page arXiv 1909
[53]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun- Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021. 1

work page internal anchor Pith review arXiv 2021
[54]

Mescheder

Lars M. Mescheder. On the convergence properties of GAN training. CoRR, abs/1801.04406, 2018. 3

work page arXiv 2018
[55]

Unrolled generative adversarial networks

Luke Metz, Ben Poole, David Pfau, and Jascha Sohl- Dickstein. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. 3

work page 2017
[56]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014. 4

work page internal anchor Pith review Pith/arXiv arXiv 2014
[57]

Engel, Curtis Hawthorne, and Ian Simon

Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. CoRR, abs/2103.16091, 2021. 1

work page arXiv 2021
[58]

Qureshi, and Mehran Ebrahimi

Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative im- age inpainting with adversarial edge learning. ArXiv, abs/1901.00212, 2019. 9

work page arXiv 1901
[59]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image genera- tion and editing with text-guided diffusion models. CoRR, abs/2112.10741, 2021. 6, 7, 16

work page internal anchor Pith review Pith/arXiv arXiv 2021
[60]

Ha and J

Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Se- men Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin. 11 High-ﬁdelity performance metrics for generative models in pytorch, 2020. Version: 0.3.0, DOI: 10.5281/zen- odo.4957738. 26, 27

work page doi:10.5281/zen- 2020
[61]

Semantic image synthesis with spatially-adaptive normalization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 4, 7

work page 2019
[62]

Semantic image synthesis with spatially-adaptive normalization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun- Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), June 2019. 22

work page 2019
[63]

Dual contradistinctive generative autoencoder

Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 823–832. Computer Vision Foundation / IEEE, 2021. 6

work page 2021
[64]

On buggy resizing libraries and surprising subtleties in fid calculation.arXiv preprint arXiv:2104.11222, 5:14,

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in ﬁd cal- culation. arXiv preprint arXiv:2104.11222, 2021. 26

work page arXiv 2021
[65]

Carbon Emissions and Large Neural Network Training

David A. Patterson, Joseph Gonzalez, Quoc V . Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. CoRR, abs/2104.10350,

work page internal anchor Pith review arXiv
[66]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021. 1, 2, 3, 4, 7, 21, 27

work page internal anchor Pith review arXiv 2021
[67]

Gen- erating diverse high-ﬁdelity images with VQ-V AE-2

Ali Razavi, A ¨aron van den Oord, and Oriol Vinyals. Gen- erating diverse high-ﬁdelity images with VQ-V AE-2. In NeurIPS, pages 14837–14847, 2019. 1, 2, 3, 22

work page 2019
[68]

Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee

Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- geswaran, Bernt Schiele, and Honglak Lee. Generative ad- versarial text to image synthesis. In ICML, 2016. 4

work page 2016
[69]

Stochastic backpropagation and approximate in- ference in deep generative models

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate in- ference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML, 2014. 1, 4, 29

work page 2014
[70]

Network-to-network translation with conditional invertible neural networks

Robin Rombach, Patrick Esser, and Bj ¨orn Ommer. Network-to-network translation with conditional invertible neural networks. In NeurIPS, 2020. 3

work page 2020
[71]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015. 2, 3, 4

work page 2015
[72]

Fleet, and Mohammad Norouzi

Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal- imans, David J. Fleet, and Mohammad Norouzi. Im- age super-resolution via iterative reﬁnement. CoRR, abs/2104.07636, 2021. 1, 4, 8, 16, 22, 23, 27

work page arXiv 2021
[73]

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with dis- cretized logistic mixture likelihood and other modiﬁcations. CoRR, abs/1701.05517, 2017. 1, 3

work page arXiv 2017
[74]

NVIDIA Developer Blog

Dave Salvator. NVIDIA Developer Blog. https : / / developer . nvidia . com / blog / getting - immediate- speedups- with- a100- tf32, 2020. 28

work page 2020
[75]

Noise estimation for generative diffusion models

Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. CoRR, abs/2104.02600, 2021. 3

work page arXiv 2021
[76]

Projected gans converge faster

Axel Sauer, Kashyap Chitta, Jens M ¨uller, and An- dreas Geiger. Projected gans converge faster. CoRR, abs/2111.01007, 2021. 6

work page arXiv 2021
[77]

A u- net based discriminator for generative adversarial networks

Edgar Sch ¨onfeld, Bernt Schiele, and Anna Khoreva. A u- net based discriminator for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204–8213. Computer Vision Founda- tion / IEEE, 2020. 6

work page 2020
[78]

Laion- 400m: Open dataset of clip-ﬁltered 400 million image-text pairs, 2021

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-ﬁltered 400 million image-text pairs, 2021. 6, 7

work page 2021
[79]

Very deep con- volutional networks for large-scale image recognition

Karen Simonyan and Andrew Zisserman. Very deep con- volutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, Int. Conf. Learn. Represent., 2015. 29, 43, 44, 45

work page 2015
[80]

D2C: diffusion-denoising models for few-shot con- ditional generation

Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: diffusion-denoising models for few-shot con- ditional generation. CoRR, abs/2106.06819, 2021. 3

work page arXiv 2021

Showing first 80 references.