pith. sign in

arxiv: 2112.10741 · v3 · submitted 2021-12-20 · 💻 cs.CV · cs.GR· cs.LG

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Pith reviewed 2026-05-11 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG
keywords text-to-image synthesisdiffusion modelsclassifier-free guidanceimage generationimage editingphotorealistic imagesinpainting
0
0 comments X

The pith

Text-conditional diffusion models using classifier-free guidance generate images humans prefer over DALL-E for photorealism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that diffusion models conditioned on text descriptions can synthesize high-quality images. It finds that classifier-free guidance is preferred by human evaluators over CLIP guidance for both photorealism and how well the image matches the caption. A 3.5 billion parameter model trained with this approach produces samples that people rate higher than those from DALL-E, even when DALL-E applies CLIP reranking. The models can also be fine-tuned to support text-guided inpainting for image editing tasks.

Core claim

A 3.5 billion parameter text-conditional diffusion model using classifier-free guidance generates samples that human evaluators favor over DALL-E outputs for both photorealism and caption similarity. The model can be fine-tuned to perform text-driven image inpainting.

What carries the argument

Classifier-free guidance applied to text-conditional diffusion models, which steers the denoising process using the text condition directly to improve fidelity without an external classifier.

If this is right

  • Human evaluators consistently prefer classifier-free guidance to CLIP guidance in text-to-image generation tasks.
  • Large-scale diffusion models can achieve superior results to prior text-to-image systems like DALL-E in blind human comparisons.
  • Fine-tuning allows diffusion models to support practical editing capabilities such as text-based inpainting.
  • The open-sourced smaller model enables further research and applications in text-guided image synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Further increases in model scale and data quality could push photorealism even closer to real photographs.
  • Similar guidance techniques might improve conditional generation in related domains like video or 3D synthesis.
  • The success of classifier-free guidance suggests it could simplify training pipelines by removing the need for separate guidance models.

Load-bearing premise

The judgments of the human evaluators accurately capture photorealism and text similarity in a way that generalizes beyond the specific test conditions.

What would settle it

A follow-up experiment with a new group of raters or an objective metric such as improved FID scores on a held-out test set that shows DALL-E preferred instead.

read the original abstract

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces GLIDE, a text-conditional diffusion model for photorealistic image generation and editing. It compares two guidance strategies (CLIP guidance vs. classifier-free guidance) and reports that classifier-free guidance is preferred by human evaluators on both photorealism and caption similarity. A 3.5B-parameter model using classifier-free guidance produces samples that human raters favor over DALL-E outputs even when DALL-E employs CLIP reranking. The work further shows that the model can be fine-tuned for text-driven inpainting and releases code plus weights for a smaller filtered-data variant.

Significance. If the human-study results hold, the paper establishes classifier-free guidance as a strong, parameter-efficient alternative to CLIP-based guidance for text-to-image diffusion, with credible evidence of outperformance versus DALL-E on the same prompt set. The open release of the smaller model and code directly supports reproducibility. The inpainting fine-tuning result demonstrates a practical editing capability that extends the core generation contribution.

minor comments (3)
  1. Abstract: the central human-preference claim is stated without any mention of protocol details (rater count, question wording, or statistical controls). Although these details appear in the main text, a single sentence in the abstract would make the claim self-contained.
  2. Human-evaluation section: the manuscript reports judgments on the same 1000 prompts used by DALL-E and one sample per prompt per model, but does not explicitly state whether prompt order or model identity was blinded to raters; adding this sentence would strengthen the protocol description.
  3. Figure captions and qualitative results: several comparison figures would benefit from explicit indication of which guidance method and sampling steps were used for each panel, to allow readers to map visuals directly to the quantitative claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept. No major comments were raised.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical ML study describing the training and evaluation of a text-conditional diffusion model (GLIDE) with comparisons to DALL-E via human raters on photorealism and caption similarity. No derivation chain, first-principles predictions, or fitted parameters renamed as outputs exist; the central claims rest on experimental results using standard diffusion model formulations and external baselines, with no self-referential reductions or load-bearing self-citations that collapse the argument to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning paper whose claims rest on training runs and human evaluations rather than new theoretical axioms or invented physical entities.

pith-pipeline@v0.9.0 · 5484 in / 1081 out tokens · 26353 ms · 2026-05-11T05:52:00.061358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Consistency Models

    cs.LG 2023-03 conditional novelty 8.0

    Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

  2. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  3. Prompt-to-Prompt Image Editing with Cross Attention Control

    cs.CV 2022-08 unverdicted novelty 8.0

    Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

  4. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    cs.CV 2022-08 unverdicted novelty 8.0

    Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

  5. VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

    cs.CV 2026-05 unverdicted novelty 7.0

    VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.

  6. Probability-Conserving Flow Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    AdaMaG is a guidance rule for generative models derived from decomposing continuity-equation effects into divergence and score-parallel terms, with a proof that divergence diverges near the manifold and a time-depende...

  7. SURGE: Approximation-free Training Free Particle Filter for Diffusion Surrogate

    stat.ML 2026-05 unverdicted novelty 7.0

    URGE performs unbiased path-wise importance reweighting via Girsanov estimation for derivative-free inference-time scaling in diffusion models, proving equivalence to particle-wise SMC and outperforming baselines empirically.

  8. Generating HDR Video from SDR Video

    cs.CV 2026-05 unverdicted novelty 7.0

    A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.

  9. Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 7.0

    Text embeddings in MM-DiTs contain a detectable omission signal for missing concepts, and amplifying it via OSI reduces concept omission in generated images on FLUX.1-Dev and SD3.5-Medium.

  10. HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    HIR-ALIGN augments limited target data for hyperspectral restoration by creating proxy clean images, synthesizing aligned HSIs with blur-robust diffusion and warp-based transfer, then finetuning models to lower target...

  11. ImageAttributionBench: How Far Are We from Generalizable Attribution?

    cs.CV 2026-05 unverdicted novelty 7.0

    ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

  12. From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.

  13. Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...

  14. Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

    cs.CV 2026-04 conditional novelty 7.0

    Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.

  15. GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.

  16. FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.

  17. SVG360: Editable Multiview Vector Graphics from a Single SVG

    cs.CV 2025-11 unverdicted novelty 7.0

    SVG360 lifts a single SVG to a view-conditioned representation, uses spatial memory to propagate consistent parts across views, and applies structure-aware vectorization to produce editable multiview SVGs.

  18. BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

    cs.RO 2025-08 conditional novelty 7.0

    BeyondMimic combines compact motion tracking with a unified guided latent diffusion model to master diverse agile behaviors from human demos and solve unseen downstream tasks via test-time classifier guidance.

  19. FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

    cs.CV 2025-06 unverdicted novelty 7.0

    FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.

  20. COCO-Inpaint: A Benchmark for Detecting and Localizing Inpainting-Based Image Manipulations

    cs.CV 2025-04 unverdicted novelty 7.0

    COCO-Inpaint supplies a large-scale dataset and evaluation protocol focused on inpainting-based image forgeries to benchmark existing detection methods.

  21. WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.

  22. Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

    cs.CV 2024-10 unverdicted novelty 7.0

    Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

  23. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  24. ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis

    cs.CV 2024-04 unverdicted novelty 7.0

    ANCHOR dataset exposes T2I model weaknesses on multi-subject abstractive captions; SAFE uses LLMs for subject extraction and embedding enhancement to improve consistency.

  25. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  26. Learning Interactive Real-World Simulators

    cs.AI 2023-10 conditional novelty 7.0

    UniSim learns a universal real-world simulator from orchestrated diverse datasets, enabling zero-shot deployment of policies trained purely in simulation.

  27. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    cs.CV 2023-10 unverdicted novelty 7.0

    Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.

  28. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  29. Scalable Diffusion Models with Transformers

    cs.CV 2022-12 unverdicted novelty 7.0

    DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

  30. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  31. Phenaki: Variable Length Video Generation From Open Domain Textual Description

    cs.CV 2022-10 unverdicted novelty 7.0

    Phenaki generates arbitrary-length videos from sequences of text prompts by tokenizing videos with causal temporal attention and generating tokens with a text-conditioned masked transformer, trained jointly on images ...

  32. Imagen Video: High Definition Video Generation with Diffusion Models

    cs.CV 2022-10 unverdicted novelty 7.0

    Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.

  33. Human Motion Diffusion Model

    cs.CV 2022-09 unverdicted novelty 7.0

    MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.

  34. Diffusion Posterior Sampling for General Noisy Inverse Problems

    stat.ML 2022-09 unverdicted novelty 7.0

    Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.

  35. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    cs.CV 2022-05 accept novelty 7.0

    Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

  36. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  37. Video Diffusion Models

    cs.CV 2022-04 unverdicted novelty 7.0

    A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...

  38. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  39. Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

    cs.CV 2026-05 unverdicted novelty 6.0

    ABSS ranks diffusion seeds by early cross-attention strength to prompt core tokens and retains only the top-k for full generation, yielding consistent gains in alignment and quality on Stable Diffusion variants.

  40. Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures

    stat.ML 2026-05 unverdicted novelty 6.0

    URGE performs unbiased inference-time scaling for diffusion models by attaching multiplicative path weights from Girsanov estimation and resampling trajectories, with a proven equivalence to prior particle-wise SMC schemes.

  41. Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?

    cs.CV 2026-05 unverdicted novelty 6.0

    AdaScope adaptively selects optimal RL intervention points during diffusion denoising by monitoring structural and semantic changes, delivering 66% higher performance at 59% lower cost than full-trajectory RL baselines.

  42. Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

    cs.CV 2026-05 conditional novelty 6.0

    SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.

  43. FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...

  44. FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.

  45. Intermediate Representations are Strong AI-Generated Image Detectors

    cs.CV 2026-05 unverdicted novelty 6.0

    Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.

  46. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 6.0

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

  47. VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

    cs.CV 2026-04 unverdicted novelty 6.0

    VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

  48. Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.

  49. PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

    cs.CV 2026-04 unverdicted novelty 6.0

    PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...

  50. MuPPet: Multi-person 2D-to-3D Pose Lifting

    cs.CV 2026-04 unverdicted novelty 6.0

    MuPPet introduces person encoding, permutation augmentation, and dynamic multi-person attention to outperform prior single- and multi-person 2D-to-3D pose lifting methods on group interaction datasets while improving ...

  51. Controllable Image Generation with Composed Parallel Token Prediction

    cs.LG 2026-04 unverdicted novelty 6.0

    A new formulation for composing discrete generative processes enables precise control over novel condition combinations in image generation, cutting error rates by 63% and speeding up inference.

  52. Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.

  53. MPDiT: Multi-Patch Global-to-Local Transformer Architecture For Efficient Flow Matching and Diffusion Model

    cs.CV 2026-03 unverdicted novelty 6.0

    MPDiT uses a hierarchical multi-patch design in transformers to lower computation in diffusion models by handling coarse global features first then fine local details, plus faster-converging embeddings.

  54. Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking

    cs.CV 2026-01 unverdicted novelty 6.0

    FIA uses contrastive concept saliency and temporal-spatial neuron identification to build unified masks that erase multiple target concepts while preserving general generation quality in diffusion models.

  55. Scaling Up AI-Generated Image Detection with Generator-Aware Prototypes

    cs.CV 2025-12 unverdicted novelty 6.0

    GAPL learns a compact set of canonical forgery prototypes and applies two-stage LoRA training to build a low-variance feature space that improves generalization across GAN and diffusion generators.

  56. Evaluating Concept Filtering Defenses against Child Sexual Abuse Material Generation by Text-to-Image Models

    cs.CR 2025-12 unverdicted novelty 6.0

    Concept filtering of child images from training data offers only limited protection against CSAM generation in text-to-image models, as prompting strategies and fine-tuning can bypass filters even when most child imag...

  57. How Noise Benefits AI-generated Image Detection

    cs.CV 2025-11 unverdicted novelty 6.0

    PiN-CLIP jointly trains a noise generator and detector under a variational positive-incentive principle to inject feature-space noise that suppresses shortcut directions and improves out-of-distribution accuracy by 5....

  58. From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection

    cs.CV 2025-10 conditional novelty 6.0

    A multi-agent forensic system integrates multiple evidence sources and debate to detect AI-generated images, reporting 97.05% accuracy on a 6,000-image benchmark while outperforming traditional classifiers.

  59. RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

    cs.CV 2025-10 unverdicted novelty 6.0

    RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

  60. LeakyCLIP: Extracting Training Data from CLIP

    cs.CR 2025-08 conditional novelty 6.0

    LeakyCLIP reconstructs images from CLIP embeddings with over 258% SSIM gain versus baselines and enables membership inference from reconstruction metrics on LAION-2B data.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 107 Pith papers · 12 internal anchors

  1. [1]

    Blended diffusion for text-driven editing of natural images

    Avrahami, O., Lischinski, D., and Fried, O. Blended diffusion for text-driven editing of natural images. arXiv:2111.14818,

  2. [2]

    2021 , journal =

    Bau, D., Andonian, A., Cui, A., Park, Y ., Jahanian, A., Oliva, A., and Torralba, A. Paint by word. arXiv:2103.10951,

  3. [3]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096,

  4. [4]

    Diffusion Models Beat GANs on Image Synthesis

    URL https://proceedings.mlr.press/v81/ buolamwini18a.html. Crowson, K. Clip guided diffusion hq 256x256. https: //colab.research.google.com/drive/ 12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021a. Crowson, K. Clip guided diffusion 512x512, secondary model method. https:// twitter.com/RiversHaveWings/status/ 1462859669454536711, 2021b. Dhariwal, P. and Nichol, A. ...

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929,

  6. [6]

    Stylegan-nada: Clip-guided domain adaptation of image generators

    Gal, R., Patashnik, O., Maron, H., Chechik, G., and Cohen- Or, D. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv:2108.00946,

  7. [7]

    Galatolo, Mario G

    Galatolo, F. A., Cimino, M. G. C. A., and Vaglini, G. Gener- ating images from caption and vice versa via clip-guided generative latent space search. arXiv:2102.01645,

  8. [8]

    Vector quantized diffusion model for text-to-image synthesis, 2022

    Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., and Guo, B. Vector quantized diffusion model for text-to-image synthesis. arXiv:2111.14822,

  9. [9]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017),

  10. [10]

    and Salimans, T

    Ho, J. and Salimans, T. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,

  11. [11]

    Denoising Diffusion Probabilistic Models

    URL https:// openreview.net/forum?id=qw8AKxfYbI. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- bilistic models. arXiv:2006.11239,

  12. [12]

    Cascaded diffusion models for high fidelity image generation

    Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and Salimans, T. Cascaded diffusion models for high fidelity image generation. arXiv:2106.15282,

  13. [13]

    A Style-Based Generator Architecture for Generative Adversarial Networks

    Karras, T., Laine, S., and Aila, T. A style-based gen- erator architecture for generative adversarial networks. arXiv:arXiv:1812.04948, 2019a. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. arXiv:1912.04958, 2019b. Kim, G. and Ye, J. C. Diffusionclip: Text-guided image ma...

  14. [14]

    Improved precision and recall met- ric for assessing generative models.CoRR, abs/1904.06991,

    Kynk¨a¨anniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assess- ing generative models. arXiv:1904.06991,

  15. [15]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Meng, C., Song, Y ., Song, J., Wu, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv:2108.01073,

  16. [16]

    Mixed Precision Training

    Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. arXiv:1710.03740,

  17. [17]

    The big sleep

    Murdock, R. The big sleep. https://twitter.com/ advadnoun/status/1351038053033406468,

  18. [18]

    Improved Denoising Diffusion Probabilistic Models

    Nichol, A. and Dhariwal, P. Improved denoising diffusion probabilistic models. arXiv:2102.09672,

  19. [19]

    Styleclip: Text-driven manipulation of stylegan imagery.arXiv preprint arXiv:2103.17249, 2021

    Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., and Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. arXiv:2103.17249,

  20. [20]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transfer- able visual models from natural language supervision. arXiv:2103.00020,

  21. [21]

    Zero-Shot Text-to-Image Generation

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text-to- image generation. arXiv:2102.12092,

  22. [22]

    Generating Diverse High-Fidelity Images with VQ-VAE-2

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models Razavi, A., van den Oord, A., and Vinyals, O. Gen- erating diverse high-fidelity images with VQ-V AE-2. arXiv:1906.00446,

  23. [23]

    Palette: Image-to-image diffusion models

    Saharia, C., Chan, W., Chang, H., Lee, C. A., Ho, J., Sali- mans, T., Fleet, D. J., and Norouzi, M. Palette: Image-to- image diffusion models. arXiv:2111.05826, 2021a. Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. arXiv:arXiv:2104.07636, 2021b. Salimans, T., Goodfellow, I., Zarem...

  24. [24]

    Image synthesis with a single (robust) classifier

    Santurkar, S., Tsipras, D., Tran, B., Ilyas, A., Engstrom, L., and Madry, A. Image synthesis with a single (robust) classifier.arXiv:1906.09453,

  25. [25]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequi- librium thermodynamics. arXiv:1503.03585,

  26. [26]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv:2010.02502, 2020a. Song, Y . and Ermon, S. Improved techniques for train- ing score-based generative models. arXiv:2006.09011, 2020a. Song, Y . and Ermon, S. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020b. Song, Y ., Sohl-Dicks...

  27. [27]

    Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

    Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y ., Wu, F., and Bao, B. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv:2008.05865,

  28. [28]

    Neural Discrete Representation Learning

    van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. arXiv:1711.00937,

  29. [29]

    Attention Is All You Need

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. arXiv:1706.03762,

  30. [30]

    Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks

    Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X. Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks. arXiv:1711.10485,

  31. [31]

    Improving text-to-image synthesis using contrastive learning

    Ye, H., Yang, X., Takac, M., Sunderraman, R., and Ji, S. Improving text-to-image synthesis using contrastive learn- ing. arXiv:2107.02423,

  32. [32]

    Y ., Baldridge, J., Lee, H., and Yang, Y

    Zhang, H., Koh, J. Y ., Baldridge, J., Lee, H., and Yang, Y . Cross-modal contrastive learning for text-to-image generation. arXiv:2101.04702,

  33. [33]

    URL https://doi.org/ 10.1145/3394171.3414017

    1145/3394171.3414017. URL https://doi.org/ 10.1145/3394171.3414017. Zhou, S., Gordon, M. L., Krishna, R., Narcomey, A., Fei- Fei, L., and Bernstein, M. S. Hype: A benchmark for human eye perceptual evaluation of generative models,

  34. [34]

    Lafite: Towards language-free training for text-to- image generation

    Zhou, Y ., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., and Sun, T. Lafite: Towards language-free training for text-to-image generation. arXiv:2111.13792,

  35. [35]

    Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis

    Zhu, M., Pan, P., Chen, W., and Yang, Y . Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis. arXiv:1904.01310,

  36. [36]

    By doing this, ties effectively dilute the wins of each model

    When computing wins and Elo scores, we count a tie as half of a win for each model. By doing this, ties effectively dilute the wins of each model. To compute Elo scores, we construct a matrixA such that entryAij is the number of times modeli beats modelj. We initialize Elo scores for allN models asσi = 0,i∈ [1,N ]. We compute Elo scores by minimizing the ...

  37. [37]

    (2021) and Ramesh et al

    We trained our CLIP models for 390K iterations with batch size 32K on a 50%-50% mixture of the datasets used by Radford et al. (2021) and Ramesh et al. (2021). For our final CLIP model, we trained a ViT-L with weight decay 0.0125. After training, we fine-tuned the final ViT-L for 30K iterations on an even broader dataset of internet images. We pre-trained GL...

  38. [38]

    a corgi in a field

    Comparing classifier-free guided samples from our large model (first row), a small version trained on the same data (second row), and our released small model trained on a smaller, filtered dataset. In the final row, we show samples using our small model guided by a CLIP model trained on filtered data. Samples are not cherry-picked. D. Comparison to Unnoised C...

  39. [39]

    Comparison of GLIDE to two CLIP guidance strategies applied to pre-trained ImageNet diffusion models. On the left, we use a vanilla CLIP model to guide the 256 × 256 diffusion model from Dhariwal & Nichol (2021), using a combination of engineered perceptual losses and data augmentations (Crowson, 2021a). In the middle, we use our noised ViT-B CLIP model t...

  40. [40]

    pink yarn ball

    is not yet available, we evaluate our model on a few of the prompts shown in the paper (Figure 11). We find that our fine-tuned model sometimes chooses to ignore the given text prompt and instead produces an image that seems influenced only by the surrounding context. To mitigate this phenomenon, we also evaluate our model with the context fully masked out. ...

  41. [41]

    weapon”, “violence

    Comparison of image inpainting quality on real images. (1) Local CLIP-guided diffusion (Crowson, 2021a), (2) PaintByWord++ (Bau et al., 2021; Avrahami et al., 2021), (3) Blended Diffusion (Avrahami et al., 2021). For our results, we follow Avrahami et al. (2021) and use CLIP to select the best of 64 samples. Our fine-tuned samples have more realistic light...