arxiv: 2307.01952 · v1 · submitted 2023-07-04 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell , Zion English , Kyle Lacey , Andreas Blattmann , Tim Dockhorn , Jonas M\"uller , Joe Penna , Robin Rombach

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords SDXLlatent diffusiontext-to-image synthesisUNet scalingrefinement modelStable Diffusionconditioning schemeshigh-resolution generation

0 comments

The pith

SDXL scales up the UNet and adds conditioning plus refinement to make latent diffusion competitive with closed image generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SDXL as an upgraded latent diffusion model for text-to-image generation that uses a UNet backbone three times larger than prior Stable Diffusion versions, achieved mainly by adding attention blocks and a second text encoder. It introduces new conditioning methods, trains across multiple aspect ratios, and pairs the base model with a separate refinement network that runs post-hoc image-to-image processing to raise visual quality. These changes produce outputs that exceed earlier open Stable Diffusion models and reach levels comparable to proprietary black-box systems. The work releases code and weights to support open research. Readers would care because it offers a transparent path to high-fidelity image synthesis without depending on closed commercial services.

Core claim

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results that

What carries the argument

The three-times-larger UNet backbone with added attention blocks and a second text encoder for expanded cross-attention context, together with novel conditioning schemes, multi-aspect-ratio training, and a post-hoc refinement model that performs image-to-image enhancement.

If this is right

Image synthesis quality improves markedly over earlier open Stable Diffusion releases.
The model handles variable aspect ratios without retraining.
A lightweight post-processing step further raises fidelity of base outputs.
Open release of weights and code enables community inspection and extension.
Performance reaches parity with certain closed commercial generators on visual metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open models may narrow the gap with proprietary systems through targeted architectural scaling rather than data secrecy alone.
The refinement stage could be adapted as a modular add-on for other diffusion pipelines.
Wider availability might shift user workflows away from paid API calls toward local or fine-tuned open alternatives.

Load-bearing premise

The reported gains in image quality come chiefly from the architectural scaling, conditioning additions, and refinement step rather than from undisclosed increases in training data volume, curation quality, or total compute.

What would settle it

A controlled re-training of SDXL and a prior Stable Diffusion baseline on identical data and hardware, followed by direct side-by-side evaluation on the same prompts, would show whether architecture alone explains the quality jump.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDXL scales the UNet, adds a second text encoder and refinement stage, and ships open weights so the gains can be checked directly.

read the letter

SDXL scales the UNet, adds a second text encoder and refinement stage, and ships open weights so the gains can be checked directly. The authors describe a three-times larger UNet backbone, achieved mostly through extra attention blocks and expanded cross-attention context from the second encoder. They also bring in new conditioning methods, train across multiple aspect ratios, and use a separate refinement model for post-processing. The results show clear visual gains over earlier Stable Diffusion releases and competitive quality with closed systems. What the paper does well is lay out these changes in a straightforward way and back them with the release of code and model weights. That openness turns the empirical claims into something the community can test right away on their own prompts and metrics. The multi-aspect training and refinement step address real issues in high-res generation. The soft spots are around attribution. Without detailed breakdowns of training data size, curation, or compute budgets, it's difficult to know how much the architectural changes drive the improvement versus just scaling up resources. Side-by-side comparisons with proprietary models can be tricky to interpret fairly. These are common in this area, though, and not fatal here because the model is public. This work is for researchers and engineers focused on open diffusion models. Anyone looking to improve or deploy high-quality text-to-image systems will get practical value from the released artifacts. It is not pushing new theory, but the concrete implementation details and verifiability make it a useful addition. I would bring it to the next reading group to go over the conditioning schemes and see how they perform in practice. It deserves a serious referee because the core claims are testable and the open release supports proper evaluation.

Referee Report

1 major / 3 minor

Summary. The paper introduces SDXL, a latent diffusion model for text-to-image synthesis. It employs a UNet backbone three times larger than prior Stable Diffusion versions, achieved mainly through additional attention blocks and a second text encoder enabling larger cross-attention context. Novel conditioning schemes are proposed, the model is trained across multiple aspect ratios, and a refinement model is added for post-hoc image-to-image fidelity improvement. The central claim is that SDXL achieves drastically improved performance over previous Stable Diffusion versions while remaining competitive with black-box state-of-the-art generators; code and model weights are released.

Significance. If the empirical performance claims hold under independent verification, this constitutes a meaningful open contribution to high-resolution text-to-image synthesis by providing a transparent, reproducible baseline that can accelerate community research. The explicit release of code and weights is a clear strength that directly supports falsifiability of the reported gains.

major comments (1)

[Abstract and results sections] The central empirical claim (drastically improved performance and competitiveness with closed SOTA models) rests on comparisons whose quantitative details, ablation controls, and dataset statistics are not fully elaborated in the provided text. Without explicit tables reporting metrics such as FID or CLIP scores on fixed benchmarks, together with controls for training data volume and curation, it remains difficult to isolate the contribution of the architectural changes (larger UNet, second text encoder, conditioning schemes) from possible differences in compute or data.

minor comments (3)

[Methods] Clarify the exact training data composition and aspect-ratio sampling strategy in the methods section to allow readers to assess potential data-related confounds.
[Figures] Add captions to all qualitative figures that explicitly state the prompt, sampling parameters, and which model variant is shown in each panel.
[Abstract] Verify that the released GitHub repository contains the exact model weights, inference code, and evaluation scripts referenced in the paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and recommendation of minor revision. We address the major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Abstract and results sections] The central empirical claim (drastically improved performance and competitiveness with closed SOTA models) rests on comparisons whose quantitative details, ablation controls, and dataset statistics are not fully elaborated in the provided text. Without explicit tables reporting metrics such as FID or CLIP scores on fixed benchmarks, together with controls for training data volume and curation, it remains difficult to isolate the contribution of the architectural changes (larger UNet, second text encoder, conditioning schemes) from possible differences in compute or data.

Authors: We agree that more explicit quantitative details would help clarify the contributions of the architectural and conditioning changes. The manuscript presents extensive qualitative results and some supporting metrics demonstrating the performance gains, but we acknowledge that additional tables with FID and CLIP scores on fixed benchmarks (e.g., MS-COCO), together with more detailed ablation controls, would strengthen the isolation of effects from the larger UNet, second text encoder, and novel conditioning. In the revised version we will add these tables and expand the description of training data aspects (including multi-aspect-ratio sampling) to the extent feasible. We note that the release of code and model weights directly enables independent verification, further ablations, and community evaluation on any desired benchmarks, which addresses the core concern of reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper describes an empirical engineering effort: a larger UNet with additional attention blocks, a second text encoder, novel conditioning schemes, multi-aspect-ratio training, and a post-hoc refinement model. The central claim of improved performance over prior Stable Diffusion versions and competitiveness with closed SOTA models rests on released weights, code, and external visual/qualitative comparisons rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the paper's own inputs by construction; the work is self-contained against verifiable external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claims rest on standard latent diffusion assumptions plus empirical training; no new physical entities or ad-hoc constants are introduced beyond typical model hyperparameters.

free parameters (1)

UNet backbone scale
Model size increased by factor of three via more attention blocks; chosen through design and training to improve capacity.

axioms (1)

domain assumption Latent diffusion models generate images by iteratively denoising in a compressed latent space conditioned on text embeddings
Core modeling assumption underlying the entire SDXL architecture and training procedure.

pith-pipeline@v0.9.0 · 5469 in / 1277 out tokens · 78062 ms · 2026-05-10T15:16:09.227947+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
cs.CR 2026-05 conditional novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression
cs.CV 2026-05 unverdicted novelty 7.0

OP4KSR enables efficient one-step 4K super-resolution without patches by adapting Flux with RoPE rescaling and periodicity loss to suppress artifacts.
Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles
cs.CY 2026-05 unverdicted novelty 7.0

A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.
ImageAttributionBench: How Far Are We from Generalizable Attribution?
cs.CV 2026-05 unverdicted novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
cs.CV 2026-05 unverdicted novelty 7.0

LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Dependency-Aware Discrete Diffusion for Scene Graph Generation
cs.CV 2026-05 unverdicted novelty 7.0

A new discrete diffusion model for scene graph generation from text captures object-relation dependencies via hierarchical constraints and training-free conditioning, yielding better graph metrics and downstream image...
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
cs.CV 2026-05 unverdicted novelty 7.0

DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vis...
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
D-Rex : Diffusion Rendering for Relightable Expressive Avatars
cs.GR 2026-04 conditional novelty 7.0

D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.
GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models
cs.LG 2026-04 unverdicted novelty 7.0

GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.
Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution
cs.CV 2026-04 unverdicted novelty 7.0

IDaS-SR achieves one-step real-world super-resolution by bridging restoration and generation manifolds via adaptive inversion noise estimation and continuous trajectory steering.
Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation
cs.CV 2026-04 unverdicted novelty 7.0

Pose-LDM generates occluded in-bed images from keypoints to augment training data, achieving top accuracy under severe occlusion compared to other augmentation methods.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

DCMorph generates face morphs via decoupled cross-attention in identity-conditioned diffusion and DDIM spherical interpolation, achieving higher attack success rates on four face recognition systems than prior methods...
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
cs.AI 2026-04 unverdicted novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Long-Text-to-Image Generation via Compositional Prompt Decomposition
cs.CV 2026-04 unverdicted novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.
Thermodynamic Diffusion Inference with Minimal Digital Conditioning
cs.LG 2026-04 unverdicted novelty 7.0

Thermodynamic diffusion inference at production scale is shown using hierarchical bilinear coupling for U-Net skips and a 2,560-parameter digital bottleneck, attaining 0.9906 cosine similarity with theoretical 10^7x e...
DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization
cs.CV 2026-04 unverdicted novelty 7.0

DiffusionPrint learns robust forensic feature maps via MoCo-style contrastive training on diffusion inpainting fingerprints, boosting localization accuracy by up to 28% when fused into existing IFL systems and general...
SEED: A Large-Scale Benchmark for Provenance Tracing in Sequential Deepfake Facial Edits
cs.CR 2026-04 unverdicted novelty 7.0

SEED is a new benchmark for sequential provenance tracing in diffusion-edited deepfake faces, with the FAITH baseline showing that wavelet-based high-frequency signals aid detection of accumulated editing artifacts.
Image-Guided Geometric Stylization of 3D Meshes
cs.CV 2026-04 unverdicted novelty 7.0

A coarse-to-fine pipeline deforms 3D meshes to reflect geometric features from an image using diffusion model representations while preserving topology and part-level semantics.
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details
cs.CV 2026-04 unverdicted novelty 7.0

RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors
cs.CV 2026-04 unverdicted novelty 7.0

Graph-PiT adds graph priors and a hierarchical GNN to part-based image synthesis to enforce relational constraints and improve structural coherence over vanilla PiT.
Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization
cs.CR 2026-04 unverdicted novelty 7.0

HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.
Your Pre-trained Diffusion Model Secretly Knows Restoration
cs.CV 2026-04 unverdicted novelty 7.0

Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on ...
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
cs.CV 2026-04 conditional novelty 7.0

1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
VOSR: A Vision-Only Generative Model for Image Super-Resolution
cs.CV 2026-04 conditional novelty 7.0

VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a r...
From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion
cs.CV 2026-03 unverdicted novelty 7.0

A diffusion model trained on synthetically damaged teeth from public datasets completes crowns with 81.8% IoU and 0.00034 Chamfer distance, and produces real-world restorations with minimal opposing-tooth interference.
SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
cs.CV 2026-03 conditional novelty 7.0

SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.
CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models
cs.CR 2026-03 unverdicted novelty 7.0

CSF is the first black-box method to attribute fine-tuned text-to-image models to original lineages via compositional semantic probes and Bayesian decisions across multiple model families.
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
Transfer between Modalities with MetaQueries
cs.CV 2025-04 unverdicted novelty 7.0

MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
Unified Reward Model for Multimodal Understanding and Generation
cs.CV 2025-03 unverdicted novelty 7.0

UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection
cs.CV 2026-05 conditional novelty 6.0

SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution
cs.CV 2026-05 unverdicted novelty 6.0

PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
L2P: Unlocking Latent Potential for Pixel Generation
cs.CV 2026-05 unverdicted novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
cs.CV 2026-05 unverdicted novelty 6.0

A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...
Filtering Memorization from Parameter-Space in Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

BAF reduces memorization in diffusion LoRAs by filtering spectral channels of the adaptation weights that show weak alignment with the base model's principal subspace.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.
Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
cs.CV 2026-05 unverdicted novelty 6.0

Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.
DiffATS: Diffusion in Aligned Tensor Space
cs.LG 2026-05 unverdicted novelty 6.0

DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 161 Pith papers · 20 internal anchors

[1]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324, 2022

work page internal anchor Pith review arXiv 2022
[2]

Tract: Denoising diffusion models with transitive closure time-distillation

David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248, 2023

work page arXiv 2023
[3]

Align your latents: High-resolution video synthesis with latent diffusion models, 2023

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023

work page arXiv 2023
[4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[5]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

work page internal anchor Pith review arXiv 2021
[6]

Distilling the Knowledge in Diffusion Models

Tim Dockhorn, Robin Rombach, Andreas Blattmann, and Yaoliang Yu. Distilling the Knowledge in Diffusion Models. CVPR Workshop on Generative Models for Computer Vision, 2023

work page 2023
[7]

Structure and content-guided video synthesis with diffusion models, 2023

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models, 2023

work page 2023
[8]

E., and Wang, W

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv:2212.05032, 2023

work page arXiv 2023
[9]

Riffusion - Stable diffusion for real-time music generation, 2022

Seth Forsgren and Hayk Martiros. Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about

work page 2022
[10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022

work page internal anchor Pith review arXiv 2022
[11]

Diffusion with offset noise, 2023

Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. URL https://www.crosslabs. org/blog/diffusion-with-offset-noise

work page 2023
[12]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500, 2017

work page Pith review arXiv 2017
[13]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020

work page internal anchor Pith review arXiv 2006
[15]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, and Tim Salimans. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022

work page internal anchor Pith review arXiv 2022
[16]

Simple diffusion: End-to-end diffusion for high resolution images.arXiv preprint arXiv:2301.11093, 2023

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023

work page arXiv 2023
[17]

Make-an-audio: Text-to-audio genera- tion with prompt-enhanced diffusion models.arXiv preprint arXiv:2301.12661, 2023

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661, 2023

work page arXiv 2023
[18]

Estimation of Non-Normalized Statistical Models by Score Matching

Aapo Hyvärinen and Peter Dayan. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005

work page 2005
[19]

2021.OpenCLIP

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021
[20]

Distribution Augmentation for Generative Modeling

Heewoo Jun, Rewon Child, Mark Chen, John Schulman, Aditya Ramesh, Alec Radford, and Ilya Sutskever. Distribution Augmentation for Generative Modeling. In International Conference on Machine Learning, pages 5006–5019. PMLR, 2020

work page 2020
[21]

Elucidating the Design Space of Diffusion-Based Generative Models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022

work page internal anchor Pith review arXiv 2022
[22]

2023.On Architectural Compression of Text-to-Image Diffusion Models

Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. On Architectural Compression of Text-to-Image Diffusion Models. arXiv:2305.15798, 2023. 19

work page arXiv 2023
[23]

Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv:2305.01569, 2023

work page arXiv 2023
[24]

2023.SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. arXiv:2306.00980, 2023

work page arXiv 2023
[25]

arXiv:2305.08891 , year=

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023

work page arXiv 2023
[26]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015

work page 2015
[27]

Character-aware models improve visual text rendering, 2023

Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering, 2023

work page 2023
[28]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073, 2021

work page internal anchor Pith review arXiv 2021
[29]

Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models, 2023

work page 2023
[30]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. arXiv:2112.10741, 2021

work page internal anchor Pith review arXiv 2021
[31]

Novelai improvements on stable diffusion, 2023

NovelAI. Novelai improvements on stable diffusion, 2023. URL https://blog.novelai.net/ novelai-improvements-on-stable-diffusion-e10d38db82ac

work page 2023
[32]

Pytorch: An imperative style, high-performance deep learning library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page 2019
[33]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. arXiv:2212.09748, 2022

work page internal anchor Pith review arXiv 2022
[34]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

How dall·e 2 works, 2022

Aditya Ramesh. How dall·e 2 works, 2022. URL http://adityaramesh.com/posts/dalle2/dalle2. html

work page 2022
[36]

Zero-shot text-to-image generation, 2021

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

work page 2021
[37]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022

work page internal anchor Pith review arXiv 2022
[38]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021

work page Pith review arXiv 2021
[39]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, 2022

work page internal anchor Pith review arXiv 2022
[41]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review arXiv 2022
[42]

Improved Techniques for Training GANs

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. arXiv:1606.03498, 2016

work page Pith review arXiv 2016
[43]

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinxiao Wu. DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv:2305.15957, 2023

work page arXiv 2023
[44]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022

work page internal anchor Pith review arXiv 2022
[45]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015. 20

work page internal anchor Pith review arXiv 2015
[46]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[47]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[48]

Evaluating a synthetic image dataset generated with stable diffusion

Andreas Stöckl. Evaluating a synthetic image dataset generated with stable diffusion. arXiv:2211.01777, 2022

work page arXiv 2022
[49]

High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity

Yu Takagi and Shinji Nishimoto. High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

work page 2023
[50]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Boosting gui prototyping with diffusion models

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, and Gérard Dray. Boosting gui prototyping with diffusion models. arXiv preprint arXiv:2306.06233, 2023

work page arXiv 2023
[52]

Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

work page 2022
[53]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

work page 2022
[54]

Adding conditional control to text-to-image diffusion models,

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023

work page arXiv 2023
[55]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. 21

work page 2018