pith. machine review for the scientific record. sign in

arxiv: 2307.01952 · v1 · submitted 2023-07-04 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords SDXLlatent diffusiontext-to-image synthesisUNet scalingrefinement modelStable Diffusionconditioning schemeshigh-resolution generation
0
0 comments X

The pith

SDXL scales up the UNet and adds conditioning plus refinement to make latent diffusion competitive with closed image generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SDXL as an upgraded latent diffusion model for text-to-image generation that uses a UNet backbone three times larger than prior Stable Diffusion versions, achieved mainly by adding attention blocks and a second text encoder. It introduces new conditioning methods, trains across multiple aspect ratios, and pairs the base model with a separate refinement network that runs post-hoc image-to-image processing to raise visual quality. These changes produce outputs that exceed earlier open Stable Diffusion models and reach levels comparable to proprietary black-box systems. The work releases code and weights to support open research. Readers would care because it offers a transparent path to high-fidelity image synthesis without depending on closed commercial services.

Core claim

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results that

What carries the argument

The three-times-larger UNet backbone with added attention blocks and a second text encoder for expanded cross-attention context, together with novel conditioning schemes, multi-aspect-ratio training, and a post-hoc refinement model that performs image-to-image enhancement.

If this is right

  • Image synthesis quality improves markedly over earlier open Stable Diffusion releases.
  • The model handles variable aspect ratios without retraining.
  • A lightweight post-processing step further raises fidelity of base outputs.
  • Open release of weights and code enables community inspection and extension.
  • Performance reaches parity with certain closed commercial generators on visual metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open models may narrow the gap with proprietary systems through targeted architectural scaling rather than data secrecy alone.
  • The refinement stage could be adapted as a modular add-on for other diffusion pipelines.
  • Wider availability might shift user workflows away from paid API calls toward local or fine-tuned open alternatives.

Load-bearing premise

The reported gains in image quality come chiefly from the architectural scaling, conditioning additions, and refinement step rather than from undisclosed increases in training data volume, curation quality, or total compute.

What would settle it

A controlled re-training of SDXL and a prior Stable Diffusion baseline on identical data and hardware, followed by direct side-by-side evaluation on the same prompts, would show whether architecture alone explains the quality jump.

read the original abstract

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces SDXL, a latent diffusion model for text-to-image synthesis. It employs a UNet backbone three times larger than prior Stable Diffusion versions, achieved mainly through additional attention blocks and a second text encoder enabling larger cross-attention context. Novel conditioning schemes are proposed, the model is trained across multiple aspect ratios, and a refinement model is added for post-hoc image-to-image fidelity improvement. The central claim is that SDXL achieves drastically improved performance over previous Stable Diffusion versions while remaining competitive with black-box state-of-the-art generators; code and model weights are released.

Significance. If the empirical performance claims hold under independent verification, this constitutes a meaningful open contribution to high-resolution text-to-image synthesis by providing a transparent, reproducible baseline that can accelerate community research. The explicit release of code and weights is a clear strength that directly supports falsifiability of the reported gains.

major comments (1)
  1. [Abstract and results sections] The central empirical claim (drastically improved performance and competitiveness with closed SOTA models) rests on comparisons whose quantitative details, ablation controls, and dataset statistics are not fully elaborated in the provided text. Without explicit tables reporting metrics such as FID or CLIP scores on fixed benchmarks, together with controls for training data volume and curation, it remains difficult to isolate the contribution of the architectural changes (larger UNet, second text encoder, conditioning schemes) from possible differences in compute or data.
minor comments (3)
  1. [Methods] Clarify the exact training data composition and aspect-ratio sampling strategy in the methods section to allow readers to assess potential data-related confounds.
  2. [Figures] Add captions to all qualitative figures that explicitly state the prompt, sampling parameters, and which model variant is shown in each panel.
  3. [Abstract] Verify that the released GitHub repository contains the exact model weights, inference code, and evaluation scripts referenced in the paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and recommendation of minor revision. We address the major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and results sections] The central empirical claim (drastically improved performance and competitiveness with closed SOTA models) rests on comparisons whose quantitative details, ablation controls, and dataset statistics are not fully elaborated in the provided text. Without explicit tables reporting metrics such as FID or CLIP scores on fixed benchmarks, together with controls for training data volume and curation, it remains difficult to isolate the contribution of the architectural changes (larger UNet, second text encoder, conditioning schemes) from possible differences in compute or data.

    Authors: We agree that more explicit quantitative details would help clarify the contributions of the architectural and conditioning changes. The manuscript presents extensive qualitative results and some supporting metrics demonstrating the performance gains, but we acknowledge that additional tables with FID and CLIP scores on fixed benchmarks (e.g., MS-COCO), together with more detailed ablation controls, would strengthen the isolation of effects from the larger UNet, second text encoder, and novel conditioning. In the revised version we will add these tables and expand the description of training data aspects (including multi-aspect-ratio sampling) to the extent feasible. We note that the release of code and model weights directly enables independent verification, further ablations, and community evaluation on any desired benchmarks, which addresses the core concern of reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper describes an empirical engineering effort: a larger UNet with additional attention blocks, a second text encoder, novel conditioning schemes, multi-aspect-ratio training, and a post-hoc refinement model. The central claim of improved performance over prior Stable Diffusion versions and competitiveness with closed SOTA models rests on released weights, code, and external visual/qualitative comparisons rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the paper's own inputs by construction; the work is self-contained against verifiable external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claims rest on standard latent diffusion assumptions plus empirical training; no new physical entities or ad-hoc constants are introduced beyond typical model hyperparameters.

free parameters (1)
  • UNet backbone scale
    Model size increased by factor of three via more attention blocks; chosen through design and training to improve capacity.
axioms (1)
  • domain assumption Latent diffusion models generate images by iteratively denoising in a compressed latent space conditioned on text embeddings
    Core modeling assumption underlying the entire SDXL architecture and training procedure.

pith-pipeline@v0.9.0 · 5469 in / 1277 out tokens · 78062 ms · 2026-05-10T15:16:09.227947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  2. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  3. From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

  4. OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

    cs.CV 2026-05 unverdicted novelty 7.0

    OP4KSR enables efficient one-step 4K super-resolution without patches by adapting Flux with RoPE rescaling and periodicity loss to suppress artifacts.

  5. Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

    cs.CY 2026-05 unverdicted novelty 7.0

    A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.

  6. ImageAttributionBench: How Far Are We from Generalizable Attribution?

    cs.CV 2026-05 unverdicted novelty 7.0

    ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

  7. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  8. Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...

  9. LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

    cs.CV 2026-05 unverdicted novelty 7.0

    LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.

  10. ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

  11. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  12. Dependency-Aware Discrete Diffusion for Scene Graph Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    A new discrete diffusion model for scene graph generation from text captures object-relation dependencies via hierarchical constraints and training-free conditioning, yielding better graph metrics and downstream image...

  13. Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

  14. Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vis...

  15. From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.

  16. D-Rex : Diffusion Rendering for Relightable Expressive Avatars

    cs.GR 2026-04 conditional novelty 7.0

    D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.

  17. GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models

    cs.LG 2026-04 unverdicted novelty 7.0

    GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.

  18. Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution

    cs.CV 2026-04 unverdicted novelty 7.0

    IDaS-SR achieves one-step real-world super-resolution by bridging restoration and generation manifolds via adaptive inversion noise estimation and continuous trajectory steering.

  19. Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation

    cs.CV 2026-04 unverdicted novelty 7.0

    Pose-LDM generates occluded in-bed images from keypoints to augment training data, achieving top accuracy under severe occlusion compared to other augmentation methods.

  20. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  21. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  22. DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    DCMorph generates face morphs via decoupled cross-attention in identity-conditioned diffusion and DDIM spherical interpolation, achieving higher attack success rates on four face recognition systems than prior methods...

  23. ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.

  24. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

  25. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  26. Long-Text-to-Image Generation via Compositional Prompt Decomposition

    cs.CV 2026-04 unverdicted novelty 7.0

    PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...

  27. Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.

  28. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  29. Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.

  30. Thermodynamic Diffusion Inference with Minimal Digital Conditioning

    cs.LG 2026-04 unverdicted novelty 7.0

    Thermodynamic diffusion inference at production scale is shown using hierarchical bilinear coupling for U-Net skips and a 2,560-parameter digital bottleneck, attaining 0.9906 cosine similarity with theoretical 10^7x e...

  31. DiffusionPrint: Learning Generative Fingerprints for Diffusion-Based Inpainting Localization

    cs.CV 2026-04 unverdicted novelty 7.0

    DiffusionPrint learns robust forensic feature maps via MoCo-style contrastive training on diffusion inpainting fingerprints, boosting localization accuracy by up to 28% when fused into existing IFL systems and general...

  32. SEED: A Large-Scale Benchmark for Provenance Tracing in Sequential Deepfake Facial Edits

    cs.CR 2026-04 unverdicted novelty 7.0

    SEED is a new benchmark for sequential provenance tracing in diffusion-edited deepfake faces, with the FAITH baseline showing that wavelet-based high-frequency signals aid detection of accumulated editing artifacts.

  33. Image-Guided Geometric Stylization of 3D Meshes

    cs.CV 2026-04 unverdicted novelty 7.0

    A coarse-to-fine pipeline deforms 3D meshes to reflect geometric features from an image using diffusion model representations while preserving topology and part-level semantics.

  34. RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

    cs.CV 2026-04 unverdicted novelty 7.0

    RefineAnything is a multimodal diffusion model using Focus-and-Refine crop-and-resize with blended paste-back to achieve high-fidelity local image refinement and near-perfect background preservation.

  35. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 7.0

    FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.

  36. Graph-PiT: Enhancing Structural Coherence in Part-Based Image Synthesis via Graph Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    Graph-PiT adds graph priors and a hierarchical GNN to part-based image synthesis to enforce relational constraints and improve structural coherence over vanilla PiT.

  37. Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

    cs.CR 2026-04 unverdicted novelty 7.0

    HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.

  38. Your Pre-trained Diffusion Model Secretly Knows Restoration

    cs.CV 2026-04 unverdicted novelty 7.0

    Pre-trained diffusion models inherently support image restoration that can be unlocked by optimizing prompt embeddings at the text encoder output using a diffusion bridge formulation, achieving competitive results on ...

  39. 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

    cs.CV 2026-04 conditional novelty 7.0

    1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.

  40. VOSR: A Vision-Only Generative Model for Image Super-Resolution

    cs.CV 2026-04 conditional novelty 7.0

    VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a r...

  41. From Synthetic Data to Real Restorations: Diffusion Model for Patient-specific Dental Crown Completion

    cs.CV 2026-03 unverdicted novelty 7.0

    A diffusion model trained on synthetically damaged teeth from public datasets completes crowns with 81.8% IoU and 0.00034 Chamfer distance, and produces real-world restorations with minimal opposing-tooth interference.

  42. SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

    cs.CV 2026-03 conditional novelty 7.0

    SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.

  43. CSF: Black-box Fingerprinting via Compositional Semantics for Text-to-Image Models

    cs.CR 2026-03 unverdicted novelty 7.0

    CSF is the first black-box method to attribute fine-tuned text-to-image models to original lineages via compositional semantic probes and Bayesian decisions across multiple model families.

  44. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  45. Transfer between Modalities with MetaQueries

    cs.CV 2025-04 unverdicted novelty 7.0

    MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.

  46. Unified Reward Model for Multimodal Understanding and Generation

    cs.CV 2025-03 unverdicted novelty 7.0

    UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.

  47. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  48. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  49. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  50. Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

    cs.CV 2026-05 conditional novelty 6.0

    SEF introduces GAN upsampling for diverse artifacts and expert fusion to reduce domain interference, yielding stronger generalization on 13 benchmarks for AI-generated image detection.

  51. InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

    cs.CV 2026-05 conditional novelty 6.0

    InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

  52. PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

    cs.CV 2026-05 unverdicted novelty 6.0

    PRISM improves text image super-resolution by rectifying global priors with flow-matching and modeling local structural uncertainty in a single diffusion pass, achieving SOTA results at millisecond inference.

  53. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 6.0

    V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...

  54. When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

    cs.CV 2026-05 unverdicted novelty 6.0

    Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...

  55. L2P: Unlocking Latent Potential for Pixel Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

  56. HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

    cs.CV 2026-05 unverdicted novelty 6.0

    A pixel-space Diffusion Transformer with Unified Transformer architecture unifies image generation, editing, and personalization in an end-to-end model that maps all inputs to a shared token space and scales from 8B t...

  57. Filtering Memorization from Parameter-Space in Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    BAF reduces memorization in diffusion LoRAs by filtering spectral channels of the adaptation weights that show weak alignment with the base model's principal subspace.

  58. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts to generate more consistent fashion outfits than prior state-of-the-art methods.

  59. Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

    cs.CV 2026-05 unverdicted novelty 6.0

    Fashion130K dataset and UMC framework align text and visual prompts with embedding refiner, Fusion Transformer, and redesigned attention to generate more consistent outfits than prior methods.

  60. DiffATS: Diffusion in Aligned Tensor Space

    cs.LG 2026-05 unverdicted novelty 6.0

    DiffATS trains diffusion models directly on aligned Tucker tensor primitives that are proven to be homeomorphisms, delivering efficient unconditional and conditional generation across images, videos, and PDE data with...

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 160 Pith papers · 20 internal anchors

  1. [1]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324, 2022

  2. [2]

    Tract: Denoising diffusion models with transitive closure time-distillation

    David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248, 2023

  3. [3]

    Align your latents: High-resolution video synthesis with latent diffusion models, 2023

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023

  4. [4]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  5. [5]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

  6. [6]

    Distilling the Knowledge in Diffusion Models

    Tim Dockhorn, Robin Rombach, Andreas Blattmann, and Yaoliang Yu. Distilling the Knowledge in Diffusion Models. CVPR Workshop on Generative Models for Computer Vision, 2023

  7. [7]

    Structure and content-guided video synthesis with diffusion models, 2023

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models, 2023

  8. [8]

    E., and Wang, W

    Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv:2212.05032, 2023

  9. [9]

    Riffusion - Stable diffusion for real-time music generation, 2022

    Seth Forsgren and Hayk Martiros. Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about

  10. [10]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022

  11. [11]

    Diffusion with offset noise, 2023

    Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. URL https://www.crosslabs. org/blog/diffusion-with-offset-noise

  12. [12]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500, 2017

  13. [13]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022

  14. [14]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020

  15. [15]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, and Tim Salimans. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022

  16. [16]

    Simple diffusion: End-to-end diffusion for high resolution images.arXiv preprint arXiv:2301.11093, 2023

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023

  17. [17]

    Make-an-audio: Text-to-audio genera- tion with prompt-enhanced diffusion models.arXiv preprint arXiv:2301.12661, 2023

    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661, 2023

  18. [18]

    Estimation of Non-Normalized Statistical Models by Score Matching

    Aapo Hyvärinen and Peter Dayan. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005

  19. [19]

    2021.OpenCLIP

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773

  20. [20]

    Distribution Augmentation for Generative Modeling

    Heewoo Jun, Rewon Child, Mark Chen, John Schulman, Aditya Ramesh, Alec Radford, and Ilya Sutskever. Distribution Augmentation for Generative Modeling. In International Conference on Machine Learning, pages 5006–5019. PMLR, 2020

  21. [21]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022

  22. [22]

    2023.On Architectural Compression of Text-to-Image Diffusion Models

    Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. On Architectural Compression of Text-to-Image Diffusion Models. arXiv:2305.15798, 2023. 19

  23. [23]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv:2305.01569, 2023

  24. [24]

    2023.SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

    Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. arXiv:2306.00980, 2023

  25. [25]

    arXiv:2305.08891 , year=

    Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023

  26. [26]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015

  27. [27]

    Character-aware models improve visual text rendering, 2023

    Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering, 2023

  28. [28]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073, 2021

  29. [29]

    Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models, 2023

  30. [30]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. arXiv:2112.10741, 2021

  31. [31]

    Novelai improvements on stable diffusion, 2023

    NovelAI. Novelai improvements on stable diffusion, 2023. URL https://blog.novelai.net/ novelai-improvements-on-stable-diffusion-e10d38db82ac

  32. [32]

    Pytorch: An imperative style, high-performance deep learning library, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

  33. [33]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. arXiv:2212.09748, 2022

  34. [34]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

  35. [35]

    How dall·e 2 works, 2022

    Aditya Ramesh. How dall·e 2 works, 2022. URL http://adityaramesh.com/posts/dalle2/dalle2. html

  36. [36]

    Zero-shot text-to-image generation, 2021

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

  37. [37]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022

  38. [38]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021

  39. [39]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015

  40. [40]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, 2022

  41. [41]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022

  42. [42]

    Improved Techniques for Training GANs

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. arXiv:1606.03498, 2016

  43. [43]

    DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

    Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinxiao Wu. DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv:2305.15957, 2023

  44. [44]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022

  45. [45]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015. 20

  46. [46]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020

  47. [47]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020

  48. [48]

    Evaluating a synthetic image dataset generated with stable diffusion

    Andreas Stöckl. Evaluating a synthetic image dataset generated with stable diffusion. arXiv:2211.01777, 2022

  49. [49]

    High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity

    Yu Takagi and Shinji Nishimoto. High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

  50. [50]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023

  51. [51]

    Boosting gui prototyping with diffusion models

    Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, and Gérard Dray. Boosting gui prototyping with diffusion models. arXiv preprint arXiv:2306.06233, 2023

  52. [52]

    Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

    Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

  53. [53]

    Scaling autoregressive models for content-rich text-to-image generation, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

  54. [54]

    Adding conditional control to text-to-image diffusion models,

    Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023

  55. [55]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. 21