pith. machine review for the scientific record. sign in

arxiv: 2605.11494 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsdiversity guidancefeature perturbationPCAsingle-step generationtraining-freeimage synthesistransformer features
0
0 comments X

The pith

Projecting perturbations onto the principal components of a diffusion model's activations enables controlled diversity gains in single-step image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-step and few-step diffusion models generate images quickly but suffer from reduced variety compared to slower multi-step versions. The paper argues that simply adding noise to features fails because it ignores the structure the model has learned. STRIDE addresses this by adding spatially coherent pink noise to transformer features after projecting it onto the principal components of the model's activations, keeping changes within the learned representation space. This leads to better diversity on standard benchmarks while keeping text-image alignment strong, and it requires no extra training or optimization steps. The approach shows that respecting the model's internal geometry is key to effective diversity in fast generation settings.

Core claim

STRIDE injects spatially coherent pink noise into intermediate transformer features, projected onto the principal components of the model's own activations. This ensures perturbations lie on the learned feature manifold and enables controlled variation along meaningful directions, improving diversity in one-step and few-step diffusion models without training or iterative refinement.

What carries the argument

PCA-directed feature perturbation: noise projected onto principal components of model activations to align with learned manifold.

If this is right

  • STRIDE reduces intra-batch similarity on COCO, DrawBench, PartiPrompts, and GenEval while maintaining CLIP scores.
  • It Pareto-dominates existing training-free baselines on the diversity-fidelity frontier for FLUX.1-schnell and SD3.5 Turbo.
  • The method operates in a single forward pass, enabling real-time use.
  • Diversity gains arise from alignment with representation structure rather than increased perturbation strength.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection approach could be tested on other transformer-based generative models to check for similar diversity benefits.
  • Combining STRIDE with existing single-pass techniques might produce further gains in variety without added computation.
  • The emphasis on spatial coherence in the noise suggests experiments swapping noise types could clarify which properties drive the gains.

Load-bearing premise

That projecting perturbations onto the principal components of the model's activations will align them with meaningful directions for diversity without introducing artifacts or reducing text alignment.

What would settle it

Running STRIDE on FLUX.1-schnell with the same prompts as baselines and finding no statistically significant increase in diversity metrics like reduced intra-batch similarity or no improvement on the diversity-fidelity frontier.

Figures

Figures reproduced from arXiv: 2605.11494 by Ankit Yadav, Arpit Garg, Lingqiao Liu, Ta Duc Huy.

Figure 1
Figure 1. Figure 1: STRIDE motivation. (a) STRIDE vs. random-seed baseline on FLUX.1-schnell: each block shows four samples from one DrawBench prompt; STRIDE produces diverse poses and styles. (b) Unstructured noise pushes features off-manifold and is treated as corruption, while PCA-directed perturbation keeps noise on the learned manifold M. These methods fundamentally rely on temporal degrees of freedom (schedules, traject… view at source ↗
Figure 2
Figure 2. Figure 2: Pareto frontiers for diversity-quality trade-off. We compare STRIDE against baselines using InBSim↓ and CLIP↑ on FLUX.1-schnell and SD3.5 Turbo. STRIDE achieves the strongest Pareto frontier on both architectures, improving diversity while preserving image-text alignment. DrawBench (−9.8% InBSim) but at 15× the CLIP cost (−1.86 vs. STRIDE’s −0.12), reflecting the steep quality penalty of repulsion-based me… view at source ↗
Figure 3
Figure 3. Figure 3: Diversity-quality Pareto frontier on DrawBench. We compare STRIDE No-PCA (undirected pink noise) and STRIDE (PCA-directed pink noise) across frequency exponent fα and perturbation strength α on FLUX.1-schnell and SD3.5 Turbo using 199 prompts with 4 images per prompt. STRIDE offers a better diversity-quality trade-off, achieving lower InBSim at comparable CLIP across both backbones [PITH_FULL_IMAGE:figure… view at source ↗
read the original abstract

Distilled one-step (T=1) or few-step (T$\leq$4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model's learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model's own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model's internal representation structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces STRIDE, a training-free, single-forward-pass method for boosting sample diversity in one-step and few-step diffusion models (e.g., FLUX.1-schnell, SD3.5 Turbo). It injects spatially coherent pink noise into intermediate transformer features after projecting the noise onto the principal components of the model's own activations, with the goal of placing perturbations along the learned feature manifold to enable controlled, meaningful variation. Experiments on COCO, DrawBench, PartiPrompts, and GenEval report consistent reductions in intra-batch similarity with minimal CLIP-score degradation and Pareto dominance over training-free baselines.

Significance. If the reported gains hold under closer scrutiny, STRIDE provides a lightweight, optimization-free technique that directly addresses the diversity bottleneck in distilled diffusion models used for real-time generation. The emphasis on aligning perturbations with internal representation geometry rather than simply increasing noise strength is a useful conceptual contribution, and the single-pass constraint makes the approach immediately deployable.

major comments (2)
  1. [§3] §3 (Method): The central design choice—that projecting pink noise onto the top principal components of per-forward-pass activations places perturbations on the 'learned feature manifold' and yields semantically meaningful diversity directions—is asserted without a supporting argument or diagnostic. PCA maximizes variance irrespective of text conditioning; nothing in the formulation prevents the leading components from capturing prompt-irrelevant factors (global illumination, low-level texture), which could still degrade alignment even if average CLIP scores remain stable. This assumption is load-bearing for the claim of controlled variation.
  2. [§4] §4 (Experiments): The Pareto-dominance claim over baselines rests on reported intra-batch similarity and CLIP scores, yet no ablation is described that isolates the contribution of the PCA projection versus the pink-noise spatial coherence alone, nor is the sensitivity to the number of retained components or the single free parameter (perturbation scale) quantified. Without these controls, it is difficult to determine whether the gains are robust or tied to unstated implementation choices.
minor comments (2)
  1. [§3] The notation for the PCA projection and the exact definition of the pink-noise covariance should be stated explicitly with an equation rather than described in prose.
  2. [§4] Figure captions and axis labels in the diversity-fidelity plots would benefit from explicit mention of the exact metrics (e.g., which similarity measure is used for intra-batch diversity).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing our honest assessment and planned revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central design choice—that projecting pink noise onto the top principal components of per-forward-pass activations places perturbations on the 'learned feature manifold' and yields semantically meaningful diversity directions—is asserted without a supporting argument or diagnostic. PCA maximizes variance irrespective of text conditioning; nothing in the formulation prevents the leading components from capturing prompt-irrelevant factors (global illumination, low-level texture), which could still degrade alignment even if average CLIP scores remain stable. This assumption is load-bearing for the claim of controlled variation.

    Authors: We appreciate the referee highlighting the need for stronger justification here. The PCA is computed on activations from the specific conditioned forward pass for each prompt, so the resulting components reflect variance directions within the prompt-dependent feature distribution rather than an unconditional global basis. This is a key distinction from standard PCA on unconditioned data. Our experiments show that this yields diversity gains with only minimal CLIP-score impact, suggesting that prompt-irrelevant factors (if present in lower components) do not dominate the top directions used. We acknowledge that the original submission could have included more explicit discussion or a diagnostic (e.g., correlation of PC directions with semantic attributes). We will revise §3 to elaborate on this per-prompt conditioning rationale and add a supporting analysis or visualization in the appendix demonstrating the semantic nature of the top components. revision: partial

  2. Referee: [§4] §4 (Experiments): The Pareto-dominance claim over baselines rests on reported intra-batch similarity and CLIP scores, yet no ablation is described that isolates the contribution of the PCA projection versus the pink-noise spatial coherence alone, nor is the sensitivity to the number of retained components or the single free parameter (perturbation scale) quantified. Without these controls, it is difficult to determine whether the gains are robust or tied to unstated implementation choices.

    Authors: We agree that these ablations and sensitivity analyses would strengthen the experimental section and help isolate the sources of improvement. The current results emphasize end-to-end comparisons across benchmarks, but we will add the requested controls in the revision: (i) an ablation of PCA-projected pink noise versus pink noise without the PCA step, (ii) results for varying numbers of retained principal components, and (iii) a sweep over the perturbation scale parameter. These will be reported in §4 with additional details in the supplementary material to confirm robustness of the Pareto dominance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained with external validation

full rationale

The paper's core proposal is a training-free single-pass procedure: compute PCA on intermediate transformer activations from the current forward pass, project spatially coherent pink noise onto those principal components, and add the result as a perturbation. This is motivated by an insight about feature geometry but does not define any quantity in terms of itself, rename a fitted parameter as a prediction, or rely on a load-bearing self-citation whose validity is internal to the present work. The claim that the resulting perturbations produce controlled, meaningful diversity is supported by comparative experiments (CLIP scores, intra-batch similarity, Pareto dominance on multiple benchmarks) rather than following tautologically from the construction. The assumption that top PCs align with semantically useful directions is acknowledged as empirical and is not smuggled in via prior self-work or uniqueness theorems.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact parameters; the approach assumes PCA directions are meaningful for diversity and that pink noise preserves spatial coherence, with likely unstated choices for perturbation scale.

free parameters (1)
  • perturbation scale
    The strength of the injected noise is almost certainly chosen or tuned to balance diversity and quality, though not quantified in the abstract.
axioms (1)
  • domain assumption Principal components of intermediate activations capture directions that allow controlled and meaningful diversity injection
    This is the core design choice stated in the abstract as the reason naive perturbation fails while STRIDE succeeds.

pith-pipeline@v0.9.0 · 5601 in / 1310 out tokens · 62165 ms · 2026-05-13T02:44:04.895784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 6 internal anchors

  1. [1]

    Self-rectifying diffusion sampling with perturbed-attention guidance

    Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. InEuropean Conference on Computer Vision, pages 1–17. Springer, 2024

  2. [2]

    Where and how to perturb: On the design of perturbation guidance in diffusion and flow models

    Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Sangwu Lee, Sayak Paul, Susung Hong, and Seungryong Kim. Where and how to perturb: On the design of perturbation guidance in diffusion and flow models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  3. [3]

    Demystifying MMD GANs

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

  4. [4]

    Announcing Black Forest Labs and the FLUX.1 suite of models

    Black Forest Labs. Announcing Black Forest Labs and the FLUX.1 suite of models. https: //bfl.ai/blog/24-08-01-bfl, August 2024. Accessed: 2026-04-27

  5. [5]

    Sana-sprint: One-step diffusion with continuous-time consistency distillation

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16185–16195, 2025

  6. [6]

    Particle guidance: non-iid diverse sampling with diffusion models.arXiv preprint arXiv:2310.13102, 2023

    Gabriele Corso, Yilun Xu, Valentin De Bortoli, Regina Barzilay, and Tommi Jaakkola. Particle guidance: non-iid diverse sampling with diffusion models.arXiv preprint arXiv:2310.13102, 2023

  7. [7]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  8. [8]

    The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

    Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

  9. [9]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  10. [10]

    Unleashing diffusion transformers for visual correspondence by modulating massive activations.arXiv preprint arXiv:2505.18584, 2025

    Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing diffusion transformers for visual correspondence by modulating massive activations.arXiv preprint arXiv:2505.18584, 2025

  11. [11]

    Distilling diversity and control in diffusion models

    Rohit Gandikota and David Bau. Distilling diversity and control in diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1304–1313, 2026

  12. [12]

    Concept sliders: Lora adaptors for precise control in diffusion models

    Rohit Gandikota, Joanna Materzy´nska, Tingrui Zhou, Antonio Torralba, and David Bau. Concept sliders: Lora adaptors for precise control in diffusion models. InEuropean Conference on Computer Vision, pages 172–188. Springer, 2024

  13. [13]

    Sliderspace: Decomposing the visual capabilities of diffusion models.arXiv preprint arXiv:2502.01639, 2025

    Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, and Nick Kolkin. Sliderspace: Decomposing the visual capabilities of diffusion models.arXiv preprint arXiv:2502.01639, 2025

  14. [14]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  15. [15]

    Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

    Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011. 10

  16. [16]

    It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

    Anne Harrington, A Koepke, Shyamgopal Karthik, Trevor Darrell, and Alexei A Efros. It’s never too late: Noise optimization for collapse recovery in trained diffusion models.arXiv preprint arXiv:2601.00090, 2025

  17. [17]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

  18. [18]

    Smoothed energy guidance: Guiding diffusion models with reduced energy curvature of attention

    Susung Hong. Smoothed energy guidance: Guiding diffusion models with reduced energy curvature of attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  19. [19]

    Improving sample quality of diffusion models using self-attention guidance

    Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. Improving sample quality of diffusion models using self-attention guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7462–7471, 2023

  20. [20]

    Sparke: Scalable prompt-aware diversity guidance in diffusion models via rke score.arXiv preprint arXiv:2506.10173, 2025

    Mohammad Jalali, Haoyu Lei, Amin Gohari, and Farzan Farnia. Sparke: Scalable prompt-aware diversity guidance in diffusion models via rke score.arXiv preprint arXiv:2506.10173, 2025

  21. [21]

    No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831,

    Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025

  22. [22]

    Shielded diffusion: Generating novel and diverse images using sparse repellency.arXiv preprint arXiv:2410.06025, 2024

    Michael Kirchhof, James Thornton, Louis Béthune, Pierre Ablin, Eugene Ndiaye, and Marco Cuturi. Shielded diffusion: Generating novel and diverse images using sparse repellency.arXiv preprint arXiv:2410.06025, 2024

  23. [23]

    Diffusion models already have a semantic latent space

    Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. InInternational Conference on Learning Representations (ICLR), 2023

  24. [24]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

    Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

  25. [25]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision (ECCV), 2014

  26. [26]

    Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

    Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023

  27. [27]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  28. [28]

    Training-free generation of diverse and high-fidelity images via prompt semantic space optimization.arXiv preprint arXiv:2511.19811, 2025

    Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, and Georgios Tzimiropou- los. Training-free generation of diverse and high-fidelity images via prompt semantic space optimization.arXiv preprint arXiv:2511.19811, 2025

  29. [29]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  30. [30]

    Scaling group inference for diverse and high-quality generation

    Gaurav Parmar, Or Patashnik, Daniil Ostashev, Kuan-Chieh Wang, Kfir Aberman, Srinivasa Narasimhan, and Jun-Yan Zhu. Scaling group inference for diverse and high-quality generation. arXiv preprint arXiv:2508.15773, 2025

  31. [31]

    Token perturbation guidance for diffusion models.arXiv preprint arXiv:2506.10036, 2025

    Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, and Babak Taati. Token perturbation guidance for diffusion models.arXiv preprint arXiv:2506.10036, 2025

  32. [32]

    Cads: Unleashing the diversity of diffusion models through condition-annealed sampling.arXiv preprint arXiv:2310.17347, 2023

    Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, and Romann M Weber. Cads: Unleashing the diversity of diffusion models through condition-annealed sampling.arXiv preprint arXiv:2310.17347, 2023. 11

  33. [33]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

  34. [34]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  35. [35]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

  36. [36]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

  37. [37]

    Introducing Stable Diffusion 3.5

    Stability AI. Introducing Stable Diffusion 3.5. https://stability.ai/news/ introducing-stable-diffusion-3-5, October 2024. Accessed: 2026-04-27

  38. [38]

    Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139, 2026

    Tianhe Wu, Ruibin Li, Lei Zhang, and Kede Ma. Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139, 2026

  39. [39]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  40. [40]

    Freeman, and Taesung Park

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  41. [41]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2022. 12