arxiv: 2605.11494 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

Ankit Yadav , Arpit Garg , Ta Duc Huy , Lingqiao Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelsdiversity guidancefeature perturbationPCAsingle-step generationtraining-freeimage synthesistransformer features

0 comments

The pith

Projecting perturbations onto the principal components of a diffusion model's activations enables controlled diversity gains in single-step image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Single-step and few-step diffusion models generate images quickly but suffer from reduced variety compared to slower multi-step versions. The paper argues that simply adding noise to features fails because it ignores the structure the model has learned. STRIDE addresses this by adding spatially coherent pink noise to transformer features after projecting it onto the principal components of the model's activations, keeping changes within the learned representation space. This leads to better diversity on standard benchmarks while keeping text-image alignment strong, and it requires no extra training or optimization steps. The approach shows that respecting the model's internal geometry is key to effective diversity in fast generation settings.

Core claim

STRIDE injects spatially coherent pink noise into intermediate transformer features, projected onto the principal components of the model's own activations. This ensures perturbations lie on the learned feature manifold and enables controlled variation along meaningful directions, improving diversity in one-step and few-step diffusion models without training or iterative refinement.

What carries the argument

PCA-directed feature perturbation: noise projected onto principal components of model activations to align with learned manifold.

If this is right

STRIDE reduces intra-batch similarity on COCO, DrawBench, PartiPrompts, and GenEval while maintaining CLIP scores.
It Pareto-dominates existing training-free baselines on the diversity-fidelity frontier for FLUX.1-schnell and SD3.5 Turbo.
The method operates in a single forward pass, enabling real-time use.
Diversity gains arise from alignment with representation structure rather than increased perturbation strength.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection approach could be tested on other transformer-based generative models to check for similar diversity benefits.
Combining STRIDE with existing single-pass techniques might produce further gains in variety without added computation.
The emphasis on spatial coherence in the noise suggests experiments swapping noise types could clarify which properties drive the gains.

Load-bearing premise

That projecting perturbations onto the principal components of the model's activations will align them with meaningful directions for diversity without introducing artifacts or reducing text alignment.

What would settle it

Running STRIDE on FLUX.1-schnell with the same prompts as baselines and finding no statistically significant increase in diversity metrics like reduced intra-batch similarity or no improvement on the diversity-fidelity frontier.

Figures

Figures reproduced from arXiv: 2605.11494 by Ankit Yadav, Arpit Garg, Lingqiao Liu, Ta Duc Huy.

**Figure 1.** Figure 1: STRIDE motivation. (a) STRIDE vs. random-seed baseline on FLUX.1-schnell: each block shows four samples from one DrawBench prompt; STRIDE produces diverse poses and styles. (b) Unstructured noise pushes features off-manifold and is treated as corruption, while PCA-directed perturbation keeps noise on the learned manifold M. These methods fundamentally rely on temporal degrees of freedom (schedules, traject… view at source ↗

**Figure 2.** Figure 2: Pareto frontiers for diversity-quality trade-off. We compare STRIDE against baselines using InBSim↓ and CLIP↑ on FLUX.1-schnell and SD3.5 Turbo. STRIDE achieves the strongest Pareto frontier on both architectures, improving diversity while preserving image-text alignment. DrawBench (−9.8% InBSim) but at 15× the CLIP cost (−1.86 vs. STRIDE’s −0.12), reflecting the steep quality penalty of repulsion-based me… view at source ↗

**Figure 3.** Figure 3: Diversity-quality Pareto frontier on DrawBench. We compare STRIDE No-PCA (undirected pink noise) and STRIDE (PCA-directed pink noise) across frequency exponent fα and perturbation strength α on FLUX.1-schnell and SD3.5 Turbo using 199 prompts with 4 images per prompt. STRIDE offers a better diversity-quality trade-off, achieving lower InBSim at comparable CLIP across both backbones [PITH_FULL_IMAGE:figure… view at source ↗

read the original abstract

Distilled one-step (T=1) or few-step (T$\leq$4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model's learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model's own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model's internal representation structure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STRIDE gives a clean training-free way to add diversity to one-step diffusion by perturbing features along PCA directions from the model's own activations, with experiments showing it beats baselines on the diversity-fidelity trade-off.

read the letter

STRIDE perturbs intermediate transformer features in single-step models with pink noise projected onto the principal components of the activations, all in one forward pass. This is the core new piece: it targets the lack of iterative mechanisms in T=1 settings by trying to keep perturbations inside the learned feature space rather than adding unstructured noise. The experiments run on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval, and they show lower intra-batch similarity with only small CLIP score drops while sitting above other training-free methods on the Pareto curve. That empirical pattern is the main strength; the method is simple enough that the gains look reproducible from the description. The soft spot is the justification for the PCA step. The paper treats the top components as meaningful directions for controlled diversity, but PCA only maximizes variance and does not know about text conditioning or output semantics. Nothing rules out the top directions encoding prompt-irrelevant factors whose perturbation could still hurt alignment or add artifacts even when average metrics stay stable. The single-pass design makes the PCA prompt-dependent, which is practical but leaves open how much the results vary with different batch compositions. The work is aimed at people building or using fast generative models who need more output variety without retraining or extra steps. A reader already working on efficient diffusion or diversity control would find the comparisons and the implementation details useful. The central claim holds up on the reported numbers, so the paper deserves a serious referee to check the method details and ask for more targeted ablations on what the PCA directions actually capture.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces STRIDE, a training-free, single-forward-pass method for boosting sample diversity in one-step and few-step diffusion models (e.g., FLUX.1-schnell, SD3.5 Turbo). It injects spatially coherent pink noise into intermediate transformer features after projecting the noise onto the principal components of the model's own activations, with the goal of placing perturbations along the learned feature manifold to enable controlled, meaningful variation. Experiments on COCO, DrawBench, PartiPrompts, and GenEval report consistent reductions in intra-batch similarity with minimal CLIP-score degradation and Pareto dominance over training-free baselines.

Significance. If the reported gains hold under closer scrutiny, STRIDE provides a lightweight, optimization-free technique that directly addresses the diversity bottleneck in distilled diffusion models used for real-time generation. The emphasis on aligning perturbations with internal representation geometry rather than simply increasing noise strength is a useful conceptual contribution, and the single-pass constraint makes the approach immediately deployable.

major comments (2)

[§3] §3 (Method): The central design choice—that projecting pink noise onto the top principal components of per-forward-pass activations places perturbations on the 'learned feature manifold' and yields semantically meaningful diversity directions—is asserted without a supporting argument or diagnostic. PCA maximizes variance irrespective of text conditioning; nothing in the formulation prevents the leading components from capturing prompt-irrelevant factors (global illumination, low-level texture), which could still degrade alignment even if average CLIP scores remain stable. This assumption is load-bearing for the claim of controlled variation.
[§4] §4 (Experiments): The Pareto-dominance claim over baselines rests on reported intra-batch similarity and CLIP scores, yet no ablation is described that isolates the contribution of the PCA projection versus the pink-noise spatial coherence alone, nor is the sensitivity to the number of retained components or the single free parameter (perturbation scale) quantified. Without these controls, it is difficult to determine whether the gains are robust or tied to unstated implementation choices.

minor comments (2)

[§3] The notation for the PCA projection and the exact definition of the pink-noise covariance should be stated explicitly with an equation rather than described in prose.
[§4] Figure captions and axis labels in the diversity-fidelity plots would benefit from explicit mention of the exact metrics (e.g., which similarity measure is used for intra-batch diversity).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing our honest assessment and planned revisions.

read point-by-point responses

Referee: [§3] §3 (Method): The central design choice—that projecting pink noise onto the top principal components of per-forward-pass activations places perturbations on the 'learned feature manifold' and yields semantically meaningful diversity directions—is asserted without a supporting argument or diagnostic. PCA maximizes variance irrespective of text conditioning; nothing in the formulation prevents the leading components from capturing prompt-irrelevant factors (global illumination, low-level texture), which could still degrade alignment even if average CLIP scores remain stable. This assumption is load-bearing for the claim of controlled variation.

Authors: We appreciate the referee highlighting the need for stronger justification here. The PCA is computed on activations from the specific conditioned forward pass for each prompt, so the resulting components reflect variance directions within the prompt-dependent feature distribution rather than an unconditional global basis. This is a key distinction from standard PCA on unconditioned data. Our experiments show that this yields diversity gains with only minimal CLIP-score impact, suggesting that prompt-irrelevant factors (if present in lower components) do not dominate the top directions used. We acknowledge that the original submission could have included more explicit discussion or a diagnostic (e.g., correlation of PC directions with semantic attributes). We will revise §3 to elaborate on this per-prompt conditioning rationale and add a supporting analysis or visualization in the appendix demonstrating the semantic nature of the top components. revision: partial
Referee: [§4] §4 (Experiments): The Pareto-dominance claim over baselines rests on reported intra-batch similarity and CLIP scores, yet no ablation is described that isolates the contribution of the PCA projection versus the pink-noise spatial coherence alone, nor is the sensitivity to the number of retained components or the single free parameter (perturbation scale) quantified. Without these controls, it is difficult to determine whether the gains are robust or tied to unstated implementation choices.

Authors: We agree that these ablations and sensitivity analyses would strengthen the experimental section and help isolate the sources of improvement. The current results emphasize end-to-end comparisons across benchmarks, but we will add the requested controls in the revision: (i) an ablation of PCA-projected pink noise versus pink noise without the PCA step, (ii) results for varying numbers of retained principal components, and (iii) a sweep over the perturbation scale parameter. These will be reported in §4 with additional details in the supplementary material to confirm robustness of the Pareto dominance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained with external validation

full rationale

The paper's core proposal is a training-free single-pass procedure: compute PCA on intermediate transformer activations from the current forward pass, project spatially coherent pink noise onto those principal components, and add the result as a perturbation. This is motivated by an insight about feature geometry but does not define any quantity in terms of itself, rename a fitted parameter as a prediction, or rely on a load-bearing self-citation whose validity is internal to the present work. The claim that the resulting perturbations produce controlled, meaningful diversity is supported by comparative experiments (CLIP scores, intra-batch similarity, Pareto dominance on multiple benchmarks) rather than following tautologically from the construction. The assumption that top PCs align with semantically useful directions is acknowledged as empirical and is not smuggled in via prior self-work or uniqueness theorems.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact parameters; the approach assumes PCA directions are meaningful for diversity and that pink noise preserves spatial coherence, with likely unstated choices for perturbation scale.

free parameters (1)

perturbation scale
The strength of the injected noise is almost certainly chosen or tuned to balance diversity and quality, though not quantified in the abstract.

axioms (1)

domain assumption Principal components of intermediate activations capture directions that allow controlled and meaningful diversity injection
This is the core design choice stated in the abstract as the reason naive perturbation fails while STRIDE succeeds.

pith-pipeline@v0.9.0 · 5601 in / 1310 out tokens · 62165 ms · 2026-05-13T02:44:04.895784+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

STRIDE projects pink noise onto the top-K principal components of the model's own activations... ensuring that perturbations lie on the learned feature manifold
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

identical noise energy reduces InBSim by 7.5% when projected onto the model's principal components, but increases it by 3.3% when applied unstructured

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 6 internal anchors

[1]

Self-rectifying diffusion sampling with perturbed-attention guidance

Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin, and Seungryong Kim. Self-rectifying diffusion sampling with perturbed-attention guidance. InEuropean Conference on Computer Vision, pages 1–17. Springer, 2024

work page 2024
[2]

Where and how to perturb: On the design of perturbation guidance in diffusion and flow models

Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Sangwu Lee, Sayak Paul, Susung Hong, and Seungryong Kim. Where and how to perturb: On the design of perturbation guidance in diffusion and flow models. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[3]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans.arXiv preprint arXiv:1801.01401, 2018

work page internal anchor Pith review arXiv 2018
[4]

Announcing Black Forest Labs and the FLUX.1 suite of models

Black Forest Labs. Announcing Black Forest Labs and the FLUX.1 suite of models. https: //bfl.ai/blog/24-08-01-bfl, August 2024. Accessed: 2026-04-27

work page 2024
[5]

Sana-sprint: One-step diffusion with continuous-time consistency distillation

Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16185–16195, 2025

work page 2025
[6]

Particle guidance: non-iid diverse sampling with diffusion models.arXiv preprint arXiv:2310.13102, 2023

Gabriele Corso, Yilun Xu, Valentin De Bortoli, Regina Barzilay, and Tommi Jaakkola. Particle guidance: non-iid diverse sampling with diffusion models.arXiv preprint arXiv:2310.13102, 2023

work page arXiv 2023
[7]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[8]

The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

work page arXiv 2022
[9]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[10]

Unleashing diffusion transformers for visual correspondence by modulating massive activations.arXiv preprint arXiv:2505.18584, 2025

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, and Weiyao Lin. Unleashing diffusion transformers for visual correspondence by modulating massive activations.arXiv preprint arXiv:2505.18584, 2025

work page arXiv 2025
[11]

Distilling diversity and control in diffusion models

Rohit Gandikota and David Bau. Distilling diversity and control in diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1304–1313, 2026

work page 2026
[12]

Concept sliders: Lora adaptors for precise control in diffusion models

Rohit Gandikota, Joanna Materzy´nska, Tingrui Zhou, Antonio Torralba, and David Bau. Concept sliders: Lora adaptors for precise control in diffusion models. InEuropean Conference on Computer Vision, pages 172–188. Springer, 2024

work page 2024
[13]

Sliderspace: Decomposing the visual capabilities of diffusion models.arXiv preprint arXiv:2502.01639, 2025

Rohit Gandikota, Zongze Wu, Richard Zhang, David Bau, Eli Shechtman, and Nick Kolkin. Sliderspace: Decomposing the visual capabilities of diffusion models.arXiv preprint arXiv:2502.01639, 2025

work page arXiv 2025
[14]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[15]

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011. 10

work page 2011
[16]

It's Never Too Late: Noise Optimization for Collapse Recovery in Trained Diffusion Models

Anne Harrington, A Koepke, Shyamgopal Karthik, Trevor Darrell, and Alexei A Efros. It’s never too late: Noise optimization for collapse recovery in trained diffusion models.arXiv preprint arXiv:2601.00090, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

work page 2021
[18]

Smoothed energy guidance: Guiding diffusion models with reduced energy curvature of attention

Susung Hong. Smoothed energy guidance: Guiding diffusion models with reduced energy curvature of attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[19]

Improving sample quality of diffusion models using self-attention guidance

Susung Hong, Gyuseong Lee, Wooseok Jang, and Seungryong Kim. Improving sample quality of diffusion models using self-attention guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7462–7471, 2023

work page 2023
[20]

Sparke: Scalable prompt-aware diversity guidance in diffusion models via rke score.arXiv preprint arXiv:2506.10173, 2025

Mohammad Jalali, Haoyu Lei, Amin Gohari, and Farzan Farnia. Sparke: Scalable prompt-aware diversity guidance in diffusion models via rke score.arXiv preprint arXiv:2506.10173, 2025

work page arXiv 2025
[21]

No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831,

Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025

work page arXiv 2025
[22]

Shielded diffusion: Generating novel and diverse images using sparse repellency.arXiv preprint arXiv:2410.06025, 2024

Michael Kirchhof, James Thornton, Louis Béthune, Pierre Ablin, Eugene Ndiaye, and Marco Cuturi. Shielded diffusion: Generating novel and diverse images using sparse repellency.arXiv preprint arXiv:2410.06025, 2024

work page arXiv 2024
[23]

Diffusion models already have a semantic latent space

Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[24]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.Advances in Neural Information Processing Systems, 37:122458–122483, 2024

work page 2024
[25]

Lawrence Zitnick

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision (ECCV), 2014

work page 2014
[26]

Instaflow: One step is enough for high-quality diffusion-based text-to-image generation

Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, et al. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[27]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

work page internal anchor Pith review arXiv 2023
[28]

Training-free generation of diverse and high-fidelity images via prompt semantic space optimization.arXiv preprint arXiv:2511.19811, 2025

Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, and Georgios Tzimiropou- los. Training-free generation of diverse and high-fidelity images via prompt semantic space optimization.arXiv preprint arXiv:2511.19811, 2025

work page arXiv 2025
[29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Scaling group inference for diverse and high-quality generation

Gaurav Parmar, Or Patashnik, Daniil Ostashev, Kuan-Chieh Wang, Kfir Aberman, Srinivasa Narasimhan, and Jun-Yan Zhu. Scaling group inference for diverse and high-quality generation. arXiv preprint arXiv:2508.15773, 2025

work page arXiv 2025
[31]

Token perturbation guidance for diffusion models.arXiv preprint arXiv:2506.10036, 2025

Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, and Babak Taati. Token perturbation guidance for diffusion models.arXiv preprint arXiv:2506.10036, 2025

work page arXiv 2025
[32]

Cads: Unleashing the diversity of diffusion models through condition-annealed sampling.arXiv preprint arXiv:2310.17347, 2023

Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, and Romann M Weber. Cads: Unleashing the diversity of diffusion models through condition-annealed sampling.arXiv preprint arXiv:2310.17347, 2023. 11

work page arXiv 2023
[33]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

work page 2022
[34]

Fast high-resolution image synthesis with latent adversarial diffusion distillation

Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. In SIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

work page 2024
[35]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

work page 2024
[36]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023
[37]

Introducing Stable Diffusion 3.5

Stability AI. Introducing Stable Diffusion 3.5. https://stability.ai/news/ introducing-stable-diffusion-3-5, October 2024. Accessed: 2026-04-27

work page 2024
[38]

Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139, 2026

Tianhe Wu, Ruibin Li, Lei Zhang, and Kede Ma. Diversity-preserved distribution matching distillation for fast visual synthesis.arXiv preprint arXiv:2602.03139, 2026

work page arXiv 2026
[39]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Freeman, and Taesung Park

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[41]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2022. 12

work page internal anchor Pith review Pith/arXiv arXiv 2022