pith. machine review for the scientific record. sign in

arxiv: 2605.07253 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelsimage generationnoise modulationefficient samplinglow-frequency componentshypernetworksdistilled diffusiongenerative models
0
0 comments X

The pith

Restricting noise modulation to low-frequency components lets distilled diffusion models match prior image quality with hundreds of times less computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LENS as a way to speed up image generation in distilled diffusion models without the usual drop in quality. It rests on the observation that low-frequency noise mostly sets the global structure and overall look of the output image. Instead of modulating noise across the full high-dimensional space, LENS uses a small dedicated network to adjust only those low-frequency parts. This design comes with a training objective derived from a theoretical argument for staying in the low-frequency subspace. The result is image quality on par with heavier methods but with orders-of-magnitude lower FLOPs, far fewer parameters, and much smaller inference overhead.

Core claim

LENS is an efficient noise modulation framework that works in a low-dimensional low-frequency subspace. The authors observe that low-frequency noise components largely control global image structure and visual fidelity, supply a theoretical reason to limit modulation to that subspace, and derive a corresponding training objective. They then train a lightweight standalone network to perform the modulation, yielding competitive image quality together with 400-700× fewer FLOPs, 25-75× fewer parameters, and 10-20× lower inference overhead than earlier hypernetwork or test-time optimization baselines.

What carries the argument

A lightweight standalone network that selectively modulates low-frequency components of the noise inside a reduced eigen subspace.

If this is right

  • Distilled diffusion models become practical for real-time or on-device image generation.
  • The computational cost of amortizing test-time optimization drops enough to support wider deployment.
  • Inference latency falls by an order of magnitude while quality stays competitive.
  • Model storage and memory requirements shrink substantially compared with full-dimensional hypernetworks.
  • The same efficiency pattern can be applied to other distilled generative pipelines that currently rely on high-dimensional noise shaping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-frequency restriction might transfer to video or audio generation where global structure is also carried by lower frequencies.
  • Combining LENS with quantization or pruning could produce even larger efficiency gains on edge hardware.
  • Varying the cutoff frequency or subspace dimension offers a tunable knob for trading quality against speed that future work could optimize.
  • If the low-frequency principle holds across architectures, it could reduce reliance on ever-larger hypernetworks in the broader generative-modeling literature.

Load-bearing premise

Low-frequency components of the noise largely determine the global structure and visual fidelity of generated images.

What would settle it

Measure FID and human preference scores on standard benchmarks when high-frequency noise modulation is added on top of LENS; if quality improves by more than a small margin, the claim that low-frequency modulation alone suffices is weakened.

Figures

Figures reproduced from arXiv: 2605.07253 by Haewon Jeon, Si-Hyeon Lee.

Figure 1
Figure 1. Figure 1: Performance vs. inference time relative to unmodified noise input. The x-axis shows inference time normalized by the case where the input noise is sampled directly from a standard Gaussian distribution without any modification, plotted in log scale. The y-axis shows the increase in GenEval2 [15] mean over this reference. Hence, the point (1, 0) corresponds to this unmodified noise setting. We compare three… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed Low-frequency Eigen Noise Shaping (LENS) framework. The model operates on low-frequency noise coefficients that capture structurally meaningful components. These coefficients are processed via self-attention, while the text prompt is incorporated through cross-attention for conditioning, and the network predicts coefficient updates. 3.2 Implementation We present the practical imple… view at source ↗
Figure 3
Figure 3. Figure 3: Reward gradient energy distribution in the PCA basis. (a) Normalized energy spectrum [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: One-step generation results using SD-Turbo and SANA-Sprint. Starting from the same [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of the number of low-frequency coefficients [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the patch size s on GenEval2 mean 25 [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗
read the original abstract

Distilled diffusion models accelerate image generation by reducing the number of denoising steps, but often suffer from degraded image quality. To mitigate this trade-off, test-time optimization methods improve quality, yet their iterative nature incurs substantial computational overhead and leads to slow inference, limiting practical usability. Recent hypernetwork-based approaches amortize this process during training, but still require costly noise modulation in high-dimensional latent spaces. In this work, we propose LENS (Low-frequency Eigen Noise Shaping), an efficient noise modulation framework that operates in a low-dimensional subspace. Our approach is motivated by the observation that low-frequency components of the noise largely determine the global structure and visual fidelity of generated images. Based on this observation, we provide a theoretical justification for restricting modulation to the low-frequency subspace and derive a principled training objective. Building on this, LENS employs a lightweight, standalone network to selectively modulate these components, enabling efficient and targeted noise modulation. Extensive experiments demonstrate that LENS achieves competitive image quality while reducing FLOPs by 400-700$\times$, model parameters by 25-75$\times$, and inference-time overhead by 10-20$\times$ compared to prior methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes LENS (Low-Frequency Eigen Noise Shaping), a framework for efficient noise modulation in distilled diffusion models. It restricts modulation to a low-dimensional low-frequency eigen-subspace based on the observation that low-frequency noise components largely determine global structure and visual fidelity. The authors provide a theoretical justification for this restriction, derive a principled training objective, and employ a lightweight standalone network for selective modulation. Experiments claim that LENS achieves competitive image quality while reducing FLOPs by 400-700×, model parameters by 25-75×, and inference-time overhead by 10-20× relative to prior hypernetwork and test-time optimization baselines.

Significance. If the low-frequency subspace restriction and derived objective are rigorously justified without degrading perceptual quality, LENS would represent a substantial advance in amortizing test-time optimization costs for diffusion sampling. The reported efficiency gains could enable practical high-quality generation on edge devices, addressing a key bottleneck in current distilled models.

major comments (3)
  1. [Abstract] Abstract: The theoretical justification for restricting modulation to the low-frequency eigen-subspace is asserted but not derived in detail; it must explicitly address whether the non-linearity of the U-Net denoiser allows high-frequency noise to be safely omitted without affecting the score function or fine textures, as this directly underpins the claimed 400-700× FLOPs reduction.
  2. [Experiments] Experiments section (implied by efficiency claims): The 400-700× FLOPs, 25-75× parameter, and 10-20× overhead reductions require explicit tables comparing against full-space hypernetwork baselines with error bars, dataset-specific frequency analysis, and ablation on subspace dimensionality to confirm that global metrics do not mask local fidelity losses.
  3. [Method / Training Objective] Training objective derivation: The principled training objective derived from the low-frequency observation needs to be shown to be independent of fitted parameters in the subspace choice; otherwise the efficiency claims risk circularity with the external frequency-content observation.
minor comments (2)
  1. [Method] Clarify notation for the eigen-subspace construction and how the lightweight modulator is architecturally separated from the main denoiser.
  2. [Introduction] Add missing references to prior frequency-domain analyses in diffusion models to better situate the low-frequency observation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for clarification and expansion, particularly regarding the depth of the theoretical derivation, the rigor of experimental reporting, and the independence of the training objective. We will revise the manuscript to address these points directly while preserving the core contributions of LENS.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The theoretical justification for restricting modulation to the low-frequency eigen-subspace is asserted but not derived in detail; it must explicitly address whether the non-linearity of the U-Net denoiser allows high-frequency noise to be safely omitted without affecting the score function or fine textures, as this directly underpins the claimed 400-700× FLOPs reduction.

    Authors: We agree that the current presentation of the theoretical justification is high-level and would benefit from greater detail. The manuscript motivates the restriction via the observation that low-frequency noise components dominate global structure, but does not fully derive the implications for the nonlinear U-Net. In the revision we will expand the relevant section with an explicit analysis of the denoiser's frequency response, showing that high-frequency perturbations have limited impact on the score function in the distilled setting and do not materially affect fine textures. This expanded derivation will directly support the efficiency claims. revision: yes

  2. Referee: [Experiments] Experiments section (implied by efficiency claims): The 400-700× FLOPs, 25-75× parameter, and 10-20× overhead reductions require explicit tables comparing against full-space hypernetwork baselines with error bars, dataset-specific frequency analysis, and ablation on subspace dimensionality to confirm that global metrics do not mask local fidelity losses.

    Authors: We acknowledge that the efficiency numbers are currently summarized without the requested supporting tables and analyses. The revised manuscript will include (i) explicit comparison tables against full-space hypernetwork baselines with standard-error bars from repeated runs, (ii) dataset-specific frequency-content breakdowns (CIFAR-10 and ImageNet), and (iii) ablations over subspace dimensionality. These additions will demonstrate that the reported gains hold while local fidelity, measured by both FID and perceptual patch metrics, remains competitive. revision: yes

  3. Referee: [Method / Training Objective] Training objective derivation: The principled training objective derived from the low-frequency observation needs to be shown to be independent of fitted parameters in the subspace choice; otherwise the efficiency claims risk circularity with the external frequency-content observation.

    Authors: The subspace is fixed a priori via eigen-decomposition of the noise covariance and is not altered by the parameters of the modulation network. The training objective is derived solely from this fixed low-frequency restriction and does not depend on the fitted weights. To eliminate any appearance of circularity we will add a dedicated paragraph and short proof sketch in the Method section clarifying this independence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in external observation and independent theoretical steps.

full rationale

The paper's chain begins with an external observation on low-frequency noise components (not derived from its own fitted parameters or equations), followed by a claimed theoretical justification for subspace restriction and derivation of a training objective. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The lightweight modulator and efficiency claims rest on this independent foundation rather than reducing to the inputs by construction. This is the most common honest outcome for papers with external motivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that low-frequency noise controls global image structure; no explicit free parameters or invented entities are described in the abstract, though the training objective and subspace dimension are likely chosen or fitted.

axioms (1)
  • domain assumption Low-frequency components of the noise largely determine the global structure and visual fidelity of generated images
    This observation is explicitly stated as the motivation for restricting modulation to the low-frequency subspace.

pith-pipeline@v0.9.0 · 5504 in / 1237 out tokens · 79693 ms · 2026-05-11T01:40:14.290853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 6 internal anchors

  1. [1]

    A noise is worth diffusion guidance

    Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Jaewon Min, Minjae Kim, Wooseok Jang, Hy- oungwon Cho, Sayak Paul, SeonHwa Kim, Eunju Cha, Kyong Hwan Jin, and Seungryong Kim. A noise is worth diffusion guidance. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=xEWooSOgaz

  2. [2]

    arXiv preprint arXiv:2406.01970 (2024)

    Yuanhao Ban, Ruochen Wang, Tianyi Zhou, Boqing Gong, Cho-Jui Hsieh, and Minhao Cheng. The crystal ball hypothesis in diffusion models: Anticipating object positions from initial noise. arXiv preprint arXiv:2406.01970, 2024

  3. [3]

    D-Flow: Differen- tiating through Flows for Controlled Generation, July 2024

    Heli Ben-Hamu, Omri Puny, Itai Gat, Brian Karrer, Uriel Singer, and Yaron Lipman. D-flow: Differentiating through flows for controlled generation.arXiv preprint arXiv:2402.14017, 2024

  4. [4]

    Sana-sprint: One-step diffusion with continuous-time consistency distillation

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16185–16195, 2025

  5. [5]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  6. [6]

    Reno: Enhancing one-step text-to-image models through reward-based noise optimization.Advances in Neural Information Processing Systems, 37:125487–125519, 2024

    Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimization.Advances in Neural Information Processing Systems, 37:125487–125519, 2024

  7. [7]

    Noise hypernetworks: Amortizing test-time compute in diffusion models

    Luca Eyring, Shyamgopal Karthik, Alexey Dosovitskiy, Nataniel Ruiz, and Zeynep Akata. Noise hypernetworks: Amortizing test-time compute in diffusion models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=DbzREoPwmM

  8. [8]

    E., and Wang, W

    Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis.arXiv preprint arXiv:2212.05032, 2022

  9. [9]

    Initno: Boosting text-to-image diffusion models via initial noise optimization

    Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to-image diffusion models via initial noise optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9380–9389, 2024

  10. [10]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7514–7528, 2021

  11. [11]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  12. [12]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  13. [13]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

  14. [14]

    T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

  15. [15]

    GenEval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853,

    Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025. 10

  16. [16]

    Optimizing diffusion noise can serve as universal motion priors

    Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan, Thabo Beeler, Supasorn Suwa- janakorn, and Siyu Tang. Optimizing diffusion noise can serve as universal motion priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1334–1345, 2024

  17. [17]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  18. [18]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

  19. [19]

    Enhancing compositional text-to- image generation with reliable random seeds

    Shuangqi Li, Hieu Le, Jingyi Xu, and Mathieu Salzmann. Enhancing compositional text-to- image generation with reliable random seeds. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=5BSlakturs

  20. [20]

    Noisear: Autoregressing initial noise prior for diffusion models.arXiv preprint arXiv:2506.01337, 2025

    Zeming Li, Xiangyue Liu, Xiangyu Zhang, Ping Tan, and Heung-Yeung Shum. Noisear: Autoregressing initial noise prior for diffusion models.arXiv preprint arXiv:2506.01337, 2025

  21. [21]

    Is-diff: Improving diffusion-based inpainting with better initial seed.arXiv preprint arXiv:2509.11638, 2025

    Yongzhe Lyu, Yu Wu, Yutian Lin, and Bo Du. Is-diff: Improving diffusion-based inpainting with better initial seed.arXiv preprint arXiv:2509.11638, 2025

  22. [22]

    The lottery ticket hypothesis in denoising: Towards semantic-driven initialization

    Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. The lottery ticket hypothesis in denoising: Towards semantic-driven initialization. InEuropean Conference on Computer Vision, pages 93–109. Springer, 2024

  23. [23]

    Noise diffusion for enhancing semantic faithfulness in text-to-image synthesis

    Boming Miao, Chunxiao Li, Xiaoxiao Wang, Andi Zhang, Rui Sun, Zizhe Wang, and Yao Zhu. Noise diffusion for enhancing semantic faithfulness in text-to-image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23575–23584, 2025

  24. [24]

    Ditto: Diffusion inference-time t-optimization for music generation.arXiv preprint arXiv:2401.12179,

    Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, and Nicholas J Bryan. Ditto: Diffusion inference-time t-optimization for music generation.arXiv preprint arXiv:2401.12179, 2024

  25. [25]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  26. [26]

    Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400

    Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019

  27. [27]

    Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advances in neural information processing systems, 37:117340–117362, 2024

    Yuxi Ren, Xin Xia, Yanzuo Lu, Jiacheng Zhang, Jie Wu, Pan Xie, Xing Wang, and Xuefeng Xiao. Hyper-sd: Trajectory segmented consistency model for efficient image synthesis.Advances in neural information processing systems, 37:117340–117362, 2024

  28. [28]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  29. [29]

    Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

  30. [30]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022

  31. [31]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

  32. [32]

    Stretching each dollar: Diffusion training from scratch on a micro-budget

    Vikash Sehwag, Xianghao Kong, Jingtao Li, Michael Spranger, and Lingjuan Lyu. Stretching each dollar: Diffusion training from scratch on a micro-budget. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28596–28608, 2025. 11

  33. [33]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  34. [34]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023

  35. [35]

    Tuning-free alignment of diffusion models with direct noise optimization

    Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Tuning-free alignment of diffusion models with direct noise optimization. InICML 2024 Workshop on Structured Probabilistic Inference & Generative Modeling, 2024. URL https://openreview.net/forum?id=Dqpa8rbL39

  36. [36]

    arXiv preprint arXiv:2502.14944 , year =

    Masatoshi Uehara, Xingyu Su, Yulai Zhao, Xiner Li, Aviv Regev, Shuiwang Ji, Sergey Levine, and Tommaso Biancalani. Reward-guided iterative refinement in diffusion models at test-time with applications to protein and dna design.arXiv preprint arXiv:2502.14944, 2025

  37. [37]

    Inference-Time Alignment in Diffusion Models with Reward-Guided Generation: Tutorial and Review, January 2025

    Masatoshi Uehara, Yulai Zhao, Chenyu Wang, Xiner Li, Aviv Regev, Sergey Levine, and Tom- maso Biancalani. Inference-time alignment in diffusion models with reward-guided generation: Tutorial and review.arXiv preprint arXiv:2501.09685, 2025

  38. [38]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  39. [39]

    End-to-end diffusion latent optimization improves classifier guidance

    Bram Wallace, Akash Gokul, Stefano Ermon, and Nikhil Naik. End-to-end diffusion latent optimization improves classifier guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7280–7290, 2023

  40. [40]

    Seeds of structure: Patch PCA reveals universal compositional cues in diffusion models

    Qingsong Wang, Zhengchao Wan, Mikhail Belkin, and Yusu Wang. Seeds of structure: Patch PCA reveals universal compositional cues in diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview. net/forum?id=EgH5WYB6my

  41. [41]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

  42. [42]

    Tlcm: Training-efficient latent consistency model for image generation with 2-8 steps.arXiv preprint arXiv:2406.05768, 2024

    Qingsong Xie, Zhenyi Liao, Zhijie Deng, Haonan Lu, et al. Tlcm: Training-efficient latent consistency model for image generation with 2-8 steps.arXiv preprint arXiv:2406.05768, 2024

  43. [43]

    Em distillation for one-step diffusion models.Advances in Neural Information Processing Systems, 37:45073–45104, 2024

    Sirui Xie, Zhisheng Xiao, Diederik P Kingma, Tingbo Hou, Ying N Wu, Kevin Murphy, Tim Salimans, Ben Poole, and Ruiqi Gao. Em distillation for one-step diffusion models.Advances in Neural Information Processing Systems, 37:45073–45104, 2024

  44. [44]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

  45. [45]

    Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models

    Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Good seed makes a good crop: Discovering secret seeds in text-to-image diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3024–3034. IEEE, 2025

  46. [46]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  47. [47]

    Text-to-image diffusion models in generative ai: A survey.arXiv preprint arXiv:2303.07909, 2023

    Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, In So Kweon, and Junmo Kim. Text- to-image diffusion models in generative ai: A survey.arXiv preprint arXiv:2303.07909, 2023

  48. [48]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  49. [49]

    Golden noise for diffusion models: A learning framework

    Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17688–17697, 2025. 12 A Theoretical Derivations A.1 Preliminaries Gaussian Distribution.We denote N(µ,Σ) as a multivariate Gaussian ...