Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

Gang Dai; Guohao Chen; Shuaicheng Niu; YiMing Xia; Yining Huang

arxiv: 2605.21907 · v1 · pith:HY6L4YSJnew · submitted 2026-05-21 · 💻 cs.CV

Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

Gang Dai , Yining Huang , Yiming Xia , Guohao Chen , Shuaicheng Niu This is my paper

Pith reviewed 2026-05-22 07:41 UTC · model grok-4.3

classification 💻 cs.CV

keywords diffusion modelstest-time scalingreward-guided optimizationsparse scalingPCA curvatureimage generationdenoising trajectory

0 comments

The pith

Reward-guided noise optimization and sparse PCA-based scaling improve diffusion model image generation at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome limitations in current test-time scaling for diffusion models, which rely on fixed noise pools and lack flexible exploration along the denoising path. It introduces RTS to guide noise choices using reward signals toward better regions and to apply sparse scaling that uses curvature analysis to select only the most important steps. This combination aims to produce higher-fidelity images from the same base model without retraining. A sympathetic reader would care because the approach promises measurable quality gains through smarter use of existing computation rather than larger models. Experiments report clear lifts over prior methods on standard image quality metrics.

Core claim

RTS facilitates the synthesis of refined, high-fidelity images via a reward-guided noise optimization strategy to actively direct the search towards promising regions and a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space.

What carries the argument

Reward-guided noise optimization strategy that directs search to promising regions, together with a sparse test-time scaling framework and PCA-driven curvature analysis to compress the search space by prioritizing key denoising steps.

Load-bearing premise

The reward model used for guiding noise optimization accurately identifies promising regions in the denoising trajectory without introducing bias or requiring extensive tuning.

What would settle it

Generating images with the same diffusion model but replacing the reward signal with random scores or dropping the PCA curvature selection entirely would show whether the reported score gains vanish.

Figures

Figures reproduced from arXiv: 2605.21907 by Gang Dai, Guohao Chen, Shuaicheng Niu, YiMing Xia, Yining Huang.

**Figure 3.** Figure 3: PCA-based key-step selection across various denoising paths. dynamics that can be safely skipped with negligible error. To systematically pinpoint these critical transitions, we quantify the local curvature f(pl) of the path at each projected point. By ranking these curvature values, we define a sparse set of key inflection points as Pkey = Top-k [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualizations of text-to-image generation. Building on FLUX and SD v3 as foundation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: 3D PCA visualization of the reward-guided noise search process. Numbers annotated next [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Visualizations of our ablation results on GenEval prompts. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: More visualizations of our scaling results compared with the base model and the SOTA [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from inflexible noise exploration across the denoising trajectory. To bridge this gap, we propose RTS, a novel Reward-guided Trajectory Scaling method to fully unlock the generative potential of diffusion models. Unlike existing methods, RTS facilitates the synthesis of refined, high-fidelity images via two core innovations: 1) a reward-guided noise optimization strategy to actively direct the search towards promising regions; and 2) a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space, effectively compressing the search space. Experiments show our approach outperforms baselines by 15.6% across GenEval Score, and a 60.4% enhancement in ImageReward score, setting a new SOTA while providing a practical guideline for more effective test-time scaling across diffusion-specific architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RTS adds reward-guided noise optimization and sparse PCA scaling to test-time diffusion, but the large ImageReward gains need checking for circularity with the guiding reward model.

read the letter

The main things to know are that this paper proposes RTS to improve diffusion image generation at test time by actively guiding noise with rewards and using sparse scaling plus PCA curvature analysis to focus on key denoising steps, reporting 15.6% gains on GenEval and 60.4% on ImageReward to claim SOTA. It moves past static noise pools in prior TTS work by making the search directed and compressed, which is a reasonable practical step if the details check out. The approach does well in targeting the inflexibility of existing test-time methods and offering a guideline for diffusion-specific architectures without retraining. The two innovations are clearly stated and tied to the denoising trajectory, which gives the work a focused feel rather than a broad overhaul. On the soft spots, the evaluation setup looks thin from the abstract, with no mention of run counts, variance, or statistical tests, so the SOTA claim is hard to assess fully until the full experiments are reviewed. The bigger issue is the potential circularity the stress-test note flags: if the reward model steering the optimization is the same or closely related to the ImageReward scorer, the 60.4% jump could partly reflect direct optimization for that metric's quirks instead of broader quality gains. The paper should clarify the exact reward used and show results on held-out metrics or human prefs to address this. Assuming the full text has the math and code details, the core argument holds up as an incremental but useful extension rather than a load-bearing flaw. This is for CV researchers working on generative models and test-time scaling who want concrete ways to boost existing diffusion outputs. A reader in that area would get value from the method and the reported numbers, even with the evaluation questions. It deserves peer review to verify the independence of the reward signal and the experimental controls.

Referee Report

3 major / 1 minor

Summary. The paper introduces RTS (Reward-guided Trajectory Scaling), a test-time scaling method for diffusion models. It proposes two innovations: a reward-guided noise optimization strategy to steer the denoising trajectory toward promising regions, and a sparse scaling framework augmented by PCA-driven curvature analysis to identify and prioritize key intermediate denoising steps, thereby compressing the search space. The central empirical claim is that RTS outperforms baselines by 15.6% on GenEval Score and 60.4% on ImageReward, establishing a new SOTA and providing practical guidelines for diffusion-specific test-time scaling.

Significance. If the performance gains are shown to be robust and independent of evaluation circularity, the work could meaningfully advance efficient test-time compute allocation for diffusion models. The combination of guided optimization with sparse, curvature-informed scaling offers a concrete mechanism for reducing search overhead while improving fidelity, which may generalize beyond the reported architectures and supply actionable heuristics for practitioners.

major comments (3)

[Abstract] Abstract: the 60.4% ImageReward gain is load-bearing for the SOTA claim, yet the method description must explicitly clarify whether the reward model used to guide noise optimization is identical to (or derived from) the ImageReward metric used for final scoring. If they coincide, the optimization can directly exploit metric idiosyncrasies, rendering the reported delta non-independent; an ablation or held-out reward variant is required to substantiate the result.
[Experiments] Experiments section: the abstract reports 15.6% GenEval and 60.4% ImageReward gains without disclosing baselines, sample counts, run statistics, or controls for post-hoc tuning. Full experimental details—including variance across seeds, exact comparison protocol, and any hyperparameter search—are necessary to verify that the outperformance is not an artifact of selective reporting.
[Method] Method (sparse scaling and PCA curvature): the claim that PCA-driven curvature analysis effectively compresses the denoising space is central to the efficiency argument, but the manuscript must supply the precise formulation (e.g., how curvature is computed from the trajectory and how the top-k steps are selected) together with pseudocode or an equation reference to permit reproduction.

minor comments (1)

[Abstract] Abstract: the two core innovations are listed but could be named more explicitly (e.g., “reward-guided noise optimization” and “PCA-driven sparse scaling”) to improve immediate readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate clarifications, additional experiments, and methodological details where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the 60.4% ImageReward gain is load-bearing for the SOTA claim, yet the method description must explicitly clarify whether the reward model used to guide noise optimization is identical to (or derived from) the ImageReward metric used for final scoring. If they coincide, the optimization can directly exploit metric idiosyncrasies, rendering the reported delta non-independent; an ablation or held-out reward variant is required to substantiate the result.

Authors: We thank the referee for raising this critical point regarding potential evaluation circularity. The reward model used to guide noise optimization during test-time scaling is a distinct preference model trained on a separate human preference dataset and is not identical to or directly derived from the ImageReward metric employed for final evaluation. To further substantiate independence, we have added a new ablation study in the revised manuscript that employs a held-out reward model variant for guidance; the performance gains remain consistent, confirming that the reported improvements are not an artifact of metric overlap. revision: yes
Referee: [Experiments] Experiments section: the abstract reports 15.6% GenEval and 60.4% ImageReward gains without disclosing baselines, sample counts, run statistics, or controls for post-hoc tuning. Full experimental details—including variance across seeds, exact comparison protocol, and any hyperparameter search—are necessary to verify that the outperformance is not an artifact of selective reporting.

Authors: We agree that full transparency in experimental reporting is essential. In the revised manuscript, we have substantially expanded the Experiments section to explicitly list all baselines, the total number of samples evaluated, run statistics including mean and standard deviation across multiple random seeds, the precise comparison protocol, and details of the hyperparameter search procedure. These additions ensure the results can be independently verified and rule out concerns about selective reporting or post-hoc tuning. revision: yes
Referee: [Method] Method (sparse scaling and PCA curvature): the claim that PCA-driven curvature analysis effectively compresses the denoising space is central to the efficiency argument, but the manuscript must supply the precise formulation (e.g., how curvature is computed from the trajectory and how the top-k steps are selected) together with pseudocode or an equation reference to permit reproduction.

Authors: We appreciate the referee's request for precise methodological details to support reproducibility. We have added the exact mathematical formulation for PCA-based curvature computation from the denoising trajectory (including eigenvalue decomposition and curvature metric derivation), the criterion for selecting top-k steps, and a reference to the relevant equations. Pseudocode for the full sparse scaling procedure has also been included in the revised Method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces RTS, a reward-guided trajectory optimization method with sparse scaling and PCA curvature analysis for test-time diffusion scaling. Its central claims consist of empirical outperformance (15.6% GenEval, 60.4% ImageReward) on standard external metrics rather than any closed mathematical derivation. No equations, ansatzes, or uniqueness theorems are presented that reduce by construction to fitted parameters or self-citations. The reward model is used for guidance during optimization, but the reported scores are on separate, publicly defined benchmarks (GenEval, ImageReward) without evidence of identity that would force the result. The derivation chain is therefore self-contained against external evaluation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms or invented entities are described. The method implicitly assumes a reliable reward signal and meaningful curvature in denoising trajectories.

pith-pipeline@v0.9.0 · 5709 in / 1001 out tokens · 24070 ms · 2026-05-22T07:41:49.933957+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reward-guided noise optimization strategy... coarse-to-fine alternating mechanism... PCA-driven curvature analysis... sparse set of key inflection points
IndisputableMonolith/Foundation/AlexanderDuality alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PCA-driven sparse step selection... X=UΣV^T... Top-k curvature values

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

Gpt-4 technical report.arXiv, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv, 2023. 1

work page 2023
[2]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv, 2025. 2

work page 2025
[3]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 2

work page 2023
[4]

Large language monkeys: Scaling inference compute with repeated sampling.arXiv, 2024

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv, 2024. 1, 3

work page 2024
[5]

Multi- marginal wasserstein gan.Advances in Neural Information Processing Systems, 32, 2019

Jiezhang Cao, Langyuan Mo, Yifan Zhang, Kui Jia, Chunhua Shen, and Mingkui Tan. Multi- marginal wasserstein gan.Advances in Neural Information Processing Systems, 32, 2019. 2

work page 2019
[6]

Deepzero: Scaling up zeroth-order optimization for deep model training.arXiv, 2023

Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, and Sijia Liu. Deepzero: Scaling up zeroth-order optimization for deep model training.arXiv, 2023. 13

work page 2023
[7]

On the convergence of prior-guided zeroth-order optimization algorithms.Advances in Neural Information Processing Systems, 34:14620–14631,

Shuyu Cheng, Guoqiang Wu, and Jun Zhu. On the convergence of prior-guided zeroth-order optimization algorithms.Advances in Neural Information Processing Systems, 34:14620–14631,

work page
[8]

Directly fine-tuning diffusion models on differentiable rewards.arXiv, 2023

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv, 2023. 1

work page 2023
[9]

Rlprompt: Optimizing discrete text prompts with reinforce- ment learning

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforce- ment learning. InConference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022. 13

work page 2022
[10]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 2

work page 2021
[11]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning,

work page
[12]

Generate like experts: multi-stage font generation by incorporating font transfer process into diffusion models

Bin Fu, Fanghua Yu, Anran Liu, Zixuan Wang, Jie Wen, Junjun He, and Yu Qiao. Generate like experts: multi-stage font generation by incorporating font transfer process into diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6892–6901, 2024. 9

work page 2024
[13]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023. 7

work page 2023
[14]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021. 7

work page 2021
[15]

Classifier-free diffusion guidance.arXiv, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv, 2022. 2

work page 2022
[16]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 2 10

work page 2020
[17]

Training compute-optimal large language models.arXiv, 2022

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv, 2022. 1

work page 2022
[18]

T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023. 7

work page 2023
[19]

A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications.IEEE Signal Processing Magazine, 37(5):43–54,

Sijia Liu, Pin-Yu Chen, Bhavya Kailkhura, Gaoyuan Zhang, Alfred O Hero III, and Pramod K Varshney. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications.IEEE Signal Processing Magazine, 37(5):43–54,

work page
[20]

Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv, 2024

Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv, 2024. 13

work page 2024
[21]

Implicit concept removal of diffusion models

Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit- Yan Yeung, and James T Kwok. Implicit concept removal of diffusion models. InEuropean Conference on Computer Vision, pages 457–473, 2024. 2

work page 2024
[22]

Scaling inference time compute for diffusion models

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Scaling inference time compute for diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2523–2534, 2025. 1, 2, 3, 7, 8

work page 2025
[23]

Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023. 13

work page 2023
[24]

Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv,

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv,

work page
[25]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer.IEEE Signal processing letters, 20(3):209–212, 2012. 7

work page 2012
[26]

Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv, 2021

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv, 2021. 2

work page 2021
[27]

Blackvip: Black-box visual prompting for robust transfer learning

Changdae Oh, Hyeji Hwang, Hee-young Lee, YongTaek Lim, Geunyoung Jung, Jiyoung Jung, Hosik Choi, and Kyungwoo Song. Blackvip: Black-box visual prompting for robust transfer learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24224–24235, 2023. 13

work page 2023
[28]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational Conference on Machine Learning, pages 8821–8831, 2021. 1

work page 2021
[29]

Test-time scaling of diffusion models via noise trajectory search.Advances in Neural Information Processing Systems, pages 1–27, 2025

Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search.Advances in Neural Information Processing Systems, pages 1–27, 2025. 2, 3

work page 2025
[30]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2

work page 2022
[31]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023. 1, 2

work page 2023
[32]

Freeu: Free lunch in diffusion u-net

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2024. 9 11

work page 2024
[33]

A general framework for inference-time scaling and steering of diffusion models

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InInternational Conference on Machine Learning, pages 1–27, 2025. 1, 3, 7, 8

work page 2025
[34]

Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv, 2024

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv, 2024. 1, 3

work page 2024
[35]

Bbtv2: Towards a gradient-free future with large language models

Tianxiang Sun, Zhengfu He, Hong Qian, Yunhua Zhou, Xuan-Jing Huang, and Xipeng Qiu. Bbtv2: Towards a gradient-free future with large language models. InConference on Empirical Methods in Natural Language Processing, pages 3916–3930, 2022. 13

work page 2022
[36]

Black-box tuning for language-model-as-a-service

Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-box tuning for language-model-as-a-service. InInternational Conference on Machine Learning, pages 20841–20855, 2022. 13

work page 2022
[37]

De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089–1100, 2023

Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089–1100, 2023. 1

work page 2023
[38]

An empirical analysis of compute-optimal inference for problem-solving with language models, 2024.arXiv, 2024

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. An empirical analysis of compute-optimal inference for problem-solving with language models, 2024.arXiv, 2024. 1, 3

work page 2024
[39]

Show-o: One single transformer to unify multimodal understanding and generation.arXiv, 2024

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv, 2024. 7

work page 2024
[40]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023. 7, 9

work page 2023
[41]

Restart sampling for improving generative processes.Advances in Neural Information Process- ing Systems, 36:76806–76838, 2023

Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes.Advances in Neural Information Process- ing Systems, 36:76806–76838, 2023. 3

work page 2023
[42]

Compass: Enhancing spatial understanding in text-to-image diffusion models

Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runxing Liu, Hong Gu, Huaqi Zhang, and Xinguo Liu. Compass: Enhancing spatial understanding in text-to-image diffusion models. InIEEE/CVF International Conference on Computer Vision, pages 15253–15265, 2025. 7 12 Guided Denoising Trajectory Optimization for Test-Time Diffusion Scaling Supplementary Materia...

work page 2025

[1] [1]

Gpt-4 technical report.arXiv, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv, 2023. 1

work page 2023

[2] [2]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv, 2025. 2

work page 2025

[3] [3]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 2

work page 2023

[4] [4]

Large language monkeys: Scaling inference compute with repeated sampling.arXiv, 2024

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv, 2024. 1, 3

work page 2024

[5] [5]

Multi- marginal wasserstein gan.Advances in Neural Information Processing Systems, 32, 2019

Jiezhang Cao, Langyuan Mo, Yifan Zhang, Kui Jia, Chunhua Shen, and Mingkui Tan. Multi- marginal wasserstein gan.Advances in Neural Information Processing Systems, 32, 2019. 2

work page 2019

[6] [6]

Deepzero: Scaling up zeroth-order optimization for deep model training.arXiv, 2023

Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, and Sijia Liu. Deepzero: Scaling up zeroth-order optimization for deep model training.arXiv, 2023. 13

work page 2023

[7] [7]

On the convergence of prior-guided zeroth-order optimization algorithms.Advances in Neural Information Processing Systems, 34:14620–14631,

Shuyu Cheng, Guoqiang Wu, and Jun Zhu. On the convergence of prior-guided zeroth-order optimization algorithms.Advances in Neural Information Processing Systems, 34:14620–14631,

work page

[8] [8]

Directly fine-tuning diffusion models on differentiable rewards.arXiv, 2023

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv, 2023. 1

work page 2023

[9] [9]

Rlprompt: Optimizing discrete text prompts with reinforce- ment learning

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforce- ment learning. InConference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022. 13

work page 2022

[10] [10]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 2

work page 2021

[11] [11]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning,

work page

[12] [12]

Generate like experts: multi-stage font generation by incorporating font transfer process into diffusion models

Bin Fu, Fanghua Yu, Anran Liu, Zixuan Wang, Jie Wen, Junjun He, and Yu Qiao. Generate like experts: multi-stage font generation by incorporating font transfer process into diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6892–6901, 2024. 9

work page 2024

[13] [13]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023. 7

work page 2023

[14] [14]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021. 7

work page 2021

[15] [15]

Classifier-free diffusion guidance.arXiv, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv, 2022. 2

work page 2022

[16] [16]

Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 2 10

work page 2020

[17] [17]

Training compute-optimal large language models.arXiv, 2022

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv, 2022. 1

work page 2022

[18] [18]

T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023. 7

work page 2023

[19] [19]

A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications.IEEE Signal Processing Magazine, 37(5):43–54,

Sijia Liu, Pin-Yu Chen, Bhavya Kailkhura, Gaoyuan Zhang, Alfred O Hero III, and Pramod K Varshney. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications.IEEE Signal Processing Magazine, 37(5):43–54,

work page

[20] [20]

Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv, 2024

Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv, 2024. 13

work page 2024

[21] [21]

Implicit concept removal of diffusion models

Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit- Yan Yeung, and James T Kwok. Implicit concept removal of diffusion models. InEuropean Conference on Computer Vision, pages 457–473, 2024. 2

work page 2024

[22] [22]

Scaling inference time compute for diffusion models

Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Scaling inference time compute for diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2523–2534, 2025. 1, 2, 3, 7, 8

work page 2025

[23] [23]

Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023. 13

work page 2023

[24] [24]

Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv,

Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv,

work page

[25] [25]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer.IEEE Signal processing letters, 20(3):209–212, 2012. 7

work page 2012

[26] [26]

Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv, 2021

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv, 2021. 2

work page 2021

[27] [27]

Blackvip: Black-box visual prompting for robust transfer learning

Changdae Oh, Hyeji Hwang, Hee-young Lee, YongTaek Lim, Geunyoung Jung, Jiyoung Jung, Hosik Choi, and Kyungwoo Song. Blackvip: Black-box visual prompting for robust transfer learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24224–24235, 2023. 13

work page 2023

[28] [28]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational Conference on Machine Learning, pages 8821–8831, 2021. 1

work page 2021

[29] [29]

Test-time scaling of diffusion models via noise trajectory search.Advances in Neural Information Processing Systems, pages 1–27, 2025

Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search.Advances in Neural Information Processing Systems, pages 1–27, 2025. 2, 3

work page 2025

[30] [30]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2

work page 2022

[31] [31]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023. 1, 2

work page 2023

[32] [32]

Freeu: Free lunch in diffusion u-net

Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2024. 9 11

work page 2024

[33] [33]

A general framework for inference-time scaling and steering of diffusion models

Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InInternational Conference on Machine Learning, pages 1–27, 2025. 1, 3, 7, 8

work page 2025

[34] [34]

Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv, 2024

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv, 2024. 1, 3

work page 2024

[35] [35]

Bbtv2: Towards a gradient-free future with large language models

Tianxiang Sun, Zhengfu He, Hong Qian, Yunhua Zhou, Xuan-Jing Huang, and Xipeng Qiu. Bbtv2: Towards a gradient-free future with large language models. InConference on Empirical Methods in Natural Language Processing, pages 3916–3930, 2022. 13

work page 2022

[36] [36]

Black-box tuning for language-model-as-a-service

Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-box tuning for language-model-as-a-service. InInternational Conference on Machine Learning, pages 20841–20855, 2022. 13

work page 2022

[37] [37]

De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089–1100, 2023

Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089–1100, 2023. 1

work page 2023

[38] [38]

An empirical analysis of compute-optimal inference for problem-solving with language models, 2024.arXiv, 2024

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. An empirical analysis of compute-optimal inference for problem-solving with language models, 2024.arXiv, 2024. 1, 3

work page 2024

[39] [39]

Show-o: One single transformer to unify multimodal understanding and generation.arXiv, 2024

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv, 2024. 7

work page 2024

[40] [40]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023. 7, 9

work page 2023

[41] [41]

Restart sampling for improving generative processes.Advances in Neural Information Process- ing Systems, 36:76806–76838, 2023

Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes.Advances in Neural Information Process- ing Systems, 36:76806–76838, 2023. 3

work page 2023

[42] [42]

Compass: Enhancing spatial understanding in text-to-image diffusion models

Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runxing Liu, Hong Gu, Huaqi Zhang, and Xinguo Liu. Compass: Enhancing spatial understanding in text-to-image diffusion models. InIEEE/CVF International Conference on Computer Vision, pages 15253–15265, 2025. 7 12 Guided Denoising Trajectory Optimization for Test-Time Diffusion Scaling Supplementary Materia...

work page 2025