pith. sign in

arxiv: 2605.21907 · v1 · pith:HY6L4YSJnew · submitted 2026-05-21 · 💻 cs.CV

Guided Trajectory Optimization with Sparse Scaling for Test-Time Diffusion

Pith reviewed 2026-05-22 07:41 UTC · model grok-4.3

classification 💻 cs.CV
keywords diffusion modelstest-time scalingreward-guided optimizationsparse scalingPCA curvatureimage generationdenoising trajectory
0
0 comments X

The pith

Reward-guided noise optimization and sparse PCA-based scaling improve diffusion model image generation at test time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome limitations in current test-time scaling for diffusion models, which rely on fixed noise pools and lack flexible exploration along the denoising path. It introduces RTS to guide noise choices using reward signals toward better regions and to apply sparse scaling that uses curvature analysis to select only the most important steps. This combination aims to produce higher-fidelity images from the same base model without retraining. A sympathetic reader would care because the approach promises measurable quality gains through smarter use of existing computation rather than larger models. Experiments report clear lifts over prior methods on standard image quality metrics.

Core claim

RTS facilitates the synthesis of refined, high-fidelity images via a reward-guided noise optimization strategy to actively direct the search towards promising regions and a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space.

What carries the argument

Reward-guided noise optimization strategy that directs search to promising regions, together with a sparse test-time scaling framework and PCA-driven curvature analysis to compress the search space by prioritizing key denoising steps.

Load-bearing premise

The reward model used for guiding noise optimization accurately identifies promising regions in the denoising trajectory without introducing bias or requiring extensive tuning.

What would settle it

Generating images with the same diffusion model but replacing the reward signal with random scores or dropping the PCA curvature selection entirely would show whether the reported score gains vanish.

Figures

Figures reproduced from arXiv: 2605.21907 by Gang Dai, Guohao Chen, Shuaicheng Niu, YiMing Xia, Yining Huang.

Figure 2
Figure 2. Figure 2: Method overview. Our RTS consists of a sparse test-time scaling framework and a reward [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PCA-based key-step selec￾tion across various denoising paths. dynamics that can be safely skipped with negligible error. To systematically pinpoint these critical transitions, we quantify the local curvature f(pl) of the path at each projected point. By ranking these curvature values, we define a sparse set of key inflection points as Pkey = Top-k [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualizations of text-to-image generation. Building on FLUX and SD v3 as foundation [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 3D PCA visualization of the reward-guided noise search process. Numbers annotated next [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualizations of our ablation results on GenEval prompts. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: More visualizations of our scaling results compared with the base model and the SOTA [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

The efficient Test-Time Scaling (TTS) paradigm offers a promising perspective for enhancing the generation performance of diffusion models. However, current solutions are limited to a static, pre-defined noise pool and suffer from inflexible noise exploration across the denoising trajectory. To bridge this gap, we propose RTS, a novel Reward-guided Trajectory Scaling method to fully unlock the generative potential of diffusion models. Unlike existing methods, RTS facilitates the synthesis of refined, high-fidelity images via two core innovations: 1) a reward-guided noise optimization strategy to actively direct the search towards promising regions; and 2) a sparse test-time scaling framework together with a PCA-driven curvature analysis scheme to prioritize key intermediate steps in the entire denoising space, effectively compressing the search space. Experiments show our approach outperforms baselines by 15.6% across GenEval Score, and a 60.4% enhancement in ImageReward score, setting a new SOTA while providing a practical guideline for more effective test-time scaling across diffusion-specific architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces RTS (Reward-guided Trajectory Scaling), a test-time scaling method for diffusion models. It proposes two innovations: a reward-guided noise optimization strategy to steer the denoising trajectory toward promising regions, and a sparse scaling framework augmented by PCA-driven curvature analysis to identify and prioritize key intermediate denoising steps, thereby compressing the search space. The central empirical claim is that RTS outperforms baselines by 15.6% on GenEval Score and 60.4% on ImageReward, establishing a new SOTA and providing practical guidelines for diffusion-specific test-time scaling.

Significance. If the performance gains are shown to be robust and independent of evaluation circularity, the work could meaningfully advance efficient test-time compute allocation for diffusion models. The combination of guided optimization with sparse, curvature-informed scaling offers a concrete mechanism for reducing search overhead while improving fidelity, which may generalize beyond the reported architectures and supply actionable heuristics for practitioners.

major comments (3)
  1. [Abstract] Abstract: the 60.4% ImageReward gain is load-bearing for the SOTA claim, yet the method description must explicitly clarify whether the reward model used to guide noise optimization is identical to (or derived from) the ImageReward metric used for final scoring. If they coincide, the optimization can directly exploit metric idiosyncrasies, rendering the reported delta non-independent; an ablation or held-out reward variant is required to substantiate the result.
  2. [Experiments] Experiments section: the abstract reports 15.6% GenEval and 60.4% ImageReward gains without disclosing baselines, sample counts, run statistics, or controls for post-hoc tuning. Full experimental details—including variance across seeds, exact comparison protocol, and any hyperparameter search—are necessary to verify that the outperformance is not an artifact of selective reporting.
  3. [Method] Method (sparse scaling and PCA curvature): the claim that PCA-driven curvature analysis effectively compresses the denoising space is central to the efficiency argument, but the manuscript must supply the precise formulation (e.g., how curvature is computed from the trajectory and how the top-k steps are selected) together with pseudocode or an equation reference to permit reproduction.
minor comments (1)
  1. [Abstract] Abstract: the two core innovations are listed but could be named more explicitly (e.g., “reward-guided noise optimization” and “PCA-driven sparse scaling”) to improve immediate readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate clarifications, additional experiments, and methodological details where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 60.4% ImageReward gain is load-bearing for the SOTA claim, yet the method description must explicitly clarify whether the reward model used to guide noise optimization is identical to (or derived from) the ImageReward metric used for final scoring. If they coincide, the optimization can directly exploit metric idiosyncrasies, rendering the reported delta non-independent; an ablation or held-out reward variant is required to substantiate the result.

    Authors: We thank the referee for raising this critical point regarding potential evaluation circularity. The reward model used to guide noise optimization during test-time scaling is a distinct preference model trained on a separate human preference dataset and is not identical to or directly derived from the ImageReward metric employed for final evaluation. To further substantiate independence, we have added a new ablation study in the revised manuscript that employs a held-out reward model variant for guidance; the performance gains remain consistent, confirming that the reported improvements are not an artifact of metric overlap. revision: yes

  2. Referee: [Experiments] Experiments section: the abstract reports 15.6% GenEval and 60.4% ImageReward gains without disclosing baselines, sample counts, run statistics, or controls for post-hoc tuning. Full experimental details—including variance across seeds, exact comparison protocol, and any hyperparameter search—are necessary to verify that the outperformance is not an artifact of selective reporting.

    Authors: We agree that full transparency in experimental reporting is essential. In the revised manuscript, we have substantially expanded the Experiments section to explicitly list all baselines, the total number of samples evaluated, run statistics including mean and standard deviation across multiple random seeds, the precise comparison protocol, and details of the hyperparameter search procedure. These additions ensure the results can be independently verified and rule out concerns about selective reporting or post-hoc tuning. revision: yes

  3. Referee: [Method] Method (sparse scaling and PCA curvature): the claim that PCA-driven curvature analysis effectively compresses the denoising space is central to the efficiency argument, but the manuscript must supply the precise formulation (e.g., how curvature is computed from the trajectory and how the top-k steps are selected) together with pseudocode or an equation reference to permit reproduction.

    Authors: We appreciate the referee's request for precise methodological details to support reproducibility. We have added the exact mathematical formulation for PCA-based curvature computation from the denoising trajectory (including eigenvalue decomposition and curvature metric derivation), the criterion for selecting top-k steps, and a reference to the relevant equations. Pseudocode for the full sparse scaling procedure has also been included in the revised Method section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces RTS, a reward-guided trajectory optimization method with sparse scaling and PCA curvature analysis for test-time diffusion scaling. Its central claims consist of empirical outperformance (15.6% GenEval, 60.4% ImageReward) on standard external metrics rather than any closed mathematical derivation. No equations, ansatzes, or uniqueness theorems are presented that reduce by construction to fitted parameters or self-citations. The reward model is used for guidance during optimization, but the reported scores are on separate, publicly defined benchmarks (GenEval, ImageReward) without evidence of identity that would force the result. The derivation chain is therefore self-contained against external evaluation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms or invented entities are described. The method implicitly assumes a reliable reward signal and meaningful curvature in denoising trajectories.

pith-pipeline@v0.9.0 · 5709 in / 1001 out tokens · 24070 ms · 2026-05-22T07:41:49.933957+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

  1. [1]

    Gpt-4 technical report.arXiv, 2023

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv, 2023. 1

  2. [2]

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv, 2025. 2

  3. [3]

    Improving image generation with better captions

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 2

  4. [4]

    Large language monkeys: Scaling inference compute with repeated sampling.arXiv, 2024

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv, 2024. 1, 3

  5. [5]

    Multi- marginal wasserstein gan.Advances in Neural Information Processing Systems, 32, 2019

    Jiezhang Cao, Langyuan Mo, Yifan Zhang, Kui Jia, Chunhua Shen, and Mingkui Tan. Multi- marginal wasserstein gan.Advances in Neural Information Processing Systems, 32, 2019. 2

  6. [6]

    Deepzero: Scaling up zeroth-order optimization for deep model training.arXiv, 2023

    Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, and Sijia Liu. Deepzero: Scaling up zeroth-order optimization for deep model training.arXiv, 2023. 13

  7. [7]

    On the convergence of prior-guided zeroth-order optimization algorithms.Advances in Neural Information Processing Systems, 34:14620–14631,

    Shuyu Cheng, Guoqiang Wu, and Jun Zhu. On the convergence of prior-guided zeroth-order optimization algorithms.Advances in Neural Information Processing Systems, 34:14620–14631,

  8. [8]

    Directly fine-tuning diffusion models on differentiable rewards.arXiv, 2023

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv, 2023. 1

  9. [9]

    Rlprompt: Optimizing discrete text prompts with reinforce- ment learning

    Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforce- ment learning. InConference on Empirical Methods in Natural Language Processing, pages 3369–3391, 2022. 13

  10. [10]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021. 2

  11. [11]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning,

  12. [12]

    Generate like experts: multi-stage font generation by incorporating font transfer process into diffusion models

    Bin Fu, Fanghua Yu, Anran Liu, Zixuan Wang, Jie Wen, Junjun He, and Yu Qiao. Generate like experts: multi-stage font generation by incorporating font transfer process into diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6892–6901, 2024. 9

  13. [13]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023. 7

  14. [14]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021. 7

  15. [15]

    Classifier-free diffusion guidance.arXiv, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv, 2022. 2

  16. [16]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020. 2 10

  17. [17]

    Training compute-optimal large language models.arXiv, 2022

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv, 2022. 1

  18. [18]

    T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A compre- hensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36:78723–78747, 2023. 7

  19. [19]

    A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications.IEEE Signal Processing Magazine, 37(5):43–54,

    Sijia Liu, Pin-Yu Chen, Bhavya Kailkhura, Gaoyuan Zhang, Alfred O Hero III, and Pramod K Varshney. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications.IEEE Signal Processing Magazine, 37(5):43–54,

  20. [20]

    Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv, 2024

    Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh, and Yang You. Sparse mezo: Less parameters for better performance in zeroth-order llm fine-tuning.arXiv, 2024. 13

  21. [21]

    Implicit concept removal of diffusion models

    Zhili Liu, Kai Chen, Yifan Zhang, Jianhua Han, Lanqing Hong, Hang Xu, Zhenguo Li, Dit- Yan Yeung, and James T Kwok. Implicit concept removal of diffusion models. InEuropean Conference on Computer Vision, pages 457–473, 2024. 2

  22. [22]

    Scaling inference time compute for diffusion models

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, et al. Scaling inference time compute for diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2523–2534, 2025. 1, 2, 3, 7, 8

  23. [23]

    Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes.Advances in Neural Information Processing Systems, 36:53038–53075, 2023. 13

  24. [24]

    Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv,

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations.arXiv,

  25. [25]

    completely blind

    Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Making a “completely blind” image quality analyzer.IEEE Signal processing letters, 20(3):209–212, 2012. 7

  26. [26]

    Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv, 2021

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv, 2021. 2

  27. [27]

    Blackvip: Black-box visual prompting for robust transfer learning

    Changdae Oh, Hyeji Hwang, Hee-young Lee, YongTaek Lim, Geunyoung Jung, Jiyoung Jung, Hosik Choi, and Kyungwoo Song. Blackvip: Black-box visual prompting for robust transfer learning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24224–24235, 2023. 13

  28. [28]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational Conference on Machine Learning, pages 8821–8831, 2021. 1

  29. [29]

    Test-time scaling of diffusion models via noise trajectory search.Advances in Neural Information Processing Systems, pages 1–27, 2025

    Vignav Ramesh and Morteza Mardani. Test-time scaling of diffusion models via noise trajectory search.Advances in Neural Information Processing Systems, pages 1–27, 2025. 2, 3

  30. [30]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 1, 2

  31. [31]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023. 1, 2

  32. [32]

    Freeu: Free lunch in diffusion u-net

    Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. Freeu: Free lunch in diffusion u-net. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4733–4743, 2024. 9 11

  33. [33]

    A general framework for inference-time scaling and steering of diffusion models

    Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, and Rajesh Ranganath. A general framework for inference-time scaling and steering of diffusion models. InInternational Conference on Machine Learning, pages 1–27, 2025. 1, 3, 7, 8

  34. [34]

    Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv, 2024

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv, 2024. 1, 3

  35. [35]

    Bbtv2: Towards a gradient-free future with large language models

    Tianxiang Sun, Zhengfu He, Hong Qian, Yunhua Zhou, Xuan-Jing Huang, and Xipeng Qiu. Bbtv2: Towards a gradient-free future with large language models. InConference on Empirical Methods in Natural Language Processing, pages 3916–3930, 2022. 13

  36. [36]

    Black-box tuning for language-model-as-a-service

    Tianxiang Sun, Yunfan Shao, Hong Qian, Xuanjing Huang, and Xipeng Qiu. Black-box tuning for language-model-as-a-service. InInternational Conference on Machine Learning, pages 20841–20855, 2022. 13

  37. [37]

    De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089–1100, 2023

    Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089–1100, 2023. 1

  38. [38]

    An empirical analysis of compute-optimal inference for problem-solving with language models, 2024.arXiv, 2024

    Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. An empirical analysis of compute-optimal inference for problem-solving with language models, 2024.arXiv, 2024. 1, 3

  39. [39]

    Show-o: One single transformer to unify multimodal understanding and generation.arXiv, 2024

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv, 2024. 7

  40. [40]

    Imagereward: Learning and evaluating human preferences for text-to-image generation

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023. 7, 9

  41. [41]

    Restart sampling for improving generative processes.Advances in Neural Information Process- ing Systems, 36:76806–76838, 2023

    Yilun Xu, Mingyang Deng, Xiang Cheng, Yonglong Tian, Ziming Liu, and Tommi Jaakkola. Restart sampling for improving generative processes.Advances in Neural Information Process- ing Systems, 36:76806–76838, 2023. 3

  42. [42]

    Compass: Enhancing spatial understanding in text-to-image diffusion models

    Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runxing Liu, Hong Gu, Huaqi Zhang, and Xinguo Liu. Compass: Enhancing spatial understanding in text-to-image diffusion models. InIEEE/CVF International Conference on Computer Vision, pages 15253–15265, 2025. 7 12 Guided Denoising Trajectory Optimization for Test-Time Diffusion Scaling Supplementary Materia...