Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models?
Pith reviewed 2026-05-20 19:44 UTC · model grok-4.3
The pith
AdaScope adaptively scopes RL fine-tuning to specific denoising stages in diffusion models, yielding higher quality at lower cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaScope is an RL-enhanced plug-in that perceives structural evolution and semantic consistency across the denoising trajectory to select the single optimal window for RL interventions and to terminate training at the onset of reward saturation. The method rests on the observation that early-stage RL suffers from delayed and mismatched rewards while late-stage RL intensifies overfitting. Theoretical analysis supports that restricting optimization to this adaptive scope produces stronger preference alignment than uniform every-step training.
What carries the argument
AdaScope, a plug-in that adaptively identifies the optimal RL intervention timing by monitoring structural evolution and semantic consistency during denoising and terminates once reward gains saturate.
If this is right
- RL updates become lower-variance and more efficient when applied only after image structures have stabilized.
- Early termination at reward saturation prevents overfitting to local details and reduces unnecessary computation.
- The dual benefit of higher generation quality and lower cost holds across multiple diffusion backbones and reward models.
- Theoretical grounds for the timing choice explain why full-trajectory optimization underperforms selective intervention.
Where Pith is reading between the lines
- The same stage-aware scoping principle could be tested on other long-horizon generative tasks such as video or 3D synthesis.
- Simpler heuristic detectors of structural change might replace the current perception module while preserving most gains.
- The approach suggests a general strategy for reducing RL cost in any sequential decision process that exhibits clear early instability and late saturation.
Load-bearing premise
That structural evolution and semantic consistency during denoising can be reliably perceived to identify the single optimal intervention window and saturation point for RL termination.
What would settle it
A direct comparison experiment in which uniform full-trajectory RL or randomly timed RL is run on the same diffusion backbone and measured for both final image quality metrics and total compute; if either matches or exceeds AdaScope on both axes the adaptive scoping claim is falsified.
Figures
read the original abstract
Despite strong image-generation performance, diffusion models' reconstruction objectives limit alignment with human preferences. RL enables such alignment through explicit rewards. However, most studies apply RL to the full denoising trajectory, making it computationally costly and weakening preference alignment, i.e., doing more but achieving less. We observe that the impact of RL fine-tuning varies significantly across denoising stages. In the early stage, image structures are unstable and distant from the final reward signal. Applying RL at this stage leads to delayed rewards and action-reward mismatching, resulting in high variance and inefficient updates. Conversely, in the later stage, reward gains saturate, and continued training tends to overfit local details, intensifying reward hacking. To tackle these challenges, we propose AdaScope, an RL-enhanced plug-in that improves generation quality while reducing computational cost. Specifically, AdaScope adaptively identifies the optimal intervention timing for RL by perceiving the structural evolution and semantic consistency during denoising, and dynamically terminates training once the denoising converges and reward gains saturate. As a result, it achieves a rare 'dual benefit': a reduction in computational costs alongside a significant performance improvement. We offer theoretical grounds for the design of AdaScope. Compared with state-of-the-art methods, AdaScope improves performance by 66% while cutting computational cost by 59%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdaScope, a plug-in for RL fine-tuning of diffusion models that adaptively identifies the optimal denoising stage for RL intervention by perceiving structural evolution and semantic consistency, then terminates training once reward gains saturate. It claims this yields a dual benefit of 66% performance improvement and 59% computational cost reduction versus state-of-the-art methods, with theoretical justification for avoiding high-variance early-stage updates and late-stage reward hacking.
Significance. If the adaptive timing mechanism is shown to be robustly defined and validated, the result would offer a practical advance in efficient preference alignment for diffusion models, reducing unnecessary optimization while improving reward alignment. The reported dual benefit, if free of baseline mismatches or post-hoc selection, would be a notable empirical contribution.
major comments (2)
- [AdaScope description (Section 3)] The central mechanism—perceiving 'structural evolution and semantic consistency' to select a single optimal RL intervention window and saturation point—is described only at a high level in the abstract and method overview. No concrete metrics, thresholds, detection algorithm, or validation against reward alignment/variance are supplied; this directly underpins the claimed 66% gain and 59% cost cut, as misidentification would collapse performance to standard full-trajectory RL.
- [Experimental results (Section 4)] Table 1 and Figure 4 (performance and cost comparisons): the 66% and 59% figures are presented without reported standard deviations across seeds, explicit baseline implementation details, or confirmation that intervention windows were fixed before evaluation rather than tuned post-hoc on the test set.
minor comments (2)
- [Method] Notation for the denoising timestep t and reward saturation criterion is introduced without a clear equation or pseudocode block; adding a short algorithm box would aid reproducibility.
- [Abstract] The abstract claims 'theoretical grounds' but the main text does not explicitly link any derivation or inequality to the adaptive termination rule; a pointer to the relevant paragraph or appendix would clarify.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity and experimental rigor that we will address in the revision. We respond to each major comment below.
read point-by-point responses
-
Referee: [AdaScope description (Section 3)] The central mechanism—perceiving 'structural evolution and semantic consistency' to select a single optimal RL intervention window and saturation point—is described only at a high level in the abstract and method overview. No concrete metrics, thresholds, detection algorithm, or validation against reward alignment/variance are supplied; this directly underpins the claimed 66% gain and 59% cost cut, as misidentification would collapse performance to standard full-trajectory RL.
Authors: We agree that the current description of the AdaScope mechanism in Section 3 focuses on the high-level design and theoretical motivation for intervening at specific denoising stages to avoid high-variance early updates and late-stage reward hacking. To improve reproducibility and directly support the performance claims, we will revise the manuscript to include concrete metrics for structural evolution (e.g., variance in low-level feature maps) and semantic consistency (e.g., embedding similarity thresholds), the precise detection algorithm with pseudocode, and additional validation experiments correlating these signals with reward alignment and reduced update variance. revision: yes
-
Referee: [Experimental results (Section 4)] Table 1 and Figure 4 (performance and cost comparisons): the 66% and 59% figures are presented without reported standard deviations across seeds, explicit baseline implementation details, or confirmation that intervention windows were fixed before evaluation rather than tuned post-hoc on the test set.
Authors: We acknowledge the need for greater statistical transparency and implementation details. The reported improvements are derived from multiple runs, but standard deviations were omitted from the presented tables and figures. In the revision we will add standard deviations or error bars to Table 1 and Figure 4. We will also expand the experimental section with explicit baseline implementation details. Regarding intervention windows, they are determined dynamically by the AdaScope mechanism during the denoising process rather than tuned post-hoc; we will add clarifying text and an ablation comparing adaptive versus fixed-window variants to confirm this was not test-set dependent. revision: yes
Circularity Check
No circularity: empirical gains and adaptive heuristic presented without derivation reducing to inputs by construction
full rationale
The abstract and description present AdaScope as a plug-in that adaptively selects RL intervention timing via perception of structural evolution and semantic consistency, then terminates on saturation. Performance improvements (66% gain, 59% cost reduction) are reported as empirical outcomes versus SOTA methods. No equations, fitted parameters renamed as predictions, or self-citation chains are exhibited that would make the central result equivalent to its own inputs. The mention of 'theoretical grounds' does not include any self-definitional or load-bearing reduction. This is the common honest finding of a self-contained empirical method.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 1, 2, 4, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng, and Stan Z Li. A sur- vey on generative diffusion models.IEEE transactions on knowledge and data engineering, 36(7):2814–2830, 2024. 1
work page 2024
-
[3]
arXiv preprint arXiv:2404.07771 , year=
Minshuo Chen, Song Mei, Jianqing Fan, and Mengdi Wang. An overview of diffusion models: Applications, guided gen- eration, statistical rates and optimization.arXiv preprint arXiv:2404.07771, 2024. 2, 4
-
[4]
Dif- fusiondet: Diffusion model for object detection
Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Dif- fusiondet: Diffusion model for object detection. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 19830–19843, 2023. 1
work page 2023
-
[5]
Diffusion models in vision: A survey
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelli- gence, 45(9):10850–10869, 2023. 1
work page 2023
-
[6]
J., Fisch, A., Heller, K., Pfohl, S., Ramachandran, D., Shaw, P., and Berant, J
Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ah- mad Beirami, Alex D’Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran, et al. Helping or herding? reward model ensembles mit- igate but do not eliminate reward hacking.arXiv preprint arXiv:2312.09244, 2023. 3
-
[7]
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023. 2, 4, 1
work page 2023
-
[8]
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Re- inforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Sys- tems, 36, 2024. 1, 3, 6
work page 2024
-
[9]
Giorgio Franceschelli and Mirco Musolesi. Reinforcement learning for generative ai: State of the art, opportunities and open research challenges.Journal of Artificial Intelligence Research, 79:417–446, 2024. 2, 4
work page 2024
-
[10]
Re- flective policy optimization.International Conference on Machine Learning, 2024
Yaozhong Gan, Renye Yan, Zhe Wu, and Junliang Xing. Re- flective policy optimization.International Conference on Machine Learning, 2024. 1
work page 2024
-
[11]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023. 2
work page 2023
-
[12]
Vinicius G Goecks, Gregory M Gremillion, Vernon J Lawh- ern, John Valasek, and Nicholas R Waytowich. Integrating behavior cloning and reinforcement learning for improved performance in dense and sparse reward environments.arXiv preprint arXiv:1910.04281, 2019. 2
-
[13]
Dealing with sparse rewards in reinforcement learning.arXiv preprint arXiv:1910.09281, 2019
Joshua Hare. Dealing with sparse rewards in reinforcement learning.arXiv preprint arXiv:1910.09281, 2019. 2
-
[14]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in Neural Information Processing Systems, 30, 2017. 3
work page 2017
-
[15]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1
work page 2020
-
[16]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
Video dif- fusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video dif- fusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022. 1
work page 2022
-
[18]
Tiechuan Hu, Wenbo Zhu, and Yuqi Yan. Reward hacking in reinforcement learning and rlhf: A multidisciplinary exami- nation of vulnerabilities, mitigation strategies, and alignment challenges. In2025 5th Intelligent Cybersecurity Conference (ICSC), pages 272–275. IEEE, 2025. 3
work page 2025
-
[19]
Dif- fusion reward: Learning rewards via conditional video dif- fusion
Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Dif- fusion reward: Learning rewards via conditional video dif- fusion. InEuropean Conference on Computer Vision, pages 478–495. Springer, 2024. 2
work page 2024
-
[20]
Yi Huang, Jiancheng Huang, Yifan Liu, Mingfu Yan, Jiaxi Lv, Jianzhuang Liu, Wei Xiong, He Zhang, Liangliang Cao, and Shifeng Chen. Diffusion model-based image editing: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 1
work page 2025
-
[21]
Measuring di- versity in co-creative image generation.arXiv preprint arXiv:2403.13826, 2024
Francisco Ibarrola and Kazjon Grace. Measuring di- versity in co-creative image generation.arXiv preprint arXiv:2403.13826, 2024. 3
-
[22]
Holodiffusion: Training a 3d diffusion model using 2d images
Animesh Karnewar, Andrea Vedaldi, David Novotny, and Niloy J Mitra. Holodiffusion: Training a 3d diffusion model using 2d images. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 18423–18433, 2023. 1
work page 2023
-
[23]
Sunwoo Kim, Minkyu Kim, and Dongmin Park. Test- time alignment of diffusion models without reward over- optimization.arXiv preprint arXiv:2501.05803, 2025. 2
-
[24]
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in neural infor- mation processing systems, 34:21696–21707, 2021. 1
work page 2021
-
[25]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in Neural Information Processing Systems, 2023. 6
work page 2023
-
[26]
Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models.Advances in Neural In- formation Processing Systems, 32, 2019. 3
work page 2019
-
[27]
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, and Kazuki Kozuka. Aligning diffusion mod- els by optimizing human utility.Advances in Neural Infor- mation Processing Systems, 37:24897–24925, 2024. 1
work page 2024
-
[28]
Step-aware preference optimization: Aligning preference with denoising performance at each step,
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Ji Li, and Liang Zheng. Step-aware prefer- ence optimization: Aligning preference with denoising per- formance at each step.arXiv preprint arXiv:2406.04314,
-
[29]
Lixiong Liu, Bao Liu, Hua Huang, and Alan Conrad Bovik. No-reference image quality assessment based on spatial and spectral entropies.Signal Processing: Image communica- tion, 29(8):856–863, 2014. 3
work page 2014
-
[30]
Deepcache: Accelerating diffusion models for free
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15762–15772, 2024. 2
work page 2024
-
[31]
Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling
Yuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao. Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling. Advances in Neural Information Processing Systems, 37: 134387–134429, 2025. 2
work page 2025
-
[32]
Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spa- tial domain.IEEE Transactions on Image Processing, 21 (12):4695–4708, 2012. 3
work page 2012
-
[33]
Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Mak- ing a “completely blind” image quality analyzer.IEEE Sig- nal Processing Letters, 20(3):209–212, 2012. 3
work page 2012
-
[34]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021. 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[35]
Owen Oertell, Shikun Sun, Yiding Chen, Jin Peng Zhou, Zhiyong Wang, and Wen Sun. Efficient controllable dif- fusion via optimal classifier guidance.arXiv preprint arXiv:2505.21666, 2025. 1
-
[36]
Markov decision processes.Handbooks in Operations Research and Management Science, 2:331– 434, 1990
Martin L Puterman. Markov decision processes.Handbooks in Operations Research and Management Science, 2:331– 434, 1990. 3
work page 1990
-
[37]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational conference on machine learning. PmLR. 6, 3
-
[38]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Learning by playing solving sparse reward tasks from scratch
Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. InInternational conference on machine learning, pages 4344–4353. PMLR,
-
[40]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 1
work page 2022
-
[41]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1
work page 2022
-
[42]
Peter J Rousseeuw. Silhouettes: a graphical aid to the in- terpretation and validation of cluster analysis.Journal of Computational and Applied Mathematics, 20:53–65, 1987. 8
work page 1987
-
[43]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1
work page 2022
-
[44]
Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 6, 3
work page 2016
-
[45]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 2022. 1, 6
work page 2022
-
[46]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[47]
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gam- ing.Advances in Neural Information Processing Systems, 35:9460–9471, 2022. 3
work page 2022
-
[48]
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gam- ing.Advances in Neural Information Processing Systems, 35:9460–9471, 2022. 2
work page 2022
-
[49]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. PMLR, 2015. 1
work page 2015
-
[50]
Inference-time alignment of diffusion models with direct noise optimization
Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Inference-time alignment of diffusion models with direct noise optimization. arXiv preprint arXiv:2405.18881, 2024. 2
-
[51]
Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, and Sergey Levine. Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review.arXiv preprint arXiv:2407.13734, 2024. 2, 4
-
[52]
Visualizing data using t-sne.Journal of machine learning research, 9 (11), 2008
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (11), 2008. 8
work page 2008
-
[53]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caim- ing Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InComputer Vision and Pattern Recognition, pages 8228–8238, 2024. 1
work page 2024
-
[54]
Chao Wang, Jian Wang, Jingjing Wang, and Xudong Zhang. Deep-reinforcement-learning-based autonomous uav navi- gation with sparse rewards.IEEE Internet of Things Journal, 7(7):6180–6190, 2020. 2
work page 2020
-
[55]
Linye Wei, Zixiang Luo, Pingzhi Tang, and Meng Li. Team: Temporal-spatial consistency guided expert activation for moe diffusion language model acceleration.arXiv preprint arXiv:2602.08404, 2026. 1
work page internal anchor Pith review arXiv 2026
-
[56]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
Diffir: Efficient diffusion model for image restoration
Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xing- long Wu, Yapeng Tian, Wenming Yang, and Luc Van Gool. Diffir: Efficient diffusion model for image restoration. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13095–13105, 2023. 1
work page 2023
-
[58]
Dymo: Training-free diffusion model alignment with dynamic multi-objective scheduling
Xin Xie and Dong Gong. Dymo: Training-free diffusion model alignment with dynamic multi-objective scheduling. InComputer Vision and Pattern Recognition Conference, pages 13220–13230, 2025. 2
work page 2025
-
[59]
A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024
Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video dif- fusion models.ACM Computing Surveys, 57(2):1–42, 2024. 1
work page 2024
-
[60]
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 2023. 6
work page 2023
-
[61]
Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models
Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20908–20918, 2023. 1
work page 2023
-
[62]
Versatile diffusion: Text, images and variations all in one diffusion model
Xingqian Xu, Zhangyang Wang, Gong Zhang, Kai Wang, and Humphrey Shi. Versatile diffusion: Text, images and variations all in one diffusion model. InProceedings of the IEEE/CVF international conference on computer vision, pages 7754–7765, 2023. 1
work page 2023
-
[63]
The exploration- exploitation dilemma revisited: An entropy perspective
Renye Yan, Yaozhong Gan, You Wu, Ling Liang, Jun- liang Xing, Yimao Cai, and Ru Huang. The exploration- exploitation dilemma revisited: An entropy perspective. arXiv preprint arXiv:2408.09974, 2024. 1
-
[64]
Entropy-adaptive diffusion policy optimiza- tion with dynamic step alignment
RenYe Yan, Jikang Cheng, Yaozhong Gan, Shikun Sun, You Wu, Yunfan Yang, Liang Ling, Jinlong Lin, Yeshuang Zhu, Jie Zhou, et al. Entropy-adaptive diffusion policy optimiza- tion with dynamic step alignment. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1924–1934, 2025. 1
work page 1924
-
[65]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InComputer Vision and Pattern Recognition, pages 8941–8951, 2024. 1, 2, 6
work page 2024
-
[66]
Yinlong Yuan, Zhu Liang Yu, Zhenghui Gu, Xiaoyan Deng, and Yuanqing Li. A novel multi-step reinforcement learning method for solving reward hacking.Applied Intelligence, 49 (8):2874–2888, 2019. 3
work page 2019
-
[67]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InComputer Vision and Pattern Recognition, pages 586–595, 2018. 6, 3
work page 2018
-
[68]
Ziyi Zhang, Sen Zhang, Yibing Zhan, Yong Luo, Yonggang Wen, and Dacheng Tao. Confronting reward overoptimiza- tion for diffusion models: A perspective of inductive and pri- macy biases. InInternational Conference on Machine Learn- ing, pages 60396–60413. PMLR, 2024. 2, 6, 1
work page 2024
-
[69]
Enmin Zhao, Renye Yan, Jinqiu Li, Kai Li, and Junliang Xing. Alphaholdem: High-performance artificial intelli- gence for heads-up no-limit poker via end-to-end reinforce- ment learning. InProceedings of the AAAI conference on artificial intelligence, pages 4689–4697, 2022. 1
work page 2022
-
[70]
3d shape generation and completion through point-voxel diffusion
Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021. 1
work page 2021
-
[71]
Xuechao Zou, Shun Zhang, Xing Fu, Yue Li, Kai Li, Yushe Cao, Congyan Lang, Pin Tao, and Junliang Xing. Mixture of global and local experts with diffusion transformer for con- trollable face generation, 2025. 1 Do Less, Achieve More: Do We Need Every-Step Optimization for RL Fine-tuning of Diffusion Models? Supplementary Material
work page 2025
-
[72]
Supplementary Related Works 1.1. RL Fine-Tuning in Diffusion Models Existing diffusion models [4, 5, 24, 59, 62] primarily approximate the data distribution through denoising reconstruction loss. However, this training approach struggles to capture high-level metrics such as semantic consistency, aesthetic prefer- ences, and user subjective judgments [2, ...
-
[73]
•(1) Visualization Experiments.See Fig
Experiment List in Our Paper To help readers quickly grasp the extensive experiments conducted in this work, we summarize the full list of experiments below. •(1) Visualization Experiments.See Fig. 1. This experiment provides a solid justification for the motivation of this work. •(2) Reward Backfilling Validation.See Fig. 3. This experiment demonstrates ...
-
[74]
Supplementary Experiments 3.1. Why AdaScope Improves Both Quality and Efficiency ? Computational Savings:Our method reduces training computational costs by adaptively pruning uninformative early denoising samples and late-stage steps where returns have saturated. In the early stage of denoising, the image’s semantic structure has not yet formed, leading t...
-
[75]
Proof 4.1. Proof of Theorem 1. This is proved in the sec.5 ofReverse-time diffusion equation modelsby Anderson. 4.2. Proof of Theorem 2. We do the direct calculation: Table 4. Detailed prompts used for generated images in Fig. 13. Image Prompt Row 1, Col 1 A young girl standing on a rooftop, blowing dandelions that transform into glowing comets, shooting ...
-
[76]
Expandx t+τ in terms of(x 0, ϵt, ϵ′): xt+τ = r ¯αt+τ ¯αt √¯αt x0 + √ 1−¯αt ϵt + r 1− ¯αt+τ ¯αt ϵ′ = √¯αt+τ x0 + r ¯αt+τ ¯αt √ 1−¯αt ϵt + r 1− ¯αt+τ ¯αt ϵ′
-
[77]
Therefore, componentwise, Cov x(i) t , x(j) t+τ = √¯αt ¯αt+τ Σij + r ¯αt+τ ¯αt (1−¯αt)δ ij
Cross-covarianceCov(x t, xt+τ):Using independence andCov(x 0) = Σ,Cov(ϵ t) =I,Cov(ϵ ′) =I, Cov(xt, xt+τ) = Cov √¯αt x0 + √ 1−¯αt ϵt, √¯αt+τ x0 + r ¯αt+τ ¯αt √ 1−¯αt ϵt + r 1− ¯αt+τ ¯αt ϵ′ = √¯αt ¯αt+τ Cov(x0, x0) + √ 1−¯αt r ¯αt+τ ¯αt √ 1−¯αt Cov(ϵt, ϵt) = √¯αt ¯αt+τ Σ + r ¯αt+τ ¯αt (1−¯αt)I. Therefore, componentwise, Cov x(i) t , x(j) t+τ = √¯αt ¯αt+τ Σi...
-
[78]
Marginal variances at each time: Var x(i) t = Var √¯αt x(i) 0 + √ 1−¯αt ϵ(i) t = ¯αt Σii + (1−¯αt), Var x(j) t+τ = Var √¯αt+τ x(j) 0 + p 1−¯αt+τ ˜ϵ(j) = ¯αt+τ Σjj + (1−¯αt+τ), (where˜ϵis standard normal noise independent ofx 0.)
-
[79]
Correlation: Corr x(i) t , x(j) t+τ = Cov x(i) t , x(j) t+τ q Var(x(i) t ) Var(x(j) t+τ) = √¯αt+τ ¯αt Σij + q ¯αt+τ ¯αt (1−¯αt)δ ij q ¯αtΣii + (1−¯αt) ¯αt+τΣjj + (1−¯αt+τ) . Ours D3PO DPOK DDPO Ours D3PO DPOK DDPO Ours D3PO DDPO DPOK OursD3PO DPOKDDPO Ours D3PO DDPO DPOK Ours DPOK D3PO DDPO Figure 13.Diversity Evaluation:Our method demonstrates the highes...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.