pith. machine review for the scientific record. sign in

arxiv: 2603.00918 · v3 · submitted 2026-03-01 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-to-imagediffusion modelsreinforcement learningself-supervisedpost-trainingreward designimage generation
0
0 comments X

The pith

Text-to-image models can improve their outputs by using their own accuracy at recovering injected noise as an internal reward signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SOLACE as a post-training method for text-to-image models that derives rewards from the model's own outputs rather than external sources. It adds noise to generated images and uses the accuracy of recovering that noise as a measure of self-confidence, turning low error into high reward for reinforcement learning. This internal signal leads to better performance on complex scenes, text in images, and overall alignment with prompts. The method avoids the need for human annotations or separate reward models and can be combined with them for further gains. If effective, it suggests models can bootstrap their own improvement using intrinsic signals.

Core claim

The central discovery is that a model's ability to reconstruct injected noise from its own generated images provides a reliable intrinsic confidence signal that can be directly used as a reward in reinforcement learning to enhance text-to-image generation quality, achieving improvements in compositional accuracy, text rendering, and prompt alignment without any external supervision or preference data.

What carries the argument

SOLACE framework, which computes self-confidence via the reconstruction error of noise added to the model's generated latent representations and uses it to generate scalar rewards for RL.

If this is right

  • Improves compositional generation by reinforcing coherent structures.
  • Enhances text rendering accuracy in images.
  • Strengthens text-image alignment.
  • Reduces reward hacking when used with external rewards.
  • Enables training without preference datasets or annotators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might generalize to other modalities like text or audio generation using similar diffusion processes.
  • Iterative self-training could lead to progressively better models without human intervention.
  • It offers a way to mitigate biases in external reward models by incorporating model-intrinsic signals.

Load-bearing premise

Accurate noise reconstruction from the model's outputs corresponds to generations that humans would rate highly when reinforced in RL.

What would settle it

Experiments showing that models trained with SOLACE rewards receive lower human preference scores than those trained with standard methods or no RL.

Figures

Figures reproduced from arXiv: 2603.00918 by Minsu Cho, Seungwook Kim.

Figure 1
Figure 1. Figure 1: Qualitative examples of SOLACE on Pick-a-Pic dataset [30]. Best viewed on electronics. Abstract Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to bet￾ter match human preferences, factuality, and improved aes￾thetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SOLACE. Given a text prompt c, we generate G different latents. Without decoding, we re-noise the latents using K noise probes across t ∈ T ⊂ [0, 1]. For each generated latent z (i) 0 , we formulate the text-to-image generative model’s self-confidence of the generated latent as the ability to denoise the re-noised latent. We leverage this self-confidence as an internal reward scalar value, whic… view at source ↗
Figure 3
Figure 3. Figure 3: User study against baseline SD3.5-M [16] on PartiPrompts [56] and HPSv2 [75]. The user study shows that SOLACE post-training yields favorable visual realism/appeal, and text-image alignment. a considerable improvement in performance on composi￾tional generation (GenEval [21]), text rendering (OCR[14]) and CLIPScore [53], almost matching the performance of SD3.5-L in these three metrics, albeit having less … view at source ↗
Figure 4
Figure 4. Figure 4: Effect of SOLACE post-training SD3.5-M after post￾training on PickScore [30] using FlowGRPO [37]. SOLACE complements external rewards, showing the best best composi￾tional generation and visual appeal on GenEval [21]. Post-training on external rewards yields high visual appeal, but sacrifices com￾positionality as shown above (Column 3: Generates yellow motor￾cycle instead / generates unwanted human). We al… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of SOLACE when applied on SD3.5 [16] on DrawBench [56], GenEval [21] and OCR [14]. It can be seen that applying SOLACE shows consistent improvements over the baseline SD3.5. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Rationale of SOLACE. Distributions of the denoising￾based self-confidence under three inference settings—10 steps (no CFG), 10 steps (CFG), and 20 steps (CFG). The distribution shifts monotonically rightward (higher self-confidence) in the same or￾der that visual quality improves, indicating that the ability to re￾cover injected noise is predictive of sample quality even when the scorer is the same model. … view at source ↗
Figure 8
Figure 8. Figure 8: User study interface used to collect human preferences [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative results of SOLACE. We present additional qualitative results of SOLACE when applied to (1) Flow￾GRPO [37] post-trained SD3.5-M [16], (2) FLUX.1-Dev [3], and (3) SD3.5-L [16]. Best viewed on electronics. 7 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to improve human preference alignment, factuality, and aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal: we re-noise the model's own outputs and measure how accurately it recovers the injected noise, treating low reconstruction error as high self-confidence. SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models, annotators, or preference data. By reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering, and text-image alignment. Integrating SOLACE with external rewards yields complementary improvements while alleviating reward hacking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework for text-to-image diffusion models. It re-noises the model's own generated outputs, measures the denoising network's reconstruction error on the injected noise as an intrinsic self-confidence signal, and uses this scalar as a reward for reinforcement learning to improve compositional generation, text rendering, and text-image alignment without external reward models or preference data. The abstract claims consistent gains from reinforcing high-confidence generations and complementary benefits when combined with external rewards.

Significance. If the self-referential reconstruction error reliably tracks human-aligned quality rather than model familiarity or training-manifold proximity, the method could reduce dependence on costly external reward models and mitigate reward hacking. However, the absence of any reported metrics, baselines, or ablation studies in the abstract leaves the central data-to-claim link unevaluated, limiting assessment of practical impact.

major comments (2)
  1. [Abstract] Abstract: The assertion that SOLACE 'delivers consistent gains in compositional generation, text rendering, and text-image alignment' is unsupported by any quantitative metrics, baselines, human evaluations, or experimental results, rendering the central empirical claim unevaluable.
  2. [Method] Method (self-confidence signal definition): The claim that low reconstruction error when recovering injected noise from the model's own outputs constitutes a reliable reward for prompt fidelity rests on an untested assumption; no derivation or preliminary evidence shows this error correlates with external quality measures rather than with outputs already close to the training distribution.
minor comments (1)
  1. [Abstract] Abstract: The integration statement ('Integrating SOLACE with external rewards yields complementary improvements') would benefit from a brief description of the combination mechanism even at high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, acknowledging where the current manuscript is incomplete and outlining targeted revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that SOLACE 'delivers consistent gains in compositional generation, text rendering, and text-image alignment' is unsupported by any quantitative metrics, baselines, human evaluations, or experimental results, rendering the central empirical claim unevaluable.

    Authors: We agree that the abstract, in its current form, states the gains without accompanying numbers or references to experiments. The full manuscript contains quantitative results, baselines, and human evaluations in Section 4. We will revise the abstract to incorporate the key metrics (e.g., improvements on compositionality and alignment benchmarks) so that the claim is directly supported within the abstract itself. revision: yes

  2. Referee: [Method] Method (self-confidence signal definition): The claim that low reconstruction error when recovering injected noise from the model's own outputs constitutes a reliable reward for prompt fidelity rests on an untested assumption; no derivation or preliminary evidence shows this error correlates with external quality measures rather than with outputs already close to the training distribution.

    Authors: The comment is correct: the submitted manuscript provides only the definition and high-level motivation for the reconstruction-error signal without a derivation or explicit correlation study against external measures. We will add a short derivation based on the denoising objective and preliminary correlation experiments (including controls for training-manifold proximity) to the method section in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; self-contained internal reward definition

full rationale

The paper explicitly constructs the SOLACE reward as low reconstruction error when the model re-noises and denoises its own outputs, then applies this scalar as an RL signal. This is a direct definitional choice of an intrinsic proxy rather than a derivation that reduces a claimed outcome (e.g., improved alignment) back to the same quantity by construction. No equations, fitted parameters, or self-citations are shown that force the central result to equal its inputs; the gains in composition and alignment are presented as empirical consequences of the RL updates, not tautological. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one core domain assumption linking reconstruction error to confidence; no free parameters or new physical entities are introduced.

axioms (1)
  • domain assumption Low reconstruction error after re-noising the model's outputs indicates high self-confidence that corresponds to higher-quality generations.
    This assumption directly converts the technical signal into the scalar reward used for RL.
invented entities (1)
  • SOLACE self-confidence signal no independent evidence
    purpose: To supply intrinsic scalar rewards for reinforcement learning without external models or data.
    Newly defined internal metric introduced by the framework.

pith-pipeline@v0.9.0 · 5430 in / 1301 out tokens · 61149 ms · 2026-05-15T18:36:58.268447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 1 Pith paper · 31 internal anchors

  1. [1]

    4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling

    Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lin- dell. 4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024. 1

  2. [2]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22669–22679, 2023. 2

  3. [3]

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506,

  4. [4]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 2

  5. [5]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 1, 2, 3

  6. [6]

    Improving image edit- ing models with generative data refinement

    Frederic Boesel and Robin Rombach. Improving image edit- ing models with generative data refinement. InThe Second Tiny Papers Track at ICLR 2024, 2024. 1

  7. [7]

    In- structpix2pix: Learning to follow image editing instructions

    Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 1

  8. [8]

    Muse: Text-to-image generation via masked generative transform- ers.arXiv preprint arXiv:2301.00704, 2023

    Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Mur- phy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transform- ers.arXiv preprint arXiv:2301.00704, 2023. 2

  9. [9]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 1, 2

  10. [10]

    Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024. 1, 2

  11. [11]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak lan- guage models to strong language models.arXiv preprint arXiv:2401.01335, 2024. 3

  12. [12]

    Self-playing ad- versarial language game enhances llm reasoning.Advances in Neural Information Processing Systems, 37:126515– 126543, 2024

    Pengyu Cheng, Yong Dai, Tianhao Hu, Han Xu, Zhisong Zhang, Lei Han, Nan Du, and Xiaolong Li. Self-playing ad- versarial language game enhances llm reasoning.Advances in Neural Information Processing Systems, 37:126515– 126543, 2024. 3

  13. [13]

    Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023

    Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023. 2

  14. [14]

    PaddleOCR 3.0 Technical Report

    Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1, 2, 5, 6, 7

  15. [15]

    Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023. 2

  16. [16]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  17. [17]

    Online reward-weighted fine- tuning of flow matching with wasserstein regularization

    Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine- tuning of flow matching with wasserstein regularization. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 2

  18. [18]

    Re- inforcement learning for fine-tuning text-to-image diffusion models

    Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Re- inforcement learning for fine-tuning text-to-image diffusion models. InThirty-seventh Conference on Neural Informa- tion Processing Systems (NeurIPS) 2023. Neural Information Processing Systems Foundation, 2023. 2

  19. [19]

    Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

    Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Im- proving dynamic object interactions in text-to-video gener- ation with ai feedback.arXiv preprint arXiv:2412.02617,

  20. [20]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023. 5

  21. [21]

    Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 1, 2, 5, 6, 7

  22. [22]

    Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

    Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun 9 Shi, et al. Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025. 1, 5

  23. [23]

    Accelerate: Training and inference at scale made simple, efficient and adaptable.https: //github.com/huggingface/accelerate, 2022

    Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https: //github.com/huggingface/accelerate, 2022. 3

  24. [24]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1

  25. [25]

    A simple and effective re- inforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

    Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and effective re- inforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025. 2

  26. [26]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

  27. [27]

    Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

  28. [28]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5

  29. [29]

    Democra- tizing text-to-image masked generative models with com- pact text-aware one-dimensional tokens.arXiv preprint arXiv:2501.07730, 2025

    Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiao- hui Shen, Suha Kwak, and Liang-Chieh Chen. Democra- tizing text-to-image masked generative models with com- pact text-aware one-dimensional tokens.arXiv preprint arXiv:2501.07730, 2025. 2

  30. [30]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023. 1, 2, 5, 6

  31. [31]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3

  32. [32]

    Aligning Text-to-Image Models using Human Feedback

    Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 2

  33. [33]

    Holis- tic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36:69981–70011, 2023

    Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holis- tic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36:69981–70011, 2023. 2

  34. [34]

    Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization

    Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 13199–13208, 2025. 2

  35. [35]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3

  36. [37]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2, 4, 5, 6, 7

  37. [38]

    Improving Video Generation with Human Feedback

    Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 2

  38. [39]

    Videodpo: Omni- preference alignment for video diffusion generation

    Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 2

  39. [40]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 3

  40. [41]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

  41. [42]

    Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation.arXiv preprint arXiv:2409.04410, 2024

    Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation.arXiv preprint arXiv:2409.04410, 2024. 2

  42. [43]

    PEFT: State-of-the-art parameter-efficient fine-tuning meth- ods.https://github.com/huggingface/peft,

    Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning meth- ods.https://github.com/huggingface/peft,

  43. [44]

    Training diffusion models towards diverse image generation with reinforcement learning

    Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Li- juan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10844– 10853, 2024. 2

  44. [45]

    Hello gpt-4o, 2024

    OpenAI. Hello gpt-4o, 2024. 5

  45. [46]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  46. [47]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scal- able off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 2

  47. [48]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and 10 Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2

  48. [49]

    Learning formal mathematics from intrinsic motiva- tion.Advances in Neural Information Processing Systems, 37:43032–43057, 2024

    Gabriel Poesia, David Broman, Nick Haber, and Noah Good- man. Learning formal mathematics from intrinsic motiva- tion.Advances in Neural Information Processing Systems, 37:43032–43057, 2024. 3

  49. [50]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1, 2, 3

  50. [51]

    Aligning text-to-image diffusion models with reward backpropagation

    Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Ka- terina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation. 2023. 2

  51. [52]

    Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

    Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

  52. [53]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5, 6

  53. [54]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2

  54. [55]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2

  55. [56]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1, 2, 5, 6, 7

  56. [57]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 5

  57. [58]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2

  58. [59]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4

  59. [60]

    Emu edit: Precise image editing via recognition and gen- eration tasks

    Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1

  60. [61]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 1, 2, 3

  61. [62]

    Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

    Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024. 1

  62. [63]

    Deeply supervised flow-based generative models.arXiv preprint arXiv:2503.14494, 2025

    Inkyu Shin, Chenglin Yang, and Liang-Chieh Chen. Deeply supervised flow-based generative models.arXiv preprint arXiv:2503.14494, 2025. 2

  63. [64]

    Fill-up: Balancing long-tailed data with generative models.arXiv preprint arXiv:2306.07200, 2023

    Joonghyuk Shin, Minguk Kang, and Jaesik Park. Fill-up: Balancing long-tailed data with generative models.arXiv preprint arXiv:2306.07200, 2023. 1

  64. [65]

    Text-to-4d dy- namic scene generation.arXiv preprint arXiv:2301.11280,

    Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy- namic scene generation.arXiv preprint arXiv:2301.11280,

  65. [66]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 3

  66. [67]

    T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation

    Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 8406–8416, 2025. 2

  67. [68]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2

  68. [69]

    Diffusion model align- ment using direct preference optimization

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 1, 2

  69. [70]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3

  70. [71]

    ModelScope Text-to-Video Technical Report

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1

  71. [72]

    Unified Reward Model for Multimodal Understanding and Generation

    Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 2, 5, 6

  72. [73]

    Domain gap embeddings for genera- tive dataset augmentation

    Yinong Oliver Wang, Younjoon Chung, Chen Henry Wu, and Fernando De la Torre. Domain gap embeddings for genera- tive dataset augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28684–28694, 2024. 1

  73. [74]

    Prolificdreamer: High-fidelity and 11 diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and 11 diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 1, 2

  74. [75]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

  75. [76]

    Omnigen: Unified image genera- tion

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 1

  76. [77]

    Genius: A generalizable and purely unsupervised self- training framework for advanced reasoning.arXiv preprint arXiv:2504.08672, 2025

    Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, and Zhiyong Wu. Genius: A generalizable and purely unsupervised self- training framework for advanced reasoning.arXiv preprint arXiv:2504.08672, 2025. 3

  77. [78]

    Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

    Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 1, 2, 5, 6

  78. [79]

    Using human feedback to fine-tune diffusion models without any reward model

    Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941– 8951, 2024. 2

  79. [80]

    Ai-generated images as data source: The dawn of synthetic era.arXiv preprint arXiv:2310.01830, 2023

    Zuhao Yang, Fangneng Zhan, Kunhao Liu, Muyu Xu, and Shijian Lu. Ai-generated images as data source: The dawn of synthetic era.arXiv preprint arXiv:2310.01830, 2023. 1

  80. [81]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres- sive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 2, 6

Showing first 80 references.