arxiv: 2603.00918 · v3 · submitted 2026-03-01 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Seungwook Kim , Minsu Cho

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords text-to-imagediffusion modelsreinforcement learningself-supervisedpost-trainingreward designimage generation

0 comments

The pith

Text-to-image models can improve their outputs by using their own accuracy at recovering injected noise as an internal reward signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SOLACE as a post-training method for text-to-image models that derives rewards from the model's own outputs rather than external sources. It adds noise to generated images and uses the accuracy of recovering that noise as a measure of self-confidence, turning low error into high reward for reinforcement learning. This internal signal leads to better performance on complex scenes, text in images, and overall alignment with prompts. The method avoids the need for human annotations or separate reward models and can be combined with them for further gains. If effective, it suggests models can bootstrap their own improvement using intrinsic signals.

Core claim

The central discovery is that a model's ability to reconstruct injected noise from its own generated images provides a reliable intrinsic confidence signal that can be directly used as a reward in reinforcement learning to enhance text-to-image generation quality, achieving improvements in compositional accuracy, text rendering, and prompt alignment without any external supervision or preference data.

What carries the argument

SOLACE framework, which computes self-confidence via the reconstruction error of noise added to the model's generated latent representations and uses it to generate scalar rewards for RL.

If this is right

Improves compositional generation by reinforcing coherent structures.
Enhances text rendering accuracy in images.
Strengthens text-image alignment.
Reduces reward hacking when used with external rewards.
Enables training without preference datasets or annotators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might generalize to other modalities like text or audio generation using similar diffusion processes.
Iterative self-training could lead to progressively better models without human intervention.
It offers a way to mitigate biases in external reward models by incorporating model-intrinsic signals.

Load-bearing premise

Accurate noise reconstruction from the model's outputs corresponds to generations that humans would rate highly when reinforced in RL.

What would settle it

Experiments showing that models trained with SOLACE rewards receive lower human preference scores than those trained with standard methods or no RL.

Figures

Figures reproduced from arXiv: 2603.00918 by Minsu Cho, Seungwook Kim.

**Figure 1.** Figure 1: Qualitative examples of SOLACE on Pick-a-Pic dataset [30]. Best viewed on electronics. Abstract Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-… view at source ↗

**Figure 2.** Figure 2: Overview of SOLACE. Given a text prompt c, we generate G different latents. Without decoding, we re-noise the latents using K noise probes across t ∈ T ⊂ [0, 1]. For each generated latent z (i) 0 , we formulate the text-to-image generative model’s self-confidence of the generated latent as the ability to denoise the re-noised latent. We leverage this self-confidence as an internal reward scalar value, whic… view at source ↗

**Figure 3.** Figure 3: User study against baseline SD3.5-M [16] on PartiPrompts [56] and HPSv2 [75]. The user study shows that SOLACE post-training yields favorable visual realism/appeal, and text-image alignment. a considerable improvement in performance on compositional generation (GenEval [21]), text rendering (OCR[14]) and CLIPScore [53], almost matching the performance of SD3.5-L in these three metrics, albeit having less … view at source ↗

**Figure 4.** Figure 4: Effect of SOLACE post-training SD3.5-M after posttraining on PickScore [30] using FlowGRPO [37]. SOLACE complements external rewards, showing the best best compositional generation and visual appeal on GenEval [21]. Post-training on external rewards yields high visual appeal, but sacrifices compositionality as shown above (Column 3: Generates yellow motorcycle instead / generates unwanted human). We al… view at source ↗

**Figure 5.** Figure 5: Qualitative results of SOLACE when applied on SD3.5 [16] on DrawBench [56], GenEval [21] and OCR [14]. It can be seen that applying SOLACE shows consistent improvements over the baseline SD3.5. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Rationale of SOLACE. Distributions of the denoisingbased self-confidence under three inference settings—10 steps (no CFG), 10 steps (CFG), and 20 steps (CFG). The distribution shifts monotonically rightward (higher self-confidence) in the same order that visual quality improves, indicating that the ability to recover injected noise is predictive of sample quality even when the scorer is the same model. … view at source ↗

**Figure 8.** Figure 8: User study interface used to collect human preferences [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Additional qualitative results of SOLACE. We present additional qualitative results of SOLACE when applied to (1) FlowGRPO [37] post-trained SD3.5-M [16], (2) FLUX.1-Dev [3], and (3) SD3.5-L [16]. Best viewed on electronics. 7 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to improve human preference alignment, factuality, and aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal: we re-noise the model's own outputs and measure how accurately it recovers the injected noise, treating low reconstruction error as high self-confidence. SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models, annotators, or preference data. By reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering, and text-image alignment. Integrating SOLACE with external rewards yields complementary improvements while alleviating reward hacking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SOLACE turns the model's own noise reconstruction error on generated images into an intrinsic RL reward for text-to-image post-training, which is a distinct internal signal but rests on an unproven link to actual quality.

read the letter

The main thing to know is that this paper introduces SOLACE, a framework that derives a reward signal for reinforcement learning post-training directly from how well the text-to-image model can reconstruct noise added to its own outputs. This intrinsic self-confidence measure replaces the need for external reward models or preference data. What the paper does well is present a clean way to create scalar rewards internally, which could lower costs and reduce some biases from external supervision. It claims consistent gains in compositional generation, text rendering, and text-image alignment, and notes that combining it with external rewards gives complementary benefits while helping with reward hacking. The approach is distinct from prior methods that rely on outside signals. The soft spots center on whether the reconstruction error truly measures generation quality. It might instead reflect how familiar or easy the output is for the model to process, potentially leading to reinforcement of outputs close to the training manifold rather than those that better match the text prompt. Without detailed experiments showing correlation with human preferences or independent metrics, it's hard to be sure the RL updates deliver the stated improvements. The abstract is light on specifics, so the full paper needs to demonstrate that this signal isn't just capturing internal ease. This is aimed at people working on post-training diffusion models and alignment techniques. A reader looking for ways to scale improvements without additional data would find it relevant, provided the results hold. I would recommend sending this to peer review. The idea is worth a serious look even if the grounding needs more validation.

Referee Report

2 major / 1 minor

Summary. The paper introduces SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework for text-to-image diffusion models. It re-noises the model's own generated outputs, measures the denoising network's reconstruction error on the injected noise as an intrinsic self-confidence signal, and uses this scalar as a reward for reinforcement learning to improve compositional generation, text rendering, and text-image alignment without external reward models or preference data. The abstract claims consistent gains from reinforcing high-confidence generations and complementary benefits when combined with external rewards.

Significance. If the self-referential reconstruction error reliably tracks human-aligned quality rather than model familiarity or training-manifold proximity, the method could reduce dependence on costly external reward models and mitigate reward hacking. However, the absence of any reported metrics, baselines, or ablation studies in the abstract leaves the central data-to-claim link unevaluated, limiting assessment of practical impact.

major comments (2)

[Abstract] Abstract: The assertion that SOLACE 'delivers consistent gains in compositional generation, text rendering, and text-image alignment' is unsupported by any quantitative metrics, baselines, human evaluations, or experimental results, rendering the central empirical claim unevaluable.
[Method] Method (self-confidence signal definition): The claim that low reconstruction error when recovering injected noise from the model's own outputs constitutes a reliable reward for prompt fidelity rests on an untested assumption; no derivation or preliminary evidence shows this error correlates with external quality measures rather than with outputs already close to the training distribution.

minor comments (1)

[Abstract] Abstract: The integration statement ('Integrating SOLACE with external rewards yields complementary improvements') would benefit from a brief description of the combination mechanism even at high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, acknowledging where the current manuscript is incomplete and outlining targeted revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that SOLACE 'delivers consistent gains in compositional generation, text rendering, and text-image alignment' is unsupported by any quantitative metrics, baselines, human evaluations, or experimental results, rendering the central empirical claim unevaluable.

Authors: We agree that the abstract, in its current form, states the gains without accompanying numbers or references to experiments. The full manuscript contains quantitative results, baselines, and human evaluations in Section 4. We will revise the abstract to incorporate the key metrics (e.g., improvements on compositionality and alignment benchmarks) so that the claim is directly supported within the abstract itself. revision: yes
Referee: [Method] Method (self-confidence signal definition): The claim that low reconstruction error when recovering injected noise from the model's own outputs constitutes a reliable reward for prompt fidelity rests on an untested assumption; no derivation or preliminary evidence shows this error correlates with external quality measures rather than with outputs already close to the training distribution.

Authors: The comment is correct: the submitted manuscript provides only the definition and high-level motivation for the reconstruction-error signal without a derivation or explicit correlation study against external measures. We will add a short derivation based on the denoising objective and preliminary correlation experiments (including controls for training-manifold proximity) to the method section in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; self-contained internal reward definition

full rationale

The paper explicitly constructs the SOLACE reward as low reconstruction error when the model re-noises and denoises its own outputs, then applies this scalar as an RL signal. This is a direct definitional choice of an intrinsic proxy rather than a derivation that reduces a claimed outcome (e.g., improved alignment) back to the same quantity by construction. No equations, fitted parameters, or self-citations are shown that force the central result to equal its inputs; the gains in composition and alignment are presented as empirical consequences of the RL updates, not tautological. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one core domain assumption linking reconstruction error to confidence; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Low reconstruction error after re-noising the model's outputs indicates high self-confidence that corresponds to higher-quality generations.
This assumption directly converts the technical signal into the scalar reward used for RL.

invented entities (1)

SOLACE self-confidence signal no independent evidence
purpose: To supply intrinsic scalar rewards for reinforcement learning without external models or data.
Newly defined internal metric introduced by the framework.

pith-pipeline@v0.9.0 · 5430 in / 1301 out tokens · 61149 ms · 2026-05-15T18:36:58.268447+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MSEi,t = 1/K ∑ ||bϵθ(z(i,m)t , t, c) - ϵ(m)||²₂ ; Si,t = -log(MSEi,t + δ) ; RSOLACE = weighted sum of Si,t
IndisputableMonolith/Foundation/BranchSelection branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages · cited by 1 Pith paper · 31 internal anchors

[1]

4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lin- dell. 4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024. 1

work page 2024
[2]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22669–22679, 2023. 2

work page 2023
[3]

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506,

work page
[4]

Improving image generation with better captions.Computer Science

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 2

work page 2023
[5]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Improving image edit- ing models with generative data refinement

Frederic Boesel and Robin Rombach. Improving image edit- ing models with generative data refinement. InThe Second Tiny Papers Track at ICLR 2024, 2024. 1

work page 2024
[7]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 1

work page 2023
[8]

Muse: Text-to-image generation via masked generative transform- ers.arXiv preprint arXiv:2301.00704, 2023

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Mur- phy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transform- ers.arXiv preprint arXiv:2301.00704, 2023. 2

work page arXiv 2023
[9]

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024. 1, 2

work page 2024
[11]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak lan- guage models to strong language models.arXiv preprint arXiv:2401.01335, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Self-playing ad- versarial language game enhances llm reasoning.Advances in Neural Information Processing Systems, 37:126515– 126543, 2024

Pengyu Cheng, Yong Dai, Tianhao Hu, Han Xu, Zhisong Zhang, Lei Han, Nan Du, and Xiaolong Li. Self-playing ad- versarial language game enhances llm reasoning.Advances in Neural Information Processing Systems, 37:126515– 126543, 2024. 3

work page 2024
[13]

Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023. 2

work page arXiv 2023
[14]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1, 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023. 2

work page arXiv 2023
[16]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[17]

Online reward-weighted fine- tuning of flow matching with wasserstein regularization

Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine- tuning of flow matching with wasserstein regularization. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 2

work page 2025
[18]

Re- inforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Re- inforcement learning for fine-tuning text-to-image diffusion models. InThirty-seventh Conference on Neural Informa- tion Processing Systems (NeurIPS) 2023. Neural Information Processing Systems Foundation, 2023. 2

work page 2023
[19]

Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback

Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Im- proving dynamic object interactions in text-to-video gener- ation with ai feedback.arXiv preprint arXiv:2412.02617,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023. 5

work page 2023
[21]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 1, 2, 5, 6, 7

work page 2023
[22]

Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025

Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun 9 Shi, et al. Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025. 1, 5

work page arXiv 2025
[23]

Accelerate: Training and inference at scale made simple, efficient and adaptable.https: //github.com/huggingface/accelerate, 2022

Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https: //github.com/huggingface/accelerate, 2022. 3

work page 2022
[24]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

A simple and effective re- inforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025

Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and effective re- inforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025. 2

work page arXiv 2025
[26]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3

work page 2020
[28]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5

work page 2022
[29]

Democra- tizing text-to-image masked generative models with com- pact text-aware one-dimensional tokens.arXiv preprint arXiv:2501.07730, 2025

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiao- hui Shen, Suha Kwak, and Liang-Chieh Chen. Democra- tizing text-to-image masked generative models with com- pact text-aware one-dimensional tokens.arXiv preprint arXiv:2501.07730, 2025. 2

work page arXiv 2025
[30]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023. 1, 2, 5, 6

work page 2023
[31]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Holis- tic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36:69981–70011, 2023

Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holis- tic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36:69981–70011, 2023. 2

work page 2023
[34]

Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 13199–13208, 2025. 2

work page 2025
[35]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2, 4, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Videodpo: Omni- preference alignment for video diffusion generation

Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 2

work page 2025
[40]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation.arXiv preprint arXiv:2409.04410, 2024

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation.arXiv preprint arXiv:2409.04410, 2024. 2

work page arXiv 2024
[43]

PEFT: State-of-the-art parameter-efficient fine-tuning meth- ods.https://github.com/huggingface/peft,

Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning meth- ods.https://github.com/huggingface/peft,

work page
[44]

Training diffusion models towards diverse image generation with reinforcement learning

Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Li- juan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10844– 10853, 2024. 2

work page 2024
[45]

Hello gpt-4o, 2024

OpenAI. Hello gpt-4o, 2024. 5

work page 2024
[46]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

work page
[47]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scal- able off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1910
[48]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and 10 Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Learning formal mathematics from intrinsic motiva- tion.Advances in Neural Information Processing Systems, 37:43032–43057, 2024

Gabriel Poesia, David Broman, Nick Haber, and Noah Good- man. Learning formal mathematics from intrinsic motiva- tion.Advances in Neural Information Processing Systems, 37:43032–43057, 2024. 3

work page 2024
[50]

DreamFusion: Text-to-3D using 2D Diffusion

Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Aligning text-to-image diffusion models with reward backpropagation

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Ka- terina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation. 2023. 2

work page 2023
[52]

Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,

work page arXiv
[53]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5, 6

work page 2021
[54]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2

work page 2023
[55]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2

work page 2022
[56]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1, 2, 5, 6, 7

work page 2022
[57]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 5

work page 2022
[58]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Emu edit: Precise image editing via recognition and gen- eration tasks

Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1

work page 2024
[61]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 1, 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024

Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024. 1

work page arXiv 2024
[63]

Deeply supervised flow-based generative models.arXiv preprint arXiv:2503.14494, 2025

Inkyu Shin, Chenglin Yang, and Liang-Chieh Chen. Deeply supervised flow-based generative models.arXiv preprint arXiv:2503.14494, 2025. 2

work page arXiv 2025
[64]

Fill-up: Balancing long-tailed data with generative models.arXiv preprint arXiv:2306.07200, 2023

Joonghyuk Shin, Minguk Kang, and Jaesik Park. Fill-up: Balancing long-tailed data with generative models.arXiv preprint arXiv:2306.07200, 2023. 1

work page arXiv 2023
[65]

Text-to-4d dy- namic scene generation.arXiv preprint arXiv:2301.11280,

Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy- namic scene generation.arXiv preprint arXiv:2301.11280,

work page arXiv
[66]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2011
[67]

T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 8406–8416, 2025. 2

work page 2025
[68]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 1, 2

work page 2024
[70]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

ModelScope Text-to-Video Technical Report

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Domain gap embeddings for genera- tive dataset augmentation

Yinong Oliver Wang, Younjoon Chung, Chen Henry Wu, and Fernando De la Torre. Domain gap embeddings for genera- tive dataset augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28684–28694, 2024. 1

work page 2024
[74]

Prolificdreamer: High-fidelity and 11 diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and 11 diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 1, 2

work page 2023
[75]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv
[76]

Omnigen: Unified image genera- tion

Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 1

work page 2025
[77]

Genius: A generalizable and purely unsupervised self- training framework for advanced reasoning.arXiv preprint arXiv:2504.08672, 2025

Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, and Zhiyong Wu. Genius: A generalizable and purely unsupervised self- training framework for advanced reasoning.arXiv preprint arXiv:2504.08672, 2025. 3

work page arXiv 2025
[78]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 1, 2, 5, 6

work page 2023
[79]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941– 8951, 2024. 2

work page 2024
[80]

Ai-generated images as data source: The dawn of synthetic era.arXiv preprint arXiv:2310.01830, 2023

Zuhao Yang, Fangneng Zhan, Kunhao Liu, Muyu Xu, and Shijian Lu. Ai-generated images as data source: The dawn of synthetic era.arXiv preprint arXiv:2310.01830, 2023. 1

work page arXiv 2023
[81]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres- sive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

Showing first 80 references.