Recognition: 2 theorem links
· Lean TheoremImproving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
Pith reviewed 2026-05-15 18:36 UTC · model grok-4.3
The pith
Text-to-image models can improve their outputs by using their own accuracy at recovering injected noise as an internal reward signal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a model's ability to reconstruct injected noise from its own generated images provides a reliable intrinsic confidence signal that can be directly used as a reward in reinforcement learning to enhance text-to-image generation quality, achieving improvements in compositional accuracy, text rendering, and prompt alignment without any external supervision or preference data.
What carries the argument
SOLACE framework, which computes self-confidence via the reconstruction error of noise added to the model's generated latent representations and uses it to generate scalar rewards for RL.
If this is right
- Improves compositional generation by reinforcing coherent structures.
- Enhances text rendering accuracy in images.
- Strengthens text-image alignment.
- Reduces reward hacking when used with external rewards.
- Enables training without preference datasets or annotators.
Where Pith is reading between the lines
- The approach might generalize to other modalities like text or audio generation using similar diffusion processes.
- Iterative self-training could lead to progressively better models without human intervention.
- It offers a way to mitigate biases in external reward models by incorporating model-intrinsic signals.
Load-bearing premise
Accurate noise reconstruction from the model's outputs corresponds to generations that humans would rate highly when reinforced in RL.
What would settle it
Experiments showing that models trained with SOLACE rewards receive lower human preference scores than those trained with standard methods or no RL.
Figures
read the original abstract
Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to improve human preference alignment, factuality, and aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal: we re-noise the model's own outputs and measure how accurately it recovers the injected noise, treating low reconstruction error as high self-confidence. SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models, annotators, or preference data. By reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering, and text-image alignment. Integrating SOLACE with external rewards yields complementary improvements while alleviating reward hacking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework for text-to-image diffusion models. It re-noises the model's own generated outputs, measures the denoising network's reconstruction error on the injected noise as an intrinsic self-confidence signal, and uses this scalar as a reward for reinforcement learning to improve compositional generation, text rendering, and text-image alignment without external reward models or preference data. The abstract claims consistent gains from reinforcing high-confidence generations and complementary benefits when combined with external rewards.
Significance. If the self-referential reconstruction error reliably tracks human-aligned quality rather than model familiarity or training-manifold proximity, the method could reduce dependence on costly external reward models and mitigate reward hacking. However, the absence of any reported metrics, baselines, or ablation studies in the abstract leaves the central data-to-claim link unevaluated, limiting assessment of practical impact.
major comments (2)
- [Abstract] Abstract: The assertion that SOLACE 'delivers consistent gains in compositional generation, text rendering, and text-image alignment' is unsupported by any quantitative metrics, baselines, human evaluations, or experimental results, rendering the central empirical claim unevaluable.
- [Method] Method (self-confidence signal definition): The claim that low reconstruction error when recovering injected noise from the model's own outputs constitutes a reliable reward for prompt fidelity rests on an untested assumption; no derivation or preliminary evidence shows this error correlates with external quality measures rather than with outputs already close to the training distribution.
minor comments (1)
- [Abstract] Abstract: The integration statement ('Integrating SOLACE with external rewards yields complementary improvements') would benefit from a brief description of the combination mechanism even at high level.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, acknowledging where the current manuscript is incomplete and outlining targeted revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that SOLACE 'delivers consistent gains in compositional generation, text rendering, and text-image alignment' is unsupported by any quantitative metrics, baselines, human evaluations, or experimental results, rendering the central empirical claim unevaluable.
Authors: We agree that the abstract, in its current form, states the gains without accompanying numbers or references to experiments. The full manuscript contains quantitative results, baselines, and human evaluations in Section 4. We will revise the abstract to incorporate the key metrics (e.g., improvements on compositionality and alignment benchmarks) so that the claim is directly supported within the abstract itself. revision: yes
-
Referee: [Method] Method (self-confidence signal definition): The claim that low reconstruction error when recovering injected noise from the model's own outputs constitutes a reliable reward for prompt fidelity rests on an untested assumption; no derivation or preliminary evidence shows this error correlates with external quality measures rather than with outputs already close to the training distribution.
Authors: The comment is correct: the submitted manuscript provides only the definition and high-level motivation for the reconstruction-error signal without a derivation or explicit correlation study against external measures. We will add a short derivation based on the denoising objective and preliminary correlation experiments (including controls for training-manifold proximity) to the method section in the revision. revision: yes
Circularity Check
No significant circularity; self-contained internal reward definition
full rationale
The paper explicitly constructs the SOLACE reward as low reconstruction error when the model re-noises and denoises its own outputs, then applies this scalar as an RL signal. This is a direct definitional choice of an intrinsic proxy rather than a derivation that reduces a claimed outcome (e.g., improved alignment) back to the same quantity by construction. No equations, fitted parameters, or self-citations are shown that force the central result to equal its inputs; the gains in composition and alignment are presented as empirical consequences of the RL updates, not tautological. The approach is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Low reconstruction error after re-noising the model's outputs indicates high self-confidence that corresponds to higher-quality generations.
invented entities (1)
-
SOLACE self-confidence signal
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MSEi,t = 1/K ∑ ||bϵθ(z(i,m)t , t, c) - ϵ(m)||²₂ ; Si,t = -log(MSEi,t + δ) ; RSOLACE = weighted sum of Si,t
-
IndisputableMonolith/Foundation/BranchSelectionbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Reference graph
Works this paper leans on
-
[1]
4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling
Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B Lin- dell. 4d-fy: Text-to-4d generation using hybrid score dis- tillation sampling. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024. 1
work page 2024
-
[2]
All are worth words: A vit backbone for diffusion models
Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 22669–22679, 2023. 2
work page 2023
-
[3]
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv e-prints, pages arXiv–2506,
-
[4]
Improving image generation with better captions.Computer Science
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 2
work page 2023
-
[5]
Training Diffusion Models with Reinforcement Learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Improving image edit- ing models with generative data refinement
Frederic Boesel and Robin Rombach. Improving image edit- ing models with generative data refinement. InThe Second Tiny Papers Track at ICLR 2024, 2024. 1
work page 2024
-
[7]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18392–18402, 2023. 1
work page 2023
-
[8]
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Mur- phy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transform- ers.arXiv preprint arXiv:2301.00704, 2023. 2
-
[9]
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart-α: Fast training of diffusion transformer for photorealistic text-to-image synthesis.arXiv preprint arXiv:2310.00426, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. In European Conference on Computer Vision, pages 74–91. Springer, 2024. 1, 2
work page 2024
-
[11]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak lan- guage models to strong language models.arXiv preprint arXiv:2401.01335, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Pengyu Cheng, Yong Dai, Tianhao Hu, Han Xu, Zhisong Zhang, Lei Han, Nan Du, and Xiaolong Li. Self-playing ad- versarial language game enhances llm reasoning.Advances in Neural Information Processing Systems, 37:126515– 126543, 2024. 3
work page 2024
-
[13]
Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023. 2
-
[14]
PaddleOCR 3.0 Technical Report
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025. 1, 2, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023. 2
-
[16]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,
-
[17]
Online reward-weighted fine- tuning of flow matching with wasserstein regularization
Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine- tuning of flow matching with wasserstein regularization. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 2
work page 2025
-
[18]
Re- inforcement learning for fine-tuning text-to-image diffusion models
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Re- inforcement learning for fine-tuning text-to-image diffusion models. InThirty-seventh Conference on Neural Informa- tion Processing Systems (NeurIPS) 2023. Neural Information Processing Systems Foundation, 2023. 2
work page 2023
-
[19]
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback
Hiroki Furuta, Heiga Zen, Dale Schuurmans, Aleksandra Faust, Yutaka Matsuo, Percy Liang, and Sherry Yang. Im- proving dynamic object interactions in text-to-video gener- ation with ai feedback.arXiv preprint arXiv:2412.02617,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Scaling laws for reward model overoptimization
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InInternational Conference on Machine Learning, pages 10835–10866. PMLR, 2023. 5
work page 2023
-
[21]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 1, 2, 5, 6, 7
work page 2023
-
[22]
Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun 9 Shi, et al. Seedream 2.0: A native chinese-english bilin- gual image generation foundation model.arXiv preprint arXiv:2503.07703, 2025. 1, 5
-
[23]
Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. Accelerate: Training and inference at scale made simple, efficient and adaptable.https: //github.com/huggingface/accelerate, 2022. 3
work page 2022
-
[24]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text- to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla. A simple and effective re- inforcement learning method for text-to-image diffusion fine-tuning.arXiv preprint arXiv:2503.00897, 2025. 2
-
[26]
LTX-Video: Realtime Video Latent Diffusion
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 3
work page 2020
-
[28]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 5
work page 2022
-
[29]
Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiao- hui Shen, Suha Kwak, and Liang-Chieh Chen. Democra- tizing text-to-image masked generative models with com- pact text-aware one-dimensional tokens.arXiv preprint arXiv:2501.07730, 2025. 2
-
[30]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Ad- vances in neural information processing systems, 36:36652– 36663, 2023. 1, 2, 5, 6
work page 2023
-
[31]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan, Hannah Teufel, Marco Bellagente, et al. Holis- tic evaluation of text-to-image models.Advances in Neural Information Processing Systems, 36:69981–70011, 2023. 2
work page 2023
-
[34]
Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 13199–13208, 2025. 2
work page 2025
-
[35]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2, 4, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Videodpo: Omni- preference alignment for video diffusion generation
Runtao Liu, Haoyu Wu, Ziqiang Zheng, Chen Wei, Yingqing He, Renjie Pi, and Qifeng Chen. Videodpo: Omni- preference alignment for video diffusion generation. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 8009–8019, 2025. 2
work page 2025
-
[40]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation.arXiv preprint arXiv:2409.04410, 2024. 2
-
[43]
Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning meth- ods.https://github.com/huggingface/peft,
-
[44]
Training diffusion models towards diverse image generation with reinforcement learning
Zichen Miao, Jiang Wang, Ze Wang, Zhengyuan Yang, Li- juan Wang, Qiang Qiu, and Zicheng Liu. Training diffusion models towards diverse image generation with reinforcement learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10844– 10853, 2024. 2
work page 2024
- [45]
-
[46]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,
-
[47]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scal- able off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[48]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and 10 Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Gabriel Poesia, David Broman, Nick Haber, and Noah Good- man. Learning formal mathematics from intrinsic motiva- tion.Advances in Neural Information Processing Systems, 37:43032–43057, 2024. 3
work page 2024
-
[50]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
Aligning text-to-image diffusion models with reward backpropagation
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Ka- terina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation. 2023. 2
work page 2023
-
[52]
Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,
Mihir Prabhudesai, Russell Mendonca, Zheyang Qin, Kate- rina Fragkiadaki, and Deepak Pathak. Video diffusion align- ment via reward gradients.arXiv preprint arXiv:2407.08737,
-
[53]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 5, 6
work page 2021
-
[54]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023. 2
work page 2023
-
[55]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1, 2
work page 2022
-
[56]
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1, 2, 5, 6, 7
work page 2022
-
[57]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- man, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural in- formation processing systems, 35:25278–25294, 2022. 5
work page 2022
-
[58]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[59]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Emu edit: Precise image editing via recognition and gen- eration tasks
Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and gen- eration tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8871– 8879, 2024. 1
work page 2024
-
[61]
MVDream: Multi-view Diffusion for 3D Generation
Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024
Yichun Shi, Peng Wang, and Weilin Huang. Seededit: Align image re-generation to image editing.arXiv preprint arXiv:2411.06686, 2024. 1
-
[63]
Deeply supervised flow-based generative models.arXiv preprint arXiv:2503.14494, 2025
Inkyu Shin, Chenglin Yang, and Liang-Chieh Chen. Deeply supervised flow-based generative models.arXiv preprint arXiv:2503.14494, 2025. 2
-
[64]
Fill-up: Balancing long-tailed data with generative models.arXiv preprint arXiv:2306.07200, 2023
Joonghyuk Shin, Minguk Kang, and Jaesik Park. Fill-up: Balancing long-tailed data with generative models.arXiv preprint arXiv:2306.07200, 2023. 1
-
[65]
Text-to-4d dy- namic scene generation.arXiv preprint arXiv:2301.11280,
Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, et al. Text-to-4d dy- namic scene generation.arXiv preprint arXiv:2301.11280,
-
[66]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[67]
T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation
Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2v-compbench: A comprehen- sive benchmark for compositional text-to-video generation. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 8406–8416, 2025. 2
work page 2025
-
[68]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Diffusion model align- ment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 1, 2
work page 2024
-
[70]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
ModelScope Text-to-Video Technical Report
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Unified Reward Model for Multimodal Understanding and Generation
Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 2, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Domain gap embeddings for genera- tive dataset augmentation
Yinong Oliver Wang, Younjoon Chung, Chen Henry Wu, and Fernando De la Torre. Domain gap embeddings for genera- tive dataset augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28684–28694, 2024. 1
work page 2024
-
[74]
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and 11 diverse text-to-3d generation with variational score distilla- tion.Advances in neural information processing systems, 36: 8406–8441, 2023. 1, 2
work page 2023
-
[75]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,
work page internal anchor Pith review Pith/arXiv arXiv
-
[76]
Omnigen: Unified image genera- tion
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xin- grun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. Omnigen: Unified image genera- tion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13294–13304, 2025. 1
work page 2025
-
[77]
Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, and Zhiyong Wu. Genius: A generalizable and purely unsupervised self- training framework for advanced reasoning.arXiv preprint arXiv:2504.08672, 2025. 3
-
[78]
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 1, 2, 5, 6
work page 2023
-
[79]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941– 8951, 2024. 2
work page 2024
-
[80]
Ai-generated images as data source: The dawn of synthetic era.arXiv preprint arXiv:2310.01830, 2023
Zuhao Yang, Fangneng Zhan, Kunhao Liu, Muyu Xu, and Shijian Lu. Ai-generated images as data source: The dawn of synthetic era.arXiv preprint arXiv:2310.01830, 2023. 1
-
[81]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gun- jan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yin- fei Yang, Burcu Karagol Ayan, et al. Scaling autoregres- sive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.