Recognition: 2 theorem links
· Lean TheoremPower Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Pith reviewed 2026-05-12 03:28 UTC · model grok-4.3
The pith
Super-Linear Advantage Shaping reshapes RL policy updates in text-to-image models by weighting the Fisher-Rao metric to amplify high-advantage signals and suppress noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By revisiting the functional update from an information geometry perspective and extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This relaxes constraints along high-advantage directions to amplify informative updates while tightening those in low-advantage regions to suppress illusory gradients; batch-level normalization further stabilizes training under varying reward scales. The result is faster training dynamics, stronger out-of-domain performance, better robustness to model scaling, reduced reward hacking, and preserved semantic and compositional fidelity compared to the
What carries the argument
Super-Linear Advantage Shaping (SLAS), which extends the Fisher-Rao information metric with advantage-dependent weighting to produce a non-linear reshaping of the local policy space.
If this is right
- Training reaches higher performance in fewer steps than DanceGRPO across multiple model backbones.
- Out-of-domain generalization improves on benchmarks such as GenEval and UniGenBench++.
- Performance remains stable when models are scaled up, unlike prior methods.
- Reward hacking is reduced while semantic and compositional quality in generated images is maintained.
Where Pith is reading between the lines
- The same geometric weighting idea could be tested in RL post-training for text or video generators to see if it reduces hacking there too.
- If the non-linear structure works as described, it suggests designing future advantage estimators around information geometry rather than purely linear normalizations.
- Batch-level normalization combined with this reshaping may offer a general recipe for stabilizing RL when reward scales vary across prompts.
Load-bearing premise
That weighting the Fisher-Rao metric by advantage will create a non-linear geometry that reliably boosts genuine high-advantage updates and damps noise without creating new instabilities or policy biases.
What would settle it
A head-to-head run on the same backbones and benchmarks showing that SLAS does not improve out-of-domain scores on GenEval or UniGenBench++ or still exhibits reward hacking at larger model scales.
Figures
read the original abstract
Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Super-Linear Advantage Shaping (SLAS) for post-training text-to-image diffusion models via reinforcement learning. Building on Group Relative Policy Optimization (GRPO), it revisits the functional update through information geometry by extending the Fisher-Rao metric with advantage-dependent weighting. This is claimed to induce a non-linear geometric structure that relaxes constraints along high-advantage directions while tightening low-advantage regions, thereby amplifying genuine signals, suppressing noise, and mitigating reward hacking. Batch-level normalization is added for stability. The method is asserted to outperform the DanceGRPO baseline on multiple backbones with faster convergence, stronger out-of-domain results on GenEval and UniGenBench++, and better robustness to model scale while preserving semantic and compositional quality.
Significance. If the central claims hold with explicit derivations and supporting metrics, SLAS would offer a geometrically principled alternative to standard advantage normalization in RL post-training of generative models. The information-geometry framing could help address reward hacking—a persistent issue in T2I and multimodal RLHF—by reshaping the local policy manifold in a non-linear fashion. This line of work has potential to improve training dynamics and generalization in large-scale diffusion post-training, provided the weighting does not introduce new instabilities.
major comments (3)
- [Abstract and §3] Abstract and §3: The abstract states that extending the Fisher-Rao metric with advantage-dependent weighting produces a non-linear geometric structure yielding super-linear advantage shaping, yet no explicit equation for the modified metric (e.g., the form of the weighting function g(A) or the resulting Riemannian inner product) or the derived policy update direction is supplied. Without this, it cannot be verified whether the construction remains non-linear or reduces to a linear function of advantage by design, directly bearing on the central claim of reward-hacking mitigation.
- [§4] §4 (Experiments): The manuscript claims that SLAS consistently surpasses DanceGRPO across backbones and benchmarks with faster dynamics, improved GenEval/UniGenBench++ scores, and reduced reward hacking, but supplies no quantitative tables, specific metric values, error bars, training curves, or ablation results on the weighting term and batch normalization. This absence prevents assessment of effect sizes and undermines the empirical support for the method's superiority and stability.
- [§3.2] §3.2 (Stability analysis): No derivation or bound is provided on the induced Riemannian curvature, gradient behavior under noisy or misspecified rewards, or conditions guaranteeing that the advantage-dependent weighting amplifies genuine signals without introducing new policy biases or training instabilities, which is load-bearing for the claim that the approach reliably mitigates reward hacking.
minor comments (2)
- [§2] The distinction between 'linear in the advantage' (after removing prompt-level std) and the proposed 'super-linear' shaping should be formalized with a short comparison of the resulting ascent directions.
- [Notation] Ensure all notation for the advantage function, Fisher-Rao metric, and weighting is defined consistently before first use; a small table summarizing symbols would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas where the presentation of the mathematical foundations and empirical results can be strengthened. We address each major comment below and will revise the manuscript to incorporate the requested clarifications, derivations, and quantitative details.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3: The abstract states that extending the Fisher-Rao metric with advantage-dependent weighting produces a non-linear geometric structure yielding super-linear advantage shaping, yet no explicit equation for the modified metric (e.g., the form of the weighting function g(A) or the resulting Riemannian inner product) or the derived policy update direction is supplied. Without this, it cannot be verified whether the construction remains non-linear or reduces to a linear function of advantage by design, directly bearing on the central claim of reward-hacking mitigation.
Authors: We agree that an explicit formulation is necessary to substantiate the central geometric claim. In the revised manuscript we will insert the precise definition of the advantage-weighted Fisher-Rao metric, the functional form of the weighting g(A) (a strictly super-linear function of advantage), the resulting Riemannian inner product, and the closed-form policy update direction obtained from the information-geometric projection. A step-by-step derivation will be added to §3 so that readers can verify the non-linearity and its relation to reward-hacking mitigation. revision: yes
-
Referee: [§4] §4 (Experiments): The manuscript claims that SLAS consistently surpasses DanceGRPO across backbones and benchmarks with faster dynamics, improved GenEval/UniGenBench++ scores, and reduced reward hacking, but supplies no quantitative tables, specific metric values, error bars, training curves, or ablation results on the weighting term and batch normalization. This absence prevents assessment of effect sizes and undermines the empirical support for the method's superiority and stability.
Authors: We acknowledge that the current experimental section lacks the granularity needed for independent assessment. The revised version will include comprehensive tables with exact metric values and standard deviations across multiple seeds, training curves comparing convergence speed, and dedicated ablation studies isolating the contribution of the advantage-dependent weighting and the batch-level normalization. These additions will quantify effect sizes and demonstrate stability. revision: yes
-
Referee: [§3.2] §3.2 (Stability analysis): No derivation or bound is provided on the induced Riemannian curvature, gradient behavior under noisy or misspecified rewards, or conditions guaranteeing that the advantage-dependent weighting amplifies genuine signals without introducing new policy biases or training instabilities, which is load-bearing for the claim that the approach reliably mitigates reward hacking.
Authors: We recognize that a formal stability analysis is required to support the reliability claim. In the revision we will expand §3.2 with (i) an explicit derivation of the Riemannian curvature induced by the weighting, (ii) a first-order analysis of gradient behavior under additive reward noise, and (iii) sufficient conditions under which the weighting amplifies informative directions while bounding policy bias. These additions will directly address the concern about new instabilities. revision: yes
Circularity Check
No significant circularity detected in SLAS derivation
full rationale
The paper derives Super-Linear Advantage Shaping by extending the standard Fisher-Rao metric with advantage-dependent weighting from an information-geometry perspective on the GRPO functional update. This construction is presented as a direct modification to reshape the policy space (relaxing high-advantage directions, tightening low-advantage ones) rather than a reparameterization or fit to target metrics. No equations reduce the claimed non-linear structure to the inputs by definition, no self-citations are load-bearing for the core premise, and the linear baseline obtained by removing the prompt-level std term is explicitly distinguished as insufficient. The overall chain remains self-contained against external information-geometry results and GRPO baselines.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Extending the Fisher-Rao metric with advantage-dependent weighting relaxes constraints along high-advantage directions while tightening low-advantage regions.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By extending the Fisher–Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure... δπ(y|x)∝π0(y|x)·sign(A(x,y))|A(x,y)|^{1+γ} (Theorem 3)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Super-Linear Advantage Shaping (SLAS) ... fAi = sign(Δr)·|Δr|^{1+γ}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
work page 2024
-
[3]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[4]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
work page 2025
-
[6]
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
DanceGRPO: Unleashing GRPO on Visual Generation
Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Kaiwen Duan, Hongwei Yao, Yufei Chen, Ziyun Li, Tong Qiao, Zhan Qin, and Cong Wang. Badreward: Clean-label poisoning of reward models in text-to-image rlhf.arXiv preprint arXiv:2506.03234, 2025
-
[9]
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Nihat Ay, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer.Information geometry, volume 64. Springer, 2017
work page 2017
-
[11]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[12]
Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
work page 2023
-
[14]
arXiv preprint arXiv:2510.18701 , year=
Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, et al. Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation.arXiv preprint arXiv:2510.18701, 2025
-
[15]
Training diffusion models with reinforcement learning
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[16]
Dpok: reinforcement learning for fine-tuning text-to-image diffusion models
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: reinforcement learning for fine-tuning text-to-image diffusion models. InProceedings of the 37th Interna- tional Conference on Neural Information Processing Systems, pages 79858–79885, 2023. 10
work page 2023
-
[17]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Diffusion model alignment using direct preference optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024
work page 2024
-
[19]
Using human feedback to fine-tune diffusion models without any reward model
Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024
work page 2024
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, and Cho-Jui Hsieh. Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026
-
[22]
GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025
Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025
-
[23]
GARDO: Reinforcing diffusion models without reward hacking
Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, and Ling Pan. Gardo: Reinforcing diffusion models without reward hacking.arXiv preprint arXiv:2512.24138, 2025
-
[24]
Michael Bereket and Jure Leskovec. Uncalibrated reasoning: Grpo induces overconfidence for stochastic outcomes.arXiv preprint arXiv:2508.11800, 2025
-
[25]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[26]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023
work page 2023
-
[28]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022
work page 2022
-
[29]
Cambridge University Press, 2025
Roman Vershynin.High-dimensional probability. Cambridge University Press, 2025
work page 2025
-
[30]
Deep learning via hessian-free optimization
James Martens et al. Deep learning via hessian-free optimization. InIcml, volume 27, pages 735–742, 2010
work page 2010
-
[31]
Shampoo: Preconditioned stochastic tensor optimization
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018
work page 2018
-
[32]
Text-to-image diffusion models in generative ai: A survey.arXiv preprint arXiv:2303.07909, 2023
Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion models in generative ai: A survey.arXiv preprint arXiv:2303.07909, 2023. 11
-
[33]
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, et al. Visual generation in the new era: An evolution from atomic mapping to agentic world modeling.arXiv preprint arXiv:2604.28185, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Generating images from captions with attention.arXiv preprint arXiv:1511.02793, 2015
Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from captions with attention.arXiv preprint arXiv:1511.02793, 2015
-
[35]
Generative adversarial text to image synthesis
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. InInternational conference on machine learning, pages 1060–1069. Pmlr, 2016
work page 2016
-
[36]
Attention-gan for object transfiguration in wild images
Xinyuan Chen, Chang Xu, Xiaokang Yang, and Dacheng Tao. Attention-gan for object transfiguration in wild images. InProceedings of the European conference on computer vision (ECCV), pages 164–180, 2018
work page 2018
-
[37]
Gan-control: Explicitly controllable gans
Alon Shoshan, Nadav Bhonker, Igor Kviatkovsky, and Gerard Medioni. Gan-control: Explicitly controllable gans. InProceedings of the IEEE/CVF international conference on computer vision, pages 14083–14093, 2021
work page 2021
-
[38]
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021
work page 2021
-
[39]
Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to- image generation via hierarchical transformers.Advances in Neural Information Processing Systems, 35:16890–16902, 2022
work page 2022
-
[40]
Vector quantized diffusion model for text-to-image synthesis
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10696–10706, 2022
work page 2022
-
[41]
Sdxl: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[42]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Show-o2: Improved Native Unified Multimodal Models
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
work page internal anchor Pith review arXiv 2025
-
[44]
YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, and Jingdong Wang. Cologen: Progressive learning of concept-localization duality for unified image generation.arXiv preprint arXiv:2602.22150, 2026
-
[45]
Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, et al. Query-kontext: An unified multimodal model for image generation and editing.arXiv preprint arXiv:2509.26641, 2025
-
[46]
Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025
-
[47]
Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, and Errui Ding. Explanatory instructions: Towards unified vision tasks understanding and zero-shot generalization.arXiv preprint arXiv:2412.18525, 2024
-
[48]
Jun Yin, Pengyu Zeng, Haoyuan Sun, Yuqin Dai, Han Zheng, Miao Zhang, Yachao Zhang, and Shuai Lu. Floorplan-llama: Aligning architects’ feedback and domain knowledge in archi- tectural floor plan generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6640–6662, 2025. 12
work page 2025
-
[49]
Bo Fang, YuXin Song, Haoyuan Sun, Qiangqiang Wu, Wenhao Wu, and Antoni B. Chan. Threading keyframe with narratives: MLLMs as strong long video comprehenders. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=kyLS9EhPhY
work page 2026
-
[50]
Viss-r1: Self-supervised reinforcement video reasoning.arXiv preprint arXiv:2511.13054, 2025
Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, and Antoni B Chan. Viss-r1: Self-supervised reinforcement video reasoning.arXiv preprint arXiv:2511.13054, 2025
-
[51]
Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, et al. Refalign: Representation alignment for reference-to-video generation.arXiv preprint arXiv:2603.25743, 2026
-
[52]
Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, et al. Sama: Factorized semantic anchoring and motion alignment for instruction-guided video editing.arXiv preprint arXiv:2603.19228, 2026
-
[53]
Bo Xia, Haoyuan Sun, Bo Yuan, Zhiheng Li, Bin Liang, and Xueqian Wang. A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025
work page 2025
-
[54]
Haoyuan Sun, Bo Xia, Yifu Luo, Tiantian Zhang, and Xueqian Wang. Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models. Transactions on Machine Learning Research, 2026
work page 2026
-
[55]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[56]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[57]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback.arXiv preprint arXiv:2307.15217, 2023
work page internal anchor Pith review arXiv 2023
-
[58]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[62]
Orpo: Monolithic preference optimization without reference model
Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170–11189, 2024
work page 2024
-
[63]
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198– 124235, 2024
work page 2024
-
[64]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024. 13
work page internal anchor Pith review arXiv 2024
-
[65]
Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024
work page 2024
-
[66]
Direct preference optimization with an offset
Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. In Findings of the Association for Computational Linguistics ACL 2024, pages 9954–9972, 2024
work page 2024
-
[67]
Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[68]
Lipo: Listwise preference optimization through learning-to-rank
Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, et al. Lipo: Listwise preference optimization through learning-to-rank. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...
work page 2025
-
[69]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342, 2025
-
[71]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[72]
VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025
work page internal anchor Pith review arXiv 2025
-
[73]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025
work page internal anchor Pith review arXiv 2025
-
[75]
arXiv preprint arXiv:2507.20673 , year=
Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025
-
[76]
A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025
-
[77]
Aligning Text-to-Image Models using Human Feedback
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023
work page internal anchor Pith review arXiv 2023
-
[78]
Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023
-
[79]
End-to-end diffusion latent optimization improves classifier guidance
Bram Wallace, Akash Gokul, Stefano Ermon, and Nikhil Naik. End-to-end diffusion latent optimization improves classifier guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7280–7290, 2023
work page 2023
-
[80]
Imagereward: Learning and evaluating human preferences for text-to-image generation
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023. 14
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.