arxiv: 2605.10937 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping

Haoyuan Sun , Jing Wang , Yuxin Song , Yu Lu , Bo Fang , Yifu Luo , Jun Yin , Pengyu Zeng

show 4 more authors

Miao Zhang Tiantian Zhang Xueqian Wang Shijian Lu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords Super-Linear Advantage ShapingText-to-Image ModelsReinforcement LearningGroup Relative Policy OptimizationReward HackingInformation GeometryPolicy Optimization

0 comments

The pith

Super-Linear Advantage Shaping reshapes RL policy updates in text-to-image models by weighting the Fisher-Rao metric to amplify high-advantage signals and suppress noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix reward hacking in reinforcement learning post-training of text-to-image models, where standard GRPO methods let models exploit flaws in reward functions instead of achieving real gains. It shows that simple normalization creates linear advantage updates that fail to separate strong signals from noise. By extending the Fisher-Rao information metric with advantage-dependent weighting, the method creates a non-linear geometry that relaxes updates in high-advantage directions while tightening low-advantage ones, plus batch normalization for stability. A sympathetic reader would care because reliable post-training without hacking could let image generators improve more consistently across scales and out-of-domain tasks while keeping semantic quality intact.

Core claim

By revisiting the functional update from an information geometry perspective and extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This relaxes constraints along high-advantage directions to amplify informative updates while tightening those in low-advantage regions to suppress illusory gradients; batch-level normalization further stabilizes training under varying reward scales. The result is faster training dynamics, stronger out-of-domain performance, better robustness to model scaling, reduced reward hacking, and preserved semantic and compositional fidelity compared to the

What carries the argument

Super-Linear Advantage Shaping (SLAS), which extends the Fisher-Rao information metric with advantage-dependent weighting to produce a non-linear reshaping of the local policy space.

If this is right

Training reaches higher performance in fewer steps than DanceGRPO across multiple model backbones.
Out-of-domain generalization improves on benchmarks such as GenEval and UniGenBench++.
Performance remains stable when models are scaled up, unlike prior methods.
Reward hacking is reduced while semantic and compositional quality in generated images is maintained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric weighting idea could be tested in RL post-training for text or video generators to see if it reduces hacking there too.
If the non-linear structure works as described, it suggests designing future advantage estimators around information geometry rather than purely linear normalizations.
Batch-level normalization combined with this reshaping may offer a general recipe for stabilizing RL when reward scales vary across prompts.

Load-bearing premise

That weighting the Fisher-Rao metric by advantage will create a non-linear geometry that reliably boosts genuine high-advantage updates and damps noise without creating new instabilities or policy biases.

What would settle it

A head-to-head run on the same backbones and benchmarks showing that SLAS does not improve out-of-domain scores on GenEval or UniGenBench++ or still exhibits reward hacking at larger model scales.

Figures

Figures reproduced from arXiv: 2605.10937 by Bo Fang, Haoyuan Sun, Jing Wang, Jun Yin, Miao Zhang, Pengyu Zeng, Shijian Lu, Tiantian Zhang, Xueqian Wang, Yifu Luo, Yu Lu, Yuxin Song.

**Figure 1.** Figure 1: Super-Linear Advantage Shaping. γ-weighted variational metric Φγ|A(x, y)| reshapes local geometry of the probability simplex, amplifying high-advantage directions while suppressing noisy, low-advantage ones. From Theorem 3, it can be concluded that introducing advantage dependent weighting Φγ(|A(x, y)|) into vanilla Fisher–Rao information metric is equivalent to an advantage-dependent rescaling in the tan… view at source ↗

**Figure 2.** Figure 2: Training Dynamics on FLUX.1 Dev. Left: Reward Mean of HPS-v2.1. Right: Reward Mean of CLIPScore. Blue lines represent the SLAS, while orange lines denote the DanceGRPO. where ∆r = ri − mean(r1, r2, · · · , rG) represents the vanilla linear advantage. Further discussion of γ, specifically its trust-region bounds, is provided in Appendix E. Moreover, simply removing the standard deviation term introduces the… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison of generations from FLUX.1 Dev, DanceGRPO, and SLAS. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Combined with our theoretical conclusion, the normalized advantage assigns a weight of [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Recently, post-training methods based on reinforcement learning, with a particular focus on Group Relative Policy Optimization (GRPO), have emerged as the robust paradigm for further advancement of text-to-image (T2I) models. However, these methods are often prone to reward hacking, wherein models exploit biases in imperfect reward functions rather than yielding genuine performance gains. In this work, we identify that normalization could lead to miscalibration and directly removing the prompt-level standard deviation term yields an optimal policy ascent direction that is linear in the advantage but still limits the separation of genuine signals from noise. To mitigate the above issues, we propose Super-Linear Advantage Shaping (SLAS) by revisiting the functional update from an information geometry perspective. By extending the Fisher-Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure that reshapes the local policy space. This design relaxes constraints along high-advantage directions to amplify informative updates, while tightening those in low-advantage regions to suppress illusory gradients. In addition, batch-level normalization is applied to stabilize training under varying reward scales. Extensive evaluations demonstrate that SLAS consistently surpasses the DanceGRPO baseline across multiple backbones and benchmarks. In particular, it yields faster training dynamics, improved out-of-domain performance on GenEval and UniGenBench++, and enhanced robustness to model scaling, while mitigating reward hacking and preserving semantic and compositional fidelity in generations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SLAS extends GRPO with an advantage-weighted Fisher-Rao metric to get super-linear shaping for T2I post-training, but the key derivations need close checking.

read the letter

The paper introduces Super-Linear Advantage Shaping by extending the Fisher-Rao metric with advantage-dependent weighting in the GRPO framework for text-to-image post-training. This creates a non-linear reshaping of the policy space that amplifies high-advantage updates and suppresses low-advantage ones, along with batch normalization for stability. The approach is motivated by the observation that standard normalization can miscalibrate and that simply dropping the prompt-level standard deviation still leaves a linear advantage direction that does not separate signal from noise well. What is new is the information geometry perspective that leads to this super-linear effect. It is a distinct step from the cited GRPO and DanceGRPO methods. The paper does well in identifying a concrete limitation in existing post-training and proposing a geometric fix. The reported results show faster training, improved out-of-domain performance on GenEval and UniGenBench++, and better robustness to scaling while reducing reward hacking without hurting generation quality. The soft spots are around verification of the core claim. The abstract describes the extension but does not provide the explicit form of the modified metric or the derivation that shows it produces the desired non-linear structure without introducing new instabilities. That makes the stress-test concern about unintended biases valid until the full equations are examined. Ablations isolating the effect of the advantage weighting would help. This is for researchers working on RL post-training of T2I models. Readers with background in policy optimization will find the geometric angle useful. It deserves a serious referee because it targets a practical issue with a motivated theoretical change. I recommend putting it through peer review, focusing on the math details and experimental evidence.

Referee Report

3 major / 2 minor

Summary. The paper proposes Super-Linear Advantage Shaping (SLAS) for post-training text-to-image diffusion models via reinforcement learning. Building on Group Relative Policy Optimization (GRPO), it revisits the functional update through information geometry by extending the Fisher-Rao metric with advantage-dependent weighting. This is claimed to induce a non-linear geometric structure that relaxes constraints along high-advantage directions while tightening low-advantage regions, thereby amplifying genuine signals, suppressing noise, and mitigating reward hacking. Batch-level normalization is added for stability. The method is asserted to outperform the DanceGRPO baseline on multiple backbones with faster convergence, stronger out-of-domain results on GenEval and UniGenBench++, and better robustness to model scale while preserving semantic and compositional quality.

Significance. If the central claims hold with explicit derivations and supporting metrics, SLAS would offer a geometrically principled alternative to standard advantage normalization in RL post-training of generative models. The information-geometry framing could help address reward hacking—a persistent issue in T2I and multimodal RLHF—by reshaping the local policy manifold in a non-linear fashion. This line of work has potential to improve training dynamics and generalization in large-scale diffusion post-training, provided the weighting does not introduce new instabilities.

major comments (3)

[Abstract and §3] Abstract and §3: The abstract states that extending the Fisher-Rao metric with advantage-dependent weighting produces a non-linear geometric structure yielding super-linear advantage shaping, yet no explicit equation for the modified metric (e.g., the form of the weighting function g(A) or the resulting Riemannian inner product) or the derived policy update direction is supplied. Without this, it cannot be verified whether the construction remains non-linear or reduces to a linear function of advantage by design, directly bearing on the central claim of reward-hacking mitigation.
[§4] §4 (Experiments): The manuscript claims that SLAS consistently surpasses DanceGRPO across backbones and benchmarks with faster dynamics, improved GenEval/UniGenBench++ scores, and reduced reward hacking, but supplies no quantitative tables, specific metric values, error bars, training curves, or ablation results on the weighting term and batch normalization. This absence prevents assessment of effect sizes and undermines the empirical support for the method's superiority and stability.
[§3.2] §3.2 (Stability analysis): No derivation or bound is provided on the induced Riemannian curvature, gradient behavior under noisy or misspecified rewards, or conditions guaranteeing that the advantage-dependent weighting amplifies genuine signals without introducing new policy biases or training instabilities, which is load-bearing for the claim that the approach reliably mitigates reward hacking.

minor comments (2)

[§2] The distinction between 'linear in the advantage' (after removing prompt-level std) and the proposed 'super-linear' shaping should be formalized with a short comparison of the resulting ascent directions.
[Notation] Ensure all notation for the advantage function, Fisher-Rao metric, and weighting is defined consistently before first use; a small table summarizing symbols would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important areas where the presentation of the mathematical foundations and empirical results can be strengthened. We address each major comment below and will revise the manuscript to incorporate the requested clarifications, derivations, and quantitative details.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The abstract states that extending the Fisher-Rao metric with advantage-dependent weighting produces a non-linear geometric structure yielding super-linear advantage shaping, yet no explicit equation for the modified metric (e.g., the form of the weighting function g(A) or the resulting Riemannian inner product) or the derived policy update direction is supplied. Without this, it cannot be verified whether the construction remains non-linear or reduces to a linear function of advantage by design, directly bearing on the central claim of reward-hacking mitigation.

Authors: We agree that an explicit formulation is necessary to substantiate the central geometric claim. In the revised manuscript we will insert the precise definition of the advantage-weighted Fisher-Rao metric, the functional form of the weighting g(A) (a strictly super-linear function of advantage), the resulting Riemannian inner product, and the closed-form policy update direction obtained from the information-geometric projection. A step-by-step derivation will be added to §3 so that readers can verify the non-linearity and its relation to reward-hacking mitigation. revision: yes
Referee: [§4] §4 (Experiments): The manuscript claims that SLAS consistently surpasses DanceGRPO across backbones and benchmarks with faster dynamics, improved GenEval/UniGenBench++ scores, and reduced reward hacking, but supplies no quantitative tables, specific metric values, error bars, training curves, or ablation results on the weighting term and batch normalization. This absence prevents assessment of effect sizes and undermines the empirical support for the method's superiority and stability.

Authors: We acknowledge that the current experimental section lacks the granularity needed for independent assessment. The revised version will include comprehensive tables with exact metric values and standard deviations across multiple seeds, training curves comparing convergence speed, and dedicated ablation studies isolating the contribution of the advantage-dependent weighting and the batch-level normalization. These additions will quantify effect sizes and demonstrate stability. revision: yes
Referee: [§3.2] §3.2 (Stability analysis): No derivation or bound is provided on the induced Riemannian curvature, gradient behavior under noisy or misspecified rewards, or conditions guaranteeing that the advantage-dependent weighting amplifies genuine signals without introducing new policy biases or training instabilities, which is load-bearing for the claim that the approach reliably mitigates reward hacking.

Authors: We recognize that a formal stability analysis is required to support the reliability claim. In the revision we will expand §3.2 with (i) an explicit derivation of the Riemannian curvature induced by the weighting, (ii) a first-order analysis of gradient behavior under additive reward noise, and (iii) sufficient conditions under which the weighting amplifies informative directions while bounding policy bias. These additions will directly address the concern about new instabilities. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in SLAS derivation

full rationale

The paper derives Super-Linear Advantage Shaping by extending the standard Fisher-Rao metric with advantage-dependent weighting from an information-geometry perspective on the GRPO functional update. This construction is presented as a direct modification to reshape the policy space (relaxing high-advantage directions, tightening low-advantage ones) rather than a reparameterization or fit to target metrics. No equations reduce the claimed non-linear structure to the inputs by definition, no self-citations are load-bearing for the core premise, and the linear baseline obtained by removing the prompt-level std term is explicitly distinguished as insufficient. The overall chain remains self-contained against external information-geometry results and GRPO baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the geometric reinterpretation of the policy update and the assumption that advantage-dependent weighting produces the desired non-linear separation of signal from noise. No explicit free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption Extending the Fisher-Rao metric with advantage-dependent weighting relaxes constraints along high-advantage directions while tightening low-advantage regions.
Invoked when describing how SLAS reshapes the local policy space.

pith-pipeline@v0.9.0 · 5589 in / 1392 out tokens · 86162 ms · 2026-05-12T03:28:46.829622+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By extending the Fisher–Rao information metric with advantage-dependent weighting, SLAS introduces a non-linear geometric structure... δπ(y|x)∝π0(y|x)·sign(A(x,y))|A(x,y)|^{1+γ} (Theorem 3)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Super-Linear Advantage Shaping (SLAS) ... fAi = sign(Δr)·|Δr|^{1+γ}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 27 internal anchors

[1]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[3]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[4]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

work page 2025
[6]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Badreward: Clean-label poisoning of reward models in text-to-image rlhf.arXiv preprint arXiv:2506.03234, 2025

Kaiwen Duan, Hongwei Yao, Yufei Chen, Ziyun Li, Tong Qiao, Zhan Qin, and Cong Wang. Badreward: Clean-label poisoning of reward models in text-to-image rlhf.arXiv preprint arXiv:2506.03234, 2025

work page arXiv 2025
[9]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-grpo: Pairwise preference reward-based grpo for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Springer, 2017

Nihat Ay, Jürgen Jost, Hông Vân Lê, and Lorenz Schwachhöfer.Information geometry, volume 64. Springer, 2017

work page 2017
[11]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[12]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[14]

arXiv preprint arXiv:2510.18701 , year=

Yibin Wang, Zhimin Li, Yuhang Zang, Jiazi Bu, Yujie Zhou, Yi Xin, Junjun He, Chunyu Wang, Qinglin Lu, Cheng Jin, et al. Unigenbench++: A unified semantic evaluation benchmark for text-to-image generation.arXiv preprint arXiv:2510.18701, 2025

work page arXiv 2025
[15]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[16]

Dpok: reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: reinforcement learning for fine-tuning text-to-image diffusion models. InProceedings of the 37th Interna- tional Conference on Neural Information Processing Systems, pages 79858–79885, 2023. 10

work page 2023
[17]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[19]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

work page 2024
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, and Cho-Jui Hsieh. Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

work page arXiv 2026
[22]

GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025

work page arXiv 2025
[23]

GARDO: Reinforcing diffusion models without reward hacking

Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, and Ling Pan. Gardo: Reinforcing diffusion models without reward hacking.arXiv preprint arXiv:2512.24138, 2025

work page arXiv 2025
[24]

and Leskovec, J

Michael Bereket and Jure Leskovec. Uncalibrated reasoning: Grpo induces overconfidence for stochastic outcomes.arXiv preprint arXiv:2508.11800, 2025

work page arXiv 2025
[25]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[26]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in neural information processing systems, 36:36652–36663, 2023

work page 2023
[28]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022
[29]

Cambridge University Press, 2025

Roman Vershynin.High-dimensional probability. Cambridge University Press, 2025

work page 2025
[30]

Deep learning via hessian-free optimization

James Martens et al. Deep learning via hessian-free optimization. InIcml, volume 27, pages 735–742, 2010

work page 2010
[31]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InInternational Conference on Machine Learning, pages 1842–1850. PMLR, 2018

work page 2018
[32]

Text-to-image diffusion models in generative ai: A survey.arXiv preprint arXiv:2303.07909, 2023

Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion models in generative ai: A survey.arXiv preprint arXiv:2303.07909, 2023. 11

work page arXiv 2023
[33]

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, et al. Visual generation in the new era: An evolution from atomic mapping to agentic world modeling.arXiv preprint arXiv:2604.28185, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Generating images from captions with attention.arXiv preprint arXiv:1511.02793, 2015

Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating images from captions with attention.arXiv preprint arXiv:1511.02793, 2015

work page arXiv 2015
[35]

Generative adversarial text to image synthesis

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. InInternational conference on machine learning, pages 1060–1069. Pmlr, 2016

work page 2016
[36]

Attention-gan for object transfiguration in wild images

Xinyuan Chen, Chang Xu, Xiaokang Yang, and Dacheng Tao. Attention-gan for object transfiguration in wild images. InProceedings of the European conference on computer vision (ECCV), pages 164–180, 2018

work page 2018
[37]

Gan-control: Explicitly controllable gans

Alon Shoshan, Nadav Bhonker, Igor Kviatkovsky, and Gerard Medioni. Gan-control: Explicitly controllable gans. InProceedings of the IEEE/CVF international conference on computer vision, pages 14083–14093, 2021

work page 2021
[38]

Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers.Advances in neural information processing systems, 34:19822–19835, 2021

work page 2021
[39]

Cogview2: Faster and better text-to- image generation via hierarchical transformers.Advances in Neural Information Processing Systems, 35:16890–16902, 2022

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to- image generation via hierarchical transformers.Advances in Neural Information Processing Systems, 35:16890–16902, 2022

work page 2022
[40]

Vector quantized diffusion model for text-to-image synthesis

Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10696–10706, 2022

work page 2022
[41]

Sdxl: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[42]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

work page internal anchor Pith review arXiv 2025
[44]

Cologen: Progressive learning of concept-localization duality for unified image generation.arXiv preprint arXiv:2602.22150, 2026

YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, and Jingdong Wang. Cologen: Progressive learning of concept-localization duality for unified image generation.arXiv preprint arXiv:2602.22150, 2026

work page arXiv 2026
[45]

Query-kontext: An unified multimodal model for image generation and editing.arXiv preprint arXiv:2509.26641, 2025

Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, et al. Query-kontext: An unified multimodal model for image generation and editing.arXiv preprint arXiv:2509.26641, 2025

work page arXiv 2025
[46]

Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

work page arXiv 2025
[47]

Explanatory instructions: Towards unified vision tasks understanding and zero-shot generalization.arXiv preprint arXiv:2412.18525, 2024

Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, and Errui Ding. Explanatory instructions: Towards unified vision tasks understanding and zero-shot generalization.arXiv preprint arXiv:2412.18525, 2024

work page arXiv 2024
[48]

Floorplan-llama: Aligning architects’ feedback and domain knowledge in archi- tectural floor plan generation

Jun Yin, Pengyu Zeng, Haoyuan Sun, Yuqin Dai, Han Zheng, Miao Zhang, Yachao Zhang, and Shuai Lu. Floorplan-llama: Aligning architects’ feedback and domain knowledge in archi- tectural floor plan generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6640–6662, 2025. 12

work page 2025
[49]

Bo Fang, YuXin Song, Haoyuan Sun, Qiangqiang Wu, Wenhao Wu, and Antoni B. Chan. Threading keyframe with narratives: MLLMs as strong long video comprehenders. InThe Fourteenth International Conference on Learning Representations, 2026. URL https:// openreview.net/forum?id=kyLS9EhPhY

work page 2026
[50]

Viss-r1: Self-supervised reinforcement video reasoning.arXiv preprint arXiv:2511.13054, 2025

Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, and Antoni B Chan. Viss-r1: Self-supervised reinforcement video reasoning.arXiv preprint arXiv:2511.13054, 2025

work page arXiv 2025
[51]

Refalign: Representation alignment for reference-to-video generation.arXiv preprint arXiv:2603.25743, 2026

Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, et al. Refalign: Representation alignment for reference-to-video generation.arXiv preprint arXiv:2603.25743, 2026

work page arXiv 2026
[52]

Sama: Factorized semantic anchoring and motion alignment for instruction-guided video editing.arXiv preprint arXiv:2603.19228, 2026

Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, et al. Sama: Factorized semantic anchoring and motion alignment for instruction-guided video editing.arXiv preprint arXiv:2603.19228, 2026

work page arXiv 2026
[53]

A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025

Bo Xia, Haoyuan Sun, Bo Yuan, Zhiheng Li, Bin Liang, and Xueqian Wang. A delay-robust method for enhanced real-time reinforcement learning.Neural Networks, 181:106769, 2025

work page 2025
[54]

Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models

Haoyuan Sun, Bo Xia, Yifu Luo, Tiantian Zhang, and Xueqian Wang. Calibration enhanced decision maker: Towards trustworthy sequential decision-making with large sequence models. Transactions on Machine Learning Research, 2026

work page 2026
[55]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017
[56]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[57]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback.arXiv preprint arXiv:2307.15217, 2023

work page internal anchor Pith review arXiv 2023
[58]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[62]

Orpo: Monolithic preference optimization without reference model

Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170–11189, 2024

work page 2024
[63]

Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198– 124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198– 124235, 2024

work page 2024
[64]

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization.arXiv preprint arXiv:2402.01306, 2024. 13

work page internal anchor Pith review arXiv 2024
[65]

Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization.Advances in Neural Information Processing Systems, 37:116617–116637, 2024

work page 2024
[66]

Direct preference optimization with an offset

Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. In Findings of the Association for Computational Linguistics ACL 2024, pages 9954–9972, 2024

work page 2024
[67]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. InForty-first International Conference on Machine Learning, 2024

work page 2024
[68]

Lipo: Listwise preference optimization through learning-to-rank

Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, et al. Lipo: Listwise preference optimization through learning-to-rank. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies...

work page 2025
[69]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342,

Zhihang Lin, Mingbao Lin, Yuan Xie, and Rongrong Ji. Cppo: Accelerating the training of group relative policy optimization-based reasoning models.arXiv preprint arXiv:2503.22342, 2025

work page arXiv 2025
[71]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

Yu Yue, Yufeng Yuan, Qiying Yu, Xiaochen Zuo, Ruofei Zhu, Wenyuan Xu, Jiaze Chen, Chengyi Wang, TianTian Fan, Zhengyin Du, et al. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks.arXiv preprint arXiv:2504.05118, 2025

work page internal anchor Pith review arXiv 2025
[73]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

work page internal anchor Pith review arXiv 2025
[75]

arXiv preprint arXiv:2507.20673 , year=

Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, et al. Geometric-mean policy optimization.arXiv preprint arXiv:2507.20673, 2025

work page arXiv 2025
[76]

A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827,

Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

work page arXiv 2025
[77]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023

work page internal anchor Pith review arXiv 2023
[78]

Directly fine-tuning diffusion models on differentiable re- wards.arXiv preprint arXiv:2309.17400, 2023

Kevin Clark, Paul Vicol, Kevin Swersky, and David J Fleet. Directly fine-tuning diffusion models on differentiable rewards.arXiv preprint arXiv:2309.17400, 2023

work page arXiv 2023
[79]

End-to-end diffusion latent optimization improves classifier guidance

Bram Wallace, Akash Gokul, Stefano Ermon, and Nikhil Naik. End-to-end diffusion latent optimization improves classifier guidance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7280–7290, 2023

work page 2023
[80]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023. 14

work page 2023

Showing first 80 references.