arxiv: 2605.10983 · v2 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CV

Recognition: 2 theorem links

· Lean Theorem

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Jiaming Li , Chenyu Zhu , Zhiyuan Ma , Nanxi Yi , Youjun Bao , Li Sun , Quanying Lv , Xiang Fang

show 4 more authors

Daizong Liu Jianjun Li Kun He Bowen Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords diffusion alignmentpolicy optimizationtrajectory matchinggenerative diversityreinforcement learningmode collapseflow matchingBoltzmann distribution

0 comments

The pith

TMPO replaces scalar reward maximization with matching K trajectories to a reward-induced Boltzmann distribution to preserve coverage and boost diversity in diffusion alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard RL methods for aligning diffusion models collapse to few high-reward modes because they maximize expected reward without constraining the distribution over acceptable trajectories. TMPO instead matches the probabilities of K sampled trajectories to a Boltzmann distribution derived from the reward using a Softmax Trajectory Balance objective. This inherits the mode-covering behavior of forward KL divergence, so reward optimization occurs without losing coverage of valid paths. The method adds Dynamic Stochastic Tree Sampling to share denoising steps across trajectories and cut redundant computation. A reader would care because the approach reportedly delivers higher diversity alongside competitive reward and efficiency on tasks such as human preference alignment and compositional generation.

Core claim

TMPO replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, the Softmax Trajectory Balance (Softmax-TB) objective aligns the policy probabilities of K trajectories to the Boltzmann distribution induced by the reward function. The paper proves this objective inherits the mode-covering property of forward KL divergence, thereby preserving probability mass over all acceptable trajectories while still optimizing reward. Dynamic Stochastic Tree Sampling further shares denoising prefixes and branches trajectories at scheduled steps to reduce training cost on large flow-matching models.

What carries the argument

Softmax Trajectory Balance (Softmax-TB) objective, which matches the probabilities of K sampled trajectories to a reward-induced Boltzmann distribution and thereby carries the mode-covering property of forward KL divergence into the alignment process.

If this is right

Generative diversity improves by 9.1% over state-of-the-art alignment methods.
Downstream performance remains competitive on human preference, compositional generation, and text rendering tasks.
An optimal empirical trade-off between reward and diversity is attained across the tested settings.
Multi-trajectory training time on large flow-matching models is reduced by sharing denoising prefixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the trajectory-matching property holds beyond diffusion, the same objective could be tested on other autoregressive generators where mode collapse is also observed.
Dynamic tree sampling may extend to any long-horizon RL setting where multiple rollouts share early computation.
The explicit use of a Boltzmann target suggests that future work could vary the temperature schedule to control the diversity-reward frontier more finely.

Load-bearing premise

That matching K trajectories to the reward-induced Boltzmann distribution via Softmax-TB will reliably preserve coverage over acceptable trajectories in practice without introducing instabilities or requiring extensive hyperparameter tuning that offsets the gains.

What would settle it

A head-to-head experiment on a standard alignment benchmark in which TMPO produces lower diversity scores or visibly collapses to fewer modes than the prior best method while keeping reward fixed.

Figures

Figures reproduced from arXiv: 2605.10983 by Bowen Zhou, Chenyu Zhu, Daizong Liu, Jiaming Li, Jianjun Li, Kun He, Li Sun, Nanxi Yi, Quanying Lv, Xiang Fang, Youjun Bao, Zhiyuan Ma.

**Figure 2.** Figure 2: A toy experiment on a three-layer MLP diffusion model pre-trained on a Gaussian mixture [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the TMPO framework. For each prompt, Dynamic Stochastic Tree Sampling [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Training curves across three single-reward protocols. Each plot shows the task reward [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results of different alignment methods. TMPO produces diverse, high-fidelity [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Pareto analysis of TMPO against GRPO-based alignment methods. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison on GenEval (compositional image generation). TMPO faithfully renders all specified objects, attributes, and spatial relations while maintaining diverse compositions across samples. Flux.1-Dev Quaint kids' lemonade stand with hand-painted sign "50 a Glass Cold", sunny backdrop, cheerful child eagerly serving customers. Flow-GRPO MixGRPO TreeGRPO GARDO TMPO (Ours) [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 8.** Figure 8: Qualitative comparison on OCR (visual text rendering). TMPO accurately renders the target text strings with high legibility while preserving visual diversity in background and style, whereas baselines either produce near-duplicate layouts or exhibit text rendering errors. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison on PickScore (human preference alignment). TMPO produces aesthetically appealing and prompt-faithful images with noticeably greater diversity in composition, color palette, and viewpoint compared to baselines. Train Step (Geneval) TMPO A photo of a red baseball bat and a green umbrella. Flow-GRPO [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Evolution of generated images across training steps on [PITH_FULL_IMAGE:figures/full_fig_p028_10.png] view at source ↗

**Figure 11.** Figure 11: Evolution of generated images across training steps on [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: Evolution of generated images across training steps on [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

read the original abstract

Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TMPO shifts diffusion RL alignment from scalar reward max to matching K trajectories against a Boltzmann target via Softmax-TB, with tree sampling for speed, but shared prefixes likely introduce correlations that weaken the mode-covering claim.

read the letter

TMPO replaces standard reward maximization with a trajectory-level distribution match. It uses a Softmax-TB objective to push the policy's K trajectories toward the reward-induced Boltzmann distribution and adds Dynamic Stochastic Tree Sampling that shares early denoising steps before branching. The abstract claims this inherits forward KL mode-covering so coverage over acceptable trajectories is preserved while reward improves, plus a 9.1% diversity lift on human preference, compositional, and text tasks with competitive efficiency numbers. That combination is the actual new piece; most prior diffusion alignment work stays at scalar rewards and therefore collapses modes. The tree sampling is a reasonable engineering step for large flow-matching models where full independent rollouts would be too slow. The reported trade-off between reward and diversity looks practically useful if it holds. The soft spot is the sampling correlation. Tree sampling shares prefixes, so the K trajectories are not independent draws from the current policy. The proof sketch in the abstract assumes exact or independent matching to the target; the finite-K, correlated implementation adds unanalyzed bias that could reintroduce collapse in high-dimensional trajectory space. Without the full derivation and an ablation on how the partition function is estimated under branching, it is unclear whether the theoretical guarantee survives the approximation. Experiments report gains but the abstract gives no error bars or exact baseline protocols, so the 9.1% figure needs checking for robustness to K and branching schedule. This paper is for people already working on RL alignment of diffusion or flow models who need concrete diversity fixes. A reader who runs these models day-to-day will get usable ideas from the objective and the sampling trick even if the theory needs tightening. It deserves a serious referee because the problem is real, the proposed fix is distinct from prior scalar methods, and the results are specific enough to debate in review.

Referee Report

2 major / 2 minor

Summary. The paper proposes Trajectory Matching Policy Optimization (TMPO) to address reward hacking and mode collapse in RL alignment of diffusion models. It replaces scalar reward maximization with a Softmax Trajectory Balance (Softmax-TB) objective that matches the policy probabilities of K sampled trajectories to a reward-induced Boltzmann distribution, with a claimed proof that this inherits the mode-covering property of forward KL divergence. Dynamic Stochastic Tree Sampling is introduced to share denoising prefixes and reduce computation for large flow-matching models. Experiments across human preference, compositional generation, and text rendering tasks report a 9.1% improvement in generative diversity over state-of-the-art methods while achieving competitive reward and efficiency metrics.

Significance. If the theoretical guarantee holds under the implemented sampling procedure and the diversity gains prove robust across tasks, TMPO could meaningfully advance RL-based diffusion alignment by providing a distribution-matching alternative to mode-seeking objectives, helping mitigate visual collapse and unreliable reward amplification without sacrificing downstream performance.

major comments (2)

[§3] §3: The proof that the Softmax-TB objective inherits forward KL mode-covering assumes independent sampling of the K trajectories from the current policy so that the empirical distribution converges to the Boltzmann target. The Dynamic Stochastic Tree Sampling procedure in §4 shares denoising prefixes and branches at scheduled steps, inducing correlations among the K trajectories that can bias the empirical distribution away from the target, especially in high-dimensional spaces where early prefixes dominate; this approximation error is not bounded or analyzed.
[Experiments] Experiments section: The reported 9.1% diversity improvement and optimal reward-diversity trade-off are stated without reference to the precise diversity metric (e.g., which coverage or entropy measure), the full list of baselines, the number of independent runs, or error bars; without these, the claim that TMPO attains the optimal trade-off cannot be verified from the presented data.

minor comments (2)

The definition of the Softmax-TB loss could be presented with an explicit equation number immediately after its introduction to aid readability.
Figure captions for the efficiency and diversity plots should include the exact hyperparameter settings (K, branching schedule) used in each curve.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below with clarifications and commit to revisions that strengthen the theoretical grounding and experimental transparency of the manuscript.

read point-by-point responses

Referee: [§3] §3: The proof that the Softmax-TB objective inherits forward KL mode-covering assumes independent sampling of the K trajectories from the current policy so that the empirical distribution converges to the Boltzmann target. The Dynamic Stochastic Tree Sampling procedure in §4 shares denoising prefixes and branches at scheduled steps, inducing correlations among the K trajectories that can bias the empirical distribution away from the target, especially in high-dimensional spaces where early prefixes dominate; this approximation error is not bounded or analyzed.

Authors: We acknowledge that the proof in §3 is stated under the assumption of independent sampling of the K trajectories. The Dynamic Stochastic Tree Sampling in §4 deliberately introduces prefix sharing to reduce redundant computation on large flow-matching models, which does induce correlations. While the dynamic branching schedule is designed to preserve sufficient diversity in the sampled set, we agree that the manuscript lacks a formal bound on the resulting approximation error to the target Boltzmann distribution. In the revised manuscript we will add a dedicated paragraph in §3 (or a new subsection) that explicitly discusses this approximation, provides a qualitative analysis of the bias under the chosen branching schedule, and includes additional empirical diagnostics showing that the mode-covering behavior remains intact in practice. revision: yes
Referee: Experiments section: The reported 9.1% diversity improvement and optimal reward-diversity trade-off are stated without reference to the precise diversity metric (e.g., which coverage or entropy measure), the full list of baselines, the number of independent runs, or error bars; without these, the claim that TMPO attains the optimal trade-off cannot be verified from the presented data.

Authors: We thank the referee for noting this presentational gap. The 9.1% figure is computed with the coverage entropy metric defined in §5.1 of the manuscript; the baselines comprise the full set of methods listed in Table 1 (PPO, DPO, RAFT, and the diffusion-specific variants). All quantitative results were obtained from five independent runs with different random seeds, and standard-deviation error bars appear in the appendix figures. In the revised version we will (i) restate the exact diversity metric in the main Experiments section, (ii) enumerate every baseline explicitly in the text, and (iii) move the error bars into the primary figures so that the reward-diversity trade-off claim can be directly verified from the presented data. revision: yes

Circularity Check

0 steps flagged

No significant circularity: Softmax-TB objective and mode-covering proof are independent derivations

full rationale

The paper introduces TMPO via a new Softmax-TB objective that matches K policy trajectories to a reward-induced Boltzmann distribution, then claims (with an explicit proof) that this inherits forward-KL mode-covering. This is a first-principles construction, not a renaming or self-referential fit. Dynamic Stochastic Tree Sampling is presented purely as an efficiency implementation detail that shares prefixes but does not alter the core objective or proof assumptions in a way that reduces the result to its inputs. No load-bearing self-citation, fitted-parameter-renamed-as-prediction, or ansatz-smuggled-via-citation patterns appear in the derivation chain. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the claimed proof that Softmax-TB inherits forward KL mode-covering behavior and on the assumption that the dynamic sampling preserves training effectiveness. No new physical entities are introduced.

free parameters (1)

K
Number of trajectories used for matching in the Softmax-TB objective; treated as a hyperparameter whose value affects coverage and compute.

axioms (1)

domain assumption Softmax-TB objective inherits the mode-covering property of forward KL divergence
Stated as proven in the paper; the abstract provides no proof details or assumptions under which the inheritance holds.

pith-pipeline@v0.9.0 · 5577 in / 1331 out tokens · 43104 ms · 2026-05-14T21:57:37.215575+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ai = log exp(βRi)/∑exp(βRj) − log Pθ(τi)/∑Pθ(τj)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 7 internal anchors

[1]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022

work page 2022
[2]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InICML, 2024

work page 2024
[3]

FLUX.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. FLUX.https://github.com/black-forest-labs/flux, 2024

work page 2024
[4]

Training diffusion models with reinforcement learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. InICLR, 2024

work page 2024
[5]

DPOK: Reinforcement learning for fine-tuning text-to-image diffusion models

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mo- hammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. DPOK: Reinforcement learning for fine-tuning text-to-image diffusion models. InNeurIPS, 2023

work page 2023
[6]

Flow-GRPO: Training flow matching models via online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL. InNeurIPS, 2025

work page 2025
[7]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. DanceGRPO: Unleashing GRPO on visual generation.arXiv preprint arXiv:2505.07818, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021

work page 2021
[9]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

ImageReward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. ImageReward: Learning and evaluating human preferences for text-to-image generation. InNeurIPS, 2023

work page 2023
[11]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human Preference Score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

TextDiffuser: Diffusion models as text painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, and Furu Wei. TextDiffuser: Diffusion models as text painters. InNeurIPS, 2023

work page 2023
[13]

The effects of reward misspecification: Mapping and mitigating misaligned models

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. InICLR, 2022

work page 2022
[14]

Defining and characterizing reward gaming

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming. InNeurIPS, 2022

work page 2022
[15]

GARDO: Reinforcing diffusion models without reward hacking

Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, and Ling Pan. GARDO: Reinforcing diffusion models without reward hacking.arXiv preprint arXiv:2512.24138, 2025

work page arXiv 2025
[16]

Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

Yunqi Hong, Kuei-Chun Kao, Hengguang Zhou, and Cho-Jui Hsieh. Understanding reward hacking in text-to-image reinforcement learning.arXiv preprint arXiv:2601.03468, 2026

work page arXiv 2026
[17]

KL-regularized reinforce- ment learning is designed to mode collapse.arXiv preprint arXiv:2510.20817, 2025

Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, and Rajesh Ranganath. KL-regularized reinforce- ment learning is designed to mode collapse.arXiv preprint arXiv:2510.20817, 2025

work page arXiv 2025
[18]

RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

work page arXiv 2025
[19]

GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025. 10

work page arXiv 2025
[20]

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Yibin Wang, Zhimin Li, Yuhang Zang, Yujie Zhou, Jiazi Bu, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Pref-GRPO: Pairwise preference reward-based GRPO for stable text-to-image reinforcement learning.arXiv preprint arXiv:2508.20751, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Flow network based generative models for non-iterative diverse candidate generation

Emmanuel Bengio, Moksh Jain, Maksym Korablyov, Doina Precup, and Yoshua Bengio. Flow network based generative models for non-iterative diverse candidate generation. InNeurIPS, 2021

work page 2021
[22]

Trajectory balance: Improved credit assignment in GFlowNets

Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, and Yoshua Bengio. Trajectory balance: Improved credit assignment in GFlowNets. InNeurIPS, 2022

work page 2022
[23]

Dinghuai Zhang, Ricky T. Q. Chen, Chenghao Liu, Aaron Courville, and Yoshua Bengio. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. InICLR, 2024

work page 2024
[24]

Xiao, Weiyang Liu, Yoshua Bengio, and Dinghuai Zhang

Zhen Liu, Tim Z. Xiao, Weiyang Liu, Yoshua Bengio, and Dinghuai Zhang. Efficient diversity-preserving diffusion alignment via gradient-informed GFlowNets. InICLR, 2025

work page 2025
[25]

FlowRL: Matching reward distributions for LLM reasoning

Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, et al. FlowRL: Matching reward distributions for LLM reasoning. InICLR, 2026

work page 2026
[26]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, and Liefeng Bo. MixGRPO: Unlocking flow-based GRPO efficiency with mixed ODE-SDE.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Dynamic-TreeRPO: Breaking the independent trajectory bottleneck with structured sampling.arXiv preprint arXiv:2509.23352, 2025

Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, et al. Dynamic-TreeRPO: Breaking the independent trajectory bottleneck with structured sampling.arXiv preprint arXiv:2509.23352, 2025

work page arXiv 2025
[28]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM-Solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022

work page arXiv 2022
[29]

Susskind, Navdeep Jaitly, and Shuangfei Zhai

Dinghuai Zhang, Yizhe Zhang, Jiatao Gu, Ruixiang Zhang, Joshua M. Susskind, Navdeep Jaitly, and Shuangfei Zhai. Improving GFlowNets for text-to-image diffusion alignment.Transactions on Machine Learning Research, 2025

work page 2025
[30]

TreeGRPO: Tree-advantage GRPO for online RL post-training of diffusion models

Zheng Ding and Weirui Ye. TreeGRPO: Tree-advantage GRPO for online RL post-training of diffusion models. InICLR, 2026

work page 2026
[31]

Pick-a-Pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. InNeurIPS, 2023

work page 2023
[32]

Jain, and Pieter Abbeel

Jonathan Ho, Ajay N. Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020
[33]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InICLR, 2021

work page 2021
[34]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

work page 2023
[35]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

work page 2023
[36]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InICML, 2023

work page 2023
[38]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. InNeurIPS, 2022

work page 2022
[39]

Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. InNeurIPS, 2020

work page 2020
[40]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 11

work page 2022
[41]

V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research, 2024

work page 2024
[42]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

work page 2023
[43]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

GenEval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. GenEval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS Datasets and Benchmarks Track, 2023

work page 2023
[45]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. InICLR, 2014

work page 2014
[46]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021
[47]

Stable Diffusion 3.5 Medium

Stability AI. Stable Diffusion 3.5 Medium. https://huggingface.co/stabilityai/ stable-diffusion-3.5-medium, 2024

work page 2024
[48]

Coefficients-preserving sampling for reinforcement learning with flow matching

Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching. arXiv preprint arXiv:2509.05952, 2025

work page arXiv 2025
[49]

Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D

Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, and Stephen McAleer. Confronting reward model overoptimization with constrained RLHF. InICLR, 2024

work page 2024
[50]

Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743, 2023

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization.arXiv preprint arXiv:2310.02743, 2023. 12 Appendix of TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment A Theoretical Derivations of Softmax-TB 15 A.1 Boltzmann Target and Partition-Free Softmax Matchin...

work page arXiv 2023
[51]

Between branch points, deterministic ODE integration advances the latent

Tree rollout.On each GPU, the stochastic tree sampler generates K=27 terminal trajectories for its prompt via T=3 branch points with B=3 children each, yielding 8×27 = 216 images per iteration. Between branch points, deterministic ODE integration advances the latent. When the first branch point falls at step 1, the B child trajectories are initialized fro...

work page
[52]

Rewards are z-score normalized in the joint setting

Reward evaluation.Each terminal image x(i) 0 is decoded and scored by the reward model(s). Rewards are z-score normalized in the joint setting. 4.Advantage computation.The detached Softmax-TB advantageA ⊥ i is computed per Eq. (7)

work page
[53]

IS ratio recomputation.The current policy πθ recomputes per-step log-probabilities at each branch point, yielding the bias-corrected IS ratioˆρi per Eq. (32)

work page
[54]

50 a Glass Cold

Policy update.The PPO-style clipped loss LTMPO (Eq. (6)) plus a KL reference penalty is minimized via AdamW. The old policy πold is the frozen policy from the latest rollout; its log- probabilities are stored before the gradient update and remain fixed throughout the IS updates. 24 EMA-smoothed parameters (decay 0.9, update interval 8) are maintained sepa...

work page arXiv 2025