arxiv: 2605.12112 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

Xiaofeng Tan , Jun Liu , Bin-Bin Gao , Yuanting Fan , Xi Jiang , Chengjie Wang , Hongsong Wang , Feng Zheng

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords RLHFflow matchingperceptual entropydiversity collapsetext-to-imagepolicy gradiententropy regularizationgenerative models

0 comments

The pith

In flow-based RLHF for text-to-image models, policy entropy remains constant while perceptual diversity collapses, requiring perceptual entropy constraints to restore balance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that RLHF fine-tuning of flow-matching text-to-image models causes diversity to collapse in the space of human perception. Standard policy entropy fails to detect or prevent this because it stays fixed due to the unchanging noise schedule in flow models. The collapse stems instead from policy gradients that concentrate probability on high-reward modes. Perceptual entropy is defined to measure spread in perceptual feature spaces and is used to add regularization terms that keep the model from narrowing too much. Experiments on multiple models and rewards confirm that these perceptual constraints improve the trade-off between image quality and variety.

Core claim

The central discovery is that policy entropy in flow models does not change during RLHF because of the fixed noise schedule, even as the generated images lose perceptual diversity through mode-seeking optimization. Perceptual entropy is proposed as an alternative that tracks diversity in perceptual spaces while satisfying entropy axioms. This leads to two methods, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, which are shown to enhance both quality and diversity metrics in practice.

What carries the argument

Perceptual entropy, computed as the entropy of the policy distribution projected or evaluated in a perceptual embedding space.

If this is right

Perceptual Entropy Constraint yields the highest quality-diversity score of 0.734 compared to the baseline of 0.366.
Combining it with other constraints achieves an average diversity of 0.989 against the baseline's 0.047.
The gains hold consistently for two different base models, both neural and rule-based reward models, and three perceptual spaces.
Standard policy entropy regularization cannot prevent the model from converging to narrow high-reward regions in perceptual space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar constant-entropy behavior may limit standard regularization in other flow or diffusion generative models during preference alignment.
Extending perceptual spaces to include semantic or stylistic features could further tailor diversity preservation to specific applications.

Load-bearing premise

The chosen perceptual spaces will continue to reflect meaningful diversity that aligns with human preferences across different tasks and models without causing unintended reductions in other qualities.

What would settle it

Training the model with the proposed perceptual constraints and measuring no corresponding increase in diversity according to independent perceptual metrics or human studies would show that the constraints do not achieve the intended preservation.

Figures

Figures reproduced from arXiv: 2605.12112 by Bin-Bin Gao, Chengjie Wang, Feng Zheng, Hongsong Wang, Jun Liu, Xiaofeng Tan, Xi Jiang, Yuanting Fan.

**Figure 1.** Figure 1: Diversity Collapse. Flow-GRPO improves quality but sacrifices diversity, producing similar styles. However, revisiting this intuition under flow matching reveals a key inconsistency: diversity collapses while the policy entropy surrogate remains constant, as shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Training-time Metrics. (a) Entropy vs. variance in VAE space. (b) Variance in pixel and reward space. Beyond the theoretical analysis, we empirically verify the constant policy entropy property in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Optimization Behavior. GRPO shows modeseeking behavior and converges to a single-peak distribution with high reward, while an ideal model should encourage coverage of multiple high-reward regions. 0 200 400 600 800 Training Step 1.0 2.0 £ 1 0 ¡ 2 Std(R(© xªK k )) Var(© xªK k ) 0.75 1.50 £ 1 0 ¡ 2 0 200 400 600 800 Training Step 0.3 0.4 0.5 max© pA(x k)ªK k = 1 min© pA(x k)ªK k = 1 [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 5.** Figure 5: Relationship between Reward and Perceptual Entropy. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Optimization Comparison. (a) Native KL (red) versus KL-asreward (blue), showing reward and entropy dynamics; (b) entropy with (blue) and without (red) JPCVAE. As illustrated in Fig.6, our method effectively mitigates these optimization issues. Compared to Native KL, which introduces competing gradients, perceptual reward shaping aligns perceptual constraints with the primary objective and stabilizes trai… view at source ↗

**Figure 7.** Figure 7: Training Dynamics of Flow-GRPO, FlowGRPO with KL, and PEC with Perceptual Entropy. (a) Mean reward. (b) Reward standard deviation. Trade-Off between Diversity and Quality [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight, we propose two entropy-regularized strategies, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, to preserve perceptual diversity and improve the quality. Experiments across two base models, neural and rule-based rewards, and three perceptual spaces demonstrate consistent gains in the quality-diversity trade-off; PEC achieves the best overall score of 0.734 (vs. baseline's 0.366); a complementary setting of PEC further reaches a diversity average of 0.989 (vs. baseline's 0.047). Our project page (https://xiaofeng-tan.github.io/projects/PEC) is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a practical diversity issue in flow RLHF and offers perceptual entropy as a workaround, but the claim that policy entropy stays constant looks like it rests on a narrow definition that ignores change-of-variables.

read the letter

The main takeaway is that standard entropy regularization stops working for diversity control once you move to flow-matching models in RLHF, and the authors propose perceptual entropy plus two constraint variants to patch it. They report solid-looking gains on quality-diversity metrics across two base models and different reward types, with the best run hitting 0.734 versus the baseline 0.366 and a diversity score of 0.989 versus 0.047 in one setting. That part is useful for anyone tuning flow-based generators where variety matters, like image or media tools. The experiments cover multiple perceptual spaces and both neural and rule-based rewards, which gives the results some breadth. What is actually new is the targeted shift from noise-space entropy to a perceptual-space version that they claim preserves the additive property of entropy. The write-up is direct about the mode-seeking behavior of the gradients driving collapse even when the noise schedule is fixed. The soft spot sits in the central theoretical claim. The paper states policy entropy remains constant because of the pre-defined noise schedule, yet in flow models the output distribution entropy is base entropy plus the integrated divergence of the learned vector field. If their measured entropy is only the fixed noise term and not the actual induced distribution entropy, then the claimed mismatch between constant entropy and collapsing diversity is not as clean as presented. The abstract and stress-test note leave this unclear, and without explicit change-of-variables derivations it is hard to judge whether the perceptual entropy fix is grounded or just an empirical patch. Perceptual entropy itself could turn out to be sensitive to the choice of feature space, and the paper does not show how well the constraints transfer outside the tested setups. This work is aimed at practitioners doing RLHF on flow models who already see diversity loss in practice. A reader who cares about the exact entropy accounting will want to check the derivations first, but the experimental improvements are concrete enough that the paper deserves a serious referee to sort out the theory versus the results. I would send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that standard policy entropy regularization fails to preserve diversity in flow-matching text-to-image RLHF because policy entropy remains constant (due to the fixed pre-defined noise schedule) while perceptual diversity collapses (due to mode-seeking policy gradients). It introduces perceptual entropy as a diversity measure in perceptual spaces that preserves standard entropy properties, and proposes two regularization strategies—Perceptual Entropy Constraint (PEC) and Perceptual Constraints on Generation Space—to improve the quality-diversity trade-off, with experiments across two base models, neural/rule-based rewards, and three perceptual spaces reporting gains such as 0.734 vs. baseline 0.366 overall score and 0.989 vs. 0.047 diversity average.

Significance. If the claimed mismatch between constant policy entropy and collapsing perceptual diversity is rigorously established and the perceptual entropy constraints generalize without side effects, the work would be significant for RLHF in flow-based generative models, offering a targeted fix for diversity collapse that standard entropy regularization cannot address. The multi-setting empirical evaluation and public project page provide some support for practical utility and reproducibility.

major comments (2)

[Abstract / theoretical analysis] Abstract and theoretical explanation: The central claim that policy entropy remains constant due to the fixed noise schedule does not engage with the change-of-variables formula for differential entropy under a learned flow. In flow-matching models the output entropy equals base entropy plus the integral of the divergence of the time-dependent vector field along trajectories; RLHF updates to the vector field should therefore alter output entropy unless the divergence is provably invariant, which is not shown.
[Method] Method section on perceptual entropy: The definition of perceptual entropy is stated to 'capture diversity in a perceptual space and maintain the property of standard entropy,' yet no explicit functional form, proof of non-negativity or concavity, or comparison to the standard differential entropy appears. Without these, it is unclear whether the quantity is independent of the reward model or reduces to a fitted heuristic, weakening the justification for the proposed constraints.

minor comments (2)

[Experiments] Experiments: Reported aggregate scores (0.734 vs. 0.366; 0.989 vs. 0.047) are given without error bars, number of random seeds, or statistical tests, making it impossible to judge whether the quality-diversity improvements are reliable.
[Abstract] Abstract: The two proposed strategies are named but not briefly characterized, which would improve immediate readability for readers scanning the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below with clarifications on our theoretical analysis and definitions. Where the comments identify gaps in rigor or explicitness, we indicate that revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract / theoretical analysis] Abstract and theoretical explanation: The central claim that policy entropy remains constant due to the fixed noise schedule does not engage with the change-of-variables formula for differential entropy under a learned flow. In flow-matching models the output entropy equals base entropy plus the integral of the divergence of the time-dependent vector field along trajectories; RLHF updates to the vector field should therefore alter output entropy unless the divergence is provably invariant, which is not shown.

Authors: We appreciate the referee's observation regarding the change-of-variables formula. In our analysis, 'policy entropy' specifically denotes the entropy of the fixed base noise distribution at the input to the flow, which is invariant because the noise schedule is pre-defined and unchanged by RLHF updates to the vector field. The perceptual diversity collapse, by contrast, arises from the mode-seeking behavior of the policy gradient in the output space. We acknowledge that a full accounting of output differential entropy must incorporate the divergence integral. In the revised manuscript we will expand the theoretical section to explicitly reference the change-of-variables formula, derive the relationship between input entropy and perceptual-space diversity under our RLHF objective, and discuss why the divergence term does not counteract the observed mode collapse in practice. This will strengthen the justification for why standard policy-entropy regularization is insufficient. revision: yes
Referee: [Method] Method section on perceptual entropy: The definition of perceptual entropy is stated to 'capture diversity in a perceptual space and maintain the property of standard entropy,' yet no explicit functional form, proof of non-negativity or concavity, or comparison to the standard differential entropy appears. Without these, it is unclear whether the quantity is independent of the reward model or reduces to a fitted heuristic, weakening the justification for the proposed constraints.

Authors: We thank the referee for noting the need for greater formality. Perceptual entropy is defined as the differential entropy of the push-forward distribution of generated samples under a fixed perceptual embedding map (e.g., CLIP or VGG features). Its explicit form is H_perp = -E_{x~p_gen}[log p_perp(f(x))], where f is the embedding function and p_perp is the density estimated in the perceptual space; because the embedding is fixed and independent of the reward model, the quantity inherits non-negativity and concavity from standard differential entropy on the embedded manifold. In the revised method section we will state this functional form explicitly, include a short derivation confirming the inherited properties, and add a comparison paragraph showing that it reduces to ordinary differential entropy when the perceptual map is the identity. These additions will clarify that the measure is not a fitted heuristic and directly supports the proposed PEC and generation-space constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on model structure and new perceptual measure rather than self-referential fits or citations

full rationale

The paper's derivation of constant policy entropy from the fixed noise schedule in flow-matching models is presented as a direct consequence of the generative process definition (base distribution plus learned transport), not a renaming or fit of the target diversity quantity. Perceptual entropy is introduced as an explicit new functional on chosen perceptual embeddings that is stated to preserve entropy-like properties; this is a definitional extension rather than a reduction of the output to the input by construction. No load-bearing self-citation chains, uniqueness theorems from the same authors, or fitted parameters relabeled as predictions appear in the abstract or described chain. Experiments across models and spaces supply independent empirical content. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about flow model dynamics and introduces perceptual entropy as a new measure; insufficient detail to list specific fitted parameters.

axioms (2)

domain assumption Policy entropy in flow models remains constant due to the fixed pre-defined noise schedule
Invoked to explain why standard entropy regularization fails to prevent diversity collapse.
domain assumption Diversity collapse is driven by the mode-seeking nature of policy gradients
Used to account for the observed perceptual narrowing despite constant entropy.

invented entities (1)

perceptual entropy no independent evidence
purpose: Captures diversity in a perceptual space while maintaining the mathematical properties of standard entropy
New quantity introduced to replace or augment policy entropy for regularization in flow-based RLHF.

pith-pipeline@v0.9.0 · 5579 in / 1551 out tokens · 143469 ms · 2026-05-13T05:45:26.678221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 22 internal anchors

[1]

Average-reward reinforcement learning with entropy regularization.arXiv preprint arXiv:2501.09080, 2025

Jacob Adamczyk, V olodymyr Makarenko, Stas Tiomkin, and Rahul V Kulkarni. Average-reward reinforcement learning with entropy regularization.arXiv preprint arXiv:2501.09080, 2025

work page arXiv 2025
[2]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

work page 2024
[3]

Benchmarking diversity in image generation via attribute- conditional human evaluation.arXiv preprint arXiv:2511.10547, 2025

Isabela Albuquerque, Ira Ktena, Olivia Wiles, Ivana Kaji´c, Amal Rannen-Triki, Cristina Vas- concelos, and Aida Nematzadeh. Benchmarking diversity in image generation via attribute- conditional human evaluation.arXiv preprint arXiv:2511.10547, 2025

work page arXiv 2025
[4]

Improving image generation with better captions

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. OpenAI Technical Report, 2023

work page 2023
[5]

Forking paths in neural text generation.arXiv preprint arXiv:2412.07961, 2024

Eric Bigelow, Ari Holtzman, Hidenori Tanaka, and Tomer Ullman. Forking paths in neural text generation.arXiv preprint arXiv:2412.07961, 2024

work page arXiv 2024
[6]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

work page arXiv 2025
[8]

Emerging properties in self-supervised vision transformers.arXiv preprint arXiv:2104.14294,

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers.CoRR, abs/2104.14294, 2021. URLhttps://arxiv.org/abs/2104.14294

work page arXiv 2021
[9]

DIME: Diffusion-based maximum entropy reinforcement learning

Onur Celik, Zechu Li, Denis Blessing, Ge Li, Daniel Palenicek, Jan Peters, Georgia Chalvatzaki, and Gerhard Neumann. DIME: Diffusion-based maximum entropy reinforcement learning. InProceedings of the 42nd International Conference on Machine Learning, volume 267 of Proceedings of Machine Learning Research, pages 6958–6977. PMLR, 13–19 Jul 2025. URL https:/...

work page 2025
[10]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758, 2025

work page arXiv 2025
[12]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Flow density control: Generative optimization beyond entropy-regularized fine-tuning

Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, and Andreas Krause. Flow density control: Generative optimization beyond entropy-regularized fine-tuning. Advances in Neural Information Processing Systems, 2025. URL https://openreview.net/ forum?id=ELk7sylH7g

work page 2025
[14]

Provable maximum entropy manifold exploration via diffusion models

Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, and Andreas Krause. Provable maximum entropy manifold exploration via diffusion models. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 12739–12761. PMLR, 13–19 Jul 2025. URL https: //proceedings.ml...

work page 2025
[15]

Aesthetic-Predictor-v2-5: Siglip-based aesthetic score predictor

Discus0434. Aesthetic-Predictor-v2-5: Siglip-based aesthetic score predictor. https:// github.com/discus0434/aesthetic-predictor-v2-5, 2024

work page 2024
[16]

Image generation diversity issues and how to tame them

Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, and Bernhard Kainz. Image generation diversity issues and how to tame them. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3029–3039, 2025

work page 2025
[17]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206, 2024. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning, 2024

work page 2024
[19]

Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffusion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

work page 2023
[20]

The vendi score: A diversity evaluation metric for machine learning.Transactions on Machine Learning Research, 2023

Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning.Transactions on Machine Learning Research, 2023

work page 2023
[21]

Dynamic-TreeRPO: Breaking the independent trajectory bottleneck with structured sampling.arXiv preprint arXiv:2509.23352, 2025

Xiaolong Fu, Lichen Ma, Zipeng Guo, Gaojing Zhou, Chongxiao Wang, ShiPing Dong, Shizhe Zhou, Ximan Liu, Jingling Fu, Tan Lit Sin, et al. Dynamic-treerpo: Breaking the independent trajectory bottleneck with structured sampling.arXiv preprint arXiv:2509.23352, 2025

work page arXiv 2025
[22]

Flow matching policy with entropy regularization.arXiv preprint arXiv:2603.17685, 2026

Ting Gao, Stavros Orfanoudakis, Nan Lin, and Elvin Isufi. Flow matching policy with entropy regularization.arXiv preprint arXiv:2603.17685, 2026

work page arXiv 2026
[23]

Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. InInterna- tional Conference on Learning Representations (ICLR), 2019

work page 2019
[24]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. InInternational conference on machine learning, pages 1352–1361. PMLR, 2017

work page 2017
[25]

Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018

work page 2018
[26]

Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324,

Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, and Bo Zhang. Tempflow-grpo: When timing matters for grpo in flow models.arXiv preprint arXiv:2508.04324, 2025

work page arXiv 2025
[27]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 9781713829546

work page 2020
[28]

Reinforce++: A simple and efficient approach for aligning large language models

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv e-prints, pages arXiv–2501, 2025

work page 2025
[29]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Towards efficient exact optimization of language model alignment.arXiv preprint arXiv:2402.00856, 2024

Haozhe Ji, Cheng Lu, Yilin Niu, Pei Ke, Hongning Wang, Jun Zhu, Jie Tang, and Minlie Huang. Towards efficient exact optimization of language model alignment.arXiv preprint arXiv:2402.00856, 2024

work page arXiv 2024
[31]

Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025

Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng, and Jing Shao. Rethinking entropy regularization in large reasoning models.arXiv preprint arXiv:2509.25133, 2025

work page arXiv 2025
[32]

Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation.Advances in Neural Information Processing Systems, 36:36652–36663, 2023

work page 2023
[33]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[34]

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, and Zhao Zhong. Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde.arXiv preprint arXiv:2507.21802, 2025

work page internal anchor Pith review arXiv 2025
[35]

arXiv preprint arXiv:2509.06040 (2025) 2, 3

Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, and Shanghang Zhang. Branchgrpo: Stable and efficient grpo with structured branching in diffusion models.arXiv preprint arXiv:2509.06040, 2025

work page arXiv 2025
[36]

Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024

Zhimin Li, Jianwei Zhang, Qin Lin, Jiangfeng Fang, Xiaobo Wang, Wenhao Wang, Zhenyu Yang, Jiawei Li, Xianfang Shi, Hao Zhang, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding, 2024. 11

work page 2024
[37]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InIEEE Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[38]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Alignment of diffusion models: Fundamentals, challenges, and future

Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James Kwok, Sumi Helal, and Zeke Xie. Alignment of diffusion models: Fundamentals, challenges, and future. ACM Computing Surveys, 2026

work page 2026
[40]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via online rl.arXiv preprint arXiv:2505.05470, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025b

Miao Liu, Shizhe Diao, Xiaotian Lu, Jiachen Hu, Xishan Dong, Yejin Choi, Jan Kautz, and Yuxiao Dong. Prorl: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025

work page arXiv 2025
[42]

Safetydpo: Scalable safety alignment for text-to-image generation.arXiv preprint arXiv:2412.10493, 2024

Runtao Liu, Chen I Chieh, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, and Fabio Pizzati. Safetydpo: Scalable safety alignment for text-to-image generation.arXiv preprint arXiv:2412.10493, 2024

work page arXiv 2024
[43]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Liang Zhao, et al. Janusflow: Harmonizing autoregres- sion and rectified flow for unified multimodal understanding and generation.arXiv preprint arXiv:2411.07975, 2024

work page arXiv 2024
[45]

Entropy-controlled flow matching.arXiv preprint arXiv:2602.22265, 2026

Chika Maduabuchi. Entropy-controlled flow matching.arXiv preprint arXiv:2602.22265, 2026

work page arXiv 2026
[46]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. InInternational Conference on Learning Representations, 2023

work page 2023
[48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[49]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Ren, Justin Dinh, and Hanjun Dai

Allen Z. Ren, Justin Dinh, and Hanjun Dai. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2408.18257, 2024

work page arXiv 2024
[51]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

work page 2022
[52]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Exploring data scaling trends and effects in reinforcement learning from human feedback

Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, and Lin Yan. Exploring data scaling trends and effects in reinforcement learning from human feedback. arXiv preprint arXiv:2503.22230, 2025

work page arXiv 2025
[55]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=PxTIG12RRHS. 12

work page 2021
[56]

Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.Technical report, 2024

Kolors Team. Kolors: Effective training of diffusion model for photorealistic text-to-image synthesis.Technical report, 2024. URL https://github.com/Kwai-Kolors/Kolors/. Project release; no arXiv identifier is listed in the official repository

work page 2024
[57]

Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review.arXiv preprint arXiv:2407.13734, 2024

Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, and Sergey Levine. Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review.arXiv preprint arXiv:2407.13734, 2024

work page arXiv 2024
[58]

Entropy-guided sequence weighting for efficient exploration in rl-based llm fine-tuning.arXiv preprint arXiv:2503.22456, 2025

Abdullah Vanlioglu. Entropy-guided sequence weighting for efficient exploration in rl-based llm fine-tuning.arXiv preprint arXiv:2503.22456, 2025

work page arXiv 2025
[59]

Diffusion model alignment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024

work page 2024
[60]

Coefficients-preserving sampling for reinforcement learning with flow matching

Feng Wang and Zihao Yu. Coefficients-preserving sampling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952, 2025

work page arXiv 2025
[61]

GRPO-Guard: Mitigating implicit over-optimization in flow matching via regulated clipping, 2025

Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, et al. Grpo-guard: Mitigating implicit over-optimization in flow matching via regulated clipping.arXiv preprint arXiv:2510.22319, 2025

work page arXiv 2025
[62]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

Animatabledreamer: Text-guided non-rigid 3d model generation and reconstruction with canonical score distillation

Xinzhou Wang, Yikai Wang, Junliang Ye, Fuchun Sun, Zhengyi Wang, Ling Wang, Pengkun Liu, Kai Sun, Xintong Wang, Fangfu Liu, and Bin He. Animatabledreamer: Text-guided non-rigid 3d model generation and reconstruction with canonical score distillation. InEuropean Conference on Computer Vision, 2024

work page 2024
[65]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025

work page internal anchor Pith review arXiv 2025
[66]

Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[67]

Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine learning, 8:229–256, 1992

work page 1992
[68]

Qwen-image technical report, 2025

Chenfei Wu, Jiahao Li, Jingren Zhou, An Yang, Bin Bai, Bo Zhang, Bowen Zheng, Bowen Yu, Chengpeng Chen, Dayiheng Huang, et al. Qwen-image technical report, 2025

work page 2025
[69]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, Han Cai, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer.arXiv preprint arXiv:2501.18427, 2025

work page arXiv 2025
[71]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

Imagereward: Learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023

work page 2023
[73]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation. arXiv preprint arXiv:2505.07818, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Gpt-imgeval: A comprehensive benchmark for diagnosing gpt4o in image generation.arXiv preprint arXiv:2504.02782,

Zhiyuan Yan, Junyan Ye, Weijia Li, Zilong Huang, Shenghai Yuan, Xiangyang He, Kaiqing Lin, Jun He, Conghui He, and Li Yuan. Gpt-imgeval: A comprehensive benchmark for diagnosing gpt-4o in image generation.arXiv preprint arXiv:2504.02782, 2025

work page arXiv 2025
[75]

Using human feedback to fine-tune diffusion models without any reward model

Kai Yang, Jian Tao, Jiafei Lyu, Chunjiang Ge, Jiaxin Chen, Weihan Shen, Xiaolong Zhu, and Xiu Li. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8941–8951, 2024

work page 2024
[76]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[77]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, and Ming-Yu Liu. Diffusionnft: Online diffusion reinforcement with forward process.arXiv preprint arXiv:2509.16117, 2025

work page internal anchor Pith review arXiv 2025
[79]

The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, and Yu Meng. The surprising effectiveness of negative reinforcement in llm reasoning, 2025.arXiv preprint arXiv:2506.01347, 2025

work page arXiv 2025
[80]

Maximum entropy inverse reinforcement learning

Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al. Maximum entropy inverse reinforcement learning. InAAAI, volume 8, pages 1433–1438. Chicago, IL, USA, 2008. 14 When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy Supplementary Material w/ PEC (Ours) Flow-GRPO w/ PEC (Ours) Flow-GRPO w/ PE...

work page 2008