MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li , Yutao Cui , Tao Huang , Yinping Ma , Chun Fan , Yiming Cheng , Miles Yang , Zhao Zhong

show 1 more author

Liefeng Bo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 13:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords MixGRPOGRPOflow matchinghuman preference alignmentODE-SDE mixingsliding windowtraining efficiencyimage generation

0 comments

The pith

MixGRPO improves GRPO efficiency for flow matching image models by restricting SDE sampling and optimization to a sliding window while using ODE sampling outside it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MixGRPO to address the inefficiency of GRPO in aligning flow matching models with human preferences for image generation. Previous approaches like FlowGRPO and DanceGRPO require sampling and optimizing over all denoising steps dictated by the Markov Decision Process, which creates high overhead. MixGRPO mixes strategies by applying SDE sampling and GRPO-guided optimization only inside a sliding window of time-steps and deterministic ODE sampling elsewhere. This confines randomness and gradient updates to fewer steps, reduces optimization overhead, accelerates convergence, and supports higher-order solvers outside the window for even faster sampling in the MixGRPO-Flash variant.

Core claim

By integrating SDE sampling and GRPO-guided optimization within a sliding window and ODE sampling outside it, MixGRPO streamlines the MDP optimization in flow matching models. This confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead and allowing for more focused gradient updates to accelerate convergence. Time-steps beyond the sliding window support higher-order solvers for faster sampling, yielding the MixGRPO-Flash variant that further improves training efficiency while achieving comparable performance.

What carries the argument

The sliding window mechanism that applies SDE sampling and GRPO optimization only within selected denoising steps and ODE sampling outside the window to confine randomness and focus updates.

If this is right

Higher-order ODE solvers can be applied outside the window for faster sampling without affecting optimization quality.
Training time drops by nearly 50 percent compared to DanceGRPO while delivering stronger human preference alignment across multiple dimensions.
The MixGRPO-Flash variant achieves comparable results with 71 percent lower training time.
Focused gradient updates within the window accelerate convergence of the alignment process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The mixed ODE-SDE window strategy may transfer to other reinforcement-learning alignment methods that currently optimize over full denoising trajectories.
Dynamically resizing the window during training could further balance speed and final alignment quality.
The same restriction of stochastic steps might reduce memory or compute costs in video or 3D generative models that use many denoising iterations.
Testing the approach on non-image flow models would reveal whether the efficiency gain is specific to image denoising schedules.

Load-bearing premise

That restricting SDE sampling and GRPO optimization to a sliding window preserves full MDP optimization quality and does not introduce bias or slower convergence outside the window.

What would settle it

A controlled experiment that measures alignment scores and convergence speed when the sliding window is progressively shrunk versus kept at full width on identical base models and datasets.

read the original abstract

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for faster sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond the standard MDP formulation of GRPO and the known ODE/SDE properties of flow matching; all details remain implicit.

pith-pipeline@v0.9.0 · 5569 in / 1086 out tokens · 50916 ms · 2026-05-13T13:25:55.850740+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LedgerForcing conservation_from_balance echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the entire denoising process to be framed as a Markov Decision Process (MDP) in a stochastic environment, where GRPO is then applied to optimize the complete state-action sequence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL
cs.CV 2026-05 conditional novelty 7.0

CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight...
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
cs.CV 2026-05 unverdicted novelty 7.0

OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
cs.LG 2026-05 unverdicted novelty 7.0

TMPO uses Softmax Trajectory Balance to match policy probabilities over multiple trajectories to a Boltzmann reward distribution, improving diversity by 9.1% in diffusion alignment tasks.
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
cs.LG 2026-05 unverdicted novelty 7.0

TMPO replaces scalar reward maximization with trajectory-level matching to a Boltzmann distribution via Softmax-TB, improving generative diversity by 9.1% while keeping competitive reward performance.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 conditional novelty 7.0

Flow-OPD applies on-policy distillation to flow matching models via specialized teachers, cold-start initialization, and manifold anchor regularization, lifting GenEval from 63 to 92 and OCR from 59 to 94 on Stable Di...
Improved techniques for fine-tuning flow models via adjoint matching: a deterministic control pipeline
cs.AI 2026-05 unverdicted novelty 7.0

A new adjoint matching framework formulates flow model alignment as optimal control, enabling direct regression training and terminal-trajectory truncation for efficiency gains on models like SiT-XL and FLUX.
ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control
cs.LG 2026-04 unverdicted novelty 7.0

ParetoSlider conditions diffusion models on continuous preference weights to approximate the full Pareto front, providing dynamic control over multi-objective rewards at inference time.
Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
cs.CV 2026-04 unverdicted novelty 7.0

OTCA improves GRPO training for visual generation by estimating step importance in trajectories and adaptively weighting multiple reward objectives.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
eess.AS 2026-03 unverdicted novelty 7.0

YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing th...
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
cs.CV 2026-05 unverdicted novelty 6.0

Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduce...
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow-matching text-to-image models, lifting GenEval from 63 to 92 and OCR accuracy from 59 to 94 while preserving fidelity.
Flow-OPD: On-Policy Distillation for Flow Matching Models
cs.CV 2026-05 unverdicted novelty 6.0

Flow-OPD applies on-policy distillation to flow matching models, achieving GenEval of 92 and OCR accuracy of 94 on Stable Diffusion 3.5 Medium while avoiding the seesaw effect of multi-reward optimization.
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data
cs.CV 2026-05 unverdicted novelty 6.0

The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world perfor...
POCA: Pareto-Optimal Curriculum Alignment for Visual Text Generation
cs.CV 2026-04 unverdicted novelty 6.0

POCA combines Pareto optimization with curriculum alignment to improve multi-reward reinforcement learning for visual text generation without relying on weighted sums.
V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think
cs.LG 2026-04 unverdicted novelty 6.0

V-GRPO makes ELBO surrogates stable and efficient for online RL alignment of denoising models, delivering SOTA text-to-image performance with 2-3x speedups over MixGRPO and DiffusionNFT.
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

Reward Score Matching unifies reward-based fine-tuning for flow and diffusion models by recasting alignment as score matching to a value-guided target.
Region-Constrained Group Relative Policy Optimization for Flow-Based Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

RC-GRPO-Editing constrains GRPO exploration to editing regions via localized noise and attention rewards, improving instruction adherence and non-target preservation in flow-based image editing.
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
cs.LG 2026-04 unverdicted novelty 6.0

Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
CellFluxRL: Biologically-Constrained Virtual Cell Modeling via Reinforcement Learning
cs.LG 2026-03 unverdicted novelty 6.0

CellFluxRL post-trains the CellFlux generative model with reinforcement learning driven by biologically meaningful reward functions, yielding virtual cell images that better satisfy physical and biological constraints...
HunyuanVideo 1.5 Technical Report
cs.CV 2025-11 unverdicted novelty 6.0

HunyuanVideo 1.5 delivers state-of-the-art open-source text-to-video and image-to-video generation with an 8.3B parameter DiT model featuring SSTA attention, glyph-aware encoding, and progressive training.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
cs.CV 2026-05 unverdicted novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
A Systematic Post-Train Framework for Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

A post-training pipeline for video generation models combines SFT, RLHF with novel GRPO, prompt enhancement, and inference optimization to improve visual quality, temporal coherence, and instruction following.
Reward-Aware Trajectory Shaping for Few-step Visual Generation
cs.CV 2026-04 unverdicted novelty 5.0

RATS lets few-step visual generators surpass multi-step teachers by shaping trajectories with reward-based adaptive guidance instead of strict imitation.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 26 Pith papers · 16 internal anchors

[1]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Michael S Albergo, Nicholas M Boffi, and Eric Vanden- Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Discount factor as a regularizer in reinforcement learning

Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. InInternational con- ference on machine learning, pages 269–278. PMLR, 2020. 2, 5

work page 2020
[3]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforce- ment learning.arXiv preprint arXiv:2305.13301, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

work page
[5]

Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

Ying Fan and Kangwook Lee. Optimizing ddpm sampling with shortcut fine-tuning.arXiv preprint arXiv:2301.13362,

work page arXiv
[6]

Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Moham- mad Ghavamzadeh, Kangwook Lee, and Kimin Lee. Dpok: Reinforcement learning for fine-tuning text-to-image diffu- sion models.Advances in Neural Information Processing Systems, 36:79858–79885, 2023. 2

work page 2023
[7]

Murphy, and Tim Salimans

Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P. Murphy, and Tim Salimans. Diffusion meets flow matching: Two sides of the same coin. 2024. 3, 6

work page 2024
[8]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 2

work page 2020
[9]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 7

work page 2022
[10]

On the role of discount factor in offline reinforcement learn- ing

Hao Hu, Yiqin Yang, Qianchuan Zhao, and Chongjie Zhang. On the role of discount factor in offline reinforcement learn- ing. InInternational conference on machine learning, pages 9072–9098. PMLR, 2022. 2, 5

work page 2022
[11]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. 4

work page 2024
[12]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022. 5

work page 2022
[13]

Pick-a-pic: An open dataset of user preferences for text-to-image generation

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Ma- tiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36: 36652–36663, 2023. 2, 5, 6, 7, 3, 4

work page 2023
[14]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 6, 8

work page 2024
[15]

Aligning Text-to-Image Models using Human Feedback

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text- to-image models using human feedback.arXiv preprint arXiv:2302.12192, 2023. 2

work page internal anchor Pith review arXiv 2023
[16]

Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization

Zhanhao Liang, Yuhui Yuan, Shuyang Gu, Bohan Chen, Tiankai Hang, Mingxi Cheng, Ji Li, and Liang Zheng. Aes- thetic post-training diffusion models from generic prefer- ences with step-by-step preference optimization. InProceed- ings of the Computer Vision and Pattern Recognition Confer- ence, pages 13199–13208, 2025. 2

work page 2025
[17]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matt Le. Flow matching for generative mod- eling.arXiv preprint arXiv:2210.02747, 2022. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 1, 2, 3, 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Improving Video Generation with Human Feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, et al. Improving video generation with human feedback.arXiv preprint arXiv:2501.13918, 2025. 1, 7, 5

work page internal anchor Pith review arXiv 2025
[20]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 4, 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787,

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787,

work page
[23]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022. 2, 6, 8, 1

work page arXiv 2022
[24]

Hpsv3: Towards wide-spectrum human preference score,

Yuhang Ma, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. Hpsv3: Towards wide-spectrum human preference score,

work page
[25]

Reward hacking behavior can generalize across tasks—ai alignment forum

Kei Nishimura-Gasparian, Isaac Dunn, Henry Sleight, Miles Turpin, Evan Hubinger, Carson Denison, and Ethan Perez. Reward hacking behavior can generalize across tasks—ai alignment forum. InAI Alignment Forum, 2024. 8

work page 2024
[26]

Stochastic differential equations

Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations: an introduction with ap- plications, pages 38–50. Springer, 2003. 3, 1

work page 2003
[27]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 1

work page 2022
[28]

Rethinking the discount factor in reinforcement learning: A decision theoretic approach

Silviu Pitis. Rethinking the discount factor in reinforcement learning: A decision theoretic approach. InProceedings of the AAAI conference on artificial intelligence, pages 7949– 7956, 2019. 2, 5 10

work page 2019
[29]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 6

work page 2021
[30]

Springer, 1996

Hannes Risken and Hannes Risken.Fokker-planck equation. Springer, 1996. 3

work page 1996
[31]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

work page 2022
[32]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[36]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equa- tions.arXiv preprint arXiv:2011.13456, 2020. 2, 3, 1

work page internal anchor Pith review Pith/arXiv arXiv 2011
[37]

Hunyuanvideo 1.5 technical report, 2025

Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. 4

work page 2025
[38]

Delving into rl for image generation with cot: A study on dpo vs

Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, and Pheng-Ann Heng. Delving into rl for image generation with cot: A study on dpo vs. grpo.arXiv preprint arXiv:2505.17017, 2025. 2

work page arXiv 2025
[39]

Diffusion model align- ment using direct preference optimization

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion model align- ment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2024. 2

work page 2024
[40]

Coefficients-preserving sampling for reinforcement learning with flow matching

Feng Wang and Zihao Yu. Coefficients-preserving sam- pling for reinforcement learning with flow matching.arXiv preprint arXiv:2509.05952, 2025. 1, 3, 4

work page arXiv 2025
[41]

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, and Jiaqi Wang. Unified reward model for multimodal understanding and generation.arXiv preprint arXiv:2503.05236, 2025. 2, 6, 8

work page internal anchor Pith review arXiv 2025
[42]

Reward hacking in reinforcement learning.lil- ianweng.github.io, 2024

Lilian Weng. Reward hacking in reinforcement learning.lil- ianweng.github.io, 2024. 8

work page 2024
[43]

RewardDance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025

Jie Wu, Yu Gao, Zilyu Ye, Ming Li, Liang Li, Hanzhong Guo, Jie Liu, Zeyue Xue, Xiaoxia Hou, Wei Liu, et al. Rewarddance: Reward scaling in visual generation.arXiv preprint arXiv:2509.08826, 2025. 8

work page arXiv 2025
[44]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere- ward: Learning and evaluating human preferences for text- to-image generation.Advances in Neural Information Pro- cessing Systems, 36:15903–15935, 2023. 1, 2, 5, 6, 7, 8, 3, 4

work page 2023
[46]

DanceGRPO: Unleashing GRPO on Visual Generation

Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, et al. Dancegrpo: Unleashing grpo on visual generation.arXiv preprint arXiv:2505.07818, 2025. 1, 2, 3, 6, 7, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Discussion on flow-grpo issue 7

GitHub User yifan123. Discussion on flow-grpo issue 7. https : / / github . com / yifan123 / flow _ grpo / issues/#issuecomment- 2870678379, 2025. Ac- cessed: 2025-05-12. 2

work page 2025
[48]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 6613–6623, 2024. 3

work page 2024
[49]

Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Sys- tems, 37:73366–73398, 2024

Huizhuo Yuan, Zixiang Chen, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning of diffusion models for text-to-image generation.Advances in Neural Information Processing Sys- tems, 37:73366–73398, 2024. 2

work page 2024
[50]

Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.Advances in Neural Information Processing Systems, 36:49842–49869, 2023

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models.Advances in Neural Information Processing Systems, 36:49842–49869, 2023. 2, 6

work page 2023
[51]

Dpm- solver-v3: Improved diffusion ode solver with empirical model statistics

Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm- solver-v3: Improved diffusion ode solver with empirical model statistics. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 2, 6

work page 2023
[52]

Drivinggen: A compre- hensive benchmark for generative video world models in au- tonomous driving.arXiv preprint arXiv:2601.01528, 2026

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hong- sheng Li, and Steven L Waslander. Drivinggen: A compre- hensive benchmark for generative video world models in au- tonomous driving.arXiv preprint arXiv:2601.01528, 2026. 1 11 MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE Supplementary Material

work page arXiv 2026
[53]

(7) has the same convergence as Eq

Proof of Convergence for Mixed ODE-SDE Sampling To prove that the mixed ODE-SDE sampling method in Eq. (7) has the same convergence as Eq. (2), which uses only ODE sampling, referencing [36], we approach this from the perspective of distribution evolution, where the distribution at each time step,e.g., ∂qt(x) ∂t must be the same. Let the in- terval for SD...

work page
[54]

We denote the discrete time steps by an index i∈ {0,1,

DPM-Solver++ for Recitified Flow For clarity and to avoid ambiguity between continuous time and discrete steps, we adopt the following notation in this section. We denote the discrete time steps by an index i∈ {0,1, . . . , T−1}, whereTis the total number of sam- pling steps. The continuous time corresponding to stepiis denoted byt i = i T ∈[0,1). The DPM...

work page
[55]

MixGRPO-Flash Algorithm MixGRPO-Flash Algorithm 2 accelerates the ODE sam- pling that does not contribute to the calculation of the pol- icy ratio after the sliding window by using DPM-Solver++ in the Eq. (21). We introduce a compression rate˜rsuch that the ODE sampling after the window only requires (T−l−w)˜rtime steps. And the total time-steps is ˜T=l+w...

work page
[56]

Hybrid Inference for Solving Reward Hacking As discussed in Section 5, reward hacking stems from the limited evaluation capabilities of the reward model. To ad- dress reward hacking and improve visualization, we employ the hybrid inference strategy from [47], which uses the post- trained model for low-SNR (signal-to-noise ratio) steps and the original mod...

work page
[57]

We established two reciprocal settings to evaluate both in-domain (ID) and out-of-domain (OOD) performance

Cross-Dataset Experiments To investigate the robustness and parameter sensitivity of the sliding window strategy in MixGRPO, we conducted a series of cross-dataset ablation studies. We established two reciprocal settings to evaluate both in-domain (ID) and out-of-domain (OOD) performance. In cross-dataset exper- 2 iment 1, the model was trained on the HPD...

work page
[58]

Coefficients-Preserving Sampling In our MixGRPO framework, introducing stochasticity dur- ing the inference phase is crucial for effective exploration in reinforcement learning. While a common practice in- volves the use of Stochastic Differential Equations (SDEs), we adopt Coefficients-Preserving Sampling (CPS) [40] as a more refined alternative to maint...

work page arXiv
[59]

PROMPT: 16-year-old teenager wearing a white bear-ear hat with a smirk on their face

More Visualized Results 5 FLUX DanceGRPO MixGRPO PROMPT: An image of an aircraft carrier made of cheese. PROMPT: 16-year-old teenager wearing a white bear-ear hat with a smirk on their face. PROMPT: A lemon with a McDonald's hat. FLUX DanceGRPO MixGRPO FLUX DanceGRPO MixGRPO Figure 7. Comparison of the visualization results of FLUX, DanceGRPO, and MixGRPO...

work page