Recognition: 2 theorem links
· Lean TheoremdFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models
Pith reviewed 2026-05-12 03:46 UTC · model grok-4.3
The pith
By deriving full trajectory probabilities and modeling denoising as an MDP, dFlowGRPO enables rate-aware policy optimization for general discrete flow models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that discrete flow models admit an explicit full trajectory probability, and that casting the denoising steps as a Markov decision process allows the policy gradient to incorporate conditional transition rates together with the posterior model; the resulting dFlowGRPO therefore supports reinforcement learning across arbitrary probability paths and non-masked source distributions, as shown by its application to the FUDOKI multimodal model on image generation and understanding benchmarks.
What carries the argument
The derivation of the full trajectory probability for discrete flow models combined with the Markov decision process formulation of the denoising sequence, which together let the optimizer use rate and posterior information.
If this is right
- dFlowGRPO outperforms existing GRPO-type methods for dLLMs on text-to-image generation tasks.
- It reaches performance competitive with continuous flow-based models trained using FlowGRPO.
- The same training run also yields strong results on multimodal understanding tasks.
- The framework works for a broad family of probability paths and non-masked source distributions.
Where Pith is reading between the lines
- The MDP view of denoising could be reused to add reward signals in other discrete generative settings such as structured text or graph generation.
- Explicit use of transition rates may allow hybrid training pipelines that move parameters between discrete and continuous flow models.
- If the rate-aware updates remain stable, the approach offers a route to fine-tune multimodal models for both generation quality and reasoning accuracy within one framework.
Load-bearing premise
The full trajectory probability derivation and MDP formulation must hold accurately for arbitrary probability paths and non-masked source distributions without introducing biases or instabilities in the resulting policy updates.
What would settle it
An experiment in which dFlowGRPO applied to a non-masked discrete flow model produces training instability or inferior text-to-image performance compared with standard GRPO baselines would falsify the claim of broad applicability.
read the original abstract
Discrete flow models (DFMs) are a class of flexible generative models for generating discrete data, and diffusion large language models (dLLMs) can be viewed as a special case with a specific choice of mixture path and a masked source distribution. While several recent works have explored reinforcement learning into dLLMs, its application to more general discrete flow models remains underexplored. In this work, we present discrete Flow-GRPO (dFlowGRPO), a unified reinforcement learning framework for discrete flow models that supports a broad family of probability paths and non-masked source distributions. We derive the full trajectory probability for DFMs and formulate denoising as a Markov decision process, enabling dFlowGRPO to incorporate information from both the associated conditional transition rates and the posterior model during reinforcement learning. We apply dFlowGRPO to FUDOKI, a recent multimodal discrete flow model, and evaluate it on both image generation and multimodal understanding tasks. Empirical results show that dFlowGRPO outperforms existing GRPO-type methods for dLLMs on text-to-image generation tasks and achieves performance competitive with continuous flow-based models trained using FlowGRPO, while also demonstrating strong capabilities on understanding tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces dFlowGRPO, a unified RL framework for discrete flow models (DFMs) that extends beyond masked dLLMs. It derives the full trajectory probability for DFMs, formulates denoising as an MDP incorporating conditional transition rates and posteriors, and enables rate-aware policy optimization for a broad family of probability paths and non-masked sources. Applied to FUDOKI on text-to-image generation and multimodal understanding tasks, it reports outperformance over GRPO-type dLLM methods and competitiveness with continuous FlowGRPO models.
Significance. If the derivation is general and the MDP yields unbiased gradients without path-specific assumptions, this would meaningfully broaden RL fine-tuning for discrete generative models, moving beyond the masked-mixture restriction of dLLMs. The reported empirical gains on image generation and understanding tasks suggest practical value, especially if the method proves stable across non-masked sources.
major comments (2)
- [Methods (trajectory probability derivation and MDP formulation)] The central claim that the derived trajectory probability and MDP formulation support rate-aware policy optimization for arbitrary probability paths and non-masked source distributions (Abstract) is load-bearing. The skeptic concern is valid: discrete flows define transitions via path-dependent conditional rates, and if the derivation relies on identities that hold only under masking or specific mixtures, the resulting policy gradient would be biased for general DFMs. Please provide the explicit steps (e.g., in the Methods derivation) showing how the full trajectory probability remains unbiased without masking assumptions, or demonstrate via a counterexample that it does not.
- [Experiments] Empirical support is limited to FUDOKI (a masked-source model). The Abstract claims broad applicability to non-masked sources, yet no experiments or ablation on non-masked DFMs are reported. This leaves the generality of the rate-aware optimization untested and weakens the claim that dFlowGRPO outperforms GRPO-type methods across the stated family of models.
minor comments (2)
- [Experiments] Clarify the exact datasets, metrics, and baselines used for the text-to-image and understanding tasks to allow direct comparison with prior GRPO and FlowGRPO results.
- Ensure all notation for conditional rates, posteriors, and trajectory probabilities is defined before first use and is consistent between the derivation and the algorithm pseudocode.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our paper. We address the major comments below and have made revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods (trajectory probability derivation and MDP formulation)] The central claim that the derived trajectory probability and MDP formulation support rate-aware policy optimization for arbitrary probability paths and non-masked source distributions (Abstract) is load-bearing. The skeptic concern is valid: discrete flows define transitions via path-dependent conditional rates, and if the derivation relies on identities that hold only under masking or specific mixtures, the resulting policy gradient would be biased for general DFMs. Please provide the explicit steps (e.g., in the Methods derivation) showing how the full trajectory probability remains unbiased without masking assumptions, or demonstrate via a counterexample that it does not.
Authors: We appreciate the referee's careful scrutiny of the derivation. The trajectory probability for DFMs is derived from the general continuous-time Markov chain formulation, where the probability of a trajectory is the product of the instantaneous transition rates λ_t(x_{t+1} | x_t) integrated over the path, combined with the posterior probabilities from the model. This does not rely on masking-specific identities; the masking is a special case where the rate matrix has a particular structure (e.g., absorbing to mask token). The MDP is defined with states as the current discrete state, actions as the next state, and rewards incorporating the rate and posterior. The policy gradient is unbiased because it follows the standard REINFORCE or GRPO estimator applied to this general MDP. To make this explicit, we have added a detailed step-by-step derivation in the revised Methods section (Section 3.2), showing the expansion from the flow matching objective to the trajectory log-probability without any masking assumptions. We believe this addresses the concern and confirms generality. revision: yes
-
Referee: [Experiments] Empirical support is limited to FUDOKI (a masked-source model). The Abstract claims broad applicability to non-masked sources, yet no experiments or ablation on non-masked DFMs are reported. This leaves the generality of the rate-aware optimization untested and weakens the claim that dFlowGRPO outperforms GRPO-type methods across the stated family of models.
Authors: We agree that our experiments focus on FUDOKI, which uses a masked source distribution, as it is a state-of-the-art multimodal DFM. While this validates the method on a practical model, we acknowledge that direct empirical comparison on non-masked DFMs would provide stronger evidence for the broad applicability. Implementing and training non-masked variants requires significant additional resources and model development, which was beyond the scope of this work. In the revised manuscript, we have added a discussion in the Experiments and Conclusion sections clarifying the scope of the empirical results and emphasizing that the theoretical framework applies generally, with FUDOKI serving as a representative case. We also suggest directions for future work on non-masked sources. revision: partial
- Empirical evaluation on non-masked discrete flow models
Circularity Check
Derivation of trajectory probability and MDP formulation is self-contained
full rationale
The paper states that it derives the full trajectory probability for DFMs and formulates denoising as a Markov decision process directly from the conditional transition rates and posterior model structure. No equations or steps in the provided abstract or description reduce this derivation to fitted parameters, self-definitions, or load-bearing self-citations by construction. The central claims about supporting a broad family of probability paths are presented as following from the derivation rather than presupposing the target results. Empirical evaluations on FUDOKI are downstream applications and do not feed back into the derivation. This is the standard case of a self-contained derivation without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Denoising in DFMs can be formulated as a Markov decision process using full trajectory probabilities derived from conditional transition rates and posterior models.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We derive the full trajectory probability for DFMs and formulate denoising as a Markov decision process, enabling dFlowGRPO to incorporate information from both the associated conditional transition rates and the posterior model
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the transition probability ratio ... is a product of dimension-wise expected posterior ratios, each reweighted by a rate-dependent term
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
International Conference on Learning Representations , year =
Flow matching for generative modeling , author =. International Conference on Learning Representations , year =
-
[2]
Applebaum, David , year =. L
-
[3]
arXiv preprint arXiv:2407.18163 , volume=
Statistical optimal transport , author =. arXiv preprint arXiv:2407.18163 , year =
-
[4]
International Conference on Learning Representations , year =
Score-based generative modeling through stochastic differential equations , author =. International Conference on Learning Representations , year =
-
[5]
Advances in Neural Information Processing Systems , volume =
Discrete flow matching , author =. Advances in Neural Information Processing Systems , volume =
-
[6]
Advances in Neural Information Processing Systems , volume =
Denoising diffusion probabilistic models , author =. Advances in Neural Information Processing Systems , volume =
- [7]
-
[8]
International Conference on Machine Learning , year =
Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design , author =. International Conference on Machine Learning , year =
-
[9]
Dhariwal, Prafulla and Nichol, Alexander , booktitle =. Diffusion models beat
-
[10]
NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , year =
Classifier-free diffusion guidance , author =. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , year =
work page 2021
-
[11]
Consistency models , author =. arXiv preprint arXiv:2303.01469 , year =
work page internal anchor Pith review arXiv
-
[12]
Journal of the American Statistical Association , volume =
Tweedie's formula and selection bias , author =. Journal of the American Statistical Association , volume =. 2011 , publisher =
work page 2011
-
[13]
Advances in Neural Information Processing Systems , volume =
Generative modeling by estimating gradients of the data distribution , author =. Advances in Neural Information Processing Systems , volume =
-
[14]
International Conference on Learning Representations , year =
Denoising diffusion implicit models , author =. International Conference on Learning Representations , year =
-
[15]
A connection between score matching and denoising autoencoders , author =. Neural Computation , volume =. 2011 , publisher =
work page 2011
-
[16]
ACM Computing Surveys , volume =
Diffusion models: a comprehensive survey of methods and applications , author =. ACM Computing Surveys , volume =. 2023 , publisher =
work page 2023
-
[17]
The Eleventh International Conference on Learning Representations , year=
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow , author=. The Eleventh International Conference on Learning Representations , year=
-
[18]
International Conference on Learning Representations , year =
Building normalizing flows with stochastic interpolants , author =. International Conference on Learning Representations , year =
-
[19]
Advances in Neural Information Processing Systems , volume =
Structured denoising diffusion models in discrete state-spaces , author =. Advances in Neural Information Processing Systems , volume =
-
[20]
Uncertainty in Artificial Intelligence , pages =
Sliced score matching: a scalable approach to density and score estimation , author =. Uncertainty in Artificial Intelligence , pages =. 2020 , organization =
work page 2020
-
[21]
Advances in Neural Information Processing Systems , volume =
Concrete score matching: generalized score matching for discrete data , author =. Advances in Neural Information Processing Systems , volume =
-
[22]
International Conference on Machine Learning , year =
Discrete diffusion modeling by estimating the ratios of the data distribution , author =. International Conference on Machine Learning , year =
-
[23]
Advances in Neural Information Processing Systems , volume =
A continuous time framework for discrete denoising models , author =. Advances in Neural Information Processing Systems , volume =
-
[24]
Advances in Neural Information Processing Systems , year =
Transfer learning for diffusion models , author =. Advances in Neural Information Processing Systems , year =
-
[25]
Edit flows: Flow matching with edit operations.arXiv preprint arXiv:2506.09018,
Edit flows: flow matching with edit operations , author =. arXiv preprint arXiv:2506.09018 , year =
-
[26]
Advances in Neural Information Processing Systems , volume =
Simplified and generalized masked diffusion for discrete data , author =. Advances in Neural Information Processing Systems , volume =
-
[27]
Advances in Neural Information Processing Systems , volume =
Simple and effective masked diffusion language models , author =. Advances in Neural Information Processing Systems , volume =
-
[28]
International Conference on Learning Representations , year =
Your absorbing discrete diffusion secretly models the conditional distributions of clean data , author =. International Conference on Learning Representations , year =
-
[29]
Advances in Neural Information Processing Systems , volume =
Argmax flows and multinomial diffusion: Learning categorical distributions , author =. Advances in Neural Information Processing Systems , volume =
-
[30]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
High-resolution image synthesis with latent diffusion models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
-
[31]
Journal of Machine Learning Research , volume =
Gaussian interpolation flows , author =. Journal of Machine Learning Research , volume =
-
[32]
Advances in Neural Information Processing Systems , volume =
Neural ordinary differential equations , author =. Advances in Neural Information Processing Systems , volume =
-
[33]
International Conference on Learning Representations , year =
Generator matching: Generative modeling with arbitrary Markov processes , author =. International Conference on Learning Representations , year =
-
[34]
Transactions on Machine Learning Research , issn=
Error Bounds for Flow Matching Methods , author=. Transactions on Machine Learning Research , issn=
-
[35]
From denoising diffusions to denoising
Benton, Joe and Shi, Yuyang and De Bortoli, Valentin and Deligiannidis, George and Doucet, Arnaud , journal =. From denoising diffusions to denoising. 2024 , publisher =
work page 2024
-
[36]
d1: Scaling reasoning in diffusion large language models via reinforcement learning , author =. arXiv preprint arXiv:2504.12216 , year =
-
[37]
Advances in Neural Information Processing Systems , volume =
Direct preference optimization: your language model is secretly a reward model , author =. Advances in Neural Information Processing Systems , volume =
-
[38]
International Conference on Learning Representations , year =
Large language diffusion models , author =. International Conference on Learning Representations , year =
-
[39]
Zhu, Fengqi and Wang, Rongzhen and Nie, Shen and Zhang, Xiaolu and Wu, Chunwei and Hu, Jun and Zhou, Jun and Chen, Jianfei and Lin, Yankai and Wen, Ji-Rong and others , journal =
-
[40]
Yang, Ling and Tian, Ye and Li, Bowen and Zhang, Xinchen and Shen, Ke and Tong, Yunhai and Wang, Mengdi , journal =
-
[41]
the method of paired comparisons , author =
Rank analysis of incomplete block designs: I. the method of paired comparisons , author =. Biometrika , volume =. 1952 , publisher =
work page 1952
-
[42]
International Conference on Machine Learning , pages =
Trust region policy optimization , author =. International Conference on Machine Learning , pages =. 2015 , organization =
work page 2015
-
[43]
A short note on an inequality between
Canonne, Cl. A short note on an inequality between. arXiv preprint arXiv:2202.07198 , year =
-
[44]
Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Y and others , journal =
-
[46]
International Conference on Learning Representations , year =
Scaling up masked diffusion models on text , author =. International Conference on Learning Representations , year =
-
[47]
Liu, Zichen and Chen, Changyu and Li, Wenjun and Qi, Penghui and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min , journal =. Understanding
-
[48]
Advances in Neural Information Processing Systems , volume =
Training and inference on any-order autoregressive models the right way , author =. Advances in Neural Information Processing Systems , volume =
-
[49]
International Conference on Learning Representations , year =
Autoregressive diffusion models , author =. International Conference on Learning Representations , year =
-
[50]
arXiv preprint arXiv:2506.23589 , year =
Transition matching: scalable and flexible generative modeling , author =. arXiv preprint arXiv:2506.23589 , year =
-
[51]
International Conference on Learning Representations , year =
Flow matching with general discrete paths: a kinetic-optimal perspective , author =. International Conference on Learning Representations , year =
- [52]
-
[53]
International Conference on Machine Learning , year =
On the guidance of flow matching , author =. International Conference on Machine Learning , year =
-
[54]
International Conference on Learning Representations , year =
Unlocking guidance for discrete state-space diffusion and flow models , author =. International Conference on Learning Representations , year =
-
[55]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
MaskGIT: masked generative image transformer , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
-
[56]
International Conference on Learning Representations , year =
Score-based continuous-time discrete diffusion models , author =. International Conference on Learning Representations , year =
-
[57]
Wang, Jin and Lai, Yao and Li, Aoxue and Zhang, Shifeng and Sun, Jiacheng and Kang, Ning and Wu, Chengyue and Li, Zhenguo and Luo, Ping , journal =
-
[58]
International Conference on Machine Learning , pages =
Deep unsupervised learning using nonequilibrium thermodynamics , author =. International Conference on Machine Learning , pages =. 2015 , organization =
work page 2015
-
[59]
International Conference on Learning Representations , year =
Simple guidance mechanisms for discrete diffusion models , author =. International Conference on Learning Representations , year =
-
[60]
Advances in Neural Information Processing Systems , volume =
Elucidating the design space of diffusion-based generative models , author =. Advances in Neural Information Processing Systems , volume =
-
[61]
International Conference on Machine Learning , pages =
Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning , author =. International Conference on Machine Learning , pages =. 2023 , organization =
work page 2023
-
[62]
International Conference on Learning Representations , year =
Energy-weighted flow matching for offline reinforcement learning , author =. International Conference on Learning Representations , year =
-
[63]
Guided flows for generative modeling and decision making.arXiv preprint arXiv:2311.13443, 2023
Guided flows for generative modeling and decision making , author =. arXiv preprint arXiv:2311.13443 , year =
-
[64]
Diffusion Posterior Sampling for General Noisy Inverse Problems
Diffusion posterior sampling for general noisy inverse problems , author =. arXiv preprint arXiv:2209.14687 , year =
work page internal anchor Pith review arXiv
-
[65]
International Conference on Machine Learning , pages =
Loss-guided diffusion models for plug-and-play controllable generation , author =. International Conference on Machine Learning , pages =. 2023 , organization =
work page 2023
-
[66]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Autoregressive model beats diffusion: llama for scalable image generation , author =. arXiv preprint arXiv:2406.06525 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
Hierarchical text-conditional image generation with
Ramesh, Aditya and Dhariwal, Prafulla and Nichol, Alex and Chu, Casey and Chen, Mark , journal =. Hierarchical text-conditional image generation with
-
[68]
International Conference on Learning Representations , year =
Podell, Dustin and English, Zion and Lacey, Kyle and Blattmann, Andreas and Dockhorn, Tim and M. International Conference on Learning Representations , year =
-
[69]
Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying , journal =
-
[70]
World model on million-length video and language with blockwise
Liu, Hao and Yan, Wilson and Zaharia, Matei and Abbeel, Pieter , booktitle =. World model on million-length video and language with blockwise
-
[71]
International Conference on Learning Representations , year =
Show-o: one single transformer to unify multimodal understanding and generation , author =. International Conference on Learning Representations , year =
-
[72]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
Wu, Chengyue and Chen, Xiaokang and Wu, Zhiyu and Ma, Yiyang and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong and Luo, Ping , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
-
[73]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
Dual diffusion for unified image generation and understanding , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year =
-
[74]
Unified multimodal discrete diffusion
Unified multimodal discrete diffusion , author =. arXiv preprint arXiv:2503.20853 , year =
-
[75]
International Conference on Learning Representations , year =
Transfusion: predict the next token and diffuse images with one multi-modal model , author =. International Conference on Learning Representations , year =
-
[76]
Chen, Xiaokang and Wu, Zhiyu and Liu, Xingchao and Pan, Zizheng and Liu, Wen and Xie, Zhenda and Yu, Xingkai and Ruan, Chong , journal =. Janus-
-
[77]
Emu3: Next-Token Prediction is All You Need
Emu3: next-token prediction is all you need , author =. arXiv preprint arXiv:2409.18869 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[78]
International Conference on Learning Representations , year =
PixArt- : fast training of diffusion transformer for photorealistic text-to-image synthesis , author =. International Conference on Learning Representations , year =
-
[79]
Improving image generation with better captions , author =
-
[80]
International Conference on Machine Learning , year =
Scaling rectified flow transformers for high-resolution image synthesis , author =. International Conference on Machine Learning , year =
-
[81]
arXiv preprint arXiv:2408.08252 , year =
Derivative-free guidance in continuous and discrete diffusion models with soft value-based decoding , author =. arXiv preprint arXiv:2408.08252 , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.