Mean Flow Policy Optimization
Pith reviewed 2026-05-10 11:11 UTC · model grok-4.3
The pith
Mean Flow Policy Optimization uses few-step flow models to represent RL policies, matching diffusion performance while cutting training and inference time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Representing policies as MeanFlow models and optimizing them via soft policy iteration under the maximum entropy RL framework produces policies whose performance on standard continuous-control benchmarks equals or exceeds that of diffusion-based methods while substantially lowering both training and inference cost.
What carries the argument
MeanFlow models, a class of few-step flow-based generative models serving as the policy class, combined with maximum-entropy soft policy iteration adapted for action-likelihood evaluation and soft improvement.
If this is right
- Expressive policy classes in RL need not incur the full iterative cost of diffusion if few-step flow alternatives exist.
- The maximum-entropy framework can be applied to generative-model families other than diffusion without losing its theoretical guarantees.
- Reducing the number of sampling steps in the policy directly translates into faster online RL training loops.
- Once action likelihoods and soft improvement are tractable, any few-step generative model becomes a candidate for entropy-regularized policy optimization.
Where Pith is reading between the lines
- The same few-step flow construction could be tried in offline RL or model-based settings where repeated policy evaluation is the dominant cost.
- If the efficiency advantage persists at scale, complex policies could be deployed on hardware with tighter latency budgets than current diffusion methods allow.
- Hybrid approaches that combine MeanFlow with existing acceleration tricks such as distillation or consistency models remain unexplored in the paper.
Load-bearing premise
The two MeanFlow-specific obstacles of action likelihood evaluation and soft policy improvement can be solved without introducing instabilities or bias that would undermine the maximum-entropy guarantees.
What would settle it
A set of runs on MuJoCo or DeepMind Control Suite in which MFPO either underperforms the diffusion baselines by a clear margin or shows no substantial reduction in training and inference wall-clock time would falsify the central claim.
read the original abstract
Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo, DeepMind Control Suite and HumanoidBench benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/dongxiaoyi-xyz/MFPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Mean Flow Policy Optimization (MFPO), which represents RL policies via MeanFlow (few-step flow-based) generative models and optimizes them under the maximum-entropy objective using soft policy iteration. It claims to resolve two MeanFlow-specific challenges—action likelihood evaluation and soft policy improvement—thereby achieving performance on MuJoCo and DeepMind Control Suite benchmarks that is comparable to or better than diffusion-based baselines while substantially lowering training and inference time. Code is released.
Significance. If the MeanFlow-specific implementations of likelihood evaluation and policy improvement are shown to be unbiased and to preserve the fixed-point guarantees of soft policy iteration, the approach would provide a practical efficiency improvement over diffusion policies without sacrificing the theoretical benefits of maximum-entropy RL. The public code release is a clear strength for reproducibility.
major comments (2)
- [Method section (action likelihood evaluation)] The manuscript does not supply the explicit estimator or derivation for the action log-likelihood under the MeanFlow policy (referenced in the abstract and the method section). Without this, it is impossible to confirm that the entropy term remains unbiased, which is load-bearing for the claim that soft policy iteration converges to the true soft-optimal policy.
- [Method section (soft policy improvement)] No analysis or fixed-point argument is given for the soft policy improvement operator when applied to the few-step MeanFlow parameterization (abstract and method section). Any approximation in the probability path or velocity field could introduce bias into the KL penalty, undermining the theoretical justification for the reported benchmark gains.
minor comments (1)
- [Experiments] The number of flow steps and the precise form of the velocity field used in the MeanFlow policy should be stated explicitly in the experimental setup for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful reading and valuable comments on the theoretical underpinnings of Mean Flow Policy Optimization. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and derivations.
read point-by-point responses
-
Referee: [Method section (action likelihood evaluation)] The manuscript does not supply the explicit estimator or derivation for the action log-likelihood under the MeanFlow policy (referenced in the abstract and the method section). Without this, it is impossible to confirm that the entropy term remains unbiased, which is load-bearing for the claim that soft policy iteration converges to the true soft-optimal policy.
Authors: We agree that an explicit derivation of the action log-likelihood estimator is required to rigorously establish unbiasedness of the entropy term. Section 3.2 describes the Monte Carlo estimation procedure based on the MeanFlow probability path, but the full mathematical steps were not expanded for brevity. In the revised manuscript we will add a dedicated appendix containing the complete derivation, showing that the estimator is unbiased for the few-step MeanFlow parameterization and therefore preserves the fixed-point properties of soft policy iteration under the maximum-entropy objective. revision: yes
-
Referee: [Method section (soft policy improvement)] No analysis or fixed-point argument is given for the soft policy improvement operator when applied to the few-step MeanFlow parameterization (abstract and method section). Any approximation in the probability path or velocity field could introduce bias into the KL penalty, undermining the theoretical justification for the reported benchmark gains.
Authors: We acknowledge that a formal fixed-point analysis of the soft policy improvement operator under the approximate MeanFlow parameterization is absent from the current manuscript. Section 3.3 outlines the practical adaptation that uses few-step sampling and an approximated KL divergence, but does not supply a contraction-mapping argument. In the revision we will include a new subsection providing a theoretical discussion: we will show that, under the assumption that the trained MeanFlow model converges to the target distribution (as enforced by the training loss), the bias in the KL penalty vanishes asymptotically and the operator retains the essential contraction property of standard soft policy iteration. Empirical support from the MuJoCo and DeepMind Control Suite results will be referenced to illustrate that any residual approximation error does not prevent convergence to high-performing policies. revision: yes
Circularity Check
No significant circularity; derivation builds on external frameworks
full rationale
The paper adapts the standard soft policy iteration algorithm from maximum-entropy RL to MeanFlow policies and states that it solves the two MeanFlow-specific challenges of likelihood evaluation and policy improvement. No equations or claims are presented that reduce the performance claims, the soft Q-function fixed point, or the reported benchmark results to quantities defined only by the authors' own fitted constants, self-referential definitions, or a chain of their prior unverified results. The experimental comparisons to diffusion baselines on MuJoCo and DeepMind Control Suite therefore constitute independent evidence rather than a tautology.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Scalable Maximum Entropy Reinforcement Learning for Diffusion Policies via Adjoint Matching
Presents adjoint matching for scalable max-ent RL training of diffusion policies, enabling simulation-free optimization.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.