Mean Flow Policy Optimization

Jian Cheng; Xiaoyi Dong; Xi Sheryl Zhang

arxiv: 2604.14698 · v2 · pith:GCKRMTXFnew · submitted 2026-04-16 · 💻 cs.LG

Mean Flow Policy Optimization

Xiaoyi Dong , Xi Sheryl Zhang , Jian Cheng This is my paper

Pith reviewed 2026-05-10 11:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningpolicy optimizationflow-based modelsmaximum entropy RLdiffusion policiesMuJoCocontinuous controlgenerative policies

0 comments

The pith

Mean Flow Policy Optimization uses few-step flow models to represent RL policies, matching diffusion performance while cutting training and inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replaces diffusion models with MeanFlow models as policy representations in online reinforcement learning. Diffusion policies are expressive but slow because of their iterative generation steps during both training and inference. MeanFlow models require only a few steps, and the method optimizes them under the maximum entropy framework by adapting soft policy iteration to handle action likelihood evaluation and policy improvement. On MuJoCo and DeepMind Control Suite tasks, this yields performance at or above diffusion baselines together with large reductions in compute time.

Core claim

Representing policies as MeanFlow models and optimizing them via soft policy iteration under the maximum entropy RL framework produces policies whose performance on standard continuous-control benchmarks equals or exceeds that of diffusion-based methods while substantially lowering both training and inference cost.

What carries the argument

MeanFlow models, a class of few-step flow-based generative models serving as the policy class, combined with maximum-entropy soft policy iteration adapted for action-likelihood evaluation and soft improvement.

If this is right

Expressive policy classes in RL need not incur the full iterative cost of diffusion if few-step flow alternatives exist.
The maximum-entropy framework can be applied to generative-model families other than diffusion without losing its theoretical guarantees.
Reducing the number of sampling steps in the policy directly translates into faster online RL training loops.
Once action likelihoods and soft improvement are tractable, any few-step generative model becomes a candidate for entropy-regularized policy optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same few-step flow construction could be tried in offline RL or model-based settings where repeated policy evaluation is the dominant cost.
If the efficiency advantage persists at scale, complex policies could be deployed on hardware with tighter latency budgets than current diffusion methods allow.
Hybrid approaches that combine MeanFlow with existing acceleration tricks such as distillation or consistency models remain unexplored in the paper.

Load-bearing premise

The two MeanFlow-specific obstacles of action likelihood evaluation and soft policy improvement can be solved without introducing instabilities or bias that would undermine the maximum-entropy guarantees.

What would settle it

A set of runs on MuJoCo or DeepMind Control Suite in which MFPO either underperforms the diffusion baselines by a clear margin or shows no substantial reduction in training and inference wall-clock time would falsify the central claim.

read the original abstract

Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo, DeepMind Control Suite and HumanoidBench benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/dongxiaoyi-xyz/MFPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MFPO is a practical port of MeanFlow to max-entropy RL policies that delivers measurable speedups on standard benchmarks, but its theoretical grounding depends on whether the likelihood and improvement fixes are unbiased.

read the letter

The core contribution is adapting an existing few-step flow model (MeanFlow) to serve as a policy in online RL. They keep the maximum-entropy objective and soft policy iteration, then supply the two missing pieces: a way to evaluate action likelihood under the MeanFlow representation and a compatible soft improvement operator. That combination is new relative to the diffusion-RL papers they cite, and it directly targets the training and sampling cost that has limited those methods so far. On MuJoCo and DeepMind Control Suite they report performance at or above the diffusion baselines with lower wall-clock time, and they release code, which is the right move for a methods paper like this. Those are the concrete positives. The soft spot is the one the stress-test note flags. Soft policy iteration only converges to the correct soft Q-function if the log-probability term is unbiased and the improvement step does not inject systematic error. MeanFlow is a few-step approximation, so any Monte-Carlo estimator or velocity-field approximation they use for the likelihood could create bias in the entropy regularizer. The abstract claims the challenges are solved, but the strength of the result rests on whether those solutions are exact or merely practical. If the derivations hold up under inspection, the efficiency claim is credible; if they rely on heuristics, the performance numbers may be harder to interpret. This paper is aimed at researchers already working on generative policies for continuous control who need faster sampling. It is solid enough to merit peer review because the benchmarks are standard, the code is public, and the engineering problem it attacks is real, even if the theoretical details will need careful checking in revision.

Referee Report

2 major / 1 minor

Summary. The paper proposes Mean Flow Policy Optimization (MFPO), which represents RL policies via MeanFlow (few-step flow-based) generative models and optimizes them under the maximum-entropy objective using soft policy iteration. It claims to resolve two MeanFlow-specific challenges—action likelihood evaluation and soft policy improvement—thereby achieving performance on MuJoCo and DeepMind Control Suite benchmarks that is comparable to or better than diffusion-based baselines while substantially lowering training and inference time. Code is released.

Significance. If the MeanFlow-specific implementations of likelihood evaluation and policy improvement are shown to be unbiased and to preserve the fixed-point guarantees of soft policy iteration, the approach would provide a practical efficiency improvement over diffusion policies without sacrificing the theoretical benefits of maximum-entropy RL. The public code release is a clear strength for reproducibility.

major comments (2)

[Method section (action likelihood evaluation)] The manuscript does not supply the explicit estimator or derivation for the action log-likelihood under the MeanFlow policy (referenced in the abstract and the method section). Without this, it is impossible to confirm that the entropy term remains unbiased, which is load-bearing for the claim that soft policy iteration converges to the true soft-optimal policy.
[Method section (soft policy improvement)] No analysis or fixed-point argument is given for the soft policy improvement operator when applied to the few-step MeanFlow parameterization (abstract and method section). Any approximation in the probability path or velocity field could introduce bias into the KL penalty, undermining the theoretical justification for the reported benchmark gains.

minor comments (1)

[Experiments] The number of flow steps and the precise form of the velocity field used in the MeanFlow policy should be stated explicitly in the experimental setup for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and valuable comments on the theoretical underpinnings of Mean Flow Policy Optimization. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and derivations.

read point-by-point responses

Referee: [Method section (action likelihood evaluation)] The manuscript does not supply the explicit estimator or derivation for the action log-likelihood under the MeanFlow policy (referenced in the abstract and the method section). Without this, it is impossible to confirm that the entropy term remains unbiased, which is load-bearing for the claim that soft policy iteration converges to the true soft-optimal policy.

Authors: We agree that an explicit derivation of the action log-likelihood estimator is required to rigorously establish unbiasedness of the entropy term. Section 3.2 describes the Monte Carlo estimation procedure based on the MeanFlow probability path, but the full mathematical steps were not expanded for brevity. In the revised manuscript we will add a dedicated appendix containing the complete derivation, showing that the estimator is unbiased for the few-step MeanFlow parameterization and therefore preserves the fixed-point properties of soft policy iteration under the maximum-entropy objective. revision: yes
Referee: [Method section (soft policy improvement)] No analysis or fixed-point argument is given for the soft policy improvement operator when applied to the few-step MeanFlow parameterization (abstract and method section). Any approximation in the probability path or velocity field could introduce bias into the KL penalty, undermining the theoretical justification for the reported benchmark gains.

Authors: We acknowledge that a formal fixed-point analysis of the soft policy improvement operator under the approximate MeanFlow parameterization is absent from the current manuscript. Section 3.3 outlines the practical adaptation that uses few-step sampling and an approximated KL divergence, but does not supply a contraction-mapping argument. In the revision we will include a new subsection providing a theoretical discussion: we will show that, under the assumption that the trained MeanFlow model converges to the target distribution (as enforced by the training loss), the bias in the KL penalty vanishes asymptotically and the operator retains the essential contraction property of standard soft policy iteration. Empirical support from the MuJoCo and DeepMind Control Suite results will be referenced to illustrate that any residual approximation error does not prevent convergence to high-performing policies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation builds on external frameworks

full rationale

The paper adapts the standard soft policy iteration algorithm from maximum-entropy RL to MeanFlow policies and states that it solves the two MeanFlow-specific challenges of likelihood evaluation and policy improvement. No equations or claims are presented that reduce the performance claims, the soft Q-function fixed point, or the reported benchmark results to quantities defined only by the authors' own fitted constants, self-referential definitions, or a chain of their prior unverified results. The experimental comparisons to diffusion baselines on MuJoCo and DeepMind Control Suite therefore constitute independent evidence rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that MeanFlow can be made compatible with soft policy iteration; no new physical constants or invented particles are introduced, but standard RL hyperparameters remain.

pith-pipeline@v0.9.0 · 5427 in / 1015 out tokens · 25249 ms · 2026-05-10T11:11:28.096185+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scalable Maximum Entropy Reinforcement Learning for Diffusion Policies via Adjoint Matching
cs.LG 2026-06 unverdicted novelty 6.0

Presents adjoint matching for scalable max-ent RL training of diffusion policies, enabling simulation-free optimization.