Score-Based One-step MeanFlow Policy Optimization
Pith reviewed 2026-05-25 05:20 UTC · model grok-4.3
The pith
SOM enables one-step MeanFlow policies in online reinforcement learning by deriving the target velocity field from the Q-function using score estimation and the probability flow ODE.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SOM is an actor-critic algorithm that constructs the target velocity field for MeanFlow directly from the Q-function via score estimation and a probability flow ODE. This resolves the need for samples from the target action distribution, enabling single-step generation of actions that concentrate on high-value modes in fully online RL settings.
What carries the argument
The construction of the MeanFlow target velocity field from the Q-function using score estimation and the probability flow ODE, which allows single-step policy generation without target samples.
Load-bearing premise
The target velocity field for MeanFlow can be accurately constructed from the Q-function via score estimation and a probability flow ODE without any samples from the target action distribution.
What would settle it
If the single-step SOM policy fails to match or exceed the performance of multi-step diffusion policies on locomotion tasks while maintaining the claimed speedups, or if the constructed velocity field does not align with high-Q regions, the central claim would be falsified.
Figures
read the original abstract
Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm for online RL. It resolves the requirement for target-distribution samples in MeanFlow by constructing the target velocity field directly from the learned Q-function via score estimation combined with the probability flow ODE. This enables single-step policy generation. The abstract claims that SOM achieves state-of-the-art performance on locomotion tasks while substantially reducing both training and inference time relative to prior diffusion- and flow-matching-based policies.
Significance. If the velocity-field construction is unbiased and the single-step policy concentrates on high-value modes, the result would be significant: it would make expressive flow-based policies practical for online RL by eliminating multi-step denoising at inference time and avoiding the need for target samples during training. The approach combines value-based guidance with flow matching in a way that could generalize beyond the reported locomotion tasks, provided the non-stationary online setting does not amplify approximation errors.
major comments (2)
- [Abstract / central derivation] The central technical claim (abstract and method derivation) asserts that the target velocity field v_t(x) for MeanFlow can be obtained exactly from the Q-function via score estimation of ∇_x log p_t(x) and the probability flow ODE, without ever sampling from the target action distribution. This identity is load-bearing for the entire method; the provided text supplies no explicit derivation, regularity conditions, or proof that the Q-induced score equals the true score of the optimal policy, and the non-stationary online loop can feed any bias back into the actor update.
- [Abstract] The abstract states that SOM 'achieves state-of-the-art performance on locomotion tasks' with a single generation step, yet supplies no experimental details, baselines, metrics, or verification. Without these, the performance claim cannot be evaluated and the soundness of the velocity-field construction remains untested.
minor comments (1)
- [Abstract] The abstract is unusually dense with technical claims but contains no equation numbers or section references that would allow a reader to locate the score-estimation construction.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, providing clarifications on the technical claims and experimental presentation while indicating planned revisions.
read point-by-point responses
-
Referee: [Abstract / central derivation] The central technical claim (abstract and method derivation) asserts that the target velocity field v_t(x) for MeanFlow can be obtained exactly from the Q-function via score estimation of ∇_x log p_t(x) and the probability flow ODE, without ever sampling from the target action distribution. This identity is load-bearing for the entire method; the provided text supplies no explicit derivation, regularity conditions, or proof that the Q-induced score equals the true score of the optimal policy, and the non-stationary online loop can feed any bias back into the actor update.
Authors: We agree that the manuscript would benefit from an explicit step-by-step derivation of the velocity-field identity. The construction follows from substituting the score estimate ∇_x log p_t(x) ≈ ∇_x Q(x) (derived via the probability flow ODE under the optimal policy) into the MeanFlow target velocity, yielding v_t(x) without target samples. We will add a dedicated subsection in the revised Section 3 with the full derivation, including regularity conditions such as Lipschitz continuity of the Q-function and sufficient smoothness of the flow. On the non-stationary concern, the critic is updated with a slower target network to mitigate feedback of approximation errors; we will expand the discussion of this stabilization mechanism and include additional analysis of bias propagation. revision: yes
-
Referee: [Abstract] The abstract states that SOM 'achieves state-of-the-art performance on locomotion tasks' with a single generation step, yet supplies no experimental details, baselines, metrics, or verification. Without these, the performance claim cannot be evaluated and the soundness of the velocity-field construction remains untested.
Authors: The abstract is intentionally concise. Full experimental details—including baselines (Diffusion Policy, Flow Matching variants, SAC), metrics (normalized return, inference time), verification across MuJoCo locomotion environments with multiple random seeds, and ablation studies confirming the velocity-field construction—are reported in Section 4 and Appendix B. The single-step generation and training-time reductions are directly measured against these baselines. We will add a sentence in the abstract directing readers to the experimental section for completeness. revision: partial
Circularity Check
No circularity: velocity field derived from independent Q-function
full rationale
The provided abstract and description state that the target velocity field is constructed directly from the Q-function via score estimation and probability flow ODE, without requiring samples from the target distribution. No equations, self-citations, or fitted-parameter redefinitions are exhibited in the given text that would reduce any claimed prediction to its inputs by construction. The central step uses an externally learned critic to define the actor's target, which is a standard actor-critic separation and remains falsifiable via RL benchmark performance. This is the most common honest non-finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Is Conditional Generative Modeling all you need for Decision-Making?
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Iterated denoising energy matching for sampling from boltzmann densities
Tara Akhound-Sadegh, Jarrid Rector-Brooks, Avishek Joey Bose, Sarthak Mittal, Pablo Lemos, Cheng-Hao Liu, Marcin Sendera, Siamak Ravanbakhsh, Gauthier Gidel, Yoshua Bengio, et al. Iterated denoising energy matching for sampling from boltzmann densities. InProceedings of the 41st International Conference on Machine Learning, pages 760–786, 2024
work page 2024
-
[3]
Score regularized policy optimization through diffusion behavior
Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and Jun Zhu. Score regularized policy optimization through diffusion behavior. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[4]
Diffusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025
work page 2025
-
[5]
Diffusion-based reinforcement learning via q-weighted variational policy optimization
Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems, 37:53945–53968, 2024
work page 2024
-
[6]
One step diffusion via shortcut models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[7]
Mean flows for one-step generative modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026
work page 2026
-
[8]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018
work page 2018
-
[9]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 6840–6851, 2020
work page 2020
-
[10]
Planning with Diffusion for Flexible Behavior Synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Donghyeon Ki, JunHyeok Oh, Seong-Woong Shim, and Byung-Jun Lee. Prior-guided diffusion planning for offline reinforcement learning.arXiv preprint arXiv:2505.10881, 2025
-
[12]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[13]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[14]
Simplifying, stabilizing and scaling continuous-time consistency models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[15]
Efficient online reinforcement learning for diffusion policy
Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy. InInternational Conference on Machine Learning, pages 41837–41853. PMLR, 2025
work page 2025
-
[16]
Learning a diffusion model policy from rewards via q-score matching
Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching. InForty-first International Conference on Machine Learning, 2024
work page 2024
-
[17]
Diffusion Policy Policy Optimization
Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024. 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022
work page 2022
-
[19]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[22]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning, pages 32211–32252, 2023
work page 2023
-
[23]
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[24]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[25]
Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998
work page 1998
-
[26]
Mujoco: A physics engine for model-based control
Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012
work page 2012
-
[27]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Diffusion actor-critic with entropy regulator
Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang WU, Jingliang Duan, and Shengbo Eben Li. Diffusion actor-critic with entropy regulator. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[29]
Diffusion policies as an expressive policy class for offline reinforcement learning
Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[30]
Policy representation via diffusion probability model for reinforcement learning
Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023
-
[31]
One-step diffusion with distribution matching distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Free- man, and Taesung Park. One-step diffusion with distribution matching distillation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024
work page 2024
-
[32]
Mean flow policy with instantaneous velocity constraint for one-step action generation
Guojian Zhan, Letian Tao, Pengcheng Wang, Yixiao Wang, Yuxin Chen, Yiheng Li, Hongyang Li, Masayoshi Tomizuka, and Shengbo Eben Li. Mean flow policy with instantaneous velocity constraint for one-step action generation. InThe Fourteenth International Conference on Learning Representations, 2026. 11 A Algorithm Pseudocode Algorithm 1Score-Based One-step Me...
work page 2026
-
[33]
To create interleaved high- and low-reward modes, we assign alternating mixture weights wi = 2 for even i and wi = 1 for odd i. The final reward is given by the normalized mixture density, producing a smooth multimodal reward landscape with values in[0,1]. F.2 Two-Moons SOM SDAC (10-step) DACER (10-step) MFP Figure 13:Two-Moons Results.Arrow plots from a ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.