pith. sign in

arxiv: 2605.23365 · v1 · pith:3YMGADYDnew · submitted 2026-05-22 · 💻 cs.LG · cs.AI

Score-Based One-step MeanFlow Policy Optimization

Pith reviewed 2026-05-25 05:20 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningMeanFlowscore estimationflow matchingactor-criticpolicy optimizationonline RLlocomotion tasks
0
0 comments X

The pith

SOM enables one-step MeanFlow policies in online reinforcement learning by deriving the target velocity field from the Q-function using score estimation and the probability flow ODE.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Score-Based One-step MeanFlow Policy Optimization (SOM) as an actor-critic algorithm for reinforcement learning. It addresses the issue that MeanFlow requires samples from the target distribution, which are unavailable in online RL, by constructing the target velocity field directly from the Q-function. This allows policies to map noise to actions in a single network evaluation. The approach concentrates probability mass on high-value modes and achieves state-of-the-art results on locomotion tasks while cutting training and inference time compared to diffusion and flow-matching policies.

Core claim

SOM is an actor-critic algorithm that constructs the target velocity field for MeanFlow directly from the Q-function via score estimation and a probability flow ODE. This resolves the need for samples from the target action distribution, enabling single-step generation of actions that concentrate on high-value modes in fully online RL settings.

What carries the argument

The construction of the MeanFlow target velocity field from the Q-function using score estimation and the probability flow ODE, which allows single-step policy generation without target samples.

Load-bearing premise

The target velocity field for MeanFlow can be accurately constructed from the Q-function via score estimation and a probability flow ODE without any samples from the target action distribution.

What would settle it

If the single-step SOM policy fails to match or exceed the performance of multi-step diffusion policies on locomotion tasks while maintaining the claimed speedups, or if the constructed velocity field does not align with high-Q regions, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.23365 by Byung-Jun Lee, Donghyeon Ki, Hee-Jun Ahn, Kyungyoon Kim.

Figure 1
Figure 1. Figure 1: Unlike existing diffusion and flow matching algorithms, SOM does not require multi-step [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training curves on locomotion benchmarks Curves denote the mean across five random seeds, with shaded regions representing the 95% confidence interval. The bottom-right panel reports a per-environment min–max normalized return averaged across the five locomotion benchmarks [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Q-value mean and std over N=1000 action samples at 50 states. Details in Appendix D. 40 30 20 10 0 10 20 30 40 20 0 20 40 Hopper-v4(SOM) 43.5 45.8 48.1 48.1 50.4 50.4 50.4 52.7 52.7 55.0 55.0 40 30 20 10 0 10 20 30 40 20 0 20 40 Hopper-v4(MFP) 43.5 45.8 48.1 48.1 50.4 50.4 50.4 52.7 52.7 55.0 55.0 40 20 0 20 40 40 30 20 10 0 10 20 30 40 Walker2d-v4(SOM) 43.4 44.7 46.0 47.2 48.5 49.8 40 20 0 20 40 40 30 20 … view at source ↗
Figure 4
Figure 4. Figure 4: Action samples from SOM (white circles) and MFP (gray triangles) at a fixed state, projected [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Generative trajectories. Arrow plots from a 7 × 7 grid of xT (black) to x0 (red) on the eight-Gaussian reward. SDAC and DACER with their full 10-step rollouts. SOM (Ours) and MFP, both 1-step by design. Details and additional results in Appendix F. is further supported by the t-SNE visualization in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Robustness under two reward perturbations. Each side is a 6×5 grid: rows alternate Clean / Noisy training Q for SOM, SDAC, and MFP (top to bottom); columns show the Q landscape and the action distribution at sampler times t = 0.75, 0.5, 0.25, 0.0. (Left) random Gaussian noise on Q (σ= 0.20). (Right) random Gaussian noise on Q (σ= 0.30). Details and additional results in Appendix H. modes. This confirms tha… view at source ↗
Figure 7
Figure 7. Figure 7: VE-SDE Results. The forward SDE describes the process in which clean data is gradually perturbed into noise over time. Based on how the variance evolves over time, the forward SDE can be categorized into three types according to the forms of the drift term f(t) and the diffusion term g(t): the Variance Exploding (VE) SDE, the Variance Preserving (VP) SDE, and the Sub-Variance Preserving (sub-VP) SDE. In th… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation results for the rescaling coefficient w. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation results for the number of Monte Carlo samples in the iDEM estimator. As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: HalfCheetah-v4 40 30 20 10 0 10 20 30 40 20 0 20 40 SOM 43.5 45.8 48.1 48.1 50.4 50.4 50.4 52.7 52.7 55.0 55.0 40 20 0 20 40 40 30 20 10 0 10 20 30 SOM 43.5 45.8 48.1 50.4 52.7 55.0 55.0 40 30 20 10 0 10 20 30 40 40 20 0 20 40 SOM 43.6 45.9 48.1 50.4 50.4 52.7 52.7 55.0 55.0 40 30 20 10 0 10 20 30 40 20 0 20 40 MFP 43.5 45.8 48.1 48.1 50.4 50.4 50.4 52.7 52.7 55.0 55.0 40 20 0 20 40 40 30 20 10 0 10 20 30… view at source ↗
Figure 11
Figure 11. Figure 11: Hopper-v4 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Walker2d-v4 F 2D Bandit Tasks: Bandit, Two-Moons, and Checkerboard F.1 Eight Mode Bandit Reward function The reward is defined as a Gaussian-mixture density with K = 8 isotropic components N (µi , σ2 I2) with σ = 0.3. The component centers µi = √ 2  cos 2πi 8 ,sin 2πi 8  , i = 0, . . . , 7, are uniformly distributed on a circle of radius √ 2. To create interleaved high- and low-reward modes, we assign a… view at source ↗
Figure 13
Figure 13. Figure 13: Two-Moons Results. Arrow plots from a 7 × 7 grid of a1 (black) to a0 (red) on the two-moon reward. SDAC and DACER with their full 10-step rollouts. SOM (Ours) and MFP, both 1-step by design. As shown in [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Score field comparison on the eight-Gaussian reward. Left: estimated score (max magnitude ≈ 10). Right: ground-truth score (max magnitude ≈ 24). Brighter regions indicate larger gradient magnitude; the eight mode centers are marked with ×. Since the reward landscape is defined as a Gaussian mixture, the corresponding diffusion-perturbed density admits a closed-form expression, making the ground-truth scor… view at source ↗
Figure 16
Figure 16. Figure 16: Gaussian Bump. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Random Gaussian Noise (σ = 2.0). I Implementation Details Note that we conducted all experiments using four NVIDIA RTX 4090 GPUs and AMD Ryzen Threadripper PRO 5995WX CPUs. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Random Gaussian Noise (σ = 3.0). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
read the original abstract

Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm for online RL. It resolves the requirement for target-distribution samples in MeanFlow by constructing the target velocity field directly from the learned Q-function via score estimation combined with the probability flow ODE. This enables single-step policy generation. The abstract claims that SOM achieves state-of-the-art performance on locomotion tasks while substantially reducing both training and inference time relative to prior diffusion- and flow-matching-based policies.

Significance. If the velocity-field construction is unbiased and the single-step policy concentrates on high-value modes, the result would be significant: it would make expressive flow-based policies practical for online RL by eliminating multi-step denoising at inference time and avoiding the need for target samples during training. The approach combines value-based guidance with flow matching in a way that could generalize beyond the reported locomotion tasks, provided the non-stationary online setting does not amplify approximation errors.

major comments (2)
  1. [Abstract / central derivation] The central technical claim (abstract and method derivation) asserts that the target velocity field v_t(x) for MeanFlow can be obtained exactly from the Q-function via score estimation of ∇_x log p_t(x) and the probability flow ODE, without ever sampling from the target action distribution. This identity is load-bearing for the entire method; the provided text supplies no explicit derivation, regularity conditions, or proof that the Q-induced score equals the true score of the optimal policy, and the non-stationary online loop can feed any bias back into the actor update.
  2. [Abstract] The abstract states that SOM 'achieves state-of-the-art performance on locomotion tasks' with a single generation step, yet supplies no experimental details, baselines, metrics, or verification. Without these, the performance claim cannot be evaluated and the soundness of the velocity-field construction remains untested.
minor comments (1)
  1. [Abstract] The abstract is unusually dense with technical claims but contains no equation numbers or section references that would allow a reader to locate the score-estimation construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, providing clarifications on the technical claims and experimental presentation while indicating planned revisions.

read point-by-point responses
  1. Referee: [Abstract / central derivation] The central technical claim (abstract and method derivation) asserts that the target velocity field v_t(x) for MeanFlow can be obtained exactly from the Q-function via score estimation of ∇_x log p_t(x) and the probability flow ODE, without ever sampling from the target action distribution. This identity is load-bearing for the entire method; the provided text supplies no explicit derivation, regularity conditions, or proof that the Q-induced score equals the true score of the optimal policy, and the non-stationary online loop can feed any bias back into the actor update.

    Authors: We agree that the manuscript would benefit from an explicit step-by-step derivation of the velocity-field identity. The construction follows from substituting the score estimate ∇_x log p_t(x) ≈ ∇_x Q(x) (derived via the probability flow ODE under the optimal policy) into the MeanFlow target velocity, yielding v_t(x) without target samples. We will add a dedicated subsection in the revised Section 3 with the full derivation, including regularity conditions such as Lipschitz continuity of the Q-function and sufficient smoothness of the flow. On the non-stationary concern, the critic is updated with a slower target network to mitigate feedback of approximation errors; we will expand the discussion of this stabilization mechanism and include additional analysis of bias propagation. revision: yes

  2. Referee: [Abstract] The abstract states that SOM 'achieves state-of-the-art performance on locomotion tasks' with a single generation step, yet supplies no experimental details, baselines, metrics, or verification. Without these, the performance claim cannot be evaluated and the soundness of the velocity-field construction remains untested.

    Authors: The abstract is intentionally concise. Full experimental details—including baselines (Diffusion Policy, Flow Matching variants, SAC), metrics (normalized return, inference time), verification across MuJoCo locomotion environments with multiple random seeds, and ablation studies confirming the velocity-field construction—are reported in Section 4 and Appendix B. The single-step generation and training-time reductions are directly measured against these baselines. We will add a sentence in the abstract directing readers to the experimental section for completeness. revision: partial

Circularity Check

0 steps flagged

No circularity: velocity field derived from independent Q-function

full rationale

The provided abstract and description state that the target velocity field is constructed directly from the Q-function via score estimation and probability flow ODE, without requiring samples from the target distribution. No equations, self-citations, or fitted-parameter redefinitions are exhibited in the given text that would reduce any claimed prediction to its inputs by construction. The central step uses an externally learned critic to define the actor's target, which is a standard actor-critic separation and remains falsifiable via RL benchmark performance. This is the most common honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5685 in / 1123 out tokens · 35723 ms · 2026-05-25T05:20:03.743106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 7 internal anchors

  1. [1]

    Is Conditional Generative Modeling all you need for Decision-Making?

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657, 2022

  2. [2]

    Iterated denoising energy matching for sampling from boltzmann densities

    Tara Akhound-Sadegh, Jarrid Rector-Brooks, Avishek Joey Bose, Sarthak Mittal, Pablo Lemos, Cheng-Hao Liu, Marcin Sendera, Siamak Ravanbakhsh, Gauthier Gidel, Yoshua Bengio, et al. Iterated denoising energy matching for sampling from boltzmann densities. InProceedings of the 41st International Conference on Machine Learning, pages 760–786, 2024

  3. [3]

    Score regularized policy optimization through diffusion behavior

    Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, and Jun Zhu. Score regularized policy optimization through diffusion behavior. InThe Twelfth International Conference on Learning Representations, 2024

  4. [4]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  5. [5]

    Diffusion-based reinforcement learning via q-weighted variational policy optimization

    Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization. Advances in Neural Information Processing Systems, 37:53945–53968, 2024

  6. [6]

    One step diffusion via shortcut models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InThe Thirteenth International Conference on Learning Representations, 2025

  7. [7]

    Mean flows for one-step generative modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  8. [8]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018

  9. [9]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 6840–6851, 2020

  10. [10]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991, 2022

  11. [11]

    Prior-guided diffusion planning for offline reinforcement learning.arXiv preprint arXiv:2505.10881, 2025

    Donghyeon Ki, JunHyeok Oh, Seong-Woong Shim, and Byung-Jun Lee. Prior-guided diffusion planning for offline reinforcement learning.arXiv preprint arXiv:2505.10881, 2025

  12. [12]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023

  13. [13]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and qiang liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2023

  14. [14]

    Simplifying, stabilizing and scaling continuous-time consistency models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InThe Thirteenth International Conference on Learning Representations, 2025

  15. [15]

    Efficient online reinforcement learning for diffusion policy

    Haitong Ma, Tianyi Chen, Kai Wang, Na Li, and Bo Dai. Efficient online reinforcement learning for diffusion policy. InInternational Conference on Machine Learning, pages 41837–41853. PMLR, 2025

  16. [16]

    Learning a diffusion model policy from rewards via q-score matching

    Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching. InForty-first International Conference on Machine Learning, 2024

  17. [17]

    Diffusion Policy Policy Optimization

    Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024. 10

  18. [18]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

  19. [19]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  20. [20]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  21. [21]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  22. [22]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In Proceedings of the 40th International Conference on Machine Learning, pages 32211–32252, 2023

  23. [23]

    Generative modeling by estimating gradients of the data distribution.Advances in Neural Information Processing Systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in Neural Information Processing Systems, 32, 2019

  24. [24]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  25. [25]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction. MIT press Cambridge, 1998

  26. [26]

    Mujoco: A physics engine for model-based control

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012

  27. [27]

    Gymnasium: A Standard Interface for Reinforcement Learning Environments

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

  28. [28]

    Diffusion actor-critic with entropy regulator

    Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang WU, Jingliang Duan, and Shengbo Eben Li. Diffusion actor-critic with entropy regulator. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  29. [29]

    Diffusion policies as an expressive policy class for offline reinforcement learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. InThe Eleventh International Conference on Learning Representations, 2023

  30. [30]

    Policy representation via diffusion probability model for reinforcement learning

    Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning.arXiv preprint arXiv:2305.13122, 2023

  31. [31]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Free- man, and Taesung Park. One-step diffusion with distribution matching distillation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  32. [32]

    Mean flow policy with instantaneous velocity constraint for one-step action generation

    Guojian Zhan, Letian Tao, Pengcheng Wang, Yixiao Wang, Yuxin Chen, Yiheng Li, Hongyang Li, Masayoshi Tomizuka, and Shengbo Eben Li. Mean flow policy with instantaneous velocity constraint for one-step action generation. InThe Fourteenth International Conference on Learning Representations, 2026. 11 A Algorithm Pseudocode Algorithm 1Score-Based One-step Me...

  33. [33]

    The final reward is given by the normalized mixture density, producing a smooth multimodal reward landscape with values in[0,1]

    To create interleaved high- and low-reward modes, we assign alternating mixture weights wi = 2 for even i and wi = 1 for odd i. The final reward is given by the normalized mixture density, producing a smooth multimodal reward landscape with values in[0,1]. F.2 Two-Moons SOM SDAC (10-step) DACER (10-step) MFP Figure 13:Two-Moons Results.Arrow plots from a ...