pith. machine review for the scientific record. sign in

arxiv: 2604.10962 · v1 · submitted 2026-04-13 · 💻 cs.RO

Recognition: unknown

ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3

classification 💻 cs.RO
keywords flow matchingreinforcement learningscore functionrobotic controldistributional controlpolicy fine-tuningstochastic differential equationsimitation learning
0
0 comments X

The pith

ScoRe-Flow achieves complete distributional control in flow matching by using a closed-form score to modulate drift during RL fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow matching policies generate actions efficiently for robots yet remain capped by the quality of available demonstrations. Reinforcement learning fine-tuning can push past those limits, but prior approaches rely on noise injection that can slow training when demonstrations already supply useful priors. ScoRe-Flow instead modulates the drift term of the stochastic process with the score function, the gradient of the log-density, which admits an exact closed-form expression from the original velocity field and needs no extra network. Pairing this drift modulation with separate learned variance prediction gives independent control over the mean and variance of each transition. On standard robotic benchmarks the method reaches higher success rates and converges faster than existing flow-based reinforcement learning techniques.

Core claim

The score function can be obtained in closed form directly from the velocity field of a flow matching model. Inserting this score into the drift of the equivalent stochastic differential equation, while predicting variance separately, produces a policy whose stochastic transitions have independently controllable mean and variance. This complete distributional control improves exploration and training stability when fine-tuning flow matching policies with reinforcement learning on robotic tasks.

What carries the argument

Closed-form score function derived from the velocity field that modulates the drift term, combined with separate variance prediction for decoupled mean-variance control.

If this is right

  • ScoRe-Flow reaches 2.4 times faster convergence than prior flow-based methods on D4RL locomotion tasks.
  • The approach yields up to 5.4 percent higher success rates on Robomimic and Franka Kitchen manipulation benchmarks.
  • No auxiliary network is required to obtain the score, preserving the efficiency of the original flow matching backbone.
  • Likelihoods remain tractable while exploration is steered toward high-density regions of the state-action space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same closed-form score construction could be applied to fine-tune other continuous generative models used in robotics beyond flow matching.
  • Decoupled mean-variance control may reduce the amount of demonstration data needed before reinforcement learning begins.
  • The method supplies a concrete route to add calibrated uncertainty to flow-based planners without retraining the entire model.
  • Direct experiments on multi-task or sim-to-real robotic settings would test whether the stability gains hold when task distributions shift.

Load-bearing premise

Modulating the drift term via the closed-form score function steers exploration toward high-probability regions and improves training stability without introducing new instabilities or bias.

What would settle it

If replacing the score-modulated drift with pure noise injection on the same D4RL locomotion tasks produced equal or faster convergence and no drop in final performance, the claimed advantage of drift modulation would be falsified.

Figures

Figures reproduced from arXiv: 2604.10962 by Cheng Zhuo, Guohao Dai, Jinhao Li, Lukai Chen, Qi Sun, Xiaotian Qiu.

Figure 1
Figure 1. Figure 1: Comparison of flow policy sampling strategies. Top: De￾terministic FM follows a fixed ODE trajectory with no exploration. Middle: Noise-only control (e.g., ReinFlow) injects learnable noise for exploration, but only perturbs the position without modu￾lating dynamics. Bottom: ScoRe-Flow combines score-based drift modulation with learned variance prediction, achieving decoupled mean-variance control. of huma… view at source ↗
Figure 2
Figure 2. Figure 2: Learning curves on D4RL locomotion tasks. Dashed lines indicate the behavior cloning level. 0 1 2 3 4 5 6 Samples 1e6 0.5 0.6 0.7 0.8 0.9 1.0 Success Rate (a) PickPlaceCan 0.0 0.5 1.0 1.5 2.0 2.5 Samples 1e7 0.2 0.4 0.6 0.8 Success Rate (b) Square 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Samples 1e7 0.0 0.2 0.4 0.6 0.8 1.0 Success Rate (c) Transport ScoRe-Flow (ours) Score-SDE (ours) ReinFlow-S ReinFlow-R DPPO Gaussian… view at source ↗
Figure 3
Figure 3. Figure 3: Learning curves on Robomimic visual manipulation tasks. ance prediction for flow policy RL. We evaluate two variants: ReinFlow-R uses Rectified Flow as the base policy with 4 denoising steps, representing the variance-only control paradigm; ReinFlow-S uses Shortcut Model as the base policy with 1 to 4 denoising steps and learned step-skipping for improved efficiency. DPPO (Ren et al., 2024) applies Diffusi… view at source ↗
Figure 4
Figure 4. Figure 4: Learning curves on Franka Kitchen multi-task benchmark [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Wall-clock time comparison on D4RL locomotion tasks. The large speedup over DPPO (up to 94.2× on Hopper-v2, 21.9× average) is primarily a structural advantage of flow-based methods requiring fewer denoising steps. The algorithmic contribution is the 2.4× faster convergence over the flow-based baseline ReinFlow at matched K = 4 steps [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training stability comparison on Kitchen-Complete-v0. Under the same Rectified Flow base policy, ScoRe-Flow achieves stable learning while ReinFlow exhibits high variance and unstable convergence behavior. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 Samples 1e8 3000 2000 1000 0 1000 2000 3000 4000 Average Episode Reward std=0.001 std=0.01 std=0.1 std=1 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sensitivity of Score-SDE to initial noise variance. Per￾formance varies dramatically with different fixed σ values, demon￾strating that score-only methods require careful task-specific tun￾ing. Additional Ablation Studies. We provide comprehensive ablation studies in Appendix D, analyzing the impact of denoising steps (K) and the decoupled variance predic￾tion mechanism. Results show that while Score-SDE’s… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study on denoising steps (K). We evaluate the success rate on the Robomimic Square task across varying inference steps (K ∈ {1, 2, 4}). The results demonstrate that the success rate reaches its maximum at K = 2, outperforming both the 4-step and 1-step settings, thus identifying K = 2 as the optimal inference configuration [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Learned αψ(t) scheduler behavior. Left: αψ as a function of denoising time t at different training stages. The scheduler applies stronger correction early in denoising and reduces near t = 1. Right: Overall αψ magnitude decreases over training as vθ improves. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Flow Matching (FM) policies have emerged as an efficient backbone for robotic control, offering fast and expressive action generation that underpins recent large-scale embodied AI systems. However, FM policies trained via imitation learning inherit the limitations of demonstration data; surpassing suboptimal behaviors requires reinforcement learning (RL) fine-tuning. Recent methods convert deterministic flows into stochastic differential equations (SDEs) with learnable noise injection, enabling exploration and tractable likelihoods, but such noise-only control can compromise training efficiency when demonstrations already provide strong priors. We observe that modulating the drift via the score function, i.e., the gradient of log-density, steers exploration toward high-probability regions, improving stability. The score admits a closed-form expression from the velocity field, requiring no auxiliary networks. Based on this, we propose ScoRe-Flow, a score-based RL fine-tuning method that combines drift modulation with learned variance prediction to achieve decoupled control over the mean and variance of stochastic transitions. Experiments demonstrate that ScoRe-Flow achieves 2.4x faster convergence than flow-based SOTA on D4RL locomotion tasks and up to 5.4% higher success rates on Robomimic and Franka Kitchen manipulation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ScoRe-Flow, a score-based RL fine-tuning method for flow matching policies in robotic control. It converts deterministic FM policies to SDEs, modulates the drift term using a closed-form score (gradient of log-density) derived directly from the velocity field without auxiliary networks, and pairs this with learned variance prediction to achieve decoupled mean/variance control over stochastic transitions. The approach is claimed to improve exploration stability and yield 2.4x faster convergence on D4RL locomotion tasks plus up to 5.4% higher success rates on Robomimic and Franka Kitchen manipulation tasks compared to flow-based SOTA.

Significance. If the central technical claims hold—specifically that drift modulation via the closed-form score preserves tractable unbiased likelihoods for RL gradients and reliably steers without new instabilities—the work offers a practical advance in fine-tuning expressive flow-based policies for robotics. The absence of extra networks for scoring and the decoupled control are notable strengths that could benefit large-scale embodied AI systems, provided the quantitative gains are robustly supported.

major comments (2)
  1. [Method / Theoretical Analysis] The derivation that modulating the SDE drift with the closed-form score from the velocity field preserves the original probability-flow density (or yields an equivalent Fokker-Planck equation) and maintains unbiased likelihoods for policy gradients must be provided explicitly. The central claim of decoupled distributional control and improved stability rests on this; without it, the RL objective may be biased as noted in the stress-test concern.
  2. [Experiments] Table or section reporting the D4RL and manipulation results: the 2.4x convergence and 5.4% success improvements require accompanying details on baselines, number of seeds, statistical tests, and ablations (e.g., drift modulation vs. variance-only) to substantiate attribution to the score-based component rather than other factors.
minor comments (2)
  1. Clarify notation for the modulated SDE (e.g., how the score term is inserted into the drift) and ensure all equations are numbered for cross-reference.
  2. [Abstract] The abstract could more precisely state the key assumption that the closed-form score relation extends to the modulated process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of ScoRe-Flow for fine-tuning flow-based policies in robotics. We address each major comment below with clarifications and commit to revisions that strengthen the theoretical and empirical support without altering the core contributions.

read point-by-point responses
  1. Referee: [Method / Theoretical Analysis] The derivation that modulating the SDE drift with the closed-form score from the velocity field preserves the original probability-flow density (or yields an equivalent Fokker-Planck equation) and maintains unbiased likelihoods for policy gradients must be provided explicitly. The central claim of decoupled distributional control and improved stability rests on this; without it, the RL objective may be biased as noted in the stress-test concern.

    Authors: We agree that an explicit derivation is essential to substantiate the claims. The manuscript already states that the score is obtained in closed form from the velocity field (which defines the probability-flow ODE), but we will expand the Methods section with a new subsection and add a dedicated appendix containing the full step-by-step derivation. This will show that the modulated drift term produces an SDE whose Fokker-Planck equation is equivalent to the original flow-matching marginals, thereby preserving the density and ensuring the likelihoods entering the RL objective remain unbiased. We will also include a brief discussion of why this construction avoids the bias concerns raised in the stress-test. revision: yes

  2. Referee: [Experiments] Table or section reporting the D4RL and manipulation results: the 2.4x convergence and 5.4% success improvements require accompanying details on baselines, number of seeds, statistical tests, and ablations (e.g., drift modulation vs. variance-only) to substantiate attribution to the score-based component rather than other factors.

    Authors: We concur that more granular experimental reporting is required. The current results section presents the headline metrics, but we will revise it to include: (i) a complete table listing all baselines with citations, (ii) performance averaged over five independent random seeds together with standard deviations, (iii) statistical significance tests (paired t-tests or Wilcoxon rank-sum) comparing ScoRe-Flow against the strongest baselines, and (iv) an explicit ablation study isolating drift modulation from variance-only control. These additions will appear in updated tables, figures, and an expanded experimental analysis subsection. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation relies on standard mathematical identities in flow matching

full rationale

The paper's key step is the claim that the score function admits a closed-form expression directly from the velocity field of the base flow-matching model, with no auxiliary networks or fitted parameters. This is a standard property of probability-flow ODEs and conditional flow matching (not a self-definition or fitted input renamed as prediction). The subsequent drift modulation plus learned variance is then used for RL fine-tuning with tractable likelihoods asserted from the construction. No load-bearing self-citation chains, uniqueness theorems imported from the same authors, or ansatz smuggling appear in the abstract or described method. The reported gains are empirical on external benchmarks (D4RL, Robomimic, Franka Kitchen) rather than tautological. The derivation chain is therefore self-contained against external flow-matching and RL machinery.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions of flow matching and stochastic differential equations plus the novel but unproven claim that score-based drift modulation improves stability; no new entities are postulated and the score derivation is presented as closed-form.

free parameters (1)
  • learned variance predictor
    A separate network or head is trained to predict variance for stochastic transitions; its parameters are fitted during RL.
axioms (2)
  • domain assumption The score function admits a closed-form expression from the velocity field of the flow matching model
    Invoked to justify that no auxiliary networks are needed for score computation.
  • domain assumption Converting deterministic flows into SDEs with learnable noise enables tractable likelihoods and exploration
    Stated as the basis for recent flow-based RL methods that ScoRe-Flow builds upon.

pith-pipeline@v0.9.0 · 5526 in / 1535 out tokens · 76925 ms · 2026-05-10T15:58:01.826056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 26 canonical work pages · 14 internal anchors

  1. [1]

    Albergo, M. S. and Vanden-Eijnden, E. Stochastic inter- polants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797,

  2. [2]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    Amin, A., Aniceto, R., Balakrishna, A., Black, K., Conley, K., Connors, G., Darpinian, J., Dhabalia, K., DiCarlo, J., Driess, D., et al. π0.6: a vla that learns from experience. arXiv preprint arXiv:2511.14759,

  3. [3]

    Training Diffusion Models with Reinforcement Learning

    Black, K., Janner, M., Du, Y ., Kostrikov, I., and Levine, S. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301,

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  5. [5]

    πRL: Online rl fine-tuning for flow-based vision-language- action models.arXiv preprint arXiv:2510.25889, 2025

    Braun, M., Jaquier, N., Rozo, L., and Asfour, T. Rieman- nian flow matching policy for robot motion learning. in 2024 ieee. InRSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5144–5151. Chen, K., Liu, Z., Zhang, T., Guo, Z., Xu, S., Lin, H., Zang, H., Zhang, Q., Yu, Z., Fan, G., et al. πRL: Online rl fine-tuning for flow-based vi...

  6. [6]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219,

  7. [7]

    Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning

    Gupta, A., Kumar, V ., Lynch, C., Levine, S., and Hausman, K. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning.arXiv preprint arXiv:1910.11956,

  8. [8]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., and Levine, S. Idql: Implicit q-learning as an actor- critic method with diffusion policies.arXiv preprint arXiv:2304.10573,

  9. [9]

    Classifier-Free Diffusion Guidance

    Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  10. [10]

    Planning with Diffusion for Flexible Behavior Synthesis

    Janner, M., Du, Y ., Tenenbaum, J. B., and Levine, S. Plan- ning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,

  11. [11]

    Kingma, D. P. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,

  12. [12]

    Flow Matching for Generative Modeling

    9 ScoRe-Flow: Complete Distributional Control for Flow Matching Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  13. [13]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- dient descent with warm restarts.arXiv preprint arXiv:1608.03983,

  14. [14]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Mandlekar, A., Xu, D., Wong, J., Nasiriany, S., Wang, C., Kulkarni, R., Fei-Fei, L., Savarese, S., Zhu, Y ., and Mart´ın-Mart´ın, R. What matters in learning from offline human demonstrations for robot manipulation.arXiv preprint arXiv:2108.03298,

  15. [15]

    Flow Q - Learning , May 2025 c

    Park, S., Li, Q., and Levine, S. Flow q-learning.arXiv preprint arXiv:2502.02538,

  16. [16]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    Rajeswaran, A., Kumar, V ., Gupta, A., Vezzani, G., Schul- man, J., Todorov, E., and Levine, S. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.arXiv preprint arXiv:1709.10087,

  17. [17]

    Diffusion policy policy optimization

    Ren, A. Z., Lidard, J., Ankile, L. L., Simeonov, A., Agrawal, P., Majumdar, A., Burchfiel, B., Dai, H., and Simchowitz, M. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588,

  18. [18]

    Albergo, Carles Domingo-Enrich, Nicholas M

    Sabour, A., Albergo, M. S., Domingo-Enrich, C., Boffi, N. M., Fidler, S., Kreis, K., and Vanden-Eijnden, E. Test- time scaling of diffusions with flow maps.arXiv preprint arXiv:2511.22688,

  19. [19]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

  20. [20]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  21. [21]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y ., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Er- mon, S., and Poole, B. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

  22. [22]

    Coefficients-preserving sampling for reinforcement learning with flow matching

    Wang, F. and Yu, Z. Coefficients-preserving sampling for re- inforcement learning with flow matching.arXiv preprint arXiv:2509.05952,

  23. [23]

    Smart-grpo: Smartly sampling noise for efficient rl of flow-matching models.arXiv preprint arXiv:2510.02654, 2025

    Yu, B., Liu, J., and Cui, J. Smart-grpo: Smartly sampling noise for efficient rl of flow-matching models.arXiv preprint arXiv:2510.02654,

  24. [24]

    Affordance-based robot manipulation with flow matching,

    Zhang, F. and Gienger, M. Affordance-based robot manipulation with flow matching.arXiv preprint arXiv:2409.01083,

  25. [25]

    Zhang, W

    Zhang, S., Zhang, W., and Gu, Q. Energy-weighted flow matching for offline reinforcement learning.arXiv preprint arXiv:2503.04975, 2025a. Zhang, T., Yu, C., Su, S., and Wang, Y . Reinflow: Fine- tuning flow matching policy with online reinforcement learning.arXiv preprint arXiv:2505.22094, 2025b. Zhong, S., Ding, S., Diao, H., Wang, X., Teh, K. C., and Pe...

  26. [26]

    arXiv preprint arXiv:2509.15207 , year=

    10 ScoRe-Flow: Complete Distributional Control for Flow Matching Zhu, X., Cheng, D., Zhang, D., Li, H., Zhang, K., Jiang, C., Sun, Y ., Hua, E., Zuo, Y ., Lv, X., et al. Flowrl: Matching reward distributions for llm reasoning.arXiv preprint arXiv:2509.15207,

  27. [27]

    Actions True Max Ep

    4 Condition Stacking 1 Denoising Steps (K) 4 BC Loss Coef (β) 0.01 Clip Interm. Actions True Max Ep. Steps / Rollout Steps 1000 / 500 Exploration Mechanism Exploration Noise Typeϵ-SDE Stochastic Learnable Noise Scheduler Explorationϵ t Schedule Linear Decay N/A Noise Hold Ratio N/A 35% of total iteration Noise Decay Target N/A0.3σ min + 0.7σmax Score Para...