pith. machine review for the scientific record. sign in

arxiv: 2605.13435 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: unknown

Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningflow-based policiesoffline RLpolicy optimizationgenerative modelsvalue propagation
0
0 comments X

The pith

Q-Flow stabilizes training of expressive flow-based policies in reinforcement learning by propagating terminal values backward along deterministic flow paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Flow-based models promise high expressivity as RL policies but suffer instability when gradients must flow through numerical ODE solvers during optimization. Q-Flow resolves the tradeoff by using the deterministic flow to carry terminal trajectory values directly to intermediate latent states, supplying reliable intermediate-value gradients for policy updates. This avoids solver unrolling while preserving full representational capacity. The method is demonstrated in offline RL on the OGBench suite, where it improves over prior baselines and supports seamless online adaptation within the same framework.

Core claim

Q-Flow leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, bridging the gap between stability and expressivity.

What carries the argument

Value propagation along the policy-induced flow that supplies intermediate gradients for direct policy updates.

If this is right

  • Policy expressivity no longer needs to be sacrificed for training stability.
  • The same framework supports both offline pretraining and online adaptation without separate mechanisms.
  • Value gradients become available at every latent state along a trajectory without additional computational cost from solver differentiation.
  • Flow-based policies can be optimized end-to-end using standard RL objectives once terminal values are propagated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same propagation idea could apply to other continuous normalizing flow or diffusion policies that rely on ODE integration.
  • Removing the need to differentiate through solvers may lower memory usage during long-horizon rollouts.
  • If the flow is learned jointly with the value function, the method might produce more consistent latent trajectories than separate matching losses.

Load-bearing premise

Propagating terminal trajectory values along the flow yields unbiased and stable gradients for policy optimization.

What would settle it

Training runs on OGBench tasks that show persistent instability or lower returns when using Q-Flow compared with restricted flow policies that avoid solver backpropagation.

Figures

Figures reproduced from arXiv: 2605.13435 by Byeongguk Jeon, JaeHyeok Doo, Kimin Lee, Minjoon Seo, Seonghyeon Ye.

Figure 1
Figure 1. Figure 1: Visualization of 2D datasets, Swiss roll (left) and Two spirals (right). The color indicates the reward of each sample, where the reward increases from dark blue to light green. FBRAC FQL [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of flow-based offline RL methods that utilize gradient-based policy optimization in 2D examples. Re￾sults are shown for the Swiss roll (left two columns) and two spirals (right two columns) environments. Strong BC refers to strong BC regularization, and Weak BC refers to weak BC regularization. 3. The Challenge of Flow-based Policy Optimization in Reinforcement Learning To understand the difficu… view at source ↗
Figure 3
Figure 3. Figure 3: 2D experiment results with Q-Flow. Q-Flow preserves full expressivity while enabling stable policy optimization [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sample (top) and gradient field (bottom) evolution over the V π ω value landscape in the 2D Swiss roll. Sample distributions are shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Flow-consistency of intermediate value. We measure the absolute difference of terminal value and intermediate value along policy-induced flow in 2D Swiss roll environment. intermediate value learning. By setting the regression target with the slowly updating target critic Qϕ¯, we effectively dampen the high variance arising from the shifting flow dynamics, thereby preventing the inner value function from c… view at source ↗
Figure 6
Figure 6. Figure 6: Offline-to-online RL results on the default task in 5 OGBench tasks. Q-Flow consistently outperforms flow-based baselines, demonstrating superior adaptability and stable improvement during online fine-tuning. Results are averaged over 8 seeds, with shaded area indicating 95% bootstrap confidence interval. AM-Giant HM-Medium Antsoccer Cube-Double Puzzle-4x4 Overall 0 20 40 60 80 100 Success Rate (%) 0 0 15 … view at source ↗
Figure 7
Figure 7. Figure 7: Component ablation study on default tasks of 5 OGBench environments. For both studies, we include FBRAC as the default baseline. the BPTT baseline defined as: max θ E τ∼U(0,1) x0∼p0 (s,a=x1)∼D   −V π ω (s, Ψ π τ,0 (x0, s), 0) + αLCFM(θ) | {z } Eq. (5)    . Both methods utilize the learned Intermediate Value func￾tion V π (s, xτ , τ ) to guide the policy, but they differ funda￾mentally in how the poli… view at source ↗
Figure 9
Figure 9. Figure 9: Training cost comparison. We report training step time (ms/step) of flow-based methods in Puzzle-4x4 with different numbers of flow steps. approaches (Janner et al., 2019; Kidambi et al., 2020) to better capture the data distribution. Diffusion and Flow-based offline RL. To model com￾plex, multi-modal behavioral distributions (Chi et al., 2023), recent works have integrated expressive generative models int… view at source ↗
Figure 10
Figure 10. Figure 10: Full 2D Toy Experiment Results. Qualitative comparison of generated samples. Q-Flow consistently captures the multi-modal structure of the target distributions, whereas baselines suffer from mode collapse or divergence. Swiss Roll Two Spirals 8 Gaussians Moons [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Intermediate Value Landscapes. Visualization of the intermediate value function V π ω (s, xτ , τ ) of Q-Flow with λ = 1 across flow time τ in each 2D distribution, evolving from left (τ = 0) to right (τ = 1). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Policy gradient norm over offline RL training across different BC/guidance coefficients (α/λ). BPTT leads to severe optimization instability as BC regularization strength weakens. B.3. Analysis Intermediate Value Analysis [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Full training curves of Q-Flow in OGBench under standard setting. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Full training curves of Q-Flow in OGBench under advanced setting. 0.0 0.2 0.4 0.6 0.8 1.0 Steps (M) 0 20 40 60 80 100 Score Umaze-default 0.0 0.2 0.4 0.6 0.8 1.0 Steps (M) 0 20 40 60 80 100 Umaze-diverse 0.0 0.2 0.4 0.6 0.8 1.0 Steps (M) 0 20 40 60 80 100 Medium-play 0.0 0.2 0.4 0.6 0.8 1.0 Steps (M) 0 20 40 60 80 100 Medium-diverse 0.0 0.2 0.4 0.6 0.8 1.0 Steps (M) 0 20 40 60 80 100 Large-play 0.0 0.2 0.… view at source ↗
Figure 16
Figure 16. Figure 16: We conduct ablation studies on the number of flow steps for the policy network and flow time embedding type for the intermediate value network. D. Additional Results and Ablations D.1. Full Offline RL Results The full offline RL results under standard setting are provided in [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Absolute value difference across flow timesteps along policy-generated trajectories in each OGBench environment. F. Additional Analysis Intermediate Value Analysis. Here, we provide full intermediate value analysis in the OGBench tasksuite. Concretely, we compute the absolute difference between the terminal value and the intermediate value: |V π ω (s, Ψ π 1,τ (xτ , s), 1) − V π ω (s, xτ , τ )|. To compute… view at source ↗
read the original abstract

There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduce Q-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of 10.6 percentage points, while also enabling stable online adaptation within the same framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Q-Flow, a framework for reinforcement learning with flow-based policies. It claims that the deterministic nature of flow dynamics allows explicit propagation of terminal trajectory values to intermediate latent states along the policy-induced flow, enabling stable policy optimization via intermediate value gradients without unrolling the numerical solver. This is positioned as resolving the stability-expressivity trade-off in flow policies. The method is evaluated in the offline setting on the OGBench suite, where it reports consistent outperformance of state-of-the-art baselines by an average of 10.6 percentage points, and is shown to support stable online adaptation within the same framework.

Significance. If the central claim holds—that flow-based value propagation yields unbiased gradients for policy optimization without solver unrolling or bias from the flow-matching approximation—it would meaningfully advance expressive flow policies in RL by removing the need to restrict capacity for stability. The reported gains on the challenging OGBench benchmark provide initial evidence of practical utility in offline RL.

major comments (2)
  1. [Abstract and §5] Abstract and §5 (Experiments): the reported average 10.6 percentage point gains on OGBench are presented without details on experimental controls, variance across seeds, number of runs, or potential confounding factors such as hyperparameter tuning differences; this leaves the empirical support for the central stability claim difficult to verify.
  2. [§3.2] §3.2 (Value Propagation): the derivation assumes that integrating the learned flow exactly recovers trajectories whose terminal states match the policy-induced distribution, allowing direct assignment of terminal values to intermediate latents; however, flow matching only regresses a velocity field to a conditional expectation, so local approximation errors integrate into path deviations that bias the propagated values used for gradients, and the method provides no adjoint correction since it avoids unrolling.
minor comments (2)
  1. [§3] Notation for the flow velocity field and the intermediate value function could be clarified with an explicit equation linking the propagated value to the policy objective.
  2. [§5] Figure captions in the experimental section should include error bars or standard deviations to support the reported performance margins.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Experiments): the reported average 10.6 percentage point gains on OGBench are presented without details on experimental controls, variance across seeds, number of runs, or potential confounding factors such as hyperparameter tuning differences; this leaves the empirical support for the central stability claim difficult to verify.

    Authors: We agree that additional experimental details are necessary for verifying the reported gains and the stability claim. In the revised manuscript, we will expand §5 to include: results averaged over 5 independent random seeds with standard deviations, the total number of evaluation runs per method, and a description of the hyperparameter tuning protocol (including ranges searched and selection criteria). We will also discuss potential confounding factors such as implementation differences with baselines. These additions will make the empirical support more transparent and reproducible. revision: yes

  2. Referee: [§3.2] §3.2 (Value Propagation): the derivation assumes that integrating the learned flow exactly recovers trajectories whose terminal states match the policy-induced distribution, allowing direct assignment of terminal values to intermediate latents; however, flow matching only regresses a velocity field to a conditional expectation, so local approximation errors integrate into path deviations that bias the propagated values used for gradients, and the method provides no adjoint correction since it avoids unrolling.

    Authors: We appreciate this observation on the approximation inherent to flow matching. The propagation step integrates the learned velocity field exactly to obtain trajectories under the flow model itself; terminal values are then assigned consistently with these model-defined paths rather than with an idealized exact distribution. This ensures the resulting gradients are unbiased with respect to the parameterized flow policy, which is the object being optimized. The avoidance of solver unrolling eliminates compounding numerical instabilities that arise in backpropagation through ODE solvers. We will add a clarifying paragraph in §3.2 acknowledging the approximation error, explaining the consistency argument above, and noting that empirical stability on OGBench supports the practical utility of the approach. A full bias analysis relative to the true data distribution would be a valuable future direction but is outside the current scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity in Q-Flow derivation

full rationale

The paper grounds its central claim in the deterministic nature of flow dynamics, which permits explicit propagation of terminal trajectory values to intermediate latent states along the policy-induced flow. This enables stable policy optimization via intermediate value gradients without solver unrolling. No load-bearing steps in the provided text reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The formulation follows directly from standard properties of deterministic ODEs in flow matching, which are external mathematical facts rather than paper-internal assumptions. Empirical results on OGBench supply independent validation. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the deterministic property of flow dynamics being sufficient for accurate value propagation; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Flow dynamics are deterministic
    Invoked to enable explicit propagation of terminal value to intermediate states without unrolling solvers.

pith-pipeline@v0.9.0 · 5486 in / 1161 out tokens · 44857 ms · 2026-05-14T19:22:59.970072+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 11 canonical work pages · 8 internal anchors

  1. [1]

    Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458, 2025

    Frans, K., Park, S., Abbeel, P., and Levine, S. Diffusion guidance is a controllable policy improvement operator. arXiv preprint arXiv:2505.23458,

  2. [2]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Fu, J., Kumar, A., Nachum, O., Tucker, G., and Levine, S. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219,

  3. [3]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    Hansen-Estruch, P., Kostrikov, I., Janner, M., Kuba, J. G., and Levine, S. Idql: Implicit q-learning as an actor- critic method with diffusion policies.arXiv preprint arXiv:2304.10573,

  4. [4]

    Aligniql: Policy alignment in implicit q-learning through constrained optimization

    He, L., Shen, L., and Wang, X. Aligniql: Policy alignment in implicit q-learning through constrained optimization. arXiv preprint arXiv:2405.18187,

  5. [5]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. and Gimpel, K. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

  6. [6]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

  7. [7]

    Reinforcement learning with action chunking

    Li, Q., Zhou, Z., and Levine, S. Reinforcement learning with action chunking. InAdvances in Neural Information Processing Systems, volume 38, pp. 55518–55553, 2025a. Li, Y ., Shao, X., Zhang, J., Wang, H., Brunswic, L. M., Zhou, K., Dong, J., Guo, K., Li, X., Chen, Z., Wang, J., and Hao, J. Generative models in decision making: A survey.arXiv preprint arX...

  8. [8]

    Playing Atari with Deep Reinforcement Learning

    Mnih, V ., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

  9. [9]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Nair, A., Gupta, A., Dalal, M., and Levine, S. Awac: Accel- erating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359,

  10. [10]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Park, S., Frans, K., Eysenbach, B., and Levine, S. Ogbench: Benchmarking offline goal-conditioned rl. InInterna- tional Conference on Learning Representations, 2025a. Park, S., Li, Q., and Levine, S. Flow q-learning. InProceed- ings of the 42nd International Conference on Machine Learning, volume 267, pp. 48104–48127, 2025b. 11 Q-Flow: Stable and Expressi...

  11. [11]

    Behavior Regularized Offline Reinforcement Learning

    Wu, Y ., Tucker, G., and Nachum, O. Behavior regu- larized offline reinforcement learning.arXiv preprint arXiv:1911.11361,

  12. [12]

    Related Work Offline RL.In offline RL, the primary objective is to maximize the expected return while staying close to the state-action distribution defined by the offline dataset

    12 Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy A. Related Work Offline RL.In offline RL, the primary objective is to maximize the expected return while staying close to the state-action distribution defined by the offline dataset. This is achieved by training the critic to minimize the Bellman error, and Q- learning is perh...

  13. [13]

    More approaches include sequence modeling (Chen et al., 2021; Janner et al.,

    or via pessimistic value learning (Kumar et al., 2020). More approaches include sequence modeling (Chen et al., 2021; Janner et al.,

  14. [14]

    and model-based methods (Janner et al., 2019; Kidambi et al., 2020). Diffusion and Flow-based RL.The application of expressive generative models, such as diffusion and flow models, to RL can be categorized by policy optimization strategies:weighted regression(Peters & Schaal, 2007; Peng et al., 2019; Nair et al., 2020),rejection sampling(Chen et al., 2023...

  15. [15]

    Rejection sampling-based methods often decouple the value learning and policy extraction

    and QIPO (Zhang et al., 2025). Rejection sampling-based methods often decouple the value learning and policy extraction. When the dataset is provided as in offline RL, they perform in-sample value maximization, such as Implicit Q-learning (IQL; Kostrikov et al. 2021), and use the learned value function to determine the action to be executed: aπ = argmax a...

  16. [16]

    and IDQL (Hansen-Estruch et al., 2023). While the above two paradigms enjoy the simplicity of application to expressive generative models, they are limited by their reliance on scalar value signals from the critic (Park et al., 2024; Frans et al., 2025). Reparameterized gradient-based methods directly maximize the value of the action generated by the mode...

  17. [17]

    let the gradient flow through a diffusion process. Since the gradient backpropagation leads to noisy and unstable policy optimization, FQL(Park et al., 2025b) distills the behavioral information of the full flow-based policy to a one-step policy and performs value maximization w.r.t. the one-step policy. While FQL utilizes the reparameterized gradient inf...

  18. [18]

    However, the fundamental distinction lies in the generative policy class, which dictates optimization complexity and intermediate value construction

    explicitly learns the intermediate value via contrastive energy prediction and is the most similar approach to Q-Flow. However, the fundamental distinction lies in the generative policy class, which dictates optimization complexity and intermediate value construction. Specifically, CEP is built on diffusion policy, i.e., an 13 Q-Flow: Stable and Expressiv...

  19. [19]

    In contrast, the sparse reward definition used in *-sparse tasks does not award the subtask completion reward and provides the full reward only upon the full completion

    In contrast, manipulation tasks typically involve multiple sequential subtasks (e.g., opening a drawer or toggling a button), resulting in rewards bounded between -Ntask and 0, where Ntask denotes the number of subtasks (up to 16 in the environments tested in this work). In contrast, the sparse reward definition used in *-sparse tasks does not award the s...

  20. [20]

    It uses a Gaussian policy and serves as the competitive baseline that has been considered state-of-the-art before the adoption of expressive generative models as policies

    is a robust actor-critic baseline that improves upon behavior regularization techniques, such as TD3+BC (Fujimoto & Gu, 2021), through architectural and hyperparameter optimization. It uses a Gaussian policy and serves as the competitive baseline that has been considered state-of-the-art before the adoption of expressive generative models as policies. We ...

  21. [21]

    Both approaches fall into the class of guidance-based methods, where policy improvement relies on evaluating the outer critic at intermediate latent actions

    is a diffusion-based RL method that aligns the generative model updates with the action-gradient of the critic. Both approaches fall into the class of guidance-based methods, where policy improvement relies on evaluating the outer critic at intermediate latent actions. While these guidance-based methods avoid costly BPTT by directly matching the model pre...

  22. [22]

    For the policy network, we use Fourier embedding for the flow time embedding

    For policy, we use the Euler method of 10 steps across all tasks. For the policy network, we use Fourier embedding for the flow time embedding. We take the mean of Q ensembles as the default aggregation strategy, or take the minimum for some tasks in thestandard settingas FQL. The aggregation is consistent in the algorithm, i.e., we use the same aggregati...

  23. [23]

    (∗) denotes the default task per environment

    18 Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy Table 3.Full offline RL results in OGBench understandard setting.Q-Flow performs comparably or superior to the baselines on most tasks. (∗) denotes the default task per environment. We also include the results of other flow-based RL methods, borrowed from Park et al. (2025b), f...

  24. [24]

    Here, the results are averaged over 12 seeds following the evaluation protocol considered by Li & Levine (2026). D.2. Additional Ablation Studies We conduct additional ablation studies in the default task of selected OGBench environments understandard setting. The results are averaged over 8 seeds. Flow Steps.Figure 16a compares performance across differe...

  25. [25]

    For D4RL antmaze evaluation, we borrow the numbers from Lu et al

    for extensive empirical validation of its effectiveness in diverse benchmarks. For D4RL antmaze evaluation, we borrow the numbers from Lu et al. (2023) and Zhang et al. (2025). As in the OGBench experiment, of offline RL experiments, we take 1M offline training steps with a batch size of 256 and report the evaluation result at the last step. For offline-t...

  26. [26]

    Q-Flow achieves the best overall performance, outperforming prior flow-based methods and remaining competitive with strong diffusion-based baselines

    and QIPO (Zhang et al., 2025), as well as flow-based approaches such as FQL. Q-Flow achieves the best overall performance, outperforming prior flow-based methods and remaining competitive with strong diffusion-based baselines. In particular, Q-Flow matches or exceeds the performance of QGPO and QIPO on several tasks, while demonstrating clear improvements...