Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

Dohyeong Kim; Eshan Balachandar; Keshav Pingali; Sungyoung Lee; Zelal Su Mustafaoglu

arxiv: 2605.01663 · v2 · pith:X5R3JTR4new · submitted 2026-05-03 · 💻 cs.LG · cs.RO

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

Sungyoung Lee , Dohyeong Kim , Eshan Balachandar , Zelal Su Mustafaoglu , Keshav Pingali This is my paper

Pith reviewed 2026-05-10 16:20 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords offline reinforcement learningflow policiesdistributional criticsbehavior regularizationQ-learningrobotic manipulationlocomotioncomputational efficiency

0 comments

The pith

FAN achieves state-of-the-art offline RL performance using only a single flow-policy iteration and one Gaussian noise sample.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Flow-Anchored Noise-conditioned Q-Learning (FAN) to address the computational cost of expressive methods in offline reinforcement learning. It replaces iterative sampling from flow policies and multiple quantile samples for distributional critics with a single flow iteration and one Gaussian noise draw, anchored by behavior regularization. Theoretical analysis establishes convergence and improved performance bounds under these reductions. Experiments on robotic manipulation and locomotion tasks show that FAN matches or exceeds prior methods while cutting both training and inference time. This matters for making high-capacity offline RL practical in settings where repeated sampling is too slow.

Core claim

FAN employs a behavior regularization technique that utilizes only a single flow policy iteration and requires only a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes.

What carries the argument

Flow-anchored noise-conditioned Q-learning, which anchors a single-iteration flow policy to the behavior distribution via regularization and estimates the distributional critic from one Gaussian noise sample.

Load-bearing premise

A single flow-policy iteration plus one Gaussian noise sample for the distributional critic preserves both the expressivity of full iterative flows and the accuracy of multi-sample critics without introducing bias that the behavior-regularization term cannot correct.

What would settle it

Observing that full iterative flow policies or multi-sample distributional critics achieve higher task success rates or tighter performance bounds than FAN on the same robotic manipulation or locomotion benchmarks would falsify the sufficiency claim.

Figures

Figures reproduced from arXiv: 2605.01663 by Dohyeong Kim, Eshan Balachandar, Keshav Pingali, Sungyoung Lee, Zelal Su Mustafaoglu.

**Figure 1.** Figure 1: Training Runtime per Batch vs. Average Success Rates on five OGBench puzzle-4x4-singleplay-v0 tasks. FAN performs the best with the highest computational efficiency. the learned policy to the behavior policy that generated the data. For effective constraints, recent work has adopted expressive algorithms for learning the policy and the value. First, flow matching has been widely used for policy training (… view at source ↗

**Figure 2.** Figure 2: Overview of FAN. (Left) Behavior regularization utilizes only a single flow policy iteration and is applied to both actor and critic updates. (Middle) The distributional critic is conditioned on the same noise used for policy sampling. (Right) The critic update incorporates an upper expectile regression to capture maximum possible distributional returns. scaling the cost linearly with the number of samples… view at source ↗

**Figure 3.** Figure 3: The Number of FLOPs and the Wall-clock Compute Time per function call for cube-double-play. FAN outperforms non-distributional approaches in most OGBench tasks, especially for the tasks dealing with complex manipulation (e.g., puzzle, cube). Also, FAN surpasses distributional approaches on average while maintaining higher computational efficiency. 5.2. Computational Efficiency We evaluate computational … view at source ↗

**Figure 4.** Figure 4: Ablation Studies on Flow Anchoring and T π n . (Up) NBRAC vs. NFQL vs. FAN to verify the effect of Flow Anchoring. (Down) FAQL vs. FAN to verify the effect of T π n . The black line (FAN) performs the best on average, compared to all other combinations [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation Study on Value Maximization in Policy Training. The black line (maximizing both Zψ and Qϕ) empirically achieves the best average performance compared to maximizing either component individually [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation Study on Increased Number of Noise Samples for Value Training. (Left) Performance curves with varying numbers of noise samples. (Right) Runtime comparison with varying numbers of noise samples [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Ablation Study on Sensitivity to κ. The black line (κ = 0.9) empirically achieves the best average performance. As shown in [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

read the original abstract

We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that uses a single flow policy iteration and requires a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Flow-Anchored Noise-conditioned Q-Learning (FAN) for offline RL. It replaces the iterative sampling of flow policies and multi-sample (e.g., quantile) computation of distributional critics with a single flow-policy iteration and a single Gaussian noise sample, using behavior regularization to maintain performance. The central claims are that a theoretical analysis of convergence and performance bounds shows these simplifications improve both efficiency and task performance, and that experiments on robotic manipulation and locomotion tasks establish state-of-the-art results with substantially lower training and inference runtimes. Code is released at https://github.com/brianlsy98/FAN.

Significance. If the theoretical bounds are shown to hold without circularity and the empirical gains prove robust to standard offline RL evaluation protocols, FAN would offer a practical route to expressive offline RL at reduced cost. The explicit release of code supports reproducibility, which is a positive contribution to the field.

major comments (2)

[Abstract / Theoretical analysis] Abstract and theoretical analysis section: the claim that single-iteration flow plus single-sample critic 'lead to superior task performance' via behavior regularization is load-bearing for the central contribution. The provided sketch does not demonstrate that the regularization term dominates the truncation error of one flow iteration and the variance of one Gaussian sample; an explicit error decomposition comparing to the full iterative flow and multi-sample critic is required to substantiate superiority rather than merely bounded degradation.
[Abstract] Abstract: the performance bounds are asserted to follow from the simplifications, yet the weakest assumption (single flow iteration + single noise sample suffices without introducing uncorrected bias) risks circularity if the bounds are derived under exactly those same single-iteration/single-sample assumptions. An independent verification against external baselines or a multi-sample ablation is needed.

minor comments (2)

The manuscript should clarify the precise form of the behavior-regularization term and how it is applied during the single-iteration update.
Experimental details on the number of runs, error bars, and exact baselines (including whether they also use single-sample approximations) would strengthen the SOTA claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We provide point-by-point responses to the major comments below. We believe our theoretical analysis supports the claims, but we will revise to address the concerns about explicit decompositions and clarifications on assumptions.

read point-by-point responses

Referee: [Abstract / Theoretical analysis] Abstract and theoretical analysis section: the claim that single-iteration flow plus single-sample critic 'lead to superior task performance' via behavior regularization is load-bearing for the central contribution. The provided sketch does not demonstrate that the regularization term dominates the truncation error of one flow iteration and the variance of one Gaussian sample; an explicit error decomposition comparing to the full iterative flow and multi-sample critic is required to substantiate superiority rather than merely bounded degradation.

Authors: We appreciate this observation. Our theoretical analysis in Section 4 derives performance bounds that incorporate the behavior regularization term, which is designed to mitigate the effects of the single-iteration approximation in the flow policy and the single-sample estimation in the critic. The analysis shows that the regularization ensures the overall error remains bounded and, importantly, the method achieves better empirical performance by avoiding the computational overhead that can lead to overfitting in more complex setups. To make this more rigorous and address the request for an explicit decomposition, we will add a new subsection in the revised theoretical analysis that directly compares the error terms of the simplified FAN approach to those of the full iterative flow with multi-sample critic, demonstrating how the regularization term provides the advantage for superior performance. revision: yes
Referee: [Abstract] Abstract: the performance bounds are asserted to follow from the simplifications, yet the weakest assumption (single flow iteration + single noise sample suffices without introducing uncorrected bias) risks circularity if the bounds are derived under exactly those same single-iteration/single-sample assumptions. An independent verification against external baselines or a multi-sample ablation is needed.

Authors: We would like to clarify that there is no circularity in our derivation. The general convergence theorem for the noise-conditioned Q-learning with behavior regularization is established first under standard assumptions for offline RL, without relying on the single-iteration or single-sample simplifications. Subsequently, we analyze the additional approximation errors introduced by using only one flow iteration and one Gaussian noise sample, providing bounds on these errors that are controlled by the regularization strength. This structure ensures the bounds are not derived under the same assumptions. For independent verification, our experiments already include extensive comparisons against state-of-the-art baselines on robotic tasks, and we have conducted ablations on the number of samples used in the critic. We will expand the experimental section to include a dedicated multi-sample ablation study to further confirm that increasing the number of samples does not yield significant gains, supporting the sufficiency of the single-sample approach. revision: partial

Circularity Check

0 steps flagged

No circularity: theoretical bounds analyze the proposed simplifications without reducing to inputs by construction

full rationale

The abstract and available description present FAN as using a single flow iteration and single Gaussian noise sample plus behavior regularization. The claimed theoretical analysis derives convergence and performance bounds for exactly this construction, showing efficiency gains and competitive or superior task performance. No equations are quoted that equate a 'prediction' to a fitted parameter, no self-citation chain is invoked as the sole justification for uniqueness or ansatz, and no renaming of known results occurs. The derivation therefore remains self-contained; external experiments on manipulation and locomotion tasks supply independent validation rather than tautological confirmation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method appears to rest on standard RL convergence assumptions plus the new regularization technique.

pith-pipeline@v0.9.0 · 5481 in / 1043 out tokens · 54662 ms · 2026-05-10T16:20:37.119102+00:00 · methodology

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)