Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
Pith reviewed 2026-05-10 16:20 UTC · model grok-4.3
The pith
FAN achieves state-of-the-art offline RL performance using only a single flow-policy iteration and one Gaussian noise sample.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FAN employs a behavior regularization technique that utilizes only a single flow policy iteration and requires only a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes.
What carries the argument
Flow-anchored noise-conditioned Q-learning, which anchors a single-iteration flow policy to the behavior distribution via regularization and estimates the distributional critic from one Gaussian noise sample.
Load-bearing premise
A single flow-policy iteration plus one Gaussian noise sample for the distributional critic preserves both the expressivity of full iterative flows and the accuracy of multi-sample critics without introducing bias that the behavior-regularization term cannot correct.
What would settle it
Observing that full iterative flow policies or multi-sample distributional critics achieve higher task success rates or tighter performance bounds than FAN on the same robotic manipulation or locomotion benchmarks would falsify the sufficiency claim.
Figures
read the original abstract
We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that uses a single flow policy iteration and requires a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Flow-Anchored Noise-conditioned Q-Learning (FAN) for offline RL. It replaces the iterative sampling of flow policies and multi-sample (e.g., quantile) computation of distributional critics with a single flow-policy iteration and a single Gaussian noise sample, using behavior regularization to maintain performance. The central claims are that a theoretical analysis of convergence and performance bounds shows these simplifications improve both efficiency and task performance, and that experiments on robotic manipulation and locomotion tasks establish state-of-the-art results with substantially lower training and inference runtimes. Code is released at https://github.com/brianlsy98/FAN.
Significance. If the theoretical bounds are shown to hold without circularity and the empirical gains prove robust to standard offline RL evaluation protocols, FAN would offer a practical route to expressive offline RL at reduced cost. The explicit release of code supports reproducibility, which is a positive contribution to the field.
major comments (2)
- [Abstract / Theoretical analysis] Abstract and theoretical analysis section: the claim that single-iteration flow plus single-sample critic 'lead to superior task performance' via behavior regularization is load-bearing for the central contribution. The provided sketch does not demonstrate that the regularization term dominates the truncation error of one flow iteration and the variance of one Gaussian sample; an explicit error decomposition comparing to the full iterative flow and multi-sample critic is required to substantiate superiority rather than merely bounded degradation.
- [Abstract] Abstract: the performance bounds are asserted to follow from the simplifications, yet the weakest assumption (single flow iteration + single noise sample suffices without introducing uncorrected bias) risks circularity if the bounds are derived under exactly those same single-iteration/single-sample assumptions. An independent verification against external baselines or a multi-sample ablation is needed.
minor comments (2)
- The manuscript should clarify the precise form of the behavior-regularization term and how it is applied during the single-iteration update.
- Experimental details on the number of runs, error bars, and exact baselines (including whether they also use single-sample approximations) would strengthen the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our paper. We provide point-by-point responses to the major comments below. We believe our theoretical analysis supports the claims, but we will revise to address the concerns about explicit decompositions and clarifications on assumptions.
read point-by-point responses
-
Referee: [Abstract / Theoretical analysis] Abstract and theoretical analysis section: the claim that single-iteration flow plus single-sample critic 'lead to superior task performance' via behavior regularization is load-bearing for the central contribution. The provided sketch does not demonstrate that the regularization term dominates the truncation error of one flow iteration and the variance of one Gaussian sample; an explicit error decomposition comparing to the full iterative flow and multi-sample critic is required to substantiate superiority rather than merely bounded degradation.
Authors: We appreciate this observation. Our theoretical analysis in Section 4 derives performance bounds that incorporate the behavior regularization term, which is designed to mitigate the effects of the single-iteration approximation in the flow policy and the single-sample estimation in the critic. The analysis shows that the regularization ensures the overall error remains bounded and, importantly, the method achieves better empirical performance by avoiding the computational overhead that can lead to overfitting in more complex setups. To make this more rigorous and address the request for an explicit decomposition, we will add a new subsection in the revised theoretical analysis that directly compares the error terms of the simplified FAN approach to those of the full iterative flow with multi-sample critic, demonstrating how the regularization term provides the advantage for superior performance. revision: yes
-
Referee: [Abstract] Abstract: the performance bounds are asserted to follow from the simplifications, yet the weakest assumption (single flow iteration + single noise sample suffices without introducing uncorrected bias) risks circularity if the bounds are derived under exactly those same single-iteration/single-sample assumptions. An independent verification against external baselines or a multi-sample ablation is needed.
Authors: We would like to clarify that there is no circularity in our derivation. The general convergence theorem for the noise-conditioned Q-learning with behavior regularization is established first under standard assumptions for offline RL, without relying on the single-iteration or single-sample simplifications. Subsequently, we analyze the additional approximation errors introduced by using only one flow iteration and one Gaussian noise sample, providing bounds on these errors that are controlled by the regularization strength. This structure ensures the bounds are not derived under the same assumptions. For independent verification, our experiments already include extensive comparisons against state-of-the-art baselines on robotic tasks, and we have conducted ablations on the number of samples used in the critic. We will expand the experimental section to include a dedicated multi-sample ablation study to further confirm that increasing the number of samples does not yield significant gains, supporting the sufficiency of the single-sample approach. revision: partial
Circularity Check
No circularity: theoretical bounds analyze the proposed simplifications without reducing to inputs by construction
full rationale
The abstract and available description present FAN as using a single flow iteration and single Gaussian noise sample plus behavior regularization. The claimed theoretical analysis derives convergence and performance bounds for exactly this construction, showing efficiency gains and competitive or superior task performance. No equations are quoted that equate a 'prediction' to a fitted parameter, no self-citation chain is invoked as the sole justification for uniqueness or ansatz, and no renaming of known results occurs. The derivation therefore remains self-contained; external experiments on manipulation and locomotion tasks supply independent validation rather than tautological confirmation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.