Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Andy Peng; Charles Xu; Kevin Frans; Qiyang Li; Sergey Levine; Tobias Springenberg; Zhiyuan Zhou

arxiv: 2606.11087 · v1 · pith:BB3KHRPUnew · submitted 2026-06-09 · 💻 cs.LG · cs.AI

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Zhiyuan Zhou , Andy Peng , Charles Xu , Qiyang Li , Tobias Springenberg , Kevin Frans , Sergey Levine This is my paper

Pith reviewed 2026-06-27 14:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords test-time policy improvementflow policiesQ-guidanceoffline reinforcement learningcontinuous controlbehavioral cloningvalue gradient

0 comments

The pith

Value gradients guide pre-trained flow policies to higher-value actions at test time without further training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a flow policy trained only with behavioral cloning can be turned into an effective reinforcement learning policy by using a separately trained value function to steer its sampling process during inference. This test-time guidance replaces the need for specialized training objectives or backpropagation through the generative process, which often destabilize learning with expressive policies. If the approach holds, it decouples stable supervised pre-training from policy improvement and yields competitive results on offline continuous control tasks at far lower cost than joint actor-critic training.

Core claim

QGF pre-trains a flow policy via behavioral cloning and a value critic, then at test time adds the value gradient to the flow sampling dynamics so that generated actions have higher expected value. No policy parameters are updated after pre-training. The method outperforms earlier test-time baselines on single-task and goal-conditioned offline RL benchmarks with high-dimensional actions and remains competitive with state-of-the-art training-time algorithms while using less compute and exhibiting better scaling with model size.

What carries the argument

Value-gradient guidance of the flow sampling process at inference time, which steers a fixed reference flow policy toward higher-value actions without retraining.

If this is right

Expressive policies can receive policy improvement entirely after supervised training ends.
Avoiding backpropagation through denoising removes a major source of training instability.
The same pre-trained flow and critic pair works for both single-task and goal-conditioned settings.
Computational cost drops because no actor-critic updates occur at training time.
Larger models become practical because scaling is no longer limited by joint training instabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of generative modeling from value-based steering may extend to other families of expressive policies beyond flows.
Test-time guidance could serve as a lightweight way to adapt a single pre-trained policy to new reward functions without retraining.
If value estimates remain useful far from the training distribution, the method may reduce reliance on conservative offline RL objectives.

Load-bearing premise

The value function supplies a gradient that reliably improves sampled actions when added to the flow dynamics, without introducing new instabilities or requiring the critic to be accurate over the entire action distribution.

What would settle it

An experiment in which adding the value gradient to the flow sampler produces lower average returns than the unguided reference policy or causes the sampling process to diverge on a standard offline RL benchmark.

Figures

Figures reproduced from arXiv: 2606.11087 by Andy Peng, Charles Xu, Kevin Frans, Qiyang Li, Sergey Levine, Tobias Springenberg, Zhiyuan Zhou.

**Figure 1.** Figure 1: We propose QGF (Q-Guided Flow), an RL algorithm that guides denoising steps of a policy trained via flow matching with critic gradient at test time. Our critic gradient estimator avoids both taking gradient at noisy action not seen during training and performing expensive, high-variance backpropagation through time, and performs better than other test-time RL methods while being competitive with the best t… view at source ↗

**Figure 2.** Figure 2: Illustrative example of 1D denoising process mapping Gaussian noise to a tri-modal distribution, with 𝑄 defined as negative L2 distance to the optimal action 𝑎 ∗ . We compare the base BC flow and three critic-gradient guidance methods (BPTT, OOD, QGF) across three guidance weights. While BPTT and QGF converge to 𝑎 ∗ , guidance with the OOD gradient ∇𝑎𝑡𝑄(𝑠, 𝑎𝑡) does not result in the optimal solution. Furth… view at source ↗

**Figure 3.** Figure 3: Sensitivity of different gradient estimators to noise in the action space: for each gradient estimator 𝐺, the plot shows the cosine similarity between 𝐺(𝑠, 𝑎𝑡) and 𝐺(𝑠, 𝑎𝑡 + 𝜖). Our proposed gradient estimator has the least variance and least sensitivity to noise. Averaged over 20 tasks and 4 seeds. Illustrative example of suboptimal 𝑄-gradient guidance. We show a simple illustrative example in which usi… view at source ↗

**Figure 4.** Figure 4: , using this first-order approximation is not just a convenience. Perhaps surprisingly, it is actually more effective than running the full denoising process because it is less constrained to the exact dataset distribution. From this, we can approximate the ground truth gradient as ∇𝑎𝑡𝑄(𝑠, 𝑎1) ≈ ∇𝑎𝑡𝑄(𝑠, ̂𝑎1) = ( 𝜕 ̂𝑎1 𝜕𝑎𝑡 ) ⊤ ∇ ̂𝑎1𝑄(𝑠, ̂𝑎1), (8) which is a product between the gradient of 𝑄 and the Jacobian… view at source ↗

**Figure 5.** Figure 5: Offline RL performance at 500k training steps (20 tasks, 10 seeds): QGF beats all previous test-time methods and is competitive with the best training-time method. See task breakdown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Offline RL performance at 500k training steps (20 tasks, 10 seeds) evaluated with best-of-N sampling. QGF outperforms BFN (N=4), and QGF+BFN matches BFN with much higher testtime compute budget. See task breakdown in [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Offline goal-conditioned RL performance at 1M training steps (25 tasks, 10 seeds): While QGF underperforms the best baseline on the simplest task, it is consistently the best performing method for the harder tasks, showing that the gradient estimator scales well to long horizon tasks. See task breakdown in [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: QGF can work with different types of critics and performs better when critics are better (20 tasks, 4 seeds). See task break down in [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Offline RL performance at 500k training steps per environment (10 seeds each). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Offline RL with Best-of-N sampling at 500k training steps per environment (10 seeds each). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Offline goal-conditioned RL at 1M training steps (10 seeds each). See the domain-specific hyperparameters in [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: We find that the BPTT gradient can be unstable for higher guidance weights and certain target distributions. Here we offer another example of our 1-D illustrative denoising example described in Section 4. In this example, we find that the BPTT gradient can be extremely unstable, as shown in [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Offline RL performance at 500k training steps (20 tasks, 10 seeds) of different QGF variants. Methods that apply the Jacobian are shaded. See task specific break down in [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Full results for [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

**Figure 17.** Figure 17: The OOD gradient leads to a denoised action with higher 𝑄 value than all other methods by exploiting the critic on OOD actions. (left): the OOD gradient leads to actions that are farthest away from the BFN oracle actions. (right): the OOD gradient leads to actions that has the largest nearest-neighbor (NN) distance to dataset actions. Aggregated over 20 tasks, 256 observations, 4 seeds. E. Sensitivity of … view at source ↗

**Figure 18.** Figure 18: Full results for [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Full Results for [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: QGF performance against guidance weight: increasing guidance weight can drastically improve policy performance, but overly large weight can also hurt performance by pushing actions off manifold. Best-of-N (BFN) performs rejection sampling on the base behavior cloning policy ̂𝜋 trained with the flow matching objective. Specifically, the output action 𝑎 is selected as: 𝑎 ← arg max 𝑎1,⋯,𝑎𝑁 ∼ ̂𝜋 𝑄(𝑠, 𝑎𝑡). (1… view at source ↗

read the original abstract

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QGF keeps flow policy training as plain BC and does all improvement at test time via critic gradient guidance on the sampling process, which looks competitive on the benchmarks but leaves the method exposed to critic errors outside the data.

read the letter

The main thing to know is that this paper keeps the flow policy training as standard behavioral cloning and moves all the RL-style improvement to test time by steering the flow sampling with gradients from a pre-trained critic. No changes to the supervised objective, no backprop through the denoising steps during training.

What the work actually does is show that this test-time guidance can beat other test-time RL baselines on single-task and goal-conditioned offline RL tasks with high-dimensional actions, while staying competitive with some training-time methods and running cheaper. The scaling behavior with model size is presented as better because the instability of actor-critic training is avoided entirely.

The soft spot is the one the stress-test flags: the critic is trained on limited offline data, so its gradient can be unreliable on actions the flow reaches during guided sampling. Because there is no joint optimization or corrective signal during training, any over- or under-estimation passes directly into the generated actions. The paper would be tighter if it included checks on how much the guidance actually improves value versus how often it follows critic noise.

This is for people working on expressive policies in robotics and continuous control who want to avoid the usual RL training headaches. Readers who need a practical way to get some policy improvement on top of imitation learning without rewriting the training loop will find the approach and the benchmark numbers useful. It deserves a serious referee because the separation of training and test-time improvement is clean and the empirical claims are stated clearly enough to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper proposes QGF (Q-Guided Flow), an RL method that pre-trains a flow-based policy via standard behavioral cloning and a separate value critic, then performs all policy improvement at test time by using the critic's value gradient to steer the reference flow sampling process toward higher-value actions. The central empirical claim is that QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, remains competitive with state-of-the-art training-time algorithms, and exhibits better scaling with model size by avoiding actor-critic training instabilities.

Significance. If the empirical results hold under rigorous evaluation, the work is significant for providing a stable, scalable way to incorporate expressive flow policies into RL without backpropagating through the sampling process or using specialized training objectives. By cleanly separating supervised pre-training from test-time guidance, it offers a practical alternative that could reduce computational cost and instability in high-dimensional control tasks.

major comments (2)

[Section 3 (Method) and Section 4 (Experiments)] The central claim depends on the pre-trained critic's gradient reliably steering the flow toward higher-value actions at test time. However, in offline RL the critic is trained only on the behavior data distribution; any overestimation on OOD actions reachable by the guided flow would directly propagate into the generated policy without correction. The manuscript should include a dedicated analysis or ablation (e.g., in the experiments section) measuring critic error on actions produced by guided versus unguided sampling and showing that guidance remains beneficial even under realistic critic inaccuracies.
[Section 4 (Experiments)] The abstract and introduction assert empirical superiority and favorable scaling, yet the provided description contains no quantitative details on experimental setup, number of seeds, statistical significance, or failure cases. Without these, the load-bearing claim that QGF is "competitive with state-of-the-art training-time algorithms while being much cheaper" cannot be assessed. The experiments section must report full protocol, baselines, and variance.

minor comments (2)

[Section 3.2] Notation for the flow sampling process and the exact form of the gradient guidance step should be made fully explicit with equations, including any step-size or normalization hyperparameters.
[Section 3] The paper should clarify whether the value function is frozen during test-time guidance or allowed any light adaptation, as this affects reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that additional analysis on critic behavior and fuller experimental reporting will strengthen the paper, and we will incorporate these changes in the revision.

read point-by-point responses

Referee: [Section 3 (Method) and Section 4 (Experiments)] The central claim depends on the pre-trained critic's gradient reliably steering the flow toward higher-value actions at test time. However, in offline RL the critic is trained only on the behavior data distribution; any overestimation on OOD actions reachable by the guided flow would directly propagate into the generated policy without correction. The manuscript should include a dedicated analysis or ablation (e.g., in the experiments section) measuring critic error on actions produced by guided versus unguided sampling and showing that guidance remains beneficial even under realistic critic inaccuracies.

Authors: We agree this is a valid concern for offline RL methods relying on test-time guidance. The current experiments emphasize end-to-end performance, but we will add a dedicated ablation subsection (new Figure or Table in Section 4) that measures critic prediction error and value estimates on actions from guided vs. unguided flow sampling. This will include quantitative comparison of critic inaccuracies and confirmation that guidance yields net benefit under realistic overestimation. revision: yes
Referee: [Section 4 (Experiments)] The abstract and introduction assert empirical superiority and favorable scaling, yet the provided description contains no quantitative details on experimental setup, number of seeds, statistical significance, or failure cases. Without these, the load-bearing claim that QGF is "competitive with state-of-the-art training-time algorithms while being much cheaper" cannot be assessed. The experiments section must report full protocol, baselines, and variance.

Authors: We acknowledge that the initial submission omitted some protocol details in the main text. In the revised manuscript we will expand Section 4 with a full experimental protocol subsection: number of random seeds (5-10 per task), statistical significance testing (e.g., Welch t-tests with p-values), complete baseline implementations and hyperparameters, per-task variance (mean ± std), and discussion of any observed failure modes or edge cases. This will make the competitiveness and cost claims fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: test-time guidance is independent of pre-training fit

full rationale

The paper separates pre-training (standard BC on flow policy plus independent critic) from test-time gradient guidance. No equations or self-citations reduce the claimed performance to a quantity defined by the method itself; the guidance step uses the critic gradient as an external steering signal rather than re-deriving or fitting the same objective. This matches the default non-circular case for a method whose central claim rests on empirical separation of training and inference phases.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard assumptions of offline RL and imitation learning; no new free parameters, axioms, or invented entities are introduced in the abstract description.

axioms (1)

domain assumption A pre-trained value function provides a useful gradient signal for improving actions sampled from a flow policy.
Central to the test-time guidance step; invoked implicitly when stating that value gradients guide the reference policy.

pith-pipeline@v0.9.1-grok · 5791 in / 1239 out tokens · 14796 ms · 2026-06-27T14:03:23.284937+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Guided Action Flow: Q-Guided Inference for Flow-Matching Vision-Language-Action Policies
cs.RO 2026-07 unverdicted novelty 4.0

Guided Action Flow applies a rollout-trained critic to steer frozen flow-matching VLA policies at inference time via action gradients, reporting success rate gains on LIBERO manipulation tasks.

Reference graph

Works this paper leans on

70 extracted references · 40 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Maximum a posteriori policy optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a posteriori policy optimisation. InInternational Conference on Learning Representations, 2018

2018
[2]

Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

2021
[3]

Offline reinforcement learning via high-fidelity generative behavior modeling.arXiv preprint arXiv:2209.14548, 2022

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling.arXiv preprint arXiv:2209.14548, 2022

work page arXiv 2022
[4]

One-step flow policy mirror descent.arXiv preprint arXiv:2507.23675, 2025

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, and Bo Dai. One-step flow policy mirror descent.arXiv preprint arXiv:2507.23675, 2025

work page arXiv 2025
[5]

Diffusion policies creating a trust region for offline reinforcement learning.Advances in Neural Information Processing Systems, 37:50098–50125, 2024

Tianyu Chen, Zhendong Wang, and Mingyuan Zhou. Diffusion policies creating a trust region for offline reinforcement learning.Advances in Neural Information Processing Systems, 37:50098–50125, 2024

2024
[6]

Feudal reinforcement learning.Advances in neural information processing systems, 5, 1992

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning.Advances in neural information processing systems, 5, 1992

1992
[7]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021
[8]

Diffusion-based reinforcement learning via q-weighted variational policy optimization.Advances in Neural Information Processing Systems, 37:53945–53968, 2024

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization.Advances in Neural Information Processing Systems, 37:53945–53968, 2024

2024
[9]

and Jin, C

Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning. arXiv preprint arXiv:2309.16984, 2023

work page arXiv 2023
[10]

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control, 2025

2025
[11]

Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

work page arXiv 2025
[12]

Online reward-weighted fine-tuning of flow matching with wasserstein regularization

Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[13]

Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning.arXiv preprint arXiv:2405.20555, 2024

Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, and Bing-Yi Jing. Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning.arXiv preprint arXiv:2405.20555, 2024

work page arXiv 2024
[14]

Frans, S

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458, 2025

work page arXiv 2025
[15]

A minimalist approach to offline reinforcement learning.Advances in neural information processing systems, 34:20132–20145, 2021

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning.Advances in neural information processing systems, 34:20132–20145, 2021

2021
[16]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018
[17]

Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328, 2023

Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328, 2023. 11 Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

work page arXiv 2023
[18]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018

2018
[19]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit Q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Diffcps: Diffusion model based constrained policy search for offline reinforcement learning.arXiv preprint arXiv:2310.05333, 2023

Longxiang He, Li Shen, Linrui Zhang, Junbo Tan, and Xueqian Wang. Diffcps: Diffusion model based constrained policy search for offline reinforcement learning.arXiv preprint arXiv:2310.05333, 2023

work page arXiv 2023
[21]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018
[22]

Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

2019
[23]

Decoupled neural interfaces using synthetic gradients

Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. InInternational conference on machine learning, pages 1627–1635. PMLR, 2017

2017
[24]

Q-guided flow q-learning

Yejun Jang, Hong Chul Nam, Jeong Min Park, Gimin Bae, and Hyun Kwon. Q-guided flow q-learning. In CoRL 2025 Workshop RemembeRL

2025
[25]

Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

2023
[26]

Enhancing diffusion-based image synthesis with robust classifier guidance.arXiv preprint arXiv:2208.08664, 2022

Bahjat Kawar, Roy Ganz, and Michael Elad. Enhancing diffusion-based image synthesis with robust classifier guidance.arXiv preprint arXiv:2208.08664, 2022

work page arXiv 2022
[27]

Understanding diffusion objectives as the elbo with simple data augmenta- tion

Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmenta- tion. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 65484–65516, 2023

2023
[28]

Flow-based single-step completion for efficient and expressive policy learning.arXiv preprint arXiv:2506.21427, 2025

Prajwal Koirala and Cody Fleming. Flow-based single-step completion for efficient and expressive policy learning.arXiv preprint arXiv:2506.21427, 2025

work page arXiv 2025
[29]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Reward-conditioned policies.arXiv preprint arXiv:1912.13465, 2019

Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies.arXiv preprint arXiv:1912.13465, 2019

work page arXiv 1912
[31]

Conservative Q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 33:1179–1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 33:1179–1191, 2020

2020
[32]

Direct feedback alignment scales to modern deep learning tasks and architectures.Advances in neural information processing systems, 33:9346– 9360, 2020

Julien Launay, Iacopo Poli, François Boniface, and Florent Krzakala. Direct feedback alignment scales to modern deep learning tasks and architectures.Advances in neural information processing systems, 33:9346– 9360, 2020

2020
[33]

Q-learning with Adjoint Matching

Qiyang Li and Sergey Levine. Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Decoupled q-chunking.arXiv preprint arXiv:2512.10926, 2025

Qiyang Li, Seohong Park, and Sergey Levine. Decoupled q-chunking.arXiv preprint arXiv:2512.10926, 2025

work page arXiv 2025
[35]

Reinforcement Learning with Action Chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking.arXiv preprint arXiv:2507.07969, 2025. 12 Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Learning multimodal behaviors from scratch with diffusion policy gradient.Advances in Neural Information Processing Systems, 37:38456–38479, 2024

Zechu Li, Rickmer Krohn, Tao Chen, Anurag Ajay, Pulkit Agrawal, and Georgia Chalvatzaki. Learning multimodal behaviors from scratch with diffusion policy gradient.Advances in Neural Information Processing Systems, 37:38456–38479, 2024

2024
[37]

Random feedback weights support learning in deep neural networks

Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random feedback weights support learning in deep neural networks.arXiv preprint arXiv:1411.0247, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[38]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[39]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

work page arXiv 2025
[41]

Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685,

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone. arXiv preprint arXiv:2412.06685, 2024

work page arXiv 2024
[42]

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, and Samuli Laine. Finite difference flow optimization for rl post-training of text-to-image models.arXiv preprint arXiv:2603.12893, 2026

work page internal anchor Pith review arXiv 2026
[43]

McAllister, S

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

work page arXiv 2025
[44]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[45]

Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

work page arXiv 2024
[46]

Direct feedback alignment provides learning in deep neural networks.Advances in neural information processing systems, 29, 2016

Arild Nøkland. Direct feedback alignment provides learning in deep neural networks.Advances in neural information processing systems, 29, 2016

2016
[47]

Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092, 2024

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092, 2024

work page arXiv 2024
[48]

Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

2024
[49]

Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

work page arXiv 2025
[50]

Flow q-learning.arXiv preprint arXiv:2502.02538,

Seohong Park, Qiyang Li, and Sergey Levine. Flow Q-learning.arXiv preprint arXiv:2502.02538, 2025

work page arXiv 2025
[51]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[52]

Reinforcement learning for flow-matching policies

Samuel Pfrommer, Yixiao Huang, and Somayeh Sojoudi. Reinforcement learning for flow-matching policies. arXiv preprint arXiv:2507.15073, 2025

work page arXiv 2025
[53]

Learning a diffusion model policy from rewards via q-score matching.arXiv preprint arXiv:2312.11752, 2023

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching.arXiv preprint arXiv:2312.11752, 2023. 13 Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

work page arXiv 2023
[54]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Image synthesis with a single (robust) classifier.Advances in neural information processing systems, 32, 2019

Shibani Santurkar, Andrew Ilyas, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Image synthesis with a single (robust) classifier.Advances in neural information processing systems, 32, 2019

2019
[56]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

2020
[58]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998
[59]

Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

2024
[60]

Robustness may be at odds with accuracy.arXiv preprint arXiv:1805.12152, 2018

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy.arXiv preprint arXiv:1805.12152, 2018

work page arXiv 2018
[61]

Analysis of temporal-diffference learning with function approximation

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996

1996
[62]

Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024

Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, et al. Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024

2024
[63]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[65]

Reinforcement learning via value gradient flow

Haoran Xu, Kaiwen Hu, Somayeh Sojoudi, and Amy Zhang. Reinforcement learning via value gradient flow. InThe Fourteenth International Conference on Learning Representations
[66]

Offline rl with no ood actions: In-sample learning via implicit value regularization.arXiv preprint arXiv:2303.15810, 2023

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization.arXiv preprint arXiv:2303.15810, 2023

work page arXiv 2023
[67]

Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

work page arXiv 2025
[68]

Policy representation via diffusion probability model for reinforcement learning

Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning. arXiv preprint arXiv:2305.13122, 2023

work page arXiv 2023
[69]

Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.Advances in neural information processing systems, 37:98871–98897, 2024

Ruoqi Zhang, Ziwei Luo, Jens Sjölund, Thomas B Schön, and Per Mattsson. Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.Advances in neural information processing systems, 37:98871–98897, 2024

2024
[70]

distilled

Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforce- ment learning.arXiv preprint arXiv:2503.04975, 2025. 14 Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning A. Result Details All performance result plots in the main paper and in the appendix report 95% confidence interval as the error b...

work page arXiv 2025

[1] [1]

Maximum a posteriori policy optimisation

Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a posteriori policy optimisation. InInternational Conference on Learning Representations, 2018

2018

[2] [2]

Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble.Advances in neural information processing systems, 34:7436–7447, 2021

2021

[3] [3]

Offline reinforcement learning via high-fidelity generative behavior modeling.arXiv preprint arXiv:2209.14548, 2022

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling.arXiv preprint arXiv:2209.14548, 2022

work page arXiv 2022

[4] [4]

One-step flow policy mirror descent.arXiv preprint arXiv:2507.23675, 2025

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, and Bo Dai. One-step flow policy mirror descent.arXiv preprint arXiv:2507.23675, 2025

work page arXiv 2025

[5] [5]

Diffusion policies creating a trust region for offline reinforcement learning.Advances in Neural Information Processing Systems, 37:50098–50125, 2024

Tianyu Chen, Zhendong Wang, and Mingyuan Zhou. Diffusion policies creating a trust region for offline reinforcement learning.Advances in Neural Information Processing Systems, 37:50098–50125, 2024

2024

[6] [6]

Feudal reinforcement learning.Advances in neural information processing systems, 5, 1992

Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning.Advances in neural information processing systems, 5, 1992

1992

[7] [7]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

2021

[8] [8]

Diffusion-based reinforcement learning via q-weighted variational policy optimization.Advances in Neural Information Processing Systems, 37:53945–53968, 2024

Shutong Ding, Ke Hu, Zhenhao Zhang, Kan Ren, Weinan Zhang, Jingyi Yu, Jingya Wang, and Ye Shi. Diffusion-based reinforcement learning via q-weighted variational policy optimization.Advances in Neural Information Processing Systems, 37:53945–53968, 2024

2024

[9] [9]

and Jin, C

Zihan Ding and Chi Jin. Consistency models as a rich and efficient policy class for reinforcement learning. arXiv preprint arXiv:2309.16984, 2023

work page arXiv 2023

[10] [10]

Carles Domingo-Enrich, Michal Drozdzal, Brian Karrer, and Ricky T. Q. Chen. Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control, 2025

2025

[11] [11]

Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, and Wen Sun. Scaling offline rl via efficient and expressive shortcut models.arXiv preprint arXiv:2505.22866, 2025

work page arXiv 2025

[12] [12]

Online reward-weighted fine-tuning of flow matching with wasserstein regularization

Jiajun Fan, Shuaike Shen, Chaoran Cheng, Yuxin Chen, Chumeng Liang, and Ge Liu. Online reward-weighted fine-tuning of flow matching with wasserstein regularization. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[13] [13]

Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning.arXiv preprint arXiv:2405.20555, 2024

Linjiajie Fang, Ruoxue Liu, Jing Zhang, Wenjia Wang, and Bing-Yi Jing. Diffusion actor-critic: Formulating constrained policy iteration as diffusion noise regression for offline reinforcement learning.arXiv preprint arXiv:2405.20555, 2024

work page arXiv 2024

[14] [14]

Frans, S

Kevin Frans, Seohong Park, Pieter Abbeel, and Sergey Levine. Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458, 2025

work page arXiv 2025

[15] [15]

A minimalist approach to offline reinforcement learning.Advances in neural information processing systems, 34:20132–20145, 2021

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning.Advances in neural information processing systems, 34:20132–20145, 2021

2021

[16] [16]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018

2018

[17] [17]

Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328, 2023

Divyansh Garg, Joey Hejna, Matthieu Geist, and Stefano Ermon. Extreme q-learning: Maxent rl without entropy.arXiv preprint arXiv:2301.02328, 2023. 11 Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

work page arXiv 2023

[18] [18]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. PMLR, 2018

2018

[19] [19]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. IDQL: Implicit Q-learning as an actor-critic method with diffusion policies.arXiv preprint arXiv:2304.10573, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Diffcps: Diffusion model based constrained policy search for offline reinforcement learning.arXiv preprint arXiv:2310.05333, 2023

Longxiang He, Li Shen, Linrui Zhang, Junbo Tan, and Xueqian Wang. Diffcps: Diffusion model based constrained policy search for offline reinforcement learning.arXiv preprint arXiv:2310.05333, 2023

work page arXiv 2023

[21] [21]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

2018

[22] [22]

Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features.Advances in neural information processing systems, 32, 2019

2019

[23] [23]

Decoupled neural interfaces using synthetic gradients

Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. InInternational conference on machine learning, pages 1627–1635. PMLR, 2017

2017

[24] [24]

Q-guided flow q-learning

Yejun Jang, Hong Chul Nam, Jeong Min Park, Gimin Bae, and Hyun Kwon. Q-guided flow q-learning. In CoRL 2025 Workshop RemembeRL

2025

[25] [25]

Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning.Advances in Neural Information Processing Systems, 36:67195–67212, 2023

2023

[26] [26]

Enhancing diffusion-based image synthesis with robust classifier guidance.arXiv preprint arXiv:2208.08664, 2022

Bahjat Kawar, Roy Ganz, and Michael Elad. Enhancing diffusion-based image synthesis with robust classifier guidance.arXiv preprint arXiv:2208.08664, 2022

work page arXiv 2022

[27] [27]

Understanding diffusion objectives as the elbo with simple data augmenta- tion

Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmenta- tion. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pages 65484–65516, 2023

2023

[28] [28]

Flow-based single-step completion for efficient and expressive policy learning.arXiv preprint arXiv:2506.21427, 2025

Prajwal Koirala and Cody Fleming. Flow-based single-step completion for efficient and expressive policy learning.arXiv preprint arXiv:2506.21427, 2025

work page arXiv 2025

[29] [29]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit Q-learning. arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

Reward-conditioned policies.arXiv preprint arXiv:1912.13465, 2019

Aviral Kumar, Xue Bin Peng, and Sergey Levine. Reward-conditioned policies.arXiv preprint arXiv:1912.13465, 2019

work page arXiv 1912

[31] [31]

Conservative Q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 33:1179–1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-learning for offline reinforcement learning.Advances in Neural Information Processing Systems, 33:1179–1191, 2020

2020

[32] [32]

Direct feedback alignment scales to modern deep learning tasks and architectures.Advances in neural information processing systems, 33:9346– 9360, 2020

Julien Launay, Iacopo Poli, François Boniface, and Florent Krzakala. Direct feedback alignment scales to modern deep learning tasks and architectures.Advances in neural information processing systems, 33:9346– 9360, 2020

2020

[33] [33]

Q-learning with Adjoint Matching

Qiyang Li and Sergey Levine. Q-learning with adjoint matching.arXiv preprint arXiv:2601.14234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Decoupled q-chunking.arXiv preprint arXiv:2512.10926, 2025

Qiyang Li, Seohong Park, and Sergey Levine. Decoupled q-chunking.arXiv preprint arXiv:2512.10926, 2025

work page arXiv 2025

[35] [35]

Reinforcement Learning with Action Chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking.arXiv preprint arXiv:2507.07969, 2025. 12 Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Learning multimodal behaviors from scratch with diffusion policy gradient.Advances in Neural Information Processing Systems, 37:38456–38479, 2024

Zechu Li, Rickmer Krohn, Tao Chen, Anurag Ajay, Pulkit Agrawal, and Georgia Chalvatzaki. Learning multimodal behaviors from scratch with diffusion policy gradient.Advances in Neural Information Processing Systems, 37:38456–38479, 2024

2024

[37] [37]

Random feedback weights support learning in deep neural networks

Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random feedback weights support learning in deep neural networks.arXiv preprint arXiv:1411.0247, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[38] [38]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[39] [39]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, and Xiao Ma. Flow-based policy for online reinforcement learning.arXiv preprint arXiv:2506.12811, 2025

work page arXiv 2025

[41] [41]

Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone.arXiv preprint arXiv:2412.06685,

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone. arXiv preprint arXiv:2412.06685, 2024

work page arXiv 2024

[42] [42]

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

David McAllister, Miika Aittala, Tero Karras, Janne Hellsten, Angjoo Kanazawa, Timo Aila, and Samuli Laine. Finite difference flow optimization for rl post-training of text-to-image models.arXiv preprint arXiv:2603.12893, 2026

work page internal anchor Pith review arXiv 2026

[43] [43]

McAllister, S

David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053, 2025

work page arXiv 2025

[44] [44]

AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta, Murtaza Dalal, and Sergey Levine. Awac: Accelerating online reinforcement learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[45] [45]

Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

work page arXiv 2024

[46] [46]

Direct feedback alignment provides learning in deep neural networks.Advances in neural information processing systems, 29, 2016

Arild Nøkland. Direct feedback alignment provides learning in deep neural networks.Advances in neural information processing systems, 29, 2016

2016

[47] [47]

Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092, 2024

Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092, 2024

work page arXiv 2024

[48] [48]

Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl?Advances in Neural Information Processing Systems, 37:79029–79056, 2024

2024

[49] [49]

Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, and Sergey Levine. Horizon reduction makes rl scalable.arXiv preprint arXiv:2506.04168, 2025

work page arXiv 2025

[50] [50]

Flow q-learning.arXiv preprint arXiv:2502.02538,

Seohong Park, Qiyang Li, and Sergey Levine. Flow Q-learning.arXiv preprint arXiv:2502.02538, 2025

work page arXiv 2025

[51] [51]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[52] [52]

Reinforcement learning for flow-matching policies

Samuel Pfrommer, Yixiao Huang, and Somayeh Sojoudi. Reinforcement learning for flow-matching policies. arXiv preprint arXiv:2507.15073, 2025

work page arXiv 2025

[53] [53]

Learning a diffusion model policy from rewards via q-score matching.arXiv preprint arXiv:2312.11752, 2023

Michael Psenka, Alejandro Escontrela, Pieter Abbeel, and Yi Ma. Learning a diffusion model policy from rewards via q-score matching.arXiv preprint arXiv:2312.11752, 2023. 13 Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

work page arXiv 2023

[54] [54]

Diffusion Policy Policy Optimization

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[55] [55]

Image synthesis with a single (robust) classifier.Advances in neural information processing systems, 32, 2019

Shibani Santurkar, Andrew Ilyas, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Image synthesis with a single (robust) classifier.Advances in neural information processing systems, 32, 2019

2019

[56] [56]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback.Advances in neural information processing systems, 33:3008–3021, 2020

2020

[58] [58]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

1998

[59] [59]

Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning.Advances in Neural Information Processing Systems, 36, 2024

2024

[60] [60]

Robustness may be at odds with accuracy.arXiv preprint arXiv:1805.12152, 2018

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy.arXiv preprint arXiv:1805.12152, 2018

work page arXiv 2018

[61] [61]

Analysis of temporal-diffference learning with function approximation

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996

1996

[62] [62]

Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024

Yinuo Wang, Likun Wang, Yuxuan Jiang, Wenjun Zou, Tong Liu, Xujie Song, Wenxuan Wang, Liming Xiao, Jiang Wu, Jingliang Duan, et al. Diffusion actor-critic with entropy regulator.Advances in Neural Information Processing Systems, 37:54183–54204, 2024

2024

[63] [63]

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

Behavior Regularized Offline Reinforcement Learning

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning.arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[65] [65]

Reinforcement learning via value gradient flow

Haoran Xu, Kaiwen Hu, Somayeh Sojoudi, and Amy Zhang. Reinforcement learning via value gradient flow. InThe Fourteenth International Conference on Learning Representations

[66] [66]

Offline rl with no ood actions: In-sample learning via implicit value regularization.arXiv preprint arXiv:2303.15810, 2023

Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, and Xianyuan Zhan. Offline rl with no ood actions: In-sample learning via implicit value regularization.arXiv preprint arXiv:2303.15810, 2023

work page arXiv 2023

[67] [67]

Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, and Zhi-Ming Ma. Advantage weighted matching: Aligning rl with pretraining in diffusion models.arXiv preprint arXiv:2509.25050, 2025

work page arXiv 2025

[68] [68]

Policy representation via diffusion probability model for reinforcement learning

Long Yang, Zhixiong Huang, Fenghao Lei, Yucun Zhong, Yiming Yang, Cong Fang, Shiting Wen, Binbin Zhou, and Zhouchen Lin. Policy representation via diffusion probability model for reinforcement learning. arXiv preprint arXiv:2305.13122, 2023

work page arXiv 2023

[69] [69]

Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.Advances in neural information processing systems, 37:98871–98897, 2024

Ruoqi Zhang, Ziwei Luo, Jens Sjölund, Thomas B Schön, and Per Mattsson. Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.Advances in neural information processing systems, 37:98871–98897, 2024

2024

[70] [70]

distilled

Shiyuan Zhang, Weitong Zhang, and Quanquan Gu. Energy-weighted flow matching for offline reinforce- ment learning.arXiv preprint arXiv:2503.04975, 2025. 14 Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning A. Result Details All performance result plots in the main paper and in the appendix report 95% confidence interval as the error b...

work page arXiv 2025