arxiv: 2506.15799 · v2 · pith:SOUJZI63new · submitted 2025-06-18 · 💻 cs.RO · cs.LG

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

Andrew Wagenmaker , Mitsuhiko Nakamoto , Yunchu Zhang , Seohong Park , Waleed Yagoub , Anusha Nagabandi , Abhishek Gupta , Sergey Levine This is my paper

Pith reviewed 2026-05-17 21:51 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords diffusion policieslatent space reinforcement learningrobotic controlbehavioral cloningpolicy adaptationsample efficiencyreal-world roboticsautonomous improvement

0 comments

The pith

Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that diffusion policies learned from demonstrations can be improved on their own by applying reinforcement learning directly to the latent noise inputs that guide the denoising process. This matters to a sympathetic reader because gathering extra human demonstrations for refinement is costly and slow, while full reinforcement learning usually demands too many real-world trials to be practical. If the method works, existing diffusion policies could adapt quickly in open environments using only black-box access to the policy and no parameter changes. The approach sidesteps common difficulties that arise when attempting to fine-tune the diffusion model itself during online operation.

Core claim

We introduce diffusion steering via reinforcement learning (DSRL) that adapts a behavioral cloning policy by running reinforcement learning over the latent noise space of a diffusion model. DSRL requires only black-box access to the policy, avoids any modification of its weights, and produces effective policy improvement with high sample efficiency on both simulated benchmarks and real-world robotic tasks, including adaptation of pretrained generalist policies.

What carries the argument

Diffusion steering via reinforcement learning (DSRL), the process of running RL to choose better noise inputs that steer the diffusion model's denoising trajectory toward improved actions.

If this is right

DSRL achieves high sample efficiency compared with standard reinforcement learning for policy improvement.
Real-world robotic tasks become amenable to autonomous online adaptation without collecting new human demonstrations.
Only black-box access to the base policy is needed, simplifying integration with existing systems.
Pretrained generalist diffusion policies can be adapted effectively to specific tasks.
Challenges of directly fine-tuning diffusion model weights are avoided entirely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same latent-space optimization idea could extend to other generative policy architectures used in robotics.
Continuous deployment might allow robots to keep improving their behavior over time as the environment changes.
The success on real hardware suggests the latent space naturally encodes action variations that reinforcement learning can exploit efficiently.

Load-bearing premise

Optimizing actions via RL in the diffusion model's latent noise space produces meaningful policy improvements without access to model gradients or internal weights and remains stable across real-world robotic tasks.

What would settle it

Apply DSRL to a physical robot on a new manipulation task and measure whether success rate rises substantially after fewer than 200 real-world trials compared with a baseline that requires additional demonstrations.

read the original abstract

Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies -- a state-of-the-art BC methodology -- we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces Diffusion Steering via Reinforcement Learning (DSRL), which adapts pretrained diffusion-based behavioral cloning policies by running RL directly in the model's latent noise space. The central claims are that this yields high sample efficiency for autonomous policy improvement, requires only black-box access to the base policy (no gradients or weight changes), and works effectively on simulated benchmarks, real-world robotic tasks, and adaptation of generalist policies.

Significance. If the sample-efficiency results hold under rigorous baselines, the work would be significant for robotics: it offers a practical route to online improvement of high-performing BC policies without the usual costs of additional demonstrations or full diffusion-model finetuning. The black-box framing and avoidance of internal access are genuine strengths that could broaden RL applicability in contact-rich domains.

major comments (3)

[§4 and §5.2] §4 (Method) and §5.2 (Simulated Experiments): the claim that DSRL is 'highly sample efficient' rests on treating the initial noise vector as the RL action; however, no analysis is given of the effective dimensionality or sensitivity of the reward surface after the multi-step denoising process. Without this, it is unclear whether standard black-box RL (PPO/ES) can reliably find improvements with the reported interaction counts.
[Table 2 and §5.3] Table 2 and §5.3 (Real-world Tasks): the reported success rates and sample counts for contact-rich manipulation are presented without direct comparison to action-space RL baselines or to gradient-based diffusion finetuning; this comparison is load-bearing for the assertion that latent-space RL avoids the usual sample inefficiency of RL.
[§5.4] §5.4 (Generalist Policy Adaptation): the experiments show improvement over the pretrained policy, but the paper does not report whether the RL optimizer remains stable when the noise dimension is large (typical for modern diffusion policies) or whether reward shaping was required; both details are necessary to evaluate the 'black-box only' practicality claim.

minor comments (3)

[Abstract] Abstract: the acronym DSRL is used before it is defined; expand on first use.
[Figure 3] Figure 3: axis labels on the sample-efficiency curves are too small; increase font size for readability.
[Related Work] Related Work: citation to prior latent-space RL methods (e.g., in continuous control) is brief; a short paragraph contrasting DSRL with those approaches would clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment below with point-by-point responses, providing clarifications based on our work and indicating revisions where they will strengthen the presentation without misrepresenting our results.

read point-by-point responses

Referee: [§4 and §5.2] §4 (Method) and §5.2 (Simulated Experiments): the claim that DSRL is 'highly sample efficient' rests on treating the initial noise vector as the RL action; however, no analysis is given of the effective dimensionality or sensitivity of the reward surface after the multi-step denoising process. Without this, it is unclear whether standard black-box RL (PPO/ES) can reliably find improvements with the reported interaction counts.

Authors: We agree that a dedicated analysis of effective dimensionality and reward-surface sensitivity after denoising would help explain the mechanism behind the observed efficiency. Our empirical results across simulated tasks demonstrate that standard black-box optimizers (PPO and ES) reliably produce large performance gains within the reported interaction budgets, which is consistent with the denoising process inducing a lower-effective-dimensional and smoother optimization landscape. In the revised manuscript we will add a short discussion subsection in §4 on latent-space properties together with a brief sensitivity study using additional rollouts to quantify this effect. revision: partial
Referee: [Table 2 and §5.3] Table 2 and §5.3 (Real-world Tasks): the reported success rates and sample counts for contact-rich manipulation are presented without direct comparison to action-space RL baselines or to gradient-based diffusion finetuning; this comparison is load-bearing for the assertion that latent-space RL avoids the usual sample inefficiency of RL.

Authors: We acknowledge that explicit side-by-side comparisons would make the sample-efficiency argument more compelling. Our real-world experiments prioritize settings where large-scale data collection is costly; the reported interaction counts (a few hundred steps) already yield high success rates on contact-rich tasks. We will expand the discussion in §5.3 and add a new paragraph that places our sample budgets in context with published action-space RL results on comparable manipulation benchmarks, while also clarifying why gradient-based diffusion finetuning was outside the black-box scope of the study. revision: partial
Referee: [§5.4] §5.4 (Generalist Policy Adaptation): the experiments show improvement over the pretrained policy, but the paper does not report whether the RL optimizer remains stable when the noise dimension is large (typical for modern diffusion policies) or whether reward shaping was required; both details are necessary to evaluate the 'black-box only' practicality claim.

Authors: We thank the referee for this observation. In the §5.4 experiments we applied unmodified PPO to the latent noise vectors of the generalist policies without any reward shaping or auxiliary terms; the optimizer remained stable across the tested (higher-dimensional) noise spaces and produced consistent policy improvements. We will revise §5.4 to explicitly state the optimizer configuration, confirm the absence of reward shaping, and report stability observations for the larger noise dimensions. revision: yes

Circularity Check

0 steps flagged

No circularity: method combines standard RL and diffusion components independently

full rationale

The paper defines DSRL as running RL directly over the latent noise space of a pretrained diffusion policy, treating the diffusion model strictly as a black-box forward map. This construction uses existing RL algorithms (e.g., PPO or ES) on the noise inputs without re-deriving or fitting any quantities from the target policy's outputs. No equations or claims reduce a prediction to a fitted parameter by construction, and no load-bearing step relies on a self-citation whose content is itself unverified or tautological. The sample-efficiency and real-world improvement assertions are presented as empirical outcomes rather than algebraic identities, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the latent noise space of a diffusion policy is a suitable and stable search space for black-box RL, plus standard RL assumptions about reward signals and exploration.

free parameters (1)

RL algorithm hyperparameters
Learning rate, exploration noise scale, and episode length are typical free parameters in any RL method and would need to be chosen or tuned for DSRL.

axioms (1)

domain assumption Black-box access to the diffusion policy is sufficient to evaluate and improve behavior via latent-space optimization.
Invoked when the paper states that DSRL requires only black-box access and avoids weight modification.

pith-pipeline@v0.9.0 · 5557 in / 1222 out tokens · 31300 ms · 2026-05-17T21:51:12.070261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

adapting the BC policy by running RL over its latent-noise space... modify the initial distribution of w with an RL-trained latent-noise space policy πW
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

diffusion steering via reinforcement learning (DSRL)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
cs.LG 2026-05 unverdicted novelty 7.0

Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
cs.RO 2026-03 conditional novelty 7.0

Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
PlayWorld: Learning Robot World Models from Autonomous Play
cs.RO 2026-03 unverdicted novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...
Action-to-Action Flow Matching
cs.RO 2026-02 unverdicted novelty 7.0

A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
cs.LG 2026-05 unverdicted novelty 6.0

Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion
cs.RO 2026-05 unverdicted novelty 6.0

Diff-CAST replaces GAN discriminators with diffusion-based priors and adds symmetric command conditioning plus constrained RL to enable versatile, drift-free, and hardware-safe quadruped locomotion.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
cs.AI 2026-05 unverdicted novelty 6.0

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
cs.RO 2026-05 unverdicted novelty 6.0

Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
cs.LG 2026-05 unverdicted novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
cs.RO 2026-05 unverdicted novelty 6.0

Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
cs.RO 2026-03 conditional novelty 6.0

ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
What Does Flow Matching Bring To TD Learning?
cs.LG 2026-03 conditional novelty 6.0

Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
cs.RO 2026-02 unverdicted novelty 6.0

Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and ge...
MATT-Diff: Multimodal Active Target Tracking by Diffusion Policy
cs.RO 2025-11 unverdicted novelty 6.0

MATT-Diff uses a diffusion model with vision transformer and attention to generate multimodal actions for active multi-target tracking from expert planner demonstrations.
Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion
cs.RO 2026-05 unverdicted novelty 5.0

Diff-CAST replaces GAN discriminators with diffusion priors and adds symmetric tracking plus constrained RL to enable diverse, drift-free, hardware-compliant quadruped locomotion from heterogeneous datasets.
Towards Robotic Dexterous Hand Intelligence: A Survey
cs.RO 2026-05 unverdicted novelty 4.0

A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 17 Pith papers · 30 internal anchors

[1]

Stepputtis, J

S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor. Language- conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33:13139–13150, 2020

work page 2020
[2]

N. M. Shafiullah, Z. Cui, A. A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022

work page 2022
[3]

J. Gu, S. Kirmani, P. Wohlhart, Y . Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023

work page arXiv 2023
[4]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

T. Z. Zhao, J. Tompson, D. Driess, P. Florence, K. Ghasemipour, C. Finn, and A. Wahid. Aloha unleashed: A simple recipe for robot dexterity. arXiv preprint arXiv:2410.13126, 2024

work page arXiv 2024
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023
[8]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[10]

Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learn- ing using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. pmlr, 2015. 12

work page 2015
[14]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

work page 2020
[15]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, and P. Agrawal. Juicer: Data-efficient imitation learning for robotic assembly. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5096–5103. IEEE, 2024

work page 2024
[17]

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Sridhar, D

A. Sridhar, D. Shah, C. Glossop, and S. Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 63–70. IEEE, 2024

work page 2024
[19]

The Ingredi- ents for Robotic Diffusion Transformers,

S. Dasari, O. Mees, S. Zhao, M. K. Srirama, and S. Levine. The ingredients for robotic diffu- sion transformers. arXiv preprint arXiv:2410.10088, 2024

work page arXiv 2024
[20]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burch- fiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

M. S. Mark, T. Gao, G. G. Sampaio, M. K. Srirama, A. Sharma, C. Finn, and A. Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone. arXiv preprint arXiv:2412.06685, 2024

work page arXiv 2024
[23]

S. Park, Q. Li, and S. Levine. Flow q-learning. arXiv preprint arXiv:2502.02538, 2025

work page arXiv 2025
[24]

B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009

work page 2009
[25]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

work page 2011
[26]

End to End Learning for Self-Driving Cars

M. Bojarski. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[27]

Zhang, Z

T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel. Deep imita- tion learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE international conference on robotics and automation (ICRA), pages 5628–5635. IEEE, 2018

work page 2018
[28]

Rahmatizadeh, P

R. Rahmatizadeh, P. Abolghasemi, L. B ¨ol¨oni, and S. Levine. Vision-based multi-task manip- ulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3758–3765. IEEE, 2018

work page 2018
[30]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 13

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn, et al. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022

work page 2022
[32]

Z. J. Cui, Y . Wang, N. M. M. Shafiullah, and L. Pinto. From play to policy: Conditional behavior generation from uncurated robot data. arXiv preprint arXiv:2210.10047, 2022

work page arXiv 2022
[33]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Planning with Diffusion for Flexible Behavior Synthesis

M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

A. Ajay, Y . Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Guided flows for generative modeling and decision making.arXiv preprint arXiv:2311.13443, 2023

Q. Zheng, M. Le, N. Shaul, Y . Lipman, A. Grover, and R. T. Chen. Guided flows for generative modeling and decision making. arXiv preprint arXiv:2311.13443, 2023

work page arXiv 2023
[37]

W. Li, X. Wang, B. Jin, and H. Zha. Hierarchical diffusion for offline decision making. In International Conference on Machine Learning, pages 20035–20064. PMLR, 2023

work page 2023
[38]

Liang, Y

Z. Liang, Y . Mu, M. Ding, F. Ni, M. Tomizuka, and P. Luo. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023

work page arXiv 2023
[39]

H. T. Suh, G. Chou, H. Dai, L. Yang, A. Gupta, and R. Tedrake. Fighting uncertainty with gradients: Offline reinforcement learning via diffusion score matching. In Conference on Robot Learning, pages 2878–2904. PMLR, 2023

work page 2023
[40]

C. Chen, F. Deng, K. Kawaguchi, C. Gulcehre, and S. Ahn. Simple hierarchical planning with diffusion. arXiv preprint arXiv:2401.02644, 2024

work page arXiv 2024
[41]

L. Wang, J. Zhao, Y . Du, E. H. Adelson, and R. Tedrake. Poco: Policy composition from and for heterogeneous robot learning. arXiv preprint arXiv:2402.02511, 2024

work page arXiv 2024
[42]

C. Lu, H. Chen, J. Chen, H. Su, C. Li, and J. Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In International Confer- ence on Machine Learning, pages 22825–22855. PMLR, 2023

work page 2023
[43]

B. Kang, X. Ma, C. Du, T. Pang, and S. Yan. Efficient diffusion policies for offline rein- forcement learning. Advances in Neural Information Processing Systems , 36:67195–67212, 2023

work page 2023
[44]

Zhang, W

S. Zhang, W. Zhang, and Q. Gu. Energy-weighted flow matching for offline reinforcement learning. arXiv preprint arXiv:2503.04975, 2025

work page arXiv 2025
[45]

S. Ding, K. Hu, Z. Zhang, K. Ren, W. Zhang, J. Yu, J. Wang, and Y . Shi. Diffusion- based reinforcement learning via q-weighted variational policy optimization. arXiv preprint arXiv:2405.16173, 2024

work page arXiv 2024
[46]

Z. Wang, J. J. Hunt, and M. Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

L. He, L. Shen, L. Zhang, J. Tan, and X. Wang. Diffcps: Diffusion model based constrained policy search for offline reinforcement learning. arXiv preprint arXiv:2310.05333, 2023

work page arXiv 2023
[48]

Ding and C

Z. Ding and C. Jin. Consistency models as a rich and efficient policy class for reinforcement learning. arXiv preprint arXiv:2309.16984, 2023. 14

work page arXiv 2023
[49]

S. E. Ada, E. Oztop, and E. Ugur. Diffusion policies for out-of-distribution generalization in offline reinforcement learning. IEEE Robotics and Automation Letters, 9(4):3116–3123, 2024

work page 2024
[50]

Zhang, Z

R. Zhang, Z. Luo, J. Sj ¨olund, T. Sch¨on, and P. Mattsson. Entropy-regularized diffusion policy with q-ensembles for offline reinforcement learning.Advances in Neural Information Process- ing Systems, 37:98871–98897, 2024

work page 2024
[51]

H. Chen, C. Lu, C. Ying, H. Su, and J. Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548, 2022

work page arXiv 2022
[52]

IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

L. He, L. Shen, J. Tan, and X. Wang. Aligniql: Policy alignment in implicit q-learning through constrained optimization. arXiv preprint arXiv:2405.18187, 2024

work page arXiv 2024
[54]

Nakamoto, O

M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance. arXiv preprint arXiv:2410.13816, 2024

work page arXiv 2024
[55]

Venkatraman, S

S. Venkatraman, S. Khaitan, R. T. Akella, J. Dolan, J. Schneider, and G. Berseth. Reasoning with latent diffusion in offline reinforcement learning.arXiv preprint arXiv:2309.06599, 2023

work page arXiv 2023
[56]

H. Chen, K. Zheng, H. Su, and J. Zhu. Aligning diffusion behaviors with q-functions for efficient continuous control. arXiv preprint arXiv:2407.09024, 2024

work page arXiv 2024
[57]

T. Chen, Z. Wang, and M. Zhou. Diffusion policies creating a trust region for offline reinforce- ment learning. arXiv preprint arXiv:2405.19690, 2024

work page arXiv 2024
[58]

H. Chen, C. Lu, Z. Wang, H. Su, and J. Zhu. Score regularized policy optimization through diffusion behavior. arXiv preprint arXiv:2310.07297, 2023

work page arXiv 2023
[59]

L. Mao, H. Xu, X. Zhan, W. Zhang, and A. Zhang. Diffusion-dice: In-sample diffusion guid- ance for offline reinforcement learning. arXiv preprint arXiv:2407.20109, 2024

work page arXiv 2024
[60]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

Psenka, A

M. Psenka, A. Escontrela, P. Abbeel, and Y . Ma. Learning a diffusion model policy from rewards via q-score matching. arXiv preprint arXiv:2312.11752, 2023

work page arXiv 2023
[62]

L. Yang, Z. Huang, F. Lei, Y . Zhong, Y . Yang, C. Fang, S. Wen, B. Zhou, and Z. Lin. Pol- icy representation via diffusion probability model for reinforcement learning. arXiv preprint arXiv:2305.13122, 2023

work page arXiv 2023
[63]

S. Li, R. Krohn, T. Chen, A. Ajay, P. Agrawal, and G. Chalvatzaki. Learning multimodal behaviors from scratch with diffusion policy gradient. Advances in Neural Information Pro- cessing Systems, 37:38456–38479, 2024

work page 2024
[64]

Ankile, A

L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement– residual rl for precise assembly. arXiv preprint arXiv:2407.16677, 2024

work page arXiv 2024
[65]

X. Yuan, T. Mu, S. Tao, Y . Fang, M. Zhang, and H. Su. Policy decorator: Model-agnostic online refinement for large policy model. arXiv preprint arXiv:2412.13630, 2024

work page arXiv 2024
[66]

Eyring, S

L. Eyring, S. Karthik, K. Roth, A. Dosovitskiy, and Z. Akata. Reno: Enhancing one-step text- to-image models through reward-based noise optimization. Advances in Neural Information Processing Systems, 37:125487–125519, 2024

work page 2024
[67]

J. Mao, X. Wang, and K. Aizawa. The lottery ticket hypothesis in denoising: Towards semantic-driven initialization. In European Conference on Computer Vision , pages 93–109. Springer, 2024. 15

work page 2024
[68]

Samuel, R

D. Samuel, R. Ben-Ari, S. Raviv, N. Darshan, and G. Chechik. Generating images of rare con- cepts using pre-trained diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4695–4703, 2024

work page 2024
[69]

D. Ahn, J. Kang, S. Lee, J. Min, M. Kim, W. Jang, H. Cho, S. Paul, S. Kim, E. Cha, et al. A noise is worth diffusion guidance. arXiv preprint arXiv:2412.03895, 2024

work page arXiv 2024
[70]

Singh, H

A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhinehart, and S. Levine. Parrot: Data-driven behavioral priors for reinforcement learning. arXiv preprint arXiv:2011.10024, 2020

work page arXiv 2011
[71]

L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[72]

Dalal, D

M. Dalal, D. Pathak, and R. R. Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. Advances in Neural Information Processing Systems, 34: 21847–21859, 2021

work page 2021
[73]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[74]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[75]

M. S. Albergo and E. Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[76]

Flow Matching Guide and Code

Y . Lipman, M. Havasi, P. Holderrieth, N. Shaul, M. Le, B. Karrer, R. T. Chen, D. Lopez-Paz, H. Ben-Hamu, and I. Gat. Flow matching guide and code. arXiv preprint arXiv:2412.06264, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Silver, G

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pages 387–395. Pmlr, 2014

work page 2014
[78]

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[79]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[80]

Kumar, A

A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforce- ment learning. Advances in neural information processing systems, 33:1179–1191, 2020

work page 2020
[81]

Offline Reinforcement Learning with Implicit Q-Learning

I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

Showing first 80 references.