Steering Your Diffusion Policy with Latent Space Reinforcement Learning
Pith reviewed 2026-05-17 21:51 UTC · model grok-4.3
The pith
Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce diffusion steering via reinforcement learning (DSRL) that adapts a behavioral cloning policy by running reinforcement learning over the latent noise space of a diffusion model. DSRL requires only black-box access to the policy, avoids any modification of its weights, and produces effective policy improvement with high sample efficiency on both simulated benchmarks and real-world robotic tasks, including adaptation of pretrained generalist policies.
What carries the argument
Diffusion steering via reinforcement learning (DSRL), the process of running RL to choose better noise inputs that steer the diffusion model's denoising trajectory toward improved actions.
If this is right
- DSRL achieves high sample efficiency compared with standard reinforcement learning for policy improvement.
- Real-world robotic tasks become amenable to autonomous online adaptation without collecting new human demonstrations.
- Only black-box access to the base policy is needed, simplifying integration with existing systems.
- Pretrained generalist diffusion policies can be adapted effectively to specific tasks.
- Challenges of directly fine-tuning diffusion model weights are avoided entirely.
Where Pith is reading between the lines
- The same latent-space optimization idea could extend to other generative policy architectures used in robotics.
- Continuous deployment might allow robots to keep improving their behavior over time as the environment changes.
- The success on real hardware suggests the latent space naturally encodes action variations that reinforcement learning can exploit efficiently.
Load-bearing premise
Optimizing actions via RL in the diffusion model's latent noise space produces meaningful policy improvements without access to model gradients or internal weights and remains stable across real-world robotic tasks.
What would settle it
Apply DSRL to a physical robot on a new manipulation task and measure whether success rate rises substantially after fewer than 200 real-world trials compared with a baseline that requires additional demonstrations.
read the original abstract
Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large number of samples it typically requires. In this work we take steps towards enabling fast autonomous adaptation of BC-trained policies via efficient real-world RL. Focusing in particular on diffusion policies -- a state-of-the-art BC methodology -- we propose diffusion steering via reinforcement learning (DSRL): adapting the BC policy by running RL over its latent-noise space. We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement. Furthermore, DSRL avoids many of the challenges associated with finetuning diffusion policies, obviating the need to modify the weights of the base policy at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks, and for adapting pretrained generalist policies, illustrating its sample efficiency and effective performance at real-world policy improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Diffusion Steering via Reinforcement Learning (DSRL), which adapts pretrained diffusion-based behavioral cloning policies by running RL directly in the model's latent noise space. The central claims are that this yields high sample efficiency for autonomous policy improvement, requires only black-box access to the base policy (no gradients or weight changes), and works effectively on simulated benchmarks, real-world robotic tasks, and adaptation of generalist policies.
Significance. If the sample-efficiency results hold under rigorous baselines, the work would be significant for robotics: it offers a practical route to online improvement of high-performing BC policies without the usual costs of additional demonstrations or full diffusion-model finetuning. The black-box framing and avoidance of internal access are genuine strengths that could broaden RL applicability in contact-rich domains.
major comments (3)
- [§4 and §5.2] §4 (Method) and §5.2 (Simulated Experiments): the claim that DSRL is 'highly sample efficient' rests on treating the initial noise vector as the RL action; however, no analysis is given of the effective dimensionality or sensitivity of the reward surface after the multi-step denoising process. Without this, it is unclear whether standard black-box RL (PPO/ES) can reliably find improvements with the reported interaction counts.
- [Table 2 and §5.3] Table 2 and §5.3 (Real-world Tasks): the reported success rates and sample counts for contact-rich manipulation are presented without direct comparison to action-space RL baselines or to gradient-based diffusion finetuning; this comparison is load-bearing for the assertion that latent-space RL avoids the usual sample inefficiency of RL.
- [§5.4] §5.4 (Generalist Policy Adaptation): the experiments show improvement over the pretrained policy, but the paper does not report whether the RL optimizer remains stable when the noise dimension is large (typical for modern diffusion policies) or whether reward shaping was required; both details are necessary to evaluate the 'black-box only' practicality claim.
minor comments (3)
- [Abstract] Abstract: the acronym DSRL is used before it is defined; expand on first use.
- [Figure 3] Figure 3: axis labels on the sample-efficiency curves are too small; increase font size for readability.
- [Related Work] Related Work: citation to prior latent-space RL methods (e.g., in continuous control) is brief; a short paragraph contrasting DSRL with those approaches would clarify novelty.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment below with point-by-point responses, providing clarifications based on our work and indicating revisions where they will strengthen the presentation without misrepresenting our results.
read point-by-point responses
-
Referee: [§4 and §5.2] §4 (Method) and §5.2 (Simulated Experiments): the claim that DSRL is 'highly sample efficient' rests on treating the initial noise vector as the RL action; however, no analysis is given of the effective dimensionality or sensitivity of the reward surface after the multi-step denoising process. Without this, it is unclear whether standard black-box RL (PPO/ES) can reliably find improvements with the reported interaction counts.
Authors: We agree that a dedicated analysis of effective dimensionality and reward-surface sensitivity after denoising would help explain the mechanism behind the observed efficiency. Our empirical results across simulated tasks demonstrate that standard black-box optimizers (PPO and ES) reliably produce large performance gains within the reported interaction budgets, which is consistent with the denoising process inducing a lower-effective-dimensional and smoother optimization landscape. In the revised manuscript we will add a short discussion subsection in §4 on latent-space properties together with a brief sensitivity study using additional rollouts to quantify this effect. revision: partial
-
Referee: [Table 2 and §5.3] Table 2 and §5.3 (Real-world Tasks): the reported success rates and sample counts for contact-rich manipulation are presented without direct comparison to action-space RL baselines or to gradient-based diffusion finetuning; this comparison is load-bearing for the assertion that latent-space RL avoids the usual sample inefficiency of RL.
Authors: We acknowledge that explicit side-by-side comparisons would make the sample-efficiency argument more compelling. Our real-world experiments prioritize settings where large-scale data collection is costly; the reported interaction counts (a few hundred steps) already yield high success rates on contact-rich tasks. We will expand the discussion in §5.3 and add a new paragraph that places our sample budgets in context with published action-space RL results on comparable manipulation benchmarks, while also clarifying why gradient-based diffusion finetuning was outside the black-box scope of the study. revision: partial
-
Referee: [§5.4] §5.4 (Generalist Policy Adaptation): the experiments show improvement over the pretrained policy, but the paper does not report whether the RL optimizer remains stable when the noise dimension is large (typical for modern diffusion policies) or whether reward shaping was required; both details are necessary to evaluate the 'black-box only' practicality claim.
Authors: We thank the referee for this observation. In the §5.4 experiments we applied unmodified PPO to the latent noise vectors of the generalist policies without any reward shaping or auxiliary terms; the optimizer remained stable across the tested (higher-dimensional) noise spaces and produced consistent policy improvements. We will revise §5.4 to explicitly state the optimizer configuration, confirm the absence of reward shaping, and report stability observations for the larger noise dimensions. revision: yes
Circularity Check
No circularity: method combines standard RL and diffusion components independently
full rationale
The paper defines DSRL as running RL directly over the latent noise space of a pretrained diffusion policy, treating the diffusion model strictly as a black-box forward map. This construction uses existing RL algorithms (e.g., PPO or ES) on the noise inputs without re-deriving or fitting any quantities from the target policy's outputs. No equations or claims reduce a prediction to a fitted parameter by construction, and no load-bearing step relies on a self-citation whose content is itself unverified or tautological. The sample-efficiency and real-world improvement assertions are presented as empirical outcomes rather than algebraic identities, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL algorithm hyperparameters
axioms (1)
- domain assumption Black-box access to the diffusion policy is sufficient to evaluate and improve behavior via latent-space optimization.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
adapting the BC policy by running RL over its latent-noise space... modify the initial distribution of w with an RL-trained latent-noise space policy πW
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
diffusion steering via reinforcement learning (DSRL)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions
Introduces the task of counterfactual time series forecasting with textual conditions plus a text-attribution mechanism that improves accuracy by distinguishing mutable from immutable factors.
-
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
-
PlayWorld: Learning Robot World Models from Autonomous Play
PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...
-
Action-to-Action Flow Matching
A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.
-
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors
Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.
-
Unified Noise Steering for Efficient Human-Guided VLA Adaptation
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
-
Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion
Diff-CAST replaces GAN discriminators with diffusion-based priors and adds symmetric command conditioning plus constrained RL to enable versatile, drift-free, and hardware-safe quadruped locomotion.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning
Q2RL extracts Q-functions from BC policies via minimal interactions and applies Q-gating to enable stable offline-to-online RL, outperforming baselines on manipulation benchmarks and achieving up to 100% success on-robot.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
-
What Does Flow Matching Bring To TD Learning?
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
-
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and ge...
-
MATT-Diff: Multimodal Active Target Tracking by Diffusion Policy
MATT-Diff uses a diffusion model with vision transformer and attention to generate multimodal actions for active multi-target tracking from expert planner demonstrations.
-
Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion
Diff-CAST replaces GAN discriminators with diffusion priors and adds symmetric tracking plus constrained RL to enable diverse, drift-free, hardware-compliant quadruped locomotion from heterogeneous datasets.
-
Towards Robotic Dexterous Hand Intelligence: A Survey
A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.
Reference graph
Works this paper leans on
-
[1]
S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor. Language- conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33:13139–13150, 2020
work page 2020
-
[2]
N. M. Shafiullah, Z. Cui, A. A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022
work page 2022
- [3]
-
[4]
O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [5]
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023
work page 2023
-
[8]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [9]
-
[10]
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
K. Team, A. Du, B. Gao, B. Xing, C. Jiang, C. Chen, C. Li, C. Xiao, C. Du, C. Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learn- ing using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. pmlr, 2015. 12
work page 2015
-
[14]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020
work page 2020
-
[15]
J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [16]
-
[17]
Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
A. Sridhar, D. Shah, C. Glossop, and S. Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 63–70. IEEE, 2024
work page 2024
- [19]
-
[20]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burch- fiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [22]
- [23]
-
[24]
B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009
work page 2009
-
[25]
S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011
work page 2011
-
[26]
End to End Learning for Self-Driving Cars
M. Bojarski. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [27]
-
[28]
R. Rahmatizadeh, P. Abolghasemi, L. B ¨ol¨oni, and S. Levine. Vision-based multi-task manip- ulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3758–3765. IEEE, 2018
work page 2018
-
[30]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022. 13
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn, et al. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022
work page 2022
- [32]
-
[33]
Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Planning with Diffusion for Flexible Behavior Synthesis
M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
A. Ajay, Y . Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
Guided flows for gen- erative modeling and decision making,
Q. Zheng, M. Le, N. Shaul, Y . Lipman, A. Grover, and R. T. Chen. Guided flows for generative modeling and decision making. arXiv preprint arXiv:2311.13443, 2023
-
[37]
W. Li, X. Wang, B. Jin, and H. Zha. Hierarchical diffusion for offline decision making. In International Conference on Machine Learning, pages 20035–20064. PMLR, 2023
work page 2023
- [38]
-
[39]
H. T. Suh, G. Chou, H. Dai, L. Yang, A. Gupta, and R. Tedrake. Fighting uncertainty with gradients: Offline reinforcement learning via diffusion score matching. In Conference on Robot Learning, pages 2878–2904. PMLR, 2023
work page 2023
- [40]
- [41]
-
[42]
C. Lu, H. Chen, J. Chen, H. Su, C. Li, and J. Zhu. Contrastive energy prediction for exact energy-guided diffusion sampling in offline reinforcement learning. In International Confer- ence on Machine Learning, pages 22825–22855. PMLR, 2023
work page 2023
-
[43]
B. Kang, X. Ma, C. Du, T. Pang, and S. Yan. Efficient diffusion policies for offline rein- forcement learning. Advances in Neural Information Processing Systems , 36:67195–67212, 2023
work page 2023
- [44]
- [45]
-
[46]
Z. Wang, J. J. Hunt, and M. Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [47]
-
[48]
Z. Ding and C. Jin. Consistency models as a rich and efficient policy class for reinforcement learning. arXiv preprint arXiv:2309.16984, 2023. 14
-
[49]
S. E. Ada, E. Oztop, and E. Ugur. Diffusion policies for out-of-distribution generalization in offline reinforcement learning. IEEE Robotics and Automation Letters, 9(4):3116–3123, 2024
work page 2024
- [50]
- [51]
-
[52]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [53]
-
[54]
M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance. arXiv preprint arXiv:2410.13816, 2024
-
[55]
S. Venkatraman, S. Khaitan, R. T. Akella, J. Dolan, J. Schneider, and G. Berseth. Reasoning with latent diffusion in offline reinforcement learning.arXiv preprint arXiv:2309.06599, 2023
- [56]
- [57]
- [58]
- [59]
-
[60]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [61]
- [62]
-
[63]
S. Li, R. Krohn, T. Chen, A. Ajay, P. Agrawal, and G. Chalvatzaki. Learning multimodal behaviors from scratch with diffusion policy gradient. Advances in Neural Information Pro- cessing Systems, 37:38456–38479, 2024
work page 2024
- [64]
- [65]
- [66]
-
[67]
J. Mao, X. Wang, and K. Aizawa. The lottery ticket hypothesis in denoising: Towards semantic-driven initialization. In European Conference on Computer Vision , pages 93–109. Springer, 2024. 15
work page 2024
- [68]
- [69]
- [70]
-
[71]
L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [72]
-
[73]
Flow Matching for Generative Modeling
Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[74]
X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[75]
M. S. Albergo and E. Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[76]
Y . Lipman, M. Havasi, P. Holderrieth, N. Shaul, M. Le, B. Karrer, R. T. Chen, D. Lopez-Paz, H. Ben-Hamu, and I. Gat. Flow matching guide and code. arXiv preprint arXiv:2412.06264, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [77]
-
[78]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[79]
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. Pmlr, 2018
work page 2018
- [80]
-
[81]
Offline Reinforcement Learning with Implicit Q-Learning
I. Kostrikov, A. Nair, and S. Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.