pith. machine review for the scientific record. sign in

arxiv: 2409.00588 · v3 · submitted 2024-09-01 · 💻 cs.RO · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Diffusion Policy Policy Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:44 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords diffusion policypolicy gradientreinforcement learningrobot learningfine-tuningcontinuous controlsim-to-real
0
0 comments X

The pith

DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DPPO, a framework that applies policy-gradient updates from reinforcement learning to fine-tune diffusion-based policies such as Diffusion Policy. It reports that this combination delivers the highest overall success rates and training efficiency across standard continuous-control and robot-learning benchmarks, outperforming both other RL algorithms tailored to diffusion policies and policy-gradient tuning of non-diffusion parameterizations. The authors trace the gains to synergies that produce structured, on-manifold exploration, reduced training variance, and policies that remain robust when transferred to real hardware. These results matter because diffusion models have become a dominant way to represent complex robot behaviors, yet earlier work had doubted that standard RL fine-tuning would work well with them. The framework is demonstrated on pixel-based simulated tasks and on zero-shot deployment for long-horizon, multi-stage manipulation on physical robots.

Core claim

DPPO is an algorithmic framework that fine-tunes diffusion-based policies using policy-gradient reinforcement learning. The method achieves the strongest overall performance and sample efficiency on common benchmarks compared with other RL algorithms for diffusion policies and compared with policy-gradient tuning of alternative policy parameterizations. It exploits synergies between the diffusion parameterization and policy-gradient updates that yield structured on-manifold exploration, stable training dynamics, and high policy robustness, as verified on simulated robotic tasks with pixel observations and via zero-shot transfer to physical robot hardware in long-horizon manipulation.

What carries the argument

The diffusion policy parameterization combined with policy-gradient updates, which produces structured on-manifold exploration and stable fine-tuning trajectories.

If this is right

  • Fine-tuning becomes practical for diffusion policies that were previously considered difficult to optimize with standard RL.
  • Policies exhibit lower variance across random seeds, reducing the number of trials needed to reach reliable performance.
  • On-manifold exploration improves sample efficiency in high-dimensional continuous action spaces typical of robotics.
  • Zero-shot sim-to-real transfer succeeds for multi-stage manipulation tasks without additional real-world fine-tuning.
  • Pixel-observation tasks become trainable end-to-end with the same diffusion-plus-gradient recipe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diffusion-plus-gradient recipe may transfer to other generative policy classes such as flow-matching or score-based models.
  • Combining DPPO with offline datasets could further boost sample efficiency by blending imitation and on-policy signals.
  • The on-manifold property may reduce the sim-to-real gap in contact-rich tasks where action distributions must stay physically plausible.

Load-bearing premise

The observed performance and efficiency gains arise specifically from synergies between the diffusion parameterization and policy-gradient updates rather than from benchmark-specific hyperparameter choices or implementation details.

What would settle it

A controlled re-implementation in which a non-diffusion policy receives identical hyperparameter tuning, network capacity, and training budget yet matches or exceeds DPPO's benchmark scores would indicate the gains are not unique to the diffusion structure.

read the original abstract

We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Diffusion Policy Policy Optimization (DPPO), a framework for fine-tuning diffusion-based policies (such as Diffusion Policy) in continuous control and robot learning using policy gradient (PG) methods. It reports that DPPO attains the strongest overall performance and efficiency on common benchmarks relative to other RL algorithms for diffusion policies and to PG fine-tuning of alternative policy parameterizations. The authors attribute these gains to unique synergies between the diffusion parameterization and PG updates that enable structured on-manifold exploration, stable training, and policy robustness. Results are shown on simulated robotic tasks with pixel observations and via zero-shot sim-to-real transfer on a long-horizon multi-stage manipulation task.

Significance. If the empirical superiority holds after equalizing hyperparameter search effort across all compared methods, the work would be significant for robot learning: it would demonstrate that PG fine-tuning is both viable and advantageous for diffusion policies, contrary to prior conjectures, and supply a practical recipe with hardware validation that could influence how diffusion policies are deployed in real-world manipulation.

major comments (2)
  1. [Experiments] Experiments section: The headline claim that performance edges arise from 'unique synergies' between diffusion parameterization and PG updates is load-bearing, yet the manuscript provides no explicit statement of the hyperparameter search budget, number of trials, or optimizer settings allocated to each baseline (other RL methods on diffusion policies and PG on alternative parameterizations). Without this information, the reported gains cannot be confidently attributed to the claimed mechanism rather than unequal tuning or implementation quality.
  2. [Method] Method section (DPPO algorithm description): The precise modifications to the standard PG update—particularly how the denoising network is treated during the policy gradient step, whether noise schedules are frozen or annealed, and how the diffusion loss is combined with the RL objective—are not stated with sufficient equation-level detail to allow independent reproduction of the reported stability benefits.
minor comments (2)
  1. [Abstract] The abstract refers to 'common benchmarks' without naming the specific environments or tasks; an explicit list (e.g., the MuJoCo or Meta-World suites used) would improve immediate readability.
  2. [Figures] Learning-curve figures would be clearer if they included shaded regions for standard deviation across random seeds rather than only mean curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We agree that the manuscript requires additional details on hyperparameter tuning procedures and algorithmic specifications to support the claims and enable reproduction. We have revised the manuscript accordingly, as described in the point-by-point responses below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The headline claim that performance edges arise from 'unique synergies' between diffusion parameterization and PG updates is load-bearing, yet the manuscript provides no explicit statement of the hyperparameter search budget, number of trials, or optimizer settings allocated to each baseline (other RL methods on diffusion policies and PG on alternative parameterizations). Without this information, the reported gains cannot be confidently attributed to the claimed mechanism rather than unequal tuning or implementation quality.

    Authors: We agree that explicit documentation of the hyperparameter search effort is required to substantiate the performance claims. In the revised manuscript, we have added a new subsection to the Experiments section along with an appendix that specifies the search budget, number of trials, optimizer settings, learning rates, batch sizes, and random seeds used for every baseline, including other RL methods applied to diffusion policies and PG fine-tuning of alternative parameterizations. These additions confirm that comparable tuning resources were allocated across methods. revision: yes

  2. Referee: [Method] Method section (DPPO algorithm description): The precise modifications to the standard PG update—particularly how the denoising network is treated during the policy gradient step, whether noise schedules are frozen or annealed, and how the diffusion loss is combined with the RL objective—are not stated with sufficient equation-level detail to allow independent reproduction of the reported stability benefits.

    Authors: We thank the referee for highlighting the need for greater precision. The revised Method section now includes explicit equations detailing the policy gradient update applied to the denoising network parameters, the fixed noise schedule used during fine-tuning, and the combined objective that integrates the diffusion denoising loss with the RL term. Updated pseudocode has also been added to facilitate independent reproduction of the reported stability properties. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on external benchmarks

full rationale

The paper presents DPPO as an algorithmic framework for RL fine-tuning of diffusion policies and supports its claims of superior performance and efficiency exclusively through experimental comparisons on standard benchmarks against other RL methods and policy parameterizations. No mathematical derivation, equation, or ansatz is offered whose validity reduces to a fitted parameter, self-citation chain, or input by construction. The central assertions are externally falsifiable via replication of the reported tasks, random seeds, and hyperparameter protocols; any self-citations to prior diffusion-policy work serve only as background and are not invoked to prove uniqueness or forbid alternatives. The result is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities; the method appears to rest on standard RL assumptions and the diffusion policy parameterization already present in prior work.

pith-pipeline@v0.9.0 · 5516 in / 1019 out tokens · 19846 ms · 2026-05-16T08:44:13.356339+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL).

  • Foundation.PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness.

  • Foundation.DiscretenessForcing continuous_no_isolated_zero_defect unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly

    cs.RO 2026-05 unverdicted novelty 7.0

    BrickCraft composes reusable visuomotor skills via relative anchoring to partial structures and situated visual manuals to achieve long-horizon interlocking brick assembly from limited demonstrations with generalizati...

  2. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.

  3. ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching

    cs.RO 2026-04 unverdicted novelty 7.0

    ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...

  4. You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector

    cs.RO 2026-03 conditional novelty 7.0

    Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.

  5. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  6. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 6.0

    DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).

  7. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

  8. Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

    cs.AI 2026-05 unverdicted novelty 6.0

    LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.

  9. ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.

  10. OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

    cs.LG 2026-05 unverdicted novelty 6.0

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  11. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  12. ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

    cs.RO 2026-03 conditional novelty 6.0

    ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.

  13. What Does Flow Matching Bring To TD Learning?

    cs.LG 2026-03 conditional novelty 6.0

    Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.

  14. Space Syntax-guided Post-training for Residential Floor Plan Generation

    cs.LG 2026-02 unverdicted novelty 6.0

    SSPT turns space-syntax integration metrics into post-training feedback signals that improve public-space dominance and functional hierarchy in AI-generated residential floor plans.

  15. RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection

    cs.CV 2026-02 unverdicted novelty 6.0

    RL-RIG uses a generate-reflect-edit loop with reinforcement learning to improve spatial accuracy in image generation, reporting up to 11% gains over prior open-source models on scene-graph metrics.

  16. How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

    cs.LG 2026-02 unverdicted novelty 6.0

    ALGD augments the Lagrangian to locally convexify the energy landscape in diffusion models, stabilizing safe RL training and generation without changing optimal policies.

  17. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  18. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  19. Driving Intents Amplify Planning-Oriented Reinforcement Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.

  20. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  21. Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems

    eess.SY 2026-04 unverdicted novelty 2.0

    A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

114 extracted references · 114 canonical work pages · cited by 18 Pith papers · 24 internal anchors

  1. [1]

    J. Achiam. Spinning Up in Deep Reinforcement Learning. 2018

  2. [2]

    A. Ajay, Y . Du, A. Gupta, J. B. Tenenbaum, T. S. Jaakkola, and P. Agrawal. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2023

  3. [3]

    Alakuijala, G

    M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid. Residual reinforcement learning from demonstrations. arXiv preprint arXiv:2106.08050, 2021

  4. [4]

    O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 2020

  5. [5]

    Ankile, A

    L. Ankile, A. Simeonov, I. Shenfeld, and P. Agrawal. Juicer: Data-efficient imitation learning for robotic assembly. arXiv, 2024

  6. [6]

    Ankile, A

    L. Ankile, A. Simeonov, I. Shenfeld, M. Torne, and P. Agrawal. From imitation to refinement–residual rl for precise visual assembly. arXiv preprint arXiv:2407.16677, 2024

  7. [7]

    P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023

  8. [8]

    C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learning. Springer, 2006. 16

  9. [9]

    Training Diffusion Models with Reinforcement Learning

    K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023

  10. [10]

    Block, A

    A. Block, A. Jadbabaie, D. Pfrommer, M. Simchowitz, and R. Tedrake. Provable guarantees for generative behavior cloning: Bridging low-level stability and high-level behavior.Advances in Neural Information Processing Systems, 2024

  11. [11]

    OpenAI Gym

    G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016

  12. [12]

    Brown, B

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 2020

  13. [13]

    Bruce, M

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al. Genie: Generative interactive environments. In Forty-first Interna- tional Conference on Machine Learning, 2024

  14. [14]

    S. H. Chan. Tutorial on diffusion models for imaging and vision. arXiv preprint arXiv:2403.18103, 2024

  15. [15]

    B. Chen, D. M. Monso, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. arXiv preprint arXiv:2407.01392, 2024

  16. [16]

    H. Chen, C. Lu, C. Ying, H. Su, and J. Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548, 2022

  17. [17]

    T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation. In Conference on Robot Learning, 2022

  18. [18]

    Y . Chen, C. Wang, L. Fei-Fei, and C. K. Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023

  19. [19]

    C. Chi, B. Burchfiel, E. Cousineau, S. Feng, and S. Song. Iterative residual policy: for goal- conditioned dynamic manipulation of deformable objects. The International Journal of Robotics Research, 2024

  20. [20]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , 2024

  21. [21]

    Directly Fine-Tuning Diffusion Models on Differentiable Rewards

    K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023

  22. [22]

    Ding and C

    Z. Ding and C. Jin. Consistency models as a rich and efficient policy class for reinforcement learning. arXiv preprint arXiv:2309.16984, 2023

  23. [23]

    Engstrom, A

    L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry. Implemen- tation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations, 2019

  24. [24]

    Fan and K

    Y . Fan and K. Lee. Optimizing ddpm sampling with shortcut fine-tuning. arXiv preprint arXiv:2301.13362, 2023

  25. [25]

    Y . Fan, O. Watkins, Y . Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems, 2024

  26. [26]

    Fefferman, S

    C. Fefferman, S. Mitter, and H. Narayanan. Testing the manifold hypothesis.Journal of the American Mathematical Society, 2016

  27. [27]

    Florence, L

    P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learn- ing. IEEE Robotics and Automation Letters, 2019. 17

  28. [28]

    Florence, C

    P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson. Implicit behavioral cloning. In Conference on Robot Learning. PMLR, 2022

  29. [29]

    J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven rein- forcement learning. arXiv preprint arXiv:2004.07219, 2020

  30. [30]

    Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024

  31. [31]

    Goo and S

    W. Goo and S. Niekum. Know your boundaries: The necessity of explicit behavioral cloning in offline rl. arXiv preprint arXiv:2206.00695, 2022

  32. [32]

    Gupta, V

    A. Gupta, V . Kumar, C. Lynch, S. Levine, and K. Hausman. Relay policy learning: Solving long- horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019

  33. [33]

    Haarnoja, A

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pages 1861–1870. PMLR, 2018

  34. [34]

    Haldar, J

    S. Haldar, J. Pari, A. Rai, and L. Pinto. Teach a robot to fish: Versatile imitation from one minute of demonstrations. arXiv preprint arXiv:2303.01497, 2023

  35. [35]

    IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies

    P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023

  36. [36]

    M. Heo, Y . Lee, D. Lee, and J. J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation. arXiv preprint arXiv:2305.12821, 2023

  37. [37]

    Hester, M

    T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning from demonstrations. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  38. [38]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural informa- tion processing systems, 2020

  39. [39]

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022

  40. [40]

    H. Hu, S. Mirchandani, and D. Sadigh. Imitation bootstrapped reinforcement learning. arXiv preprint arXiv:2311.02198, 2023

  41. [41]

    Huang, R

    S. Huang, R. F. J. Dossa, C. Ye, J. Braga, D. Chakraborty, K. Mehta, and J. G. AraÚjo. Cleanrl: High- quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 2022

  42. [42]

    Huang, L

    Z. Huang, L. Yang, X. Zhou, Z. Zhang, W. Zhang, X. Zheng, J. Chen, Y . Wang, C. Bin, and W. Yang. Protein-ligand interaction prior for binding-aware 3d molecule diffusion models. In The Twelfth International Conference on Learning Representations, 2024

  43. [43]

    Hwangbo, J

    J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 2019

  44. [44]

    M. T. Jackson, M. T. Matthews, C. Lu, B. Ellis, S. Whiteson, and J. Foerster. Policy-guided diffusion. arXiv preprint arXiv:2404.06356, 2024

  45. [45]

    Planning with Diffusion for Flexible Behavior Synthesis

    M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

  46. [46]

    X. Jia, D. Blessing, X. Jiang, M. Reuss, A. Donat, R. Lioutikov, and G. Neumann. Towards di- verse behaviors: A benchmark for imitation learning with human demonstrations. arXiv preprint arXiv:2402.14606, 2024. 18

  47. [47]

    B. Kang, X. Ma, C. Du, T. Pang, and S. Yan. Efficient diffusion policies for offline reinforcement learning. Advances in Neural Information Processing Systems, 2024

  48. [48]

    Kaufmann, L

    E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Müller, V . Koltun, and D. Scaramuzza. Champion- level drone racing using deep reinforcement learning. Nature, 2023

  49. [49]

    Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020

  50. [50]

    S. Lee, Y . Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto. Behavior generation with latent actions. arXiv preprint arXiv:2403.03181, 2024

  51. [51]

    K. Lei, Z. He, C. Lu, K. Hu, Y . Gao, and H. Xu. Uni-o4: Unifying online and offline deep reinforce- ment learning with multi-step on-policy optimization. arXiv preprint arXiv:2311.03351, 2023

  52. [52]

    Liang, S

    J. Liang, S. Saxena, and O. Kroemer. Learning active task-oriented exploration policies for bridging the sim-to-real gap. arXiv preprint arXiv:2006.01952, 2020

  53. [53]

    Liang, Y

    Z. Liang, Y . Mu, M. Ding, F. Ni, M. Tomizuka, and P. Luo. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023

  54. [54]

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continu- ous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  55. [55]

    Y . Lin, A. S. Wang, G. Sutanto, A. Rai, and F. Meier. Polymetis. https:// facebookresearch.github.io/fairo/polymetis/, 2021

  56. [56]

    A. Lou, C. Meng, and S. Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. stat, 2024

  57. [57]

    J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. arXiv preprint arXiv:2401.16013, 2024

  58. [58]

    S. Luo, Y . Su, X. Peng, S. Wang, J. Peng, and J. Ma. Antigen-specific antibody design and optimiza- tion with diffusion-based generative models for protein structures. Advances in Neural Information Processing Systems, 2022

  59. [59]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021

  60. [60]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manip- ulation. In arXiv preprint arXiv:2108.03298, 2021

  61. [61]

    A. Nair, A. Gupta, M. Dalal, and S. Levine. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020

  62. [62]

    Nakamoto, S

    M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine. Cal- ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024

  63. [63]

    A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, 2021

  64. [64]

    T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 2018

  65. [65]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 2022. 19

  66. [66]

    Pearce, T

    T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V . Macua, S. Z. Tan, I. Momennejad, K. Hofmann, et al. Imitating human behaviour with diffusion models.arXiv preprint arXiv:2301.10677, 2023

  67. [67]

    X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019

  68. [68]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 2021

  69. [69]

    Permenter and C

    F. Permenter and C. Yuan. Interpreting and improving diffusion models from an optimization per- spective. arXiv preprint arXiv:2306.04848, 2023

  70. [70]

    Peters and S

    J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, 2007

  71. [71]

    D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1988

  72. [72]

    DreamFusion: Text-to-3D using 2D Diffusion

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022

  73. [73]

    Popova, O

    M. Popova, O. Isayev, and A. Tropsha. Deep reinforcement learning for de novo drug design.Science advances, 2018

  74. [74]

    Psenka, A

    M. Psenka, A. Escontrela, P. Abbeel, and Y . Ma. Learning a diffusion model policy from rewards via q-score matching. arXiv preprint arXiv:2312.11752, 2023

  75. [75]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In Interna- tional conference on machine learning, 2021

  76. [76]

    Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

    A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learn- ing complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017

  77. [77]

    A. Z. Ren, H. Dai, B. Burchfiel, and A. Majumdar. AdaptSim: Task-driven simulation adaptation for sim-to-real transfer. In Proceedings of the Conference on Robot Learning (CoRL), 2023

  78. [78]

    Reuss, M

    M. Reuss, M. Li, X. Jia, and R. Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023

  79. [79]

    Rigter, J

    M. Rigter, J. Yamada, and I. Posner. World models via policy-guided trajectory diffusion. arXiv preprint arXiv:2312.08533, 2023

  80. [80]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

Showing first 80 references.