{"paper":{"title":"Diffusion Policy Policy Optimization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks.","cross_cats":["cs.LG"],"primary_cat":"cs.RO","authors_text":"Allen Z. Ren, Anirudha Majumdar, Anthony Simeonov, Benjamin Burchfiel, Hongkai Dai, Justin Lidard, Lars L. Ankile, Max Simchowitz, Pulkit Agrawal","submitted_at":"2024-09-01T02:47:50Z","abstract_excerpt":"We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks comp"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the observed performance gains arise from unique synergies between the diffusion parameterization and policy-gradient updates rather than from unstated hyperparameter tuning or benchmark-specific implementation details.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0e8a56129b09f2d9914d9be6f3265e57c9c1fb2af7f7b46486606240b114825a"},"source":{"id":"2409.00588","kind":"arxiv","version":3},"verdict":{"id":"4c35ceb7-a59b-4c43-896a-e49b8b58955d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:44:13.356339Z","strongest_claim":"DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations.","one_line_summary":"DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the observed performance gains arise from unique synergies between the diffusion parameterization and policy-gradient updates rather than from unstated hyperparameter tuning or benchmark-specific implementation details.","pith_extraction_headline":"DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks."},"references":{"count":114,"sample":[{"doi":"","year":2018,"title":"J. Achiam. Spinning Up in Deep Reinforcement Learning. 2018","work_id":"0b567b9e-aa3b-432f-a0ce-bce4fd59fd26","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"A. Ajay, Y . Du, A. Gupta, J. B. Tenenbaum, T. S. Jaakkola, and P. Agrawal. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Re","work_id":"643a8287-f9b1-4398-9d8e-70d8a00b57b6","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid. Residual reinforcement learning from demonstrations. arXiv preprint arXiv:2106.08050, 2021","work_id":"eb611dc4-806c-481d-9c04-7a393d30c779","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. The International Journal of","work_id":"f4632aff-93d3-480a-8cf9-ef09dcd7251d","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"L. Ankile, A. Simeonov, I. Shenfeld, and P. Agrawal. Juicer: Data-efficient imitation learning for robotic assembly. arXiv, 2024","work_id":"37f59bbe-8c03-49c3-b7af-ec358536c0f6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":114,"snapshot_sha256":"1d740cd6d4d3538da11e8dd180d91a1fab44c78d0ebfc72ff068d7b7af839b64","internal_anchors":24},"formal_canon":{"evidence_count":3,"snapshot_sha256":"16a7138b27ac4cb0c7f484c9f657bcb58ffdf1f1f1aa815a90791da98220ba61"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}