{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:QM2I3SS6PUQRPIEMTEONHQMQON","short_pith_number":"pith:QM2I3SS6","schema_version":"1.0","canonical_sha256":"83348dca5e7d2117a08c991cd3c19073458a4f7b0e1e2d3d3f0abf5fdd9e0872","source":{"kind":"arxiv","id":"2409.00588","version":3},"attestation_state":"computed","paper":{"title":"Diffusion Policy Policy Optimization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks.","cross_cats":["cs.LG"],"primary_cat":"cs.RO","authors_text":"Allen Z. Ren, Anirudha Majumdar, Anthony Simeonov, Benjamin Burchfiel, Hongkai Dai, Justin Lidard, Lars L. Ankile, Max Simchowitz, Pulkit Agrawal","submitted_at":"2024-09-01T02:47:50Z","abstract_excerpt":"We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks comp"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2409.00588","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.RO","submitted_at":"2024-09-01T02:47:50Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"d1ea95de8d3a4b7a2518acbc7b245a34effaa8d966a90740b5a52f4d94d4825f","abstract_canon_sha256":"9d3d0304a1b74a8d746cb2da96dde3219bf7b97e22907935b53ce1aa1ef2b15c"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.459640Z","signature_b64":"QbkovPjcBJx21pOXSXBMWoDQiXY/KMr3XKFWi2JuwYJcrIdFpZwBDmIS7Z6Y7qF4ePRINCsYioX3TM+d7KauAA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"83348dca5e7d2117a08c991cd3c19073458a4f7b0e1e2d3d3f0abf5fdd9e0872","last_reissued_at":"2026-05-17T23:38:48.459195Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.459195Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Diffusion Policy Policy Optimization","license":"http://creativecommons.org/licenses/by/4.0/","headline":"DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks.","cross_cats":["cs.LG"],"primary_cat":"cs.RO","authors_text":"Allen Z. Ren, Anirudha Majumdar, Anthony Simeonov, Benjamin Burchfiel, Hongkai Dai, Justin Lidard, Lars L. Ankile, Max Simchowitz, Pulkit Agrawal","submitted_at":"2024-09-01T02:47:50Z","abstract_excerpt":"We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks comp"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the observed performance gains arise from unique synergies between the diffusion parameterization and policy-gradient updates rather than from unstated hyperparameter tuning or benchmark-specific implementation details.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0e8a56129b09f2d9914d9be6f3265e57c9c1fb2af7f7b46486606240b114825a"},"source":{"id":"2409.00588","kind":"arxiv","version":3},"verdict":{"id":"4c35ceb7-a59b-4c43-896a-e49b8b58955d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:44:13.356339Z","strongest_claim":"DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations.","one_line_summary":"DPPO fine-tunes diffusion policies via policy gradients and outperforms prior RL approaches for diffusion policies and PG-tuned alternatives on robot benchmarks while enabling stable training and hardware deployment.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the observed performance gains arise from unique synergies between the diffusion parameterization and policy-gradient updates rather than from unstated hyperparameter tuning or benchmark-specific implementation details.","pith_extraction_headline":"DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks."},"references":{"count":114,"sample":[{"doi":"","year":2018,"title":"J. Achiam. Spinning Up in Deep Reinforcement Learning. 2018","work_id":"0b567b9e-aa3b-432f-a0ce-bce4fd59fd26","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"A. Ajay, Y . Du, A. Gupta, J. B. Tenenbaum, T. S. Jaakkola, and P. Agrawal. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Re","work_id":"643a8287-f9b1-4398-9d8e-70d8a00b57b6","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid. Residual reinforcement learning from demonstrations. arXiv preprint arXiv:2106.08050, 2021","work_id":"eb611dc4-806c-481d-9c04-7a393d30c779","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. The International Journal of","work_id":"f4632aff-93d3-480a-8cf9-ef09dcd7251d","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"L. Ankile, A. Simeonov, I. Shenfeld, and P. Agrawal. Juicer: Data-efficient imitation learning for robotic assembly. arXiv, 2024","work_id":"37f59bbe-8c03-49c3-b7af-ec358536c0f6","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":114,"snapshot_sha256":"1d740cd6d4d3538da11e8dd180d91a1fab44c78d0ebfc72ff068d7b7af839b64","internal_anchors":24},"formal_canon":{"evidence_count":3,"snapshot_sha256":"16a7138b27ac4cb0c7f484c9f657bcb58ffdf1f1f1aa815a90791da98220ba61"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2409.00588","created_at":"2026-05-17T23:38:48.459265+00:00"},{"alias_kind":"arxiv_version","alias_value":"2409.00588v3","created_at":"2026-05-17T23:38:48.459265+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2409.00588","created_at":"2026-05-17T23:38:48.459265+00:00"},{"alias_kind":"pith_short_12","alias_value":"QM2I3SS6PUQR","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"QM2I3SS6PUQRPIEM","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"QM2I3SS6","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2507.07969","citing_title":"Reinforcement Learning with Action Chunking","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2507.07986","citing_title":"EXPO: Stable Reinforcement Learning with Expressive Policies","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2507.12768","citing_title":"AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2506.15799","citing_title":"Steering Your Diffusion Policy with Latent Space Reinforcement Learning","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2512.02535","citing_title":"AID: Agent Intent from Diffusion for Multi-Agent Informative Path Planning","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02924","citing_title":"How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2602.19974","citing_title":"RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2602.22507","citing_title":"Space Syntax-guided Post-training for Residential Floor Plan Generation","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04333","citing_title":"What Does Flow Matching Bring To TD Learning?","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15757","citing_title":"You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15956","citing_title":"ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2506.08052","citing_title":"ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14696","citing_title":"EponaV2: Driving World Model with Comprehensive Future Reasoning","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12625","citing_title":"Driving Intents Amplify Planning-Oriented Reinforcement Learning","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12625","citing_title":"Driving Intents Amplify Planning-Oriented Reinforcement Learning","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2502.05855","citing_title":"DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control","ref_index":60,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03066","citing_title":"Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems","ref_index":124,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04647","citing_title":"ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving","ref_index":118,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05812","citing_title":"Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05812","citing_title":"Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04647","citing_title":"ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving","ref_index":118,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10962","citing_title":"ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07605","citing_title":"BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17887","citing_title":"StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18839","citing_title":"One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models","ref_index":209,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON","json":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON.json","graph_json":"https://pith.science/api/pith-number/QM2I3SS6PUQRPIEMTEONHQMQON/graph.json","events_json":"https://pith.science/api/pith-number/QM2I3SS6PUQRPIEMTEONHQMQON/events.json","paper":"https://pith.science/paper/QM2I3SS6"},"agent_actions":{"view_html":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON","download_json":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON.json","view_paper":"https://pith.science/paper/QM2I3SS6","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2409.00588&json=true","fetch_graph":"https://pith.science/api/pith-number/QM2I3SS6PUQRPIEMTEONHQMQON/graph.json","fetch_events":"https://pith.science/api/pith-number/QM2I3SS6PUQRPIEMTEONHQMQON/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON/action/timestamp_anchor","attest_storage":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON/action/storage_attestation","attest_author":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON/action/author_attestation","sign_citation":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON/action/citation_signature","submit_replication":"https://pith.science/pith/QM2I3SS6PUQRPIEMTEONHQMQON/action/replication_record"}},"created_at":"2026-05-17T23:38:48.459265+00:00","updated_at":"2026-05-17T23:38:48.459265+00:00"}