{"paper":{"title":"Steering Your Diffusion Policy with Latent Space Reinforcement Learning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights.","cross_cats":["cs.LG"],"primary_cat":"cs.RO","authors_text":"Abhishek Gupta, Andrew Wagenmaker, Anusha Nagabandi, Mitsuhiko Nakamoto, Seohong Park, Sergey Levine, Waleed Yagoub, Yunchu Zhang","submitted_at":"2025-06-18T18:35:57Z","abstract_excerpt":"Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large numb"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That optimizing actions via RL in the diffusion model's latent noise space will produce meaningful policy improvements without access to model gradients or internal weights, and that this optimization remains stable across real-world robotic tasks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a8626b513ab1b4211d7a1346e5015606294dc0c7040d43b8ba5938c48808798e"},"source":{"id":"2506.15799","kind":"arxiv","version":2},"verdict":{"id":"6d83199d-f644-4f80-98eb-cf9c5c9d730a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T21:51:12.070261Z","strongest_claim":"We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement.","one_line_summary":"DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That optimizing actions via RL in the diffusion model's latent noise space will produce meaningful policy improvements without access to model gradients or internal weights, and that this optimization remains stable across real-world robotic tasks.","pith_extraction_headline":"Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights."},"references":{"count":98,"sample":[{"doi":"","year":2020,"title":"S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor. Language- conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33","work_id":"d1ad8cfe-2902-4f49-92c9-0e9eeb7964ba","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"N. M. Shafiullah, Z. Cui, A. A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022","work_id":"34f91594-c861-4a18-9af0-3827f8f98d8d","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"G., Rao, K., Yu, W., Fu, C., Gopalakrishnan, K., Xu, Z., et al","work_id":"cfaa2d68-1521-4065-b84f-c32f5a2e061d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","ref_index":4,"cited_arxiv_id":"2405.12213","is_internal_anchor":true},{"doi":"","year":2024,"title":"Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126","work_id":"db13f99b-32b7-4745-adc2-d4cfb5799638","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":98,"snapshot_sha256":"3badc55f1c879c29c896626ff7577f88c88d151c06356fac0e08d50bb3025858","internal_anchors":30},"formal_canon":{"evidence_count":2,"snapshot_sha256":"5e1926ff1d8b414943e2a091e5097b020a55c80d5b2302a37ffa9ddfbce71923"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}