{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:SOUJZI63CLDKUJMYDPLWER3S2K","short_pith_number":"pith:SOUJZI63","schema_version":"1.0","canonical_sha256":"93a89ca3db12c6aa25981bd7624772d299dd180391dde7212029dd62572bc52f","source":{"kind":"arxiv","id":"2506.15799","version":2},"attestation_state":"computed","paper":{"title":"Steering Your Diffusion Policy with Latent Space Reinforcement Learning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights.","cross_cats":["cs.LG"],"primary_cat":"cs.RO","authors_text":"Abhishek Gupta, Andrew Wagenmaker, Anusha Nagabandi, Mitsuhiko Nakamoto, Seohong Park, Sergey Levine, Waleed Yagoub, Yunchu Zhang","submitted_at":"2025-06-18T18:35:57Z","abstract_excerpt":"Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large numb"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2506.15799","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.RO","submitted_at":"2025-06-18T18:35:57Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"123eb9f732b785ef2fd3accc2cdd00a4ee02ede57d79f226477b9bc4cf77dffd","abstract_canon_sha256":"663649781b6ec4cce2ed62242e79588a55edf61a4d2087d98728585e10f0b98a"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:12.935958Z","signature_b64":"5Bbh3XEan61i009PEGMhaq7z10ZXn+iOOpiYc/AsinqvRCucMnV8tYtdBDaa8gsA2zBJieveyTGiEfDhFPl+Bw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"93a89ca3db12c6aa25981bd7624772d299dd180391dde7212029dd62572bc52f","last_reissued_at":"2026-05-17T23:38:12.935251Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:12.935251Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Steering Your Diffusion Policy with Latent Space Reinforcement Learning","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights.","cross_cats":["cs.LG"],"primary_cat":"cs.RO","authors_text":"Abhishek Gupta, Andrew Wagenmaker, Anusha Nagabandi, Mitsuhiko Nakamoto, Seohong Park, Sergey Levine, Waleed Yagoub, Yunchu Zhang","submitted_at":"2025-06-18T18:35:57Z","abstract_excerpt":"Robotic control policies learned from human demonstrations have achieved impressive results in many real-world applications. However, in scenarios where initial performance is not satisfactory, as is often the case in novel open-world settings, such behavioral cloning (BC)-learned policies typically require collecting additional human demonstrations to further improve their behavior -- an expensive and time-consuming process. In contrast, reinforcement learning (RL) holds the promise of enabling autonomous online policy improvement, but often falls short of achieving this due to the large numb"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That optimizing actions via RL in the diffusion model's latent noise space will produce meaningful policy improvements without access to model gradients or internal weights, and that this optimization remains stable across real-world robotic tasks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a8626b513ab1b4211d7a1346e5015606294dc0c7040d43b8ba5938c48808798e"},"source":{"id":"2506.15799","kind":"arxiv","version":2},"verdict":{"id":"6d83199d-f644-4f80-98eb-cf9c5c9d730a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T21:51:12.070261Z","strongest_claim":"We show that DSRL is highly sample efficient, requires only black-box access to the BC policy, and enables effective real-world autonomous policy improvement.","one_line_summary":"DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That optimizing actions via RL in the diffusion model's latent noise space will produce meaningful policy improvements without access to model gradients or internal weights, and that this optimization remains stable across real-world robotic tasks.","pith_extraction_headline":"Optimizing in a diffusion policy's latent noise space enables sample-efficient autonomous robotic adaptation without altering model weights."},"references":{"count":98,"sample":[{"doi":"","year":2020,"title":"S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor. Language- conditioned imitation learning for robot manipulation tasks. Advances in Neural Information Processing Systems, 33","work_id":"d1ad8cfe-2902-4f49-92c9-0e9eeb7964ba","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"N. M. Shafiullah, Z. Cui, A. A. Altanzaya, and L. Pinto. Behavior transformers: Cloning k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022","work_id":"34f91594-c861-4a18-9af0-3827f8f98d8d","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"G., Rao, K., Yu, W., Fu, C., Gopalakrishnan, K., Xu, Z., et al","work_id":"cfaa2d68-1521-4065-b84f-c32f5a2e061d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","ref_index":4,"cited_arxiv_id":"2405.12213","is_internal_anchor":true},{"doi":"","year":2024,"title":"Aloha unleashed: A simple recipe for robot dexterity.arXiv preprint arXiv:2410.13126","work_id":"db13f99b-32b7-4745-adc2-d4cfb5799638","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":98,"snapshot_sha256":"3badc55f1c879c29c896626ff7577f88c88d151c06356fac0e08d50bb3025858","internal_anchors":30},"formal_canon":{"evidence_count":2,"snapshot_sha256":"5e1926ff1d8b414943e2a091e5097b020a55c80d5b2302a37ffa9ddfbce71923"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2506.15799","created_at":"2026-05-17T23:38:12.935403+00:00"},{"alias_kind":"arxiv_version","alias_value":"2506.15799v2","created_at":"2026-05-17T23:38:12.935403+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2506.15799","created_at":"2026-05-17T23:38:12.935403+00:00"},{"alias_kind":"pith_short_12","alias_value":"SOUJZI63CLDK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"SOUJZI63CLDKUJMY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"SOUJZI63","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":19,"internal_anchor_count":19,"sample":[{"citing_arxiv_id":"2511.11931","citing_title":"MATT-Diff: Multimodal Active Target Tracking by Diffusion Policy","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07322","citing_title":"Action-to-Action Flow Matching","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13193","citing_title":"Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2603.04333","citing_title":"What Does Flow Matching Bring To TD Learning?","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2603.09030","citing_title":"PlayWorld: Learning Robot World Models from Autonomous Play","ref_index":90,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15757","citing_title":"You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2603.15956","citing_title":"ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13959","citing_title":"WarmPrior: Straightening Flow-Matching Policies with Temporal Priors","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13925","citing_title":"Towards Robotic Dexterous Hand Intelligence: A Survey","ref_index":92,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14422","citing_title":"What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08804","citing_title":"Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08804","citing_title":"Constraint-Aware Diffusion Priors for High-Fidelity and Versatile Quadruped Locomotion","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10821","citing_title":"Unified Noise Steering for Efficient Human-Guided VLA Adaptation","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05812","citing_title":"Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23121","citing_title":"Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05812","citing_title":"Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05172","citing_title":"When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00416","citing_title":"Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03065","citing_title":"OGPO: Sample Efficient Full-Finetuning of Generative Control Policies","ref_index":181,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K","json":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K.json","graph_json":"https://pith.science/api/pith-number/SOUJZI63CLDKUJMYDPLWER3S2K/graph.json","events_json":"https://pith.science/api/pith-number/SOUJZI63CLDKUJMYDPLWER3S2K/events.json","paper":"https://pith.science/paper/SOUJZI63"},"agent_actions":{"view_html":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K","download_json":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K.json","view_paper":"https://pith.science/paper/SOUJZI63","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2506.15799&json=true","fetch_graph":"https://pith.science/api/pith-number/SOUJZI63CLDKUJMYDPLWER3S2K/graph.json","fetch_events":"https://pith.science/api/pith-number/SOUJZI63CLDKUJMYDPLWER3S2K/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K/action/timestamp_anchor","attest_storage":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K/action/storage_attestation","attest_author":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K/action/author_attestation","sign_citation":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K/action/citation_signature","submit_replication":"https://pith.science/pith/SOUJZI63CLDKUJMYDPLWER3S2K/action/replication_record"}},"created_at":"2026-05-17T23:38:12.935403+00:00","updated_at":"2026-05-17T23:38:12.935403+00:00"}