{"paper":{"title":"Scaling Robot Learning with Semantically Imagined Experience","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Robot policies trained on data augmented by text-to-image inpainting solve unseen tasks with new objects and resist novel distractors.","cross_cats":["cs.AI","cs.CL","cs.CV","cs.LG"],"primary_cat":"cs.RO","authors_text":"Anthony Brohan, Austin Stone, Brian Ichter, Clayton Tan, Dee M, Fei Xia, Jaspiar Singh, Jodilyn Peralta, Jonathan Tompson, Karol Hausman, Su Wang, Ted Xiao, Tianhe Yu","submitted_at":"2023-02-22T18:47:51Z","abstract_excerpt":"Recent advances in robot learning have shown promise in enabling robots to perform a variety of manipulation tasks and generalize to novel scenarios. One of the key contributing factors to this progress is the scale of robot data used to train the models. To obtain large-scale datasets, prior approaches have relied on either demonstrations requiring high human involvement or engineering-heavy autonomous data collection schemes, both of which are challenging to scale. To mitigate this issue, we propose an alternative route and leverage text-to-image foundation models widely used in computer vis"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"manipulation policies trained on data augmented this way are able to solve completely unseen tasks with new objects and can behave more robustly w.r.t. novel distractors.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The inpainted images generated by the text-to-image diffusion model are sufficiently realistic and physically plausible that policies trained on them transfer successfully to real-world robot execution without introducing harmful artifacts or distribution shifts.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Robot policies trained on data augmented by text-to-image inpainting solve unseen tasks with new objects and resist novel distractors.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f8221ef6bd7733798d4ff7504725b3d1ddba8f89cdb562c3128162a553a876f7"},"source":{"id":"2302.11550","kind":"arxiv","version":1},"verdict":{"id":"620568ac-7b7f-4ffd-b872-914be9cbeb90","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T18:52:17.658791Z","strongest_claim":"manipulation policies trained on data augmented this way are able to solve completely unseen tasks with new objects and can behave more robustly w.r.t. novel distractors.","one_line_summary":"Augmenting robot datasets via diffusion-based semantic inpainting enables manipulation policies to solve unseen tasks with new objects and improves robustness to novel distractors.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The inpainted images generated by the text-to-image diffusion model are sufficiently realistic and physically plausible that policies trained on them transfer successfully to real-world robot execution without introducing harmful artifacts or distribution shifts.","pith_extraction_headline":"Robot policies trained on data augmented by text-to-image inpainting solve unseen tasks with new objects and resist novel distractors."},"references":{"count":78,"sample":[{"doi":"","year":2022,"title":"VIMA : General robot manipulation with multimodal prompts","work_id":"7b5f6cce-bbaa-40ed-8b09-7330832dd736","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","ref_index":2,"cited_arxiv_id":"2212.06817","is_internal_anchor":true},{"doi":"","year":2022,"title":"M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, 2022","work_id":"30fc4fa5-1578-4727-8698-e4e6d6d06872","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Perceiver-Actor: A multi-task transformer for robotic manipulation","work_id":"b20db57f-09e7-4916-b4b5-9e7b95f2bd97","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Hierarchical Text-Conditional Image Generation with CLIP Latents","work_id":"0c6a768b-70b8-4242-bb0e-459f1008c9fc","ref_index":5,"cited_arxiv_id":"2204.06125","is_internal_anchor":true}],"resolved_work":78,"snapshot_sha256":"90377220c7b4b8ee0ad5f2180fc6b4908eacd4a2c1df69aff970ff3e751df9cc","internal_anchors":21},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b39de740a6e5c54404793b66f82288d44b63b0d6218f7f686d68e0079cadac7e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}