{"paper":{"title":"Video Generators are Robot Policies","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Video generation models can serve as robot policies by predicting future behavior frames and extracting actions from them.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Carl Vondrick, Junbang Liang, Paarth Shah, Pavel Tokmakov, Rares Ambrus, Ruoshi Liu, Sruthi Sudhakar","submitted_at":"2025-08-01T17:23:49Z","abstract_excerpt":"Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"that the video generator produces videos whose implied actions are both feasible and optimal for the robot, without introducing dynamics that do not match the physical system","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video generation models can serve as robot policies by predicting future behavior frames and extracting actions from them.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c8167f37032029487d31e32dbc2a7f3f97e5345635d09bf9c8b64b944f14f53a"},"source":{"id":"2508.00795","kind":"arxiv","version":1},"verdict":{"id":"833cdadc-8905-4dc4-81dd-c9cbccea6062","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T21:40:58.620100Z","strongest_claim":"learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency","one_line_summary":"Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"that the video generator produces videos whose implied actions are both feasible and optimal for the robot, without introducing dynamics that do not match the physical system","pith_extraction_headline":"Video generation models can serve as robot policies by predicting future behavior frames and extracting actions from them."},"references":{"count":63,"sample":[{"doi":"","year":1995,"title":"M. Bain and C. Sammut. A framework for behavioural cloning. In Machine intelligence 15 , pages 103–129, 1995","work_id":"0d1cbebc-440b-42b4-b9b3-db36b6cf40be","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. In RSS, 2023","work_id":"15569f87-3ba9-410a-b163-979639add640","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics transformer for real-world control at scale. In RSS, 2022","work_id":"150cd0c4-b39a-4fe7-9119-ad8e42133820","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. In RSS, 2024","work_id":"8b84caca-9bd4-4a9d-8ed5-0ca81a86f547","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. RSS, 2025","work_id":"a26bc36c-e311-4408-812d-bb59153fcbe0","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":63,"snapshot_sha256":"b36c0947e9a8c9133a217e4eab11789d405cdf155fcf08939490dd7799d135ef","internal_anchors":7},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2c7317f3b91e5c496f90a5d0adf37f5b3197e23f104d56f0ef7aef2b31cd8300"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}