{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:BQDFLROG4STK3HB6IUQSMANU74","short_pith_number":"pith:BQDFLROG","schema_version":"1.0","canonical_sha256":"0c0655c5c6e4a6ad9c3e45212601b4ff3af1553ebbe9880dfd774649f1429ae8","source":{"kind":"arxiv","id":"2508.00795","version":1},"attestation_state":"computed","paper":{"title":"Video Generators are Robot Policies","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Video generation models can serve as robot policies by predicting future behavior frames and extracting actions from them.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Carl Vondrick, Junbang Liang, Paarth Shah, Pavel Tokmakov, Rares Ambrus, Ruoshi Liu, Sruthi Sudhakar","submitted_at":"2025-08-01T17:23:49Z","abstract_excerpt":"Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavi"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2508.00795","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.RO","submitted_at":"2025-08-01T17:23:49Z","cross_cats_sorted":[],"title_canon_sha256":"a444d10e3f74deb76778d265a9478199e5e5cd0b7f6e5b90f7db6678eca95ff5","abstract_canon_sha256":"5e72f95ab4a2684d67183c98a995b8cb49ff1d51ed5898e203b6071544df66aa"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.057425Z","signature_b64":"0817qYkn3aS7+kZ0jfdFXetT60zoHBIj+RqJ1kwMqGCZoDOEredNBLB8NobSgEzrMHpH4T1GCbIOiMWRZfqFBg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0c0655c5c6e4a6ad9c3e45212601b4ff3af1553ebbe9880dfd774649f1429ae8","last_reissued_at":"2026-05-17T23:38:50.056926Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.056926Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Video Generators are Robot Policies","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Video generation models can serve as robot policies by predicting future behavior frames and extracting actions from them.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Carl Vondrick, Junbang Liang, Paarth Shah, Pavel Tokmakov, Rares Ambrus, Ruoshi Liu, Sruthi Sudhakar","submitted_at":"2025-08-01T17:23:49Z","abstract_excerpt":"Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"that the video generator produces videos whose implied actions are both feasible and optimal for the robot, without introducing dynamics that do not match the physical system","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Video generation models can serve as robot policies by predicting future behavior frames and extracting actions from them.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c8167f37032029487d31e32dbc2a7f3f97e5345635d09bf9c8b64b944f14f53a"},"source":{"id":"2508.00795","kind":"arxiv","version":1},"verdict":{"id":"833cdadc-8905-4dc4-81dd-c9cbccea6062","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T21:40:58.620100Z","strongest_claim":"learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency","one_line_summary":"Training models to generate videos of robot actions produces policies that generalize better to new objects and tasks while using far less demonstration data than standard behavior cloning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"that the video generator produces videos whose implied actions are both feasible and optimal for the robot, without introducing dynamics that do not match the physical system","pith_extraction_headline":"Video generation models can serve as robot policies by predicting future behavior frames and extracting actions from them."},"references":{"count":63,"sample":[{"doi":"","year":1995,"title":"M. Bain and C. Sammut. A framework for behavioural cloning. In Machine intelligence 15 , pages 103–129, 1995","work_id":"0d1cbebc-440b-42b4-b9b3-db36b6cf40be","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. In RSS, 2023","work_id":"15569f87-3ba9-410a-b163-979639add640","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics transformer for real-world control at scale. In RSS, 2022","work_id":"150cd0c4-b39a-4fe7-9119-ad8e42133820","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. In RSS, 2024","work_id":"8b84caca-9bd4-4a9d-8ed5-0ca81a86f547","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. RSS, 2025","work_id":"a26bc36c-e311-4408-812d-bb59153fcbe0","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":63,"snapshot_sha256":"b36c0947e9a8c9133a217e4eab11789d405cdf155fcf08939490dd7799d135ef","internal_anchors":7},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2c7317f3b91e5c496f90a5d0adf37f5b3197e23f104d56f0ef7aef2b31cd8300"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2508.00795","created_at":"2026-05-17T23:38:50.057014+00:00"},{"alias_kind":"arxiv_version","alias_value":"2508.00795v1","created_at":"2026-05-17T23:38:50.057014+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2508.00795","created_at":"2026-05-17T23:38:50.057014+00:00"},{"alias_kind":"pith_short_12","alias_value":"BQDFLROG4STK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"BQDFLROG4STK3HB6","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"BQDFLROG","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2603.00110","citing_title":"Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2603.09030","citing_title":"PlayWorld: Learning Robot World Models from Autonomous Play","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2512.15692","citing_title":"mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2603.16666","citing_title":"Fast-WAM: Do World Action Models Need Test-time Future Imagination?","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12090","citing_title":"World Action Models: The Next Frontier in Embodied AI","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12167","citing_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06222","citing_title":"When to Trust Imagination: Adaptive Action Execution for World Action Models","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06222","citing_title":"When to Trust Imagination: Adaptive Action Execution for World Action Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06192","citing_title":"EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2602.15922","citing_title":"World Action Models are Zero-shot Policies","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00080","citing_title":"World Model for Robot Learning: A Comprehensive Survey","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00078","citing_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12908","citing_title":"Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \\rightarrow G$): Vision-Geometry Backbones over Language and Video Models","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11135","citing_title":"AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08168","citing_title":"ViVa: A Video-Generative Value Model for Robot Reinforcement Learning","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07514","citing_title":"Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06168","citing_title":"Action Images: End-to-End Policy Learning via Multiview Video Generation","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04502","citing_title":"Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13645","citing_title":"A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15483","citing_title":"${\\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities","ref_index":101,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74","json":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74.json","graph_json":"https://pith.science/api/pith-number/BQDFLROG4STK3HB6IUQSMANU74/graph.json","events_json":"https://pith.science/api/pith-number/BQDFLROG4STK3HB6IUQSMANU74/events.json","paper":"https://pith.science/paper/BQDFLROG"},"agent_actions":{"view_html":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74","download_json":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74.json","view_paper":"https://pith.science/paper/BQDFLROG","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2508.00795&json=true","fetch_graph":"https://pith.science/api/pith-number/BQDFLROG4STK3HB6IUQSMANU74/graph.json","fetch_events":"https://pith.science/api/pith-number/BQDFLROG4STK3HB6IUQSMANU74/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74/action/timestamp_anchor","attest_storage":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74/action/storage_attestation","attest_author":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74/action/author_attestation","sign_citation":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74/action/citation_signature","submit_replication":"https://pith.science/pith/BQDFLROG4STK3HB6IUQSMANU74/action/replication_record"}},"created_at":"2026-05-17T23:38:50.057014+00:00","updated_at":"2026-05-17T23:38:50.057014+00:00"}