{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:B3MFR4N7YWZSXPIWNIYG62VWMM","short_pith_number":"pith:B3MFR4N7","schema_version":"1.0","canonical_sha256":"0ed858f1bfc5b32bbd166a306f6ab6632284e5e44e96a97f8ee3ab25c759d4b4","source":{"kind":"arxiv","id":"2409.16283","version":1},"attestation_state":"computed","paper":{"title":"Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Generating human videos from web data lets a single robot policy manipulate unseen objects and novel motions without fine-tuning.","cross_cats":["cs.CV","cs.LG","eess.IV"],"primary_cat":"cs.RO","authors_text":"Abhinav Gupta, Carl Doersch, Debidatta Dwibedi, Dhruv Shah, Dorsa Sadigh, Fei Xia, Homanga Bharadhwaj, Sean Kirmani, Shubham Tulsiani, Ted Xiao","submitted_at":"2024-09-24T17:57:33Z","abstract_excerpt":"How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2409.16283","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.RO","submitted_at":"2024-09-24T17:57:33Z","cross_cats_sorted":["cs.CV","cs.LG","eess.IV"],"title_canon_sha256":"3f9ca604e7808f1291056c97bf44e06cc35ffd74c485e0535211086a76b907c9","abstract_canon_sha256":"34fb22a907aed15928b0a7e293e229a831eafa8c063e11d912c7ca183fdceb80"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.572957Z","signature_b64":"yKTtTdgTvE3U3YnSpBOogbmgf8vZK1RiU5GGuC3y9Olz+9lSmnK/5ZiOD29Z/pxB/XqUJJE+GKvXdwD7fT36Dw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"0ed858f1bfc5b32bbd166a306f6ab6632284e5e44e96a97f8ee3ab25c759d4b4","last_reissued_at":"2026-05-17T23:38:52.572500Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.572500Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Generating human videos from web data lets a single robot policy manipulate unseen objects and novel motions without fine-tuning.","cross_cats":["cs.CV","cs.LG","eess.IV"],"primary_cat":"cs.RO","authors_text":"Abhinav Gupta, Carl Doersch, Debidatta Dwibedi, Dhruv Shah, Dorsa Sadigh, Fei Xia, Homanga Bharadhwaj, Sean Kirmani, Shubham Tulsiani, Ted Xiao","submitted_at":"2024-09-24T17:57:33Z","abstract_excerpt":"How can robot manipulation policies generalize to novel tasks involving unseen object types and new motions? In this paper, we provide a solution in terms of predicting motion information from web data through human video generation and conditioning a robot policy on the generated video. Instead of attempting to scale robot data collection which is expensive, we show how we can leverage video generation models trained on easily available web data, for enabling generalization. Our approach Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That videos generated by a pre-trained model from web data provide sufficiently accurate and transferable motion information for a robot policy to execute novel tasks without any fine-tuning of the video model or additional domain adaptation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Generating human videos from web data lets a single robot policy manipulate unseen objects and novel motions without fine-tuning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1bbe4e4338af2e5d31282bc788cb896d47082880c17f6866ee3925319f5771eb"},"source":{"id":"2409.16283","kind":"arxiv","version":1},"verdict":{"id":"1bf39bd6-76af-49ba-909c-68383b3c8f09","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:13:10.717237Z","strongest_claim":"Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.","one_line_summary":"Gen2Act enables generalizable robot manipulation for unseen objects and novel motions by using zero-shot human video generation from web data to condition a policy trained on an order of magnitude less robot interaction data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That videos generated by a pre-trained model from web data provide sufficiently accurate and transferable motion information for a robot policy to execute novel tasks without any fine-tuning of the video model or additional domain adaptation.","pith_extraction_headline":"Generating human videos from web data lets a single robot policy manipulate unseen objects and novel motions without fine-tuning."},"references":{"count":61,"sample":[{"doi":"","year":2022,"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","ref_index":1,"cited_arxiv_id":"2212.06817","is_internal_anchor":true},{"doi":"","year":2024,"title":"Roboagent: Generalization and efficiency in robot manipulation via semantic augmen- tations and action chunking,","work_id":"cf539f96-21ec-42f3-9343-020ac356a037","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset","work_id":"13253de2-3d89-415c-8c2f-3adb25d4c337","ref_index":4,"cited_arxiv_id":"2403.12945","is_internal_anchor":true},{"doi":"","year":2022,"title":"R3M: A Universal Visual Representation for Robot Manipulation","work_id":"1fb6c1b7-913d-4a89-bbad-842fdb5fca1d","ref_index":5,"cited_arxiv_id":"2203.12601","is_internal_anchor":true},{"doi":"","year":2023,"title":"Where are we in the search for an artificial vi- sual cortex for embodied intelligence?","work_id":"a30a45a3-bdd7-45bf-a4f4-66d48e56bd4a","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":61,"snapshot_sha256":"9e6d99e7c63ae2fdb0ab688cfa1d7b4abf4342d125363741532386a8d130fec2","internal_anchors":12},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2409.16283","created_at":"2026-05-17T23:38:52.572580+00:00"},{"alias_kind":"arxiv_version","alias_value":"2409.16283v1","created_at":"2026-05-17T23:38:52.572580+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2409.16283","created_at":"2026-05-17T23:38:52.572580+00:00"},{"alias_kind":"pith_short_12","alias_value":"B3MFR4N7YWZS","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"B3MFR4N7YWZSXPIW","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"B3MFR4N7","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2605.22882","citing_title":"GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23856","citing_title":"Point Tracking Improves World Action Models","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2507.01099","citing_title":"Geometry-aware 4D Video Generation for Robot Manipulation","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2507.00990","citing_title":"Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2507.12768","citing_title":"AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2505.03233","citing_title":"GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2512.15840","citing_title":"Large Video Planner Enables Generalizable Robot Control","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2507.04447","citing_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2601.07060","citing_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2507.12898","citing_title":"Vidar: Embodied Video Diffusion Model for Generalist Manipulation","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2503.22020","citing_title":"CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2510.10125","citing_title":"Ctrl-World: A Controllable Generative World Model for Robot Manipulation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2505.12705","citing_title":"DreamGen: Unlocking Generalization in Robot Learning through Video World Models","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19092","citing_title":"RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14274","citing_title":"CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2603.16666","citing_title":"Fast-WAM: Do World Action Models Need Test-time Future Imagination?","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03181","citing_title":"Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04974","citing_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12090","citing_title":"World Action Models: The Next Frontier in Embodied AI","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13030","citing_title":"Motus: A Unified Latent Action World Model","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14803","citing_title":"Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2411.19650","citing_title":"CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10942","citing_title":"HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10079","citing_title":"SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22615","citing_title":"GazeVLA: Learning Human Intention for Robotic Manipulation","ref_index":3,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM","json":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM.json","graph_json":"https://pith.science/api/pith-number/B3MFR4N7YWZSXPIWNIYG62VWMM/graph.json","events_json":"https://pith.science/api/pith-number/B3MFR4N7YWZSXPIWNIYG62VWMM/events.json","paper":"https://pith.science/paper/B3MFR4N7"},"agent_actions":{"view_html":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM","download_json":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM.json","view_paper":"https://pith.science/paper/B3MFR4N7","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2409.16283&json=true","fetch_graph":"https://pith.science/api/pith-number/B3MFR4N7YWZSXPIWNIYG62VWMM/graph.json","fetch_events":"https://pith.science/api/pith-number/B3MFR4N7YWZSXPIWNIYG62VWMM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM/action/storage_attestation","attest_author":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM/action/author_attestation","sign_citation":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM/action/citation_signature","submit_replication":"https://pith.science/pith/B3MFR4N7YWZSXPIWNIYG62VWMM/action/replication_record"}},"created_at":"2026-05-17T23:38:52.572580+00:00","updated_at":"2026-05-17T23:38:52.572580+00:00"}