{"paper":{"title":"Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single instruction-conditioned video diffusion model unifies policy learning, simulation, and evaluation for robotic manipulation.","cross_cats":["cs.CV"],"primary_cat":"cs.RO","authors_text":"Donglin Yang, Guanghui Ren, Jianlan Luo, Jingbin Cai, Liliang Chen, Maoqing Yao, Pengfei Zhou, Shengcong Chen, Shuicheng Yan, Si Liu, Siyuan Huang, Yue Hu, Yue Liao, Yuxin Jiang","submitted_at":"2025-08-07T17:59:44Z","abstract_excerpt":"We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy infer"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"GE integrates policy learning, evaluation, and simulation within a single video-generative framework, establishing a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the instruction-conditioned video diffusion model in GE-Base sufficiently captures real-world spatial, temporal, and semantic dynamics to support accurate action mapping in GE-Act and reliable rollouts in GE-Sim across diverse embodiments.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single instruction-conditioned video diffusion model unifies policy learning, simulation, and evaluation for robotic manipulation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"d31d1636b651700f6bf1f02c37ac00c072cd4e1ffb5b464d089e336ff31020e2"},"source":{"id":"2508.05635","kind":"arxiv","version":3},"verdict":{"id":"bb69e79f-dc32-49f2-8af2-395ef5156751","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T21:25:20.398637Z","strongest_claim":"GE integrates policy learning, evaluation, and simulation within a single video-generative framework, establishing a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence.","one_line_summary":"Genie Envisioner unifies robotic policy learning, simulation, and evaluation inside one instruction-conditioned video diffusion framework using GE-Base, GE-Act, and GE-Sim.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the instruction-conditioned video diffusion model in GE-Base sufficiently captures real-world spatial, temporal, and semantic dynamics to support accurate action mapping in GE-Act and reliable rollouts in GE-Sim across diverse embodiments.","pith_extraction_headline":"A single instruction-conditioned video diffusion model unifies policy learning, simulation, and evaluation for robotic manipulation."},"references":{"count":30,"sample":[{"doi":"","year":null,"title":"Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs","work_id":"83956045-536a-41ff-af02-b80e2a614eab","ref_index":1,"cited_arxiv_id":"2503.01743","is_internal_anchor":true},{"doi":"","year":null,"title":"Cosmos World Foundation Model Platform for Physical AI","work_id":"a2dba24c-318d-476a-8b21-4289c265810c","ref_index":2,"cited_arxiv_id":"2501.03575","is_internal_anchor":true},{"doi":"","year":null,"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","ref_index":3,"cited_arxiv_id":"2204.01691","is_internal_anchor":true},{"doi":"","year":null,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":4,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":null,"title":"GR00T N1: An Open Foundation Model for Generalist Humanoid Robots","work_id":"e2db69c7-ee8a-4cb7-a761-7b8de1dfcf97","ref_index":5,"cited_arxiv_id":"2503.14734","is_internal_anchor":true}],"resolved_work":30,"snapshot_sha256":"ce8e3f6dd0408b02de1dc50f3f6352fe12ada7f544123ab7fa4124bd82b284f4","internal_anchors":20},"formal_canon":{"evidence_count":3,"snapshot_sha256":"d20f84f0c4517f6d64e62745072350500c0426fae96871ae5042365b95690e42"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}