{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:XE4DZ72WMHQWO2RL4XB2E2ZYSN","short_pith_number":"pith:XE4DZ72W","schema_version":"1.0","canonical_sha256":"b9383cff5661e1676a2be5c3a26b38935c6a5cae241ad73df8473f74c3e799dc","source":{"kind":"arxiv","id":"2404.12377","version":1},"attestation_state":"computed","paper":{"title":"RoboDreamer: Learning Compositional World Models for Robot Imagination","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"RoboDreamer factorizes video generation using language primitives to create plans for unseen robot tasks.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Chuang Gan, Dit-Yan Yeung, Jiaben Chen, Siyuan Zhou, Yandong Li, Yilun Du","submitted_at":"2024-04-18T17:58:03Z","abstract_excerpt":"Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this is"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2404.12377","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.RO","submitted_at":"2024-04-18T17:58:03Z","cross_cats_sorted":[],"title_canon_sha256":"857a47855d259cc296814b1c63587a86ee8d37905cfeb28ac34dabab669fa6c2","abstract_canon_sha256":"0aa12db3538b1b684f3c6617c97b68a99d7954b94c3f0e293508d2cddd1ee5ff"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.212577Z","signature_b64":"XT8L6YcDMlDMVRuxhrl0lwFVfYAvO93+oGD+2UKeliIJC+4Dch/TBJ3zUsF6tnqTMg4qmsqrVgxKh6CUIUKCAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"b9383cff5661e1676a2be5c3a26b38935c6a5cae241ad73df8473f74c3e799dc","last_reissued_at":"2026-05-17T23:38:50.211982Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.211982Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"RoboDreamer: Learning Compositional World Models for Robot Imagination","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"RoboDreamer factorizes video generation using language primitives to create plans for unseen robot tasks.","cross_cats":[],"primary_cat":"cs.RO","authors_text":"Chuang Gan, Dit-Yan Yeung, Jiaben Chen, Siyuan Zhou, Yandong Li, Yilun Du","submitted_at":"2024-04-18T17:58:03Z","abstract_excerpt":"Text-to-video models have demonstrated substantial potential in robotic decision-making, enabling the imagination of realistic plans of future actions as well as accurate environment simulation. However, one major issue in such models is generalization -- models are limited to synthesizing videos subject to language instructions similar to those seen at training time. This is heavily limiting in decision-making, where we seek a powerful world model to synthesize plans of unseen combinations of objects and actions in order to solve previously unseen tasks in new environments. To resolve this is"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That natural language instructions can be reliably parsed into lower-level primitives whose separate models compose into coherent, realistic videos without introducing artifacts or losing task-relevant details.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"RoboDreamer factorizes video generation using language primitives to create plans for unseen robot tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"98546809f5544792a6442e94ac62eae53b46ea6b6b1221ea558c9a5d60fcd642"},"source":{"id":"2404.12377","kind":"arxiv","version":1},"verdict":{"id":"cd704efb-c6fd-468c-992c-3d2d32701d49","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:43:49.453064Z","strongest_claim":"Our approach can successfully synthesize video plans on unseen goals in the RT-X, enables successful robot execution in simulation, and substantially outperforms monolithic baseline approaches to video generation.","one_line_summary":"RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That natural language instructions can be reliably parsed into lower-level primitives whose separate models compose into coherent, realistic videos without introducing artifacts or losing task-relevant details.","pith_extraction_headline":"RoboDreamer factorizes video generation using language primitives to create plans for unseen robot tasks."},"references":{"count":65,"sample":[{"doi":"","year":2021,"title":"Unsupervised learning of compositional energy concepts","work_id":"585f2b86-ddff-414d-ada6-8a36e14fef70","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"B., Dieleman, S., Fergus, R., Sohl-Dickstein, J., Doucet, A., and Grathwohl, W","work_id":"1bf90112-a203-409f-8900-11651a71919b","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"G., Tapaswi, M., Laptev, I., and Schmid, C","work_id":"646c92f1-d546-4fbb-8968-71d5ee6f9718","ref_index":12,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Diffusion-based generation, optimization, and planning in 3d scenes","work_id":"6bff28de-131c-49cc-86fa-2b61c4212c98","ref_index":15,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"R., and Davison, A","work_id":"ebc7a356-db9d-48bd-8ae8-be7274070335","ref_index":17,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":65,"snapshot_sha256":"d9f370a51d0cca10de936e6bd758f85f4a968cc6ab1eb5a137c4add3f2f0e6f7","internal_anchors":12},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c256a1870659fa9e3d01b953c60dd7a4b6a37d8529245a7b3f095707f4136afd"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2404.12377","created_at":"2026-05-17T23:38:50.212081+00:00"},{"alias_kind":"arxiv_version","alias_value":"2404.12377v1","created_at":"2026-05-17T23:38:50.212081+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2404.12377","created_at":"2026-05-17T23:38:50.212081+00:00"},{"alias_kind":"pith_short_12","alias_value":"XE4DZ72WMHQW","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"XE4DZ72WMHQWO2RL","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"XE4DZ72W","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":27,"internal_anchor_count":27,"sample":[{"citing_arxiv_id":"2605.22882","citing_title":"GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2506.17697","citing_title":"Beyond Syntax: Action Semantics Learning for App Agents","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2512.01773","citing_title":"IGen: Scalable Data Generation for Robot Learning from Open-World Images","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2512.05564","citing_title":"ProPhy: Progressive Physical Alignment for Dynamic World Simulation","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2602.11075","citing_title":"RISE: Self-Improving Robot Policy with Compositional World Model","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2505.12705","citing_title":"DreamGen: Unlocking Generalization in Robot Learning through Video World Models","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2602.20309","citing_title":"QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19092","citing_title":"RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2603.16666","citing_title":"Fast-WAM: Do World Action Models Need Test-time Future Imagination?","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12090","citing_title":"World Action Models: The Next Frontier in Embodied AI","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21998","citing_title":"Causal World Modeling for Robot Control","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2604.28185","citing_title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","ref_index":104,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10942","citing_title":"HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24661","citing_title":"Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06481","citing_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2602.15922","citing_title":"World Action Models are Zero-shot Policies","ref_index":92,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21241","citing_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11386","citing_title":"ComSim: Building Scalable Real-World Robot Data Generation via Compositional Simulation","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08168","citing_title":"ViVa: A Video-Generative Value Model for Robot Reinforcement Learning","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07794","citing_title":"NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24661","citing_title":"Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06168","citing_title":"Action Images: End-to-End Policy Learning via Multiview Video Generation","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11751","citing_title":"Grounded World Model for Semantically Generalizable Planning","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17887","citing_title":"StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15938","citing_title":"VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation","ref_index":34,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN","json":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN.json","graph_json":"https://pith.science/api/pith-number/XE4DZ72WMHQWO2RL4XB2E2ZYSN/graph.json","events_json":"https://pith.science/api/pith-number/XE4DZ72WMHQWO2RL4XB2E2ZYSN/events.json","paper":"https://pith.science/paper/XE4DZ72W"},"agent_actions":{"view_html":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN","download_json":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN.json","view_paper":"https://pith.science/paper/XE4DZ72W","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2404.12377&json=true","fetch_graph":"https://pith.science/api/pith-number/XE4DZ72WMHQWO2RL4XB2E2ZYSN/graph.json","fetch_events":"https://pith.science/api/pith-number/XE4DZ72WMHQWO2RL4XB2E2ZYSN/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN/action/timestamp_anchor","attest_storage":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN/action/storage_attestation","attest_author":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN/action/author_attestation","sign_citation":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN/action/citation_signature","submit_replication":"https://pith.science/pith/XE4DZ72WMHQWO2RL4XB2E2ZYSN/action/replication_record"}},"created_at":"2026-05-17T23:38:50.212081+00:00","updated_at":"2026-05-17T23:38:50.212081+00:00"}