{"paper":{"title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"DreamVLA forecasts compact dynamic, spatial and semantic world knowledge to drive a perception-prediction-action loop that raises robot manipulation success.","cross_cats":["cs.RO"],"primary_cat":"cs.CV","authors_text":"Fan Lu, He Wang, Hongsi Liu, Jiawei He, Jiazhao Zhang, Li Yi, Runpei Dong, Wenjun Zeng, Wenyao Zhang, Xin Jin, Xinqiang Yu, Yunnan Wang, Zekun Qi, Zhizheng Zhang","submitted_at":"2025-07-06T16:14:29Z","abstract_excerpt":"Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establ"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks through dynamic-region-guided world knowledge prediction integrated with spatial and semantic cues.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the block-wise structured attention successfully prevents interference among dynamic, spatial, and semantic representations and that the resulting forecasts provide compact yet sufficient information for action planning.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"DreamVLA forecasts compact dynamic, spatial and semantic world knowledge to drive a perception-prediction-action loop that raises robot manipulation success.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"eb7a5f68a60a6779ec6eb6b428e8a6124077c64f380e06fba12ef7edec065d3f"},"source":{"id":"2507.04447","kind":"arxiv","version":3},"verdict":{"id":"f9ca7b68-8de3-4dd3-940f-3d1195c20c6b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T15:38:13.929583Z","strongest_claim":"DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks through dynamic-region-guided world knowledge prediction integrated with spatial and semantic cues.","one_line_summary":"DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the block-wise structured attention successfully prevents interference among dynamic, spatial, and semantic representations and that the resulting forecasts provide compact yet sufficient information for action planning.","pith_extraction_headline":"DreamVLA forecasts compact dynamic, spatial and semantic world knowledge to drive a perception-prediction-action loop that raises robot manipulation success."},"references":{"count":147,"sample":[{"doi":"","year":2024,"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","ref_index":1,"cited_arxiv_id":"2406.09246","is_internal_anchor":true},{"doi":"","year":2023,"title":"Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpa","work_id":"7b1ea7ce-4fbc-4b7c-b5f5-ffc3f9278e8f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Video language planning","work_id":"9736ca18-78ad-4d25-ac93-df8f2295a1f8","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Embodiedgpt: Vision-language pre-training via embodied chain of thought","work_id":"8bbf271e-d024-4137-886f-409e81f68453","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Robotic Control via Embodied Chain-of-Thought Reasoning","work_id":"bbe96698-686e-4b4a-9f7f-ed5054e61cca","ref_index":5,"cited_arxiv_id":"2407.08693","is_internal_anchor":true}],"resolved_work":147,"snapshot_sha256":"84fc2d82c7c003ca8e1eaed2eacf7069ab3ff374f73ae0893f124e0f85dbb2b7","internal_anchors":42},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2cab4fd21269c17c0ee73347cfce123e4b9fc75d7225d2047fe65e9e75441e87"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}