{"work":{"id":"c181dcf1-e774-4216-a30e-e55c3f3a766c","openalex_id":null,"doi":null,"arxiv_id":"2504.02792","raw_key":null,"title":"Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets","authors":null,"authors_text":"Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, Abhishek Gupta","year":2025,"venue":"cs.RO","abstract":"Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.","external_url":"https://arxiv.org/abs/2504.02792","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-16T12:42:47.095521+00:00","pith_arxiv_id":"2504.02792","created_at":"2026-05-10T08:12:26.021199+00:00","updated_at":"2026-05-16T12:42:47.095521+00:00","title_quality_ok":true,"display_title":"Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets","render_title":"Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets"},"hub":{"state":{"work_id":"c181dcf1-e774-4216-a30e-e55c3f3a766c","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":29,"external_cited_by_count":null,"distinct_field_count":3,"first_pith_cited_at":"2025-05-19T04:55:39+00:00","last_pith_cited_at":"2026-05-12T13:10:52+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-16T13:28:47.123119+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":6},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":6},{"context_polarity":"baseline","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}