pith. machine review for the scientific record. sign in

arxiv: 2504.02792 · v3 · submitted 2025-04-03 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords unified world modelsrobot learningdiffusion modelsimitation learningvideo pretrainingaction predictiontransformerworld modeling
0
0 comments X

The pith

Unified World Models couple video diffusion and action diffusion inside one transformer so a single network can pretrain robot policies on mixed video-plus-action datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Unified World Models as a way to train robot policies on both action-labeled demonstrations and abundant unlabeled video. A shared transformer runs separate diffusion processes for video frames and for actions, each controlled by its own timestep. Selecting the right combination of timesteps at inference time turns the same weights into a policy, a forward dynamics model, an inverse dynamics model, or a video generator. Experiments show the resulting policies generalize better than those trained only by imitation learning and improve further when extra action-free video is added during pretraining.

Core claim

Unified World Models integrate an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.

What carries the argument

Unified transformer with two independent diffusion timesteps—one for video frames and one for actions—allowing the same weights to switch among policy, forward model, inverse model, and video generation simply by choosing the timestep pair.

If this is right

  • Pretraining on large multitask robot datasets that contain both dynamics and action labels produces policies that transfer more robustly than standard imitation learning.
  • Independent timestep control lets the model absorb action-free video data during pretraining without requiring action labels, further boosting downstream policy performance.
  • The same weights can be used at inference time as a forward dynamics predictor, an inverse dynamics predictor, or a video generator simply by changing the diffusion timestep pair.
  • The approach unifies imitation learning and world modeling inside one training run rather than training separate models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be extended to additional modalities such as language or tactile signals by adding further independent diffusion streams inside the same transformer.
  • Because video data is far cheaper to collect than action-labeled trajectories, the framework lowers the data cost of scaling robot foundation models.
  • If timestep separation works cleanly, similar diffusion unification might apply to other paired modalities where one stream is easier to observe than the other.
  • Real-world deployment would benefit from testing whether the learned forward model can be used for planning without retraining.

Load-bearing premise

Separate timestep control for each modality inside a shared transformer is enough to keep video and action modeling from interfering while still letting each capability be read out cleanly at test time.

What would settle it

Train UWM on a mixed dataset, then measure whether setting the action timestep to zero (policy mode) produces lower success rates than a model trained only on action data while video generation quality remains high.

read the original abstract

Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Unified World Models (UWM), a unified transformer that couples an action diffusion process and a video diffusion process governed by independent modality-specific timesteps. By selecting appropriate timestep pairs at inference, the same model can be used as a policy, forward dynamics model, inverse dynamics model, or video generator. The authors report that pretraining on large-scale multitask robot datasets containing both action-labeled and action-free video data produces more generalizable policies than standard imitation learning in both simulation and real-world settings.

Significance. If the empirical gains prove robust, the work provides a practical unification of imitation learning and world modeling that directly addresses the scarcity of action annotations by leveraging abundant video data. The diffusion-timestep control mechanism offers a lightweight way to extract multiple capabilities from a single pretrained model, which could simplify scaling of robotic foundation models on heterogeneous datasets.

major comments (2)
  1. [§3] §3 (Method): The central claim that independent control of (t_video, t_action) inside a shared transformer cleanly yields uncontaminated policies, dynamics, or video generation rests on the untested assumption that cross-modality gradient interference is negligible. No ablation compares joint training against modality-isolated training, nor is there analysis of how the shared weights handle conflicting denoising objectives on heterogeneous data.
  2. [§4] §4 (Experiments): The reported policy improvements lack sufficient detail on data splits, exact baseline implementations, and controls that isolate the contribution of video pretraining. Without these, it is impossible to determine whether gains arise from the unified architecture or from other experimental choices.
minor comments (2)
  1. [§3] Notation for the two diffusion timesteps should be introduced once with explicit symbols (e.g., t_v and t_a) and used consistently thereafter to improve readability.
  2. The abstract would benefit from a single sentence summarizing the quantitative gains (e.g., success-rate deltas) rather than only qualitative statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that independent control of (t_video, t_action) inside a shared transformer cleanly yields uncontaminated policies, dynamics, or video generation rests on the untested assumption that cross-modality gradient interference is negligible. No ablation compares joint training against modality-isolated training, nor is there analysis of how the shared weights handle conflicting denoising objectives on heterogeneous data.

    Authors: We agree that a direct ablation comparing joint training to modality-isolated training would strengthen the evidence regarding gradient interference. While the empirical success of UWM across all tasks (policy, dynamics, inverse dynamics, and video generation) indicates that the independent timestep mechanism largely prevents objective conflicts, we acknowledge the absence of this specific control. In the revised manuscript we add an ablation that trains separate modality-specific models and compares them to the joint UWM, along with gradient-norm analysis during training to quantify any cross-modality interference. revision: yes

  2. Referee: [§4] §4 (Experiments): The reported policy improvements lack sufficient detail on data splits, exact baseline implementations, and controls that isolate the contribution of video pretraining. Without these, it is impossible to determine whether gains arise from the unified architecture or from other experimental choices.

    Authors: We appreciate the request for greater experimental transparency. The revised manuscript now includes: (i) explicit descriptions of all pretraining and finetuning data splits with exact dataset sizes and task distributions, (ii) full hyperparameter tables and training procedures for every baseline, and (iii) an additional control experiment that removes video pretraining while keeping the architecture and action data identical, thereby isolating the contribution of the video component. revision: yes

Circularity Check

0 steps flagged

No circularity: new unified diffusion architecture evaluated on external datasets

full rationale

The paper introduces UWM as a novel transformer-based coupling of independent video and action diffusion processes, with claims about flexible representation of policies and dynamics arising directly from the architectural choice of modality-specific timesteps. No equations or derivations reduce by construction to fitted parameters defined by the target result, nor do any load-bearing steps rely on self-citations that themselves assume the outcome. The pretraining procedure and empirical evaluations on simulated and real-world robot datasets are presented as independent of the claimed capabilities, making the derivation self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the model appears to rest on standard diffusion and transformer assumptions already established in prior literature.

pith-pipeline@v0.9.0 · 5608 in / 1113 out tokens · 97349 ms · 2026-05-13T16:17:41.939600+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  2. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  3. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  4. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  5. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  6. Envisioning the Future, One Step at a Time

    cs.CV 2026-04 unverdicted novelty 7.0

    An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.

  7. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...

  8. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.

  9. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  10. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  11. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  12. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  13. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  14. A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.

  15. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  16. AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

    cs.RO 2026-04 unverdicted novelty 6.0

    AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.

  17. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  18. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  19. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  20. Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

    cs.RO 2026-03 unverdicted novelty 6.0

    SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.

  21. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  22. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  23. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  24. Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.

  25. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  26. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  27. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 24 Pith papers · 13 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhi- tao Gong, Sina Samangooei, Marianne Monteiro, Ja- cob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Bink...

  2. [2]

    ://arxiv.org/abs/2304.08488

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics, 2023. URL https://arxiv.org/abs/2304.08488

  3. [3]

    Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models

    Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Interna- tional Conference on Learning Representations , 2022

  4. [4]

    One transformer fits all distributions in multi-modal diffusion at scale, 2023

    Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale, 2023. URL https://arxiv.org/abs/2303. 06555

  5. [5]

    ://arxiv.org/abs/2405.01527

    Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation, 2024. URL https://arxiv.org/abs/ 2405.01527

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A vi...

  7. [7]

    Stable video diffusion: Scaling latent video diffusion models to large datasets,

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets,

  8. [8]

    URL https://arxiv.org/abs/2311.15127

  9. [9]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017

  10. [10]

    Unimask: Unified inference in sequential decision problems, 2022

    Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Mi- lani, Katja Hofmann, Matthew Hausknecht, Anca Dra- gan, and Sam Devlin. Unimask: Unified inference in sequential decision problems, 2022. URL https://arxiv. org/abs/2211.10869

  11. [11]

    Diffu- sion forcing: Next-token prediction meets full-sequence diffusion, 2024

    Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffu- sion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv.org/abs/2407.01392

  12. [12]

    Dif- fusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

  13. [13]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models, 2024. URL https://arxiv.org/abs/2310.08864

  14. [14]

    From play to policy: Condi- tional behavior generation from uncurated robot data

    Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Condi- tional behavior generation from uncurated robot data. In International Conference on Learning Representations , 2023

  15. [15]

    Vision transformers need registers

    Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Rep- resentations, 2024. URL https://openreview.net/forum? id=2dnO3LLiJ1

  16. [16]

    Dasari, O

    Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Ku- mar Srirama, and Sergey Levine. The ingredients for robotic diffusion transformers. arXiv preprint arXiv:2410.10088, 2024

  17. [17]

    ImageNet:

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition , pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

  18. [18]

    An image is worth 16x16 words: Transformers for im- age recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for im- age recognition at scale. In International Confer- ence on Learning Representations , 2021. URL ...

  19. [19]

    The” something something” video database for learning and evaluating visual common sense

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international con- ference on computer vision , pages 58...

  20. [20]

    Prediction with action: Visual policy learning via joint denoising process

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process. In The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems , 2024

  21. [21]

    Prediction with action: Visual policy learning via joint denoising process, 2024

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024. URL https://arxiv.org/abs/2411. 18179

  22. [22]

    Zhang, Shaoqing Ren, and Jian Sun

    Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) , pages 770–778, 2015. URL https://api. semanticscholar.org/CorpusID:206594692

  23. [23]

    Denoising diffusion probabilistic models, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

  24. [24]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458 , 2022

  25. [25]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual represen- tations, 2024. URL https://arxiv.org/abs/2412.14803

  26. [26]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...

  27. [27]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 , 2024

  28. [28]

    Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

    Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions, 2024. URL https://arxiv.org/abs/2403.03181

  29. [29]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Representations, 2023

  30. [30]

    LIBERO: Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, qiang liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. In Thirty- seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023. URL https://openreview.net/forum?id=xzEtNSuDJk

  31. [31]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learn- ing Representations, 2019. URL https://openreview.net/ forum?id=Bkg6RiCqY7

  32. [32]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021

  33. [33]

    Ssm meets video diffusion models: Efficient long-term video generation with structured state spaces

    Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, and Yutaka Matsuo. Ssm meets video diffusion models: Efficient long-term video generation with structured state spaces. arXiv preprint arXiv:2403.07711 , March 2024

  34. [34]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable dif- fusion models with transformers. arXiv preprint arXiv:2212.09748, 2022

  35. [35]

    SDXL: Improving latent diffusion mod- els for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations ,

  36. [36]

    URL https://openreview.net/forum?id=di52zR8xgf

  37. [37]

    Cosmos World Foundation Model Platform for Physical AI

    NVIDIA Research. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, January 2025

  38. [38]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021

  39. [39]

    Efficient reduc- tions for imitation learning

    Stephane Ross and Drew Bagnell. Efficient reduc- tions for imitation learning. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL ...

  40. [40]

    Behavior transformers: Cloning k modes with one stone

    Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ar- iuntuya Altanzaya, and Lerrel Pinto. Behavior trans- formers: Cloning k modes with one stone, 2022. URL https://arxiv.org/abs/2206.11251

  41. [41]

    De- noising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. In International Conference on Learning Representations , 2021. URL https://openreview.net/forum?id=St1giarCHLP

  42. [42]

    Octo: An open-source gener- alist robot policy, 2024

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag San- keti, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source gener- alist robot policy, 2024. URL https://arxiv.org/abs/24...

  43. [43]

    Mimicplay: Long- horizon imitation learning by watching hu- man play

    Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play, 2023. URL https://arxiv.org/abs/ 2302.12422

  44. [44]

    ://arxiv.org/abs/2401.00025

    Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning, 2024. URL https://arxiv. org/abs/2401.00025

  45. [45]

    Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024. URL https://arxiv.org/ abs/2409.12514

  46. [46]

    Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation, 2023. URL https: //arxiv.org/abs/2312.13139

  47. [47]

    Unleashing large-scale video generative pre- training for visual robot manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation. In The Twelfth International Conference on Learning Representations , 2024

  48. [48]

    ivideogpt: Inter- active videogpts are scalable world models

    Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Inter- active videogpts are scalable world models. In Advances in Neural Information Processing Systems , 2024

  49. [49]

    Learn- ing by watching: Physical imitation of manipulation skills from human videos, 2021

    Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learn- ing by watching: Physical imitation of manipulation skills from human videos, 2021. URL https://arxiv.org/ abs/2101.07241

  50. [50]

    Latent action pretraining from videos,

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos,

  51. [51]

    URL https://arxiv.org/abs/2410.11758

  52. [52]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manip- ulation with low-cost hardware, 2023. URL https://arxiv. org/abs/2304.13705

  53. [53]

    Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid

    Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity, 2024. URL https://arxiv.org/abs/2410. 13126

  54. [54]

    Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tiru- mala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Trans- fusion: Predict the next token and diffuse images with one multi-modal model, 2024. URL https://arxiv.org/abs/ 2408.11039. APPENDIX A. Additional Implementation Details

  55. [55]

    Model Architecture: We base our implementation of UWM on the diffusion transformer architecture with AdaLN conditioning [33]. The inputs to the model are (o, ata , o′ to′ , ta, to′), where o := {oi 0:ho }nc i=1 is a sequence of observations from nc camera views, ata := aho:ho+ha is a sequence of noisy actions, o′ to′ := {oi ho+ha:2ho+ha }nc i=1 is a seque...

  56. [56]

    The cropping and augmentation parameters are kept temporally consistent across o and o′ but differ from camera view to camera view

    Training and Inference Details: Given a transition tuple (o, a, o′) from sampled from the dataset, we first apply random cropping and augmentations to the image observations. The cropping and augmentation parameters are kept temporally consistent across o and o′ but differ from camera view to camera view. We then sample action and observation diffusion ti...

  57. [57]

    Scene camera 1 Scene camera 2 Eval Camera Wrist camera Fig

    Training Compute: Training a UWM on the DROID dataset for 100K gradient steps with the hyperparameters shown in Table V takes 24 hours on 4 NVIDIA A100 GPUs using Pytorch DDP. Scene camera 1 Scene camera 2 Eval Camera Wrist camera Fig. 11. Setup of the robot experiments. We adopt the DROID [25] setup which consists of two scene cameras and one wrist camer...

  58. [58]

    We remove the image tokens, image diffusion timestep, and registers and keep ev- erything else identical

    Diffusion Policies: We base our implementation of dif- fusion policies on the UWM model. We remove the image tokens, image diffusion timestep, and registers and keep ev- erything else identical. This is equivalent to the Transformer version of the original diffusion policy [11] and similar to the architecture in [15]

  59. [59]

    The diffusion timestep is still passed into the transformer via AdaLN

    PAD: We base our implementation of PAD on the UWM model, replacing coupled action-image diffusion with joint diffusion, and condition the model by concatenating the clean current observations to the noisy future observation predictions along the channel dimension. The diffusion timestep is still passed into the transformer via AdaLN. While the original PA...

  60. [60]

    Instead of regressing consecutive actions and observations, we predict a sequence of actions and the following image observations

    GR1: We use a custom implementation of the GR1 model adapted to have the same input-output format as UWM. Instead of regressing consecutive actions and observations, we predict a sequence of actions and the following image observations. GR1 conditions on the current observations by passing the ViT encoded observation tokens through a Per- ceiver resampler...

  61. [61]

    As shown in Fig

    Robot Setup: We conduct real-world experiments using a Franka Panda robot in the DROID [25] setup. As shown in Fig. 11 the robot’s observation space consists of two scene cameras and a wrist camera (visualized in Fig. 13. We additionally mount an overhead camera to track the initializations during TABLE VI TASK -SPECIFIC PARAMETERS # demos # finetuning st...

  62. [62]

    5 and the task-specific settings in Table VI

    Tasks: We provide a detailed description of each real- world task shown in Fig. 5 and the task-specific settings in Table VI. • Stack-Bowls: the robot needs to pick up the red bowl on the counter and place it in the blue bowl. The positions of the bowls are randomized across the counter top. A rollout is successful if the red bowl is placed securely insid...

  63. [63]

    As shown in Fig

    Evaluation Protocol: To ensure fairness of real-robot evaluations, we use an overhead camera and a Python program to systematically track randomizations. As shown in Fig. 12, the program overlays the reference frame onto the current frame, so the user can adjust the objects to match the ref- erence frame. All tasks except Rice-Cooker are evaluated on 50 r...

  64. [64]

    Although we utilized three cameras to maximize coverage (Fig

    Failure Modes: We provide a description of some com- mon failure modes in the real-world experiments. Although we utilized three cameras to maximize coverage (Fig. 13), certain angles resulted in objects being visible to only one camera. These limited viewpoints made some initializations more challenging for the robot to complete the tasks successfully. A...

  65. [65]

    It involves controlling a 7-DoF Franka Panda Lighting 2 Lighting 1 Background 1 Background 2 Clutter 1 Clutter 2 In-Distribution Standard OOD Fig

    Simulated Environments: LIBERO [29] is a simulated robotic benchmark designed to evaluate lifelong learning algorithms. It involves controlling a 7-DoF Franka Panda Lighting 2 Lighting 1 Background 1 Background 2 Clutter 1 Clutter 2 In-Distribution Standard OOD Fig. 13. Visualization of the robot’s perspective in in-distribution, standard out-of-distribut...

  66. [66]

    Book-Caddy: the robot needs to pick up the book from the table top and place it in the back of a caddy

  67. [67]

    Soup-Cheese: the robot needs to place the alphabet soup and the cheese in the basket in sequence

  68. [68]

    Bowl-Drawer: the robot needs to pick up the bowl, place it in the bottom drawer, and close the drawer

  69. [69]

    Moka-Moka: the robot needs to pick up the two Moka cups from the table and place them on the electric stove

  70. [70]

    Mug-Mug: the robot needs to place the left mug in the left plate and place the right mug in the right plate. TABLE VII ABLATION OF DESIGN CHOICES Book-Caddy Soup-Cheese UWM w/ 8 registers 0.88 ± 0.04 0.90 ± 0.02 UWM w/ 4 registers 0.83 ± 0.05 0.86 ± 0.03 UWM w/o registers 0.81 ± 0.07 0.85 ± 0.03 Cross attention UWM 0.78 ± 0.05 0.86 ± 0.04 TABLE VIII ABLAT...

  71. [71]

    Specifically, we want to (1) understand the effect of registers on task per- formance, and (2) compare the use of AdaLN for observation conditioning with cross attention [17]

    Ablations of Design Choices: To understand the effect of UWM’s design choices, we conduct ablation studies on two simulated tasks from the LIBERO environment. Specifically, we want to (1) understand the effect of registers on task per- formance, and (2) compare the use of AdaLN for observation conditioning with cross attention [17]. For each model, we tra...

  72. [72]

    This incentivizes the model to learn about image features, but not about temporal dynamics

    Ablation of Learning Objectives: To evaluate whether the performance gain of UWM is a result of dynamics predic- tion or pure reconstruction, we pretrain a UWM to reconstruct the current observations instead of the future observations. This incentivizes the model to learn about image features, but not about temporal dynamics. Table. VIII shows that while ...

  73. [73]

    Learning from Internet videos: We evaluate whether UWM can leverage knowledge from Internet videos by includ- ing a mixture of Kinetics-400 [8] and Something-Something- InternetVideo Dataset (Kinetics 400 and Something-Something v2) Fig. 14. Visualization of Internet video dataset. We curate the dataset by combining human activity videos from Kinetics-400...