arxiv: 2504.02792 · v3 · submitted 2025-04-03 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 1 theorem link

· Lean Theorem

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu , Raymond Yu , Siyuan Feng , Benjamin Burchfiel , Paarth Shah , Abhishek Gupta

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:17 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords unified world modelsrobot learningdiffusion modelsimitation learningvideo pretrainingaction predictiontransformerworld modeling

0 comments

The pith

Unified World Models couple video diffusion and action diffusion inside one transformer so a single network can pretrain robot policies on mixed video-plus-action datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Unified World Models as a way to train robot policies on both action-labeled demonstrations and abundant unlabeled video. A shared transformer runs separate diffusion processes for video frames and for actions, each controlled by its own timestep. Selecting the right combination of timesteps at inference time turns the same weights into a policy, a forward dynamics model, an inverse dynamics model, or a video generator. Experiments show the resulting policies generalize better than those trained only by imitation learning and improve further when extra action-free video is added during pretraining.

Core claim

Unified World Models integrate an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.

What carries the argument

Unified transformer with two independent diffusion timesteps—one for video frames and one for actions—allowing the same weights to switch among policy, forward model, inverse model, and video generation simply by choosing the timestep pair.

If this is right

Pretraining on large multitask robot datasets that contain both dynamics and action labels produces policies that transfer more robustly than standard imitation learning.
Independent timestep control lets the model absorb action-free video data during pretraining without requiring action labels, further boosting downstream policy performance.
The same weights can be used at inference time as a forward dynamics predictor, an inverse dynamics predictor, or a video generator simply by changing the diffusion timestep pair.
The approach unifies imitation learning and world modeling inside one training run rather than training separate models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to additional modalities such as language or tactile signals by adding further independent diffusion streams inside the same transformer.
Because video data is far cheaper to collect than action-labeled trajectories, the framework lowers the data cost of scaling robot foundation models.
If timestep separation works cleanly, similar diffusion unification might apply to other paired modalities where one stream is easier to observe than the other.
Real-world deployment would benefit from testing whether the learned forward model can be used for planning without retraining.

Load-bearing premise

Separate timestep control for each modality inside a shared transformer is enough to keep video and action modeling from interfering while still letting each capability be read out cleanly at test time.

What would settle it

Train UWM on a mixed dataset, then measure whether setting the action timestep to zero (policy mode) produces lower success rates than a model trained only on action data while video generation quality remains high.

read the original abstract

Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UWM's shared transformer with independent video/action timesteps is a clean unification but risks cross-modality interference that the abstract does not fully address.

read the letter

The main point is that this paper trains a single transformer on both video and action data by running separate diffusion processes for each modality, with independent timesteps that let the same weights switch between policy, forward dynamics, inverse dynamics, or video generation at inference time. That setup directly tackles the shortage of labeled robot actions by pulling in unlabeled video as extra pretraining signal, and the abstract reports gains over plain imitation learning in both simulation and real-robot tests. They also release code, which helps. The approach is technically straightforward and builds on existing diffusion work without adding heavy new machinery. What it does well is show a practical way to mix heterogeneous datasets without forcing everything into one task format. The results suggest the model generalizes better when video pretraining is included, which matches the scaling intuition in robotics. On the soft spots, the shared weights create a plausible interference problem: gradients from video denoising at one timestep can still shift the parameters used for action denoising at another, especially on mixed data. The abstract does not detail ablations or masking that would confirm the timesteps fully isolate the signals, so the central claim rests on experiments whose robustness is hard to judge from the summary alone. Minor issues include the usual need for more baseline comparisons and data-split details. This paper is for groups working on robot foundation models and world-model pretraining. Readers who care about data efficiency in imitation learning will get value from the unification idea even if they end up tweaking the architecture. It deserves serious peer review because the core mechanism is coherent and the problem is real, though the interference concern needs direct evidence in the full version.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Unified World Models (UWM), a unified transformer that couples an action diffusion process and a video diffusion process governed by independent modality-specific timesteps. By selecting appropriate timestep pairs at inference, the same model can be used as a policy, forward dynamics model, inverse dynamics model, or video generator. The authors report that pretraining on large-scale multitask robot datasets containing both action-labeled and action-free video data produces more generalizable policies than standard imitation learning in both simulation and real-world settings.

Significance. If the empirical gains prove robust, the work provides a practical unification of imitation learning and world modeling that directly addresses the scarcity of action annotations by leveraging abundant video data. The diffusion-timestep control mechanism offers a lightweight way to extract multiple capabilities from a single pretrained model, which could simplify scaling of robotic foundation models on heterogeneous datasets.

major comments (2)

[§3] §3 (Method): The central claim that independent control of (t_video, t_action) inside a shared transformer cleanly yields uncontaminated policies, dynamics, or video generation rests on the untested assumption that cross-modality gradient interference is negligible. No ablation compares joint training against modality-isolated training, nor is there analysis of how the shared weights handle conflicting denoising objectives on heterogeneous data.
[§4] §4 (Experiments): The reported policy improvements lack sufficient detail on data splits, exact baseline implementations, and controls that isolate the contribution of video pretraining. Without these, it is impossible to determine whether gains arise from the unified architecture or from other experimental choices.

minor comments (2)

[§3] Notation for the two diffusion timesteps should be introduced once with explicit symbols (e.g., t_v and t_a) and used consistently thereafter to improve readability.
The abstract would benefit from a single sentence summarizing the quantitative gains (e.g., success-rate deltas) rather than only qualitative statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (Method): The central claim that independent control of (t_video, t_action) inside a shared transformer cleanly yields uncontaminated policies, dynamics, or video generation rests on the untested assumption that cross-modality gradient interference is negligible. No ablation compares joint training against modality-isolated training, nor is there analysis of how the shared weights handle conflicting denoising objectives on heterogeneous data.

Authors: We agree that a direct ablation comparing joint training to modality-isolated training would strengthen the evidence regarding gradient interference. While the empirical success of UWM across all tasks (policy, dynamics, inverse dynamics, and video generation) indicates that the independent timestep mechanism largely prevents objective conflicts, we acknowledge the absence of this specific control. In the revised manuscript we add an ablation that trains separate modality-specific models and compares them to the joint UWM, along with gradient-norm analysis during training to quantify any cross-modality interference. revision: yes
Referee: [§4] §4 (Experiments): The reported policy improvements lack sufficient detail on data splits, exact baseline implementations, and controls that isolate the contribution of video pretraining. Without these, it is impossible to determine whether gains arise from the unified architecture or from other experimental choices.

Authors: We appreciate the request for greater experimental transparency. The revised manuscript now includes: (i) explicit descriptions of all pretraining and finetuning data splits with exact dataset sizes and task distributions, (ii) full hyperparameter tables and training procedures for every baseline, and (iii) an additional control experiment that removes video pretraining while keeping the architecture and action data identical, thereby isolating the contribution of the video component. revision: yes

Circularity Check

0 steps flagged

No circularity: new unified diffusion architecture evaluated on external datasets

full rationale

The paper introduces UWM as a novel transformer-based coupling of independent video and action diffusion processes, with claims about flexible representation of policies and dynamics arising directly from the architectural choice of modality-specific timesteps. No equations or derivations reduce by construction to fitted parameters defined by the target result, nor do any load-bearing steps rely on self-citations that themselves assume the outcome. The pretraining procedure and empirical evaluations on simulated and real-world robot datasets are presented as independent of the claimed capabilities, making the derivation self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the model appears to rest on standard diffusion and transformer assumptions already established in prior literature.

pith-pipeline@v0.9.0 · 5608 in / 1113 out tokens · 97349 ms · 2026-05-13T16:17:41.939600+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
cs.CV 2026-05 unverdicted novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
Envisioning the Future, One Step at a Time
cs.CV 2026-04 unverdicted novelty 7.0

An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
cs.RO 2026-04 unverdicted novelty 6.0

Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
cs.RO 2026-04 unverdicted novelty 6.0

AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
cs.RO 2026-03 unverdicted novelty 6.0

SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
cs.AI 2026-01 conditional novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 25 Pith papers · 13 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhi- tao Gong, Sina Samangooei, Marianne Monteiro, Ja- cob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Bink...

work page 2022
[2]

://arxiv.org/abs/2304.08488

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics, 2023. URL https://arxiv.org/abs/2304.08488

work page arXiv 2023
[3]

Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models

Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Interna- tional Conference on Learning Representations , 2022

work page 2022
[4]

One transformer fits all distributions in multi-modal diffusion at scale, 2023

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale, 2023. URL https://arxiv.org/abs/2303. 06555

work page 2023
[5]

Track2act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation,

Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation, 2024. URL https://arxiv.org/abs/ 2405.01527

work page arXiv 2024
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A vi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Stable video diffusion: Scaling latent video diffusion models to large datasets,

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets,

work page
[8]

URL https://arxiv.org/abs/2311.15127

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017

work page 2017
[10]

Unimask: Unified inference in sequential decision problems, 2022

Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Mi- lani, Katja Hofmann, Matthew Hausknecht, Anca Dra- gan, and Sam Devlin. Unimask: Unified inference in sequential decision problems, 2022. URL https://arxiv. org/abs/2211.10869

work page arXiv 2022
[11]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.arXiv preprint arXiv:2407.01392,

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffu- sion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv.org/abs/2407.01392

work page arXiv 2024
[12]

Dif- fusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[13]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models, 2024. URL https://arxiv.org/abs/2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

From play to policy: Condi- tional behavior generation from uncurated robot data

Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Condi- tional behavior generation from uncurated robot data. In International Conference on Learning Representations , 2023

work page 2023
[15]

Vision transformers need registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Rep- resentations, 2024. URL https://openreview.net/forum? id=2dnO3LLiJ1

work page 2024
[16]

Dasari, O

Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Ku- mar Srirama, and Sergey Levine. The ingredients for robotic diffusion transformers. arXiv preprint arXiv:2410.10088, 2024

work page arXiv 2024
[17]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition , pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[18]

An image is worth 16x16 words: Transformers for im- age recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for im- age recognition at scale. In International Confer- ence on Learning Representations , 2021. URL ...

work page 2021
[19]

The” something something” video database for learning and evaluating visual common sense

Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international con- ference on computer vision , pages 58...

work page 2017
[20]

Prediction with action: Visual policy learning via joint denoising process

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process. In The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems , 2024

work page 2024
[21]

Prediction with action: Visual policy learning via joint denoising process, 2024

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024. URL https://arxiv.org/abs/2411. 18179

work page 2024
[22]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) , pages 770–778, 2015. URL https://api. semanticscholar.org/CorpusID:206594692

work page 2016
[23]

Denoising diffusion probabilistic models, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020

work page 2020
[24]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual represen- tations, 2024. URL https://arxiv.org/abs/2412.14803

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions, 2024. URL https://arxiv.org/abs/2403.03181

work page arXiv 2024
[29]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Representations, 2023

work page 2023
[30]

LIBERO: Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, qiang liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. In Thirty- seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023. URL https://openreview.net/forum?id=xzEtNSuDJk

work page 2023
[31]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learn- ing Representations, 2019. URL https://openreview.net/ forum?id=Bkg6RiCqY7

work page 2019
[32]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Ssm meets video diffusion models: Efficient long-term video generation with structured state spaces

Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, and Yutaka Matsuo. Ssm meets video diffusion models: Efficient long-term video generation with structured state spaces. arXiv preprint arXiv:2403.07711 , March 2024

work page arXiv 2024
[34]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable dif- fusion models with transformers. arXiv preprint arXiv:2212.09748, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

SDXL: Improving latent diffusion mod- els for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations ,

work page
[36]

URL https://openreview.net/forum?id=di52zR8xgf

work page
[37]

Cosmos World Foundation Model Platform for Physical AI

NVIDIA Research. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, January 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

High-resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021

work page 2021
[39]

Efficient reduc- tions for imitation learning

Stephane Ross and Drew Bagnell. Efficient reduc- tions for imitation learning. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL ...

work page 2010
[40]

Behavior transformers: Cloning k modes with one stone

Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ar- iuntuya Altanzaya, and Lerrel Pinto. Behavior trans- formers: Cloning k modes with one stone, 2022. URL https://arxiv.org/abs/2206.11251

work page arXiv 2022
[41]

De- noising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. In International Conference on Learning Representations , 2021. URL https://openreview.net/forum?id=St1giarCHLP

work page 2021
[42]

Octo: An open-source gener- alist robot policy, 2024

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag San- keti, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source gener- alist robot policy, 2024. URL https://arxiv.org/abs/24...

work page 2024
[43]

Mimicplay: Long- horizon imitation learning by watching human play,

Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play, 2023. URL https://arxiv.org/abs/ 2302.12422

work page arXiv 2023
[44]

Any-point trajectory modeling for policy learning

Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning, 2024. URL https://arxiv. org/abs/2401.00025

work page arXiv 2024
[45]

Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024

Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024. URL https://arxiv.org/ abs/2409.12514

work page arXiv 2024
[46]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation, 2023. URL https: //arxiv.org/abs/2312.13139

work page internal anchor Pith review arXiv 2023
[47]

Unleashing large-scale video generative pre- training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation. In The Twelfth International Conference on Learning Representations , 2024

work page 2024
[48]

ivideogpt: Inter- active videogpts are scalable world models

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Inter- active videogpts are scalable world models. In Advances in Neural Information Processing Systems , 2024

work page 2024
[49]

Learn- ing by watching: Physical imitation of manipulation skills from human videos, 2021

Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learn- ing by watching: Physical imitation of manipulation skills from human videos, 2021. URL https://arxiv.org/ abs/2101.07241

work page arXiv 2021
[50]

Latent action pretraining from videos,

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos,

work page
[51]

URL https://arxiv.org/abs/2410.11758

work page Pith review arXiv
[52]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manip- ulation with low-cost hardware, 2023. URL https://arxiv. org/abs/2304.13705

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid

Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity, 2024. URL https://arxiv.org/abs/2410. 13126

work page 2024
[54]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tiru- mala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Trans- fusion: Predict the next token and diffuse images with one multi-modal model, 2024. URL https://arxiv.org/abs/ 2408.11039. APPENDIX A. Additional Implementation Details

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Model Architecture: We base our implementation of UWM on the diffusion transformer architecture with AdaLN conditioning [33]. The inputs to the model are (o, ata , o′ to′ , ta, to′), where o := {oi 0:ho }nc i=1 is a sequence of observations from nc camera views, ata := aho:ho+ha is a sequence of noisy actions, o′ to′ := {oi ho+ha:2ho+ha }nc i=1 is a seque...

work page
[56]

The cropping and augmentation parameters are kept temporally consistent across o and o′ but differ from camera view to camera view

Training and Inference Details: Given a transition tuple (o, a, o′) from sampled from the dataset, we first apply random cropping and augmentations to the image observations. The cropping and augmentation parameters are kept temporally consistent across o and o′ but differ from camera view to camera view. We then sample action and observation diffusion ti...

work page
[57]

Scene camera 1 Scene camera 2 Eval Camera Wrist camera Fig

Training Compute: Training a UWM on the DROID dataset for 100K gradient steps with the hyperparameters shown in Table V takes 24 hours on 4 NVIDIA A100 GPUs using Pytorch DDP. Scene camera 1 Scene camera 2 Eval Camera Wrist camera Fig. 11. Setup of the robot experiments. We adopt the DROID [25] setup which consists of two scene cameras and one wrist camer...

work page
[58]

We remove the image tokens, image diffusion timestep, and registers and keep ev- erything else identical

Diffusion Policies: We base our implementation of dif- fusion policies on the UWM model. We remove the image tokens, image diffusion timestep, and registers and keep ev- erything else identical. This is equivalent to the Transformer version of the original diffusion policy [11] and similar to the architecture in [15]

work page
[59]

The diffusion timestep is still passed into the transformer via AdaLN

PAD: We base our implementation of PAD on the UWM model, replacing coupled action-image diffusion with joint diffusion, and condition the model by concatenating the clean current observations to the noisy future observation predictions along the channel dimension. The diffusion timestep is still passed into the transformer via AdaLN. While the original PA...

work page
[60]

Instead of regressing consecutive actions and observations, we predict a sequence of actions and the following image observations

GR1: We use a custom implementation of the GR1 model adapted to have the same input-output format as UWM. Instead of regressing consecutive actions and observations, we predict a sequence of actions and the following image observations. GR1 conditions on the current observations by passing the ViT encoded observation tokens through a Per- ceiver resampler...

work page
[61]

As shown in Fig

Robot Setup: We conduct real-world experiments using a Franka Panda robot in the DROID [25] setup. As shown in Fig. 11 the robot’s observation space consists of two scene cameras and a wrist camera (visualized in Fig. 13. We additionally mount an overhead camera to track the initializations during TABLE VI TASK -SPECIFIC PARAMETERS # demos # finetuning st...

work page
[62]

5 and the task-specific settings in Table VI

Tasks: We provide a detailed description of each real- world task shown in Fig. 5 and the task-specific settings in Table VI. • Stack-Bowls: the robot needs to pick up the red bowl on the counter and place it in the blue bowl. The positions of the bowls are randomized across the counter top. A rollout is successful if the red bowl is placed securely insid...

work page
[63]

As shown in Fig

Evaluation Protocol: To ensure fairness of real-robot evaluations, we use an overhead camera and a Python program to systematically track randomizations. As shown in Fig. 12, the program overlays the reference frame onto the current frame, so the user can adjust the objects to match the ref- erence frame. All tasks except Rice-Cooker are evaluated on 50 r...

work page
[64]

Although we utilized three cameras to maximize coverage (Fig

Failure Modes: We provide a description of some com- mon failure modes in the real-world experiments. Although we utilized three cameras to maximize coverage (Fig. 13), certain angles resulted in objects being visible to only one camera. These limited viewpoints made some initializations more challenging for the robot to complete the tasks successfully. A...

work page
[65]

It involves controlling a 7-DoF Franka Panda Lighting 2 Lighting 1 Background 1 Background 2 Clutter 1 Clutter 2 In-Distribution Standard OOD Fig

Simulated Environments: LIBERO [29] is a simulated robotic benchmark designed to evaluate lifelong learning algorithms. It involves controlling a 7-DoF Franka Panda Lighting 2 Lighting 1 Background 1 Background 2 Clutter 1 Clutter 2 In-Distribution Standard OOD Fig. 13. Visualization of the robot’s perspective in in-distribution, standard out-of-distribut...

work page
[66]

Book-Caddy: the robot needs to pick up the book from the table top and place it in the back of a caddy

work page
[67]

Soup-Cheese: the robot needs to place the alphabet soup and the cheese in the basket in sequence

work page
[68]

Bowl-Drawer: the robot needs to pick up the bowl, place it in the bottom drawer, and close the drawer

work page
[69]

Moka-Moka: the robot needs to pick up the two Moka cups from the table and place them on the electric stove

work page
[70]

Mug-Mug: the robot needs to place the left mug in the left plate and place the right mug in the right plate. TABLE VII ABLATION OF DESIGN CHOICES Book-Caddy Soup-Cheese UWM w/ 8 registers 0.88 ± 0.04 0.90 ± 0.02 UWM w/ 4 registers 0.83 ± 0.05 0.86 ± 0.03 UWM w/o registers 0.81 ± 0.07 0.85 ± 0.03 Cross attention UWM 0.78 ± 0.05 0.86 ± 0.04 TABLE VIII ABLAT...

work page
[71]

Specifically, we want to (1) understand the effect of registers on task per- formance, and (2) compare the use of AdaLN for observation conditioning with cross attention [17]

Ablations of Design Choices: To understand the effect of UWM’s design choices, we conduct ablation studies on two simulated tasks from the LIBERO environment. Specifically, we want to (1) understand the effect of registers on task per- formance, and (2) compare the use of AdaLN for observation conditioning with cross attention [17]. For each model, we tra...

work page
[72]

This incentivizes the model to learn about image features, but not about temporal dynamics

Ablation of Learning Objectives: To evaluate whether the performance gain of UWM is a result of dynamics predic- tion or pure reconstruction, we pretrain a UWM to reconstruct the current observations instead of the future observations. This incentivizes the model to learn about image features, but not about temporal dynamics. Table. VIII shows that while ...

work page
[73]

Learning from Internet videos: We evaluate whether UWM can leverage knowledge from Internet videos by includ- ing a mixture of Kinetics-400 [8] and Something-Something- InternetVideo Dataset (Kinetics 400 and Something-Something v2) Fig. 14. Visualization of Internet video dataset. We curate the dataset by combining human activity videos from Kinetics-400...

work page