Recognition: 1 theorem link
· Lean TheoremUnified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Pith reviewed 2026-05-13 16:17 UTC · model grok-4.3
The pith
Unified World Models couple video diffusion and action diffusion inside one transformer so a single network can pretrain robot policies on mixed video-plus-action datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Unified World Models integrate an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.
What carries the argument
Unified transformer with two independent diffusion timesteps—one for video frames and one for actions—allowing the same weights to switch among policy, forward model, inverse model, and video generation simply by choosing the timestep pair.
If this is right
- Pretraining on large multitask robot datasets that contain both dynamics and action labels produces policies that transfer more robustly than standard imitation learning.
- Independent timestep control lets the model absorb action-free video data during pretraining without requiring action labels, further boosting downstream policy performance.
- The same weights can be used at inference time as a forward dynamics predictor, an inverse dynamics predictor, or a video generator simply by changing the diffusion timestep pair.
- The approach unifies imitation learning and world modeling inside one training run rather than training separate models.
Where Pith is reading between the lines
- The method could be extended to additional modalities such as language or tactile signals by adding further independent diffusion streams inside the same transformer.
- Because video data is far cheaper to collect than action-labeled trajectories, the framework lowers the data cost of scaling robot foundation models.
- If timestep separation works cleanly, similar diffusion unification might apply to other paired modalities where one stream is easier to observe than the other.
- Real-world deployment would benefit from testing whether the learned forward model can be used for planning without retraining.
Load-bearing premise
Separate timestep control for each modality inside a shared transformer is enough to keep video and action modeling from interfering while still letting each capability be read out cleanly at test time.
What would settle it
Train UWM on a mixed dataset, then measure whether setting the action timestep to zero (policy mode) produces lower success rates than a model trained only on action data while video generation quality remains high.
read the original abstract
Imitation learning has emerged as a promising approach towards building generalist robots. However, scaling imitation learning for large robot foundation models remains challenging due to its reliance on high-quality expert demonstrations. Meanwhile, large amounts of video data depicting a wide range of environments and diverse behaviors are readily available. This data provides a rich source of information about real-world dynamics and agent-environment interactions. Leveraging this data directly for imitation learning, however, has proven difficult due to the lack of action annotation. In this work, we present Unified World Models (UWM), a framework that allows for leveraging both video and action data for policy learning. Specifically, a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator. Through simulated and real-world experiments, we show that: (1) UWM enables effective pretraining on large-scale multitask robot datasets with both dynamics and action predictions, resulting in more generalizable and robust policies than imitation learning, (2) UWM naturally facilitates learning from action-free video data through independent control of modality-specific diffusion timesteps, further improving the performance of finetuned policies. Our results suggest that UWM offers a promising step toward harnessing large, heterogeneous datasets for scalable robot learning, and provides a simple unification between the often disparate paradigms of imitation learning and world modeling. Videos and code are available at https://weirdlabuw.github.io/uwm/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Unified World Models (UWM), a unified transformer that couples an action diffusion process and a video diffusion process governed by independent modality-specific timesteps. By selecting appropriate timestep pairs at inference, the same model can be used as a policy, forward dynamics model, inverse dynamics model, or video generator. The authors report that pretraining on large-scale multitask robot datasets containing both action-labeled and action-free video data produces more generalizable policies than standard imitation learning in both simulation and real-world settings.
Significance. If the empirical gains prove robust, the work provides a practical unification of imitation learning and world modeling that directly addresses the scarcity of action annotations by leveraging abundant video data. The diffusion-timestep control mechanism offers a lightweight way to extract multiple capabilities from a single pretrained model, which could simplify scaling of robotic foundation models on heterogeneous datasets.
major comments (2)
- [§3] §3 (Method): The central claim that independent control of (t_video, t_action) inside a shared transformer cleanly yields uncontaminated policies, dynamics, or video generation rests on the untested assumption that cross-modality gradient interference is negligible. No ablation compares joint training against modality-isolated training, nor is there analysis of how the shared weights handle conflicting denoising objectives on heterogeneous data.
- [§4] §4 (Experiments): The reported policy improvements lack sufficient detail on data splits, exact baseline implementations, and controls that isolate the contribution of video pretraining. Without these, it is impossible to determine whether gains arise from the unified architecture or from other experimental choices.
minor comments (2)
- [§3] Notation for the two diffusion timesteps should be introduced once with explicit symbols (e.g., t_v and t_a) and used consistently thereafter to improve readability.
- The abstract would benefit from a single sentence summarizing the quantitative gains (e.g., success-rate deltas) rather than only qualitative statements.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that independent control of (t_video, t_action) inside a shared transformer cleanly yields uncontaminated policies, dynamics, or video generation rests on the untested assumption that cross-modality gradient interference is negligible. No ablation compares joint training against modality-isolated training, nor is there analysis of how the shared weights handle conflicting denoising objectives on heterogeneous data.
Authors: We agree that a direct ablation comparing joint training to modality-isolated training would strengthen the evidence regarding gradient interference. While the empirical success of UWM across all tasks (policy, dynamics, inverse dynamics, and video generation) indicates that the independent timestep mechanism largely prevents objective conflicts, we acknowledge the absence of this specific control. In the revised manuscript we add an ablation that trains separate modality-specific models and compares them to the joint UWM, along with gradient-norm analysis during training to quantify any cross-modality interference. revision: yes
-
Referee: [§4] §4 (Experiments): The reported policy improvements lack sufficient detail on data splits, exact baseline implementations, and controls that isolate the contribution of video pretraining. Without these, it is impossible to determine whether gains arise from the unified architecture or from other experimental choices.
Authors: We appreciate the request for greater experimental transparency. The revised manuscript now includes: (i) explicit descriptions of all pretraining and finetuning data splits with exact dataset sizes and task distributions, (ii) full hyperparameter tables and training procedures for every baseline, and (iii) an additional control experiment that removes video pretraining while keeping the architecture and action data identical, thereby isolating the contribution of the video component. revision: yes
Circularity Check
No circularity: new unified diffusion architecture evaluated on external datasets
full rationale
The paper introduces UWM as a novel transformer-based coupling of independent video and action diffusion processes, with claims about flexible representation of policies and dynamics arising directly from the architectural choice of modality-specific timesteps. No equations or derivations reduce by construction to fitted parameters defined by the target result, nor do any load-bearing steps rely on self-citations that themselves assume the outcome. The pretraining procedure and empirical evaluations on simulated and real-world robot datasets are presented as independent of the claimed capabilities, making the derivation self-contained without circular reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a UWM integrates an action diffusion process and a video diffusion process within a unified transformer architecture, where independent diffusion timesteps govern each modality. By controlling each diffusion timestep, UWM can flexibly represent a policy, a forward dynamics, an inverse dynamics, and a video generator.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 28 Pith papers
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
-
Envisioning the Future, One Step at a Time
An autoregressive diffusion model on sparse point trajectories predicts multi-modal future scene dynamics from single images with orders-of-magnitude faster sampling than dense video simulators while matching accuracy.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
-
Human Cognition in Machines: A Unified Perspective of World Models
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
-
A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
-
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
-
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
-
Motus: A Unified Latent Action World Model
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhi- tao Gong, Sina Samangooei, Marianne Monteiro, Ja- cob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Bink...
work page 2022
-
[2]
Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics, 2023. URL https://arxiv.org/abs/2304.08488
-
[3]
Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models
Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Interna- tional Conference on Learning Representations , 2022
work page 2022
-
[4]
One transformer fits all distributions in multi-modal diffusion at scale, 2023
Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale, 2023. URL https://arxiv.org/abs/2303. 06555
work page 2023
-
[5]
Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation, 2024. URL https://arxiv.org/abs/ 2405.01527
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A vi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Stable video diffusion: Scaling latent video diffusion models to large datasets,
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets,
-
[8]
URL https://arxiv.org/abs/2311.15127
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , July 2017
work page 2017
-
[10]
Unimask: Unified inference in sequential decision problems, 2022
Micah Carroll, Orr Paradise, Jessy Lin, Raluca Georgescu, Mingfei Sun, David Bignell, Stephanie Mi- lani, Katja Hofmann, Matthew Hausknecht, Anca Dra- gan, and Sam Devlin. Unimask: Unified inference in sequential decision problems, 2022. URL https://arxiv. org/abs/2211.10869
-
[11]
Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffu- sion forcing: Next-token prediction meets full-sequence diffusion, 2024. URL https://arxiv.org/abs/2407.01392
-
[12]
Dif- fusion policy: Visuomotor policy learning via action diffusion
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023
work page 2023
-
[13]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and rt-x models, 2024. URL https://arxiv.org/abs/2310.08864
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
From play to policy: Condi- tional behavior generation from uncurated robot data
Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Condi- tional behavior generation from uncurated robot data. In International Conference on Learning Representations , 2023
work page 2023
-
[15]
Vision transformers need registers
Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. In The Twelfth International Conference on Learning Rep- resentations, 2024. URL https://openreview.net/forum? id=2dnO3LLiJ1
work page 2024
- [16]
-
[17]
In: 2009 IEEE Conference on Computer Vision and Pattern Recognition
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition , pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848
-
[18]
An image is worth 16x16 words: Transformers for im- age recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for im- age recognition at scale. In International Confer- ence on Learning Representations , 2021. URL ...
work page 2021
-
[19]
The” something something” video database for learning and evaluating visual common sense
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michal- ski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international con- ference on computer vision , pages 58...
work page 2017
-
[20]
Prediction with action: Visual policy learning via joint denoising process
Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process. In The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems , 2024
work page 2024
-
[21]
Prediction with action: Visual policy learning via joint denoising process, 2024
Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024. URL https://arxiv.org/abs/2411. 18179
work page 2024
-
[22]
Zhang, Shaoqing Ren, and Jian Sun
Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR) , pages 770–778, 2015. URL https://api. semanticscholar.org/CorpusID:206594692
work page 2016
-
[23]
Denoising diffusion probabilistic models, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020
work page 2020
-
[24]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual represen- tations, 2024. URL https://arxiv.org/abs/2412.14803
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H. Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions, 2024. URL https://arxiv.org/abs/2403.03181
-
[29]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. In International Conference on Learning Representations, 2023
work page 2023
-
[30]
LIBERO: Benchmarking knowledge transfer for lifelong robot learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, qiang liu, Yuke Zhu, and Peter Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. In Thirty- seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , 2023. URL https://openreview.net/forum?id=xzEtNSuDJk
work page 2023
-
[31]
Decoupled weight de- cay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learn- ing Representations, 2019. URL https://openreview.net/ forum?id=Bkg6RiCqY7
work page 2019
-
[32]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Ssm meets video diffusion models: Efficient long-term video generation with structured state spaces
Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, and Yutaka Matsuo. Ssm meets video diffusion models: Efficient long-term video generation with structured state spaces. arXiv preprint arXiv:2403.07711 , March 2024
-
[34]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable dif- fusion models with transformers. arXiv preprint arXiv:2212.09748, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
SDXL: Improving latent diffusion mod- els for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion mod- els for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations ,
-
[36]
URL https://openreview.net/forum?id=di52zR8xgf
-
[37]
Cosmos World Foundation Model Platform for Physical AI
NVIDIA Research. Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575, January 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
High-resolution image synthesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021
work page 2021
-
[39]
Efficient reduc- tions for imitation learning
Stephane Ross and Drew Bagnell. Efficient reduc- tions for imitation learning. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL ...
work page 2010
-
[40]
Behavior transformers: Cloning k modes with one stone
Nur Muhammad Mahi Shafiullah, Zichen Jeff Cui, Ar- iuntuya Altanzaya, and Lerrel Pinto. Behavior trans- formers: Cloning k modes with one stone, 2022. URL https://arxiv.org/abs/2206.11251
-
[41]
De- noising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. In International Conference on Learning Representations , 2021. URL https://openreview.net/forum?id=St1giarCHLP
work page 2021
-
[42]
Octo: An open-source gener- alist robot policy, 2024
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag San- keti, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source gener- alist robot policy, 2024. URL https://arxiv.org/abs/24...
work page 2024
-
[43]
Mimicplay: Long- horizon imitation learning by watching human play,
Chen Wang, Linxi Fan, Jiankai Sun, Ruohan Zhang, Li Fei-Fei, Danfei Xu, Yuke Zhu, and Anima Anand- kumar. Mimicplay: Long-horizon imitation learning by watching human play, 2023. URL https://arxiv.org/abs/ 2302.12422
-
[44]
Any-point trajectory modeling for policy learning
Chuan Wen, Xingyu Lin, John So, Kai Chen, Qi Dou, Yang Gao, and Pieter Abbeel. Any-point trajectory modeling for policy learning, 2024. URL https://arxiv. org/abs/2401.00025
-
[45]
Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024
Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. Tinyvla: To- wards fast, data-efficient vision-language-action models for robotic manipulation, 2024. URL https://arxiv.org/ abs/2409.12514
-
[46]
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation, 2023. URL https: //arxiv.org/abs/2312.13139
work page internal anchor Pith review arXiv 2023
-
[47]
Unleashing large-scale video generative pre- training for visual robot manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre- training for visual robot manipulation. In The Twelfth International Conference on Learning Representations , 2024
work page 2024
-
[48]
ivideogpt: Inter- active videogpts are scalable world models
Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Inter- active videogpts are scalable world models. In Advances in Neural Information Processing Systems , 2024
work page 2024
-
[49]
Learn- ing by watching: Physical imitation of manipulation skills from human videos, 2021
Haoyu Xiong, Quanzhou Li, Yun-Chun Chen, Homanga Bharadhwaj, Samarth Sinha, and Animesh Garg. Learn- ing by watching: Physical imitation of manipulation skills from human videos, 2021. URL https://arxiv.org/ abs/2101.07241
-
[50]
Latent action pretraining from videos,
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos,
-
[51]
URL https://arxiv.org/abs/2410.11758
-
[52]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manip- ulation with low-cost hardware, 2023. URL https://arxiv. org/abs/2304.13705
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Tony Z. Zhao, Jonathan Tompson, Danny Driess, Pete Florence, Kamyar Ghasemipour, Chelsea Finn, and Ayzaan Wahid. Aloha unleashed: A simple recipe for robot dexterity, 2024. URL https://arxiv.org/abs/2410. 13126
work page 2024
-
[54]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tiru- mala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Trans- fusion: Predict the next token and diffuse images with one multi-modal model, 2024. URL https://arxiv.org/abs/ 2408.11039. APPENDIX A. Additional Implementation Details
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Model Architecture: We base our implementation of UWM on the diffusion transformer architecture with AdaLN conditioning [33]. The inputs to the model are (o, ata , o′ to′ , ta, to′), where o := {oi 0:ho }nc i=1 is a sequence of observations from nc camera views, ata := aho:ho+ha is a sequence of noisy actions, o′ to′ := {oi ho+ha:2ho+ha }nc i=1 is a seque...
-
[56]
Training and Inference Details: Given a transition tuple (o, a, o′) from sampled from the dataset, we first apply random cropping and augmentations to the image observations. The cropping and augmentation parameters are kept temporally consistent across o and o′ but differ from camera view to camera view. We then sample action and observation diffusion ti...
-
[57]
Scene camera 1 Scene camera 2 Eval Camera Wrist camera Fig
Training Compute: Training a UWM on the DROID dataset for 100K gradient steps with the hyperparameters shown in Table V takes 24 hours on 4 NVIDIA A100 GPUs using Pytorch DDP. Scene camera 1 Scene camera 2 Eval Camera Wrist camera Fig. 11. Setup of the robot experiments. We adopt the DROID [25] setup which consists of two scene cameras and one wrist camer...
-
[58]
Diffusion Policies: We base our implementation of dif- fusion policies on the UWM model. We remove the image tokens, image diffusion timestep, and registers and keep ev- erything else identical. This is equivalent to the Transformer version of the original diffusion policy [11] and similar to the architecture in [15]
-
[59]
The diffusion timestep is still passed into the transformer via AdaLN
PAD: We base our implementation of PAD on the UWM model, replacing coupled action-image diffusion with joint diffusion, and condition the model by concatenating the clean current observations to the noisy future observation predictions along the channel dimension. The diffusion timestep is still passed into the transformer via AdaLN. While the original PA...
-
[60]
GR1: We use a custom implementation of the GR1 model adapted to have the same input-output format as UWM. Instead of regressing consecutive actions and observations, we predict a sequence of actions and the following image observations. GR1 conditions on the current observations by passing the ViT encoded observation tokens through a Per- ceiver resampler...
-
[61]
Robot Setup: We conduct real-world experiments using a Franka Panda robot in the DROID [25] setup. As shown in Fig. 11 the robot’s observation space consists of two scene cameras and a wrist camera (visualized in Fig. 13. We additionally mount an overhead camera to track the initializations during TABLE VI TASK -SPECIFIC PARAMETERS # demos # finetuning st...
-
[62]
5 and the task-specific settings in Table VI
Tasks: We provide a detailed description of each real- world task shown in Fig. 5 and the task-specific settings in Table VI. • Stack-Bowls: the robot needs to pick up the red bowl on the counter and place it in the blue bowl. The positions of the bowls are randomized across the counter top. A rollout is successful if the red bowl is placed securely insid...
-
[63]
Evaluation Protocol: To ensure fairness of real-robot evaluations, we use an overhead camera and a Python program to systematically track randomizations. As shown in Fig. 12, the program overlays the reference frame onto the current frame, so the user can adjust the objects to match the ref- erence frame. All tasks except Rice-Cooker are evaluated on 50 r...
-
[64]
Although we utilized three cameras to maximize coverage (Fig
Failure Modes: We provide a description of some com- mon failure modes in the real-world experiments. Although we utilized three cameras to maximize coverage (Fig. 13), certain angles resulted in objects being visible to only one camera. These limited viewpoints made some initializations more challenging for the robot to complete the tasks successfully. A...
-
[65]
Simulated Environments: LIBERO [29] is a simulated robotic benchmark designed to evaluate lifelong learning algorithms. It involves controlling a 7-DoF Franka Panda Lighting 2 Lighting 1 Background 1 Background 2 Clutter 1 Clutter 2 In-Distribution Standard OOD Fig. 13. Visualization of the robot’s perspective in in-distribution, standard out-of-distribut...
-
[66]
Book-Caddy: the robot needs to pick up the book from the table top and place it in the back of a caddy
-
[67]
Soup-Cheese: the robot needs to place the alphabet soup and the cheese in the basket in sequence
-
[68]
Bowl-Drawer: the robot needs to pick up the bowl, place it in the bottom drawer, and close the drawer
-
[69]
Moka-Moka: the robot needs to pick up the two Moka cups from the table and place them on the electric stove
-
[70]
Mug-Mug: the robot needs to place the left mug in the left plate and place the right mug in the right plate. TABLE VII ABLATION OF DESIGN CHOICES Book-Caddy Soup-Cheese UWM w/ 8 registers 0.88 ± 0.04 0.90 ± 0.02 UWM w/ 4 registers 0.83 ± 0.05 0.86 ± 0.03 UWM w/o registers 0.81 ± 0.07 0.85 ± 0.03 Cross attention UWM 0.78 ± 0.05 0.86 ± 0.04 TABLE VIII ABLAT...
-
[71]
Ablations of Design Choices: To understand the effect of UWM’s design choices, we conduct ablation studies on two simulated tasks from the LIBERO environment. Specifically, we want to (1) understand the effect of registers on task per- formance, and (2) compare the use of AdaLN for observation conditioning with cross attention [17]. For each model, we tra...
-
[72]
This incentivizes the model to learn about image features, but not about temporal dynamics
Ablation of Learning Objectives: To evaluate whether the performance gain of UWM is a result of dynamics predic- tion or pure reconstruction, we pretrain a UWM to reconstruct the current observations instead of the future observations. This incentivizes the model to learn about image features, but not about temporal dynamics. Table. VIII shows that while ...
-
[73]
Learning from Internet videos: We evaluate whether UWM can leverage knowledge from Internet videos by includ- ing a mixture of Kinetics-400 [8] and Something-Something- InternetVideo Dataset (Kinetics 400 and Something-Something v2) Fig. 14. Visualization of Internet video dataset. We curate the dataset by combining human activity videos from Kinetics-400...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.