pith. machine review for the scientific record. sign in

arxiv: 2503.00200 · v3 · submitted 2025-02-28 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Unified Video Action Model

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:44 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords actionvideoinferencelearningmodelpredictionunifieddynamics
0
0 comments X

The pith

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robots need to watch video of their surroundings and decide what physical actions to take next. Most current systems handle seeing and acting with separate AI models. UVA instead trains one model on both at once by creating a shared hidden code that captures what the scene looks like and what the robot should do. Two small diffusion modules sit on top of this shared code: one turns the code into future video frames, the other turns it into action commands. Because the action head is separate and lightweight, the robot can output motor commands in real time without waiting for a full video to be generated. Training uses random masking so the same model can fill in missing actions, missing video, or both. This lets the single network switch between learning a policy, predicting what will happen next, figuring out which action caused an observed change, and generating video. The abstract reports that this unified setup matches the accuracy of specialized models while being faster for action output.

Core claim

UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications.

Load-bearing premise

That the joint video-action latent representation captures the necessary relationship between visual sequences and action sequences with negligible information loss or task interference, allowing decoupled decoding to retain full accuracy.

read the original abstract

A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction, and actions provide dynamics information for video prediction. However, effectively combining video generation and action prediction remains challenging, and current video generation-based methods struggle to match the performance of direct policy learning in action accuracy and inference speed. To bridge this gap, we introduce the Unified Video Action model (UVA), which jointly optimizes video and action predictions to achieve both high accuracy and efficient action inference. The key lies in learning a joint video-action latent representation and decoupling video-action decoding. The joint latent representation bridges the visual and action domains, effectively modeling the relationship between video and action sequences. Meanwhile, the decoupled decoding, powered by two lightweight diffusion heads, enables high-speed action inference by bypassing video generation during inference. Such a unified framework further enables versatile functionality through masked input training. By selectively masking actions or videos, a single model can tackle diverse tasks beyond policy learning, such as forward and inverse dynamics modeling and video generation. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications. Results are best viewed on https://unified-video-action-model.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Unified Video Action (UVA) model, which learns a joint video-action latent representation and decouples decoding via two lightweight diffusion heads. Masked-input training enables a single model to perform policy learning, forward/inverse dynamics, and video prediction, with the central claim that this unified approach matches the accuracy and speed of task-specific methods without compromise.

Significance. If the joint latent representation and decoupled heads truly incur negligible task interference, UVA could serve as a versatile foundation for robotics, reducing the proliferation of separate models while preserving inference efficiency for action prediction.

major comments (2)
  1. [Abstract] Abstract: the claim that 'extensive experiments' show UVA matches task-specific performance 'without compromising' accuracy is unsupported by any quantitative metrics, baselines, or ablations. No comparison to an action-only baseline (identical backbone and action head, video loss removed) is provided, leaving the no-interference assumption unverified.
  2. [Abstract] Abstract: the joint video-action latent representation is asserted to 'bridge the visual and action domains' with 'negligible information loss,' yet no analysis of loss balancing, latent dimensionality effects, or cross-task feature interference is given; this is load-bearing for the multi-task claim.
minor comments (1)
  1. [Abstract] The abstract states results are 'best viewed on https://unified-video-action-model.github.io/' but does not summarize key quantitative findings (e.g., success rates, MSE, inference FPS) inline; this reduces standalone readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and will revise the paper to incorporate additional quantitative details and analyses as outlined.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'extensive experiments' show UVA matches task-specific performance 'without compromising' accuracy is unsupported by any quantitative metrics, baselines, or ablations. No comparison to an action-only baseline (identical backbone and action head, video loss removed) is provided, leaving the no-interference assumption unverified.

    Authors: We agree that the abstract would be strengthened by including key quantitative metrics. The manuscript body (Sections 4 and 5) reports extensive comparisons showing UVA achieves performance parity with task-specific methods on policy success rates, dynamics prediction error, and video generation quality, while maintaining fast inference. To directly address the no-interference claim, we will add an explicit action-only baseline ablation (identical backbone and action head with video loss removed) to the experiments section. We will also revise the abstract to reference these supporting metrics and the new baseline result. revision: yes

  2. Referee: [Abstract] Abstract: the joint video-action latent representation is asserted to 'bridge the visual and action domains' with 'negligible information loss,' yet no analysis of loss balancing, latent dimensionality effects, or cross-task feature interference is given; this is load-bearing for the multi-task claim.

    Authors: The current manuscript supports the bridging claim primarily through end-task performance parity, but we acknowledge the value of explicit supporting analyses. We will add a dedicated subsection with ablations on loss weight balancing between video and action objectives, sweeps over latent dimensionality, and quantitative measures of cross-task interference (such as latent feature correlations and task-removal ablations). These additions will be included in the revised manuscript to more rigorously substantiate the joint representation. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on end-to-end training and held-out experiments

full rationale

The paper presents UVA as a joint video-action model trained end-to-end on robotics data, with performance claims supported by references to extensive experiments on policy learning, dynamics, and video prediction tasks. No equations, derivations, or ansatzes are described that reduce any 'prediction' or result to quantities defined only by the model's own fitted parameters or self-citations. The joint latent representation and decoupled diffusion heads are architectural choices learned from data, not self-definitional constructs, and the 'without compromising performance' assertion is framed as an empirical outcome rather than a tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a single learned latent can faithfully encode both visual dynamics and action effects, plus the empirical claim that separate lightweight diffusion heads incur no accuracy penalty.

free parameters (2)
  • latent dimension and diffusion head widths
    Architectural sizes chosen during model design and fitted during training to balance capacity and speed.
  • masking ratios and schedules
    Probabilities for masking actions versus video frames are selected to enable multi-task training.
axioms (1)
  • domain assumption Diffusion models can jointly model the conditional distributions of video frames given actions and actions given video frames when conditioned on a shared latent.
    Invoked by the choice of two diffusion heads operating on the same latent code.
invented entities (1)
  • joint video-action latent representation no independent evidence
    purpose: To serve as a common bridge between visual observations and action sequences.
    New postulated shared embedding space whose independent existence outside the trained network is not demonstrated.

pith-pipeline@v0.9.0 · 5540 in / 1428 out tokens · 47256 ms · 2026-05-13T17:44:28.917154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  2. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  3. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  4. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  5. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  6. OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

    cs.CV 2026-05 unverdicted novelty 6.0

    OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.

  7. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...

  8. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.

  9. Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

    cs.RO 2026-05 unverdicted novelty 6.0

    A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...

  10. UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling

    cs.RO 2026-04 unverdicted novelty 6.0

    UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.

  11. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  12. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  13. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  14. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  15. DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

    cs.RO 2026-03 unverdicted novelty 6.0

    DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.

  16. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  17. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  18. Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation

    cs.RO 2026-03 unverdicted novelty 6.0

    SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.

  19. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  20. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    cs.AI 2026-01 conditional novelty 6.0

    Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

  21. Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    cs.CV 2025-06 unverdicted novelty 6.0

    Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...

  22. Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.

  23. CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.

  24. StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement

    cs.RO 2026-04 unverdicted novelty 5.0

    StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...

  25. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  26. Motus: A Unified Latent Action World Model

    cs.CV 2025-12 unverdicted novelty 5.0

    Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.

  27. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 26 Pith papers · 14 internal anchors

  1. [1]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy Alexey. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, 2020

  2. [2]

    π0: A vision- language-action flow model for general robot control,

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision- language-action flow model for general robot control,

  3. [3]
  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling Latent Video Dif- fusion Models to Large Datasets. arXiv preprint arXiv:2311.15127, 2023

  5. [5]

    Video Generation Models as World Simulators, 2024

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video Generation Models as World Simulators, 2024

  6. [6]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning , 2024

  7. [7]

    Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam

    Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021

  8. [8]

    Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

    Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

  9. [9]

    MaskGIT: Masked Generative Image Transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked Generative Image Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

  10. [10]

    Gamegen-x: Interactive open-world game video generation

    Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. arXiv preprint arXiv:2411.00769 , 2024

  11. [11]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

  12. [12]

    Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

    Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal Manipulation Interface: In-the- Wild Robot Teaching Without In-the-Wild Robots. arXiv preprint arXiv:2402.10329, 2024

  13. [13]

    Clark, S

    Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy gen- eralization. arXiv preprint arXiv:2502.03729 , 2025

  14. [14]

    Autoregressive Video Gen- eration Without Vector Quantization

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive Video Gen- eration Without Vector Quantization. arXiv preprint arXiv:2412.14169, 2024

  15. [15]

    Learning universal policies via text-guided video generation

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024

  16. [16]

    Tam- ing Transformers for High-Resolution Image Synthe- sis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Tam- ing Transformers for High-Resolution Image Synthe- sis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873– 12883, 2021

  17. [17]

    Implicit Behavioral Cloning

    Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit Behavioral Cloning. Conference on Robot Learning (CoRL), November 2021

  18. [18]

    ViD-GPT: Introducing GPT-Style Autoregressive Generation in Video Diffusion Models

    Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, and Jun Xiao. ViD-GPT: Introducing GPT-Style Autoregressive Generation in Video Diffusion Models. arXiv preprint arXiv:2406.10981 , 2024

  19. [19]

    arXiv preprint arXiv:2311.10709 (2023)

    Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv preprint arXiv:2311.10709 , 2023

  20. [20]

    Prediction with action: Visual policy learning via joint denoising process

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process. In The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems , 2024

  21. [21]

    Masked Autoencoders Are Scalable Vision Learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022

  22. [22]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems , 33:6840–6851, 2020

  23. [23]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv preprint arXiv:2210.02303, 2022

  24. [24]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video Diffusion Models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022

  25. [25]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Repre- sentations. arXiv preprint arXiv:2412.14803 , 2024

  26. [26]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- VLA: An Open-Source Vision-Language-Action Model. arXiv preprint arXiv:2406.09246 , 2024

  27. [27]

    Tenenbaum

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences. arXiv preprint arXiv:2310.08576, 2023

  28. [28]

    Autoregressive Image Generation Without Vector Quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive Image Generation Without Vector Quantization. arXiv preprint arXiv:2406.11838 , 2024

  29. [29]

    Dreamitate: Real-World Visuomotor Pol- icy Learning via Video Generation

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sud- hakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-World Visuomotor Pol- icy Learning via Video Generation. CoRL, 2024

  30. [30]

    Data scaling laws in im- itation learning for robotic manipulation

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in im- itation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024

  31. [31]

    Libero: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking Knowledge Transfer for Lifelong Robot Learning. Ad- vances in Neural Information Processing Systems , 36, 2024

  32. [32]

    Masked Autoencoding for Scalable and General- izable Decision Making

    Fangchen Liu, Hao Liu, Aditya Grover, and Pieter Abbeel. Masked Autoencoding for Scalable and General- izable Decision Making. Advances in Neural Information Processing Systems, 35:12608–12618, 2022

  33. [33]

    What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What Matters in Learning from Offline Human Demon- strations for Robot Manipulation. In arXiv preprint arXiv:2108.03298, 2021

  34. [34]

    Learning Transferable Visual Models from Natural Lan- guage Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Lan- guage Supervision. In International Conference on Machine Learning, pages 8748–8763, 2021

  35. [35]

    Robot Learning with Sensorimotor Pre-Training

    Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot Learning with Sensorimotor Pre-Training. In Conference on Robot Learning, pages 683–693, 2023

  36. [36]

    High-Resolution Image Synthesis with Latent Diffusion Models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models, 2021

  37. [37]

    Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, Andrew Pierson, Ankush Gupta, Austin Waters, A ¨aron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt, M...

  38. [38]

    Deep Unsupervised Learning Using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015

  39. [39]

    Improved techniques for training consistency models

    Yang Song and Prafulla Dhariwal. Improved Tech- niques for Training Consistency Models. arXiv preprint arXiv:2310.14189, 2023

  40. [40]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- Based Generative Modeling Through Stochastic Differ- ential Equations. arXiv preprint arXiv:2011.13456, 2020

  41. [41]

    Consistency Models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency Models. arXiv preprint arXiv:2303.01469, 2023

  42. [42]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023

  43. [43]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Ku- rach, Raphael Marinier, Marcin Michalski, and Syl- vain Gelly. Towards Accurate Generative Models of Video: A New Metric & Challenges. arXiv preprint arXiv:1812.01717, 2018

  44. [44]

    Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 , 2024

  45. [45]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. Ad- vances in Neural Information Processing Systems , 2017

  46. [46]

    Phenaki: Variable Length Video Gen- eration from Open Domain Textual Descriptions

    Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable Length Video Gen- eration from Open Domain Textual Descriptions. In International Conference on Learning Representations , 2022

  47. [47]

    Scaling autoregressive video models

    Dirk Weissenborn, Oscar T ¨ackstr¨om, and Jakob Uszkor- eit. Scaling Autoregressive Video Models. arXiv preprint arXiv:1906.02634, 2019

  48. [48]

    ART-V: Auto-Regressive Text-to-Video Generation with Diffusion Models

    Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, et al. ART-V: Auto-Regressive Text-to-Video Generation with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024

  49. [49]

    Masked Trajectory Models for Prediction, Representa- tion, and Control

    Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, and Aravind Rajeswaran. Masked Trajectory Models for Prediction, Representa- tion, and Control. In International Conference on Ma- chine Learning, pages 37607–37623, 2023

  50. [50]

    Flow as the cross-domain manipulation interface

    Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gor- don Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface. CoRL, 2024

  51. [51]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video Generation Using VQ-V AE and Transformers. arXiv preprint arXiv:2104.10157 , 2021

  52. [52]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023. X. S UPPLEMENTARY MATERIALS In this section, we first introduce the autoregressive video generation process in §X-A and then show more details of the simulation benchmarks (§X-B) and real-w...

  53. [53]

    application-dependent

    Application-independent: randomly mask the inputs, regard- less of task semantics. The results are reported in Table IX Policy learning and video generation are evaluated by suc- cess rate and FVD. Forward dynamics is evaluated by FVD on videos generated conditioned on actions. Inverse dynamics is evaluated by L2 error. Overall, in the “application-depend...