arxiv: 2503.00200 · v3 · submitted 2025-02-28 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Unified Video Action Model

Shuang Li , Yihuai Gao , Dorsa Sadigh , Shuran Song

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:44 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords actionvideoinferencelearningmodelpredictionunifieddynamics

0 comments

The pith

UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without performance loss versus task-specific methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robots need to watch video of their surroundings and decide what physical actions to take next. Most current systems handle seeing and acting with separate AI models. UVA instead trains one model on both at once by creating a shared hidden code that captures what the scene looks like and what the robot should do. Two small diffusion modules sit on top of this shared code: one turns the code into future video frames, the other turns it into action commands. Because the action head is separate and lightweight, the robot can output motor commands in real time without waiting for a full video to be generated. Training uses random masking so the same model can fill in missing actions, missing video, or both. This lets the single network switch between learning a policy, predicting what will happen next, figuring out which action caused an observed change, and generating video. The abstract reports that this unified setup matches the accuracy of specialized models while being faster for action output.

Core claim

UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications.

Load-bearing premise

That the joint video-action latent representation captures the necessary relationship between visual sequences and action sequences with negligible information loss or task interference, allowing decoupled decoding to retain full accuracy.

read the original abstract

A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction, and actions provide dynamics information for video prediction. However, effectively combining video generation and action prediction remains challenging, and current video generation-based methods struggle to match the performance of direct policy learning in action accuracy and inference speed. To bridge this gap, we introduce the Unified Video Action model (UVA), which jointly optimizes video and action predictions to achieve both high accuracy and efficient action inference. The key lies in learning a joint video-action latent representation and decoupling video-action decoding. The joint latent representation bridges the visual and action domains, effectively modeling the relationship between video and action sequences. Meanwhile, the decoupled decoding, powered by two lightweight diffusion heads, enables high-speed action inference by bypassing video generation during inference. Such a unified framework further enables versatile functionality through masked input training. By selectively masking actions or videos, a single model can tackle diverse tasks beyond policy learning, such as forward and inverse dynamics modeling and video generation. Via an extensive set of experiments, we demonstrate that UVA can serve as a general-purpose solution for a wide range of robotics tasks, such as policy learning, forward/inverse dynamics and video observation prediction, without compromising performance compared to methods tailored for specific applications. Results are best viewed on https://unified-video-action-model.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UVA shares a latent between video and action with separate diffusion heads for multi-task robotics but the no-compromise claim needs better backing.

read the letter

The main thing here is a model that learns one latent space for both video frames and robot actions, then uses two separate lightweight diffusion heads to decode them. Selective masking during training turns the same weights into a general tool for policy learning, forward and inverse dynamics, and video prediction. This setup does a few things right. The decoupled heads let action inference run quickly without having to generate full videos at test time, which fixes the speed problem that video-based methods usually have. The masking approach is simple and effective for getting multiple capabilities out of one training run. If the results hold, it could cut down on the number of separate models a robotics system needs to keep around. The weak part is the performance claim. The paper says extensive experiments show no compromise compared to task-specific methods, but without numbers or a clear action-only baseline in the abstract, it's difficult to verify that the joint latent doesn't create interference. The stress-test point about unmeasured task interference is worth checking in the full paper; if they didn't run an ablation keeping the action head and backbone the same but dropping the video loss, that would be a gap. This paper is for robotics researchers who want unified models that support several prediction tasks from shared weights. Someone looking for ways to make diffusion policies faster or more versatile would get something out of the architecture. I would send it to peer review. The core design is sensible and the experiments, if they check out, would make it worth citing in work on multi-task robot learning.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Unified Video Action (UVA) model, which learns a joint video-action latent representation and decouples decoding via two lightweight diffusion heads. Masked-input training enables a single model to perform policy learning, forward/inverse dynamics, and video prediction, with the central claim that this unified approach matches the accuracy and speed of task-specific methods without compromise.

Significance. If the joint latent representation and decoupled heads truly incur negligible task interference, UVA could serve as a versatile foundation for robotics, reducing the proliferation of separate models while preserving inference efficiency for action prediction.

major comments (2)

[Abstract] Abstract: the claim that 'extensive experiments' show UVA matches task-specific performance 'without compromising' accuracy is unsupported by any quantitative metrics, baselines, or ablations. No comparison to an action-only baseline (identical backbone and action head, video loss removed) is provided, leaving the no-interference assumption unverified.
[Abstract] Abstract: the joint video-action latent representation is asserted to 'bridge the visual and action domains' with 'negligible information loss,' yet no analysis of loss balancing, latent dimensionality effects, or cross-task feature interference is given; this is load-bearing for the multi-task claim.

minor comments (1)

[Abstract] The abstract states results are 'best viewed on https://unified-video-action-model.github.io/' but does not summarize key quantitative findings (e.g., success rates, MSE, inference FPS) inline; this reduces standalone readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below and will revise the paper to incorporate additional quantitative details and analyses as outlined.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'extensive experiments' show UVA matches task-specific performance 'without compromising' accuracy is unsupported by any quantitative metrics, baselines, or ablations. No comparison to an action-only baseline (identical backbone and action head, video loss removed) is provided, leaving the no-interference assumption unverified.

Authors: We agree that the abstract would be strengthened by including key quantitative metrics. The manuscript body (Sections 4 and 5) reports extensive comparisons showing UVA achieves performance parity with task-specific methods on policy success rates, dynamics prediction error, and video generation quality, while maintaining fast inference. To directly address the no-interference claim, we will add an explicit action-only baseline ablation (identical backbone and action head with video loss removed) to the experiments section. We will also revise the abstract to reference these supporting metrics and the new baseline result. revision: yes
Referee: [Abstract] Abstract: the joint video-action latent representation is asserted to 'bridge the visual and action domains' with 'negligible information loss,' yet no analysis of loss balancing, latent dimensionality effects, or cross-task feature interference is given; this is load-bearing for the multi-task claim.

Authors: The current manuscript supports the bridging claim primarily through end-task performance parity, but we acknowledge the value of explicit supporting analyses. We will add a dedicated subsection with ablations on loss weight balancing between video and action objectives, sweeps over latent dimensionality, and quantitative measures of cross-task interference (such as latent feature correlations and task-removal ablations). These additions will be included in the revised manuscript to more rigorously substantiate the joint representation. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on end-to-end training and held-out experiments

full rationale

The paper presents UVA as a joint video-action model trained end-to-end on robotics data, with performance claims supported by references to extensive experiments on policy learning, dynamics, and video prediction tasks. No equations, derivations, or ansatzes are described that reduce any 'prediction' or result to quantities defined only by the model's own fitted parameters or self-citations. The joint latent representation and decoupled diffusion heads are architectural choices learned from data, not self-definitional constructs, and the 'without compromising performance' assertion is framed as an empirical outcome rather than a tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a single learned latent can faithfully encode both visual dynamics and action effects, plus the empirical claim that separate lightweight diffusion heads incur no accuracy penalty.

free parameters (2)

latent dimension and diffusion head widths
Architectural sizes chosen during model design and fitted during training to balance capacity and speed.
masking ratios and schedules
Probabilities for masking actions versus video frames are selected to enable multi-task training.

axioms (1)

domain assumption Diffusion models can jointly model the conditional distributions of video frames given actions and actions given video frames when conditioned on a shared latent.
Invoked by the choice of two diffusion heads operating on the same latent code.

invented entities (1)

joint video-action latent representation no independent evidence
purpose: To serve as a common bridge between visual observations and action sequences.
New postulated shared embedding space whose independent existence outside the trained network is not demonstrated.

pith-pipeline@v0.9.0 · 5540 in / 1428 out tokens · 47256 ms · 2026-05-13T17:44:28.917154+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint video-action latent representation and decoupling video-action decoding... two lightweight diffusion heads
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

masked input training... versatile functionality

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
cs.CV 2026-05 unverdicted novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
cs.RO 2026-04 unverdicted novelty 7.0

VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
cs.CV 2026-05 unverdicted novelty 6.0

OmniHumanoid factorizes transferable motion learning from embodiment-specific adaptation to enable scalable cross-embodiment video generation without paired data for new humanoids.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
cs.RO 2026-05 unverdicted novelty 6.0

A dual-contrastive disentanglement method factorizes videos into independent task and embodiment latents, then uses a parameter-efficient adapter on a frozen video diffusion model to synthesize robot executions from s...
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
cs.RO 2026-03 unverdicted novelty 6.0

DIAL decouples intent from action in end-to-end VLAs using a latent visual foresight bottleneck and two-stage training, reaching SOTA on RoboCasa with 10x fewer demonstrations and zero-shot real-world transfer.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
Simulation Distillation: Pretraining World Models in Simulation for Rapid Real-World Adaptation
cs.RO 2026-03 unverdicted novelty 6.0

SimDist pretrains world models in simulation and adapts them to real-world robots by updating only the latent dynamics model, enabling rapid improvement on contact-rich tasks where prior methods fail.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
cs.AI 2026-01 conditional novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
cs.CV 2025-06 unverdicted novelty 6.0

Self Forcing trains autoregressive video diffusion models by performing autoregressive rollout with KV caching during training to close the exposure bias gap, using a holistic video-level loss and few-step diffusion f...
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
cs.RO 2026-04 unverdicted novelty 5.0

StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
cs.RO 2026-04 accept novelty 5.0

A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
Motus: A Unified Latent Action World Model
cs.CV 2025-12 unverdicted novelty 5.0

Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 26 Pith papers · 14 internal anchors

[1]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy Alexey. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[2]

π0: A vision- language-action flow model for general robot control,

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision- language-action flow model for general robot control,

work page
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

URL https://arxiv. org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable Video Diffusion: Scaling Latent Video Dif- fusion Models to Large Datasets. arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Video Generation Models as World Simulators, 2024

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video Generation Models as World Simulators, 2024

work page 2024
[6]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning , 2024

work page 2024
[7]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam

Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021

work page 2021
[8]

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Joao Carreira and Andrew Zisserman. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017
[9]

MaskGIT: Masked Generative Image Transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked Generative Image Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

work page 2022
[10]

Gamegen-x: Interactive open-world game video generation.arXiv preprint arXiv:2411.00769, 2024

Haoxuan Che, Xuanhua He, Quande Liu, Cheng Jin, and Hao Chen. Gamegen-x: Interactive open-world game video generation. arXiv preprint arXiv:2411.00769 , 2024

work page arXiv 2024
[11]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023
[12]

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Cheng Chi, Zhenjia Xu, Chuer Pan, Eric Cousineau, Benjamin Burchfiel, Siyuan Feng, Russ Tedrake, and Shuran Song. Universal Manipulation Interface: In-the- Wild Robot Teaching Without In-the-Wild Robots. arXiv preprint arXiv:2402.10329, 2024

work page internal anchor Pith review arXiv 2024
[13]

Clark, S

Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, and Suneel Belkhale. Action-free reasoning for policy gen- eralization. arXiv preprint arXiv:2502.03729 , 2025

work page arXiv 2025
[14]

Autoregressive Video Gen- eration Without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive Video Gen- eration Without Vector Quantization. arXiv preprint arXiv:2412.14169, 2024

work page arXiv 2024
[15]

Learning universal policies via text-guided video generation

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[16]

Tam- ing Transformers for High-Resolution Image Synthe- sis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Tam- ing Transformers for High-Resolution Image Synthe- sis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873– 12883, 2021

work page 2021
[17]

Implicit Behavioral Cloning

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit Behavioral Cloning. Conference on Robot Learning (CoRL), November 2021

work page 2021
[18]

ViD-GPT: Introducing GPT-Style Autoregressive Generation in Video Diffusion Models

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, and Jun Xiao. ViD-GPT: Introducing GPT-Style Autoregressive Generation in Video Diffusion Models. arXiv preprint arXiv:2406.10981 , 2024

work page arXiv 2024
[19]

Emu video: Factorizing text-to-video gen- eration by explicit image conditioning,

Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning. arXiv preprint arXiv:2311.10709 , 2023

work page arXiv 2023
[20]

Prediction with action: Visual policy learning via joint denoising process

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process. In The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems , 2024

work page 2024
[21]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022

work page 2022
[22]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems , 33:6840–6851, 2020

work page 2020
[23]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Video Diffusion Models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video Diffusion Models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022

work page 2022
[25]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Repre- sentations. arXiv preprint arXiv:2412.14803 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Open- VLA: An Open-Source Vision-Language-Action Model. arXiv preprint arXiv:2406.09246 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences. arXiv preprint arXiv:2310.08576, 2023

work page arXiv 2023
[28]

Autoregressive Image Generation Without Vector Quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive Image Generation Without Vector Quantization. arXiv preprint arXiv:2406.11838 , 2024

work page arXiv 2024
[29]

Dreamitate: Real-World Visuomotor Pol- icy Learning via Video Generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sud- hakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-World Visuomotor Pol- icy Learning via Video Generation. CoRL, 2024

work page 2024
[30]

Data scaling laws in im- itation learning for robotic manipulation.arXiv preprint arXiv:2410.18647, 2024

Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, and Yang Gao. Data scaling laws in im- itation learning for robotic manipulation. arXiv preprint arXiv:2410.18647, 2024

work page arXiv 2024
[31]

Libero: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking Knowledge Transfer for Lifelong Robot Learning. Ad- vances in Neural Information Processing Systems , 36, 2024

work page 2024
[32]

Masked Autoencoding for Scalable and General- izable Decision Making

Fangchen Liu, Hao Liu, Aditya Grover, and Pieter Abbeel. Masked Autoencoding for Scalable and General- izable Decision Making. Advances in Neural Information Processing Systems, 35:12608–12618, 2022

work page 2022
[33]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What Matters in Learning from Offline Human Demon- strations for Robot Manipulation. In arXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Learning Transferable Visual Models from Natural Lan- guage Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning Transferable Visual Models from Natural Lan- guage Supervision. In International Conference on Machine Learning, pages 8748–8763, 2021

work page 2021
[35]

Robot Learning with Sensorimotor Pre-Training

Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, and Jitendra Malik. Robot Learning with Sensorimotor Pre-Training. In Conference on Robot Learning, pages 683–693, 2023

work page 2023
[36]

High-Resolution Image Synthesis with Latent Diffusion Models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models, 2021

work page 2021
[37]

Abhishek Sharma, Adams Yu, Ali Razavi, Andeep Toor, Andrew Pierson, Ankush Gupta, Austin Waters, A ¨aron van den Oord, Daniel Tanis, Dumitru Erhan, Eric Lau, Eleni Shaw, Gabe Barth-Maron, Greg Shaw, Han Zhang, Henna Nandwani, Hernan Moraldo, Hyunjik Kim, Irina Blok, Jakob Bauer, Jeff Donahue, Junyoung Chung, Kory Mathewson, Kurtis David, Lasse Espeholt, M...

work page 2024
[38]

Deep Unsupervised Learning Using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015

work page 2015
[39]

Improved techniques for training consistency models

Yang Song and Prafulla Dhariwal. Improved Tech- niques for Training Consistency Models. arXiv preprint arXiv:2310.14189, 2023

work page arXiv 2023
[40]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- Based Generative Modeling Through Stochastic Differ- ential Equations. arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[41]

Consistency Models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency Models. arXiv preprint arXiv:2303.01469, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Ku- rach, Raphael Marinier, Marcin Michalski, and Syl- vain Gelly. Towards Accurate Generative Models of Video: A New Metric & Challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[44]

Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837 , 2024

work page arXiv 2024
[45]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. Ad- vances in Neural Information Processing Systems , 2017

work page 2017
[46]

Phenaki: Variable Length Video Gen- eration from Open Domain Textual Descriptions

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Moham- mad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable Length Video Gen- eration from Open Domain Textual Descriptions. In International Conference on Learning Representations , 2022

work page 2022
[47]

Scaling autoregressive video models

Dirk Weissenborn, Oscar T ¨ackstr¨om, and Jakob Uszkor- eit. Scaling Autoregressive Video Models. arXiv preprint arXiv:1906.02634, 2019

work page arXiv 1906
[48]

ART-V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, et al. ART-V: Auto-Regressive Text-to-Video Generation with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7395–7405, 2024

work page 2024
[49]

Masked Trajectory Models for Prediction, Representa- tion, and Control

Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, and Aravind Rajeswaran. Masked Trajectory Models for Prediction, Representa- tion, and Control. In International Conference on Ma- chine Learning, pages 37607–37623, 2023

work page 2023
[50]

Flow as the cross-domain manipulation interface

Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gor- don Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface. CoRL, 2024

work page 2024
[51]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. VideoGPT: Video Generation Using VQ-V AE and Transformers. arXiv preprint arXiv:2104.10157 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[52]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 , 2023. X. S UPPLEMENTARY MATERIALS In this section, we first introduce the autoregressive video generation process in §X-A and then show more details of the simulation benchmarks (§X-B) and real-w...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

application-dependent

Application-independent: randomly mask the inputs, regard- less of task semantics. The results are reported in Table IX Policy learning and video generation are evaluated by suc- cess rate and FVD. Forward dynamics is evaluated by FVD on videos generated conditioned on actions. Inverse dynamics is evaluated by L2 error. Overall, in the “application-depend...

work page