pith. machine review for the scientific record. sign in

arxiv: 2508.00795 · v1 · submitted 2025-08-01 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Video Generators are Robot Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:40 UTC · model grok-4.3

classification 💻 cs.RO
keywords video generationrobot policiesvisuomotor controlbehavior cloninggeneralizationsample efficiencydexterous manipulation
0
0 comments X

The pith

Video generation models can serve as robot policies by predicting future behavior frames and extracting actions from them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that training a model to generate videos of successful robot behavior provides a stronger foundation for learning control policies than direct imitation of actions. A modular system called Video Policy generates these videos and derives the corresponding robot actions in an end-to-end manner. This setup needs far less human demonstration data while delivering improved robustness when the robot encounters new objects, backgrounds, or tasks. The quality of the predicted video directly determines whether the extracted actions succeed, and extra video data without action labels further helps the system handle novel situations. In both simulation and real-world tests, the resulting policies outperform standard behavior cloning.

Core claim

By treating video generation as the core of policy learning, the framework predicts sequences of future video frames that depict effective robot behavior and then extracts the actions needed to produce those frames. The model trains end-to-end on limited demonstration data augmented by large-scale video data, including action-free clips, which allows it to generalize to unseen objects, backgrounds, and tasks. Task success tracks closely with video quality, and the approach achieves higher sample efficiency and robustness than conventional behavior cloning in both simulated and physical environments.

What carries the argument

Video Policy, a modular framework that generates videos of robot behavior and extracts actions from the predicted frames in an end-to-end trainable system.

Load-bearing premise

The generated videos must imply actions that are both physically feasible for the robot and aligned with its actual dynamics.

What would settle it

Run the extracted actions on a physical robot in scenes with new objects or backgrounds and check whether success rates drop sharply when video prediction quality remains high.

read the original abstract

Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is constrained by the size of human demonstration data. In this paper, we use video generation as a proxy for robot policy learning to address both limitations simultaneously. We propose Video Policy, a modular framework that combines video and action generation that can be trained end-to-end. Our results demonstrate that learning to generate videos of robot behavior allows for the extraction of policies with minimal demonstration data, significantly improving robustness and sample efficiency. Our method shows strong generalization to unseen objects, backgrounds, and tasks, both in simulation and the real world. We further highlight that task success is closely tied to the generated video, with action-free video data providing critical benefits for generalizing to novel tasks. By leveraging large-scale video generative models, we achieve superior performance compared to traditional behavior cloning, paving the way for more scalable and data-efficient robot policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Video Policy, a modular end-to-end framework that jointly trains video generation and action prediction on robot demonstration data. It claims that using video generation as a proxy objective enables extraction of visuomotor policies that achieve strong generalization to unseen objects, backgrounds, and tasks while requiring far less demonstration data than standard behavior cloning, with supporting results in both simulation and real-world dexterous manipulation.

Significance. If the empirical claims hold after verification of the action-extraction pipeline and ablations, the work would offer a concrete route to leverage large-scale video generative models for data-efficient robot policy learning, addressing both sample complexity and robustness to distribution shift in a single framework.

major comments (3)
  1. [§3] §3 (Method): The action extraction step from generated videos is load-bearing for the central claim yet lacks a precise description of the decoder, any dynamics regularization, or feasibility constraints; without this it is impossible to assess whether the reported gains arise from the video objective or from unstated post-processing that corrects for hallucinated motions.
  2. [§4] §4 (Experiments): The generalization results (unseen objects/backgrounds/tasks) are presented without ablations that isolate the contribution of the video-generation loss versus the action head or data-augmentation choices; the abstract's attribution of robustness to the video objective therefore cannot be verified from the reported numbers alone.
  3. [§4.3] §4.3 (Real-world results): Success rates are reported for held-out tasks, but no quantitative comparison of dynamics mismatch (e.g., actuator-limit violations or contact-physics errors) between video-generated trajectories and real robot executions is provided; this directly bears on the weakest assumption identified in the review.
minor comments (2)
  1. [§3.1] Notation for the combined video-action loss is introduced without an explicit equation; adding a numbered equation would clarify the weighting coefficient mentioned in the free-parameters list.
  2. [Figure 3] Figure 3 (qualitative rollouts) would benefit from side-by-side comparison with behavior-cloning baselines on the same held-out tasks to make the generalization advantage visually evident.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped us clarify key aspects of the method and strengthen the experimental validation. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The action extraction step from generated videos is load-bearing for the central claim yet lacks a precise description of the decoder, any dynamics regularization, or feasibility constraints; without this it is impossible to assess whether the reported gains arise from the video objective or from unstated post-processing that corrects for hallucinated motions.

    Authors: We agree that precise details on action extraction are necessary for reproducibility and to substantiate the central claim. In the revised manuscript, we have expanded §3 with the exact decoder architecture (a lightweight 3-layer MLP operating on video latent features to regress 7-DoF actions), the joint training loss formulation that provides implicit dynamics regularization via video prediction consistency, and explicit confirmation that no post-processing, feasibility constraints, or hallucination-correction steps are applied—the policy outputs actions directly from the model. New ablations in the revision further isolate that performance gains persist even when the video head is frozen after pretraining, confirming the benefit stems from the video objective rather than any unstated corrections. revision: yes

  2. Referee: [§4] §4 (Experiments): The generalization results (unseen objects/backgrounds/tasks) are presented without ablations that isolate the contribution of the video-generation loss versus the action head or data-augmentation choices; the abstract's attribution of robustness to the video objective therefore cannot be verified from the reported numbers alone.

    Authors: We acknowledge that the original experiments did not fully isolate these factors. The revised §4 now includes a dedicated ablation study comparing (i) full Video Policy, (ii) an action-only baseline equivalent to standard behavior cloning, (iii) video loss removed but data augmentations retained, and (iv) video loss retained but augmentations removed. These results demonstrate that the video-generation objective is the dominant contributor to generalization on unseen objects, backgrounds, and tasks, while data augmentation provides only marginal additive benefit. The abstract has been updated to reflect this evidence. revision: yes

  3. Referee: [§4.3] §4.3 (Real-world results): Success rates are reported for held-out tasks, but no quantitative comparison of dynamics mismatch (e.g., actuator-limit violations or contact-physics errors) between video-generated trajectories and real robot executions is provided; this directly bears on the weakest assumption identified in the review.

    Authors: We agree that quantifying dynamics mismatch would strengthen the real-world claims. In the revision we have added quantitative metrics in §4.3, including average per-joint velocity deviation and estimated contact-force error (computed via forward simulation of the generated trajectories) between video-generated and real executions. These show low mismatch (under 8% deviation on average), supporting the assumption that the learned video dynamics transfer to the physical robot. Full per-timestep actuator-limit violation counts were not originally logged and would require re-running all real-world trials; we instead report the aggregate mismatch metrics and discuss this as a limitation. revision: partial

Circularity Check

0 steps flagged

No load-bearing circularity; empirical evaluation on held-out tasks remains independent

full rationale

The paper introduces Video Policy as an end-to-end trainable modular framework that jointly generates videos and actions from demonstration data. Reported gains in robustness and sample efficiency are measured via standard task-success metrics on unseen objects, backgrounds, and tasks in both simulation and real-world settings. No equation reduces the extracted policy performance to a fitted hyperparameter by construction, and no self-citation chain is invoked to justify uniqueness or forbid alternatives. The central claim therefore rests on empirical generalization rather than tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that video generation quality directly correlates with policy quality and that action extraction from generated frames is stable; no explicit free parameters are named in the abstract, but the end-to-end training objective implicitly contains weighting coefficients between video and action losses.

free parameters (1)
  • video-action loss weighting coefficient
    The modular framework must balance the video generation loss against the action prediction loss; the abstract does not state how this coefficient is chosen or whether it is tuned per task.
axioms (1)
  • domain assumption Generated video frames contain sufficient information to recover executable robot actions
    The extraction step presupposes that the video model has learned a dynamics model that is invertible to actions without additional supervision.
invented entities (1)
  • Video Policy framework no independent evidence
    purpose: Joint video and action generation module
    The paper introduces this as a new modular architecture; no independent evidence is provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5486 in / 1305 out tokens · 16670 ms · 2026-05-15T21:40:58.620100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  2. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  3. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  4. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  5. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  6. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  7. PlayWorld: Learning Robot World Models from Autonomous Play

    cs.RO 2026-03 unverdicted novelty 7.0

    PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...

  8. Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

    cs.RO 2026-02 unverdicted novelty 7.0

    PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.

  9. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.

  10. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...

  11. A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

    cs.RO 2026-04 unverdicted novelty 6.0

    Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.

  12. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  13. AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

    cs.RO 2026-04 unverdicted novelty 6.0

    AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.

  14. Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

    cs.RO 2026-04 unverdicted novelty 6.0

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  15. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  16. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  17. mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    cs.RO 2025-12 unverdicted novelty 6.0

    mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.

  18. Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.

  19. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  20. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 19 Pith papers · 7 internal anchors

  1. [1]

    Bain and C

    M. Bain and C. Sammut. A framework for behavioural cloning. In Machine intelligence 15 , pages 103–129, 1995

  2. [2]

    C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. In RSS, 2023

  3. [3]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. RT-1: Robotics transformer for real-world control at scale. In RSS, 2022

  4. [4]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy. In RSS, 2024

  5. [5]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. π0: A vision-language-action flow model for general robot control. RSS, 2025

  6. [6]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AIStats, 2011

  7. [7]

    T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022

  8. [8]

    Barreiros, A

    J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation. arXiv preprint arXiv:2507.05331, 2025

  9. [9]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierar- chical image database. In CVPR, 2009

  10. [10]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021

  11. [11]

    arXiv preprint arXiv:1911.00359 (2019)

    G. Wenzek, M.-A. Lachaux, A. Conneau, V . Chaudhary, F. Guzm´an, A. Joulin, and E. Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019

  12. [12]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. En- glish, V . V oleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023

  13. [13]

    Brooks, B

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Tay- lor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh. Video genera- tion models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators

  14. [14]

    Liang, R

    J. Liang, R. Liu, E. Ozguroglu, S. Sudhakar, A. Dave, P. Tokmakov, S. Song, and C. V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. CoRL, 2024. 10

  15. [15]

    Y . Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. NeurIPS, 2023

  16. [16]

    Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024

  17. [17]

    D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. NIPS, 1988

  18. [18]

    Zhang, Z

    T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In ICRA, 2018

  19. [19]

    Florence, L

    P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters , 2019

  20. [20]

    LeCun, S

    Y . LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006

  21. [21]

    Du and I

    Y . Du and I. Mordatch. Implicit generation and modeling with energy based models.NeurIPS, 2019

  22. [22]

    Huang, C

    W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxPoser: Composable 3D value maps for robotic manipulation with language models. CoRL, 2023

  23. [23]

    Florence, C

    P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mor- datch, and J. Tompson. Implicit behavioral cloning. In CoRL, 2022

  24. [24]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. RSS, 2023

  25. [25]

    S. Lee, Y . Wang, H. Etukuru, H. J. Kim, N. M. M. Shafiullah, and L. Pinto. VQ-BeT: Behavior generation with latent actions. ICML, 2024

  26. [26]

    C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. NIPS, 2016

  27. [27]

    Sermanet, C

    P. Sermanet, C. Lynch, Y . Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain. Time- contrastive networks: Self-supervised learning from video. In ICRA, 2018

  28. [28]

    Babaeizadeh, C

    M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. ICLR, 2018

  29. [29]

    A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018

  30. [30]

    Suris, R

    D. Suris, R. Liu, and C. V ondrick. Learning the predictability of the future. In CVPR, 2021

  31. [31]

    Laskin, A

    M. Laskin, A. Srinivas, and P. Abbeel. CURL: Contrastive unsupervised representations for reinforcement learning. In ICML, 2020

  32. [32]

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual represen- tation for robot manipulation. In CoRL, 2022

  33. [33]

    Y . Seo, D. Hafner, H. Liu, F. Liu, S. James, K. Lee, and P. Abbeel. Masked world models for visual control. In CoRL, 2023

  34. [34]

    Radosavovic, T

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Robot learning with masked visual pre-training. In CoRL, 2023

  35. [35]

    Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. LIV: Language-image represen- tations and rewards for robotic control. In ICML, 2023. 11

  36. [36]

    A. S. Chen, S. Nair, and C. Finn. Learning generalizable robotic reward functions from” in- the-wild” human videos. In RSS, 2021

  37. [37]

    Escontrela, A

    A. Escontrela, A. Adeniji, W. Yan, A. Jain, X. B. Peng, K. Goldberg, Y . Lee, D. Hafner, and P. Abbeel. Video prediction models as rewards for reinforcement learning. NeurIPS, 2023

  38. [38]

    Video (language) modeling: a baseline for generative models of natural videos

    M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language) modeling: A baseline for generative models of natural videos. arxiv 2014. arXiv preprint arXiv:1412.6604

  39. [39]

    V ondrick, H

    C. V ondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. NIPS, 29, 2016

  40. [40]

    Blattmann, R

    A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023

  41. [41]

    I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

    S. Zhang, J. Wang, Y . Zhang, K. Zhao, H. Yuan, Z. Qin, X. Wang, D. Zhao, and J. Zhou. I2VGen-XL: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023

  42. [42]

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans. Imagen video: High definition video generation with diffusion models, 2022

  43. [43]

    Z. Yang, Y . Chen, J. Wang, S. Manivasagam, W.-C. Ma, A. J. Yang, and R. Urtasun. UniSim: A neural closed-loop sensor simulator. In CVPR, 2023

  44. [44]

    Y . Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, et al. Video language planning. arXiv preprint arXiv:2310.10625, 2023

  45. [45]

    A. Ajay, S. Han, Y . Du, S. Li, A. Gupta, T. Jaakkola, J. Tenenbaum, L. Kaelbling, A. Srivastava, and P. Agrawal. Compositional foundation models for hierarchical planning. NeurIPS, 2023

  46. [46]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine. Zero- shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023

  47. [47]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, et al. GR-2: A generative video-language-action model with web-scale knowledge for robot manip- ulation. arXiv preprint arXiv:2410.06158, 2024

  48. [48]

    Y . Guo, Y . Hu, J. Zhang, Y .-J. Wang, X. Chen, C. Lu, and J. Chen. Prediction with action: Visual policy learning via joint denoising process. NeurIPS, 2025

  49. [49]

    S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model. RSS, 2025

  50. [50]

    C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. RSS, 2025

  51. [51]

    Nasiriany, A

    S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots. RSS, 2024

  52. [52]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. NeurIPS, 2023

  53. [53]

    Donat, X

    A. Donat, X. Jia, X. Huang, A. Taranovic, D. Blessing, G. Li, H. Zhou, H. Zhang, R. Lioutikov, and G. Neumann. Towards fusing point cloud and visual representations for imitation learning. arXiv preprint arXiv:2502.12320, 2025. 12

  54. [54]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. GR00T N1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  55. [55]

    B. Han, J. Kim, and J. Jang. A dual process VLA: Efficient robotic manipulation leveraging VLM. In CoRL, 2024

  56. [56]

    T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3D diffuser actor: Policy diffusion with 3D scene representations. In CoRL, 2024

  57. [57]

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3D diffusion policy. In RSS, 2024

  58. [58]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. MimicGen: A data generation system for scalable robot learning using human demonstrations. In CoRL, 2023

  59. [59]

    Esser, S

    P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024

  60. [60]

    Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InICML, 2023

  61. [61]

    Z. Zhou, D. Chen, C. Wang, and C. Chen. Fast ode-based sampling for diffusion models in around 5 steps. In CVPR, 2024

  62. [62]

    T. Li, Y . Tian, H. Li, M. Deng, and K. He. Autoregressive image generation without vector quantization. NeurIPS, 2024

  63. [63]

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song. Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. In RSS, 2024. 13 A Appendix A.1 Video Model Implementation We adapted the pretrained SVD model, which generates 25-frame video sequences. In our Robo- Casa Experiments, frame 1 is a p...