pith. machine review for the scientific record. sign in

arxiv: 2512.15692 · v2 · submitted 2025-12-17 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:35 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords video pretrainingrobot manipulationflow matchingvision-language-action modelsinverse dynamicssample efficiencyrobotic controlaction decoder
0
0 comments X

The pith

Pretrained video models plus a flow-matching decoder let robots learn manipulation with far less data than vision-language-action models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that vision-language-action models start from static web images and text, forcing them to learn physical dynamics and timing entirely from scarce robot trajectories. It proposes instead to start with a large internet video model that already encodes semantics together with visual dynamics, then attach a lightweight flow-matching action decoder that converts video latents into low-level robot commands. This split isolates the hard control problem and leaves semantics and dynamics to the video pretraining. The authors report that the resulting Video-Action Model reaches state-of-the-art results on both simulated and real manipulation tasks while using ten times fewer samples and converging twice as fast.

Core claim

A Video-Action Model that conditions a flow-matching inverse-dynamics decoder on latent representations from a pretrained internet-scale video model can produce low-level robot actions directly from video-space plans, delivering state-of-the-art performance on simulated and real-world manipulation tasks with tenfold better sample efficiency and twofold faster convergence than conventional vision-language-action architectures.

What carries the argument

The Video-Action Model (VAM), which pairs a pretrained video model's latent representations with a flow-matching action decoder that functions as an inverse dynamics model to generate robot actions.

Load-bearing premise

That a pretrained internet video model already captures enough physical causality and temporal dynamics for the remaining job to reduce cleanly to low-level control through the flow-matching decoder.

What would settle it

An ablation that swaps the video pretraining backbone for a static vision-language model and measures whether sample efficiency and convergence speed fall back to levels seen in standard VLAs.

read the original abstract

Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces mimic-video, a Video-Action Model (VAM) that pairs a pretrained internet-scale video model with a flow-matching action decoder acting as an Inverse Dynamics Model. Conditioned on video latent representations, the decoder generates low-level robot actions. The central claim is that this yields state-of-the-art performance on simulated and real-world robotic manipulation tasks, with 10x better sample efficiency and 2x faster convergence than traditional Vision-Language-Action (VLA) models, by supplying physical dynamics absent from static vision-language pretraining.

Significance. If the attribution of gains to video-pretrained dynamics holds after proper controls, the work would offer a concrete route to lower data requirements for generalizable robot policies. The flow-matching decoder formulation is a technically coherent choice for mapping video latents to actions.

major comments (2)
  1. [Abstract] Abstract: the headline quantitative claims (SOTA performance, 10x sample-efficiency gain, 2x convergence speedup) are stated without task definitions, baseline architectures, statistical reporting, or any ablation evidence, so the central performance assertions cannot be evaluated from the manuscript as presented.
  2. [Evaluation] Evaluation section: no controlled ablation isolates the video encoder's contribution (e.g., freezing the flow-matching decoder and swapping the video backbone for an equivalently sized image or VLM encoder on the same robot data). Without this, the efficiency gains cannot be attributed to video-derived physical causality rather than decoder architecture, training schedule, or capacity differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We address each major comment below and have revised the manuscript to improve the presentation of results and strengthen the supporting evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline quantitative claims (SOTA performance, 10x sample-efficiency gain, 2x convergence speedup) are stated without task definitions, baseline architectures, statistical reporting, or any ablation evidence, so the central performance assertions cannot be evaluated from the manuscript as presented.

    Authors: We agree that the abstract, being a concise summary, omits supporting details. The full Evaluation section defines the tasks (simulated and real-world robotic manipulation benchmarks including pick-and-place and drawer opening), specifies the VLA baseline architectures, reports statistical results (means and standard deviations over multiple seeds), and includes ablation studies. To make the claims more evaluable at a glance, we have revised the abstract to briefly reference the primary tasks, baselines, and the nature of the reported metrics. revision: yes

  2. Referee: [Evaluation] Evaluation section: no controlled ablation isolates the video encoder's contribution (e.g., freezing the flow-matching decoder and swapping the video backbone for an equivalently sized image or VLM encoder on the same robot data). Without this, the efficiency gains cannot be attributed to video-derived physical causality rather than decoder architecture, training schedule, or capacity differences.

    Authors: This is a fair critique on causal attribution. While the manuscript compares against full VLA baselines and provides ablations on the flow-matching decoder and training schedule, it does not include the precise controlled swap of the video backbone (with the decoder frozen) against an equivalently sized image or VLM encoder trained on identical robot data. We will add this ablation experiment in the revised manuscript to more directly isolate the contribution of the video-pretrained representations. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external pretraining and empirical results

full rationale

The paper introduces mimic-video by pairing an external pretrained internet-scale video model with a flow-matching decoder acting as an IDM. No equations, derivations, or fitted parameters are presented that reduce the reported 10x sample-efficiency or 2x convergence gains to quantities defined inside the paper itself. The central premise attributes dynamics capture to the video pretraining step, which is described as external rather than derived or self-cited in a load-bearing way. Performance numbers are framed as evaluation outcomes on simulated and real tasks, not as predictions forced by internal construction. This satisfies the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that video pretraining supplies physical dynamics that VLAs lack, leaving only control to the decoder; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Pretrained video models capture semantics and visual dynamics sufficiently to isolate low-level control
    Stated in the abstract as the reason video pretraining reduces data burden compared with static VLAs.

pith-pipeline@v0.9.0 · 5538 in / 1259 out tokens · 39910 ms · 2026-05-15T10:35:14.719280+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  2. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  3. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  4. Mask World Model: Predicting What Matters for Robust Robot Policy Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...

  5. HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.

  6. Latent Geometry Beyond Search: Amortizing Planning in World Models

    cs.RO 2026-05 unverdicted novelty 6.0

    In regularized latent spaces of world models, planning can be amortized into a goal-conditioned inverse dynamics model that matches CEM performance at 100-130x lower per-decision cost.

  7. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.

  8. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...

  9. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  10. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  11. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  12. FASTER: Value-Guided Sampling for Fast RL

    cs.LG 2026-04 unverdicted novelty 6.0

    FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

  13. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  14. Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations

    cs.RO 2026-04 unverdicted novelty 6.0

    GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.

  15. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  16. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  17. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  18. Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

    cs.RO 2026-03 unverdicted novelty 5.0

    Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...

  19. Causal World Modeling for Robot Control

    cs.CV 2026-01 unverdicted novelty 5.0

    LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.

  20. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  21. World Model for Robot Learning: A Comprehensive Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 19 Pith papers · 30 internal anchors

  1. [1]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Fran- cois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...

  2. [2]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Key- sers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias B...

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mo- hith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. pi0: A V...

  4. [4]

    Robocat: A self-improving foundation agent for robotic manipulation.arXiv preprint arXiv:2306.11706, 1(8), 2023

    Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauzá, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving generalist agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023

  5. [5]

    Emerging Properties in Self-Supervised Vision Transformers.arXiv:2104.14294 [cs], May 2021

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging Properties in Self-Supervised Vision Transformers.arXiv:2104.14294 [cs], May 2021. URL http://arxiv.org/abs/2104.14294. arXiv: 2104.14294

  6. [6]

    Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations,

  7. [7]

    URL https://arxiv.org/abs/1806.07366

  8. [8]

    Training strategies for efficient embodied reasoning

    William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning. In Conference on Robot Learning, 2025

  9. [9]

    Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, March 2024. URL http://arxiv.org/abs/ 2303.04137. arXiv:2303.04137 [cs]

  10. [10]

    Open X.-Embodiment Collaboration, Abby O’Neill, Ab- dul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, An- drew Wang, Andrey Kolobov, Anikait Singh, Animesh G...

  11. [11]

    The ingredients for robotic diffusion transformers

    Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Kumar Srirama, and Sergey Levine. The ingredients for robotic diffusion transformers. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), Atlanta, USA, 2025

  12. [12]

    Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation, August 2024

    Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation, August 2024. URL http://arxiv.org/abs/2408. 11812. arXiv:2408.11812 [cs]

  13. [13]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. URL http://arxiv.org/ abs/2010.11929. arXiv:2010.11929 [cs]

  14. [14]

    Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

  15. [15]

    Tenenbaum, Dale Schuurmans, and Pieter Abbeel

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation, 2023. URL https://arxiv.org/abs/2302. 00111

  16. [16]

    Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

    Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video Language Planning, October 2023. URL http://arxiv.org/abs/2310.10625. arXiv:2310.10625 [cs]

  17. [17]

    Deep Visual Foresight for Planning Robot Motion

    Chelsea Finn and Sergey Levine. Deep Visual Foresight for Planning Robot Motion, March 2017. URL http: //arxiv.org/abs/1610.00696. arXiv:1610.00696 [cs]

  18. [18]

    Unsu- pervised learning for physical interaction through video prediction.Advances in neural information processing systems, 29, 2016

    Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsu- pervised learning for physical interaction through video prediction.Advances in neural information processing systems, 29, 2016

  19. [19]

    Learning Visual Predictive Models of Physics for Playing Billiards

    Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning Visual Predictive Models of Physics for Playing Billiards, January 2016. URL http://arxiv.org/abs/1511.07404. arXiv:1511.07404 [cs]

  20. [20]

    Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

    Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, and Xinggang Wang. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,

  21. [21]

    URL https://arxiv.org/abs/2502.13144

  22. [22]

    Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2025. URL https://arxiv. org/abs/2510.10125

  23. [23]

    Ghil-glue: Hierarchical control with filtered subgoal images

    Kyle Beltran Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas Kollar, and Benjamin Burchfiel. Ghil-glue: Hierarchical control with filtered subgoal images. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), Atlanta, USA, 2025

  24. [24]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html

  25. [25]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

  26. [26]

    Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025

    Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu- Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025. URL https://arxiv.org/abs/2507.16815

  27. [27]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  28. [28]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, September 2024. URL http://arxiv....

  29. [29]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645

  30. [30]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation, 2024. URL https://arxiv.org/abs/2405.05941

  31. [31]

    Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-World Visuomotor Policy Learning via Video Generation, June 2024. URL http://arxiv.org/abs/2406.16862. arXiv:2406.16862 [cs]

  32. [32]

    Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video Generators are Robot Policies, August 2025. URL http://arxiv.org/abs/2508.00795. arXiv:2508.00795 [cs]

  33. [33]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling, February 2023. URL http://arxiv. org/abs/2210.02747. arXiv:2210.02747 [cs]

  34. [34]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310

  35. [35]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/ 1711.05101

  36. [36]

    Learning latent plans from play

    Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. InConference on robot learning, pages 1113–1132. Pmlr, 2020

  37. [37]

    What matters in language conditioned robotic imitation learning over unstructured data.IEEE Robotics and Automation Letters (RA-L), 7(4):11205–11212, 2022

    Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data.IEEE Robotics and Automation Letters (RA-L), 7(4):11205–11212, 2022

  38. [38]

    Grewe, and Robert K

    Elvis Nava, Victoriano Montesinos, Erik Bauer, Benedek Forrai, Jonas Pai, Stefan Weirich, Stephan-Daniel Gravert, Philipp Wand, Stephan Polinski, Benjamin F. Grewe, and Robert K. Katzschmann. mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity, June 2025. URL http://arxiv.org/abs/2506.11916. arXiv:2506.11916 [cs]

  39. [39]

    Cosmos world foundation model platform for physical ai,

    NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, ...

  40. [40]

    URL https://arxiv.org/abs/2501.03575

  41. [41]

    NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopad- hyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, J...

  42. [42]

    Action-Conditional Video Prediction using Deep Networks in Atari Games

    Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Action-Conditional Video Prediction using Deep Networks in Atari Games, December 2015. URL http://arxiv.org/abs/1507.08750. arXiv:1507.08750 [cs]

  43. [43]

    Video generation models as world simu- lators, March 2024

    OpenAI. Video generation models as world simu- lators, March 2024. URL https://openai.com/index/ video-generation-models-as-world-simulators/

  44. [44]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/ abs/2212.09748

  45. [45]

    Fast: Efficient action tokenization for vision-language-action models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. InProceedings of Robotics: Science and Systems, Los Angeles, USA, 2025

  46. [46]

    Strengthening Generative Robot Policies through Predictive World Modeling, May 2025

    Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Strengthening Generative Robot Policies through Predictive World Modeling, May 2025. URL http://arxiv. org/abs/2502.00622. arXiv:2502.00622 [cs]

  47. [47]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683

  48. [48]

    A Generalist Agent.Transactions on Machine Learning Research, August 2022

    Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A Generalist Agent.Transactions on...

  49. [49]

    URL https://openreview.net/forum?id=1ikK0kHjvj

  50. [50]

    Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies, 2025

    Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya ˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies, 2025. URL https://arxiv.org/abs/2509.04996

  51. [51]

    A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

    Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning, 2011. URL https://arxiv.org/abs/1011.0686

  52. [52]

    Evaluating gemini robotics policies in a veo world simulator, 2025

    Gemini Robotics Team, Coline Devin, Yilun Du, De- bidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini robotics policies in a veo world simula...

  53. [53]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An Open-Source Generalist Robot Policy, May 2024. URL http://arxiv.org/abs/240...

  54. [54]

    Bridgedata v2: A dataset for robot learning at scale, 2024

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale, 2024. URL https: //arxiv.org/abs/2308.12952

  55. [55]

    Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images

    Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images, November 2015. URL http://arxiv.org/abs/ 1506.07365. arXiv:1506.07365 [cs]

  56. [56]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  57. [57]

    Video models are zero-shot learners and reasoners

    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixi- ang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners, September 2025. URL http://arxiv.org/abs/2509.20328. arXiv:2509.20328 [cs]

  58. [58]

    Latent Action Pretraining from Videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent Action Pretraining from Videos, May 2025. URL http://arxiv.org/abs/2410.11758. arXiv:2410.11758 [cs]

  59. [59]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic Control via Embodied Chain-of-Thought Reasoning, March 2025. URL http://arxiv.org/abs/2407.08693. arXiv:2407.08693 [cs]

  60. [60]

    CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, March 2025

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, March 2025. URL http://arxiv.org/abs/2503.22020. arXiv:2503.22020 [cs]

  61. [61]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, April 2023. URL http://arxiv.org/ abs/2304.13705. arXiv:2304.13705 [cs]

  62. [62]

    Flare: Robot learning with implicit world modeling, 2025

    Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. Flare: Robot learning with implicit world modeling, 2025. URL https://arxiv.org/abs/2...

  63. [63]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets, May 2025. URL http://arxiv.org/abs/2504.02792. arXiv:2504.02792 [cs]

  64. [64]

    noise as augmentation

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...