Recognition: 2 theorem links
· Lean Theoremmimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
Pith reviewed 2026-05-15 10:35 UTC · model grok-4.3
The pith
Pretrained video models plus a flow-matching decoder let robots learn manipulation with far less data than vision-language-action models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A Video-Action Model that conditions a flow-matching inverse-dynamics decoder on latent representations from a pretrained internet-scale video model can produce low-level robot actions directly from video-space plans, delivering state-of-the-art performance on simulated and real-world manipulation tasks with tenfold better sample efficiency and twofold faster convergence than conventional vision-language-action architectures.
What carries the argument
The Video-Action Model (VAM), which pairs a pretrained video model's latent representations with a flow-matching action decoder that functions as an inverse dynamics model to generate robot actions.
Load-bearing premise
That a pretrained internet video model already captures enough physical causality and temporal dynamics for the remaining job to reduce cleanly to low-level control through the flow-matching decoder.
What would settle it
An ablation that swaps the video pretraining backbone for a static vision-language model and measures whether sample efficiency and convergence speed fall back to levels seen in standard VLAs.
read the original abstract
Prevailing Vision-Language-Action Models (VLAs) for robotic manipulation are built upon vision-language backbones pretrained on large-scale, but disconnected static web data. As a result, despite improved semantic generalization, the policy must implicitly infer complex physical dynamics and temporal dependencies solely from robot trajectories. This reliance creates an unsustainable data burden, necessitating continuous, large-scale expert data collection to compensate for the lack of innate physical understanding. We contend that while vision-language pretraining effectively captures semantic priors, it remains blind to physical causality. A more effective paradigm leverages video to jointly capture semantics and visual dynamics during pretraining, thereby isolating the remaining task of low-level control. To this end, we introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations. The decoder serves as an Inverse Dynamics Model (IDM), generating low-level robot actions from the latent representation of video-space action plans. Our extensive evaluation shows that our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces mimic-video, a Video-Action Model (VAM) that pairs a pretrained internet-scale video model with a flow-matching action decoder acting as an Inverse Dynamics Model. Conditioned on video latent representations, the decoder generates low-level robot actions. The central claim is that this yields state-of-the-art performance on simulated and real-world robotic manipulation tasks, with 10x better sample efficiency and 2x faster convergence than traditional Vision-Language-Action (VLA) models, by supplying physical dynamics absent from static vision-language pretraining.
Significance. If the attribution of gains to video-pretrained dynamics holds after proper controls, the work would offer a concrete route to lower data requirements for generalizable robot policies. The flow-matching decoder formulation is a technically coherent choice for mapping video latents to actions.
major comments (2)
- [Abstract] Abstract: the headline quantitative claims (SOTA performance, 10x sample-efficiency gain, 2x convergence speedup) are stated without task definitions, baseline architectures, statistical reporting, or any ablation evidence, so the central performance assertions cannot be evaluated from the manuscript as presented.
- [Evaluation] Evaluation section: no controlled ablation isolates the video encoder's contribution (e.g., freezing the flow-matching decoder and swapping the video backbone for an equivalently sized image or VLM encoder on the same robot data). Without this, the efficiency gains cannot be attributed to video-derived physical causality rather than decoder architecture, training schedule, or capacity differences.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on the manuscript. We address each major comment below and have revised the manuscript to improve the presentation of results and strengthen the supporting evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline quantitative claims (SOTA performance, 10x sample-efficiency gain, 2x convergence speedup) are stated without task definitions, baseline architectures, statistical reporting, or any ablation evidence, so the central performance assertions cannot be evaluated from the manuscript as presented.
Authors: We agree that the abstract, being a concise summary, omits supporting details. The full Evaluation section defines the tasks (simulated and real-world robotic manipulation benchmarks including pick-and-place and drawer opening), specifies the VLA baseline architectures, reports statistical results (means and standard deviations over multiple seeds), and includes ablation studies. To make the claims more evaluable at a glance, we have revised the abstract to briefly reference the primary tasks, baselines, and the nature of the reported metrics. revision: yes
-
Referee: [Evaluation] Evaluation section: no controlled ablation isolates the video encoder's contribution (e.g., freezing the flow-matching decoder and swapping the video backbone for an equivalently sized image or VLM encoder on the same robot data). Without this, the efficiency gains cannot be attributed to video-derived physical causality rather than decoder architecture, training schedule, or capacity differences.
Authors: This is a fair critique on causal attribution. While the manuscript compares against full VLA baselines and provides ablations on the flow-matching decoder and training schedule, it does not include the precise controlled swap of the video backbone (with the decoder frozen) against an equivalently sized image or VLM encoder trained on identical robot data. We will add this ablation experiment in the revised manuscript to more directly isolate the contribution of the video-pretrained representations. revision: yes
Circularity Check
No circularity: claims rest on external pretraining and empirical results
full rationale
The paper introduces mimic-video by pairing an external pretrained internet-scale video model with a flow-matching decoder acting as an IDM. No equations, derivations, or fitted parameters are presented that reduce the reported 10x sample-efficiency or 2x convergence gains to quantities defined inside the paper itself. The central premise attributes dynamics capture to the video pretraining step, which is described as external rather than derived or self-cited in a load-bearing way. Performance numbers are framed as evaluation outcomes on simulated and real tasks, not as predictions forced by internal construction. This satisfies the default expectation of a non-circular paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pretrained video models capture semantics and visual dynamics sufficiently to isolate low-level control
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
-
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
-
Being-H0.7: A Latent World-Action Model from Egocentric Videos
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
-
Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization...
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
Latent Geometry Beyond Search: Amortizing Planning in World Models
In regularized latent spaces of world models, planning can be amortized into a goal-conditioned inverse dynamics model that matches CEM performance at 100-130x lower per-decision cost.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
-
When to Trust Imagination: Adaptive Action Execution for World Action Models
Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
-
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
-
GazeVLA: Learning Human Intention for Robotic Manipulation
GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
-
FASTER: Value-Guided Sampling for Fast RL
FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
-
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
-
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations
GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
-
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
-
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
-
World Action Models are Zero-shot Policies
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
-
Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance
Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...
-
Causal World Modeling for Robot Control
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
World Model for Robot Learning: A Comprehensive Survey
A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
Reference graph
Works this paper leans on
-
[1]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Fran- cois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Key- sers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias B...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mo- hith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. pi0: A V...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauzá, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving generalist agent for robotic manipulation. arXiv preprint arXiv:2306.11706, 2023
-
[5]
Emerging Properties in Self-Supervised Vision Transformers.arXiv:2104.14294 [cs], May 2021
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging Properties in Self-Supervised Vision Transformers.arXiv:2104.14294 [cs], May 2021. URL http://arxiv.org/abs/2104.14294. arXiv: 2104.14294
-
[6]
Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations,
-
[7]
URL https://arxiv.org/abs/1806.07366
work page internal anchor Pith review arXiv
-
[8]
Training strategies for efficient embodied reasoning
William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning. In Conference on Robot Learning, 2025
work page 2025
-
[9]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, March 2024. URL http://arxiv.org/abs/ 2303.04137. arXiv:2303.04137 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Open X.-Embodiment Collaboration, Abby O’Neill, Ab- dul Rehman, Abhinav Gupta, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, An- drew Wang, Andrey Kolobov, Anikait Singh, Animesh G...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
The ingredients for robotic diffusion transformers
Sudeep Dasari, Oier Mees, Sebastian Zhao, Mohan Kumar Srirama, and Sergey Levine. The ingredients for robotic diffusion transformers. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), Atlanta, USA, 2025
work page 2025
-
[12]
Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation, August 2024. URL http://arxiv.org/abs/2408. 11812. arXiv:2408.11812 [cs]
-
[13]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, June 2021. URL http://arxiv.org/ abs/2010.11929. arXiv:2010.11929 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[14]
Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025
-
[15]
Tenenbaum, Dale Schuurmans, and Pieter Abbeel
Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation, 2023. URL https://arxiv.org/abs/2302. 00111
work page 2023
-
[16]
Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson
Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video Language Planning, October 2023. URL http://arxiv.org/abs/2310.10625. arXiv:2310.10625 [cs]
-
[17]
Deep Visual Foresight for Planning Robot Motion
Chelsea Finn and Sergey Levine. Deep Visual Foresight for Planning Robot Motion, March 2017. URL http: //arxiv.org/abs/1610.00696. arXiv:1610.00696 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsu- pervised learning for physical interaction through video prediction.Advances in neural information processing systems, 29, 2016
work page 2016
-
[19]
Learning Visual Predictive Models of Physics for Playing Billiards
Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning Visual Predictive Models of Physics for Playing Billiards, January 2016. URL http://arxiv.org/abs/1511.07404. arXiv:1511.07404 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,
Hao Gao, Shaoyu Chen, Bo Jiang, Bencheng Liao, Yiang Shi, Xiaoyang Guo, Yuechuan Pu, Haoran Yin, Xiangyu Li, Xinbang Zhang, Ying Zhang, Wenyu Liu, Qian Zhang, and Xinggang Wang. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning,
- [21]
-
[22]
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2025. URL https://arxiv. org/abs/2510.10125
-
[23]
Ghil-glue: Hierarchical control with filtered subgoal images
Kyle Beltran Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas Kollar, and Benjamin Burchfiel. Ghil-glue: Hierarchical control with filtered subgoal images. InProceedings of the IEEE International Conference on Robotics and Automation (ICRA), Atlanta, USA, 2025
work page 2025
-
[24]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. InAdvances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html
work page 2020
-
[25]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025
Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu- Chiang Frank Wang, and Fu-En Yang. Thinkact: Vision- language-action reasoning via reinforced visual latent planning, 2025. URL https://arxiv.org/abs/2507.16815
-
[27]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model, September 2024. URL http://arxiv....
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine- tuning vision-language-action models: Optimizing speed and success, 2025. URL https://arxiv.org/abs/2502.19645
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in simulation, 2024. URL https://arxiv.org/abs/2405.05941
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-World Visuomotor Policy Learning via Video Generation, June 2024. URL http://arxiv.org/abs/2406.16862. arXiv:2406.16862 [cs]
-
[32]
Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl V ondrick. Video Generators are Robot Policies, August 2025. URL http://arxiv.org/abs/2508.00795. arXiv:2508.00795 [cs]
-
[33]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling, February 2023. URL http://arxiv. org/abs/2210.02747. arXiv:2210.02747 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https://arxiv.org/abs/2306.03310
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/ 1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[36]
Learning latent plans from play
Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey Levine, and Pierre Sermanet. Learning latent plans from play. InConference on robot learning, pages 1113–1132. Pmlr, 2020
work page 2020
-
[37]
Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data.IEEE Robotics and Automation Letters (RA-L), 7(4):11205–11212, 2022
work page 2022
-
[38]
Elvis Nava, Victoriano Montesinos, Erik Bauer, Benedek Forrai, Jonas Pai, Stefan Weirich, Stephan-Daniel Gravert, Philipp Wand, Stephan Polinski, Benjamin F. Grewe, and Robert K. Katzschmann. mimic-one: a Scalable Model Recipe for General Purpose Robot Dexterity, June 2025. URL http://arxiv.org/abs/2506.11916. arXiv:2506.11916 [cs]
-
[39]
Cosmos world foundation model platform for physical ai,
NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, ...
-
[40]
URL https://arxiv.org/abs/2501.03575
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopad- hyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, J...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Action-Conditional Video Prediction using Deep Networks in Atari Games
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard Lewis, and Satinder Singh. Action-Conditional Video Prediction using Deep Networks in Atari Games, December 2015. URL http://arxiv.org/abs/1507.08750. arXiv:1507.08750 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[43]
Video generation models as world simu- lators, March 2024
OpenAI. Video generation models as world simu- lators, March 2024. URL https://openai.com/index/ video-generation-models-as-world-simulators/
work page 2024
-
[44]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/ abs/2212.09748
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Fast: Efficient action tokenization for vision-language-action models
Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models. InProceedings of Robotics: Science and Systems, Los Angeles, USA, 2025
work page 2025
-
[46]
Strengthening Generative Robot Policies through Predictive World Modeling, May 2025
Han Qi, Haocheng Yin, Aris Zhu, Yilun Du, and Heng Yang. Strengthening Generative Robot Policies through Predictive World Modeling, May 2025. URL http://arxiv. org/abs/2502.00622. arXiv:2502.00622 [cs]
-
[47]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
A Generalist Agent.Transactions on Machine Learning Research, August 2022
Scott Reed, Konrad Zolna, Emilio Parisotto, Ser- gio Gómez Colmenarejo, Alexander Novikov, Gabriel Barth-maron, Mai Giménez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A Generalist Agent.Transactions on...
work page 2022
-
[49]
URL https://openreview.net/forum?id=1ikK0kHjvj
-
[50]
Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya ˘gmurlu, Fabian Otto, and Rudolf Lioutikov. Flower: Democratizing generalist robot policies with efficient vision-language-action flow policies, 2025. URL https://arxiv.org/abs/2509.04996
-
[51]
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning, 2011. URL https://arxiv.org/abs/1011.0686
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[52]
Evaluating gemini robotics policies in a veo world simulator, 2025
Gemini Robotics Team, Coline Devin, Yilun Du, De- bidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini robotics policies in a veo world simula...
-
[53]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An Open-Source Generalist Robot Policy, May 2024. URL http://arxiv.org/abs/240...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Bridgedata v2: A dataset for robot learning at scale, 2024
Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale, 2024. URL https: //arxiv.org/abs/2308.12952
-
[55]
Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images
Manuel Watter, Jost Tobias Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images, November 2015. URL http://arxiv.org/abs/ 1506.07365. arXiv:1506.07365 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[56]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[57]
Video models are zero-shot learners and reasoners
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixi- ang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners, September 2025. URL http://arxiv.org/abs/2509.20328. arXiv:2509.20328 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
Latent Action Pretraining from Videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent Action Pretraining from Videos, May 2025. URL http://arxiv.org/abs/2410.11758. arXiv:2410.11758 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Robotic Control via Embodied Chain-of-Thought Reasoning
Michał Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic Control via Embodied Chain-of-Thought Reasoning, March 2025. URL http://arxiv.org/abs/2407.08693. arXiv:2407.08693 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, March 2025
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, and Tsung-Yi Lin. CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models, March 2025. URL http://arxiv.org/abs/2503.22020. arXiv:2503.22020 [cs]
-
[61]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware, April 2023. URL http://arxiv.org/ abs/2304.13705. arXiv:2304.13705 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Flare: Robot learning with implicit world modeling, 2025
Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, and Linxi Fan. Flare: Robot learning with implicit world modeling, 2025. URL https://arxiv.org/abs/2...
-
[63]
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets, May 2025. URL http://arxiv.org/abs/2504.02792. arXiv:2504.02792 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R. Sanketi, Grecia Salazar, Michael S. Ryoo, Krista Reymann, Kanishka Rao, Karl Pertsch, Igor Mordatch, Henryk Michalewski...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.