pith. sign in

arxiv: 2605.27817 · v1 · pith:W7TJE3RCnew · submitted 2026-05-27 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Turning Video Models into Generalist Robot Policies

Pith reviewed 2026-06-29 12:32 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG
keywords video generative modelsinverse dynamics modelsrobot policieszero-shot transfercross-embodiment controlJacobiangeneralist policies
0
0 comments X

The pith

Decoupled video planning paired with Jacobian-based inverse dynamics models produces cross-embodiment robot policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether video generative models can drive robot control without joint training on action-labeled data for every new robot body. It keeps an action-free video planner fixed and trains only a separate inverse dynamics model that converts the planner's video predictions into robot joint commands using the robot's Jacobian. Experiments test this on simulated and real tasks, including zero-shot transfer to a Panda arm and a 16-DoF Allegro hand. A reader would care if the separation allows one video model to serve many embodiments while each IDM trains on cheap self-play data.

Core claim

The paper claims that an action-free video world model combined with an embodiment-specific inverse dynamics model based on the robot Jacobian forms a closed-loop video-to-action policy that achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation, and that the same video planner works across embodiments when paired with different IDMs.

What carries the argument

The embodiment-specific inverse dynamics model (IDM) based on the robot Jacobian, which converts predicted future video frames into executable joint actions without any retraining of the video planner.

If this is right

  • The video planner remains embodiment-agnostic and can be swapped for other video models without retraining the IDM.
  • The IDM trains independently on readily available self-play data and scales to high-dimensional action spaces such as 16-DoF hands.
  • The same video planner produces viable policies on both a Panda arm and an Allegro hand when paired with the matching IDM.
  • Closed-loop execution of the combined system yields strong performance on manipulation benchmarks without embodiment-specific finetuning of the planner.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the Jacobian IDM remains accurate under contact-rich dynamics, the method could extend to tasks involving tool use or deformable objects with minimal extra data.
  • Replacing the Jacobian IDM with a learned model trained on the same self-play data might further improve fidelity when analytic Jacobians are unavailable or noisy.
  • Applying the decoupling to mobile bases or legged robots would test whether the video planner's predictions transfer when the action space changes more dramatically.

Load-bearing premise

An embodiment-specific IDM based on the robot Jacobian can accurately translate predictions from an action-free video model into executable actions without any joint training of the video planner on that embodiment's data.

What would settle it

Demonstrating that the Jacobian-based IDM produces joint commands whose resulting video trajectories deviate substantially from the video model's predictions on a new embodiment, causing task success rates to fall below those of jointly trained baselines.

Figures

Figures reproduced from arXiv: 2605.27817 by Evan Kim, Max Simchowitz, Sizhe Lester Li, Tao Pang, Tong Zhao, Vincent Sitzmann, Xingjian Bai.

Figure 1
Figure 1. Figure 1: Controlling robots across embodiments, skills, and environments. We demonstrate that a 14B video generative model, paired with an inverse dynamics model carefully designed for action-following, can solve a wide variety of robotic tasks, ranging from zero-shot pick-and-place tasks on a real-world Panda arm to contact-rich re-orientation of a cube with a 16-DoF robotic hand. Abstract: Video generative models… view at source ↗
Figure 2
Figure 2. Figure 2: Translating video to actions. a, Given context frames, our video world model rolls out a short visual plan. b, The Jacobian IDM inverts each step of this path into a chunk of low-level actions. c, The chunk is executed and observations return to the video model for closed-loop execution. Executing on a visual chunk. Although the planner generates M future frames, the controller commits to executing only th… view at source ↗
Figure 3
Figure 3. Figure 3: Zero-Shot Evaluation on Panda Arm. a, After training on DROID [42], we deploy zero-shot on a Panda in an unseen scene with ad-hoc camera placements; in Press-Hidden-Button, the blue target is visible from only one of three views, with an orange distractor probing instruction grounding. b, Visualization of the predicted Jacobian: each column of Jθ is assigned a fixed RGB color and rendered, per pixel, weigh… view at source ↗
Figure 4
Figure 4. Figure 4: In-Hand Object Reorientation. a, Cube-reorientation demonstrations [46] with three language instructions (clockwise, counter-clockwise, random); we evaluate prompt-conditioned in￾hand reorientation. b, Predicted Jacobian visualized as in Fig. 3b (columns: selected action channels out of 16 total; rows: viewpoints). c, Generated frames and executions side-by-side with prompt; the model orchestrates dexterou… view at source ↗
Figure 5
Figure 5. Figure 5: J-IDM scales with action dimensionality. a, Rollouts of recovered actions for direct-IDM baselines and J-IDM as DoFs grow. b, At a fixed data budget, the J-IDM gap over direct-IDMs widens with DoF. c, At high DoF, J-IDM has a better data-accuracy trade-off across training sizes [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-world comparison with robot foundation model baselines. VERA remains competitive on basic tasks and substantially bet￾ter on challenging tasks. Instruction following on challenging tasks. We further design two sets of harder, reasoning-heavy prompts: location-based prompts (e.g., “push the but￾ton on top of the paper”) that do not directly reveal the target, and semantic-based prompts (e.g., “push the… view at source ↗
Figure 7
Figure 7. Figure 7: Additional results on the hidden button task. (Top) We visualize the dream rollout of our approach. (Bottom) We visualize the execution results of our model, finding that our model is capable of solving the task by finding the hidden button and pushing it. C Additional Results and Qualitative Analyses Fine-tuning is critical for robot planning. A central question is whether generic video pretraining transf… view at source ↗
Figure 8
Figure 8. Figure 8: Effect of action chunking on closed-loop performance. We ablate the number of low-level actions executed before replanning. Increasing the chunk length initially improves performance by producing smoother local execution and reducing excessive replanning. However, overly long chunks reduce feedback frequency and can hurt performance, especially when contact or visual tracking errors accumulate. We therefor… view at source ↗
read the original abstract

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes VERA, a closed-loop video-to-action policy that decouples an action-free video world model (kept embodiment-agnostic) from an embodiment-specific inverse dynamics model (IDM) based on the robot Jacobian. It claims this separation enables easy interchange of video models, independent IDM training on self-play data, data-efficiency, and scalability to high-DoF actions, with strong performance on simulated and real-world benchmarks including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation, positioning the approach as a viable route to zero-shot, cross-embodiment robot control.

Significance. If the empirical results hold, the decoupling strategy offers a practical alternative to joint video-action finetuning by preserving the video planner's generality and allowing embodiment-specific components to be trained separately. The claimed data-efficiency of the Jacobian IDM and its extension to 16-DoF actions represent concrete engineering contributions that could lower barriers to generalist policies. Explicit credit is due for the emphasis on interchangeability and the provision of a project website for additional results, which aids reproducibility in an empirical robotics setting.

major comments (2)
  1. [Abstract and IDM section] Abstract and Methods (IDM design): the central claim that a Jacobian-based IDM can faithfully translate action-free video predictions into executable actions without joint training on the target embodiment is load-bearing, yet the manuscript provides no quantitative evidence (metrics, baselines, ablations, or error analysis) on how video frames are mapped to the exact state representation required by the IDM; this is especially critical for the 16-DoF Allegro hand where small prediction discrepancies can produce large kinematic errors.
  2. [Results (Allegro hand experiments)] Results on 16-DoF Allegro hand: the zero-shot dexterous re-orientation result relies on closed-loop execution not accumulating errors from distribution shift between self-play IDM data and video-model rollouts, but no analysis of state-estimation accuracy or uncertainty handling is supplied to support that the differential kinematics inversion remains stable.
minor comments (2)
  1. [Abstract] The abstract states 'strong performance' and 'achieves strong performance across simulated and real-world benchmarks' without any numbers; moving quantitative results, baselines, and ablations into the abstract or a prominent table would improve readability.
  2. [Methods] Notation for the Jacobian IDM and video-to-state conversion should be defined more explicitly with equations to allow readers to verify the claimed parameter-free or data-efficient properties.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and positive assessment of the decoupling strategy. We address each major comment below and will revise the manuscript accordingly where additional evidence or discussion is warranted.

read point-by-point responses
  1. Referee: [Abstract and IDM section] Abstract and Methods (IDM design): the central claim that a Jacobian-based IDM can faithfully translate action-free video predictions into executable actions without joint training on the target embodiment is load-bearing, yet the manuscript provides no quantitative evidence (metrics, baselines, ablations, or error analysis) on how video frames are mapped to the exact state representation required by the IDM; this is especially critical for the 16-DoF Allegro hand where small prediction discrepancies can produce large kinematic errors.

    Authors: We agree that explicit quantitative characterization of the video-to-state mapping is important for the central claim. The Methods section details the Jacobian-based differential kinematics inversion used to convert predicted video frames into actions, and the overall policy results (including the 16-DoF Allegro experiments) serve as indirect validation. However, we acknowledge the absence of dedicated IDM-specific ablations, error metrics, or baselines on state estimation accuracy. We will add these analyses (including per-DoF error statistics on the Allegro hand) to the revised manuscript. revision: yes

  2. Referee: [Results (Allegro hand experiments)] Results on 16-DoF Allegro hand: the zero-shot dexterous re-orientation result relies on closed-loop execution not accumulating errors from distribution shift between self-play IDM data and video-model rollouts, but no analysis of state-estimation accuracy or uncertainty handling is supplied to support that the differential kinematics inversion remains stable.

    Authors: The closed-loop results on the Allegro hand demonstrate that the policy completes the re-orientation task without catastrophic failure, which we attribute to the data-efficient self-play training of the IDM and the embodiment-specific Jacobian design. We do not currently provide separate state-estimation accuracy curves or uncertainty quantification. We will expand the Results and Discussion sections with available state-tracking diagnostics from the experiments and a brief analysis of closed-loop stability to address this point. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical decoupling of video planner and Jacobian IDM stands on independent components

full rationale

The paper's core claim rests on an empirical separation: an action-free video model is left unchanged while an embodiment-specific IDM (Jacobian-based) is trained separately on self-play data. No equations or results are shown to reduce by construction to fitted quantities defined by the target outcome; the video-to-action translation is presented as a testable engineering choice whose success is measured on external benchmarks (Panda, Allegro hand) rather than derived from the result itself. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text, and the derivation chain does not rename known patterns or smuggle assumptions via citation. The approach is therefore self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a Jacobian-derived IDM can faithfully invert video predictions into actions for any embodiment without video-model retraining; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The robot embodiment Jacobian provides a sufficient mapping from video-predicted observations to executable actions.
    This premise underpins the IDM design and the claim that the video planner can remain action-free.

pith-pipeline@v0.9.1-grok · 5822 in / 1225 out tokens · 40441 ms · 2026-06-29T12:32:08.686934+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. World Action Models: A Survey

    cs.RO 2026-06 unverdicted novelty 3.0

    A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

Reference graph

Works this paper leans on

53 extracted references · 22 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InProceedings of the 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 2165–2183. PMLR,

  2. [2]

    URLhttps://proceedings.mlr.press/v229/zitkovich23a.html

  3. [3]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, G. Lam, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on R...

  4. [4]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . L. Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024. URL https://octo-models. github.io/

  5. [5]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control. InProceedings of Robotics: ...

  6. [6]

    URL https://www.roboticsproceedings.org/ rss21/p010.html

    doi:10.15607/RSS.2025.XXI.010. URL https://www.roboticsproceedings.org/ rss21/p010.html

  7. [7]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, b. ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

  8. [8]

    J. Wang, M. Leonard, K. Daniilidis, D. Jayaraman, and E. Hu. Evaluating pi0 in the wild: Strengths, problems, and the future of generalist robot policies, 2025

  9. [9]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, H. Zhang, and M. Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. URL https: //arxiv.org/abs/2410.06158

  10. [10]

    Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen. Video prediction policy: A generalist robot policy with predictive visual representations. In Proceedings of the 42nd International Conference on Machine Learning, 2025. URL https: //proceedings.mlr.press/v267/hu25g.html. Spotlight. 9

  11. [11]

    L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y . Shen, and Y . Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. URL https://arxiv.org/abs/2601.21998. Introduces LingBot-V A, a video–action co- prediction framework

  12. [12]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World action mod...

  13. [13]

    URLhttps://arxiv.org/abs/2602.15922

  14. [14]

    S. Yang, Y . Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=sFyTZEqmUY

  15. [15]

    G. Zhou, H. Pan, Y . LeCun, and L. Pinto. DINO-WM: World models on pre-trained visual features enable zero-shot planning. InProceedings of the 42nd International Conference on Machine Learning, 2025. URLhttps://proceedings.mlr.press/v267/zhou25t.html

  16. [16]

    S. Li, Y . Gao, D. Sadigh, and S. Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

  17. [17]

    Y . Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuur- mans, and P. Abbeel. Learning universal policies via text-guided video gener- ation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 1d5b9233ad716a43be5c0d3023cb82d0-Abstract-Conference.html

  18. [18]

    P.-C. Ko, J. Mao, Y . Du, S.-H. Sun, and J. B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://openreview.net/forum?id=Mhb5fpA1T0

  19. [19]

    Causal video models are data-efficient robot policy learners

    Rhoda AI Research. Causal video models are data-efficient robot policy learners. Rhoda AI Research blog, 2026. URLhttps://www.rhoda.ai/research/direct-video-action

  20. [20]

    B. Chen, T. Zhang, H. Geng, K. Song, C. Zhang, P. Li, W. T. Freeman, J. Malik, P. Abbeel, R. Tedrake, V . Sitzmann, and Y . Du. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025. URLhttps://arxiv.org/abs/2512.15840

  21. [21]

    Allegro hand

    Wonik Robotics. Allegro hand. https://www.allegrohand.com/sub/product/p.php? idx=1, 2024. Accessed: 2026-05-07

  22. [22]

    Mandlekar, D

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. InProceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 1678–1690. PMLR, 2022. URL https://p...

  23. [23]

    E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn. BC- Z: Zero-shot task generalization with robotic imitation learning. InProceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 991–1002. PMLR, 2022. URL https://proceedings.mlr.press/v164/jang22a. html

  24. [24]

    Shridhar, L

    M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In K. Liu, D. Kulic, and J. Ichnowski, editors,Proceedings of The 6th Conference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 785–799. PMLR, 14–18 Dec 2023. URL https://proceedings.mlr.press/v205/shridhar23a. html. 10

  25. [25]

    Jiang, A

    Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan. VIMA: Robot manipulation with multimodal prompts. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learnin...

  26. [26]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InProceedings of Robotics: Science and Systems, 2023. doi:10.15607/ RSS.2023.XIX.016. URL https://www.roboticsproceedings.org/rss19/p016.html. Action Chunking with Transformers (ACT)

  27. [27]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10–11):1684–1704, 2025. doi:10.1177/02783649241273668. URL https: //journals.sagepub.com/doi/10.1177/02783649241273668

  28. [28]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J....

  29. [29]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence. Palm-e: An embodied multimodal language model. InarXiv preprint arXiv:2303.03378, 2023

  30. [30]

    arXiv preprint arXiv:2306.11706 , year=

    K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza, T. Davchev, Y . Zhou, A. Gupta, A. Raju, A. Laurens, C. Fantacci, V . Dalibard, M. Zambelli, M. Martins, R. Pevce- viciute, M. Blokzijl, M. Denil, N. Batchelor, T. Lampe, E. Parisotto, K. ˙Zołna, S. Reed, S. G. Colmenarejo, J. Scholz, A. Abdolmaleki, O. Groth, J.-B. Regli, O. Sushkov, T. Rot...

  31. [31]

    O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Kolobov, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid...

  32. [32]

    Brooks, B

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y . Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh. Video generation models as world simulators. OpenAI Technical Report, 2024. URL https://openai.com/index/ video-generation-models-as-world-simulators/

  33. [33]

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y . Zhang, W. Wang, Y . Cheng, B. Xu, X. Gu, Y . Dong, and J. Tang. CogVideoX: Text- to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=LQzN6TRFg9

  34. [34]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. URLhttps://arxiv.org/abs/2503.20314

  35. [35]

    B. Chen, D. Martí Monsó, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitz- mann. Diffusion forcing: Next-token prediction meets full-sequence diffu- sion. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 2aee1c4159e48407d68fe16ae8e6e49e-Abstract-Conference.html

  36. [36]

    K. Song, B. Chen, M. Simchowitz, Y . Du, R. Tedrake, and V . Sitzmann. History-guided video diffusion. InProceedings of the 42nd International Conference on Machine Learning, 2025. URLhttps://proceedings.mlr.press/v267/song25b.html

  37. [37]

    Bruce, M

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y . Aytar, S. M. E. Bechtle, F. Behbahani, S. C. Y . Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rocktäschel. Genie: Generative interactive environments. In...

  38. [38]

    Y . Tian, S. Yang, J. Zeng, P. Wang, D. Lin, H. Dong, and J. Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

  39. [39]

    J. Pai, L. Achenbach, V . Montesinos, B. Forrai, O. Mees, and E. Nava. mimic-video: Video- action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

  40. [40]

    S. L. Li, A. Zhang, B. Chen, H. Matusik, C. Liu, D. Rus, and V . Sitzmann. Control- ling diverse robots by inferring jacobian fields with deep networks.Nature, 643:89–95,

  41. [41]

    URL https://www.nature.com/articles/ s41586-025-09170-0

    doi:10.1038/s41586-025-09170-0. URL https://www.nature.com/articles/ s41586-025-09170-0. 12

  42. [42]

    Teed and J

    Z. Teed and J. Deng. Raft: Recurrent all-pairs field transforms for optical flow, 2020. URL https://arxiv.org/abs/2003.12039

  43. [43]

    Doersch, Y

    C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement, 2023. URL https://arxiv.org/abs/2306.08637

  44. [44]

    A. W. Harley, Y . You, X. Sun, Y . Zheng, N. Raghuraman, Y . Gu, S. Liang, W.-H. Chu, A. Dave, P. Tokmakov, S. You, R. Ambrus, K. Fragkiadaki, and L. J. Guibas. AllTracker: Efficient dense point tracking at high resolution. InICCV, 2025

  45. [45]

    Zhang, F

    D. Zhang, F. Wang, M. Pollefeys, and H. Xu. Megaflow: Zero-shot large displacement optical flow.arXiv preprint arXiv:2603.25739, 2026

  46. [46]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  47. [47]

    T. T. Zhang, D. Pfrommer, C. Pan, N. Matni, and M. Simchowitz. Action chunking and exploratory data collection yield exponential improvements in behavior cloning for continuous control, 2025. URLhttps://arxiv.org/abs/2507.09061

  48. [48]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  49. [49]

    Mandlekar, S

    A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y . Narang, L. Fan, Y . Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In7th Annual Conference on Robot Learning, 2023

  50. [50]

    H. J. T. Suh, T. Pang, T. Zhao, and R. Tedrake. Dexterous contact-rich manipulation via the contact trust region.The International Journal of Robotics Research, 2026. doi: 10.1177/02783649251398875. URL https://journals.sagepub.com/doi/10.1177/ 02783649251398875

  51. [51]

    T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation. In Conference on robot learning, pages 297–307. PMLR, 2022

  52. [52]

    Ranftl, A

    R. Ranftl, A. Bochkovskiy, and V . Koltun. Vision transformers for dense prediction, 2021. URL https://arxiv.org/abs/2103.13413

  53. [53]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without super...