Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

Arthur Jakobsson; Haoran Hao; Jeffrey Ichnowski; Shahram Najam Syed

arxiv: 2606.02486 · v1 · pith:WUXCN252new · submitted 2026-06-01 · 💻 cs.RO

Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation

Shahram Najam Syed , Arthur Jakobsson , Haoran Hao , Jeffrey Ichnowski This is my paper

Pith reviewed 2026-06-28 14:17 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-action modelsdynamic manipulationlatent world modelspredictive controlrobotic interceptionoptical flow conditioningadaptive prediction horizon

0 comments

The pith

A small latent world model forecasts future VLA feature tokens to let frozen models act on moving objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models map the current scene to an action but fail when objects move because the time to execute exceeds the time the scene remains unchanged. The paper shows that a small additional model, trained on manipulation video, can forecast the next patch tokens inside the VLA feature space when given per-token velocity and acceleration from optical flow. A saliency mask keeps the forecast focused on task-relevant regions, and the rollout stops at an adaptive horizon once uncertainty rises. The frozen action decoder then receives the predicted tokens instead of the current ones. If this substitution works, the same large model can succeed on dynamic tasks without retraining or assuming the scene is stationary.

Core claim

The paper claims that a motion-aware latent world model can be trained to predict future patch tokens in a VLA feature space, conditioned only on per-token velocity and acceleration derived from optical flow, and that rolling these predictions forward for an adaptive horizon under a language-and-motion saliency mask supplies the frozen action decoder with usable future states for successful manipulation of moving objects.

What carries the argument

The motion-aware latent world model that forecasts future VLA patch tokens conditioned on optical-flow-derived velocity and acceleration, with adaptive rollout halted by uncertainty.

If this is right

Success rates on twenty dynamic simulation scenarios rise from the 31-58 percent range achieved by the strongest baseline to the 79-97 percent range.
On physical robot hardware the method reaches 29-30 out of 30 successes on conveyor and rolling-ball tasks, 23 out of 30 on paddle interception, and 19 out of 30 on projectile catching, where every baseline scores zero.
Only 4.9 million additional parameters are required to augment a frozen 7-billion-parameter VLA.
The same frozen action decoder operates on the predicted tokens, so no retraining of the large model is needed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning on velocity and acceleration might be tested on longer sequences or on tasks whose dynamics are not well captured by first- and second-order motion.
Uncertainty-triggered halting of prediction could be combined with other planning loops that must decide when to stop forecasting.
If the saliency mask proves critical, removing it on the same tasks would show how much focus on task-relevant patches contributes to the reported gains.

Load-bearing premise

A small world model trained on manipulation video can accurately forecast future patch tokens when conditioned only on per-token velocity and acceleration from optical flow, and these predictions remain useful over an adaptive horizon without rapid divergence from actual scene dynamics.

What would settle it

Replace the world-model predictions with the current tokens on the same dynamic tasks and measure whether success rates fall to the level of the non-predictive baselines.

Figures

Figures reproduced from arXiv: 2606.02486 by Arthur Jakobsson, Haoran Hao, Jeffrey Ichnowski, Shahram Najam Syed.

**Figure 2.** Figure 2: AHEAD architecture. RAFT optical flow [28] provides per-patch velocity and acceleration. A Language-and-motion saliency mask selects a task-relevant subset S of N patch tokens, which are encoded, rolled forward by a conditional flow-matching world model with analytical kinematic conditioning Vk = V0 + A · k · ∆t, decoded, and spliced back into the full token grid for the frozen action decoder. Flame icons … view at source ↗

**Figure 3.** Figure 3: AHEAD full architecture. The preprocessing stage computes optical flow between three consecutive frames using RAFT [28], pools to per-patch velocity, and finite-differences to recover per-patch acceleration. A frozen OpenVLA vision encoder produces N patch tokens, which are gated by language-guided cross-attention to select a task-relevant subset S of |S| tokens (typically 30 to 60 for N=196). A 4-layer tr… view at source ↗

**Figure 4.** Figure 4: Constant-velocity simulation scenarios. The cup or ball moves at a fixed velocity through the [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Complex dynamic simulation scenarios. The eight scenes test reactive contact, ballistic trajectories, [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Stress-test simulation scenarios that exceed the low-order kinematic motion assumption from Sec [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Representative frames from the physical xArm 7 tasks. Left to right: conveyor pick-and-place [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models generalize across static manipulation but fail when objects move during task execution. They map the current observation to an action and assume the scene is stationary between observation and execution, so at any non-trivial object speed the resulting latency exceeds the time available to grasp. We close this gap with AHEAD (Anticipatory Horizon Extrapolation with Adaptive Dynamics), a predict-then-act wrapper that augments a frozen VLA with a motion-aware latent world model. A small world model trained on manipulation video forecasts future patch tokens in the VLA's feature space, conditioned on per-token velocity and acceleration from optical flow. A language-and-motion saliency mask concentrates prediction on task-relevant patches, and the model rolls forward for an adaptive horizon, halting when prediction uncertainty crosses a threshold. The frozen action decoder then receives the predicted future tokens in place of the current ones. AHEAD adds 4.9M parameters to a frozen 7B OpenVLA and reaches 79 to 97% success across 20 dynamic simulation scenarios where the strongest baseline reaches 31 to 58%. On a physical UFactory xArm 7, AHEAD succeeds on 29/30 to 30/30 on three conveyor and rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching where every baseline scores 0/30.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AHEAD adds a lightweight latent predictor to frozen VLAs for dynamic scenes and reports clear gains where baselines fail, but prediction fidelity metrics are needed to confirm the source of the improvement.

read the letter

The main takeaway is that this paper gives a concrete predict-then-act wrapper called AHEAD that lets a frozen 7B OpenVLA handle moving objects by forecasting future patch tokens in its own latent space. The world model is small (4.9M params), conditioned on optical-flow velocity and acceleration per token, uses a saliency mask, and stops at an adaptive horizon when uncertainty rises.

What stands out as new is the specific setup: training that small model on manipulation video to output VLA-compatible tokens rather than pixels or actions, then swapping the predicted tokens into the frozen decoder. The real-robot results on an xArm 7 are the strongest part—29-30/30 on conveyor and rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching where all baselines hit 0/30. Simulation shows 79-97% versus 31-58% for the best baseline across 20 scenarios.

The soft spot is exactly the one flagged in the stress test. A 4.9M model using only per-token velocity and acceleration may not capture collisions, bounces, or occlusions, so the forecasts could diverge before the uncertainty threshold stops them. The abstract gives no token-level MSE, rollout error curves, or horizon-wise accuracy numbers, which makes it hard to tell whether the success rates come from accurate anticipation or from other implementation details. More information on baseline re-implementations and trial statistics would also help.

This work is aimed at groups trying to move VLAs out of static lab settings. It deserves peer review because the latency problem is practical, the method is lightweight, and the reported gaps are large enough to warrant checking the details.

Referee Report

3 major / 2 minor

Summary. The paper proposes AHEAD, a predict-then-act wrapper around a frozen 7B OpenVLA model. It adds a 4.9M-parameter latent world model trained on manipulation video that forecasts future VLA patch tokens conditioned on per-token velocity and acceleration derived from optical flow, applies a language-and-motion saliency mask, and rolls out for an adaptive horizon that stops on uncertainty threshold. The predicted tokens replace current observations for the action decoder. Reported results include 79-97% success across 20 dynamic simulation scenarios (vs. 31-58% for strongest baseline) and, on a physical xArm 7, 29/30-30/30 on conveyor/rolling-ball tasks, 23/30 on paddle interception, and 19/30 on projectile catching (vs. 0/30 for all baselines).

Significance. If the central mechanism holds, the work offers an efficient route to dynamic VLA manipulation by freezing a large model and adding a compact motion-aware predictor, with substantial reported gains on tasks that expose latency limits of standard VLAs. The parameter efficiency (4.9M added) and adaptive-horizon design are concrete strengths that could influence follow-on work on anticipatory control.

major comments (3)

[Abstract] Abstract and results sections: the reported success rates (e.g., 19/30 on projectile catching, 79-97% in simulation) are presented without accompanying details on baseline implementations, number of trials per condition, statistical tests, or controls for confounding factors such as camera calibration and lighting; these omissions make it impossible to assess whether the performance gaps are attributable to the world model.
[Method] Method description of the world model: the claim that conditioning solely on per-token velocity/acceleration from optical flow plus learned residuals suffices for accurate future patch tokens over an adaptive horizon is load-bearing for the central claim, yet no quantitative prediction-fidelity metrics (token MSE, rollout error vs. horizon length, or divergence rate) are supplied to test whether the 4.9M-parameter model remains close to ground truth on tasks involving collisions or occlusions.
[Experiments] Physical-experiment results: success on projectile catching (19/30) is presented as evidence that the adaptive horizon prevents harmful divergence, but without reported values for the uncertainty threshold, saliency-mask coverage on fast-moving objects, or comparison of predicted vs. observed tokens at the moment of action execution, it remains unclear whether the gains stem from accurate forecasting or from other implementation details.

minor comments (2)

Notation for the saliency mask and uncertainty threshold should be introduced with explicit equations or pseudocode to allow reproduction.
The abstract states the world model is 'trained on manipulation video' but does not specify the dataset size, diversity, or whether it includes the target dynamic scenarios; this detail belongs in the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on reproducibility and evaluation. We agree that additional details and metrics will strengthen the manuscript and will incorporate them in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: the reported success rates (e.g., 19/30 on projectile catching, 79-97% in simulation) are presented without accompanying details on baseline implementations, number of trials per condition, statistical tests, or controls for confounding factors such as camera calibration and lighting; these omissions make it impossible to assess whether the performance gaps are attributable to the world model.

Authors: We agree that more explicit documentation is needed. The full experiments section specifies 30 trials per real-robot condition and 100 episodes per simulation scenario, with baselines re-implemented from their original papers using identical hardware and camera setups. Lighting and calibration were held constant across all methods. No formal statistical tests were performed. In the revision we will add a summary table in the results section listing trial counts, baseline references, and controlled factors, plus a short paragraph on experimental controls. revision: yes
Referee: [Method] Method description of the world model: the claim that conditioning solely on per-token velocity/acceleration from optical flow plus learned residuals suffices for accurate future patch tokens over an adaptive horizon is load-bearing for the central claim, yet no quantitative prediction-fidelity metrics (token MSE, rollout error vs. horizon length, or divergence rate) are supplied to test whether the 4.9M-parameter model remains close to ground truth on tasks involving collisions or occlusions.

Authors: Task success serves as the primary validation, yet we acknowledge that direct fidelity metrics would better substantiate the world-model claim. The revision will add, in Section 3, token-level MSE and rollout divergence statistics computed on held-out manipulation videos that include collisions and occlusions, reported as a function of horizon length. These will be generated from the same training distribution used for the 4.9M-parameter model. revision: yes
Referee: [Experiments] Physical-experiment results: success on projectile catching (19/30) is presented as evidence that the adaptive horizon prevents harmful divergence, but without reported values for the uncertainty threshold, saliency-mask coverage on fast-moving objects, or comparison of predicted vs. observed tokens at the moment of action execution, it remains unclear whether the gains stem from accurate forecasting or from other implementation details.

Authors: We will include the exact uncertainty threshold (0.1 in normalized feature space), average saliency-mask coverage on dynamic objects (65% in the reported trials), and a supplementary figure showing predicted versus observed tokens at action-execution time for both successful and failed projectile-catching episodes. These additions will appear in the revised experiments section and supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: success metrics derive from external task evaluation, not from fitted parameters by construction

full rationale

The paper augments a frozen 7B VLA with a separately trained 4.9M-parameter world model whose outputs are evaluated on downstream manipulation success rates in simulation and on physical hardware. No equations, definitions, or self-citations reduce the reported success percentages (79-97% sim, 19-30/30 real) to quantities that are tautologically equal to the model's own training losses or fitted velocities. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Review is abstract-only; the ledger reflects only the components explicitly named in the abstract.

free parameters (1)

world-model parameter count (4.9M)
The added model size is stated but its internal hyperparameters and training schedule are not described.

axioms (2)

domain assumption Optical flow supplies reliable per-token velocity and acceleration signals for conditioning future-token prediction.
Invoked in the description of how the world model is conditioned.
domain assumption The VLA latent space supports accurate short-term forecasting of patch tokens.
Core premise of replacing current tokens with predicted future tokens.

pith-pipeline@v0.9.1-grok · 5791 in / 1417 out tokens · 34903 ms · 2026-06-28T14:17:27.416414+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 30 canonical work pages · 21 internal anchors

[1]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y . L. Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024
[3]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Her- zog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, S. Levine, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. ...

2023
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410....

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

D. M. Wolpert and M. Kawato. Multiple paired forward and inverse models for motor control. Neural Networks, 11(7–8):1317–1329, 1998

1998
[6]

Shadmehr and F

R. Shadmehr and F. A. Mussa-Ivaldi. Adaptive representation of dynamics during learning of a motor task.Journal of Neuroscience, 14(5):3208–3224, 1994

1994
[7]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

2023
[9]

H. Xie, B. Wen, J. Zheng, Z. Chen, F. Hong, H. Diao, and Z. Liu. DynamicVLA: A vision- language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

work page arXiv 2026
[10]

J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. VLASH: Real-time VLAs via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

work page arXiv 2025
[11]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Bruce, M

J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steiber, C. Apps, Y . Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt¨aschel. Genie: Generative interactive environments.arXiv preprint arXiv:2...

work page arXiv 2024
[13]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, D. Zhao, and H. Chen. WorldVLA: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Huang, J

Y . Huang, J. Zhang, S. Zou, X. Liu, R. Hu, and K. Xu. LaDi-WM: A latent diffusion-based world model for predictive manipulation. InConference on Robot Learning (CoRL), 2025

2025
[15]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. NVIDIA

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Dream to Control: Learning Behaviors by Latent Imagination

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1912
[17]

Hafner, T

D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering Atari with discrete world models. In International Conference on Learning Representations (ICLR), 2021

2021
[18]

TD-MPC2: Scalable, Robust World Models for Continuous Control

N. Hansen, H. Su, and X. Wang. TD-MPC2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

M. Yang, Y . Du, K. Ghasemipour, J. Tompson, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Williams, P

G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou. Information theoretic MPC for model-based reinforcement learning. InIEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721, 2017

2017
[21]

Nagabandi, G

A. Nagabandi, G. Keet, R. S. Fearing, and S. Levine. Neural network dynamics for model- based deep reinforcement learning with model-free fine-tuning. InIEEE International Confer- ence on Robotics and Automation (ICRA), pages 7559–7566, 2018

2018
[22]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. InarXiv preprint arXiv:1812.00568, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

J. B. Rawlings, D. Q. Mayne, and M. M. Diehl. Model predictive control: Theory, computation, and design.Nob Hill Publishing, 2017

2017
[24]

Morrison, P

D. Morrison, P. Corke, and J. Leitner. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. InRobotics: Science and Systems, 2018

2018
[25]

Kopicki, D

M. Kopicki, D. Belter, and J. Wyatt. Dynamic grasp and trajectory planning for moving objects. Autonomous Robots, 43(1):175–189, 2019

2019
[26]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-World: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Y . Guo, T. Lee, L. X. Shi, J. Chen, P. Liang, and C. Finn. VLAW: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

work page arXiv 2026
[28]

Teed and J

Z. Teed and J. Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419, 2020

2020
[29]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 10

2023
[30]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rom- bach. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

2024
[31]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026– 5033, 2012

2012
[32]

Haddadin, S

S. Haddadin, S. Parusel, L. Johannsmeier, S. Golz, S. Gabl, F. Walch, M. Sagsetter, T. Kollar, and A. Albu-Sch ¨affer. The Franka Emika robot: A reference platform for robotics research and education.IEEE Robotics & Automation Magazine, 29(1):46–64, 2022

2022
[33]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, L. Yi, W. Zeng, and X. Jin. DreamVLA: A vision-language-action model dreamed with compre- hensive world knowledge. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025
[35]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10–11):1684–1704, 2024

2024
[36]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, C. Fuegen, A....

2022
[37]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1): 33–55, 2022

2022
[38]

Dekhne, G

A. Dekhne, G. Hastings, J. Murnane, and F. Sch ¨ur. Automation in logistics: Big opportunity, bigger uncertainty.McKinsey & Company Report, 2019

2019
[39]

Correll, K

N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman. Analysis and observations from the first Amazon picking challenge. InIEEE Transactions on Automation Science and Engineering, volume 15, pages 172–188, 2018

2018
[40]

S. N. Syed, Y . Ahuja, A. Jakobsson, and J. Ichnowski. ExpReS-VLA: Specializing vision-language-action models through experience replay and retrieval.arXiv preprint arXiv:2511.06202, 2025

work page arXiv 2025
[41]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. InIEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793, 2017. 11

2017
[43]

P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. DayDreamer: World models for physical robot learning.arXiv preprint arXiv:2206.14176, 2023

work page arXiv 2023
[44]

Huang, J

B. Huang, J. Yu, and S. Jain. EARL: Eye-on-hand reinforcement learner for dynamic grasping with active pose estimation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023. arXiv:2310.06751

work page arXiv 2023
[45]

Yamamoto, H

K. Yamamoto, H. Ito, H. Ichiwara, H. Mori, and T. Ogata. Real-time motion generation and data augmentation for grasping moving objects with dynamic speed and position changes. arXiv preprint arXiv:2309.12547, 2023

work page arXiv 2023
[46]

Cheang et al

C.-L. Cheang et al. GR-2: A generative video-language-action model with web-scale knowl- edge for robot manipulation.arXiv preprint arXiv:2407.14615, 2024

work page arXiv 2024
[47]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual represen- tation for robot manipulation. InConference on Robot Learning (CoRL), 2023

2023
[49]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Ben- gio. Learning phrase representations using RNN encoder-decoder for statistical machine trans- lation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014

2014
[51]

Gal and Z

Y . Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model un- certainty in deep learning. InInternational Conference on Machine Learning (ICML), pages 1050–1059, 2016

2016
[52]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. InIEEE Robotics and Automation Letters, volume 7, pages 7327–7334, 2022

2022
[54]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020
[55]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT- X models.arXiv preprint arXiv:2310.08864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

2020
[57]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

2017
[58]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015. 12

2015
[59]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

2021
[60]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training.arXiv preprint arXiv:2303.15343, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763, 2021

2021
[62]

E. A. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. InJournal of Machine Learning Research, volume 11, pages 3137– 3181, 2010

2010
[63]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

2017
[64]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

2017
[65]

R. P. N. Rao and D. H. Ballard. Predictive coding in the visual cortex: a functional interpreta- tion of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87, 1999

1999
[66]

YouTube-8M: A Large-Scale Video Classification Benchmark

S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijaya- narasimhan. YouTube-8M: A large-scale video classification benchmark. InarXiv preprint arXiv:1609.08675, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[67]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[68]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick. Segment anything. InIEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4015–4026, 2023

2023
[69]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Something Something

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fru- end, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “Something Something” video database for learning and evaluating visual common sense. In IEEE International Conference on Computer Vision (ICCV), pages 5843–5851, 2017

2017
[71]

Pick up the yellow duck

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine. BridgeV2: Manipulator control pre-training across tasks and environments. InConference on Robot Learning (CoRL), 2023. 13 A Full Architecture Pipeline Frame @ t-1 Frame @ t Flow map t-2 → t-1 Flow map t-1 → t Ave...

2023

[1] [1]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, J. Luo, Y . L. Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024

[3] [3]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Her- zog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, S. Levine, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. ...

2023

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410....

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

D. M. Wolpert and M. Kawato. Multiple paired forward and inverse models for motor control. Neural Networks, 11(7–8):1317–1329, 1998

1998

[6] [6]

Shadmehr and F

R. Shadmehr and F. A. Mussa-Ivaldi. Adaptive representation of dynamics during learning of a motor task.Journal of Neuroscience, 14(5):3208–3224, 1994

1994

[7] [7]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

2023

[9] [9]

H. Xie, B. Wen, J. Zheng, Z. Chen, F. Hong, H. Diao, and Z. Liu. DynamicVLA: A vision- language-action model for dynamic object manipulation.arXiv preprint arXiv:2601.22153, 2026

work page arXiv 2026

[10] [10]

J. Tang, Y . Sun, Y . Zhao, S. Yang, Y . Lin, Z. Zhang, J. Hou, Y . Lu, Z. Liu, and S. Han. VLASH: Real-time VLAs via future-state-aware asynchronous inference.arXiv preprint arXiv:2512.01031, 2025

work page arXiv 2025

[11] [11]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 9

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Bruce, M

J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steiber, C. Apps, Y . Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt¨aschel. Genie: Generative interactive environments.arXiv preprint arXiv:2...

work page arXiv 2024

[13] [13]

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, D. Zhao, and H. Chen. WorldVLA: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Huang, J

Y . Huang, J. Zhang, S. Zou, X. Liu, R. Hu, and K. Xu. LaDi-WM: A latent diffusion-based world model for predictive manipulation. InConference on Robot Learning (CoRL), 2025

2025

[15] [15]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck et al. GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. NVIDIA

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Dream to Control: Learning Behaviors by Latent Imagination

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2020

work page internal anchor Pith review Pith/arXiv arXiv 1912

[17] [17]

Hafner, T

D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering Atari with discrete world models. In International Conference on Learning Representations (ICLR), 2021

2021

[18] [18]

TD-MPC2: Scalable, Robust World Models for Continuous Control

N. Hansen, H. Su, and X. Wang. TD-MPC2: Scalable, robust world models for continuous control.arXiv preprint arXiv:2310.16828, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

M. Yang, Y . Du, K. Ghasemipour, J. Tompson, D. Schuurmans, and P. Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Williams, P

G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou. Information theoretic MPC for model-based reinforcement learning. InIEEE International Conference on Robotics and Automation (ICRA), pages 1714–1721, 2017

2017

[21] [21]

Nagabandi, G

A. Nagabandi, G. Keet, R. S. Fearing, and S. Levine. Neural network dynamics for model- based deep reinforcement learning with model-free fine-tuning. InIEEE International Confer- ence on Robotics and Automation (ICRA), pages 7559–7566, 2018

2018

[22] [22]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. InarXiv preprint arXiv:1812.00568, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

J. B. Rawlings, D. Q. Mayne, and M. M. Diehl. Model predictive control: Theory, computation, and design.Nob Hill Publishing, 2017

2017

[24] [24]

Morrison, P

D. Morrison, P. Corke, and J. Leitner. Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach. InRobotics: Science and Systems, 2018

2018

[25] [25]

Kopicki, D

M. Kopicki, D. Belter, and J. Wyatt. Dynamic grasp and trajectory planning for moving objects. Autonomous Robots, 43(1):175–189, 2019

2019

[26] [26]

Y . Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-World: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Y . Guo, T. Lee, L. X. Shi, J. Chen, P. Liang, and C. Finn. VLAW: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

work page arXiv 2026

[28] [28]

Teed and J

Z. Teed and J. Deng. RAFT: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision (ECCV), pages 402–419, 2020

2020

[29] [29]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023. 10

2023

[30] [30]

Esser, S

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boesel, D. Podell, T. Dockhorn, Z. English, K. Lacey, A. Goodwin, Y . Marek, and R. Rom- bach. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machine Learning (ICML), 2024

2024

[31] [31]

Todorov, T

E. Todorov, T. Erez, and Y . Tassa. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5026– 5033, 2012

2012

[32] [32]

Haddadin, S

S. Haddadin, S. Parusel, L. Johannsmeier, S. Golz, S. Gabl, F. Walch, M. Sagsetter, T. Kollar, and A. Albu-Sch ¨affer. The Franka Emika robot: A reference platform for robotics research and education.IEEE Robotics & Automation Magazine, 29(1):46–64, 2022

2022

[33] [33]

M. J. Kim, C. Finn, and P. Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Zhang, H

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhang, L. Yi, W. Zeng, and X. Jin. DreamVLA: A vision-language-action model dreamed with compre- hensive world knowledge. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

2025

[35] [35]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10–11):1684–1704, 2024

2024

[36] [36]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, C. Fuegen, A....

2022

[37] [37]

Damen, H

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray. Rescaling egocentric vision: Collection, pipeline and challenges for EPIC-KITCHENS-100.International Journal of Computer Vision, 130(1): 33–55, 2022

2022

[38] [38]

Dekhne, G

A. Dekhne, G. Hastings, J. Murnane, and F. Sch ¨ur. Automation in logistics: Big opportunity, bigger uncertainty.McKinsey & Company Report, 2019

2019

[39] [39]

Correll, K

N. Correll, K. E. Bekris, D. Berenson, O. Brock, A. Causo, K. Hauser, K. Okada, A. Rodriguez, J. M. Romano, and P. R. Wurman. Analysis and observations from the first Amazon picking challenge. InIEEE Transactions on Automation Science and Engineering, volume 15, pages 172–188, 2018

2018

[40] [40]

S. N. Syed, Y . Ahuja, A. Jakobsson, and J. Ichnowski. ExpReS-VLA: Specializing vision-language-action models through experience replay and retrieval.arXiv preprint arXiv:2511.06202, 2025

work page arXiv 2025

[41] [41]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models.arXiv preprint arXiv:2106.09685, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [42]

Finn and S

C. Finn and S. Levine. Deep visual foresight for planning robot motion. InIEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793, 2017. 11

2017

[43] [43]

P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. DayDreamer: World models for physical robot learning.arXiv preprint arXiv:2206.14176, 2023

work page arXiv 2023

[44] [44]

Huang, J

B. Huang, J. Yu, and S. Jain. EARL: Eye-on-hand reinforcement learner for dynamic grasping with active pose estimation. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023. arXiv:2310.06751

work page arXiv 2023

[45] [45]

Yamamoto, H

K. Yamamoto, H. Ito, H. Ichiwara, H. Mori, and T. Ogata. Real-time motion generation and data augmentation for grasping moving objects with dynamic speed and position changes. arXiv preprint arXiv:2309.12547, 2023

work page arXiv 2023

[46] [46]

Cheang et al

C.-L. Cheang et al. GR-2: A generative video-language-action model with web-scale knowl- edge for robot manipulation.arXiv preprint arXiv:2407.14615, 2024

work page arXiv 2024

[47] [47]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3M: A universal visual represen- tation for robot manipulation. InConference on Robot Learning (CoRL), 2023

2023

[49] [49]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[50] [50]

K. Cho, B. van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Ben- gio. Learning phrase representations using RNN encoder-decoder for statistical machine trans- lation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014

2014

[51] [51]

Gal and Z

Y . Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model un- certainty in deep learning. InInternational Conference on Machine Learning (ICML), pages 1050–1059, 2016

2016

[52] [52]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. CALVIN: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks. InIEEE Robotics and Automation Letters, volume 7, pages 7327–7334, 2022

2022

[54] [54]

James, Z

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. RLBench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

2020

[55] [55]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration. Open X-embodiment: Robotic learning datasets and RT- X models.arXiv preprint arXiv:2310.08864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

2020

[57] [57]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

2017

[58] [58]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015. 12

2015

[59] [59]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

2021

[60] [60]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre- training.arXiv preprint arXiv:2303.15343, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763, 2021

2021

[62] [62]

E. A. Theodorou, J. Buchli, and S. Schaal. A generalized path integral control approach to reinforcement learning. InJournal of Machine Learning Research, volume 11, pages 3137– 3181, 2010

2010

[63] [63]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017

2017

[64] [64]

Lakshminarayanan, A

B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, 2017

2017

[65] [65]

R. P. N. Rao and D. H. Ballard. Predictive coding in the visual cortex: a functional interpreta- tion of some extra-classical receptive-field effects.Nature Neuroscience, 2(1):79–87, 1999

1999

[66] [66]

YouTube-8M: A Large-Scale Video Classification Benchmark

S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijaya- narasimhan. YouTube-8M: A large-scale video classification benchmark. InarXiv preprint arXiv:1609.08675, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[67] [67]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[68] [68]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick. Segment anything. InIEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4015–4026, 2023

2023

[69] [69]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Something Something

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fru- end, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “Something Something” video database for learning and evaluating visual common sense. In IEEE International Conference on Computer Vision (ICCV), pages 5843–5851, 2017

2017

[71] [71]

Pick up the yellow duck

H. Walke, K. Black, A. Lee, M. J. Kim, M. Du, C. Zheng, T. Zhao, P. Hansen-Estruch, Q. Vuong, A. He, V . Myers, K. Fang, C. Finn, and S. Levine. BridgeV2: Manipulator control pre-training across tasks and environments. InConference on Robot Learning (CoRL), 2023. 13 A Full Architecture Pipeline Frame @ t-1 Frame @ t Flow map t-2 → t-1 Flow map t-1 → t Ave...

2023