pith. sign in

arxiv: 2606.11187 · v1 · pith:5FBXTHPZnew · submitted 2026-06-09 · 💻 cs.CV

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

Pith reviewed 2026-06-27 13:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-chunk predictionvideo generationworld action modelsautoregressive modelingcausal modelingtraining convergenceinference acceleration
0
0 comments X

The pith

Next Forcing uses multi-chunk prediction modules to speed up training and inference in autoregressive video world models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Next Forcing as a way to improve autoregressive video generation by adding auxiliary modules that predict several future video chunks at once. These modules create a causal chain that uses predictions of near-future chunks to help with farther ones, sending dense supervision signals back to the main model about temporal dynamics. This leads to quicker convergence during training and better final accuracy, especially when operating at high frame rates. The modules can also stay active at inference time to generate the next chunk alongside the current one, cutting the time needed in half. Such improvements support building more effective causal world models that can simulate real-world physics for tasks like robot control.

Core claim

Next Forcing augments the main autoregressive video model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons, forming a causal chain across prediction depths where intermediate features are fused to predict future dynamics and provide dense multi-scale temporal supervision back to the main model.

What carries the argument

Multi-chunk prediction (MCP) modules that denoise multiple future video chunks in a causal chain by fusing features from the main model to supply multi-scale temporal supervision.

If this is right

  • At 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps.
  • It demonstrates 2.3x faster convergence.
  • It establishes new state-of-the-art results on the RoboTwin benchmark with 94.1% on Clean and 93.5% on Random.
  • It achieves 2x inference acceleration by retaining the MCP modules.
  • It shows significant improvements on PhyWorld and over 50% FVD reduction on general video pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The causal chain structure could allow scaling to predict even more distant chunks for longer-term video forecasting.
  • Parallel prediction during inference may enable deployment in real-time robotic planning systems.
  • Improved physical adherence on PhyWorld suggests the method could be tested on other physics-based simulation benchmarks.

Load-bearing premise

The lightweight auxiliary MCP modules can be trained to extract and fuse useful intermediate features from the main model across multiple prediction depths without introducing training instability or degrading the quality of the primary single-chunk predictions.

What would settle it

Running the training at 50 fps and observing that the multi-chunk prediction modules do not accelerate convergence or improve accuracy compared to the baseline single-chunk approach would falsify the main benefit.

Figures

Figures reproduced from arXiv: 2606.11187 by Gangwei Xu, Jiaming Zhou, Qihang Zhang, Xing Zhu, Xin Yang, Yinghao Xu, Yujun Shen.

Figure 1
Figure 1. Figure 1: Task success rate (%) on RoboTwin across training steps. Next Forcing converges faster and reaches higher final accuracy than LingBot-VA at both 12 and 50 fps. The advantage is most pronounced at 50 fps: at 5k steps Next Forcing already outperforms LingBot-VA by 29.7 points on Random, and matches its 45k-step accuracy at only 20k steps, a 2.3× training speedup. chunk to the current one, with only small res… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Next Forcing. The main model denoises the current chunk, while chained MCP modules predict future chunks (next1 , next2 , . . .) using features from the main model, providing dense temporal supervision during training and enabling parallel chunk prediction at inference. • We provide systematic ablations on the design choices that enable multi-chunk prediction to transfer from discrete tokens to… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on PhyWorld. We show 5 frames (start, 3 intermediate, end) from ground truth (top), Next Forcing (middle), and Baseline (bottom). Blue boxes highlight regions where the baseline deviates from the ground-truth physical trajectory, while Next Forcing generates more physically consistent dynamics. 5.2 Main Results 5.2.1 Comparison with State-of-the-Art [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 4
Figure 4. Figure 4: FVD (↓) on general video pretraining across training steps. Test Set 1 contains human activity videos, while Test Set 2 focuses on camera-driven scene dynamics. Next Forcing consistently achieves substantially lower FVD than LingBot-VA on both test sets throughout training. 5.2.4 Generality on Video Pretraining To further validate that Next Forcing generalizes beyond robot-specific data, we conduct pretrai… view at source ↗
Figure 5
Figure 5. Figure 5: Attention mask for main model and MCP modules. Only video tokens are shown for clarity (action tokens omitted). Under teacher forcing, the sequence consists of noisy tokens (current chunk being denoised) and clean tokens (ground-truth context). Noisy tokens attend to all causally preceding clean tokens and to noisy tokens within the same chunk; clean tokens follow a standard causal pattern; clean tokens ca… view at source ↗
read the original abstract

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next$^1$, next$^2$, next$^3$ chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Next Forcing, a multi-chunk prediction (MCP) framework for autoregressive video generation in World Action Models. It augments the main model with lightweight auxiliary MCP modules that predict multiple future video chunks at different temporal horizons in a causal chain, using fused features from multiple layers of the main model to provide dense multi-scale temporal supervision. This is claimed to accelerate training convergence, improve accuracy at high frame rates (e.g., 93.1% relative improvement over LingBot-VA at 50 fps after 5k steps, 2.3x faster convergence), achieve new SOTA on RoboTwin (94.1/93.5% Clean/Random), improve on PhyWorld, reduce FVD by over 50% in general video pretraining, and enable 2x faster inference by retaining MCP modules for parallel chunk prediction.

Significance. If validated, this work could have substantial impact on the development of efficient causal world models for video. The extension of multi-token prediction to video chunks addresses a clear limitation in current autoregressive approaches regarding future dynamics supervision. The inference acceleration is a notable practical benefit. The empirical claims, if supported by rigorous experiments, would represent a meaningful advance in the field. However, the current presentation lacks the detailed experimental validation needed to fully gauge the significance.

major comments (2)
  1. [Abstract] Abstract: The central empirical claims, such as the 93.1% relative improvement and new SOTA results on RoboTwin, are presented without reference to specific experimental sections, tables, or figures detailing the setup, baselines, or variance, which is load-bearing for assessing the validity of the reported gains.
  2. [MCP framework] MCP framework: The assumption that the auxiliary MCP modules can be trained to extract and fuse useful intermediate features without introducing training instability or degrading primary predictions is central to the method but lacks supporting analysis or ablations in the manuscript.
minor comments (1)
  1. [Abstract] Clarify the exact meaning of 'next$^1$, next$^2$, next$^3$ chunks' with a brief definition or reference to a figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of Next Forcing's potential impact on causal world models. We address each major comment below and commit to revisions that strengthen the experimental presentation without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claims, such as the 93.1% relative improvement and new SOTA results on RoboTwin, are presented without reference to specific experimental sections, tables, or figures detailing the setup, baselines, or variance, which is load-bearing for assessing the validity of the reported gains.

    Authors: We agree that explicit cross-references would improve traceability. In the revised manuscript we will insert concise pointers (e.g., “see Section 4.2 and Table 2”) into the abstract for the 93.1 % relative improvement, RoboTwin SOTA numbers, and convergence claims. We will also add a short reproducibility paragraph in Section 4 noting that all reported numbers are from single runs with fixed seeds; multi-seed variance will be reported if additional compute is obtained before camera-ready. revision: yes

  2. Referee: [MCP framework] MCP framework: The assumption that the auxiliary MCP modules can be trained to extract and fuse useful intermediate features without introducing training instability or degrading primary predictions is central to the method but lacks supporting analysis or ablations in the manuscript.

    Authors: The manuscript currently demonstrates the net benefit through end-to-end metrics, but dedicated stability and fusion ablations are indeed absent. We will add a new subsection (4.4) containing: (i) training-loss curves with and without MCP modules, (ii) an ablation on feature-fusion depth and layer selection, and (iii) a direct comparison of primary-prediction quality (FVD, accuracy) when MCP modules are present versus removed. These additions will be included in the revised submission. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an MCP framework with auxiliary modules for multi-horizon video chunk prediction, drawing inspiration from multi-token prediction in LLMs. All reported gains (e.g., 93.1% relative improvement, 2.3x faster convergence, SOTA on RoboTwin, >50% FVD reduction) are presented as empirical outcomes on external benchmarks (RoboTwin, PhyWorld) rather than quantities defined by fitted parameters or self-citations. No equations appear that equate a claimed prediction to its own training objective by construction, and no load-bearing self-citations or uniqueness theorems are invoked. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations, training objectives, or architectural diagrams; therefore no free parameters, axioms, or invented entities beyond the high-level mention of MCP modules can be identified.

pith-pipeline@v0.9.1-grok · 5879 in / 1248 out tokens · 30896 ms · 2026-06-27T13:29:45.302831+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 31 linked inside Pith

  1. [1]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  2. [2]

    Video PreTraining (VPT): Learning to act by watching unlabeled online videos

    Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video PreTraining (VPT): Learning to act by watching unlabeled online videos. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  3. [3]

    Being-h0.7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

    BeingBeyond Team. Being-h0.7: A latent world-action model from egocentric videos.arXiv preprint arXiv:2605.00078, 2026

  4. [4]

    Scheduled sampling for sequence prediction with recurrent neural networks

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

  5. [5]

    Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doer- sch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

  6. [6]

    Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  7. [7]

    π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  8. [8]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnik...

  9. [9]

    Jake Bruce, Michael D. Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, Yusuf Aytar, Sarah Bechtle, Feryal Behbahani, Stephanie Chan, Nicolas Heess, Lucy Gonzalez, Simon Osindero, Sherjil Ozair, Scott Reed, Jingwei Zhang, Konrad Zolna, Jeff Clune, Nando de Freitas, Satinder...

  10. [10]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In International Conference on Machine Learning (ICML), 2024

  11. [11]

    Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 11

  12. [12]

    GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

  13. [13]

    Diffusion forcing: Next-token prediction meets full-sequence diffusion

    Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  14. [14]

    Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

    Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu, Pengfei Wan, and Xiang Bai. Out of sight but not out of mind: Hybrid memory for dynamic video world models.arXiv preprint arXiv:2603.25716, 2026

  15. [15]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  16. [16]

    Moto: Latent motion token as the bridging language for robot manipulation

    Yi Chen, Yuying Ge, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for robot manipulation. InProceedings of the IEEE/CVF international conference on computer vision, 2025

  17. [17]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

  18. [18]

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

    DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  19. [19]

    Tenenbaum, Dale Schuurmans, and Pieter Abbeel

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  20. [20]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InInternational Conference on Machin...

  21. [21]

    Wichmann

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  22. [22]

    Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

    Gemini Robotics Team. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  23. [23]

    Better & faster large language models via multi-token prediction

    Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. InInternational Conference on Machine Learning (ICML), 2024

  24. [24]

    Recurrent world models facilitate policy evolution

    David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems (NeurIPS), 2018

  25. [25]

    Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

  26. [26]

    GAIA-1: A generative world model for autonomous driving

    Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023

  27. [27]

    Video prediction policy: A generalist robot policy with predictive visual representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. InInternational Conference on Machine Learning (ICML), 2025. 12

  28. [28]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  29. [29]

    Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

  30. [30]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

  31. [31]

    How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

    Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

  32. [32]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  33. [33]

    Openvla: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. InConference on Robot Learning (C...

  34. [34]

    Tenenbaum

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InInternational Conference on Learning Representations (ICLR), 2024

  35. [35]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning (ICML), 2023

  36. [36]

    Causal world modeling for robot control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control. arXiv preprint arXiv:2601.21998, 2026

  37. [37]

    Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

  38. [38]

    Mixture-of- transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of- transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

  39. [39]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations (ICLR), 2023

  40. [40]

    Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  41. [41]

    RDT-1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. RDT-1b: a diffusion foundation model for bimanual manipulation. In International Conference on Learning Representations (ICLR), 2025

  42. [42]

    Being-h0: Vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025

    Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597, 2025. 13

  43. [43]

    GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

    NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You Liang ...

  44. [44]

    mimic-video: Video-action models for generalizable robot control beyond VLAs.arXiv preprint arXiv:2512.15692, 2025

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond VLAs.arXiv preprint arXiv:2512.15692, 2025

  45. [45]

    π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

    Physical Intelligence. π0.7: a steerable generalist robotic foundation model with emergent capabilities.arXiv preprint arXiv:2604.15483, 2026

  46. [46]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  47. [47]

    SpatialVLA: Exploring spatial representations for visual-language-action model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. SpatialVLA: Exploring spatial representations for visual-language-action model. InRobotics: Science and Systems (RSS), 2025

  48. [48]

    Sequence level training with recurrent neural networks

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. InInternational Conference on Learning Representa- tions (ICLR), 2016

  49. [49]

    Next block prediction: Video generation via semi-autoregressive modeling.arXiv preprint arXiv:2502.07737, 2025

    Shuhuai Ren, Shuming Ma, Xu Sun, and Furu Wei. Next block prediction: Video generation via semi-autoregressive modeling.arXiv preprint arXiv:2502.07737, 2025

  50. [50]

    GAIA-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

    Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. GAIA-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523, 2025

  51. [51]

    Learning to act without actions

    Dominik Schmidt and Minqi Jiang. Learning to act without actions. InInternational Conference on Learning Representations (ICLR), 2024

  52. [52]

    Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025

    Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025

  53. [53]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  54. [54]

    Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

    Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation.arXiv preprint arXiv:2412.15109, 2024

  55. [55]

    Towards accurate generative models of video: A new metric & challenges

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  56. [56]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  57. [57]

    Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 14

  58. [58]

    TinyVLA: Towards fast, data-efficient vision-language-action models for robotic manipulation.arXiv preprint arXiv:2409.12514, 2024

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin Peng, Feifei Feng, and Jian Tang. TinyVLA: Towards fast, data-efficient vision-language-action models for robotic manipulation.arXiv preprint arXiv:2409.12514, 2024

  59. [59]

    Dual- stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

    John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual- stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

  60. [60]

    Unleashing large-scale video generative pre-training for visual robot manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations (ICLR), 2024

  61. [61]

    Magma: A foundation model for multimodal AI agents

    Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, and Jianfeng Gao. Magma: A foundation model for multimodal AI agents. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  62. [62]

    World action models are zero-shot policies.arXiv preprint arXiv:2602.15922, 2026

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

  63. [63]

    Latent action pretraining from videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. In International Conference on Learning Representations (ICLR), 2025

  64. [64]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22963–22974, 2025

  65. [65]

    Fast-W AM: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

    Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-W AM: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  66. [66]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Tsung-Yi Lin, Gordon Wetzstein, Ming-Yu Liu, and Donglai Xiang. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  67. [67]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems (RSS), 2023

  68. [68]

    3D-VLA: A 3D vision-language-action generative world model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3D-VLA: A 3D vision-language-action generative world model. InInternational Conference on Machine Learning (ICML), 2024

  69. [69]

    X-VLA: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

    Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Ya-Qin Zhang, Jiangmiao Pang, Jingjing Liu, Tai Wang, and Xianyuan Zhan. X-VLA: Soft-prompted transformer as scalable cross-embodiment vision-language-action model.arXiv preprint arXiv:2510.10274, 2025

  70. [70]

    Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies.arXiv preprint arXiv:2412.10345, 2024

  71. [71]

    Exploring the limits of vision-language-action manipulation in cross-task generalization.Advances in Neural Information Processing Systems (NeurIPS), 38:139899– 139927, 2026

    Jiaming Zhou, Ke Ye, Jiayi Liu, Teli Ma, Zifan Wang, Ronghe Qiu, Kun-Yu Lin, Zhilin Zhao, and Junwei Liang. Exploring the limits of vision-language-action manipulation in cross-task generalization.Advances in Neural Information Processing Systems (NeurIPS), 38:139899– 139927, 2026. 15

  72. [72]

    cX# nX!nX

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InRobotics: Science and Systems (RSS), 2025. 16 Supplementary Material This appendix provides additional details that complement the main paper. Appendix A describes...