pith. sign in

arxiv: 2606.01955 · v1 · pith:COEFE7A3new · submitted 2026-06-01 · 💻 cs.RO · cs.CV

WALL-WM: Carving World Action Modeling at the Event Joints

Pith reviewed 2026-06-28 14:12 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords world action modelsevent-grounded pretrainingvision-language-actionrobotic generalizationsemantic eventspretraining infrastructurevariable-length execution
0
0 comments X

The pith

WALL-WM uses semantic action events as the atomic unit for pretraining to fix granularity mismatch in world action models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing world action models force language, vision, and actions into fixed-length chunks, creating a mismatch with how each modality operates. WALL-WM instead uses semantically coherent action events as the basic learning unit, supported by event captions and balanced sampling. This enables scalable pretraining and two inference modes: one for variable-length event execution and one for standard chunks. If successful, it leads to broad generalization across language, scenes, and tasks in real-world evaluations. The method includes infrastructure for large-scale training.

Core claim

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs initialize from foundation models and optimize fixed-length chunks, but this creates a granularity mismatch. WALL-WM organizes supervision and data around semantic events with event-level captions and cluster-balanced sampling. It supports event mode for next-event descriptions and variable-length chunks, and unified mode with Staircase Decoding for fixed-length while preserving gradient continuity. Together with Muon-optimizer pretraining, it achieves sta

What carries the argument

semantically coherent action events as the atomic unit of learning in event-grounded VLA pretraining

If this is right

  • Supports variable-length execution chunks conditioned on next-event descriptions
  • Enables conventional fixed-length chunk inference via unified mode with staircase decoding
  • Provides a practical scale-up recipe for general-purpose WAMs using large-scale pretraining infrastructure
  • Achieves state-of-the-art performance across language, scenes, and tasks in real-world generalization evaluation

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The event organization could extend to other sequential prediction tasks where semantic boundaries align better with goals than fixed time windows.
  • Cluster-balanced sampling might help mitigate long-tail issues in behavior datasets beyond robotics.
  • The dual inference modes suggest potential for adaptive systems that choose execution granularity based on task demands.

Load-bearing premise

That semantically coherent action events can be reliably identified, captioned, and used as the atomic unit of supervision at scale without introducing selection bias or requiring post-hoc adjustments that affect the reported generalization gains.

What would settle it

Running the same large-scale real-world generalization evaluation but with supervision based on fixed-length segments instead of event captions, and observing no performance gains over chunk-centric baselines.

read the original abstract

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces WALL-WM, a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit. It pairs this with event-level captions, cluster-balanced sampling, two inference modes (event mode for variable-length chunks and unified mode with Staircase Decoding for fixed-length inference), and Muon-optimizer-based pretraining infrastructure. The central claim is that this resolves granularity mismatches and enables broad generalization across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

Significance. If the experimental claims hold, the event-grounded formulation could provide a scalable alternative to fixed-chunk VLA training by better aligning supervision with semantic structure, potentially improving generalization in robotics applications. The dual inference modes and data ecosystem represent a methodological contribution that, if validated with rigorous evidence, would be of interest to the robotics and multimodal learning communities.

major comments (1)
  1. [Abstract] Abstract: the claim that WALL-WM 'achieves state-of-the-art performance in large-scale real-world generalization evaluation' is unsupported by any metrics, baselines, dataset descriptions, ablation results, or evaluation protocol details. Without this evidence the central generalization claim cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and the recommendation for major revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that WALL-WM 'achieves state-of-the-art performance in large-scale real-world generalization evaluation' is unsupported by any metrics, baselines, dataset descriptions, ablation results, or evaluation protocol details. Without this evidence the central generalization claim cannot be assessed.

    Authors: We agree that an abstract claim of this strength should be backed by visible evidence. The manuscript contains a dedicated Experiments section that reports the metrics from the large-scale real-world generalization evaluation, the baselines used, dataset descriptions, ablation results on event grounding and sampling, and the full evaluation protocol. To make this support explicit at the abstract level, we will revise the abstract to include the key quantitative results and a concise reference to the evaluation setup. These changes will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain absent from text

full rationale

The supplied abstract and manuscript description contain no equations, parameter-fitting procedures, uniqueness theorems, or derivation steps of any kind. All claims are high-level architectural and empirical (event-grounded pretraining, Staircase Decoding, cluster-balanced sampling, generalization results) with no reduction of a 'prediction' to fitted inputs or self-citation chains. The central modeling shift is presented as a design choice rather than a derived necessity, leaving the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.1-grok · 5895 in / 1063 out tokens · 20487 ms · 2026-06-28T14:12:53.172496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

    cs.CV 2026-06 unverdicted novelty 7.0

    X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

  2. MemoryWAM: Efficient World Action Modeling with Persistent Memory

    cs.RO 2026-06 unverdicted novelty 4.0

    MemoryWAM is a world action model with a hybrid memory design using recent frames, anchor frames, and gist tokens for efficient long-horizon robotic manipulation.

  3. World Action Models: A Survey

    cs.RO 2026-06 unverdicted novelty 3.0

    A survey that clarifies boundaries and organizes World Action Models by generation requirements and predictive substrates, identifying a trend toward generating less of the future.

Reference graph

Works this paper leans on

96 extracted references · 49 linked inside Pith · cited by 3 Pith papers

  1. [1]

    Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

    Anthropic. Claude 3.7 sonnet.https://www.anthropic.com/news/claude-3-7-sonnet, 2025

  2. [2]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  3. [3]

    Latent reasoning vla: Latent thinking and prediction for vision-language-action models.arXiv preprint arXiv:2602.01166, 2026

    Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, et al. Latent reasoning vla: Latent thinking and prediction for vision-language-action models.arXiv preprint arXiv:2602.01166, 2026

  4. [4]

    V-JEPA: Latent video prediction for visual representation learning, 2024.https://openreview.net/ forum?id=WFYbBOEOtv

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-JEPA: Latent video prediction for visual representation learning, 2024.https://openreview.net/ forum?id=WFYbBOEOtv

  5. [5]

    Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

  6. [6]

    Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation

    Homanga Bharadhwaj, Roozbeh Mottaghi, Abhinav Gupta, and Shubham Tulsiani. Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation. InEuropean Conference on Computer Vision, pages 306–324. Springer, 2024

  7. [7]

    Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

    HongzheBi, HengkaiTan, ShenghaoXie, ZeyuanWang, ShuheHuang, HaitianLiu, RuowenZhao, YaoFeng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030, 2025

  8. [8]

    Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  9. [9]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  10. [10]

    Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025

  11. [11]

    Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  12. [12]

    Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

    Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

  13. [13]

    Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  14. [14]

    Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

    Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, et al. Gr-3 technical report.arXiv preprint arXiv:2507.15493, 2025

  15. [15]

    Moto: Latent motion token as the bridging language for learning robot manipulation from videos

    Yi Chen, Yuying Ge, Weiliang Tang, Yizhuo Li, Yixiao Ge, Mingyu Ding, Ying Shan, and Xihui Liu. Moto: Latent motion token as the bridging language for learning robot manipulation from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19752–19763, 2025

  16. [16]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Brian Ichter, and Avinash Shah. Diffusion policy: Visuomotor policy learning via action diffusion. InRobotics: Science and Systems (RSS), 2023

  17. [17]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  18. [18]

    Rescalingegocentricvision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

    DimaDamen, HazelDoughty, GiovanniMariaFarinella, AntoninoFurnari, etal. Rescalingegocentricvision: Collection, pipeline and challenges for epic-kitchens-100.International Journal of Computer Vision, 130:33–55, 2022

  19. [19]

    Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026

    Danny Driess, Jost Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Ren, Homer Walke, Quan Vuong, Lucy Xiaoyang Shi, et al. Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.Advances in Neural Information Processing Systems, 38:102867–102888, 2026

  20. [20]

    Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36: 9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36: 9156–9172, 2023

  21. [21]

    Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

    Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. In The Twelfth International Conference on Learning Representations, 2024.https://openreview.net/forum?id= 9pKtcJcMP3

  22. [22]

    Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025

    Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898, 2025. 41

  23. [23]

    Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

  24. [24]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Sharan Narang. Think before you speak: Training language models with pause tokens. InInternational Conference on Learning Representations, 2024

  25. [25]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18995–19012, 2022

  26. [26]

    Prediction with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37: 112386–112410, 2024

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37: 112386–112410, 2024

  27. [27]

    Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  28. [28]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

  29. [29]

    Mastering diverse domains through world models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  30. [30]

    Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

    Shibo Hao, Sainbayar Gu, Haotian Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

  31. [31]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

  32. [32]

    Video prediction policy: A generalist robot policy with predictive visual representations

    Yucheng Hu et al. Video prediction policy: A generalist robot policy with predictive visual representations. In Proceedings of the 42nd International Conference on Machine Learning, 2025

  33. [33]

    Ladi-wm: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

    Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, and Kai Xu. Ladi-wm: A latent diffusion-based world model for predictive manipulation.arXiv preprint arXiv:2505.11528, 2025

  34. [35]

    How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

    Zixuan Huang, Xiang Li, Zhaoyang Lv, and James M Rehg. How much 3d do video foundation models encode?arXiv preprint arXiv:2512.19949, 2025

  35. [36]

    arXiv preprint arXiv:2504.16054, 2025

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.𝑝𝑖0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  36. [37]

    Dreamgen: Unlocking generalization in robot learning through video world models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705, 2025

  37. [38]

    Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025

    Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, and Lianhui Qin. Ladir: Latent diffusion enhances llms for text reasoning.arXiv preprint arXiv:2510.04573, 2025

  38. [39]

    Beyond mode elicitation: Diversity- preserving reinforcement learning via latent diffusion reasoner.arXiv preprint arXiv:2602.01705, 2026

    Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, and Lianhui Qin. Beyond mode elicitation: Diversity- preserving reinforcement learning via latent diffusion reasoner.arXiv preprint arXiv:2602.01705, 2026

  39. [40]

    Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

  40. [41]

    Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

  41. [42]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026

  42. [43]

    Tenenbaum

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InProceedings of the International Conference on Learning Representations (ICLR), 2024

  43. [44]

    Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  44. [45]

    Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation.arXiv preprint arXiv:2411.19650, 2024

  45. [46]

    Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. 42

  46. [47]

    Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635, 2025

  47. [48]

    Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

    Yaron Lipman, Ricky Chen, Heli Ben-Hamu, Max Nickel, and Manzil Zaheer Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  48. [49]

    Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

    Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, Zhengping Che, Jian Tang, Pheng-Ann Heng, and Shanghang Zhang. Last0: Latent spatio-temporal chain-of-thought for robotic vision-language-action model.arXiv preprint arXiv:2601.05248, 2026

  49. [50]

    Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

    Jinghui Lu, Jiayi Guan, Zhijian Huang, Jinlong Li, Guang Li, Lingdong Kong, Yingyan Li, Han Wang, Shaoqing Xu, Yuechen Luo, et al. Onevl: One-step latent reasoning and planning with vision-language explanation.arXiv preprint arXiv:2604.18486, 2026

  50. [51]

    Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

    Calvin Luo, Zilai Zeng, Yilun Du, and Chen Sun. Solving new tasks by adapting internet video knowledge.arXiv preprint arXiv:2504.15369, 2025

  51. [52]

    Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

    Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, and Fuxi Wen. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

  52. [53]

    Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

    Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. Lda-1b: Scaling latent dynamics action model via universal embodied data ingestion.arXiv preprint arXiv:2602.12215, 2026

  53. [54]

    Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

    Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

  54. [55]

    Openvid-1m: A large-scale high-quality dataset for text-to-video generation

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation. InInternational Conference on Learning Representations, 2025

  55. [56]

    Spear-1: Scaling beyond robot demonstrations via 3d understanding.arXiv preprint arXiv:2511.17411, 2025

    Nikolay Nikolov, Giuliano Albanese, Sombit Dey, Aleksandar Yanev, Luc Van Gool, Jan-Nico Zaech, and Danda Pani Paudel. Spear-1: Scaling beyond robot demonstrations via 3d understanding.arXiv preprint arXiv:2511.17411, 2025

  56. [57]

    Introducing gpt-5.https://openai.com/index/introducing-gpt-5-for-developers/, 2025

    OpenAI. Introducing gpt-5.https://openai.com/index/introducing-gpt-5-for-developers/, 2025

  57. [58]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  58. [59]

    mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692, 2025

  59. [60]

    Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  60. [61]

    Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.arXiv preprint arXiv:2501.15830, 2025

  61. [62]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  62. [63]

    Common objectsin3d: Large-scalelearningandevaluationofreal-life3dcategoryreconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objectsin3d: Large-scalelearningandevaluationofreal-life3dcategoryreconstruction. InProceedingsoftheIEEE/CVF International Conference on Computer Vision, pages 10901–10911, 2021

  63. [64]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

  64. [65]

    Worldarena: Benchmarking embodied video generation models as world simulators.arXiv preprint arXiv:2602.08971, 2026

    Chenguo Shang et al. Worldarena: Benchmarking embodied video generation models as world simulators.arXiv preprint arXiv:2602.08971, 2026

  65. [66]

    Generic event boundary detection: A benchmark for event segmentation

    Mike Zheng Shou, Stan Weixian Lei, Weiyao Wang, Deepti Ghadiyaram, and Matt Feiszli. Generic event boundary detection: A benchmark for event segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8075–8084, 2021

  66. [67]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  67. [68]

    Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025

    GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, et al. Gigabrain-0: A world model-powered vision-language-action model.arXiv preprint arXiv:2510.19430, 2025

  68. [69]

    Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026

    MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, et al. Motubrain: An advanced world action model for robot control.arXiv preprint arXiv:2604.27792, 2026. 43

  69. [70]

    Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  70. [71]

    Qwen3.5: Accelerating productivity with native multimodal agents, February 2026.https://qwen

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026.https://qwen. ai/blog?id=qwen3.5

  71. [72]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  72. [73]

    Xrzero-g0: Pushing the frontier of dexterous robotic manipulation with interfaces, quality and ratios.arXiv preprint arXiv:2604.13001, 2026

    Junming Wang, Teng Pu, Wingmun Fung, Jindong Wang, Shanchang Wang, Yuan Deng, Shuyuan Wang, Ziwei Liu, Kunhao Pan, Ping Yang, et al. Xrzero-g0: Pushing the frontier of dexterous robotic manipulation with interfaces, quality and ratios.arXiv preprint arXiv:2604.13001, 2026

  73. [74]

    Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

    Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

  74. [75]

    Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

    Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Zhibin Tang, Kun Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation.IEEE Robotics and Automation Letters, 2025

  75. [76]

    Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

    John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025

  76. [77]

    A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

    Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

  77. [78]

    Generation models know space: Unleashing implicit 3d priors for scene understanding.arXiv preprint arXiv:2603.19235, 2026

    Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, and Xiang Bai. Generation models know space: Unleashing implicit 3d priors for scene understanding.arXiv preprint arXiv:2603.19235, 2026

  78. [79]

    Advancing high-resolution video-language representation with large-scale video transcriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5036–5045, 2022

  79. [80]

    Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

    An Yang, Baosong Yang, Beichen Zhang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. https://arxiv.org/abs/2412.15115

  80. [81]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025

Showing first 80 references.