pith. sign in

arxiv: 2606.20781 · v1 · pith:JQ77BRSBnew · submitted 2026-06-18 · 💻 cs.RO · cs.CV

World Action Models: A Survey

Pith reviewed 2026-06-26 17:09 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords World Action Modelsembodied AIpredictive actionvideo generation modelsvision-language-actionworld modelsrobotics
0
0 comments X

The pith

World Action Models are predictive-action methods that trade future generation richness for lower compute and label costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys World Action Models as embodied systems that forecast futures to guide actions. It clarifies how they differ from video generators and language-based policies. Two complementary views organize the methods: one on what must be generated and one on the predictive components used. This leads to the observation that the field favors designs generating only what control requires. The framework helps readers track trade-offs in efficiency and performance across the literature.

Core claim

WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires.

What carries the argument

The two-view taxonomy of generation requirements (rendered futures, latent futures, video-generation-free action reasoning) and predictive substrate decomposition (backbone, action coupling, deployment regime).

If this is right

  • Design choices in WAMs explicitly balance representational richness with costs in compute, memory, latency, and action labels.
  • The field is shifting to methods that generate less of the future while preserving control needs.
  • Properties such as interactability, causality, persistence, physical plausibility, and generalization can be discussed uniformly.
  • Data, evaluation, and open challenges receive a consistent treatment under the common account.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid models that combine language backbones with minimal generation components may address latency issues in real-world robotics.
  • Testing the taxonomy on emerging methods could highlight areas where current classifications need refinement.
  • Links to model-based reinforcement learning might be strengthened by focusing on the predictive substrate view.

Load-bearing premise

The proposed two-view taxonomy accurately captures all relevant distinctions and design patterns across the cited literature without significant omissions or misclassifications.

What would settle it

A published WAM that cannot be placed into any category of the generation requirements view or the predictive substrate decomposition, or results demonstrating that full video future generation consistently outperforms reduced-generation approaches on control benchmarks.

read the original abstract

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey on World Action Models (WAMs), defined as embodied predictive-action models that make forecasts of the future available to action. It first clarifies boundaries with broad world models, video generation models, action-grounded video world models, and Vision-Language-Action policies. It then organizes the literature via two complementary views—one on generation requirements (rendered futures, latent futures, video-generation-free action reasoning) and one decomposing methods by predictive substrate, backbone, action coupling, and deployment regime—followed by discussion of interactability, causality, persistence, physical plausibility, generalization, data, evaluation, and open challenges. The central synthesis is that WAMs trade representational richness against compute, memory, latency, and action-label cost, with the field moving toward methods that generate less of the future while preserving control requirements.

Significance. If the proposed taxonomy proves a useful and stable organizing lens, the survey would provide a timely common account for a rapidly expanding area at the intersection of robotics, video generation, and control. The explicit identification of design trade-offs and the accompanying survey homepage constitute concrete contributions that could help structure future work.

major comments (1)
  1. [Taxonomy and synthesis sections] The central claim that a consistent design pattern emerges across the cited works depends on the two-view taxonomy being applied without significant omissions or misclassifications. The manuscript would be strengthened by an explicit table (or appendix) that maps every cited paper to the generation-requirement and predictive-substrate categories; without it, the observed pattern cannot be independently verified.
minor comments (2)
  1. [Introduction / boundary clarification] The abstract states that the survey 'first clarifies these boundaries, then organizes existing works'; the corresponding sections would benefit from a short summary table contrasting WAMs with the four neighboring concepts to make the boundary clarifications immediately scannable.
  2. [Taxonomy decomposition] Several technical terms (e.g., 'predictive substrate', 'action coupling', 'deployment regime') are introduced in the second view; a one-paragraph glossary or footnote definitions on first use would improve readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the survey. We agree that an explicit mapping table would strengthen the verifiability of the taxonomy and the central synthesis. We will add this as an appendix in the revised manuscript.

read point-by-point responses
  1. Referee: [Taxonomy and synthesis sections] The central claim that a consistent design pattern emerges across the cited works depends on the two-view taxonomy being applied without significant omissions or misclassifications. The manuscript would be strengthened by an explicit table (or appendix) that maps every cited paper to the generation-requirement and predictive-substrate categories; without it, the observed pattern cannot be independently verified.

    Authors: We agree with this assessment. While the two-view taxonomy is applied consistently in the text, an explicit tabular mapping of all cited works would indeed allow independent verification of the classifications and the observed design pattern. In the revision we will add a new appendix containing a comprehensive table that assigns each referenced paper to its generation-requirement category (rendered futures, latent futures, or video-generation-free action reasoning) and its predictive-substrate category, together with the backbone, action-coupling, and deployment-regime attributes used in the second view. This addition directly addresses the concern without altering the manuscript's core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey with no derivations

full rationale

This is a literature survey paper whose contribution is a two-view taxonomy (generation requirements and predictive-substrate decomposition) applied to existing works. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. The observed design pattern is reported as an empirical synthesis across cited literature rather than an internally derived result. Self-citations, if present, are not load-bearing for any central claim. This matches the default expectation of no circularity for non-derivational papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the central contribution rests on the authors' reading and categorization of existing literature rather than new axioms, parameters, or entities.

pith-pipeline@v0.9.1-grok · 5784 in / 1062 out tokens · 23671 ms · 2026-06-26T17:09:33.085712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

215 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Egocentric-10k, 2025.https://huggingface.co/datasets/builddotai/Egocentric-10K

    Build AI. Egocentric-10k, 2025.https://huggingface.co/datasets/builddotai/Egocentric-10K

  2. [2]

    Feedback world model enables precise guidance of diffusion policy, 2026.https: //arxiv.org/abs/2605.15705

    Tuo An, Jindou Jia, Gen Li, Jingliang Li, Chuhao Zhou, Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, and Jianfei Yang. Feedback world model enables precise guidance of diffusion policy, 2026.https: //arxiv.org/abs/2605.15705

  3. [3]

    Self-supervised learning from images with a joint-embedding predictive architecture, 2023

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture, 2023. https://arxiv.org/abs/2301.08243

  4. [4]

    V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025.https://arxiv.org/abs/2506.09985

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xi...

  5. [5]

    RoboArena: Distributed real-world evaluation of generalist robot policies

    Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, Jonathan Tremblay, Kanav Arora, Kirsty Ellis, Luca Macesanu, Marcel Torne Villasevil, Matthew Leonard, Meedeum Cho, Ozgur Aslan, Shivin Dass, Jie Wang, William Reger, Xingfang Yuan, Xuning Yang, Abhishek Gupta, Dinesh Jayar...

  6. [6]

    Mc-jepa: A joint-embedding predictive architecture for self- supervised learning of motion and content features, 2023.https://arxiv.org/abs/2307.12698

    Adrien Bardes, Jean Ponce, and Yann LeCun. Mc-jepa: A joint-embedding predictive architecture for self- supervised learning of motion and content features, 2023.https://arxiv.org/abs/2307.12698

  7. [7]

    Revisiting feature prediction for learning visual representations from video, 2024.https: //arxiv.org/abs/2404.08471

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024.https: //arxiv.org/abs/2404.08471

  8. [8]

    Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023.https://arxiv.org/abs/2309.01918

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023.https://arxiv.org/abs/2309.01918

  9. [9]

    Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InCoRL Workshop on Cross-Embodiment, 2024. https://arxiv.org/abs/2409. 16283

  10. [10]

    Motus: A unified latent action world model, 2025.https://arxiv.org/abs/2512.13030

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025.https://arxiv.org/abs/2512.13030

  11. [11]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

  12. [12]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.https://arxiv.org/abs/2307.15818

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

  13. [13]

    RynnVLA-002: A Unified Vision-Language-Action and World Model

    Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Bohan Hou, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Rynnvla-002: A unified vision-language-action and world model.CoRR, abs/2511.17502, 2025. doi: 10.48550/ARXIV.2511.17502. https://doi.org/10.48550/arXiv. 2511.17502

  14. [14]

    Worldvla: Towards autoregressive action world model, 2025

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model, 2025. https://arxiv.org/abs/2506.21539

  15. [15]

    Indego: A dataset of industrial scenarios and collaborative work for egocentric assistants, 2025

    Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang, Ze Lu, Oliver Heimann, and Jörg Krüger. Indego: A dataset of industrial scenarios and collaborative work for egocentric assistants, 2025. https://arxiv.org/abs/2511.19684

  16. [16]

    Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024.https://arxiv.org/abs/2410.06158

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024.https://arxiv.org/abs/2410.06158

  17. [17]

    Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Caiyi Zhang, Peihao Li, Kiwhan Song, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025.https://arxiv.org/abs/2512.15840. 43

  18. [18]

    Transdreamer: Reinforcement learning with transformer world models, 2022.https://arxiv.org/abs/2202.09481

    Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models, 2022.https://arxiv.org/abs/2202.09481

  19. [19]

    Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process, 2026

    Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process, 2026. https://arxiv.org/abs/2511.01718

  20. [20]

    Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation, 2025.https://arxiv.org/abs/2506.18088

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

  21. [21]

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, An...

  22. [22]

    RoboMME: Benchmarking and understanding memory for robotic generalist policies

    Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, and Joyce Chai. RoboMME: Benchmarking and understanding memory for robotic generalist policies. In International Conference on Machine Learning, 2026.https://arxiv.org/abs/2603.04639

  23. [23]

    Scaling egocentric vision: The epic-kitchens dataset, 2018.https://arxiv.org/abs/1804.02748

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, 44 Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset, 2018.https://arxiv.org/abs/1804.02748

  24. [24]

    Robonet: Large-scale multi-robot learning, 2020.https://arxiv.org/ abs/1910.11215

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning, 2020.https://arxiv.org/ abs/1910.11215

  25. [25]

    Emerging properties in unified multimodal pretraining, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. https://arxiv.org/abs/2505.14683

  26. [26]

    Autoregressive video generation without vector quantization, 2024.https://arxiv.org/abs/ 2412.14169

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization, 2024.https://arxiv.org/abs/ 2412.14169

  27. [27]

    Dexworldmodel: Causal latent world modeling towards automated learning of embodied tasks, 2026.https://arxiv.org/abs/2604.16484

    Yueci Deng, Guiliang Liu, and Kui Jia. Dexworldmodel: Causal latent world modeling towards automated learning of embodied tasks, 2026.https://arxiv.org/abs/2604.16484

  28. [28]

    Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

    Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025. https://arxiv.org/abs/2512.24766

  29. [29]

    Tenenbaum, Dale Schuurmans, and Pieter Abbeel

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems, 2023.https://arxiv.org/abs/2302.00111

  30. [30]

    Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

    Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InInternational Conference on Learning Representations, 2024.https://arxiv.org/abs/2310.10625

  31. [31]

    Worldscore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InIEEE/CVF International Conference on Computer Vision, 2025. https://arxiv.org/ abs/2504.00983

  32. [32]

    Aim: Intent-aware unified world action modeling with spatial value maps, 2026.https://arxiv.org/abs/2604.11135

    Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, and Jiayu Chen. Aim: Intent-aware unified world action modeling with spatial value maps, 2026.https://arxiv.org/abs/2604.11135

  33. [33]

    Dreamavoid: Critical-phase test-time dreaming to avoid failures in vla policies, 2026.https://arxiv.org/abs/ 2605.11750

    Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu, Ruihua Han, Manling Li, and Hengshuang Zhao. Dreamavoid: Critical-phase test-time dreaming to avoid failures in vla policies, 2026.https://arxiv.org/abs/ 2605.11750

  34. [34]

    LIBERO-Plus: In-depth robustness analysis of vision- language-action models.arXiv preprint arXiv:2510.13626, 2025.https://arxiv.org/abs/2510.13626

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. LIBERO-Plus: In-depth robustness analysis of vision- language-action models.arXiv preprint arXiv:2510.13626, 2025.https://arxiv.org/abs/2510.13626

  35. [35]

    A-jepa: Joint-embedding predictive architecture can listen, 2024.https://arxiv.org/abs/2311.15830

    Zhengcong Fei, Mingyuan Fan, and Junshi Huang. A-jepa: Joint-embedding predictive architecture can listen, 2024.https://arxiv.org/abs/2311.15830

  36. [36]

    Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models, 2026.https://arxiv.org/abs/2605.10942

    Qiuxuan Feng, Jiale Yu, Jiaming Liu, Yueru Jia, Zhuangzhe Wu, Hao Chen, Zezhong Qian, Shuo Gu, Peng Jia, Siwei Ma, and Shanghang Zhang. Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models, 2026.https://arxiv.org/abs/2605.10942

  37. [37]

    Vidar: Embodied video diffusion model for generalist manipulation, 2025.https://arxiv.org/abs/2507.12898

    Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation, 2025.https://arxiv.org/abs/2507.12898

  38. [38]

    Barry, Kris Kitani, and George Konidaris

    Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, and George Konidaris. Novaplan: Zero-shot long-horizon manipulation via closed-loop video language planning, 2026. https://arxiv.org/abs/2602.20119

  39. [39]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, 2023

  40. [40]

    Adaworld: Learning adaptable world models with latent actions, 2025.https://arxiv.org/abs/2503.18938

    Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions, 2025.https://arxiv.org/abs/2503.18938

  41. [41]

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie 45 Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abb...

  42. [42]

    Vampo: Policy optimization for improving visual dynamics in video action models, 2026.https://arxiv.org/abs/2603.19370

    Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, Han Zhao, Shangke Lyu, Zhaoxin Fan, Haoang Li, Ran Cheng, Cheng Chi, Huibin Ge, Yaozhi Luo, and Donglin Wang. Vampo: Policy optimization for improving visual dynamics in video action models, 2026.https://arxiv.org/abs/2603.19370

  43. [43]

    RoboVerse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning, 2025.https://arxiv.org/abs/2504.18904

    Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

  44. [44]

    World models for learning dexterous hand-object interactions from human videos, 2025.https://arxiv.org/abs/2512.13644

    Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models for learning dexterous hand-object interactions from human videos, 2025.https://arxiv.org/abs/2512.13644

  45. [45]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

  46. [46]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moha...

  47. [47]

    Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.https://arxiv.org/abs/2302.04659

    Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.https://arxiv.org/abs/2302.04659

  48. [48]

    Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717, 2026

    Songen Gu, Yunuo Cai, Tianyu Wang, Simo Wu, and Yanwei Fu. Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717, 2026. https://arxiv.org/ abs/2602.10717

  49. [49]

    Point tracking improves world action models, 2026.https://arxiv.org/abs/2605.23856

    Jiarui Guan, Wenshuai Zhao, Yue Pei, Ziliang Chen, Arno Solin, and Juho Kannala. Point tracking improves world action models, 2026.https://arxiv.org/abs/2605.23856. 46

  50. [50]

    Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025.https://arxiv.org/abs/2505.10075

    Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025.https://arxiv.org/abs/2505.10075

  51. [51]

    Unified 4d world action modeling from video priors with asynchronous denoising, 2026

    Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun, Yifei Su, Heyun Wang, Yuan Zhang, Xinghang Li, and Huaping Liu. Unified 4d world action modeling from video priors with asynchronous denoising, 2026. https://arxiv.org/abs/2604.26694

  52. [52]

    Prediction with action: Visual policy learning via joint denoising process, 2024.https://arxiv.org/abs/2411

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024.https://arxiv.org/abs/2411. 18179

  53. [53]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2024.https://arxiv.org/abs/2307.04725

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2024.https://arxiv.org/abs/2307.04725

  54. [54]

    AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning, 2024

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning, 2024

  55. [55]

    Learning latent dynamics for planning from pixels, 2019.https://arxiv.org/abs/1811.04551

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019.https://arxiv.org/abs/1811.04551

  56. [56]

    Mastering atari with discrete world models, 2022.https://arxiv.org/abs/2010.02193

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models, 2022.https://arxiv.org/abs/2010.02193

  57. [57]

    Mastering diverse domains through world models, 2023.https://arxiv.org/abs/2301.04104

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2023.https://arxiv.org/abs/2301.04104

  58. [58]

    Training agents inside of scalable world models, 2025

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. https://arxiv.org/abs/2509.24527

  59. [59]

    Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation, 2025.https://arxiv.org/abs/2506.06677

    Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation, 2025.https://arxiv.org/abs/2506.06677

  60. [60]

    Yoon, Mouli Sivapurapu, and Jian Zhang

    Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026.https://arxiv.org/abs/2505.11709

  61. [61]

    World model for robot learning: A comprehensive survey, 2026.https://arxiv.org/abs/2605.00080

    Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, and Jianfei Yang. World model for robot learning: A comprehensive survey, 2026.https://arxiv.org/abs/2605.00080

  62. [62]

    Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024.https://arxiv.org/abs/2412.14803

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024.https://arxiv.org/abs/2412.14803

  63. [63]

    BagelVLA: Enhancing long-horizon manipulation via interleaved vision-language-action generation, 2026.https://arxiv.org/abs/2602.09849

    Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, Wei Li, and Jianyu Chen. BagelVLA: Enhancing long-horizon manipulation via interleaved vision-language-action generation, 2026.https://arxiv.org/abs/2602.09849. arXiv:2602.09849

  64. [64]

    Dreaming the unseen: World model- regularized diffusion policy for out-of-distribution robustness, 2026.https://arxiv.org/abs/2603.21017

    Ziou Hu, Xiangtong Yao, Yuan Meng, Zhenshan Bing, and Alois Knoll. Dreaming the unseen: World model- regularized diffusion policy for out-of-distribution robustness, 2026.https://arxiv.org/abs/2603.21017

  65. [65]

    ARDuP: Active region video diffusion for universal policies

    Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, and Abhinav Shrivastava. ARDuP: Active region video diffusion for universal policies. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 8465–8472, 2024.https://arxiv.org/abs/2406.13301

  66. [66]

    Enerverse: Envisioning embodied future space for robotics manipulation, 2025.https://arxiv.org/abs/2501.01895

    Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, and Guanghui Ren. Enerverse: Envisioning embodied future space for robotics manipulation, 2025.https://arxiv.org/abs/2501.01895

  67. [67]

    Noisegate: Learning per-latent timestep schedules as information gating in world action models, 2026.https://arxiv.org/abs/2605.07794

    Wen Huang, Haoran Sun, Yongjian Guo, Yunxuan Ma, Haoran Li, Jing Long, Zhouying Mo, Zhong Guan, Yucheng Guo, Shuai Di, and Junwu Xiong. Noisegate: Learning per-latent timestep schedules as information gating in world action models, 2026.https://arxiv.org/abs/2605.07794

  68. [68]

    Pointworld: Scaling 3d world models for in-the-wild robotic manipulation, 2026.https://arxiv.org/abs/2601

    Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation, 2026.https://arxiv.org/abs/2601. 03782. 47

  69. [69]

    Navdreamer: Video models as zero-shot 3d navigators, 2026.https://arxiv.org/abs/2602.09765

    Xijie Huang, Weiqi Gai, Tianyue Wu, Congyu Wang, Zhiyang Liu, Xin Zhou, Yuze Wu, and Fei Gao. Navdreamer: Video models as zero-shot 3d navigators, 2026.https://arxiv.org/abs/2602.09765

  70. [70]

    3pointr: 3d point tracks for learning manipula- tion from unconstrained human videos, 2026.https://arxiv.org/abs/2603.08485

    Adam Hung, Bardienus Pieter Duisterhof, and Jeffrey Ichnowski. 3pointr: 3d point tracks for learning manipula- tion from unconstrained human videos, 2026.https://arxiv.org/abs/2603.08485

  71. [71]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  72. [72]

    Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...

  73. [73]

    Dreamgen: Unlocking generalization in robot learning through video world models, 2025.https://arxiv.org/abs/2505.12705

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

  74. [74]

    Openego: A large-scale multimodal egocentric dataset for dexterous manipulation, 2025.https://arxiv.org/abs/2509.05513

    Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation, 2025.https://arxiv.org/abs/2509.05513

  75. [75]

    Ckt-wam: Parameter-efficient context knowledge transfer between world action models, 2026.https://arxiv.org/abs/2605.06247

    Yuhua Jiang, Yijun Guo, Hongbing Yang, Guojun Lei, Nuo Chen, Yinuo Zhang, Shaoqiang Yan, Bo Lin, Feifei Gao, and Biqing Qi. Ckt-wam: Parameter-efficient context knowledge transfer between world action models, 2026.https://arxiv.org/abs/2605.06247

  76. [76]

    CoTracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. CoTracker: It is better to track together. InEuropean Conference on Computer Vision, 2024. https://arxiv. org/abs/2307.07635

  77. [77]

    Egomimic: Scaling imitation learning via egocentric video, 2024.https://arxiv.org/abs/2410.24221

    Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024.https://arxiv.org/abs/2410.24221

  78. [78]

    Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.09246

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.09246

  79. [79]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026.https://arxiv.org/abs/2601.16163

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026.https://arxiv.org/abs/2601.16163

  80. [80]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023. https: //arxiv.org/abs/2304.02643

Showing first 80 references.