World Action Models: A Survey

Qi Li; Qiuhong Shen; Shihua Zhang; Shizun Wang; Shuicheng Yan; Xinchao Wang; Yue Liao; Zhenxiong Tan

arxiv: 2606.20781 · v1 · pith:JQ77BRSBnew · submitted 2026-06-18 · 💻 cs.RO · cs.CV

World Action Models: A Survey

Qiuhong Shen , Shihua Zhang , Yue Liao , Qi Li , Zhenxiong Tan , Shizun Wang , Shuicheng Yan , Xinchao Wang This is my paper

Pith reviewed 2026-06-26 17:09 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords World Action Modelsembodied AIpredictive actionvideo generation modelsvision-language-actionworld modelsrobotics

0 comments

The pith

World Action Models are predictive-action methods that trade future generation richness for lower compute and label costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys World Action Models as embodied systems that forecast futures to guide actions. It clarifies how they differ from video generators and language-based policies. Two complementary views organize the methods: one on what must be generated and one on the predictive components used. This leads to the observation that the field favors designs generating only what control requires. The framework helps readers track trade-offs in efficiency and performance across the literature.

Core claim

WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires.

What carries the argument

The two-view taxonomy of generation requirements (rendered futures, latent futures, video-generation-free action reasoning) and predictive substrate decomposition (backbone, action coupling, deployment regime).

If this is right

Design choices in WAMs explicitly balance representational richness with costs in compute, memory, latency, and action labels.
The field is shifting to methods that generate less of the future while preserving control needs.
Properties such as interactability, causality, persistence, physical plausibility, and generalization can be discussed uniformly.
Data, evaluation, and open challenges receive a consistent treatment under the common account.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid models that combine language backbones with minimal generation components may address latency issues in real-world robotics.
Testing the taxonomy on emerging methods could highlight areas where current classifications need refinement.
Links to model-based reinforcement learning might be strengthened by focusing on the predictive substrate view.

Load-bearing premise

The proposed two-view taxonomy accurately captures all relevant distinctions and design patterns across the cited literature without significant omissions or misclassifications.

What would settle it

A published WAM that cannot be placed into any category of the generation requirements view or the predictive substrate decomposition, or results demonstrating that full video future generation consistently outperforms reduced-generation approaches on control benchmarks.

read the original abstract

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The survey's two-view taxonomy organizes WAM work and flags the shift toward lighter predictive models, but its real test is whether the classifications hold without major omissions.

read the letter

This paper draws boundaries between world models, video generators, VLAs, and WAMs, then sorts existing methods by what they must generate and by their predictive substrate. That split lets it track how choices in backbone, action coupling, and deployment affect interactability and cost.

The taxonomy itself is the main addition. It surfaces a pattern across the cited papers: methods are reducing how much of the future they render or predict while keeping the parts control actually needs. The discussion of trade-offs in compute, memory, latency, and labels follows directly from that breakdown and matches what people see in practice.

The coverage looks reasonable on the abstract, but any fresh taxonomy can force awkward fits or skip distinctions that matter in specific papers. Without line-by-line checks on the references, it is hard to know if the claimed design pattern covers the full range or just the most visible examples. The field moves quickly, so the survey will need updates soon.

People building or comparing embodied policies will get the most from it as a map of current options. Readers who want a single place to locate their own design choices will find it practical. It deserves referee time because a usable organizing frame can reduce duplicated effort even if later work refines the categories.

I would send it for review.

Referee Report

1 major / 2 minor

Summary. The manuscript is a survey on World Action Models (WAMs), defined as embodied predictive-action models that make forecasts of the future available to action. It first clarifies boundaries with broad world models, video generation models, action-grounded video world models, and Vision-Language-Action policies. It then organizes the literature via two complementary views—one on generation requirements (rendered futures, latent futures, video-generation-free action reasoning) and one decomposing methods by predictive substrate, backbone, action coupling, and deployment regime—followed by discussion of interactability, causality, persistence, physical plausibility, generalization, data, evaluation, and open challenges. The central synthesis is that WAMs trade representational richness against compute, memory, latency, and action-label cost, with the field moving toward methods that generate less of the future while preserving control requirements.

Significance. If the proposed taxonomy proves a useful and stable organizing lens, the survey would provide a timely common account for a rapidly expanding area at the intersection of robotics, video generation, and control. The explicit identification of design trade-offs and the accompanying survey homepage constitute concrete contributions that could help structure future work.

major comments (1)

[Taxonomy and synthesis sections] The central claim that a consistent design pattern emerges across the cited works depends on the two-view taxonomy being applied without significant omissions or misclassifications. The manuscript would be strengthened by an explicit table (or appendix) that maps every cited paper to the generation-requirement and predictive-substrate categories; without it, the observed pattern cannot be independently verified.

minor comments (2)

[Introduction / boundary clarification] The abstract states that the survey 'first clarifies these boundaries, then organizes existing works'; the corresponding sections would benefit from a short summary table contrasting WAMs with the four neighboring concepts to make the boundary clarifications immediately scannable.
[Taxonomy decomposition] Several technical terms (e.g., 'predictive substrate', 'action coupling', 'deployment regime') are introduced in the second view; a one-paragraph glossary or footnote definitions on first use would improve readability for readers outside the immediate subfield.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the survey. We agree that an explicit mapping table would strengthen the verifiability of the taxonomy and the central synthesis. We will add this as an appendix in the revised manuscript.

read point-by-point responses

Referee: [Taxonomy and synthesis sections] The central claim that a consistent design pattern emerges across the cited works depends on the two-view taxonomy being applied without significant omissions or misclassifications. The manuscript would be strengthened by an explicit table (or appendix) that maps every cited paper to the generation-requirement and predictive-substrate categories; without it, the observed pattern cannot be independently verified.

Authors: We agree with this assessment. While the two-view taxonomy is applied consistently in the text, an explicit tabular mapping of all cited works would indeed allow independent verification of the classifications and the observed design pattern. In the revision we will add a new appendix containing a comprehensive table that assigns each referenced paper to its generation-requirement category (rendered futures, latent futures, or video-generation-free action reasoning) and its predictive-substrate category, together with the backbone, action-coupling, and deployment-regime attributes used in the second view. This addition directly addresses the concern without altering the manuscript's core claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey with no derivations

full rationale

This is a literature survey paper whose contribution is a two-view taxonomy (generation requirements and predictive-substrate decomposition) applied to existing works. No equations, fitted parameters, predictions, or derivation chains exist that could reduce to inputs by construction. The observed design pattern is reported as an empirical synthesis across cited literature rather than an internally derived result. Self-citations, if present, are not load-bearing for any central claim. This matches the default expectation of no circularity for non-derivational papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the central contribution rests on the authors' reading and categorization of existing literature rather than new axioms, parameters, or entities.

pith-pipeline@v0.9.1-grok · 5784 in / 1062 out tokens · 23671 ms · 2026-06-26T17:09:33.085712+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

215 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Egocentric-10k, 2025.https://huggingface.co/datasets/builddotai/Egocentric-10K

Build AI. Egocentric-10k, 2025.https://huggingface.co/datasets/builddotai/Egocentric-10K

2025
[2]

Feedback world model enables precise guidance of diffusion policy, 2026.https: //arxiv.org/abs/2605.15705

Tuo An, Jindou Jia, Gen Li, Jingliang Li, Chuhao Zhou, Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, and Jianfei Yang. Feedback world model enables precise guidance of diffusion policy, 2026.https: //arxiv.org/abs/2605.15705

Pith/arXiv arXiv 2026
[3]

Self-supervised learning from images with a joint-embedding predictive architecture, 2023

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture, 2023. https://arxiv.org/abs/2301.08243

arXiv 2023
[4]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025.https://arxiv.org/abs/2506.09985

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xi...

Pith/arXiv arXiv 2025
[5]

RoboArena: Distributed real-world evaluation of generalist robot policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, Jonathan Tremblay, Kanav Arora, Kirsty Ellis, Luca Macesanu, Marcel Torne Villasevil, Matthew Leonard, Meedeum Cho, Ozgur Aslan, Shivin Dass, Jie Wang, William Reger, Xingfang Yuan, Xuning Yang, Abhishek Gupta, Dinesh Jayar...

arXiv 2025
[6]

Mc-jepa: A joint-embedding predictive architecture for self- supervised learning of motion and content features, 2023.https://arxiv.org/abs/2307.12698

Adrien Bardes, Jean Ponce, and Yann LeCun. Mc-jepa: A joint-embedding predictive architecture for self- supervised learning of motion and content features, 2023.https://arxiv.org/abs/2307.12698

arXiv 2023
[7]

Revisiting feature prediction for learning visual representations from video, 2024.https: //arxiv.org/abs/2404.08471

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024.https: //arxiv.org/abs/2404.08471

Pith/arXiv arXiv 2024
[8]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023.https://arxiv.org/abs/2309.01918

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023.https://arxiv.org/abs/2309.01918

arXiv 2023
[9]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InCoRL Workshop on Cross-Embodiment, 2024. https://arxiv.org/abs/2409. 16283

2024
[10]

Motus: A unified latent action world model, 2025.https://arxiv.org/abs/2512.13030

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025.https://arxiv.org/abs/2512.13030

Pith/arXiv arXiv 2025
[11]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

Pith/arXiv arXiv 2024
[12]

Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.https://arxiv.org/abs/2307.15818

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

Pith/arXiv arXiv 2023
[13]

RynnVLA-002: A Unified Vision-Language-Action and World Model

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Bohan Hou, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Rynnvla-002: A unified vision-language-action and world model.CoRR, abs/2511.17502, 2025. doi: 10.48550/ARXIV.2511.17502. https://doi.org/10.48550/arXiv. 2511.17502

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.17502 2025
[14]

Worldvla: Towards autoregressive action world model, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model, 2025. https://arxiv.org/abs/2506.21539

Pith/arXiv arXiv 2025
[15]

Indego: A dataset of industrial scenarios and collaborative work for egocentric assistants, 2025

Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang, Ze Lu, Oliver Heimann, and Jörg Krüger. Indego: A dataset of industrial scenarios and collaborative work for egocentric assistants, 2025. https://arxiv.org/abs/2511.19684

arXiv 2025
[16]

Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024.https://arxiv.org/abs/2410.06158

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024.https://arxiv.org/abs/2410.06158

Pith/arXiv arXiv 2024
[17]

Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Caiyi Zhang, Peihao Li, Kiwhan Song, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025.https://arxiv.org/abs/2512.15840. 43

Pith/arXiv arXiv 2025
[18]

Transdreamer: Reinforcement learning with transformer world models, 2022.https://arxiv.org/abs/2202.09481

Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models, 2022.https://arxiv.org/abs/2202.09481

arXiv 2022
[19]

Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process, 2026

Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process, 2026. https://arxiv.org/abs/2511.01718

arXiv 2026
[20]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation, 2025.https://arxiv.org/abs/2506.18088

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

Pith/arXiv arXiv 2025
[21]

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, An...

Pith/arXiv arXiv 2023
[22]

RoboMME: Benchmarking and understanding memory for robotic generalist policies

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, and Joyce Chai. RoboMME: Benchmarking and understanding memory for robotic generalist policies. In International Conference on Machine Learning, 2026.https://arxiv.org/abs/2603.04639

Pith/arXiv arXiv 2026
[23]

Scaling egocentric vision: The epic-kitchens dataset, 2018.https://arxiv.org/abs/1804.02748

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, 44 Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset, 2018.https://arxiv.org/abs/1804.02748

Pith/arXiv arXiv 2018
[24]

Robonet: Large-scale multi-robot learning, 2020.https://arxiv.org/ abs/1910.11215

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning, 2020.https://arxiv.org/ abs/1910.11215

Pith/arXiv arXiv 2020
[25]

Emerging properties in unified multimodal pretraining, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. https://arxiv.org/abs/2505.14683

Pith/arXiv arXiv 2025
[26]

Autoregressive video generation without vector quantization, 2024.https://arxiv.org/abs/ 2412.14169

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization, 2024.https://arxiv.org/abs/ 2412.14169

Pith/arXiv arXiv 2024
[27]

Dexworldmodel: Causal latent world modeling towards automated learning of embodied tasks, 2026.https://arxiv.org/abs/2604.16484

Yueci Deng, Guiliang Liu, and Kui Jia. Dexworldmodel: Causal latent world modeling towards automated learning of embodied tasks, 2026.https://arxiv.org/abs/2604.16484

Pith/arXiv arXiv 2026
[28]

Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025. https://arxiv.org/abs/2512.24766

arXiv 2025
[29]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems, 2023.https://arxiv.org/abs/2302.00111

arXiv 2023
[30]

Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InInternational Conference on Learning Representations, 2024.https://arxiv.org/abs/2310.10625

arXiv 2024
[31]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InIEEE/CVF International Conference on Computer Vision, 2025. https://arxiv.org/ abs/2504.00983

arXiv 2025
[32]

Aim: Intent-aware unified world action modeling with spatial value maps, 2026.https://arxiv.org/abs/2604.11135

Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, and Jiayu Chen. Aim: Intent-aware unified world action modeling with spatial value maps, 2026.https://arxiv.org/abs/2604.11135

Pith/arXiv arXiv 2026
[33]

Dreamavoid: Critical-phase test-time dreaming to avoid failures in vla policies, 2026.https://arxiv.org/abs/ 2605.11750

Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu, Ruihua Han, Manling Li, and Hengshuang Zhao. Dreamavoid: Critical-phase test-time dreaming to avoid failures in vla policies, 2026.https://arxiv.org/abs/ 2605.11750

Pith/arXiv arXiv 2026
[34]

LIBERO-Plus: In-depth robustness analysis of vision- language-action models.arXiv preprint arXiv:2510.13626, 2025.https://arxiv.org/abs/2510.13626

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. LIBERO-Plus: In-depth robustness analysis of vision- language-action models.arXiv preprint arXiv:2510.13626, 2025.https://arxiv.org/abs/2510.13626

Pith/arXiv arXiv 2025
[35]

A-jepa: Joint-embedding predictive architecture can listen, 2024.https://arxiv.org/abs/2311.15830

Zhengcong Fei, Mingyuan Fan, and Junshi Huang. A-jepa: Joint-embedding predictive architecture can listen, 2024.https://arxiv.org/abs/2311.15830

arXiv 2024
[36]

Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models, 2026.https://arxiv.org/abs/2605.10942

Qiuxuan Feng, Jiale Yu, Jiaming Liu, Yueru Jia, Zhuangzhe Wu, Hao Chen, Zezhong Qian, Shuo Gu, Peng Jia, Siwei Ma, and Shanghang Zhang. Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models, 2026.https://arxiv.org/abs/2605.10942

Pith/arXiv arXiv 2026
[37]

Vidar: Embodied video diffusion model for generalist manipulation, 2025.https://arxiv.org/abs/2507.12898

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation, 2025.https://arxiv.org/abs/2507.12898

Pith/arXiv arXiv 2025
[38]

Barry, Kris Kitani, and George Konidaris

Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, and George Konidaris. Novaplan: Zero-shot long-horizon manipulation via closed-loop video language planning, 2026. https://arxiv.org/abs/2602.20119

arXiv 2026
[39]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, 2023

2023
[40]

Adaworld: Learning adaptable world models with latent actions, 2025.https://arxiv.org/abs/2503.18938

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions, 2025.https://arxiv.org/abs/2503.18938

arXiv 2025
[41]

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie 45 Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abb...

Pith/arXiv arXiv 2026
[42]

Vampo: Policy optimization for improving visual dynamics in video action models, 2026.https://arxiv.org/abs/2603.19370

Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, Han Zhao, Shangke Lyu, Zhaoxin Fan, Haoang Li, Ran Cheng, Cheng Chi, Huibin Ge, Yaozhi Luo, and Donglin Wang. Vampo: Policy optimization for improving visual dynamics in video action models, 2026.https://arxiv.org/abs/2603.19370

arXiv 2026
[43]

RoboVerse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning, 2025.https://arxiv.org/abs/2504.18904

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

arXiv 2025
[44]

World models for learning dexterous hand-object interactions from human videos, 2025.https://arxiv.org/abs/2512.13644

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models for learning dexterous hand-object interactions from human videos, 2025.https://arxiv.org/abs/2512.13644

arXiv 2025
[45]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

arXiv 2022
[46]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moha...

arXiv 2024
[47]

Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.https://arxiv.org/abs/2302.04659

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.https://arxiv.org/abs/2302.04659

arXiv 2023
[48]

Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717, 2026

Songen Gu, Yunuo Cai, Tianyu Wang, Simo Wu, and Yanwei Fu. Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717, 2026. https://arxiv.org/ abs/2602.10717

arXiv 2026
[49]

Point tracking improves world action models, 2026.https://arxiv.org/abs/2605.23856

Jiarui Guan, Wenshuai Zhao, Yue Pei, Ziliang Chen, Arno Solin, and Juho Kannala. Point tracking improves world action models, 2026.https://arxiv.org/abs/2605.23856. 46

Pith/arXiv arXiv 2026
[50]

Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025.https://arxiv.org/abs/2505.10075

Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025.https://arxiv.org/abs/2505.10075

arXiv 2025
[51]

Unified 4d world action modeling from video priors with asynchronous denoising, 2026

Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun, Yifei Su, Heyun Wang, Yuan Zhang, Xinghang Li, and Huaping Liu. Unified 4d world action modeling from video priors with asynchronous denoising, 2026. https://arxiv.org/abs/2604.26694

Pith/arXiv arXiv 2026
[52]

Prediction with action: Visual policy learning via joint denoising process, 2024.https://arxiv.org/abs/2411

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024.https://arxiv.org/abs/2411. 18179

2024
[53]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2024.https://arxiv.org/abs/2307.04725

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2024.https://arxiv.org/abs/2307.04725

Pith/arXiv arXiv 2024
[54]

AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning, 2024

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning, 2024

2024
[55]

Learning latent dynamics for planning from pixels, 2019.https://arxiv.org/abs/1811.04551

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019.https://arxiv.org/abs/1811.04551

Pith/arXiv arXiv 2019
[56]

Mastering atari with discrete world models, 2022.https://arxiv.org/abs/2010.02193

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models, 2022.https://arxiv.org/abs/2010.02193

Pith/arXiv arXiv 2022
[57]

Mastering diverse domains through world models, 2023.https://arxiv.org/abs/2301.04104

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2023.https://arxiv.org/abs/2301.04104

Pith/arXiv arXiv 2023
[58]

Training agents inside of scalable world models, 2025

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. https://arxiv.org/abs/2509.24527

Pith/arXiv arXiv 2025
[59]

Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation, 2025.https://arxiv.org/abs/2506.06677

Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation, 2025.https://arxiv.org/abs/2506.06677

arXiv 2025
[60]

Yoon, Mouli Sivapurapu, and Jian Zhang

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026.https://arxiv.org/abs/2505.11709

Pith/arXiv arXiv 2026
[61]

World model for robot learning: A comprehensive survey, 2026.https://arxiv.org/abs/2605.00080

Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, and Jianfei Yang. World model for robot learning: A comprehensive survey, 2026.https://arxiv.org/abs/2605.00080

Pith/arXiv arXiv 2026
[62]

Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024.https://arxiv.org/abs/2412.14803

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024.https://arxiv.org/abs/2412.14803

Pith/arXiv arXiv 2024
[63]

BagelVLA: Enhancing long-horizon manipulation via interleaved vision-language-action generation, 2026.https://arxiv.org/abs/2602.09849

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, Wei Li, and Jianyu Chen. BagelVLA: Enhancing long-horizon manipulation via interleaved vision-language-action generation, 2026.https://arxiv.org/abs/2602.09849. arXiv:2602.09849

arXiv 2026
[64]

Dreaming the unseen: World model- regularized diffusion policy for out-of-distribution robustness, 2026.https://arxiv.org/abs/2603.21017

Ziou Hu, Xiangtong Yao, Yuan Meng, Zhenshan Bing, and Alois Knoll. Dreaming the unseen: World model- regularized diffusion policy for out-of-distribution robustness, 2026.https://arxiv.org/abs/2603.21017

arXiv 2026
[65]

ARDuP: Active region video diffusion for universal policies

Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, and Abhinav Shrivastava. ARDuP: Active region video diffusion for universal policies. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 8465–8472, 2024.https://arxiv.org/abs/2406.13301

arXiv 2024
[66]

Enerverse: Envisioning embodied future space for robotics manipulation, 2025.https://arxiv.org/abs/2501.01895

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, and Guanghui Ren. Enerverse: Envisioning embodied future space for robotics manipulation, 2025.https://arxiv.org/abs/2501.01895

arXiv 2025
[67]

Noisegate: Learning per-latent timestep schedules as information gating in world action models, 2026.https://arxiv.org/abs/2605.07794

Wen Huang, Haoran Sun, Yongjian Guo, Yunxuan Ma, Haoran Li, Jing Long, Zhouying Mo, Zhong Guan, Yucheng Guo, Shuai Di, and Junwu Xiong. Noisegate: Learning per-latent timestep schedules as information gating in world action models, 2026.https://arxiv.org/abs/2605.07794

Pith/arXiv arXiv 2026
[68]

Pointworld: Scaling 3d world models for in-the-wild robotic manipulation, 2026.https://arxiv.org/abs/2601

Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation, 2026.https://arxiv.org/abs/2601. 03782. 47

2026
[69]

Navdreamer: Video models as zero-shot 3d navigators, 2026.https://arxiv.org/abs/2602.09765

Xijie Huang, Weiqi Gai, Tianyue Wu, Congyu Wang, Zhiyang Liu, Xin Zhou, Yuze Wu, and Fei Gao. Navdreamer: Video models as zero-shot 3d navigators, 2026.https://arxiv.org/abs/2602.09765

arXiv 2026
[70]

3pointr: 3d point tracks for learning manipula- tion from unconstrained human videos, 2026.https://arxiv.org/abs/2603.08485

Adam Hung, Bardienus Pieter Duisterhof, and Jeffrey Ichnowski. 3pointr: 3d point tracks for learning manipula- tion from unconstrained human videos, 2026.https://arxiv.org/abs/2603.08485

Pith/arXiv arXiv 2026
[71]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

Pith/arXiv arXiv 2025
[72]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...

Pith/arXiv arXiv 2026
[73]

Dreamgen: Unlocking generalization in robot learning through video world models, 2025.https://arxiv.org/abs/2505.12705

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

Pith/arXiv arXiv 2025
[74]

Openego: A large-scale multimodal egocentric dataset for dexterous manipulation, 2025.https://arxiv.org/abs/2509.05513

Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation, 2025.https://arxiv.org/abs/2509.05513

arXiv 2025
[75]

Ckt-wam: Parameter-efficient context knowledge transfer between world action models, 2026.https://arxiv.org/abs/2605.06247

Yuhua Jiang, Yijun Guo, Hongbing Yang, Guojun Lei, Nuo Chen, Yinuo Zhang, Shaoqiang Yan, Bo Lin, Feifei Gao, and Biqing Qi. Ckt-wam: Parameter-efficient context knowledge transfer between world action models, 2026.https://arxiv.org/abs/2605.06247

Pith/arXiv arXiv 2026
[76]

CoTracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. CoTracker: It is better to track together. InEuropean Conference on Computer Vision, 2024. https://arxiv. org/abs/2307.07635

arXiv 2024
[77]

Egomimic: Scaling imitation learning via egocentric video, 2024.https://arxiv.org/abs/2410.24221

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024.https://arxiv.org/abs/2410.24221

arXiv 2024
[78]

Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.09246

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.09246

Pith/arXiv arXiv 2024
[79]

Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026.https://arxiv.org/abs/2601.16163

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026.https://arxiv.org/abs/2601.16163

Pith/arXiv arXiv 2026
[80]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023. https: //arxiv.org/abs/2304.02643

Pith/arXiv arXiv 2023

Showing first 80 references.

[1] [1]

Egocentric-10k, 2025.https://huggingface.co/datasets/builddotai/Egocentric-10K

Build AI. Egocentric-10k, 2025.https://huggingface.co/datasets/builddotai/Egocentric-10K

2025

[2] [2]

Feedback world model enables precise guidance of diffusion policy, 2026.https: //arxiv.org/abs/2605.15705

Tuo An, Jindou Jia, Gen Li, Jingliang Li, Chuhao Zhou, Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, and Jianfei Yang. Feedback world model enables precise guidance of diffusion policy, 2026.https: //arxiv.org/abs/2605.15705

Pith/arXiv arXiv 2026

[3] [3]

Self-supervised learning from images with a joint-embedding predictive architecture, 2023

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture, 2023. https://arxiv.org/abs/2301.08243

arXiv 2023

[4] [4]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025.https://arxiv.org/abs/2506.09985

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xi...

Pith/arXiv arXiv 2025

[5] [5]

RoboArena: Distributed real-world evaluation of generalist robot policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, Jonathan Tremblay, Kanav Arora, Kirsty Ellis, Luca Macesanu, Marcel Torne Villasevil, Matthew Leonard, Meedeum Cho, Ozgur Aslan, Shivin Dass, Jie Wang, William Reger, Xingfang Yuan, Xuning Yang, Abhishek Gupta, Dinesh Jayar...

arXiv 2025

[6] [6]

Mc-jepa: A joint-embedding predictive architecture for self- supervised learning of motion and content features, 2023.https://arxiv.org/abs/2307.12698

Adrien Bardes, Jean Ponce, and Yann LeCun. Mc-jepa: A joint-embedding predictive architecture for self- supervised learning of motion and content features, 2023.https://arxiv.org/abs/2307.12698

arXiv 2023

[7] [7]

Revisiting feature prediction for learning visual representations from video, 2024.https: //arxiv.org/abs/2404.08471

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024.https: //arxiv.org/abs/2404.08471

Pith/arXiv arXiv 2024

[8] [8]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023.https://arxiv.org/abs/2309.01918

Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking, 2023.https://arxiv.org/abs/2309.01918

arXiv 2023

[9] [9]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. InCoRL Workshop on Cross-Embodiment, 2024. https://arxiv.org/abs/2409. 16283

2024

[10] [10]

Motus: A unified latent action world model, 2025.https://arxiv.org/abs/2512.13030

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025.https://arxiv.org/abs/2512.13030

Pith/arXiv arXiv 2025

[11] [11]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

Pith/arXiv arXiv 2024

[12] [12]

Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.https://arxiv.org/abs/2307.15818

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...

Pith/arXiv arXiv 2023

[13] [13]

RynnVLA-002: A Unified Vision-Language-Action and World Model

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Bohan Hou, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Rynnvla-002: A unified vision-language-action and world model.CoRR, abs/2511.17502, 2025. doi: 10.48550/ARXIV.2511.17502. https://doi.org/10.48550/arXiv. 2511.17502

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.17502 2025

[14] [14]

Worldvla: Towards autoregressive action world model, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model, 2025. https://arxiv.org/abs/2506.21539

Pith/arXiv arXiv 2025

[15] [15]

Indego: A dataset of industrial scenarios and collaborative work for egocentric assistants, 2025

Vivek Chavan, Yasmina Imgrund, Tung Dao, Sanwantri Bai, Bosong Wang, Ze Lu, Oliver Heimann, and Jörg Krüger. Indego: A dataset of industrial scenarios and collaborative work for egocentric assistants, 2025. https://arxiv.org/abs/2511.19684

arXiv 2025

[16] [16]

Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024.https://arxiv.org/abs/2410.06158

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation, 2024.https://arxiv.org/abs/2410.06158

Pith/arXiv arXiv 2024

[17] [17]

Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Caiyi Zhang, Peihao Li, Kiwhan Song, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025.https://arxiv.org/abs/2512.15840. 43

Pith/arXiv arXiv 2025

[18] [18]

Transdreamer: Reinforcement learning with transformer world models, 2022.https://arxiv.org/abs/2202.09481

Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models, 2022.https://arxiv.org/abs/2202.09481

arXiv 2022

[19] [19]

Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process, 2026

Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language-action model via joint discrete denoising diffusion process, 2026. https://arxiv.org/abs/2511.01718

arXiv 2026

[20] [20]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation, 2025.https://arxiv.org/abs/2506.18088

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

Pith/arXiv arXiv 2025

[21] [21]

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, An...

Pith/arXiv arXiv 2023

[22] [22]

RoboMME: Benchmarking and understanding memory for robotic generalist policies

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, and Joyce Chai. RoboMME: Benchmarking and understanding memory for robotic generalist policies. In International Conference on Machine Learning, 2026.https://arxiv.org/abs/2603.04639

Pith/arXiv arXiv 2026

[23] [23]

Scaling egocentric vision: The epic-kitchens dataset, 2018.https://arxiv.org/abs/1804.02748

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, 44 Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Scaling egocentric vision: The epic-kitchens dataset, 2018.https://arxiv.org/abs/1804.02748

Pith/arXiv arXiv 2018

[24] [24]

Robonet: Large-scale multi-robot learning, 2020.https://arxiv.org/ abs/1910.11215

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning, 2020.https://arxiv.org/ abs/1910.11215

Pith/arXiv arXiv 2020

[25] [25]

Emerging properties in unified multimodal pretraining, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining, 2025. https://arxiv.org/abs/2505.14683

Pith/arXiv arXiv 2025

[26] [26]

Autoregressive video generation without vector quantization, 2024.https://arxiv.org/abs/ 2412.14169

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization, 2024.https://arxiv.org/abs/ 2412.14169

Pith/arXiv arXiv 2024

[27] [27]

Dexworldmodel: Causal latent world modeling towards automated learning of embodied tasks, 2026.https://arxiv.org/abs/2604.16484

Yueci Deng, Guiliang Liu, and Kui Jia. Dexworldmodel: Causal latent world modeling towards automated learning of embodied tasks, 2026.https://arxiv.org/abs/2604.16484

Pith/arXiv arXiv 2026

[28] [28]

Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow.arXiv preprint arXiv:2512.24766, 2025. https://arxiv.org/abs/2512.24766

arXiv 2025

[29] [29]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. InAdvances in Neural Information Processing Systems, 2023.https://arxiv.org/abs/2302.00111

arXiv 2023

[30] [30]

Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InInternational Conference on Learning Representations, 2024.https://arxiv.org/abs/2310.10625

arXiv 2024

[31] [31]

Worldscore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Li Fei-Fei, and Jiajun Wu. Worldscore: A unified evaluation benchmark for world generation. InIEEE/CVF International Conference on Computer Vision, 2025. https://arxiv.org/ abs/2504.00983

arXiv 2025

[32] [32]

Aim: Intent-aware unified world action modeling with spatial value maps, 2026.https://arxiv.org/abs/2604.11135

Liaoyuan Fan, Zetian Xu, Chen Cao, Wenyao Zhang, Mingqi Yuan, and Jiayu Chen. Aim: Intent-aware unified world action modeling with spatial value maps, 2026.https://arxiv.org/abs/2604.11135

Pith/arXiv arXiv 2026

[33] [33]

Dreamavoid: Critical-phase test-time dreaming to avoid failures in vla policies, 2026.https://arxiv.org/abs/ 2605.11750

Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu, Ruihua Han, Manling Li, and Hengshuang Zhao. Dreamavoid: Critical-phase test-time dreaming to avoid failures in vla policies, 2026.https://arxiv.org/abs/ 2605.11750

Pith/arXiv arXiv 2026

[34] [34]

LIBERO-Plus: In-depth robustness analysis of vision- language-action models.arXiv preprint arXiv:2510.13626, 2025.https://arxiv.org/abs/2510.13626

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. LIBERO-Plus: In-depth robustness analysis of vision- language-action models.arXiv preprint arXiv:2510.13626, 2025.https://arxiv.org/abs/2510.13626

Pith/arXiv arXiv 2025

[35] [35]

A-jepa: Joint-embedding predictive architecture can listen, 2024.https://arxiv.org/abs/2311.15830

Zhengcong Fei, Mingyuan Fan, and Junshi Huang. A-jepa: Joint-embedding predictive architecture can listen, 2024.https://arxiv.org/abs/2311.15830

arXiv 2024

[36] [36]

Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models, 2026.https://arxiv.org/abs/2605.10942

Qiuxuan Feng, Jiale Yu, Jiaming Liu, Yueru Jia, Zhuangzhe Wu, Hao Chen, Zezhong Qian, Shuo Gu, Peng Jia, Siwei Ma, and Shanghang Zhang. Harmowam: Harmonizing generalizable and precise manipulation via adaptive world action models, 2026.https://arxiv.org/abs/2605.10942

Pith/arXiv arXiv 2026

[37] [37]

Vidar: Embodied video diffusion model for generalist manipulation, 2025.https://arxiv.org/abs/2507.12898

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation, 2025.https://arxiv.org/abs/2507.12898

Pith/arXiv arXiv 2025

[38] [38]

Barry, Kris Kitani, and George Konidaris

Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, and George Konidaris. Novaplan: Zero-shot long-horizon manipulation via closed-loop video language planning, 2026. https://arxiv.org/abs/2602.20119

arXiv 2026

[39] [39]

Dreamsim: Learning new dimensions of human visual similarity using synthetic data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, 2023

2023

[40] [40]

Adaworld: Learning adaptable world models with latent actions, 2025.https://arxiv.org/abs/2503.18938

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions, 2025.https://arxiv.org/abs/2503.18938

arXiv 2025

[41] [41]

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie 45 Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abb...

Pith/arXiv arXiv 2026

[42] [42]

Vampo: Policy optimization for improving visual dynamics in video action models, 2026.https://arxiv.org/abs/2603.19370

Zirui Ge, Pengxiang Ding, Baohua Yin, Qishen Wang, Zhiyong Xie, Yemin Wang, Jinbo Wang, Hengtao Li, Runze Suo, Wenxuan Song, Han Zhao, Shangke Lyu, Zhaoxin Fan, Haoang Li, Ran Cheng, Cheng Chi, Huibin Ge, Yaozhi Luo, and Donglin Wang. Vampo: Policy optimization for improving visual dynamics in video action models, 2026.https://arxiv.org/abs/2603.19370

arXiv 2026

[43] [43]

RoboVerse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning, 2025.https://arxiv.org/abs/2504.18904

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gon...

arXiv 2025

[44] [44]

World models for learning dexterous hand-object interactions from human videos, 2025.https://arxiv.org/abs/2512.13644

Raktim Gautam Goswami, Amir Bar, David Fan, Tsung-Yen Yang, Gaoyue Zhou, Prashanth Krishnamurthy, Michael Rabbat, Farshad Khorrami, and Yann LeCun. World models for learning dexterous hand-object interactions from human videos, 2025.https://arxiv.org/abs/2512.13644

arXiv 2025

[45] [45]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Ku- mar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

arXiv 2022

[46] [46]

Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Moha...

arXiv 2024

[47] [47]

Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.https://arxiv.org/abs/2302.04659

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, Xiaodi Yuan, Pengwei Xie, Zhiao Huang, Rui Chen, and Hao Su. Maniskill2: A unified benchmark for generalizable manipulation skills, 2023.https://arxiv.org/abs/2302.04659

arXiv 2023

[48] [48]

Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717, 2026

Songen Gu, Yunuo Cai, Tianyu Wang, Simo Wu, and Yanwei Fu. Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717, 2026. https://arxiv.org/ abs/2602.10717

arXiv 2026

[49] [49]

Point tracking improves world action models, 2026.https://arxiv.org/abs/2605.23856

Jiarui Guan, Wenshuai Zhao, Yue Pei, Ziliang Chen, Arno Solin, and Juho Kannala. Point tracking improves world action models, 2026.https://arxiv.org/abs/2605.23856. 46

Pith/arXiv arXiv 2026

[50] [50]

Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025.https://arxiv.org/abs/2505.10075

Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025.https://arxiv.org/abs/2505.10075

arXiv 2025

[51] [51]

Unified 4d world action modeling from video priors with asynchronous denoising, 2026

Jun Guo, Qiwei Li, Peiyan Li, Zilong Chen, Nan Sun, Yifei Su, Heyun Wang, Yuan Zhang, Xinghang Li, and Huaping Liu. Unified 4d world action modeling from video priors with asynchronous denoising, 2026. https://arxiv.org/abs/2604.26694

Pith/arXiv arXiv 2026

[52] [52]

Prediction with action: Visual policy learning via joint denoising process, 2024.https://arxiv.org/abs/2411

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024.https://arxiv.org/abs/2411. 18179

2024

[53] [53]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2024.https://arxiv.org/abs/2307.04725

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2024.https://arxiv.org/abs/2307.04725

Pith/arXiv arXiv 2024

[54] [54]

AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning, 2024

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. AnimateDiff: Animate your personalized text-to-image diffusion models without specific tuning, 2024

2024

[55] [55]

Learning latent dynamics for planning from pixels, 2019.https://arxiv.org/abs/1811.04551

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019.https://arxiv.org/abs/1811.04551

Pith/arXiv arXiv 2019

[56] [56]

Mastering atari with discrete world models, 2022.https://arxiv.org/abs/2010.02193

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models, 2022.https://arxiv.org/abs/2010.02193

Pith/arXiv arXiv 2022

[57] [57]

Mastering diverse domains through world models, 2023.https://arxiv.org/abs/2301.04104

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2023.https://arxiv.org/abs/2301.04104

Pith/arXiv arXiv 2023

[58] [58]

Training agents inside of scalable world models, 2025

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. https://arxiv.org/abs/2509.24527

Pith/arXiv arXiv 2025

[59] [59]

Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation, 2025.https://arxiv.org/abs/2506.06677

Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation, 2025.https://arxiv.org/abs/2506.06677

arXiv 2025

[60] [60]

Yoon, Mouli Sivapurapu, and Jian Zhang

Ryan Hoque, Peide Huang, David J. Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video, 2026.https://arxiv.org/abs/2505.11709

Pith/arXiv arXiv 2026

[61] [61]

World model for robot learning: A comprehensive survey, 2026.https://arxiv.org/abs/2605.00080

Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo, Sicong Leng, Haoran Geng, Yanjie Ze, Tatsuya Harada, Philip Torr, Oier Mees, Marc Pollefeys, Zhuang Liu, Jiajun Wu, Pieter Abbeel, Jitendra Malik, Yilun Du, and Jianfei Yang. World model for robot learning: A comprehensive survey, 2026.https://arxiv.org/abs/2605.00080

Pith/arXiv arXiv 2026

[62] [62]

Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024.https://arxiv.org/abs/2412.14803

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024.https://arxiv.org/abs/2412.14803

Pith/arXiv arXiv 2024

[63] [63]

BagelVLA: Enhancing long-horizon manipulation via interleaved vision-language-action generation, 2026.https://arxiv.org/abs/2602.09849

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, Wei Li, and Jianyu Chen. BagelVLA: Enhancing long-horizon manipulation via interleaved vision-language-action generation, 2026.https://arxiv.org/abs/2602.09849. arXiv:2602.09849

arXiv 2026

[64] [64]

Dreaming the unseen: World model- regularized diffusion policy for out-of-distribution robustness, 2026.https://arxiv.org/abs/2603.21017

Ziou Hu, Xiangtong Yao, Yuan Meng, Zhenshan Bing, and Alois Knoll. Dreaming the unseen: World model- regularized diffusion policy for out-of-distribution robustness, 2026.https://arxiv.org/abs/2603.21017

arXiv 2026

[65] [65]

ARDuP: Active region video diffusion for universal policies

Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu, Linxi Fan, De-An Huang, and Abhinav Shrivastava. ARDuP: Active region video diffusion for universal policies. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 8465–8472, 2024.https://arxiv.org/abs/2406.13301

arXiv 2024

[66] [66]

Enerverse: Envisioning embodied future space for robotics manipulation, 2025.https://arxiv.org/abs/2501.01895

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hongsheng Li, Maoqing Yao, and Guanghui Ren. Enerverse: Envisioning embodied future space for robotics manipulation, 2025.https://arxiv.org/abs/2501.01895

arXiv 2025

[67] [67]

Noisegate: Learning per-latent timestep schedules as information gating in world action models, 2026.https://arxiv.org/abs/2605.07794

Wen Huang, Haoran Sun, Yongjian Guo, Yunxuan Ma, Haoran Li, Jing Long, Zhouying Mo, Zhong Guan, Yucheng Guo, Shuai Di, and Junwu Xiong. Noisegate: Learning per-latent timestep schedules as information gating in world action models, 2026.https://arxiv.org/abs/2605.07794

Pith/arXiv arXiv 2026

[68] [68]

Pointworld: Scaling 3d world models for in-the-wild robotic manipulation, 2026.https://arxiv.org/abs/2601

Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation, 2026.https://arxiv.org/abs/2601. 03782. 47

2026

[69] [69]

Navdreamer: Video models as zero-shot 3d navigators, 2026.https://arxiv.org/abs/2602.09765

Xijie Huang, Weiqi Gai, Tianyue Wu, Congyu Wang, Zhiyang Liu, Xin Zhou, Yuze Wu, and Fei Gao. Navdreamer: Video models as zero-shot 3d navigators, 2026.https://arxiv.org/abs/2602.09765

arXiv 2026

[70] [70]

3pointr: 3d point tracks for learning manipula- tion from unconstrained human videos, 2026.https://arxiv.org/abs/2603.08485

Adam Hung, Bardienus Pieter Duisterhof, and Jeffrey Ichnowski. 3pointr: 3d point tracks for learning manipula- tion from unconstrained human videos, 2026.https://arxiv.org/abs/2603.08485

Pith/arXiv arXiv 2026

[71] [71]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

Pith/arXiv arXiv 2025

[72] [72]

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black, George Bokinsky, Shihao Cao, Thomas Charbonnier, Vedant Choudhary, Foster Collins, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Maitrayee Dhaka, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine...

Pith/arXiv arXiv 2026

[73] [73]

Dreamgen: Unlocking generalization in robot learning through video world models, 2025.https://arxiv.org/abs/2505.12705

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

Pith/arXiv arXiv 2025

[74] [74]

Openego: A large-scale multimodal egocentric dataset for dexterous manipulation, 2025.https://arxiv.org/abs/2509.05513

Ahad Jawaid and Yu Xiang. Openego: A large-scale multimodal egocentric dataset for dexterous manipulation, 2025.https://arxiv.org/abs/2509.05513

arXiv 2025

[75] [75]

Ckt-wam: Parameter-efficient context knowledge transfer between world action models, 2026.https://arxiv.org/abs/2605.06247

Yuhua Jiang, Yijun Guo, Hongbing Yang, Guojun Lei, Nuo Chen, Yinuo Zhang, Shaoqiang Yan, Bo Lin, Feifei Gao, and Biqing Qi. Ckt-wam: Parameter-efficient context knowledge transfer between world action models, 2026.https://arxiv.org/abs/2605.06247

Pith/arXiv arXiv 2026

[76] [76]

CoTracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. CoTracker: It is better to track together. InEuropean Conference on Computer Vision, 2024. https://arxiv. org/abs/2307.07635

arXiv 2024

[77] [77]

Egomimic: Scaling imitation learning via egocentric video, 2024.https://arxiv.org/abs/2410.24221

Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video, 2024.https://arxiv.org/abs/2410.24221

arXiv 2024

[78] [78]

Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.09246

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model, 2024.https://arxiv.org/abs/2406.09246

Pith/arXiv arXiv 2024

[79] [79]

Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026.https://arxiv.org/abs/2601.16163

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning, 2026.https://arxiv.org/abs/2601.16163

Pith/arXiv arXiv 2026

[80] [80]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023. https: //arxiv.org/abs/2304.02643

Pith/arXiv arXiv 2023