arxiv: 2605.12090 · v1 · submitted 2026-05-12 · 💻 cs.RO · cs.CL· cs.CV

Recognition: no theorem link

World Action Models: The Next Frontier in Embodied AI

Siyin Wang , Junhao Shi , Zhaoyang Fu , Xinzhe He , Feihong Liu , Chenchen Yang , Yikang Zhou , Zhaoye Fei

show 6 more authors

Jingjing Gong Jinlan Fu Mike Zheng Shou Xuanjing Huang Xipeng Qiu Yu-Gang Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:56 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.CV

keywords World Action ModelsEmbodied AIVision-Language-ActionWorld ModelsRobot LearningFoundation ModelsTaxonomyAction Generation

0 comments

The pith

World Action Models unify world dynamics prediction with action generation in embodied AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines an emerging class of embodied foundation models called World Action Models that go beyond reactive vision-language-action mappings by jointly predicting how the physical environment will evolve under possible actions. It traces the roots of this idea in separate lines of world model and policy research, then organizes the scattered existing work into a clear taxonomy of cascaded pipelines versus joint architectures. The survey also catalogs the data sources that train these models and the evaluation criteria used to test them. A reader would care because the shift from immediate reaction to explicit forward simulation could enable more reliable long-horizon behavior in robots and other physical agents.

Core claim

The authors formally define World Action Models as embodied foundation models that unify predictive state modeling with action generation by targeting a joint distribution over future states and actions rather than actions alone. They disambiguate this paradigm from prior concepts, trace its origins in VLA and world-model literature, and organize methods into Cascaded WAMs and Joint WAMs with further splits by generation modality, conditioning, and decoding strategy. The paper further synthesizes the supporting data ecosystem and emerging evaluation protocols centered on visual fidelity, physical commonsense, and action plausibility.

What carries the argument

World Action Models (WAMs), which integrate predictive models of environment dynamics directly into the action-generation process to produce joint distributions over future states and actions.

If this is right

Architectural choices can be compared systematically by whether world prediction is cascaded before action generation or trained jointly with it.
Training draws on a mix of robot teleoperation, human egocentric video, simulation, and internet-scale data to scale beyond narrow robot datasets.
Evaluation now requires separate checks on predicted state accuracy, physical plausibility, and final action correctness.
Open challenges center on computational cost of forward prediction during real-time control and on scaling the joint modeling objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit dynamics component may improve zero-shot transfer to new environments by letting the model simulate interventions it has never seen executed.
WAM-style training objectives could be combined with classical planning loops to produce hybrid systems that search over predicted futures before committing to actions.
If the taxonomy proves useful, future papers may adopt the cascaded-versus-joint distinction as a standard way to position new methods.

Load-bearing premise

Explicitly modeling how the world changes under an agent's interventions will produce meaningfully better embodied policies than learning direct reactive mappings from current observations to actions.

What would settle it

A controlled comparison on long-horizon robotic manipulation benchmarks in which agents built as World Action Models are measured against matched Vision-Language-Action baselines for task success rate and sample efficiency; absence of consistent gains would indicate that the added predictive component does not deliver the expected benefit.

read the original abstract

Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey coins 'World Action Models' as a unifying label for combining world models with VLAs and gives a clean taxonomy, but adds no new technical results or experiments.

read the letter

The core takeaway is that this is an organizational survey rather than a research advance. It defines World Action Models as embodied models that target a joint distribution over future states and actions, then splits the space into cascaded versus joint architectures with further cuts by modality, conditioning, and decoding. That framing is new as a single label and could help people navigate the scattered VLA-plus-world-model papers that have appeared in the last couple of years.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces World Action Models (WAMs) as an emerging paradigm of embodied foundation models that integrate predictive world models with vision-language-action (VLA) policies. It formally defines WAMs as targeting a joint distribution over future states and actions (rather than actions alone), disambiguates them from related concepts, traces the historical integration of VLA and world-model research, organizes existing methods into a taxonomy of Cascaded versus Joint WAMs (with further subdivisions by generation modality, conditioning mechanism, and action decoding), analyzes the supporting data ecosystem (teleoperation, human demonstrations, simulation, egocentric video), synthesizes evaluation protocols focused on visual fidelity, physical commonsense, and action plausibility, and outlines open challenges.

Significance. If the taxonomy and disambiguation hold, the survey supplies the first systematic conceptual framework for a rapidly fragmenting intersection of world models and embodied policies. This organization of architectural paradigms, data sources, and evaluation axes could reduce duplication of effort and clarify trade-offs, thereby accelerating research on non-reactive, dynamics-aware action generation.

major comments (2)

[Definition and disambiguation] The load-bearing definition of WAMs (abstract and §2) as models that 'target a joint distribution over future states and actions' is stated at a high level but lacks an explicit probabilistic formulation or side-by-side comparison with the conditional action distribution learned by standard VLAs; without this, borderline architectures risk inconsistent classification under the proposed Cascaded/Joint split.
[Taxonomy of Cascaded and Joint WAMs] The taxonomy (§3) subdivides Joint WAMs by modality, conditioning, and decoding strategy, yet supplies no explicit decision criteria, pseudocode, or worked examples of how a given method is assigned to a leaf category; this weakens the claim that the taxonomy clarifies trade-offs and may leave hybrid or emerging methods unclassifiable.

minor comments (3)

[Data ecosystem] The data-ecosystem section would be strengthened by a summary table listing scale, diversity, and annotation characteristics of the cited sources (teleoperation, egocentric video, etc.) to support the assertion that they collectively fuel WAM development.
[Evaluation protocols] Evaluation protocols are synthesized around three axes, but the manuscript would benefit from an explicit table or figure mapping existing benchmarks to those axes and to the taxonomy branches, improving usability for readers.
[Foundations and related work] A small number of citations appear to be missing for recent VLA baselines that already incorporate limited forward prediction; adding them would strengthen the claim of literature fragmentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation for minor revision. The feedback identifies opportunities to strengthen the formal grounding and practical utility of the proposed framework. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Definition and disambiguation] The load-bearing definition of WAMs (abstract and §2) as models that 'target a joint distribution over future states and actions' is stated at a high level but lacks an explicit probabilistic formulation or side-by-side comparison with the conditional action distribution learned by standard VLAs; without this, borderline architectures risk inconsistent classification under the proposed Cascaded/Joint split.

Authors: We agree that an explicit probabilistic formulation would reduce ambiguity. In the revised manuscript we will expand §2 with the following formalization: a WAM targets p(s_{t+1:T}, a_{t:T} | o_{1:t}, a_{1:t-1}, g) where s denotes future states, a actions, o observations and g the goal, contrasting it directly with standard VLAs that model only the conditional p(a_t | o_{1:t}, g). A side-by-side comparison table will be added, and borderline cases (e.g., models that predict states only for planning but decode actions separately) will be discussed with classification rules. These additions will be placed immediately after the current high-level definition. revision: yes
Referee: [Taxonomy of Cascaded and Joint WAMs] The taxonomy (§3) subdivides Joint WAMs by modality, conditioning, and decoding strategy, yet supplies no explicit decision criteria, pseudocode, or worked examples of how a given method is assigned to a leaf category; this weakens the claim that the taxonomy clarifies trade-offs and may leave hybrid or emerging methods unclassifiable.

Authors: We acknowledge that the taxonomy would benefit from operational criteria. The revised §3 will include (i) a decision flowchart with explicit rules (e.g., “if state and action tokens are generated by a single autoregressive pass then classify as Joint; if a separate world-model module is queried before action decoding then Cascaded”), (ii) pseudocode for the classification procedure, and (iii) worked examples for three representative papers (one Cascaded, two Joint with different sub-branches) showing the exact assignment logic. These additions will also note how hybrid methods can be annotated with multiple labels when appropriate. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey paper with no derivations or fitted quantities

full rationale

This is a survey paper whose contribution is a proposed taxonomy and definition for World Action Models (WAMs) drawn from existing literature. The abstract and structure contain no equations, theorems, parameter fits, predictions, or derivations that could reduce to inputs by construction. Claims about unifying predictive state modeling with action generation are definitional and organizational, supported by external citations rather than self-referential loops. No load-bearing self-citations, ansatzes, or renamings of known results appear in the provided content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a conceptual framework and taxonomy but does not introduce or rely on new free parameters, mathematical axioms, or invented physical entities; all content synthesizes prior embodied AI research.

pith-pipeline@v0.9.0 · 5603 in / 1112 out tokens · 55825 ms · 2026-05-13T04:56:35.872308+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · 69 internal anchors

[1]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey , Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashniko...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Openvla: An open-source vision-language-action model,

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailovici, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model,

work page
[3]

URL https://arxiv.org/abs/2406.09246

work page internal anchor Pith review Pith/arXiv arXiv
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. 𝜋0: A visi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. LIBERO-Plus: In-depth robustness analysis of vision-language- action models. arXiv preprint arXiv:2510.13626, 2025. URL https://arxiv.org/abs/2510.13626

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

World model- ing makes a better planner: Dual preference optimization for embodied task planning

Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu. World model- ing makes a better planner: Dual preference optimization for embodied task planning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Lin...

work page
[7]

ISBN 979-8-89176-251-0

Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1044. URL https://aclanthology.org/2025.acl-long.1044/

work page doi:10.18653/v1/2025.acl-long.1044 2025
[8]

arXiv preprint arXiv:2302.00111 , year=

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2302.00111

work page arXiv 2023
[9]

Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, et al. Video language planning. arXiv preprint arXiv:2310.10625, 2023. URL https: //arxiv.org/abs/2310.10625

work page arXiv 2023
[10]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. In International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2310.08576

work page arXiv 2024
[11]

Roboenvision: A long-horizon video generation model for multi-task robot manipulation.arXiv preprint arXiv:2506.22007, 2025

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, and Abhinav Valada. Roboenvision: A long-horizon video generation model for multi-task robot manipulation. arXiv preprint arXiv:2506.22007, 2025. URL https://arxiv.org/abs/2506.22007

work page arXiv 2025
[12]

Say , dream, and act: Learning video world models for instruction-driven robot manipulation

Songen Gu, Yunuo Cai, Tianyu Wang, Simo Wu, and Yanwei Fu. Say , dream, and act: Learning video world models for instruction-driven robot manipulation. arXiv preprint arXiv:2602.10717, 2026. URL https://arxiv.org/ abs/2602.10717

work page arXiv 2026
[13]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual rep- resentations. arXiv preprint arXiv:2412.14803, 2024. URL https://arxiv.org/abs/2412.14803

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas. arXiv preprint arXiv:2512.15692, 2025. URL https://arxiv.org/abs/2512.15692. 45

work page internal anchor Pith review arXiv 2025
[15]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Junbang Liang, Pavel T okmakov , Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. URL https://arxiv.org/abs/2508.00795

work page arXiv 2025
[16]

S-vam: Shortcut video-action model by self-distilling geometric and semantic foresight.arXiv preprint arXiv:2603.16195,

Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, and Haoang Li. S-vam: Shortcut video-action model by self- distilling geometric and semantic foresight, 2026. URL https://arxiv.org/abs/2603.16195

work page arXiv 2026
[17]

Latent action pretraining from videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, ...

work page 2025
[18]

Cosmos policy: Fine-tuning video models for visuomotor control and planning,

Moo Jin Kim, Yihuai Gao, Tsung- Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning,

work page
[19]

URL https://arxiv.org/abs/2601.16163

work page internal anchor Pith review Pith/arXiv arXiv
[20]

World action models are zero-shot policies,

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, Y ou Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruĳie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

work page
[21]

URL https://arxiv.org/abs/2602.15922

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Causal world modeling for robot control, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control, 2026. URL https://arxiv.org/abs/2601. 21998

work page 2026
[23]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025. URL https://arxiv.org/abs/2512.13030

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets, 2025. URL https://arxiv. org/abs/2504.02792

work page internal anchor Pith review arXiv 2025
[25]

://arxiv.org/abs/2411.18179

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024. URL https://arxiv.org/abs/2411.18179

work page arXiv 2024
[26]

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

Zĳian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, and Guangrun Wang. Learning physics from pretrained video models: A multimodal continuous and sequential world interaction models for robotic manipulation, 2026. URL https://arxiv.org/abs/2603.00110

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

ivideogpt: Interactive videogpts are scalable world models, 2024

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models, 2024. URL https://arxiv.org/abs/2405.15223

work page arXiv 2024
[28]

Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025

Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025. URL https://arxiv.org/abs/2505.10075

work page arXiv 2025
[29]

Enerverse: Envisioning embodied future space for robotics manipulation

Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hong- sheng Li, Maoqing Yao, and Guanghui Ren. Enerverse: Envisioning embodied future space for robotics manipu- lation, 2025. URL https://arxiv.org/abs/2501.01895

work page arXiv 2025
[30]

Learning Latent Dynamics for Planning from Pixels

Danĳar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https://arxiv.org/abs/1811.04551

work page Pith review arXiv 2019
[31]

Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022

Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models, 2024. URL https://arxiv.org/abs/2202.09481

work page arXiv 2024
[32]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024. URL https: //arxiv.org/abs/2404.08471. 46

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Mocogan: Decomposing motion and content for video generation, 2017

Sergey T ulyakov , Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation, 2017. URL https://arxiv.org/abs/1707.04993

work page arXiv 2017
[34]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image seg- mentation, 2015. URL https://arxiv.org/abs/1505.04597

work page internal anchor Pith review Pith/arXiv arXiv 2015
[35]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation, 2025. URL https://arxiv.org/abs/2401.03048

work page internal anchor Pith review arXiv 2025
[36]

Wan: Open and Advanced Large-Scale Video Generative Models

T eam Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

OpenAI. Sora 2. https://openai.com/sora, 09 2025. Accessed: 2026-04-08

work page 2025
[38]

arXiv preprint arXiv:2308.10901 (2023) 5

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos, 2023. URL https://arxiv.org/abs/2308.10901

work page arXiv 2023
[39]

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruĳie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty , Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

work page arXiv 2026
[40]

Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination, 2024. URL https://arxiv.org/abs/2404.12377

work page arXiv 2024
[41]

Roboscape: Physics-informed embodied world model, 2025

Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model, 2025. URL https://arxiv.org/abs/2506.23135

work page arXiv 2025
[42]

Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125, 2025

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2026. URL https://arxiv.org/abs/2510.10125

work page arXiv 2026
[43]

Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025. URL https://arxiv.org/abs/2412.14957

work page arXiv 2025
[44]

Dream to Control: Learning Behaviors by Latent Imagination

Danĳar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603

work page internal anchor Pith review Pith/arXiv arXiv 2020
[45]

Mastering Atari with Discrete World Models

Danĳar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world mod- els, 2022. URL https://arxiv.org/abs/2010.02193

work page internal anchor Pith review arXiv 2022
[46]

Training Agents Inside of Scalable World Models

Danĳar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URL https://arxiv.org/abs/2509.24527

work page internal anchor Pith review arXiv 2025
[47]

RISE: Self-Improving Robot Policy with Compositional World Model

Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, and Hongyang Li. Rise: Self-improving robot policy with compositional world model, 2026. URL https://arxiv.org/abs/2602.11075

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Mastering Diverse Domains through World Models

Danĳar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024. URL https://arxiv.org/abs/2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Daydreamer: World models for physical robot learning, 2022

Philipp Wu, Alejandro Escontrela, Danĳar Hafner, Ken Goldberg, and Pieter Abbeel. Daydreamer: World models for physical robot learning, 2022. URL https://arxiv.org/abs/2206.14176

work page arXiv 2022
[50]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training, 2026. URL https://arxiv.org/ abs/2509.24948. 47

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl, 2025

Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu’ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, and Yong Li. Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl, 2025. URL https://arxiv.org/abs/2512.03556

work page arXiv 2025
[52]

Wmpo: World model-based policy optimization for vision-language-action models.arXiv preprint arXiv:2511.09515, 2025

Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. Wmpo: World model-based policy optimization for vision-language-action models, 2025. URL https://arxiv.org/abs/2511.09515

work page arXiv 2025
[53]

Wovr: World models as reliable simulators for post-training vla policies with rl.ArXiv, abs/2602.13977, 2026

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. Wovr: World models as reliable simulators for post-training vla policies with rl, 2026. URL https://arxiv.org/abs/2602.13977

work page arXiv 2026
[54]

arXiv preprint arXiv:2510.00406 (2025)

Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, and Weihua Su. Vla-rft: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URL https://arxiv.org/abs/2510.00406

work page arXiv 2025
[55]

Reinforcement World Model Learning for

Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He, Suman Nath, Nikhil Singh, Jiangfeng Gao, and Zhou Yu. Reinforcement world model learning for llm-based agents, 2026. URL https://arxiv.org/abs/2602.05842

work page arXiv 2026
[56]

Modem-v2: Visuo-motor world mod- els for real-world robot manipulation, 2024

Patrick Lancaster, Nicklas Hansen, Aravind Rajeswaran, and Vikash Kumar. Modem-v2: Visuo-motor world mod- els for real-world robot manipulation, 2024. URL https://arxiv.org/abs/2309.14236

work page arXiv 2024
[57]

informativeness

Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-gymnast: Training robots with reinforcement learning in a world model, 2026. URL https://arxiv.org/abs/2602.02454

work page arXiv 2026
[58]

Offline robotic world model: Learning robotic policies without a physics simulator.arXiv preprint arXiv:2504.16680, 2025

Chenhao Li, Andreas Krause, and Marco Hutter. Uncertainty-aware robotic world model makes offline model- based reinforcement learning work on real robots, 2026. URL https://arxiv.org/abs/2504.16680

work page arXiv 2026
[59]

World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipula- tion, 2025. URL https://arxiv.org/abs/2509.19080

work page arXiv 2025
[60]

arXiv preprint arXiv:2305.14343 , year=

Alejandro Escontrela, Ademi Adenĳi, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danĳar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning, 2023. URL https: //arxiv.org/abs/2305.14343

work page arXiv 2023
[61]

Robot learning from a physical world model,

Jiageng Mao, Sicheng He, Hao-Ning Wu, Yang You, Shuyang Sun, Zhicheng Wang, Yanan Bao, Huizhong Chen, Leonidas Guibas, Vitor Guizilini, Howard Zhou, and Yue Wang. Robot learning from a physical world model,

work page
[62]

URL https://arxiv.org/abs/2511.07416

work page arXiv
[63]

Diffusion reward: Learning rewards via conditional video diffusion, 2024

Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Diffusion reward: Learning rewards via conditional video diffusion, 2024. URL https://arxiv.org/abs/2312.14134

work page arXiv 2024
[64]

Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, and Wenjun Zeng. Goal-driven reward by video diffusion models for reinforcement learning, 2025. URL https://arxiv.org/abs/2512.00961

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Evaluating gemini robotics policies in a veo world simulator, 2026

Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini robot...

work page 2026
[66]

Interactive world simulator for robot policy training and evaluation, 2026

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, T ony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation, 2026. URL https://arxiv.org/abs/2603.08546

work page arXiv 2026
[67]

Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator, 2025. URL https://arxiv.org/abs/2505.19017

work page arXiv 2025
[68]

Worldgym: World model as an environment for policy evaluation, 2025

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation, 2025. URL https://arxiv.org/abs/2506.00613

work page arXiv 2025
[69]

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, and Yichen Zhu. dworldeval: Scalable robotic policy evaluation via discrete diffusion world model, 2026. URL https://arxiv.org/abs/2604.22152. 48

work page internal anchor Pith review Pith/arXiv arXiv 2026
[71]

URL https://arxiv.org/abs/2407.05530

work page arXiv
[72]

arXiv preprint arXiv:2504.20995 (2025)

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models, 2025. URL https://arxiv.org/abs/2504.20995

work page arXiv 2025
[73]

MVISTA-4D: View-consistent 4d world model with test-time action inference for robotic manip- ulation

Jiaxu Wang et al. MVISTA-4D: View-consistent 4d world model with test-time action inference for robotic manip- ulation. arXiv preprint arXiv:2602.09878, 2026. URL https://arxiv.org/abs/2602.09878

work page arXiv 2026
[74]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham T ulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. In CoRL Workshop on Cross-Embodiment, 2024. URL https://arxiv.org/abs/ 2409.16283

work page internal anchor Pith review arXiv 2024
[75]

Flow as the cross-domain manipulation interface

Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface. In 8th Annual Conference on Robot Learning, 2024. URL https: //arxiv.org/abs/2407.15208

work page arXiv 2024
[76]

3dflowaction: Learning cross-embodiment manipulation from 3d flow world model.ArXiv, abs/2506.06199, 2025

Hongpeng Zhi, Piao Chen, Siyuan Zhou, Yichao Dong, Qiang Wu, Lin Han, and Mingkui Tan. 3DFlowAction: Learning cross-embodiment manipulation from 3d flow world model. arXiv preprint arXiv:2506.06199, 2025. URL https://arxiv.org/abs/2506.06199

work page arXiv 2025
[77]

NovaFlow: Zero- shot manipulation via actionable flow from generated videos

Hongyu Li, Lingfeng Sun, Yafei Hu, Duy Ta, Jennifer Barry , George Konidaris, and Jiahui Fu. NovaFlow: Zero- shot manipulation via actionable flow from generated videos. arXiv preprint arXiv:2510.08568, 2025. URL https: //arxiv.org/abs/2510.08568

work page arXiv 2025
[78]

Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow. arXiv preprint arXiv:2512.24766, 2025. URL https: //arxiv.org/abs/2512.24766

work page arXiv 2025
[79]

Dreamitate: Real-world visuomotor policy learning via video generation.arXiv preprint arXiv:2406.16862, 2024

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel T okmakov , Shuran Song, and Carl Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation. In Conference on Robot Learning, 2024. URL https://arxiv.org/abs/2406.16862

work page arXiv 2024
[80]

Geometry-aware 4d video generation for robot manipulation, 2025

Zeyi Liu et al. Geometry-aware 4d video generation for robot manipulation, 2025

work page 2025
[81]

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, and Yunzhu Li. Robotic manipula- tion by imitating generated videos without physical demonstrations. arXiv preprint arXiv:2507.00990, 2025. URL https://arxiv.org/abs/2507.00990

work page Pith review arXiv 2025

Showing first 80 references.