Recognition: no theorem link
World Action Models: The Next Frontier in Embodied AI
Pith reviewed 2026-05-13 04:56 UTC · model grok-4.3
The pith
World Action Models unify world dynamics prediction with action generation in embodied AI.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors formally define World Action Models as embodied foundation models that unify predictive state modeling with action generation by targeting a joint distribution over future states and actions rather than actions alone. They disambiguate this paradigm from prior concepts, trace its origins in VLA and world-model literature, and organize methods into Cascaded WAMs and Joint WAMs with further splits by generation modality, conditioning, and decoding strategy. The paper further synthesizes the supporting data ecosystem and emerging evaluation protocols centered on visual fidelity, physical commonsense, and action plausibility.
What carries the argument
World Action Models (WAMs), which integrate predictive models of environment dynamics directly into the action-generation process to produce joint distributions over future states and actions.
If this is right
- Architectural choices can be compared systematically by whether world prediction is cascaded before action generation or trained jointly with it.
- Training draws on a mix of robot teleoperation, human egocentric video, simulation, and internet-scale data to scale beyond narrow robot datasets.
- Evaluation now requires separate checks on predicted state accuracy, physical plausibility, and final action correctness.
- Open challenges center on computational cost of forward prediction during real-time control and on scaling the joint modeling objective.
Where Pith is reading between the lines
- The explicit dynamics component may improve zero-shot transfer to new environments by letting the model simulate interventions it has never seen executed.
- WAM-style training objectives could be combined with classical planning loops to produce hybrid systems that search over predicted futures before committing to actions.
- If the taxonomy proves useful, future papers may adopt the cascaded-versus-joint distinction as a standard way to position new methods.
Load-bearing premise
Explicitly modeling how the world changes under an agent's interventions will produce meaningfully better embodied policies than learning direct reactive mappings from current observations to actions.
What would settle it
A controlled comparison on long-horizon robotic manipulation benchmarks in which agents built as World Action Models are measured against matched Vision-Language-Action baselines for task success rate and sample efficiency; absence of consistent gains would indicate that the added predictive component does not deliver the expected benefit.
read the original abstract
Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces World Action Models (WAMs) as an emerging paradigm of embodied foundation models that integrate predictive world models with vision-language-action (VLA) policies. It formally defines WAMs as targeting a joint distribution over future states and actions (rather than actions alone), disambiguates them from related concepts, traces the historical integration of VLA and world-model research, organizes existing methods into a taxonomy of Cascaded versus Joint WAMs (with further subdivisions by generation modality, conditioning mechanism, and action decoding), analyzes the supporting data ecosystem (teleoperation, human demonstrations, simulation, egocentric video), synthesizes evaluation protocols focused on visual fidelity, physical commonsense, and action plausibility, and outlines open challenges.
Significance. If the taxonomy and disambiguation hold, the survey supplies the first systematic conceptual framework for a rapidly fragmenting intersection of world models and embodied policies. This organization of architectural paradigms, data sources, and evaluation axes could reduce duplication of effort and clarify trade-offs, thereby accelerating research on non-reactive, dynamics-aware action generation.
major comments (2)
- [Definition and disambiguation] The load-bearing definition of WAMs (abstract and §2) as models that 'target a joint distribution over future states and actions' is stated at a high level but lacks an explicit probabilistic formulation or side-by-side comparison with the conditional action distribution learned by standard VLAs; without this, borderline architectures risk inconsistent classification under the proposed Cascaded/Joint split.
- [Taxonomy of Cascaded and Joint WAMs] The taxonomy (§3) subdivides Joint WAMs by modality, conditioning, and decoding strategy, yet supplies no explicit decision criteria, pseudocode, or worked examples of how a given method is assigned to a leaf category; this weakens the claim that the taxonomy clarifies trade-offs and may leave hybrid or emerging methods unclassifiable.
minor comments (3)
- [Data ecosystem] The data-ecosystem section would be strengthened by a summary table listing scale, diversity, and annotation characteristics of the cited sources (teleoperation, egocentric video, etc.) to support the assertion that they collectively fuel WAM development.
- [Evaluation protocols] Evaluation protocols are synthesized around three axes, but the manuscript would benefit from an explicit table or figure mapping existing benchmarks to those axes and to the taxonomy branches, improving usability for readers.
- [Foundations and related work] A small number of citations appear to be missing for recent VLA baselines that already incorporate limited forward prediction; adding them would strengthen the claim of literature fragmentation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive recommendation for minor revision. The feedback identifies opportunities to strengthen the formal grounding and practical utility of the proposed framework. We respond to each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Definition and disambiguation] The load-bearing definition of WAMs (abstract and §2) as models that 'target a joint distribution over future states and actions' is stated at a high level but lacks an explicit probabilistic formulation or side-by-side comparison with the conditional action distribution learned by standard VLAs; without this, borderline architectures risk inconsistent classification under the proposed Cascaded/Joint split.
Authors: We agree that an explicit probabilistic formulation would reduce ambiguity. In the revised manuscript we will expand §2 with the following formalization: a WAM targets p(s_{t+1:T}, a_{t:T} | o_{1:t}, a_{1:t-1}, g) where s denotes future states, a actions, o observations and g the goal, contrasting it directly with standard VLAs that model only the conditional p(a_t | o_{1:t}, g). A side-by-side comparison table will be added, and borderline cases (e.g., models that predict states only for planning but decode actions separately) will be discussed with classification rules. These additions will be placed immediately after the current high-level definition. revision: yes
-
Referee: [Taxonomy of Cascaded and Joint WAMs] The taxonomy (§3) subdivides Joint WAMs by modality, conditioning, and decoding strategy, yet supplies no explicit decision criteria, pseudocode, or worked examples of how a given method is assigned to a leaf category; this weakens the claim that the taxonomy clarifies trade-offs and may leave hybrid or emerging methods unclassifiable.
Authors: We acknowledge that the taxonomy would benefit from operational criteria. The revised §3 will include (i) a decision flowchart with explicit rules (e.g., “if state and action tokens are generated by a single autoregressive pass then classify as Joint; if a separate world-model module is queried before action decoding then Cascaded”), (ii) pseudocode for the classification procedure, and (iii) worked examples for three representative papers (one Cascaded, two Joint with different sub-branches) showing the exact assignment logic. These additions will also note how hybrid methods can be annotated with multiple labels when appropriate. revision: yes
Circularity Check
No significant circularity: survey paper with no derivations or fitted quantities
full rationale
This is a survey paper whose contribution is a proposed taxonomy and definition for World Action Models (WAMs) drawn from existing literature. The abstract and structure contain no equations, theorems, parameter fits, predictions, or derivations that could reduce to inputs by construction. Claims about unifying predictive state modeling with action generation are definitional and organizational, supported by external citations rather than self-referential loops. No load-bearing self-citations, ansatzes, or renamings of known results appear in the provided content.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey , Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashniko...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Openvla: An open-source vision-language-action model,
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailovici, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model,
-
[3]
URL https://arxiv.org/abs/2406.09246
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. 𝜋0: A visi...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. LIBERO-Plus: In-depth robustness analysis of vision-language- action models. arXiv preprint arXiv:2510.13626, 2025. URL https://arxiv.org/abs/2510.13626
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
World model- ing makes a better planner: Dual preference optimization for embodied task planning
Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu. World model- ing makes a better planner: Dual preference optimization for embodied task planning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Lin...
-
[7]
Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1044. URL https://aclanthology.org/2025.acl-long.1044/
-
[8]
arXiv preprint arXiv:2302.00111 , year=
Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2302.00111
-
[9]
Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson
Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, et al. Video language planning. arXiv preprint arXiv:2310.10625, 2023. URL https: //arxiv.org/abs/2310.10625
- [10]
-
[11]
Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, and Abhinav Valada. Roboenvision: A long-horizon video generation model for multi-task robot manipulation. arXiv preprint arXiv:2506.22007, 2025. URL https://arxiv.org/abs/2506.22007
-
[12]
Say , dream, and act: Learning video world models for instruction-driven robot manipulation
Songen Gu, Yunuo Cai, Tianyu Wang, Simo Wu, and Yanwei Fu. Say , dream, and act: Learning video world models for instruction-driven robot manipulation. arXiv preprint arXiv:2602.10717, 2026. URL https://arxiv.org/ abs/2602.10717
-
[13]
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual rep- resentations. arXiv preprint arXiv:2412.14803, 2024. URL https://arxiv.org/abs/2412.14803
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas. arXiv preprint arXiv:2512.15692, 2025. URL https://arxiv.org/abs/2512.15692. 45
work page internal anchor Pith review arXiv 2025
-
[15]
Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
Junbang Liang, Pavel T okmakov , Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. URL https://arxiv.org/abs/2508.00795
-
[16]
Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, and Haoang Li. S-vam: Shortcut video-action model by self- distilling geometric and semantic foresight, 2026. URL https://arxiv.org/abs/2603.16195
-
[17]
Latent action pretraining from videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, ...
work page 2025
-
[18]
Cosmos policy: Fine-tuning video models for visuomotor control and planning,
Moo Jin Kim, Yihuai Gao, Tsung- Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning,
-
[19]
URL https://arxiv.org/abs/2601.16163
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
World action models are zero-shot policies,
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, Y ou Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...
-
[21]
URL https://arxiv.org/abs/2602.15922
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Causal world modeling for robot control, 2026
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control, 2026. URL https://arxiv.org/abs/2601. 21998
work page 2026
-
[23]
Motus: A Unified Latent Action World Model
Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025. URL https://arxiv.org/abs/2512.13030
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets, 2025. URL https://arxiv. org/abs/2504.02792
work page internal anchor Pith review arXiv 2025
-
[25]
Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024. URL https://arxiv.org/abs/2411.18179
-
[26]
Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, and Guangrun Wang. Learning physics from pretrained video models: A multimodal continuous and sequential world interaction models for robotic manipulation, 2026. URL https://arxiv.org/abs/2603.00110
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
ivideogpt: Interactive videogpts are scalable world models, 2024
Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models, 2024. URL https://arxiv.org/abs/2405.15223
-
[28]
Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025
Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025. URL https://arxiv.org/abs/2505.10075
-
[29]
Enerverse: Envisioning embodied future space for robotics manipulation
Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hong- sheng Li, Maoqing Yao, and Guanghui Ren. Enerverse: Envisioning embodied future space for robotics manipu- lation, 2025. URL https://arxiv.org/abs/2501.01895
-
[30]
Learning Latent Dynamics for Planning from Pixels
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https://arxiv.org/abs/1811.04551
work page Pith review arXiv 2019
-
[31]
Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models, 2024. URL https://arxiv.org/abs/2202.09481
-
[32]
Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024. URL https: //arxiv.org/abs/2404.08471. 46
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Mocogan: Decomposing motion and content for video generation, 2017
Sergey T ulyakov , Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation, 2017. URL https://arxiv.org/abs/1707.04993
-
[34]
U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image seg- mentation, 2015. URL https://arxiv.org/abs/1505.04597
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[35]
Latte: Latent Diffusion Transformer for Video Generation
Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation, 2025. URL https://arxiv.org/abs/2401.03048
work page internal anchor Pith review arXiv 2025
-
[36]
Wan: Open and Advanced Large-Scale Video Generative Models
T eam Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
OpenAI. Sora 2. https://openai.com/sora, 09 2025. Accessed: 2026-04-08
work page 2025
-
[38]
arXiv preprint arXiv:2308.10901 (2023) 5
Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos, 2023. URL https://arxiv.org/abs/2308.10901
-
[39]
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty , Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...
-
[40]
Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination, 2024. URL https://arxiv.org/abs/2404.12377
-
[41]
Roboscape: Physics-informed embodied world model, 2025
Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model, 2025. URL https://arxiv.org/abs/2506.23135
-
[42]
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2026. URL https://arxiv.org/abs/2510.10125
-
[43]
Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025. URL https://arxiv.org/abs/2412.14957
-
[44]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[45]
Mastering Atari with Discrete World Models
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world mod- els, 2022. URL https://arxiv.org/abs/2010.02193
work page internal anchor Pith review arXiv 2022
-
[46]
Training Agents Inside of Scalable World Models
Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URL https://arxiv.org/abs/2509.24527
work page internal anchor Pith review arXiv 2025
-
[47]
RISE: Self-Improving Robot Policy with Compositional World Model
Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, and Hongyang Li. Rise: Self-improving robot policy with compositional world model, 2026. URL https://arxiv.org/abs/2602.11075
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024. URL https://arxiv.org/abs/2301.04104
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Daydreamer: World models for physical robot learning, 2022
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. Daydreamer: World models for physical robot learning, 2022. URL https://arxiv.org/abs/2206.14176
-
[50]
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training, 2026. URL https://arxiv.org/ abs/2509.24948. 47
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu’ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, and Yong Li. Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl, 2025. URL https://arxiv.org/abs/2512.03556
-
[52]
Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. Wmpo: World model-based policy optimization for vision-language-action models, 2025. URL https://arxiv.org/abs/2511.09515
-
[53]
Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. Wovr: World models as reliable simulators for post-training vla policies with rl, 2026. URL https://arxiv.org/abs/2602.13977
-
[54]
arXiv preprint arXiv:2510.00406 (2025)
Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, and Weihua Su. Vla-rft: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URL https://arxiv.org/abs/2510.00406
-
[55]
Reinforcement World Model Learning for
Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He, Suman Nath, Nikhil Singh, Jiangfeng Gao, and Zhou Yu. Reinforcement world model learning for llm-based agents, 2026. URL https://arxiv.org/abs/2602.05842
-
[56]
Modem-v2: Visuo-motor world mod- els for real-world robot manipulation, 2024
Patrick Lancaster, Nicklas Hansen, Aravind Rajeswaran, and Vikash Kumar. Modem-v2: Visuo-motor world mod- els for real-world robot manipulation, 2024. URL https://arxiv.org/abs/2309.14236
-
[57]
Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-gymnast: Training robots with reinforcement learning in a world model, 2026. URL https://arxiv.org/abs/2602.02454
-
[58]
Chenhao Li, Andreas Krause, and Marco Hutter. Uncertainty-aware robotic world model makes offline model- based reinforcement learning work on real robots, 2026. URL https://arxiv.org/abs/2504.16680
-
[59]
Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipula- tion, 2025. URL https://arxiv.org/abs/2509.19080
-
[60]
arXiv preprint arXiv:2305.14343 , year=
Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning, 2023. URL https: //arxiv.org/abs/2305.14343
-
[61]
Robot learning from a physical world model,
Jiageng Mao, Sicheng He, Hao-Ning Wu, Yang You, Shuyang Sun, Zhicheng Wang, Yanan Bao, Huizhong Chen, Leonidas Guibas, Vitor Guizilini, Howard Zhou, and Yue Wang. Robot learning from a physical world model,
- [62]
-
[63]
Diffusion reward: Learning rewards via conditional video diffusion, 2024
Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Diffusion reward: Learning rewards via conditional video diffusion, 2024. URL https://arxiv.org/abs/2312.14134
-
[64]
Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning
Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, and Wenjun Zeng. Goal-driven reward by video diffusion models for reinforcement learning, 2025. URL https://arxiv.org/abs/2512.00961
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Evaluating gemini robotics policies in a veo world simulator, 2026
Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini robot...
work page 2026
-
[66]
Interactive world simulator for robot policy training and evaluation, 2026
Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, T ony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation, 2026. URL https://arxiv.org/abs/2603.08546
-
[67]
Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025
Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator, 2025. URL https://arxiv.org/abs/2505.19017
-
[68]
Worldgym: World model as an environment for policy evaluation, 2025
Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation, 2025. URL https://arxiv.org/abs/2506.00613
-
[69]
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, and Yichen Zhu. dworldeval: Scalable robotic policy evaluation via discrete diffusion world model, 2026. URL https://arxiv.org/abs/2604.22152. 48
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [71]
-
[72]
arXiv preprint arXiv:2504.20995 (2025)
Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models, 2025. URL https://arxiv.org/abs/2504.20995
-
[73]
MVISTA-4D: View-consistent 4d world model with test-time action inference for robotic manip- ulation
Jiaxu Wang et al. MVISTA-4D: View-consistent 4d world model with test-time action inference for robotic manip- ulation. arXiv preprint arXiv:2602.09878, 2026. URL https://arxiv.org/abs/2602.09878
-
[74]
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation
Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham T ulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. In CoRL Workshop on Cross-Embodiment, 2024. URL https://arxiv.org/abs/ 2409.16283
work page internal anchor Pith review arXiv 2024
-
[75]
Flow as the cross-domain manipulation interface
Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface. In 8th Annual Conference on Robot Learning, 2024. URL https: //arxiv.org/abs/2407.15208
-
[76]
Hongpeng Zhi, Piao Chen, Siyuan Zhou, Yichao Dong, Qiang Wu, Lin Han, and Mingkui Tan. 3DFlowAction: Learning cross-embodiment manipulation from 3d flow world model. arXiv preprint arXiv:2506.06199, 2025. URL https://arxiv.org/abs/2506.06199
-
[77]
NovaFlow: Zero- shot manipulation via actionable flow from generated videos
Hongyu Li, Lingfeng Sun, Yafei Hu, Duy Ta, Jennifer Barry , George Konidaris, and Jiahui Fu. NovaFlow: Zero- shot manipulation via actionable flow from generated videos. arXiv preprint arXiv:2510.08568, 2025. URL https: //arxiv.org/abs/2510.08568
-
[78]
Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow. arXiv preprint arXiv:2512.24766, 2025. URL https: //arxiv.org/abs/2512.24766
-
[79]
Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel T okmakov , Shuran Song, and Carl Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation. In Conference on Robot Learning, 2024. URL https://arxiv.org/abs/2406.16862
-
[80]
Geometry-aware 4d video generation for robot manipulation, 2025
Zeyi Liu et al. Geometry-aware 4d video generation for robot manipulation, 2025
work page 2025
-
[81]
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, and Yunzhu Li. Robotic manipula- tion by imitating generated videos without physical demonstrations. arXiv preprint arXiv:2507.00990, 2025. URL https://arxiv.org/abs/2507.00990
work page Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.