World Models for Robotic Manipulation: A Survey

Canxi Liang; David Navarro-Alarcon; Fangyuan Wang; Guorui Pei; Hongmin Wu; Jiaming Qi; Jia Pan; Jinsong Wu; Jun Hu; Mengshi Zhang

arxiv: 2606.00113 · v1 · pith:7XYSTMDQnew · submitted 2026-05-27 · 💻 cs.RO

World Models for Robotic Manipulation: A Survey

Fangyuan Wang , Ziyuan Wang , Guorui Pei , Mengshi Zhang , Canxi Liang , Jun Hu , Zhongxuan Li , Jinsong Wu

show 10 more authors

Ning Han Zeqing Zhang Jiaming Qi Hongmin Wu Shiyao Zhang Pai Zheng Jia Pan David Navarro-Alarcon Sichao Liu Peng Zhou

This is my paper

Pith reviewed 2026-06-29 12:22 UTC · model grok-4.3

classification 💻 cs.RO

keywords world modelsrobotic manipulationrobot learningdynamics predictionaction-conditioned predictionpredictive infrastructuremanipulation datasetsclosed-loop evaluation

0 comments

The pith

World models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how learned world models enable robots to anticipate how actions reshape objects, contacts, and scene geometry. It provides an operational definition of a world model as an action-conditioned predictive system, distinct from perception modules, inverse models, policies, rewards, and value functions. The work organizes the literature into five representation families and develops a functional taxonomy that separates integrated prediction-action models from explicit predictive planners. It characterizes infrastructure roles such as synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification, while mapping these across pretraining, post-training, and inference. The survey reviews 34 manipulation datasets and evaluation protocols, revealing the evolution toward infrastructure alongside open challenges in contact modeling, hallucination control, action alignment, and closed-loop benchmarking.

Core claim

By defining world models as action-conditioned predictive systems and classifying existing work into five representation families with a functional taxonomy, the survey establishes that these models are shifting from specialized task-specific dynamics predictors to general predictive infrastructure that supports synthetic experience, planning, verification, and adaptation in robotic manipulation, while exposing persistent gaps in contact-rich modeling, hallucination mitigation, action alignment, and reliable closed-loop evaluation.

What carries the argument

The operational definition of a world model as an action-conditioned predictive system, combined with the five representation families and the functional taxonomy that distinguishes integrated prediction-action models from explicit predictive planners.

If this is right

World models support synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification in manipulation pipelines.
These roles apply across pretraining, post-training, and inference adaptation stages.
Evaluation protocols must jointly assess predictive fidelity, task performance, and simulator reliability.
Open challenges remain in contact modeling, hallucination control, action alignment, and closed-loop benchmarking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The infrastructure framing implies world models could reduce dependence on physical data collection by enabling more reliable simulation-based training.
Persistent contact modeling gaps point to potential value in hybrid physics-informed and learned predictors for manipulation.
The taxonomy may serve as a guide for integrating predictive components into emerging vision-language-action architectures.

Load-bearing premise

The assumption that the operational definition cleanly separates world models from perception modules, inverse models, policies, rewards, and value functions, and that the five representation families plus functional taxonomy provide exhaustive coverage of the literature without significant omissions or overlaps.

What would settle it

Identification of a substantial body of manipulation literature containing predictive modules that cannot be classified into any of the five representation families or that blur the separation from policies and rewards would falsify the taxonomy's claimed exhaustiveness.

Figures

Figures reproduced from arXiv: 2606.00113 by Canxi Liang, David Navarro-Alarcon, Fangyuan Wang, Guorui Pei, Hongmin Wu, Jiaming Qi, Jia Pan, Jinsong Wu, Jun Hu, Mengshi Zhang, Ning Han, Pai Zheng, Peng Zhou, Shiyao Zhang, Sichao Liu, Zeqing Zhang, Zhongxuan Li, Ziyuan Wang.

**Figure 1.** Figure 1: A world model predicts task-relevant future evolution of the world, usually conditioned on observations and robot actions. We organize the literature [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Representation spectrum of world models. The five families are ordered by increasing structured inductive bias, from appearance-reconstructive image [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Functional taxonomy of direct prediction–action interfaces. (a–b) Integrated prediction–action models embed prediction inside the action-producing [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Five functional roles of infrastructure world models for robotic manipulation: synthetic experience generation, candidate-action filtering, search-based [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: World models across the robot-learning lifecycle. During pretraining, predictive objectives learn reusable latent, video, or three-dimensional priors from [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Chronological overview of the 34 representative manipulation datasets in Table II, with the horizontal axis denoting first public release year and [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes world models for manipulation with a functional taxonomy and role mapping that could help coordinate the subfield, but the operational definition risks not cleanly separating them from policies in integrated VLA systems.

read the letter

This survey stands out for organizing the scattered literature on world models in robotic manipulation around a clear set of questions: what is predicted, how it connects to action, and when it is used in the pipeline. The new functional taxonomy separating integrated prediction-action models from explicit planners, along with the mapping to infrastructure roles like synthetic experience and outcome verification, gives the field a way to think about these models beyond just task-specific predictors.

The paper does well in pulling together 34 manipulation datasets and outlining evaluation protocols that cover predictive fidelity, task performance, and simulator reliability. It also highlights practical open challenges in areas like contact modeling, hallucination control, action alignment, and benchmarking under closed-loop conditions. These elements make it a solid reference for anyone working in this area.

The soft spots are around the operational definition. It tries to cleanly separate world models from perception, inverse models, policies, rewards, and value functions, but in modern vision-language-action systems the predictive components are often tightly integrated with the policy. This could mean the taxonomy misses some overlaps or hybrids, like 3D-latent combinations or physics-informed video models. If the five representation families don't cover everything without gaps, the narrative about world models becoming predictive infrastructure rests on a partition that might not hold up across all the literature.

Overall, this is for researchers focused on robot learning and manipulation who need a map of existing approaches and where the gaps are. A reader interested in how predictive models fit into the broader learning pipeline will find it helpful.

It deserves a serious referee because the synthesis and taxonomy are substantial enough to warrant feedback on their completeness and accuracy, even if some boundaries need adjustment.

Referee Report

2 major / 1 minor

Summary. This survey organizes the literature on world models for robotic manipulation around three questions (what is predicted, how it connects to action, and when it is used in the learning pipeline). It proposes an operational definition of a world model as an action-conditioned predictive system that excludes perception modules, inverse models, policies, rewards, and value functions; classifies existing work into five representation families; introduces a functional taxonomy separating integrated prediction-action models from explicit predictive planners; maps infrastructure roles (synthetic experience, filtering, search, learned environments, verification) across pretraining/post-training/inference; reviews 34 manipulation datasets; and synthesizes evaluation protocols. The central claim is that world models are evolving from task-specific dynamics predictors into predictive infrastructure, while surfacing open challenges in contact modeling, hallucination control, action alignment, and closed-loop benchmarking.

Significance. If the operational definition and five-family taxonomy prove exhaustive and non-overlapping, the paper would supply a much-needed organizing framework for a fragmented subfield, enabling clearer comparisons of design choices and highlighting concrete research gaps. The dataset review and role-mapping could directly inform benchmark construction and system architecture decisions in robot learning.

major comments (2)

[Abstract / §2] Abstract and §2 (operational definition): the claim that the definition cleanly separates world models from policies and value functions is load-bearing for the 'predictive infrastructure' narrative, yet the manuscript provides no concrete counter-examples or boundary cases from integrated VLA architectures (e.g., where the predictive head is jointly trained with the policy head). Without such disambiguation, the taxonomy risks artificial separation that does not reflect current practice.
[§3] §3 (five representation families and functional taxonomy): the exhaustiveness of the partition is asserted but not demonstrated via an explicit coverage table or omission analysis; hybrid 3D+latent or physics-informed video models that straddle multiple families are not addressed, which directly affects the completeness of the infrastructure-role mapping and the listed open challenges.

minor comments (1)

[Dataset review section] The abstract states that 34 datasets are reviewed, but the manuscript should include an explicit summary table (dataset name, size, manipulation tasks covered, world-model usage) to allow readers to assess coverage without reading every citation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate planned revisions to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract / §2] Abstract and §2 (operational definition): the claim that the definition cleanly separates world models from policies and value functions is load-bearing for the 'predictive infrastructure' narrative, yet the manuscript provides no concrete counter-examples or boundary cases from integrated VLA architectures (e.g., where the predictive head is jointly trained with the policy head). Without such disambiguation, the taxonomy risks artificial separation that does not reflect current practice.

Authors: We agree that explicit boundary cases would strengthen the operational definition. In the revised manuscript we will expand §2 with concrete examples drawn from integrated VLA architectures (e.g., models in which a predictive head is jointly trained with a policy head), showing how the world-model component remains the action-conditioned predictive subsystem even under joint optimization. This addition will directly support the separation from policies and value functions while preserving the predictive-infrastructure framing. revision: yes
Referee: [§3] §3 (five representation families and functional taxonomy): the exhaustiveness of the partition is asserted but not demonstrated via an explicit coverage table or omission analysis; hybrid 3D+latent or physics-informed video models that straddle multiple families are not addressed, which directly affects the completeness of the infrastructure-role mapping and the listed open challenges.

Authors: We accept that an explicit demonstration of coverage is required. The revised §3 will include a coverage table mapping all surveyed works to the five families together with an omission analysis. We will also add a dedicated paragraph on hybrid models (3D+latent and physics-informed video predictors), revise the functional taxonomy and infrastructure-role mapping to accommodate them, and update the open-challenges section accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: survey paper with external-citation foundation

full rationale

This is a literature survey whose central claims rest on classification of external papers rather than any internal derivation, fitted parameter, or self-referential prediction. The operational definition and taxonomy are presented as organizing choices, not as results derived from prior equations or self-citations within the manuscript. No equations, uniqueness theorems, or ansatzes appear; all substantive statements are supported by citations to independent works. The paper is therefore self-contained against external benchmarks and receives the default non-finding for a review article.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey contributes an organizational framework rather than new entities or parameters; it rests on the domain assumption that literature can be partitioned by representation type and pipeline role.

axioms (1)

domain assumption An operational definition of world model as an action-conditioned predictive system cleanly distinguishes it from perception, inverse models, policies, rewards, and value functions.
Explicitly stated in the abstract as the basis for the entire taxonomy and role mapping.

pith-pipeline@v0.9.1-grok · 5843 in / 1195 out tokens · 33861 ms · 2026-06-29T12:22:14.699344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

153 extracted references · 96 canonical work pages · 31 internal anchors

[1]

Internal world models and supervised learning,

M. I. Jordan and D. E. Rumelhart, “Internal world models and supervised learning,” inMachine Learning Proceedings 1991. Elsevier, 1991, pp. 70–74. [Online]. Available: https://doi.org/10.1016/ B978-1-55860-200-7.50018-0

1991
[2]

An internal model for sensorimotor integration,

D. M. Wolpert, Z. Ghahramani, and M. I. Jordan, “An internal model for sensorimotor integration,”Science, vol. 269, no. 5232, pp. 1880–1882,
[3]

Available: https://doi.org/10.1126/science.7569931

[Online]. Available: https://doi.org/10.1126/science.7569931

work page doi:10.1126/science.7569931
[4]

Dyna, an integrated architecture for learning, planning, and reacting,

R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,”ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, 1991. [Online]. Available: https://doi.org/10.1145/122344.122377

work page doi:10.1145/122344.122377 1991
[5]

Learning to control a low-cost manipulator using data-efficient reinforcement learning,

M. Deisenroth, C. Rasmussen, and D. Fox, “Learning to control a low-cost manipulator using data-efficient reinforcement learning,” inRobotics: Science and Systems VII, 2011. [Online]. Available: https://doi.org/10.15607/RSS.2011.VII.008

work page doi:10.15607/rss.2011.vii.008 2011
[6]

Recurrent world models facilitate policy evolution,

D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/ 2018/file/2de5d16682c3c3...

2018
[7]

Learning latent dynamics for planning from pixels,

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” inInternational Conference on Machine Learning, 2019. [Online]. Available: https://proceedings.mlr.press/v97/hafner19a.html

2019
[8]

Dream to control: Learning behaviors by latent imagination,

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=S1lOTC4tDS

2020
[9]

Mastering diverse control tasks through world models,

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse control tasks through world models,”Nature, 2025. [Online]. Available: https://doi.org/10.1038/s41586-025-08744-2

work page doi:10.1038/s41586-025-08744-2 2025
[10]

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Y . Tian, S. Yang, J. Zeng, P. Wang, D. Lin, H. Dong, and J. Pang, “Predictive inverse dynamics models are scalable learners for robotic manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2412.15109

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

WorldVLA: Towards Autoregressive Action World Model

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, D. Zhao, and H. Chen, “WorldVLA: Towards autoregressive action world model,” 2025. [Online]. Available: https://arxiv.org/abs/2506.21539

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,

Z. Jiang, K. Liu, Y . Qin, S. Tian, Y . Zheng, M. Zhou, C. Yu, H. Li, and D. Zhao, “World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.19080

work page arXiv 2025
[13]

Worldgym: World model as an environment for policy evaluation,

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang, “Worldgym: World model as an environment for policy evaluation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.00613

work page arXiv 2025
[14]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, L. Chen, S. Yan, M. Yao, and G. Ren, “Genie envisioner: A unified world foundation platform for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2508.05635

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025

H. Zhen, Q. Sun, H. Zhang, J. Li, S. Zhou, Y . Du, and C. Gan, “TesserAct: Learning 4d embodied world models,” 2025. [Online]. Available: https://arxiv.org/abs/2504.20995

work page arXiv 2025
[16]

Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,

W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei, “Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,” 2026. [Online]. Available: https://arxiv.org/abs/2601.03782

work page arXiv 2026
[17]

Is Sora a world simulator? A comprehensive survey on general world models and beyond,

Z. Zhu, X. Wang, W. Zhao, C. Min, B. Li, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, C. Zhang, Y . You, Z. Zhang, D. Zhao, L. Xiao, J. Zhao, J. Lu, and G. Huang, “Is Sora a world simulator? A comprehensive survey on general world models and beyond,” 2024

2024
[18]

A survey: Learning embodied intelligence from physical simulators and world models,

X. Long, Q. Zhao, K. Zhang, Z. Zhang, D. Wang, Y . Liu, Z. Shu, Y . Lu, S. Wang, X. Wei, W. Li, W. Yin, Y . Yao, J. Pan, Q. Shen, R. Yang, X. Cao, and Q. Dai, “A survey: Learning embodied intelligence from physical simulators and world models,” 2025. 22

2025
[19]

3D and 4D world modeling: A survey,

L. Kong, W. Yang, J. Mei, Y . Liu, A. Liang, D. Zhu, D. Lu, W. Yin, X. Hu, M. Jia, J. Deng, K. Zhang, Y . Wu, T. Yan, S. Gao, S. Wang, L. Li, L. Pan, Y . Liu, J. Zhu, W. T. Ooi, S. C. H. Hoi, and Z. Liu, “3D and 4D world modeling: A survey,” 2025

2025
[20]

A step toward world models: A survey on robotic manipulation,

P.-F. Zhang, Y . Cheng, X. Sun, S. Wang, F. Li, L. Zhu, and H. T. Shen, “A step toward world models: A survey on robotic manipulation,” 2025

2025
[21]

Towards generalist embodied ai: A survey on world models for vla agents,

W. Tan, L. Zhu, B. Wang, E. Xie, B. Ji, Z. Lin, W. Yang, J. Li, and H. T. Shen, “Towards generalist embodied ai: A survey on world models for vla agents,” 2026, techRxiv preprint. [Online]. Available: https: //www.techrxiv.org/doi/full/10.36227/techrxiv.176948355.54623875/v1

work page doi:10.36227/techrxiv.176948355.54623875/v1 2026
[22]

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

R. Shao, W. Li, L. Zhang, R. Zhang, Z. Liu, R. Chen, and L. Nie, “Large vlm-based vision-language-action models for robotic manipulation: A survey,” 2025. [Online]. Available: https://arxiv.org/abs/2508.13073

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

A Survey on Vision-Language-Action Models for Embodied AI

Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision- language-action models for embodied AI,”IEEE Transactions on Neural Networks and Learning Systems, 2026, early Access; arXiv:2405.14093

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Forward models: Supervised learning with a distal teacher,

M. I. Jordan and D. E. Rumelhart, “Forward models: Supervised learning with a distal teacher,”Cognitive Science, vol. 16, no. 3, pp. 307–354,
[25]

Available: https://doi.org/10.1207/s15516709cog1603_1

[Online]. Available: https://doi.org/10.1207/s15516709cog1603_1

work page doi:10.1207/s15516709cog1603_1
[26]

Chris Miall and Daniel M

R. C. Miall and D. M. Wolpert, “Forward models for physiological motor control,”Neural Networks, vol. 9, no. 8, pp. 1265–1279, 1996. [Online]. Available: https://doi.org/10.1016/S0893-6080(96)00035-4

work page doi:10.1016/s0893-6080(96)00035-4 1996
[27]

Multiple paired forward and inverse models for motor control,

D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse models for motor control,”Neural Networks, vol. 11, no. 7–8, pp. 1317–1329, 1998. [Online]. Available: https://doi.org/10.1016/ S0893-6080(98)00066-5

1998
[28]

Learning universal policies via text- guided video generation,

Y . Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel, “Learning universal policies via text- guided video generation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[29]

Zero-shot robotic manipulation with pretrained image- editing diffusion models,

K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image- editing diffusion models,” inInternational Conference on Learning Representations (ICLR), 2024

2024
[30]

Video language planning,

Y . Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, L. Kaelbling, A. Zeng, and J. Tompson, “Video language planning,” inInternational Conference on Learning Representations (ICLR), 2024

2024
[31]

Unleashing large-scale video generative pre-training for visual robot manipulation,

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 10 641–10 662

2024
[32]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, H. Zhang, and M. Zhu, “GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06158

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Linet al., “Dreamgen: Unlocking generaliza- tion in robot learning through video world models,”arXiv preprint arXiv:2505.12705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Learning real-world action- video dynamics with heterogeneous masked autoregression,

L. Wang, K. Zhao, C. Liu, and X. Chen, “Learning real-world action- video dynamics with heterogeneous masked autoregression,”arXiv preprint arXiv:2502.04296, 2025

work page arXiv 2025
[35]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finnet al., “Cosmos policy: Fine-tuning video models for visuomotor control and planning,”arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas, “V-jepa 2: Self-supe...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

T. Yuan, Z. Dong, Y . Liu, and H. Zhao, “Fast-wam: Do world action models need test-time future imagination?” 2026. [Online]. Available: https://arxiv.org/abs/2603.16666

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Last-vla: Thinking in latent spatio- temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Y . Luo, F. Li, S. Xu, Y . Ji, Z. Zhang, B. Wang, Y . Shen, J. Cui, L. Chen, G. Chen, H. Ye, Z.-X. Yang, and F. Wen, “Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving,” 2026. [Online]. Available: https://arxiv.org/abs/2603.01928

work page arXiv 2026
[39]

Chain of world: World model thinking in latent motion,

F. Yang, D. Di, L. Tang, X. Zhang, L. Fan, H. Li, C. Wei, T. Su, and B. Ma, “Chain of world: World model thinking in latent motion,”
[40]

Available: https://arxiv.org/abs/2603.03195

[Online]. Available: https://arxiv.org/abs/2603.03195

work page arXiv
[41]

Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

X. Sun, Z. Xu, C. Cao, Z. Liu, Y . Sun, J. Pang, R. Zhang, Z. Yang, K. Pang, D. He, M. Yuan, and J. Chen, “Atomvla: Scalable post-training for robotic manipulation via predictive latent world models,” 2026. [Online]. Available: https://arxiv.org/abs/2603.08519

work page arXiv 2026
[42]

Flip: Flow-centric generative planning as general-purpose manipulation world model,

C. Gao, H. Zhang, Z. Xu, C. Zhehao, and L. Shao, “Flip: Flow-centric generative planning as general-purpose manipulation world model,” in International Conference on Learning Representations, vol. 2025, 2025, pp. 21 927–21 948

2025
[43]

FlowVLA: Visual chain of thought-based motion reasoning for vision-language-action models,

Z. Zhong, H. Yan, J. Li, X. Liu, X. Gong, T. Zhang, W. Song, J. Chen, X. Zheng, H. Wang, and H. Li, “FlowVLA: Visual chain of thought-based motion reasoning for vision-language-action models,”
[44]

Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025

[Online]. Available: https://arxiv.org/abs/2508.18269

work page arXiv
[45]

3d-vla: a 3d vision-language-action generative world model,

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: a 3d vision-language-action generative world model,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024
[46]

OG-VLA: Orthographic image generation for 3d- aware vision-language action model,

I. Singh, A. Goyal, S. Birchfield, D. Fox, A. Garg, and V . Blukis, “OG-VLA: Orthographic image generation for 3d- aware vision-language action model,” 2025. [Online]. Available: https://arxiv.org/abs/2506.01196

work page arXiv 2025
[47]

3D-CA VLA: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

V . Bhat, Y .-H. Lan, P. Krishnamurthy, R. Karri, and F. Khorrami, “3D-CA VLA: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,” 2025. [Online]. Available: https://arxiv.org/abs/2505.05800

work page arXiv 2025
[48]

Wristworld: Generating wrist-views via 4d world models for robotic manipulation,

Z. Qian, X. Chi, Y . Li, S. Wang, Z. Qin, X. Ju, S. Han, and S. Zhang, “Wristworld: Generating wrist-views via 4d world models for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.07313

work page arXiv 2025
[49]

Gwm: Towards scalable gaussian world models for robotic manipulation,

G. Lu, B. Jia, P. Li, Y . Chen, Z. Wang, Y . Tang, and S. Huang, “Gwm: Towards scalable gaussian world models for robotic manipulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9263–9274

2025
[50]

PIN- WM: Learning physics-informed world models for non-prehensile manipulation,

W. Li, H. Zhao, Z. Yu, Y . Du, Q. Zou, R. Hu, and K. Xu, “PIN- WM: Learning physics-informed world models for non-prehensile manipulation,” inProceedings of Robotics: Science and Systems (RSS), Los Angeles, CA, USA, 2025

2025
[51]

Showui: One vision-language- action model for GUI visual agent

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y . Liu, D. Xiang, G. Wetzstein, and T.-Y . Lin, “CoT-VLA: Visual chain-of-thought reasoning for vision- language-action models,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1715–1726. [Online]. Available: https://do...

work page doi:10.1109/cvpr52734.2025.00166 2025
[52]

RynnVLA-002: A Unified Vision-Language-Action and World Model

J. Cen, S. Huang, Y . Yuan, K. Li, H. Yuan, C. Yu, Y . Jiang, J. Guo, X. Li, H. Luo, F. Wang, D. Zhao, and H. Chen, “RynnVLA-002: A unified vision-language-action and world model,” 2025. [Online]. Available: https://arxiv.org/abs/2511.17502

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Physical autoregressive model for robotic manipulation without action pretraining,

Z. Song, S. Qin, T. Chen, L. Lin, and G. Wang, “Physical autoregressive model for robotic manipulation without action pretraining,” 2025. [Online]. Available: https://arxiv.org/abs/2508.09822

work page arXiv 2025
[54]

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

J. Won, K. Lee, H. Jang, D. Kim, and J. Shin, “Dual-stream diffusion for world-model augmented vision-language-action model,” 2025. [Online]. Available: https://arxiv.org/abs/2510.27607

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

World Action Models are Zero-shot Policies

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang, “World action mo...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[56]

Do World Action Models Generalize Better than VLAs? A Robustness Study

Z. Zhang, Z. Li, B. Rahmati, R. H. Yang, Y . Ma, A. Rasouli, S. Pakdamansavoji, Y . Wu, L. Zhang, T. Cao, F. Wen, X. Wang, X. Quan, and Y . Zhang, “Do world action models generalize better than VLAs? a robustness study,” 2026. [Online]. Available: https://arxiv.org/abs/2603.22078

work page internal anchor Pith review Pith/arXiv arXiv 2026
[57]

Flare: Robot learning with implicit world modeling,

R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, A. Narayan, Y . L. Tan, G. Wang, Q. Wang, J. Xiang, Y . Xu, S. Ye, J. Kautz, F. Huang, Y . Zhu, and L. Fan, “Flare: Robot learning with implicit world modeling,” in Proceedings of The 9th Conference on Robot Learning, ser. Proceedings of Machine Learning Resear...

2025
[58]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge,

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhanget al., “Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge,”Advances in Neural Information Processing Systems, vol. 38, pp. 24 195–24 228, 2026

2026
[59]

arXiv preprint arXiv:2602.10098 (2026)

J. Sun, W. Zhang, Z. Qi, S. Ren, Z. Liu, H. Zhu, G. Sun, X. Jin, and Z. Chen, “VLA-JEPA: Enhancing vision-language- 23 action model with latent world model,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10098

work page arXiv 2026
[60]

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu, “DIAL: Decoupling intent and action via latent world modeling for end-to-end vla,” 2026. [Online]. Available: https://arxiv.org/abs/2603.29844

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

UP-VLA: A unified understanding and prediction model for embodied agent,

J. Zhang, Y . Guo, Y . Hu, X. Chen, X. Zhu, and J. Chen, “UP-VLA: A unified understanding and prediction model for embodied agent,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=V7JPraxi5j

2025
[62]

Daydreamer: World models for physical robot learning,

P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Daydreamer: World models for physical robot learning,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol
[63]

2226–2240

PMLR, 14–18 Dec 2023, pp. 2226–2240. [Online]. Available: https://proceedings.mlr.press/v205/wu23c.html

2023
[64]

Multi-view masked world models for visual robotic manipulation,

Y . Seo, J. Kim, S. James, K. Lee, J. Shin, and P. Abbeel, “Multi-view masked world models for visual robotic manipulation,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023,...

2023
[65]

Learning view-invariant world models for visual robotic manipulation,

J.-C. Pang, N. Tang, K. Li, Y . Tang, X.-Q. Cai, Z.-Y . Zhang, G. Niu, M. Sugiyama, and Y . Yu, “Learning view-invariant world models for visual robotic manipulation,” inInternational Conference on Learning Representations, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 54 853–54 876. [Online]. Available: https://proceedings.iclr...

2025
[66]

Ladi-WM: A latent diffusion-based world model for predictive manipulation,

Y . Huang, J. Zhang, S. Zou, X. Liu, R. Hu, and K. Xu, “Ladi-WM: A latent diffusion-based world model for predictive manipulation,” in 9th Annual Conference on Robot Learning, 2025. [Online]. Available: https://openreview.net/forum?id=o2w2iiMyEU

2025
[67]

Lumos: Language-conditioned imitation learning with world models,

I. Nematollahi, B. DeMoss, A. L. Chandra, N. Hawes, W. Burgard, and I. Posner, “Lumos: Language-conditioned imitation learning with world models,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 8219–8225

2025
[68]

Reward-free world models for online imitation learning,

S. Li, Z. Huang, and H. Su, “Reward-free world models for online imitation learning,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/ forum?id=owEhpoKBKC

2025
[69]

Focus: object-centric world models for robotic manipulation,

S. Ferraro, P. Mazzaglia, T. Verbelen, and B. Dhoedt, “Focus: object-centric world models for robotic manipulation,”Frontiers in Neurorobotics, vol. V olume 19 - 2025, 2025. [Online]. Available: https://www.frontiersin.org/journals/neurorobotics/articles/10. 3389/fnbot.2025.1585386

work page arXiv 2025
[70]

Leveraging separated world model for exploration in visually distracted environments,

K. Huang, S. Wan, M. Shao, H.-H. Sun, L. Gan, S. Feng, and D.-C. Zhan, “Leveraging separated world model for exploration in visually distracted environments,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 82 350–82 37...

2024
[71]

Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy,

P. Li, H. Wu, Y . Huang, C. Cheang, L. Wang, and T. Kong, “Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy,”IEEE Robotics and Automation Letters, vol. 10, no. 2, pp. 1912–1919, 2025

1912
[72]

Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation

L. Heng, J. Xu, Y . Wang, X. Li, M. Cai, Y . Shen, J. Zhu, G. Ren, and H. Dong, “Imagine2act: Leveraging object-action motion consistency from imagined goals for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17125

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Mind: Learning a dual-system world model for real-time planning and implicit risk analysis,

X. Chi, K. Ge, J. Liu, S. Zhou, P. Jia, Z. He, Y . Liu, T. Li, L. Han, S. Han, S. Zhang, and Y . Guo, “Mind: Learning a dual-system world model for real-time planning and implicit risk analysis,” 2025. [Online]. Available: https://arxiv.org/abs/2506.18897

work page arXiv 2025
[74]

Closed-loop visuomotor control with generative expectation for robotic manipulation,

Q. Bu, J. Zeng, L. Chen, Y . Yang, G. Zhou, J. Yan, P. Luo, H. Cui, Y . Ma, and H. Li, “Closed-loop visuomotor control with generative expectation for robotic manipulation,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=1ptdkwZbMG

2024
[75]

Eva: Aligning video world models with executable robot actions via inverse dynamics rewards,

R. Wang, Q. Liu, Y . Deng, G. Liu, Z. Liu, and K. Jia, “Eva: Aligning video world models with executable robot actions via inverse dynamics rewards,” 2026. [Online]. Available: https://arxiv.org/abs/2603.17808

work page arXiv 2026
[76]

Video prediction policy: A generalist robot policy with predictive visual representations,

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen, “Video prediction policy: A generalist robot policy with predictive visual representations,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=c0dhw1du33

2025
[77]

TD-MPC2: Scalable, robust world models for continuous control,

N. Hansen, H. Su, and X. Wang, “TD-MPC2: Scalable, robust world models for continuous control,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=Oxh5CstDJU

2024
[78]

Modem: Accelerating visual model-based reinforcement learning with demonstrations,

N. Hansen, Y . Lin, H. Su, X. Wang, V . Kumar, and A. Rajeswaran, “Modem: Accelerating visual model-based reinforcement learning with demonstrations,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=JdTnc9gjVfJ

2023
[79]

Modem-v2: Visuo-motor world models for real-world robot manipulation,

P. Lancaster, N. Hansen, A. Rajeswaran, and V . Kumar, “Modem-v2: Visuo-motor world models for real-world robot manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 7530–7537

2024
[80]

Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning,

A. L. Escoriza, N. Hansen, S. Tao, T. Mu, and H. Su, “Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=Bv7LUUYOiq

2025

Showing first 80 references.

[1] [1]

Internal world models and supervised learning,

M. I. Jordan and D. E. Rumelhart, “Internal world models and supervised learning,” inMachine Learning Proceedings 1991. Elsevier, 1991, pp. 70–74. [Online]. Available: https://doi.org/10.1016/ B978-1-55860-200-7.50018-0

1991

[2] [2]

An internal model for sensorimotor integration,

D. M. Wolpert, Z. Ghahramani, and M. I. Jordan, “An internal model for sensorimotor integration,”Science, vol. 269, no. 5232, pp. 1880–1882,

[3] [3]

Available: https://doi.org/10.1126/science.7569931

[Online]. Available: https://doi.org/10.1126/science.7569931

work page doi:10.1126/science.7569931

[4] [4]

Dyna, an integrated architecture for learning, planning, and reacting,

R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,”ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, 1991. [Online]. Available: https://doi.org/10.1145/122344.122377

work page doi:10.1145/122344.122377 1991

[5] [5]

Learning to control a low-cost manipulator using data-efficient reinforcement learning,

M. Deisenroth, C. Rasmussen, and D. Fox, “Learning to control a low-cost manipulator using data-efficient reinforcement learning,” inRobotics: Science and Systems VII, 2011. [Online]. Available: https://doi.org/10.15607/RSS.2011.VII.008

work page doi:10.15607/rss.2011.vii.008 2011

[6] [6]

Recurrent world models facilitate policy evolution,

D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/ 2018/file/2de5d16682c3c3...

2018

[7] [7]

Learning latent dynamics for planning from pixels,

D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” inInternational Conference on Machine Learning, 2019. [Online]. Available: https://proceedings.mlr.press/v97/hafner19a.html

2019

[8] [8]

Dream to control: Learning behaviors by latent imagination,

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=S1lOTC4tDS

2020

[9] [9]

Mastering diverse control tasks through world models,

D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse control tasks through world models,”Nature, 2025. [Online]. Available: https://doi.org/10.1038/s41586-025-08744-2

work page doi:10.1038/s41586-025-08744-2 2025

[10] [10]

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Y . Tian, S. Yang, J. Zeng, P. Wang, D. Lin, H. Dong, and J. Pang, “Predictive inverse dynamics models are scalable learners for robotic manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2412.15109

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

WorldVLA: Towards Autoregressive Action World Model

J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, D. Zhao, and H. Chen, “WorldVLA: Towards autoregressive action world model,” 2025. [Online]. Available: https://arxiv.org/abs/2506.21539

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,

Z. Jiang, K. Liu, Y . Qin, S. Tian, Y . Zheng, M. Zhou, C. Yu, H. Li, and D. Zhao, “World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.19080

work page arXiv 2025

[13] [13]

Worldgym: World model as an environment for policy evaluation,

J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang, “Worldgym: World model as an environment for policy evaluation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.00613

work page arXiv 2025

[14] [14]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, L. Chen, S. Yan, M. Yao, and G. Ren, “Genie envisioner: A unified world foundation platform for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2508.05635

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025

H. Zhen, Q. Sun, H. Zhang, J. Li, S. Zhou, Y . Du, and C. Gan, “TesserAct: Learning 4d embodied world models,” 2025. [Online]. Available: https://arxiv.org/abs/2504.20995

work page arXiv 2025

[16] [16]

Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,

W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei, “Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,” 2026. [Online]. Available: https://arxiv.org/abs/2601.03782

work page arXiv 2026

[17] [17]

Is Sora a world simulator? A comprehensive survey on general world models and beyond,

Z. Zhu, X. Wang, W. Zhao, C. Min, B. Li, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, C. Zhang, Y . You, Z. Zhang, D. Zhao, L. Xiao, J. Zhao, J. Lu, and G. Huang, “Is Sora a world simulator? A comprehensive survey on general world models and beyond,” 2024

2024

[18] [18]

A survey: Learning embodied intelligence from physical simulators and world models,

X. Long, Q. Zhao, K. Zhang, Z. Zhang, D. Wang, Y . Liu, Z. Shu, Y . Lu, S. Wang, X. Wei, W. Li, W. Yin, Y . Yao, J. Pan, Q. Shen, R. Yang, X. Cao, and Q. Dai, “A survey: Learning embodied intelligence from physical simulators and world models,” 2025. 22

2025

[19] [19]

3D and 4D world modeling: A survey,

L. Kong, W. Yang, J. Mei, Y . Liu, A. Liang, D. Zhu, D. Lu, W. Yin, X. Hu, M. Jia, J. Deng, K. Zhang, Y . Wu, T. Yan, S. Gao, S. Wang, L. Li, L. Pan, Y . Liu, J. Zhu, W. T. Ooi, S. C. H. Hoi, and Z. Liu, “3D and 4D world modeling: A survey,” 2025

2025

[20] [20]

A step toward world models: A survey on robotic manipulation,

P.-F. Zhang, Y . Cheng, X. Sun, S. Wang, F. Li, L. Zhu, and H. T. Shen, “A step toward world models: A survey on robotic manipulation,” 2025

2025

[21] [21]

Towards generalist embodied ai: A survey on world models for vla agents,

W. Tan, L. Zhu, B. Wang, E. Xie, B. Ji, Z. Lin, W. Yang, J. Li, and H. T. Shen, “Towards generalist embodied ai: A survey on world models for vla agents,” 2026, techRxiv preprint. [Online]. Available: https: //www.techrxiv.org/doi/full/10.36227/techrxiv.176948355.54623875/v1

work page doi:10.36227/techrxiv.176948355.54623875/v1 2026

[22] [22]

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

R. Shao, W. Li, L. Zhang, R. Zhang, Z. Liu, R. Chen, and L. Nie, “Large vlm-based vision-language-action models for robotic manipulation: A survey,” 2025. [Online]. Available: https://arxiv.org/abs/2508.13073

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

A Survey on Vision-Language-Action Models for Embodied AI

Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision- language-action models for embodied AI,”IEEE Transactions on Neural Networks and Learning Systems, 2026, early Access; arXiv:2405.14093

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Forward models: Supervised learning with a distal teacher,

M. I. Jordan and D. E. Rumelhart, “Forward models: Supervised learning with a distal teacher,”Cognitive Science, vol. 16, no. 3, pp. 307–354,

[25] [25]

Available: https://doi.org/10.1207/s15516709cog1603_1

[Online]. Available: https://doi.org/10.1207/s15516709cog1603_1

work page doi:10.1207/s15516709cog1603_1

[26] [26]

Chris Miall and Daniel M

R. C. Miall and D. M. Wolpert, “Forward models for physiological motor control,”Neural Networks, vol. 9, no. 8, pp. 1265–1279, 1996. [Online]. Available: https://doi.org/10.1016/S0893-6080(96)00035-4

work page doi:10.1016/s0893-6080(96)00035-4 1996

[27] [27]

Multiple paired forward and inverse models for motor control,

D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse models for motor control,”Neural Networks, vol. 11, no. 7–8, pp. 1317–1329, 1998. [Online]. Available: https://doi.org/10.1016/ S0893-6080(98)00066-5

1998

[28] [28]

Learning universal policies via text- guided video generation,

Y . Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel, “Learning universal policies via text- guided video generation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[29] [29]

Zero-shot robotic manipulation with pretrained image- editing diffusion models,

K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image- editing diffusion models,” inInternational Conference on Learning Representations (ICLR), 2024

2024

[30] [30]

Video language planning,

Y . Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, L. Kaelbling, A. Zeng, and J. Tompson, “Video language planning,” inInternational Conference on Learning Representations (ICLR), 2024

2024

[31] [31]

Unleashing large-scale video generative pre-training for visual robot manipulation,

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 10 641–10 662

2024

[32] [32]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, H. Zhang, and M. Zhu, “GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06158

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Linet al., “Dreamgen: Unlocking generaliza- tion in robot learning through video world models,”arXiv preprint arXiv:2505.12705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Learning real-world action- video dynamics with heterogeneous masked autoregression,

L. Wang, K. Zhao, C. Liu, and X. Chen, “Learning real-world action- video dynamics with heterogeneous masked autoregression,”arXiv preprint arXiv:2502.04296, 2025

work page arXiv 2025

[35] [35]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finnet al., “Cosmos policy: Fine-tuning video models for visuomotor control and planning,”arXiv preprint arXiv:2601.16163, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas, “V-jepa 2: Self-supe...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

T. Yuan, Z. Dong, Y . Liu, and H. Zhao, “Fast-wam: Do world action models need test-time future imagination?” 2026. [Online]. Available: https://arxiv.org/abs/2603.16666

work page internal anchor Pith review Pith/arXiv arXiv 2026

[38] [38]

Last-vla: Thinking in latent spatio- temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

Y . Luo, F. Li, S. Xu, Y . Ji, Z. Zhang, B. Wang, Y . Shen, J. Cui, L. Chen, G. Chen, H. Ye, Z.-X. Yang, and F. Wen, “Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving,” 2026. [Online]. Available: https://arxiv.org/abs/2603.01928

work page arXiv 2026

[39] [39]

Chain of world: World model thinking in latent motion,

F. Yang, D. Di, L. Tang, X. Zhang, L. Fan, H. Li, C. Wei, T. Su, and B. Ma, “Chain of world: World model thinking in latent motion,”

[40] [40]

Available: https://arxiv.org/abs/2603.03195

[Online]. Available: https://arxiv.org/abs/2603.03195

work page arXiv

[41] [41]

Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

X. Sun, Z. Xu, C. Cao, Z. Liu, Y . Sun, J. Pang, R. Zhang, Z. Yang, K. Pang, D. He, M. Yuan, and J. Chen, “Atomvla: Scalable post-training for robotic manipulation via predictive latent world models,” 2026. [Online]. Available: https://arxiv.org/abs/2603.08519

work page arXiv 2026

[42] [42]

Flip: Flow-centric generative planning as general-purpose manipulation world model,

C. Gao, H. Zhang, Z. Xu, C. Zhehao, and L. Shao, “Flip: Flow-centric generative planning as general-purpose manipulation world model,” in International Conference on Learning Representations, vol. 2025, 2025, pp. 21 927–21 948

2025

[43] [43]

FlowVLA: Visual chain of thought-based motion reasoning for vision-language-action models,

Z. Zhong, H. Yan, J. Li, X. Liu, X. Gong, T. Zhang, W. Song, J. Chen, X. Zheng, H. Wang, and H. Li, “FlowVLA: Visual chain of thought-based motion reasoning for vision-language-action models,”

[44] [44]

Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025

[Online]. Available: https://arxiv.org/abs/2508.18269

work page arXiv

[45] [45]

3d-vla: a 3d vision-language-action generative world model,

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: a 3d vision-language-action generative world model,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

2024

[46] [46]

OG-VLA: Orthographic image generation for 3d- aware vision-language action model,

I. Singh, A. Goyal, S. Birchfield, D. Fox, A. Garg, and V . Blukis, “OG-VLA: Orthographic image generation for 3d- aware vision-language action model,” 2025. [Online]. Available: https://arxiv.org/abs/2506.01196

work page arXiv 2025

[47] [47]

3D-CA VLA: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

V . Bhat, Y .-H. Lan, P. Krishnamurthy, R. Karri, and F. Khorrami, “3D-CA VLA: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,” 2025. [Online]. Available: https://arxiv.org/abs/2505.05800

work page arXiv 2025

[48] [48]

Wristworld: Generating wrist-views via 4d world models for robotic manipulation,

Z. Qian, X. Chi, Y . Li, S. Wang, Z. Qin, X. Ju, S. Han, and S. Zhang, “Wristworld: Generating wrist-views via 4d world models for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.07313

work page arXiv 2025

[49] [49]

Gwm: Towards scalable gaussian world models for robotic manipulation,

G. Lu, B. Jia, P. Li, Y . Chen, Z. Wang, Y . Tang, and S. Huang, “Gwm: Towards scalable gaussian world models for robotic manipulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9263–9274

2025

[50] [50]

PIN- WM: Learning physics-informed world models for non-prehensile manipulation,

W. Li, H. Zhao, Z. Yu, Y . Du, Q. Zou, R. Hu, and K. Xu, “PIN- WM: Learning physics-informed world models for non-prehensile manipulation,” inProceedings of Robotics: Science and Systems (RSS), Los Angeles, CA, USA, 2025

2025

[51] [51]

Showui: One vision-language- action model for GUI visual agent

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y . Liu, D. Xiang, G. Wetzstein, and T.-Y . Lin, “CoT-VLA: Visual chain-of-thought reasoning for vision- language-action models,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1715–1726. [Online]. Available: https://do...

work page doi:10.1109/cvpr52734.2025.00166 2025

[52] [52]

RynnVLA-002: A Unified Vision-Language-Action and World Model

J. Cen, S. Huang, Y . Yuan, K. Li, H. Yuan, C. Yu, Y . Jiang, J. Guo, X. Li, H. Luo, F. Wang, D. Zhao, and H. Chen, “RynnVLA-002: A unified vision-language-action and world model,” 2025. [Online]. Available: https://arxiv.org/abs/2511.17502

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Physical autoregressive model for robotic manipulation without action pretraining,

Z. Song, S. Qin, T. Chen, L. Lin, and G. Wang, “Physical autoregressive model for robotic manipulation without action pretraining,” 2025. [Online]. Available: https://arxiv.org/abs/2508.09822

work page arXiv 2025

[54] [54]

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

J. Won, K. Lee, H. Jang, D. Kim, and J. Shin, “Dual-stream diffusion for world-model augmented vision-language-action model,” 2025. [Online]. Available: https://arxiv.org/abs/2510.27607

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

World Action Models are Zero-shot Policies

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang, “World action mo...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[56] [56]

Do World Action Models Generalize Better than VLAs? A Robustness Study

Z. Zhang, Z. Li, B. Rahmati, R. H. Yang, Y . Ma, A. Rasouli, S. Pakdamansavoji, Y . Wu, L. Zhang, T. Cao, F. Wen, X. Wang, X. Quan, and Y . Zhang, “Do world action models generalize better than VLAs? a robustness study,” 2026. [Online]. Available: https://arxiv.org/abs/2603.22078

work page internal anchor Pith review Pith/arXiv arXiv 2026

[57] [57]

Flare: Robot learning with implicit world modeling,

R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, A. Narayan, Y . L. Tan, G. Wang, Q. Wang, J. Xiang, Y . Xu, S. Ye, J. Kautz, F. Huang, Y . Zhu, and L. Fan, “Flare: Robot learning with implicit world modeling,” in Proceedings of The 9th Conference on Robot Learning, ser. Proceedings of Machine Learning Resear...

2025

[58] [58]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge,

W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhanget al., “Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge,”Advances in Neural Information Processing Systems, vol. 38, pp. 24 195–24 228, 2026

2026

[59] [59]

arXiv preprint arXiv:2602.10098 (2026)

J. Sun, W. Zhang, Z. Qi, S. Ren, Z. Liu, H. Zhu, G. Sun, X. Jin, and Z. Chen, “VLA-JEPA: Enhancing vision-language- 23 action model with latent world model,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10098

work page arXiv 2026

[60] [60]

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu, “DIAL: Decoupling intent and action via latent world modeling for end-to-end vla,” 2026. [Online]. Available: https://arxiv.org/abs/2603.29844

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

UP-VLA: A unified understanding and prediction model for embodied agent,

J. Zhang, Y . Guo, Y . Hu, X. Chen, X. Zhu, and J. Chen, “UP-VLA: A unified understanding and prediction model for embodied agent,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=V7JPraxi5j

2025

[62] [62]

Daydreamer: World models for physical robot learning,

P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Daydreamer: World models for physical robot learning,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol

[63] [63]

2226–2240

PMLR, 14–18 Dec 2023, pp. 2226–2240. [Online]. Available: https://proceedings.mlr.press/v205/wu23c.html

2023

[64] [64]

Multi-view masked world models for visual robotic manipulation,

Y . Seo, J. Kim, S. James, K. Lee, J. Shin, and P. Abbeel, “Multi-view masked world models for visual robotic manipulation,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023,...

2023

[65] [65]

Learning view-invariant world models for visual robotic manipulation,

J.-C. Pang, N. Tang, K. Li, Y . Tang, X.-Q. Cai, Z.-Y . Zhang, G. Niu, M. Sugiyama, and Y . Yu, “Learning view-invariant world models for visual robotic manipulation,” inInternational Conference on Learning Representations, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 54 853–54 876. [Online]. Available: https://proceedings.iclr...

2025

[66] [66]

Ladi-WM: A latent diffusion-based world model for predictive manipulation,

Y . Huang, J. Zhang, S. Zou, X. Liu, R. Hu, and K. Xu, “Ladi-WM: A latent diffusion-based world model for predictive manipulation,” in 9th Annual Conference on Robot Learning, 2025. [Online]. Available: https://openreview.net/forum?id=o2w2iiMyEU

2025

[67] [67]

Lumos: Language-conditioned imitation learning with world models,

I. Nematollahi, B. DeMoss, A. L. Chandra, N. Hawes, W. Burgard, and I. Posner, “Lumos: Language-conditioned imitation learning with world models,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 8219–8225

2025

[68] [68]

Reward-free world models for online imitation learning,

S. Li, Z. Huang, and H. Su, “Reward-free world models for online imitation learning,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/ forum?id=owEhpoKBKC

2025

[69] [69]

Focus: object-centric world models for robotic manipulation,

S. Ferraro, P. Mazzaglia, T. Verbelen, and B. Dhoedt, “Focus: object-centric world models for robotic manipulation,”Frontiers in Neurorobotics, vol. V olume 19 - 2025, 2025. [Online]. Available: https://www.frontiersin.org/journals/neurorobotics/articles/10. 3389/fnbot.2025.1585386

work page arXiv 2025

[70] [70]

Leveraging separated world model for exploration in visually distracted environments,

K. Huang, S. Wan, M. Shao, H.-H. Sun, L. Gan, S. Feng, and D.-C. Zhan, “Leveraging separated world model for exploration in visually distracted environments,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 82 350–82 37...

2024

[71] [71]

Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy,

P. Li, H. Wu, Y . Huang, C. Cheang, L. Wang, and T. Kong, “Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy,”IEEE Robotics and Automation Letters, vol. 10, no. 2, pp. 1912–1919, 2025

1912

[72] [72]

Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation

L. Heng, J. Xu, Y . Wang, X. Li, M. Cai, Y . Shen, J. Zhu, G. Ren, and H. Dong, “Imagine2act: Leveraging object-action motion consistency from imagined goals for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17125

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Mind: Learning a dual-system world model for real-time planning and implicit risk analysis,

X. Chi, K. Ge, J. Liu, S. Zhou, P. Jia, Z. He, Y . Liu, T. Li, L. Han, S. Han, S. Zhang, and Y . Guo, “Mind: Learning a dual-system world model for real-time planning and implicit risk analysis,” 2025. [Online]. Available: https://arxiv.org/abs/2506.18897

work page arXiv 2025

[74] [74]

Closed-loop visuomotor control with generative expectation for robotic manipulation,

Q. Bu, J. Zeng, L. Chen, Y . Yang, G. Zhou, J. Yan, P. Luo, H. Cui, Y . Ma, and H. Li, “Closed-loop visuomotor control with generative expectation for robotic manipulation,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=1ptdkwZbMG

2024

[75] [75]

Eva: Aligning video world models with executable robot actions via inverse dynamics rewards,

R. Wang, Q. Liu, Y . Deng, G. Liu, Z. Liu, and K. Jia, “Eva: Aligning video world models with executable robot actions via inverse dynamics rewards,” 2026. [Online]. Available: https://arxiv.org/abs/2603.17808

work page arXiv 2026

[76] [76]

Video prediction policy: A generalist robot policy with predictive visual representations,

Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen, “Video prediction policy: A generalist robot policy with predictive visual representations,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=c0dhw1du33

2025

[77] [77]

TD-MPC2: Scalable, robust world models for continuous control,

N. Hansen, H. Su, and X. Wang, “TD-MPC2: Scalable, robust world models for continuous control,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=Oxh5CstDJU

2024

[78] [78]

Modem: Accelerating visual model-based reinforcement learning with demonstrations,

N. Hansen, Y . Lin, H. Su, X. Wang, V . Kumar, and A. Rajeswaran, “Modem: Accelerating visual model-based reinforcement learning with demonstrations,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=JdTnc9gjVfJ

2023

[79] [79]

Modem-v2: Visuo-motor world models for real-world robot manipulation,

P. Lancaster, N. Hansen, A. Rajeswaran, and V . Kumar, “Modem-v2: Visuo-motor world models for real-world robot manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 7530–7537

2024

[80] [80]

Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning,

A. L. Escoriza, N. Hansen, S. Tao, T. Mu, and H. Su, “Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=Bv7LUUYOiq

2025