pith. sign in

arxiv: 2606.00113 · v1 · pith:7XYSTMDQnew · submitted 2026-05-27 · 💻 cs.RO

World Models for Robotic Manipulation: A Survey

Pith reviewed 2026-06-29 12:22 UTC · model grok-4.3

classification 💻 cs.RO
keywords world modelsrobotic manipulationrobot learningdynamics predictionaction-conditioned predictionpredictive infrastructuremanipulation datasetsclosed-loop evaluation
0
0 comments X

The pith

World models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how learned world models enable robots to anticipate how actions reshape objects, contacts, and scene geometry. It provides an operational definition of a world model as an action-conditioned predictive system, distinct from perception modules, inverse models, policies, rewards, and value functions. The work organizes the literature into five representation families and develops a functional taxonomy that separates integrated prediction-action models from explicit predictive planners. It characterizes infrastructure roles such as synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification, while mapping these across pretraining, post-training, and inference. The survey reviews 34 manipulation datasets and evaluation protocols, revealing the evolution toward infrastructure alongside open challenges in contact modeling, hallucination control, action alignment, and closed-loop benchmarking.

Core claim

By defining world models as action-conditioned predictive systems and classifying existing work into five representation families with a functional taxonomy, the survey establishes that these models are shifting from specialized task-specific dynamics predictors to general predictive infrastructure that supports synthetic experience, planning, verification, and adaptation in robotic manipulation, while exposing persistent gaps in contact-rich modeling, hallucination mitigation, action alignment, and reliable closed-loop evaluation.

What carries the argument

The operational definition of a world model as an action-conditioned predictive system, combined with the five representation families and the functional taxonomy that distinguishes integrated prediction-action models from explicit predictive planners.

If this is right

  • World models support synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification in manipulation pipelines.
  • These roles apply across pretraining, post-training, and inference adaptation stages.
  • Evaluation protocols must jointly assess predictive fidelity, task performance, and simulator reliability.
  • Open challenges remain in contact modeling, hallucination control, action alignment, and closed-loop benchmarking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The infrastructure framing implies world models could reduce dependence on physical data collection by enabling more reliable simulation-based training.
  • Persistent contact modeling gaps point to potential value in hybrid physics-informed and learned predictors for manipulation.
  • The taxonomy may serve as a guide for integrating predictive components into emerging vision-language-action architectures.

Load-bearing premise

The assumption that the operational definition cleanly separates world models from perception modules, inverse models, policies, rewards, and value functions, and that the five representation families plus functional taxonomy provide exhaustive coverage of the literature without significant omissions or overlaps.

What would settle it

Identification of a substantial body of manipulation literature containing predictive modules that cannot be classified into any of the five representation families or that blur the separation from policies and rewards would falsify the taxonomy's claimed exhaustiveness.

Figures

Figures reproduced from arXiv: 2606.00113 by Canxi Liang, David Navarro-Alarcon, Fangyuan Wang, Guorui Pei, Hongmin Wu, Jiaming Qi, Jia Pan, Jinsong Wu, Jun Hu, Mengshi Zhang, Ning Han, Pai Zheng, Peng Zhou, Shiyao Zhang, Sichao Liu, Zeqing Zhang, Zhongxuan Li, Ziyuan Wang.

Figure 1
Figure 1. Figure 1: A world model predicts task-relevant future evolution of the world, usually conditioned on observations and robot actions. We organize the literature [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representation spectrum of world models. The five families are ordered by increasing structured inductive bias, from appearance-reconstructive image [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Functional taxonomy of direct prediction–action interfaces. (a–b) Integrated prediction–action models embed prediction inside the action-producing [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Five functional roles of infrastructure world models for robotic manipulation: synthetic experience generation, candidate-action filtering, search-based [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: World models across the robot-learning lifecycle. During pretraining, predictive objectives learn reusable latent, video, or three-dimensional priors from [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Chronological overview of the 34 representative manipulation datasets in Table II, with the horizontal axis denoting first public release year and [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This survey organizes the literature on world models for robotic manipulation around three questions (what is predicted, how it connects to action, and when it is used in the learning pipeline). It proposes an operational definition of a world model as an action-conditioned predictive system that excludes perception modules, inverse models, policies, rewards, and value functions; classifies existing work into five representation families; introduces a functional taxonomy separating integrated prediction-action models from explicit predictive planners; maps infrastructure roles (synthetic experience, filtering, search, learned environments, verification) across pretraining/post-training/inference; reviews 34 manipulation datasets; and synthesizes evaluation protocols. The central claim is that world models are evolving from task-specific dynamics predictors into predictive infrastructure, while surfacing open challenges in contact modeling, hallucination control, action alignment, and closed-loop benchmarking.

Significance. If the operational definition and five-family taxonomy prove exhaustive and non-overlapping, the paper would supply a much-needed organizing framework for a fragmented subfield, enabling clearer comparisons of design choices and highlighting concrete research gaps. The dataset review and role-mapping could directly inform benchmark construction and system architecture decisions in robot learning.

major comments (2)
  1. [Abstract / §2] Abstract and §2 (operational definition): the claim that the definition cleanly separates world models from policies and value functions is load-bearing for the 'predictive infrastructure' narrative, yet the manuscript provides no concrete counter-examples or boundary cases from integrated VLA architectures (e.g., where the predictive head is jointly trained with the policy head). Without such disambiguation, the taxonomy risks artificial separation that does not reflect current practice.
  2. [§3] §3 (five representation families and functional taxonomy): the exhaustiveness of the partition is asserted but not demonstrated via an explicit coverage table or omission analysis; hybrid 3D+latent or physics-informed video models that straddle multiple families are not addressed, which directly affects the completeness of the infrastructure-role mapping and the listed open challenges.
minor comments (1)
  1. [Dataset review section] The abstract states that 34 datasets are reviewed, but the manuscript should include an explicit summary table (dataset name, size, manipulation tasks covered, world-model usage) to allow readers to assess coverage without reading every citation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below and indicate planned revisions to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract / §2] Abstract and §2 (operational definition): the claim that the definition cleanly separates world models from policies and value functions is load-bearing for the 'predictive infrastructure' narrative, yet the manuscript provides no concrete counter-examples or boundary cases from integrated VLA architectures (e.g., where the predictive head is jointly trained with the policy head). Without such disambiguation, the taxonomy risks artificial separation that does not reflect current practice.

    Authors: We agree that explicit boundary cases would strengthen the operational definition. In the revised manuscript we will expand §2 with concrete examples drawn from integrated VLA architectures (e.g., models in which a predictive head is jointly trained with a policy head), showing how the world-model component remains the action-conditioned predictive subsystem even under joint optimization. This addition will directly support the separation from policies and value functions while preserving the predictive-infrastructure framing. revision: yes

  2. Referee: [§3] §3 (five representation families and functional taxonomy): the exhaustiveness of the partition is asserted but not demonstrated via an explicit coverage table or omission analysis; hybrid 3D+latent or physics-informed video models that straddle multiple families are not addressed, which directly affects the completeness of the infrastructure-role mapping and the listed open challenges.

    Authors: We accept that an explicit demonstration of coverage is required. The revised §3 will include a coverage table mapping all surveyed works to the five families together with an omission analysis. We will also add a dedicated paragraph on hybrid models (3D+latent and physics-informed video predictors), revise the functional taxonomy and infrastructure-role mapping to accommodate them, and update the open-challenges section accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: survey paper with external-citation foundation

full rationale

This is a literature survey whose central claims rest on classification of external papers rather than any internal derivation, fitted parameter, or self-referential prediction. The operational definition and taxonomy are presented as organizing choices, not as results derived from prior equations or self-citations within the manuscript. No equations, uniqueness theorems, or ansatzes appear; all substantive statements are supported by citations to independent works. The paper is therefore self-contained against external benchmarks and receives the default non-finding for a review article.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey contributes an organizational framework rather than new entities or parameters; it rests on the domain assumption that literature can be partitioned by representation type and pipeline role.

axioms (1)
  • domain assumption An operational definition of world model as an action-conditioned predictive system cleanly distinguishes it from perception, inverse models, policies, rewards, and value functions.
    Explicitly stated in the abstract as the basis for the entire taxonomy and role mapping.

pith-pipeline@v0.9.1-grok · 5843 in / 1195 out tokens · 33861 ms · 2026-06-29T12:22:14.699344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

153 extracted references · 96 canonical work pages · 31 internal anchors

  1. [1]

    Internal world models and supervised learning,

    M. I. Jordan and D. E. Rumelhart, “Internal world models and supervised learning,” inMachine Learning Proceedings 1991. Elsevier, 1991, pp. 70–74. [Online]. Available: https://doi.org/10.1016/ B978-1-55860-200-7.50018-0

  2. [2]

    An internal model for sensorimotor integration,

    D. M. Wolpert, Z. Ghahramani, and M. I. Jordan, “An internal model for sensorimotor integration,”Science, vol. 269, no. 5232, pp. 1880–1882,

  3. [3]

    Available: https://doi.org/10.1126/science.7569931

    [Online]. Available: https://doi.org/10.1126/science.7569931

  4. [4]

    Dyna, an integrated architecture for learning, planning, and reacting,

    R. S. Sutton, “Dyna, an integrated architecture for learning, planning, and reacting,”ACM SIGART Bulletin, vol. 2, no. 4, pp. 160–163, 1991. [Online]. Available: https://doi.org/10.1145/122344.122377

  5. [5]

    Learning to control a low-cost manipulator using data-efficient reinforcement learning,

    M. Deisenroth, C. Rasmussen, and D. Fox, “Learning to control a low-cost manipulator using data-efficient reinforcement learning,” inRobotics: Science and Systems VII, 2011. [Online]. Available: https://doi.org/10.15607/RSS.2011.VII.008

  6. [6]

    Recurrent world models facilitate policy evolution,

    D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/ 2018/file/2de5d16682c3c3...

  7. [7]

    Learning latent dynamics for planning from pixels,

    D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” inInternational Conference on Machine Learning, 2019. [Online]. Available: https://proceedings.mlr.press/v97/hafner19a.html

  8. [8]

    Dream to control: Learning behaviors by latent imagination,

    D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=S1lOTC4tDS

  9. [9]

    Mastering diverse control tasks through world models,

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse control tasks through world models,”Nature, 2025. [Online]. Available: https://doi.org/10.1038/s41586-025-08744-2

  10. [10]

    Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

    Y . Tian, S. Yang, J. Zeng, P. Wang, D. Lin, H. Dong, and J. Pang, “Predictive inverse dynamics models are scalable learners for robotic manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2412.15109

  11. [11]

    WorldVLA: Towards Autoregressive Action World Model

    J. Cen, C. Yu, H. Yuan, Y . Jiang, S. Huang, J. Guo, X. Li, Y . Song, H. Luo, F. Wang, D. Zhao, and H. Chen, “WorldVLA: Towards autoregressive action world model,” 2025. [Online]. Available: https://arxiv.org/abs/2506.21539

  12. [12]

    World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,

    Z. Jiang, K. Liu, Y . Qin, S. Tian, Y . Zheng, M. Zhou, C. Yu, H. Li, and D. Zhao, “World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.19080

  13. [13]

    Worldgym: World model as an environment for policy evaluation,

    J. Quevedo, A. K. Sharma, Y . Sun, V . Suryavanshi, P. Liang, and S. Yang, “Worldgym: World model as an environment for policy evaluation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.00613

  14. [14]

    Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

    Y . Liao, P. Zhou, S. Huang, D. Yang, S. Chen, Y . Jiang, Y . Hu, J. Cai, S. Liu, J. Luo, L. Chen, S. Yan, M. Yao, and G. Ren, “Genie envisioner: A unified world foundation platform for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2508.05635

  15. [15]

    Tesseract: Learning 4d embodied world models.arXiv preprint arXiv:2504.20995, 2025

    H. Zhen, Q. Sun, H. Zhang, J. Li, S. Zhou, Y . Du, and C. Gan, “TesserAct: Learning 4d embodied world models,” 2025. [Online]. Available: https://arxiv.org/abs/2504.20995

  16. [16]

    Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,

    W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei, “Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,” 2026. [Online]. Available: https://arxiv.org/abs/2601.03782

  17. [17]

    Is Sora a world simulator? A comprehensive survey on general world models and beyond,

    Z. Zhu, X. Wang, W. Zhao, C. Min, B. Li, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, C. Zhang, Y . You, Z. Zhang, D. Zhao, L. Xiao, J. Zhao, J. Lu, and G. Huang, “Is Sora a world simulator? A comprehensive survey on general world models and beyond,” 2024

  18. [18]

    A survey: Learning embodied intelligence from physical simulators and world models,

    X. Long, Q. Zhao, K. Zhang, Z. Zhang, D. Wang, Y . Liu, Z. Shu, Y . Lu, S. Wang, X. Wei, W. Li, W. Yin, Y . Yao, J. Pan, Q. Shen, R. Yang, X. Cao, and Q. Dai, “A survey: Learning embodied intelligence from physical simulators and world models,” 2025. 22

  19. [19]

    3D and 4D world modeling: A survey,

    L. Kong, W. Yang, J. Mei, Y . Liu, A. Liang, D. Zhu, D. Lu, W. Yin, X. Hu, M. Jia, J. Deng, K. Zhang, Y . Wu, T. Yan, S. Gao, S. Wang, L. Li, L. Pan, Y . Liu, J. Zhu, W. T. Ooi, S. C. H. Hoi, and Z. Liu, “3D and 4D world modeling: A survey,” 2025

  20. [20]

    A step toward world models: A survey on robotic manipulation,

    P.-F. Zhang, Y . Cheng, X. Sun, S. Wang, F. Li, L. Zhu, and H. T. Shen, “A step toward world models: A survey on robotic manipulation,” 2025

  21. [21]

    Towards generalist embodied ai: A survey on world models for vla agents,

    W. Tan, L. Zhu, B. Wang, E. Xie, B. Ji, Z. Lin, W. Yang, J. Li, and H. T. Shen, “Towards generalist embodied ai: A survey on world models for vla agents,” 2026, techRxiv preprint. [Online]. Available: https: //www.techrxiv.org/doi/full/10.36227/techrxiv.176948355.54623875/v1

  22. [22]

    Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

    R. Shao, W. Li, L. Zhang, R. Zhang, Z. Liu, R. Chen, and L. Nie, “Large vlm-based vision-language-action models for robotic manipulation: A survey,” 2025. [Online]. Available: https://arxiv.org/abs/2508.13073

  23. [23]

    A Survey on Vision-Language-Action Models for Embodied AI

    Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision- language-action models for embodied AI,”IEEE Transactions on Neural Networks and Learning Systems, 2026, early Access; arXiv:2405.14093

  24. [24]

    Forward models: Supervised learning with a distal teacher,

    M. I. Jordan and D. E. Rumelhart, “Forward models: Supervised learning with a distal teacher,”Cognitive Science, vol. 16, no. 3, pp. 307–354,

  25. [25]

    Available: https://doi.org/10.1207/s15516709cog1603_1

    [Online]. Available: https://doi.org/10.1207/s15516709cog1603_1

  26. [26]

    Chris Miall and Daniel M

    R. C. Miall and D. M. Wolpert, “Forward models for physiological motor control,”Neural Networks, vol. 9, no. 8, pp. 1265–1279, 1996. [Online]. Available: https://doi.org/10.1016/S0893-6080(96)00035-4

  27. [27]

    Multiple paired forward and inverse models for motor control,

    D. M. Wolpert and M. Kawato, “Multiple paired forward and inverse models for motor control,”Neural Networks, vol. 11, no. 7–8, pp. 1317–1329, 1998. [Online]. Available: https://doi.org/10.1016/ S0893-6080(98)00066-5

  28. [28]

    Learning universal policies via text- guided video generation,

    Y . Du, M. Yang, B. Dai, H. Dai, O. Nachum, J. B. Tenenbaum, D. Schuurmans, and P. Abbeel, “Learning universal policies via text- guided video generation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  29. [29]

    Zero-shot robotic manipulation with pretrained image- editing diffusion models,

    K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image- editing diffusion models,” inInternational Conference on Learning Representations (ICLR), 2024

  30. [30]

    Video language planning,

    Y . Du, M. Yang, P. Florence, F. Xia, A. Wahid, B. Ichter, P. Sermanet, T. Yu, P. Abbeel, J. B. Tenenbaum, L. Kaelbling, A. Zeng, and J. Tompson, “Video language planning,” inInternational Conference on Learning Representations (ICLR), 2024

  31. [31]

    Unleashing large-scale video generative pre-training for visual robot manipulation,

    H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,” inInternational Conference on Learning Representations, vol. 2024, 2024, pp. 10 641–10 662

  32. [32]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang, H. Zhang, and M. Zhu, “GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/2410.06158

  33. [33]

    DreamGen: Unlocking Generalization in Robot Learning through Video World Models

    J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Linet al., “Dreamgen: Unlocking generaliza- tion in robot learning through video world models,”arXiv preprint arXiv:2505.12705, 2025

  34. [34]

    Learning real-world action- video dynamics with heterogeneous masked autoregression,

    L. Wang, K. Zhao, C. Liu, and X. Chen, “Learning real-world action- video dynamics with heterogeneous masked autoregression,”arXiv preprint arXiv:2502.04296, 2025

  35. [35]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finnet al., “Cosmos policy: Fine-tuning video models for visuomotor control and planning,”arXiv preprint arXiv:2601.16163, 2026

  36. [36]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas, “V-jepa 2: Self-supe...

  37. [37]

    Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    T. Yuan, Z. Dong, Y . Liu, and H. Zhao, “Fast-wam: Do world action models need test-time future imagination?” 2026. [Online]. Available: https://arxiv.org/abs/2603.16666

  38. [38]

    Last-vla: Thinking in latent spatio- temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928, 2026

    Y . Luo, F. Li, S. Xu, Y . Ji, Z. Zhang, B. Wang, Y . Shen, J. Cui, L. Chen, G. Chen, H. Ye, Z.-X. Yang, and F. Wen, “Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving,” 2026. [Online]. Available: https://arxiv.org/abs/2603.01928

  39. [39]

    Chain of world: World model thinking in latent motion,

    F. Yang, D. Di, L. Tang, X. Zhang, L. Fan, H. Li, C. Wei, T. Su, and B. Ma, “Chain of world: World model thinking in latent motion,”

  40. [40]

    Available: https://arxiv.org/abs/2603.03195

    [Online]. Available: https://arxiv.org/abs/2603.03195

  41. [41]

    Atomvla: Scalable post-training for robotic manipulation via predictive latent world models.arXiv preprint arXiv:2603.08519, 2026

    X. Sun, Z. Xu, C. Cao, Z. Liu, Y . Sun, J. Pang, R. Zhang, Z. Yang, K. Pang, D. He, M. Yuan, and J. Chen, “Atomvla: Scalable post-training for robotic manipulation via predictive latent world models,” 2026. [Online]. Available: https://arxiv.org/abs/2603.08519

  42. [42]

    Flip: Flow-centric generative planning as general-purpose manipulation world model,

    C. Gao, H. Zhang, Z. Xu, C. Zhehao, and L. Shao, “Flip: Flow-centric generative planning as general-purpose manipulation world model,” in International Conference on Learning Representations, vol. 2025, 2025, pp. 21 927–21 948

  43. [43]

    FlowVLA: Visual chain of thought-based motion reasoning for vision-language-action models,

    Z. Zhong, H. Yan, J. Li, X. Liu, X. Gong, T. Zhang, W. Song, J. Chen, X. Zheng, H. Wang, and H. Li, “FlowVLA: Visual chain of thought-based motion reasoning for vision-language-action models,”

  44. [44]
  45. [45]

    3d-vla: a 3d vision-language-action generative world model,

    H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: a 3d vision-language-action generative world model,” in Proceedings of the 41st International Conference on Machine Learning, ser. ICML’24. JMLR.org, 2024

  46. [46]

    OG-VLA: Orthographic image generation for 3d- aware vision-language action model,

    I. Singh, A. Goyal, S. Birchfield, D. Fox, A. Garg, and V . Blukis, “OG-VLA: Orthographic image generation for 3d- aware vision-language action model,” 2025. [Online]. Available: https://arxiv.org/abs/2506.01196

  47. [47]

    3D-CA VLA: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

    V . Bhat, Y .-H. Lan, P. Krishnamurthy, R. Karri, and F. Khorrami, “3D-CA VLA: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,” 2025. [Online]. Available: https://arxiv.org/abs/2505.05800

  48. [48]

    Wristworld: Generating wrist-views via 4d world models for robotic manipulation,

    Z. Qian, X. Chi, Y . Li, S. Wang, Z. Qin, X. Ju, S. Han, and S. Zhang, “Wristworld: Generating wrist-views via 4d world models for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.07313

  49. [49]

    Gwm: Towards scalable gaussian world models for robotic manipulation,

    G. Lu, B. Jia, P. Li, Y . Chen, Z. Wang, Y . Tang, and S. Huang, “Gwm: Towards scalable gaussian world models for robotic manipulation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9263–9274

  50. [50]

    PIN- WM: Learning physics-informed world models for non-prehensile manipulation,

    W. Li, H. Zhao, Z. Yu, Y . Du, Q. Zou, R. Hu, and K. Xu, “PIN- WM: Learning physics-informed world models for non-prehensile manipulation,” inProceedings of Robotics: Science and Systems (RSS), Los Angeles, CA, USA, 2025

  51. [51]

    Showui: One vision-language- action model for GUI visual agent

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, M.-Y . Liu, D. Xiang, G. Wetzstein, and T.-Y . Lin, “CoT-VLA: Visual chain-of-thought reasoning for vision- language-action models,” in2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 1715–1726. [Online]. Available: https://do...

  52. [52]

    RynnVLA-002: A Unified Vision-Language-Action and World Model

    J. Cen, S. Huang, Y . Yuan, K. Li, H. Yuan, C. Yu, Y . Jiang, J. Guo, X. Li, H. Luo, F. Wang, D. Zhao, and H. Chen, “RynnVLA-002: A unified vision-language-action and world model,” 2025. [Online]. Available: https://arxiv.org/abs/2511.17502

  53. [53]

    Physical autoregressive model for robotic manipulation without action pretraining,

    Z. Song, S. Qin, T. Chen, L. Lin, and G. Wang, “Physical autoregressive model for robotic manipulation without action pretraining,” 2025. [Online]. Available: https://arxiv.org/abs/2508.09822

  54. [54]

    Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

    J. Won, K. Lee, H. Jang, D. Kim, and J. Shin, “Dual-stream diffusion for world-model augmented vision-language-action model,” 2025. [Online]. Available: https://arxiv.org/abs/2510.27607

  55. [55]

    World Action Models are Zero-shot Policies

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xiang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang, “World action mo...

  56. [56]

    Do World Action Models Generalize Better than VLAs? A Robustness Study

    Z. Zhang, Z. Li, B. Rahmati, R. H. Yang, Y . Ma, A. Rasouli, S. Pakdamansavoji, Y . Wu, L. Zhang, T. Cao, F. Wen, X. Wang, X. Quan, and Y . Zhang, “Do world action models generalize better than VLAs? a robustness study,” 2026. [Online]. Available: https://arxiv.org/abs/2603.22078

  57. [57]

    Flare: Robot learning with implicit world modeling,

    R. Zheng, J. Wang, S. Reed, J. Bjorck, Y . Fang, F. Hu, J. Jang, K. Kundalia, Z. Lin, L. Magne, A. Narayan, Y . L. Tan, G. Wang, Q. Wang, J. Xiang, Y . Xu, S. Ye, J. Kautz, F. Huang, Y . Zhu, and L. Fan, “Flare: Robot learning with implicit world modeling,” in Proceedings of The 9th Conference on Robot Learning, ser. Proceedings of Machine Learning Resear...

  58. [58]

    Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge,

    W. Zhang, H. Liu, Z. Qi, Y . Wang, X. Yu, J. Zhang, R. Dong, J. He, H. Wang, Z. Zhanget al., “Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge,”Advances in Neural Information Processing Systems, vol. 38, pp. 24 195–24 228, 2026

  59. [59]

    arXiv preprint arXiv:2602.10098 (2026)

    J. Sun, W. Zhang, Z. Qi, S. Ren, Z. Liu, H. Zhu, G. Sun, X. Jin, and Z. Chen, “VLA-JEPA: Enhancing vision-language- 23 action model with latent world model,” 2026. [Online]. Available: https://arxiv.org/abs/2602.10098

  60. [60]

    DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

    Y . Chen, Y . Ge, H. Zhou, M. Ding, Y . Ge, and X. Liu, “DIAL: Decoupling intent and action via latent world modeling for end-to-end vla,” 2026. [Online]. Available: https://arxiv.org/abs/2603.29844

  61. [61]

    UP-VLA: A unified understanding and prediction model for embodied agent,

    J. Zhang, Y . Guo, Y . Hu, X. Chen, X. Zhu, and J. Chen, “UP-VLA: A unified understanding and prediction model for embodied agent,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=V7JPraxi5j

  62. [62]

    Daydreamer: World models for physical robot learning,

    P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Daydreamer: World models for physical robot learning,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol

  63. [63]

    2226–2240

    PMLR, 14–18 Dec 2023, pp. 2226–2240. [Online]. Available: https://proceedings.mlr.press/v205/wu23c.html

  64. [64]

    Multi-view masked world models for visual robotic manipulation,

    Y . Seo, J. Kim, S. James, K. Lee, J. Shin, and P. Abbeel, “Multi-view masked world models for visual robotic manipulation,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23–29 Jul 2023,...

  65. [65]

    Learning view-invariant world models for visual robotic manipulation,

    J.-C. Pang, N. Tang, K. Li, Y . Tang, X.-Q. Cai, Z.-Y . Zhang, G. Niu, M. Sugiyama, and Y . Yu, “Learning view-invariant world models for visual robotic manipulation,” inInternational Conference on Learning Representations, Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, Eds., vol. 2025, 2025, pp. 54 853–54 876. [Online]. Available: https://proceedings.iclr...

  66. [66]

    Ladi-WM: A latent diffusion-based world model for predictive manipulation,

    Y . Huang, J. Zhang, S. Zou, X. Liu, R. Hu, and K. Xu, “Ladi-WM: A latent diffusion-based world model for predictive manipulation,” in 9th Annual Conference on Robot Learning, 2025. [Online]. Available: https://openreview.net/forum?id=o2w2iiMyEU

  67. [67]

    Lumos: Language-conditioned imitation learning with world models,

    I. Nematollahi, B. DeMoss, A. L. Chandra, N. Hawes, W. Burgard, and I. Posner, “Lumos: Language-conditioned imitation learning with world models,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 8219–8225

  68. [68]

    Reward-free world models for online imitation learning,

    S. Li, Z. Huang, and H. Su, “Reward-free world models for online imitation learning,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/ forum?id=owEhpoKBKC

  69. [69]

    Focus: object-centric world models for robotic manipulation,

    S. Ferraro, P. Mazzaglia, T. Verbelen, and B. Dhoedt, “Focus: object-centric world models for robotic manipulation,”Frontiers in Neurorobotics, vol. V olume 19 - 2025, 2025. [Online]. Available: https://www.frontiersin.org/journals/neurorobotics/articles/10. 3389/fnbot.2025.1585386

  70. [70]

    Leveraging separated world model for exploration in visually distracted environments,

    K. Huang, S. Wan, M. Shao, H.-H. Sun, L. Gan, S. Feng, and D.-C. Zhan, “Leveraging separated world model for exploration in visually distracted environments,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, pp. 82 350–82 37...

  71. [71]

    Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy,

    P. Li, H. Wu, Y . Huang, C. Cheang, L. Wang, and T. Kong, “Gr-mg: Leveraging partially-annotated data via multi-modal goal-conditioned policy,”IEEE Robotics and Automation Letters, vol. 10, no. 2, pp. 1912–1919, 2025

  72. [72]

    Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation

    L. Heng, J. Xu, Y . Wang, X. Li, M. Cai, Y . Shen, J. Zhu, G. Ren, and H. Dong, “Imagine2act: Leveraging object-action motion consistency from imagined goals for robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.17125

  73. [73]

    Mind: Learning a dual-system world model for real-time planning and implicit risk analysis,

    X. Chi, K. Ge, J. Liu, S. Zhou, P. Jia, Z. He, Y . Liu, T. Li, L. Han, S. Han, S. Zhang, and Y . Guo, “Mind: Learning a dual-system world model for real-time planning and implicit risk analysis,” 2025. [Online]. Available: https://arxiv.org/abs/2506.18897

  74. [74]

    Closed-loop visuomotor control with generative expectation for robotic manipulation,

    Q. Bu, J. Zeng, L. Chen, Y . Yang, G. Zhou, J. Yan, P. Luo, H. Cui, Y . Ma, and H. Li, “Closed-loop visuomotor control with generative expectation for robotic manipulation,” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. [Online]. Available: https://openreview.net/forum?id=1ptdkwZbMG

  75. [75]

    Eva: Aligning video world models with executable robot actions via inverse dynamics rewards,

    R. Wang, Q. Liu, Y . Deng, G. Liu, Z. Liu, and K. Jia, “Eva: Aligning video world models with executable robot actions via inverse dynamics rewards,” 2026. [Online]. Available: https://arxiv.org/abs/2603.17808

  76. [76]

    Video prediction policy: A generalist robot policy with predictive visual representations,

    Y . Hu, Y . Guo, P. Wang, X. Chen, Y .-J. Wang, J. Zhang, K. Sreenath, C. Lu, and J. Chen, “Video prediction policy: A generalist robot policy with predictive visual representations,” in Forty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=c0dhw1du33

  77. [77]

    TD-MPC2: Scalable, robust world models for continuous control,

    N. Hansen, H. Su, and X. Wang, “TD-MPC2: Scalable, robust world models for continuous control,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=Oxh5CstDJU

  78. [78]

    Modem: Accelerating visual model-based reinforcement learning with demonstrations,

    N. Hansen, Y . Lin, H. Su, X. Wang, V . Kumar, and A. Rajeswaran, “Modem: Accelerating visual model-based reinforcement learning with demonstrations,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=JdTnc9gjVfJ

  79. [79]

    Modem-v2: Visuo-motor world models for real-world robot manipulation,

    P. Lancaster, N. Hansen, A. Rajeswaran, and V . Kumar, “Modem-v2: Visuo-motor world models for real-world robot manipulation,” in2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 7530–7537

  80. [80]

    Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning,

    A. L. Escoriza, N. Hansen, S. Tao, T. Mu, and H. Su, “Multi-stage manipulation with demonstration-augmented reward, policy, and world model learning,” inForty-second International Conference on Machine Learning, 2025. [Online]. Available: https://openreview.net/forum?id=Bv7LUUYOiq

Showing first 80 references.