pith. sign in

arxiv: 2606.12783 · v1 · pith:7D5PZLQUnew · submitted 2026-06-11 · 💻 cs.AI

A Tutorial on World Models and Physical AI

Pith reviewed 2026-06-27 07:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords world modelsphysical AIexplicit modelsimplicit modelspredictive structureroboticsfoundation modelsplanning
0
0 comments X

The pith

World models are unified through a shared predictive structure that differentiates explicit from implicit representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes world modeling approaches around a common predictive core. Explicit models learn structured dynamics that support rollout-based reasoning and planning. Implicit models encode the same predictive ability inside scalable learned representations. This distinction supplies a foundation for physical AI in robotics and autonomous driving, moving beyond reactive control. Foundation models are presented as a route to systems that integrate perception, prediction, and action.

Core claim

This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

What carries the argument

The coherent framework that unifies explicit and implicit world models by shared predictive structure and differentiates them by representation and exploitation.

Load-bearing premise

Explicit and implicit world models share a common predictive structure that can organize the entire literature into one coherent framework.

What would settle it

A demonstration that the predictive mechanisms underlying explicit rollout models and implicit representation models have no measurable overlap or unifying features.

Figures

Figures reproduced from arXiv: 2606.12783 by Il-Seok Oh.

Figure 1
Figure 1. Figure 1: Multiple imaginative predictions inferred from a single partial observation through a world model (Source: Agência [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A conceptual illustration of hierarchical human reasoning supported by world models, inspired by the causal hierarchy [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between (a) a standard MDP, in which the agent interacts directly with the external environment, and (b) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Imagined rollout generation using the world model in Algorithm 3. Starting from an encoded observation, the recurrent [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architectural comparison of latent world models. (a) Ha–Schmidhuber model (RNN-MDN): the encoder is independent [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Staged learning procedure of the Ha–Schmidhuber world model. The encoder–decoder pair [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Recurrent State-Space Model (RSSM). RSSM represents the system state using a deterministic recurrent state [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Planning-oriented world model learning in MuZero. (a) Monte Carlo Tree Search performed in latent space using the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Explicit versus implicit world models. (a) Explicit world models encode world knowledge in an internal dynamics [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of human-like world knowledge and temporal reasoning exhibited by a VLM from a partial observation. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Inference-time autoregressive generation with latent control in Genie. Starting from an encoded frame, the latent [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Stage-wise training procedure of Genie from video data. The encoder [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Joint-embedding predictive architectures (JEPA). (a) Generic JEPA learns compatibility in representation space [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Reactive (Mode-1) and world model-based (Mode-2) physical AI. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples of Mode-1 and Mode-2 physical AI systems. Mode-1 (reactive) systems include commercially deployed [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Dreamer vs. DayDreamer: Extension of explicit latent world model learning from simulation in (a) to real-world [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Pretraining and adaptation pipeline of V-JEPA 2. Representations are learned from large-scale internet video data [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Fine-tuning of V-JEPA 2-AC (adapted from [ [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Architectural evolution in autonomous driving. (a) Traditional modular pipelines based on hand-engineered world [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Simplified architecture of GAIA-1. The world model is realized as an autoregressive predictor in latent token space, [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Overall architecture of AD-L-JEPA, where masked BEV regions are predicted in latent space using learnable token [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
read the original abstract

World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. This tutorial claims to present a coherent organizational framework that unifies explicit world models (structured dynamics for rollout-based reasoning and planning) and implicit world models (predictive structure encoded in scalable representations) through their shared predictive structure, while differentiating them by representation and exploitation; it positions this as foundational for physical AI in robotics and autonomous driving, notes pathways via foundation models, and identifies open challenges in hierarchical reasoning, long-horizon planning, and autonomous goal formation.

Significance. If the proposed taxonomy accurately and non-selectively captures the literature, the tutorial would provide a useful synthesis for researchers seeking to connect disparate world-modeling paradigms; as a review rather than a source of new theorems or experiments, its value lies in expository organization rather than discovery.

minor comments (1)
  1. [Abstract] Abstract: the final sentence states the unification claim but does not preview the specific criteria (e.g., representation type, exploitation mechanism) used to differentiate approaches, which would help readers anticipate the framework's structure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the tutorial and their recommendation to accept. The review accurately captures the manuscript's intent to provide an expository synthesis unifying explicit and implicit world models through predictive structure for physical AI applications.

Circularity Check

0 steps flagged

Tutorial synthesis carries no derivation chain

full rationale

This is a review/tutorial paper whose sole claim is an expository unification of existing literature via a shared predictive-structure taxonomy. No equations, fitted parameters, formal derivations, or new predictions appear; the abstract and structure position the work as synthesis rather than discovery. Consequently none of the enumerated circularity patterns can apply, and the paper is self-contained as an organizational exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Tutorial paper with no new derivations, parameters, or entities introduced.

pith-pipeline@v0.9.1-grok · 5655 in / 893 out tokens · 17736 ms · 2026-06-27T07:26:00.633431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    2025.Gemini Robotics: bringing AI into the physical world

    Saminda Abeyruwan et al. 2025.Gemini Robotics: bringing AI into the physical world. arXiv:2503.20020 [cs.RO]

  2. [2]

    2025.Cosmos world foundation model platform for physical AI

    Niket Agarwal et al. 2025.Cosmos world foundation model platform for physical AI. arXiv:2501.03575 [cs.CV]

  3. [3]

    Mahmoud Assran et al. 2023. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  4. [4]

    2025.V-JEPA 2: self-supervised video models enable understanding, prediction and planning

    Mido Assran et al. 2025.V-JEPA 2: self-supervised video models enable understanding, prediction and planning. arXiv:2506.09985 [cs.AI]

  5. [5]

    2024.Revisiting feature prediction for learning visual representations from video

    Adrien Bardes et al. 2024.Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471 [cs.CV]

  6. [6]

    2021.On the opportunities and risks of foundation models

    Rishi Bommasani et al. 2021.On the opportunities and risks of foundation models. arXiv:2108.07258 [cs.LG]

  7. [7]

    2024.Genie: generative interactive environments

    Jake Bruce et al. 2024.Genie: generative interactive environments. arXiv:2402.15391 [cs.LG]

  8. [8]

    2023.Sparks of artificial general intelligence: early experiments with GPT-4

    Sebastien Bubeck et al. 2023.Sparks of artificial general intelligence: early experiments with GPT-4. arXiv:2303.12712 [cs.CL]

  9. [9]

    Jingtao Ding et al. 2025. Understanding world or predicting future? a comprehensive survey of world models.Comput. Surveys58, 3 (Sept. 2025), 1–38. doi:10.1145/3746449

  10. [10]

    2025.A survey of large language model-powered spatial intelligence across scales: advances in embodied agents, smart cities, and earth science

    Jie Feng et al. 2025.A survey of large language model-powered spatial intelligence across scales: advances in embodied agents, smart cities, and earth science. arXiv:2504.09848 [cs.AI]

  11. [11]

    2023.Foundation models in robotics: applications, challenges, and the future

    Roya Firoozi et al. 2023.Foundation models in robotics: applications, challenges, and the future. arXiv:2312.07843 [cs.RO]

  12. [12]

    2025.Embodied AI agents: modeling the world

    Pascale Fung et al. 2025.Embodied AI agents: modeling the world. arXiv:2506.22355 [cs.AI]

  13. [13]

    2025.Foundation models in autonomous driving: a survey on scenario generation and scenario analysis

    Yuan Gao et al . 2025.Foundation models in autonomous driving: a survey on scenario generation and scenario analysis. arXiv:2506.11526 [cs.RO]

  14. [14]

    2024.Octo: open-source generalist robot policy

    Dibya Ghosh et al. 2024.Octo: open-source generalist robot policy. arXiv:2405.12213 [cs.RO]

  15. [15]

    Yanchen Guan et al. 2024. World models for autonomous driving: an initial survey.IEEE Transactions on Intelligent Vehicles(May 2024), 1–17. doi:10.1109/TIV.2024.3398357

  16. [16]

    Wes Gurnee and Max Tegmark. 2024. Language models represent space and time. InInternational Conference on Learning Representations

  17. [17]

    2018.World models

    David Ha and Jurgen Schmidhuber. 2018.World models. arXiv:1803.10122 [cs.LG] ACM Comput. Surv., Vol. 58, No. 4, Article 111. Publication date: August 2026. A Tutorial on World Models and Physical AI•111:35

  18. [18]

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2025. Mastering diverse domains through world models.Nature640, 17 (April 2025), 647–665. doi:10.1038/s41586-025-08744-2

  19. [19]

    2018.A Thousand Brains: A New Theory of Intelligence

    Jeff Hawkins. 2018.A Thousand Brains: A New Theory of Intelligence. Basic Books, New York, NY

  20. [20]

    Kaiming He et al. 2022. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  21. [21]

    2025.ASAP: aligning simulation and real-world physics for learning agile humanoid whole-body skills

    Tairan He et al . 2025.ASAP: aligning simulation and real-world physics for learning agile humanoid whole-body skills. arXiv:2502.01143 [cs.RO]

  22. [22]

    2025.A definition of AGI

    Dan Hendrycks et al. 2025.A definition of AGI. arXiv:2510.18212 [cs.AI]

  23. [23]

    Yi-Hsuan Hsiao et al . 2025. Aerobatic maneuvers in insect-scale flapping-wing aerial robots via deep-learned robust tube model predictive control.Science Advances11, 49 (December 2025), 1–13. doi:10.1126/sciadv.aea8716

  24. [24]

    2023.GAIA-1: a generative world model for autonomous driving

    Anthony Hu et al. 2023.GAIA-1: a generative world model for autonomous driving. arXiv:2309.17080 [cs.CV]

  25. [25]

    1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness

    Philip Nicholas Johnson-Laird. 1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Harvard University Press, Cambridge, MA

  26. [26]

    2011.Thinking, Fast and Slow

    Daniel Kahneman. 2011.Thinking, Fast and Slow. Macmillan, New York, NY

  27. [27]

    2024.DROID: a large-scale in-the-wild robot manipulation dataset

    Alexander Khazatsky et al. 2024.DROID: a large-scale in-the-wild robot manipulation dataset. arXiv:2403.12945 [cs.RO]

  28. [28]

    2025.3D and 4D world modeling: a survey

    Lingdong Kong et al. 2025.3D and 4D world modeling: a survey. arXiv:2509.07996 [cs.CV]

  29. [29]

    Kusano et al

    Kristofer D. Kusano et al. 2025. Comparison of Waymo rider-only crash rates by crash type to human benchmarks at 56.7 million miles. Traffic Injury Prevention26, 1 (May 2025), S8–S20. doi:10.1080/15389588.2025.2499887

  30. [30]

    Yann LeCun. 2022. A path towards autonomous machine intelligence. https://openreview.net/pdf?id=BZ5a1r-kVsf Open Review

  31. [31]

    Moerland, Joost Broekens, Aske Plaat, and Catholijn M

    Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. 1983. Model-based reinforcement learning: a survey. Foundations and Trends in Machine Learning16, 1 (Jan. 1983), 1–118. doi:10.1561/2200000086

  32. [32]

    2023.Open X-Embodiment: robotic learning datasets and RT-X models

    Abby O’Neill et al. 2023.Open X-Embodiment: robotic learning datasets and RT-X models. arXiv:2310.08864 [cs.RO]

  33. [33]

    2009.Causality: Models, Reasoning, and Inference(2nd

    Judea Pearl. 2009.Causality: Models, Reasoning, and Inference(2nd. ed.). Cambridge University Press, London, England

  34. [34]

    Quian Quiroga, L

    R. Quian Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried. 2005. Invariant visual representation by single neurons in the human brain.Nature435, 23 (June 2005), 1102–1107. doi:10.1038/nature03687

  35. [35]

    2025.GAIA-2: a controllable multi-view generative world model for autonomous driving

    Lloyd Russel et al. 2025.GAIA-2: a controllable multi-view generative world model for autonomous driving. arXiv:2503.20523 [cs.CV]

  36. [36]

    Julian Schrittwieser et al. 2020. Mastering Atari, Go, and chess and shogi by planning with a learned model.Nature588, 24 (December 2020), 604–612. doi:10.1038/s41586-020-03051-4

  37. [37]

    David Silver et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science362, 6419 (December 2018), 1140–1144. doi:10.1126/science.aar6404

  38. [38]

    2025.HITTER: a humanoid table tennis robot via hierarchical planning and learning

    Zhi Su et al. 2025.HITTER: a humanoid table tennis robot via hierarchical planning and learning. arXiv:2508.21043 [cs.RO]

  39. [39]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Introduction(2nd. ed.). The MIT Press, London, England

  40. [40]

    Ashish Vaswani et al. 2017. Attention is all you need. InAdvances in Neural Information Processing Systems

  41. [41]

    Wayve. 2025. GAIA-3: scaling world models to power safety and evaluation. https://wayve.ai/thinking/gaia-3. Blog post

  42. [42]

    Jason Wei et al. 2022. Chain-of-thougth prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems

  43. [43]

    Philipp Wu et al. 2022. DayDreamer: world models for physical robot learning. InProceedings of the Conference on Robot Learning

  44. [44]

    Cong Zhang, Bangyang Wei, Yang Liu, , and Samuel Labi. 2026. World model-based long-tail and scenario-specific generation for autonomous driving.Journal of Intelligent and Connected Vehicles(2026). doi:10.26599/JICV.2026.9210080

  45. [45]

    Jingyuan Zhao et al. 2025. A survey of autonomous driving from a deep learning perspective.Comput. Surveys57, 10 (May 2025), 1–60. doi:10.1145/3729420

  46. [46]

    2023.A survey of large language models

    Wayne Xin Zhao et al. 2023.A survey of large language models. arXiv:2303.18223 [cs.CL]

  47. [47]

    2025.Self-supervised representation learning with joint embedding predictive architecture for automotive LiDAR object detection

    Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Beiyao Sha, and Anna Choromanska. 2025.Self-supervised representation learning with joint embedding predictive architecture for automotive LiDAR object detection. arXiv:2501.04969 [cs.RO]

  48. [48]

    2024.Is Sora a world simulator? A comprehensive survey on general world models and beyond

    Zheng Zhu et al. 2024.Is Sora a world simulator? A comprehensive survey on general world models and beyond. arXiv:2405.03520 [cs.CV]

  49. [49]

    Hospedales

    Yongshuo Zong, Oisin Mac Aodha, and Timothy M. Hospedales. 2025. Self-supervised multimodal learning: a survey.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7 (July 2025), 5299–5318. doi:10.1109/TPAMI.2024.3429301 Received 20 February 2026; revised 12 March 2026; accepted 5 June 2026 ACM Comput. Surv., Vol. 58, No. 4, Article 111. Publ...