A Tutorial on World Models and Physical AI

Il-Seok Oh

arxiv: 2606.12783 · v1 · pith:7D5PZLQUnew · submitted 2026-06-11 · 💻 cs.AI

A Tutorial on World Models and Physical AI

Il-Seok Oh This is my paper

Pith reviewed 2026-06-27 07:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords world modelsphysical AIexplicit modelsimplicit modelspredictive structureroboticsfoundation modelsplanning

0 comments

The pith

World models are unified through a shared predictive structure that differentiates explicit from implicit representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper organizes world modeling approaches around a common predictive core. Explicit models learn structured dynamics that support rollout-based reasoning and planning. Implicit models encode the same predictive ability inside scalable learned representations. This distinction supplies a foundation for physical AI in robotics and autonomous driving, moving beyond reactive control. Foundation models are presented as a route to systems that integrate perception, prediction, and action.

Core claim

This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

What carries the argument

The coherent framework that unifies explicit and implicit world models by shared predictive structure and differentiates them by representation and exploitation.

Load-bearing premise

Explicit and implicit world models share a common predictive structure that can organize the entire literature into one coherent framework.

What would settle it

A demonstration that the predictive mechanisms underlying explicit rollout models and implicit representation models have no measurable overlap or unifying features.

Figures

Figures reproduced from arXiv: 2606.12783 by Il-Seok Oh.

**Figure 1.** Figure 1: Multiple imaginative predictions inferred from a single partial observation through a world model (Source: Agência [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: A conceptual illustration of hierarchical human reasoning supported by world models, inspired by the causal hierarchy [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison between (a) a standard MDP, in which the agent interacts directly with the external environment, and (b) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Imagined rollout generation using the world model in Algorithm 3. Starting from an encoded observation, the recurrent [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Architectural comparison of latent world models. (a) Ha–Schmidhuber model (RNN-MDN): the encoder is independent [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Staged learning procedure of the Ha–Schmidhuber world model. The encoder–decoder pair [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Recurrent State-Space Model (RSSM). RSSM represents the system state using a deterministic recurrent state [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Planning-oriented world model learning in MuZero. (a) Monte Carlo Tree Search performed in latent space using the [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Explicit versus implicit world models. (a) Explicit world models encode world knowledge in an internal dynamics [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Example of human-like world knowledge and temporal reasoning exhibited by a VLM from a partial observation. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Inference-time autoregressive generation with latent control in Genie. Starting from an encoded frame, the latent [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Stage-wise training procedure of Genie from video data. The encoder [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Joint-embedding predictive architectures (JEPA). (a) Generic JEPA learns compatibility in representation space [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Reactive (Mode-1) and world model-based (Mode-2) physical AI. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Examples of Mode-1 and Mode-2 physical AI systems. Mode-1 (reactive) systems include commercially deployed [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Dreamer vs. DayDreamer: Extension of explicit latent world model learning from simulation in (a) to real-world [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗

**Figure 17.** Figure 17: Pretraining and adaptation pipeline of V-JEPA 2. Representations are learned from large-scale internet video data [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Fine-tuning of V-JEPA 2-AC (adapted from [ [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

**Figure 19.** Figure 19: Architectural evolution in autonomous driving. (a) Traditional modular pipelines based on hand-engineered world [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Simplified architecture of GAIA-1. The world model is realized as an autoregressive predictor in latent token space, [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: Overall architecture of AD-L-JEPA, where masked BEV regions are predicted in latent space using learnable token [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

read the original abstract

World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A central distinction can be drawn between explicit world models, which learn structured dynamics for rollout-based reasoning and planning, and implicit world models, which encode predictive structure within scalable learned representations. These complementary paradigms provide a foundation for physical AI in domains such as robotics and autonomous driving, enabling intelligence beyond reactive control under real-world constraints. Recent foundation models further suggest a pathway toward unified systems integrating perception, prediction, and action. Despite rapid progress, major challenges remain in hierarchical reasoning, long-horizon planning, and autonomous goal formation, which are critical for advancing toward artificial general intelligence. This tutorial presents a coherent framework in which diverse world modeling approaches are unified through shared predictive structure and differentiated by how such structure is represented and exploited.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clear tutorial that organizes explicit versus implicit world models around predictive structure, but adds no new methods or results.

read the letter

This tutorial organizes existing ideas on world models for physical AI by splitting them into explicit models that learn structured dynamics and implicit ones that embed predictions in representations. The central move is to tie both to a shared predictive structure.

It does a decent job laying out how these approaches support prediction, reasoning, and action in robotics and autonomous driving. The discussion of challenges like hierarchical reasoning and long-horizon planning is straightforward and matches what the field already flags.

The main limitation is that the paper is explicitly a synthesis. No new derivations, experiments, or falsifiable claims appear, so its value depends entirely on whether the taxonomy is accurate and balanced in the full text. If the coverage is selective, the unification could feel forced.

This is aimed at newcomers who want a map of the area rather than researchers already working on world models. Experienced readers will likely find the distinctions familiar.

It deserves peer review because a well-structured tutorial can still be useful for entry into the topic, even without original contributions.

Referee Report

0 major / 1 minor

Summary. This tutorial claims to present a coherent organizational framework that unifies explicit world models (structured dynamics for rollout-based reasoning and planning) and implicit world models (predictive structure encoded in scalable representations) through their shared predictive structure, while differentiating them by representation and exploitation; it positions this as foundational for physical AI in robotics and autonomous driving, notes pathways via foundation models, and identifies open challenges in hierarchical reasoning, long-horizon planning, and autonomous goal formation.

Significance. If the proposed taxonomy accurately and non-selectively captures the literature, the tutorial would provide a useful synthesis for researchers seeking to connect disparate world-modeling paradigms; as a review rather than a source of new theorems or experiments, its value lies in expository organization rather than discovery.

minor comments (1)

[Abstract] Abstract: the final sentence states the unification claim but does not preview the specific criteria (e.g., representation type, exploitation mechanism) used to differentiate approaches, which would help readers anticipate the framework's structure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the tutorial and their recommendation to accept. The review accurately captures the manuscript's intent to provide an expository synthesis unifying explicit and implicit world models through predictive structure for physical AI applications.

Circularity Check

0 steps flagged

Tutorial synthesis carries no derivation chain

full rationale

This is a review/tutorial paper whose sole claim is an expository unification of existing literature via a shared predictive-structure taxonomy. No equations, fitted parameters, formal derivations, or new predictions appear; the abstract and structure position the work as synthesis rather than discovery. Consequently none of the enumerated circularity patterns can apply, and the paper is self-contained as an organizational exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Tutorial paper with no new derivations, parameters, or entities introduced.

pith-pipeline@v0.9.1-grok · 5655 in / 893 out tokens · 17736 ms · 2026-06-27T07:26:00.633431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 12 canonical work pages · 1 internal anchor

[1]

2025.Gemini Robotics: bringing AI into the physical world

Saminda Abeyruwan et al. 2025.Gemini Robotics: bringing AI into the physical world. arXiv:2503.20020 [cs.RO]

Pith/arXiv arXiv 2025
[2]

2025.Cosmos world foundation model platform for physical AI

Niket Agarwal et al. 2025.Cosmos world foundation model platform for physical AI. arXiv:2501.03575 [cs.CV]

Pith/arXiv arXiv 2025
[3]

Mahmoud Assran et al. 2023. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2023
[4]

2025.V-JEPA 2: self-supervised video models enable understanding, prediction and planning

Mido Assran et al. 2025.V-JEPA 2: self-supervised video models enable understanding, prediction and planning. arXiv:2506.09985 [cs.AI]

Pith/arXiv arXiv 2025
[5]

2024.Revisiting feature prediction for learning visual representations from video

Adrien Bardes et al. 2024.Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471 [cs.CV]

Pith/arXiv arXiv 2024
[6]

2021.On the opportunities and risks of foundation models

Rishi Bommasani et al. 2021.On the opportunities and risks of foundation models. arXiv:2108.07258 [cs.LG]

Pith/arXiv arXiv 2021
[7]

2024.Genie: generative interactive environments

Jake Bruce et al. 2024.Genie: generative interactive environments. arXiv:2402.15391 [cs.LG]

arXiv 2024
[8]

2023.Sparks of artificial general intelligence: early experiments with GPT-4

Sebastien Bubeck et al. 2023.Sparks of artificial general intelligence: early experiments with GPT-4. arXiv:2303.12712 [cs.CL]

Pith/arXiv arXiv 2023
[9]

Jingtao Ding et al. 2025. Understanding world or predicting future? a comprehensive survey of world models.Comput. Surveys58, 3 (Sept. 2025), 1–38. doi:10.1145/3746449

work page doi:10.1145/3746449 2025
[10]

2025.A survey of large language model-powered spatial intelligence across scales: advances in embodied agents, smart cities, and earth science

Jie Feng et al. 2025.A survey of large language model-powered spatial intelligence across scales: advances in embodied agents, smart cities, and earth science. arXiv:2504.09848 [cs.AI]

arXiv 2025
[11]

2023.Foundation models in robotics: applications, challenges, and the future

Roya Firoozi et al. 2023.Foundation models in robotics: applications, challenges, and the future. arXiv:2312.07843 [cs.RO]

arXiv 2023
[12]

2025.Embodied AI agents: modeling the world

Pascale Fung et al. 2025.Embodied AI agents: modeling the world. arXiv:2506.22355 [cs.AI]

arXiv 2025
[13]

2025.Foundation models in autonomous driving: a survey on scenario generation and scenario analysis

Yuan Gao et al . 2025.Foundation models in autonomous driving: a survey on scenario generation and scenario analysis. arXiv:2506.11526 [cs.RO]

arXiv 2025
[14]

2024.Octo: open-source generalist robot policy

Dibya Ghosh et al. 2024.Octo: open-source generalist robot policy. arXiv:2405.12213 [cs.RO]

Pith/arXiv arXiv 2024
[15]

Yanchen Guan et al. 2024. World models for autonomous driving: an initial survey.IEEE Transactions on Intelligent Vehicles(May 2024), 1–17. doi:10.1109/TIV.2024.3398357

work page doi:10.1109/tiv.2024.3398357 2024
[16]

Wes Gurnee and Max Tegmark. 2024. Language models represent space and time. InInternational Conference on Learning Representations

2024
[17]

2018.World models

David Ha and Jurgen Schmidhuber. 2018.World models. arXiv:1803.10122 [cs.LG] ACM Comput. Surv., Vol. 58, No. 4, Article 111. Publication date: August 2026. A Tutorial on World Models and Physical AI•111:35

Pith/arXiv arXiv 2018
[18]

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2025. Mastering diverse domains through world models.Nature640, 17 (April 2025), 647–665. doi:10.1038/s41586-025-08744-2

work page doi:10.1038/s41586-025-08744-2 2025
[19]

2018.A Thousand Brains: A New Theory of Intelligence

Jeff Hawkins. 2018.A Thousand Brains: A New Theory of Intelligence. Basic Books, New York, NY

2018
[20]

Kaiming He et al. 2022. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2022
[21]

2025.ASAP: aligning simulation and real-world physics for learning agile humanoid whole-body skills

Tairan He et al . 2025.ASAP: aligning simulation and real-world physics for learning agile humanoid whole-body skills. arXiv:2502.01143 [cs.RO]

arXiv 2025
[22]

2025.A definition of AGI

Dan Hendrycks et al. 2025.A definition of AGI. arXiv:2510.18212 [cs.AI]

arXiv 2025
[23]

Yi-Hsuan Hsiao et al . 2025. Aerobatic maneuvers in insect-scale flapping-wing aerial robots via deep-learned robust tube model predictive control.Science Advances11, 49 (December 2025), 1–13. doi:10.1126/sciadv.aea8716

work page doi:10.1126/sciadv.aea8716 2025
[24]

2023.GAIA-1: a generative world model for autonomous driving

Anthony Hu et al. 2023.GAIA-1: a generative world model for autonomous driving. arXiv:2309.17080 [cs.CV]

Pith/arXiv arXiv 2023
[25]

1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness

Philip Nicholas Johnson-Laird. 1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Harvard University Press, Cambridge, MA

1983
[26]

2011.Thinking, Fast and Slow

Daniel Kahneman. 2011.Thinking, Fast and Slow. Macmillan, New York, NY

2011
[27]

2024.DROID: a large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky et al. 2024.DROID: a large-scale in-the-wild robot manipulation dataset. arXiv:2403.12945 [cs.RO]

Pith/arXiv arXiv 2024
[28]

2025.3D and 4D world modeling: a survey

Lingdong Kong et al. 2025.3D and 4D world modeling: a survey. arXiv:2509.07996 [cs.CV]

arXiv 2025
[29]

Kusano et al

Kristofer D. Kusano et al. 2025. Comparison of Waymo rider-only crash rates by crash type to human benchmarks at 56.7 million miles. Traffic Injury Prevention26, 1 (May 2025), S8–S20. doi:10.1080/15389588.2025.2499887

work page doi:10.1080/15389588.2025.2499887 2025
[30]

Yann LeCun. 2022. A path towards autonomous machine intelligence. https://openreview.net/pdf?id=BZ5a1r-kVsf Open Review

2022
[31]

Moerland, Joost Broekens, Aske Plaat, and Catholijn M

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. 1983. Model-based reinforcement learning: a survey. Foundations and Trends in Machine Learning16, 1 (Jan. 1983), 1–118. doi:10.1561/2200000086

work page doi:10.1561/2200000086 1983
[32]

2023.Open X-Embodiment: robotic learning datasets and RT-X models

Abby O’Neill et al. 2023.Open X-Embodiment: robotic learning datasets and RT-X models. arXiv:2310.08864 [cs.RO]

Pith/arXiv arXiv 2023
[33]

2009.Causality: Models, Reasoning, and Inference(2nd

Judea Pearl. 2009.Causality: Models, Reasoning, and Inference(2nd. ed.). Cambridge University Press, London, England

2009
[34]

Quian Quiroga, L

R. Quian Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried. 2005. Invariant visual representation by single neurons in the human brain.Nature435, 23 (June 2005), 1102–1107. doi:10.1038/nature03687

work page doi:10.1038/nature03687 2005
[35]

2025.GAIA-2: a controllable multi-view generative world model for autonomous driving

Lloyd Russel et al. 2025.GAIA-2: a controllable multi-view generative world model for autonomous driving. arXiv:2503.20523 [cs.CV]

Pith/arXiv arXiv 2025
[36]

Julian Schrittwieser et al. 2020. Mastering Atari, Go, and chess and shogi by planning with a learned model.Nature588, 24 (December 2020), 604–612. doi:10.1038/s41586-020-03051-4

work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020
[37]

David Silver et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science362, 6419 (December 2018), 1140–1144. doi:10.1126/science.aar6404

work page doi:10.1126/science.aar6404 2018
[38]

2025.HITTER: a humanoid table tennis robot via hierarchical planning and learning

Zhi Su et al. 2025.HITTER: a humanoid table tennis robot via hierarchical planning and learning. arXiv:2508.21043 [cs.RO]

arXiv 2025
[39]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Introduction(2nd. ed.). The MIT Press, London, England

2018
[40]

Ashish Vaswani et al. 2017. Attention is all you need. InAdvances in Neural Information Processing Systems

2017
[41]

Wayve. 2025. GAIA-3: scaling world models to power safety and evaluation. https://wayve.ai/thinking/gaia-3. Blog post

2025
[42]

Jason Wei et al. 2022. Chain-of-thougth prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems

2022
[43]

Philipp Wu et al. 2022. DayDreamer: world models for physical robot learning. InProceedings of the Conference on Robot Learning

2022
[44]

Cong Zhang, Bangyang Wei, Yang Liu, , and Samuel Labi. 2026. World model-based long-tail and scenario-specific generation for autonomous driving.Journal of Intelligent and Connected Vehicles(2026). doi:10.26599/JICV.2026.9210080

work page doi:10.26599/jicv.2026.9210080 2026
[45]

Jingyuan Zhao et al. 2025. A survey of autonomous driving from a deep learning perspective.Comput. Surveys57, 10 (May 2025), 1–60. doi:10.1145/3729420

work page doi:10.1145/3729420 2025
[46]

2023.A survey of large language models

Wayne Xin Zhao et al. 2023.A survey of large language models. arXiv:2303.18223 [cs.CL]

Pith/arXiv arXiv 2023
[47]

2025.Self-supervised representation learning with joint embedding predictive architecture for automotive LiDAR object detection

Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Beiyao Sha, and Anna Choromanska. 2025.Self-supervised representation learning with joint embedding predictive architecture for automotive LiDAR object detection. arXiv:2501.04969 [cs.RO]

arXiv 2025
[48]

2024.Is Sora a world simulator? A comprehensive survey on general world models and beyond

Zheng Zhu et al. 2024.Is Sora a world simulator? A comprehensive survey on general world models and beyond. arXiv:2405.03520 [cs.CV]

arXiv 2024
[49]

Hospedales

Yongshuo Zong, Oisin Mac Aodha, and Timothy M. Hospedales. 2025. Self-supervised multimodal learning: a survey.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7 (July 2025), 5299–5318. doi:10.1109/TPAMI.2024.3429301 Received 20 February 2026; revised 12 March 2026; accepted 5 June 2026 ACM Comput. Surv., Vol. 58, No. 4, Article 111. Publ...

work page doi:10.1109/tpami.2024.3429301 2025

[1] [1]

2025.Gemini Robotics: bringing AI into the physical world

Saminda Abeyruwan et al. 2025.Gemini Robotics: bringing AI into the physical world. arXiv:2503.20020 [cs.RO]

Pith/arXiv arXiv 2025

[2] [2]

2025.Cosmos world foundation model platform for physical AI

Niket Agarwal et al. 2025.Cosmos world foundation model platform for physical AI. arXiv:2501.03575 [cs.CV]

Pith/arXiv arXiv 2025

[3] [3]

Mahmoud Assran et al. 2023. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2023

[4] [4]

2025.V-JEPA 2: self-supervised video models enable understanding, prediction and planning

Mido Assran et al. 2025.V-JEPA 2: self-supervised video models enable understanding, prediction and planning. arXiv:2506.09985 [cs.AI]

Pith/arXiv arXiv 2025

[5] [5]

2024.Revisiting feature prediction for learning visual representations from video

Adrien Bardes et al. 2024.Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471 [cs.CV]

Pith/arXiv arXiv 2024

[6] [6]

2021.On the opportunities and risks of foundation models

Rishi Bommasani et al. 2021.On the opportunities and risks of foundation models. arXiv:2108.07258 [cs.LG]

Pith/arXiv arXiv 2021

[7] [7]

2024.Genie: generative interactive environments

Jake Bruce et al. 2024.Genie: generative interactive environments. arXiv:2402.15391 [cs.LG]

arXiv 2024

[8] [8]

2023.Sparks of artificial general intelligence: early experiments with GPT-4

Sebastien Bubeck et al. 2023.Sparks of artificial general intelligence: early experiments with GPT-4. arXiv:2303.12712 [cs.CL]

Pith/arXiv arXiv 2023

[9] [9]

Jingtao Ding et al. 2025. Understanding world or predicting future? a comprehensive survey of world models.Comput. Surveys58, 3 (Sept. 2025), 1–38. doi:10.1145/3746449

work page doi:10.1145/3746449 2025

[10] [10]

2025.A survey of large language model-powered spatial intelligence across scales: advances in embodied agents, smart cities, and earth science

Jie Feng et al. 2025.A survey of large language model-powered spatial intelligence across scales: advances in embodied agents, smart cities, and earth science. arXiv:2504.09848 [cs.AI]

arXiv 2025

[11] [11]

2023.Foundation models in robotics: applications, challenges, and the future

Roya Firoozi et al. 2023.Foundation models in robotics: applications, challenges, and the future. arXiv:2312.07843 [cs.RO]

arXiv 2023

[12] [12]

2025.Embodied AI agents: modeling the world

Pascale Fung et al. 2025.Embodied AI agents: modeling the world. arXiv:2506.22355 [cs.AI]

arXiv 2025

[13] [13]

2025.Foundation models in autonomous driving: a survey on scenario generation and scenario analysis

Yuan Gao et al . 2025.Foundation models in autonomous driving: a survey on scenario generation and scenario analysis. arXiv:2506.11526 [cs.RO]

arXiv 2025

[14] [14]

2024.Octo: open-source generalist robot policy

Dibya Ghosh et al. 2024.Octo: open-source generalist robot policy. arXiv:2405.12213 [cs.RO]

Pith/arXiv arXiv 2024

[15] [15]

Yanchen Guan et al. 2024. World models for autonomous driving: an initial survey.IEEE Transactions on Intelligent Vehicles(May 2024), 1–17. doi:10.1109/TIV.2024.3398357

work page doi:10.1109/tiv.2024.3398357 2024

[16] [16]

Wes Gurnee and Max Tegmark. 2024. Language models represent space and time. InInternational Conference on Learning Representations

2024

[17] [17]

2018.World models

David Ha and Jurgen Schmidhuber. 2018.World models. arXiv:1803.10122 [cs.LG] ACM Comput. Surv., Vol. 58, No. 4, Article 111. Publication date: August 2026. A Tutorial on World Models and Physical AI•111:35

Pith/arXiv arXiv 2018

[18] [18]

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. 2025. Mastering diverse domains through world models.Nature640, 17 (April 2025), 647–665. doi:10.1038/s41586-025-08744-2

work page doi:10.1038/s41586-025-08744-2 2025

[19] [19]

2018.A Thousand Brains: A New Theory of Intelligence

Jeff Hawkins. 2018.A Thousand Brains: A New Theory of Intelligence. Basic Books, New York, NY

2018

[20] [20]

Kaiming He et al. 2022. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

2022

[21] [21]

2025.ASAP: aligning simulation and real-world physics for learning agile humanoid whole-body skills

Tairan He et al . 2025.ASAP: aligning simulation and real-world physics for learning agile humanoid whole-body skills. arXiv:2502.01143 [cs.RO]

arXiv 2025

[22] [22]

2025.A definition of AGI

Dan Hendrycks et al. 2025.A definition of AGI. arXiv:2510.18212 [cs.AI]

arXiv 2025

[23] [23]

Yi-Hsuan Hsiao et al . 2025. Aerobatic maneuvers in insect-scale flapping-wing aerial robots via deep-learned robust tube model predictive control.Science Advances11, 49 (December 2025), 1–13. doi:10.1126/sciadv.aea8716

work page doi:10.1126/sciadv.aea8716 2025

[24] [24]

2023.GAIA-1: a generative world model for autonomous driving

Anthony Hu et al. 2023.GAIA-1: a generative world model for autonomous driving. arXiv:2309.17080 [cs.CV]

Pith/arXiv arXiv 2023

[25] [25]

1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness

Philip Nicholas Johnson-Laird. 1983.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Harvard University Press, Cambridge, MA

1983

[26] [26]

2011.Thinking, Fast and Slow

Daniel Kahneman. 2011.Thinking, Fast and Slow. Macmillan, New York, NY

2011

[27] [27]

2024.DROID: a large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky et al. 2024.DROID: a large-scale in-the-wild robot manipulation dataset. arXiv:2403.12945 [cs.RO]

Pith/arXiv arXiv 2024

[28] [28]

2025.3D and 4D world modeling: a survey

Lingdong Kong et al. 2025.3D and 4D world modeling: a survey. arXiv:2509.07996 [cs.CV]

arXiv 2025

[29] [29]

Kusano et al

Kristofer D. Kusano et al. 2025. Comparison of Waymo rider-only crash rates by crash type to human benchmarks at 56.7 million miles. Traffic Injury Prevention26, 1 (May 2025), S8–S20. doi:10.1080/15389588.2025.2499887

work page doi:10.1080/15389588.2025.2499887 2025

[30] [30]

Yann LeCun. 2022. A path towards autonomous machine intelligence. https://openreview.net/pdf?id=BZ5a1r-kVsf Open Review

2022

[31] [31]

Moerland, Joost Broekens, Aske Plaat, and Catholijn M

Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. 1983. Model-based reinforcement learning: a survey. Foundations and Trends in Machine Learning16, 1 (Jan. 1983), 1–118. doi:10.1561/2200000086

work page doi:10.1561/2200000086 1983

[32] [32]

2023.Open X-Embodiment: robotic learning datasets and RT-X models

Abby O’Neill et al. 2023.Open X-Embodiment: robotic learning datasets and RT-X models. arXiv:2310.08864 [cs.RO]

Pith/arXiv arXiv 2023

[33] [33]

2009.Causality: Models, Reasoning, and Inference(2nd

Judea Pearl. 2009.Causality: Models, Reasoning, and Inference(2nd. ed.). Cambridge University Press, London, England

2009

[34] [34]

Quian Quiroga, L

R. Quian Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried. 2005. Invariant visual representation by single neurons in the human brain.Nature435, 23 (June 2005), 1102–1107. doi:10.1038/nature03687

work page doi:10.1038/nature03687 2005

[35] [35]

2025.GAIA-2: a controllable multi-view generative world model for autonomous driving

Lloyd Russel et al. 2025.GAIA-2: a controllable multi-view generative world model for autonomous driving. arXiv:2503.20523 [cs.CV]

Pith/arXiv arXiv 2025

[36] [36]

Julian Schrittwieser et al. 2020. Mastering Atari, Go, and chess and shogi by planning with a learned model.Nature588, 24 (December 2020), 604–612. doi:10.1038/s41586-020-03051-4

work page internal anchor Pith review doi:10.1038/s41586-020-03051-4 2020

[37] [37]

David Silver et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science362, 6419 (December 2018), 1140–1144. doi:10.1126/science.aar6404

work page doi:10.1126/science.aar6404 2018

[38] [38]

2025.HITTER: a humanoid table tennis robot via hierarchical planning and learning

Zhi Su et al. 2025.HITTER: a humanoid table tennis robot via hierarchical planning and learning. arXiv:2508.21043 [cs.RO]

arXiv 2025

[39] [39]

Sutton and Andrew G

Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Introduction(2nd. ed.). The MIT Press, London, England

2018

[40] [40]

Ashish Vaswani et al. 2017. Attention is all you need. InAdvances in Neural Information Processing Systems

2017

[41] [41]

Wayve. 2025. GAIA-3: scaling world models to power safety and evaluation. https://wayve.ai/thinking/gaia-3. Blog post

2025

[42] [42]

Jason Wei et al. 2022. Chain-of-thougth prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems

2022

[43] [43]

Philipp Wu et al. 2022. DayDreamer: world models for physical robot learning. InProceedings of the Conference on Robot Learning

2022

[44] [44]

Cong Zhang, Bangyang Wei, Yang Liu, , and Samuel Labi. 2026. World model-based long-tail and scenario-specific generation for autonomous driving.Journal of Intelligent and Connected Vehicles(2026). doi:10.26599/JICV.2026.9210080

work page doi:10.26599/jicv.2026.9210080 2026

[45] [45]

Jingyuan Zhao et al. 2025. A survey of autonomous driving from a deep learning perspective.Comput. Surveys57, 10 (May 2025), 1–60. doi:10.1145/3729420

work page doi:10.1145/3729420 2025

[46] [46]

2023.A survey of large language models

Wayne Xin Zhao et al. 2023.A survey of large language models. arXiv:2303.18223 [cs.CL]

Pith/arXiv arXiv 2023

[47] [47]

2025.Self-supervised representation learning with joint embedding predictive architecture for automotive LiDAR object detection

Haoran Zhu, Zhenyuan Dong, Kristi Topollai, Beiyao Sha, and Anna Choromanska. 2025.Self-supervised representation learning with joint embedding predictive architecture for automotive LiDAR object detection. arXiv:2501.04969 [cs.RO]

arXiv 2025

[48] [48]

2024.Is Sora a world simulator? A comprehensive survey on general world models and beyond

Zheng Zhu et al. 2024.Is Sora a world simulator? A comprehensive survey on general world models and beyond. arXiv:2405.03520 [cs.CV]

arXiv 2024

[49] [49]

Hospedales

Yongshuo Zong, Oisin Mac Aodha, and Timothy M. Hospedales. 2025. Self-supervised multimodal learning: a survey.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 7 (July 2025), 5299–5318. doi:10.1109/TPAMI.2024.3429301 Received 20 February 2026; revised 12 March 2026; accepted 5 June 2026 ACM Comput. Surv., Vol. 58, No. 4, Article 111. Publ...

work page doi:10.1109/tpami.2024.3429301 2025