pith. machine review for the scientific record. sign in

arxiv: 2604.16592 · v1 · submitted 2026-04-17 · 💻 cs.RO · cs.AI· cs.CV· cs.ET

Recognition: unknown

Human Cognition in Machines: A Unified Perspective of World Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:08 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.ET
keywords world modelscognitive architecture theoryunified frameworkmeta-cognitionmotivationepistemic world modelsAI taxonomycognitive functions
0
0 comments X

The pith

A unified framework based on cognitive architecture theory requires world models to incorporate all human cognitive functions including motivation and meta-cognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish a conceptual unified framework for world models that draws directly from Cognitive Architecture Theory to include the full set of cognitive functions: memory, perception, language, reasoning, imagining, motivation, and meta-cognition. This matters because many existing AI systems assert near human-like capabilities without a shared standard for what those capabilities entail or how to measure completeness. The framework distinguishes prior work by which functions each model addresses, applies a taxonomy across video, embodied, and a newly defined epistemic category, and flags large gaps in motivation and meta-cognition. By doing so it supplies concrete directions for future models that aim at scientific discovery and self-aware behavior.

Core claim

The paper establishes that world models can be unified and evaluated by mapping them onto the complete set of cognitive functions supplied by Cognitive Architecture Theory. Prior models are shown to be partial, with motivation (especially intrinsic motivation) and meta-cognition remaining drastically under-researched. The work introduces epistemic world models as a distinct category for agent frameworks that operate over structured knowledge for scientific discovery. The resulting taxonomy, when applied to video, embodied, and epistemic models, identifies specific gaps and proposes targeted research directions to close them.

What carries the argument

The unified conceptual framework that maps every world model onto the full list of cognitive functions from Cognitive Architecture Theory, using this mapping both to classify existing systems and to expose missing elements.

If this is right

  • Any world model claiming human-like cognition must be assessed against all seven cognitive functions rather than a subset.
  • Motivation and meta-cognition constitute the largest and most consequential research gaps that future models must address.
  • Epistemic world models form a new category that focuses on structured knowledge and scientific discovery tasks.
  • The taxonomy supplies a classification scheme that can be applied uniformly to video, embodied, and epistemic world models to guide development.
  • Concrete directions for filling the identified gaps can be pursued to produce more complete agent architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adoption of the framework could create a common evaluation language across different subfields of world-model research.
  • Models that remain incomplete in motivation and meta-cognition may continue to struggle with sustained autonomous exploration.
  • The new epistemic category suggests that world-model research could overlap more directly with automated scientific reasoning systems.
  • Future empirical tests could compare agents built with the full function set against current partial models on discovery-oriented benchmarks.

Load-bearing premise

Cognitive Architecture Theory supplies the complete and correct set of cognitive functions needed to ground and evaluate all world models in AI.

What would settle it

A world model that achieves claimed human-like performance on tasks involving self-reflection, long-term planning, or open-ended discovery while omitting explicit mechanisms for motivation or meta-cognition would falsify the necessity of the full framework.

Figures

Figures reproduced from arXiv: 2604.16592 by Amir Taherin, Arash Akbari, Arman Akbari, David Kaeli, Edmund Yeh, Enfu Nan, Geng Yuan, Haochen Zeng, Jennifer Dy, Juyi Lin, Pu Zhao, Rahul Chowdhury, Sarah Ostadabbas, Sean Duffy, Silvia Zhang, Timothy Rupprecht, Weiwei Chen, Yanzhi Wang, Yifan Cao, Yixiao Chen, Yixin Shen, Yumei He.

Figure 1
Figure 1. Figure 1: Our survey studies the convergence of three different but inter-related fields: human cognition, machine cognition, and World Models. 2. We propose a unified World Model as a conceptual road-map for incorpo￾rating all the component parts of cognitive architecture for robust world representation and generation. 3. We identify and propose solutions to research gaps in World Model motiva￾tion and meta-cogniti… view at source ↗
Figure 2
Figure 2. Figure 2: The taxonomy of World Models covered in our survey correspond to the com￾ponent parts of cognitive architecture theory [123] they innovate most. simulators that 1) represent current world structure and 2) predict future world dynamics [40]. Recent works also survey advances in video World Models [211], embodiment [101], temporal–spatial modeling [110], and physical realism [109], all highlighting challenge… view at source ↗
Figure 3
Figure 3. Figure 3: The component-parts of our Unified World Model built from first principles in cognitive architecture theory [123] and meta-cognition [10]. This serves as a conceptual road-map for World Model research. images in a lower-dimension latent space functionally constituting memory, JEPA is innovative in it’s encoded latent space in how they novelly train their model to extend representations of patched images to… view at source ↗
Figure 4
Figure 4. Figure 4: Above are the typical architectures encountered when reviewing video World Models [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Above are the typical architectures encountered when reviewing embodied World Models [PITH_FULL_IMAGE:figures/full_fig_p030_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Above are the typical architecture and Global Workspace frameworks encoun￾tered when reviewing World Models for scientific discovery used with a human-in-the￾loop subject-matter expert [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗
read the original abstract

This comprehensive report distinguishes prior works by the cognitive functions they innovate. Many works claim an almost "human-like" cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles in Cognitive Architecture Theory (CAT). We present a conceptual unified framework for world models that fully incorporates all the cognitive functions associated with CAT (i.e. memory, perception, language, reasoning, imagining, motivation, and meta-cognition) and identify gaps in the research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and meta-cognition remain drastically under-researched, and we propose concrete directions informed by active inference and global workspace theory to address them. We further introduce Epistemic World Models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied across video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a conceptual unified framework for world models in AI, grounded in Cognitive Architecture Theory (CAT). It claims this framework fully incorporates the cognitive functions of memory, perception, language, reasoning, imagining, motivation, and meta-cognition. The work applies a taxonomy across video, embodied, and epistemic world models, identifies research gaps (particularly in motivation and meta-cognition), proposes directions informed by active inference and global workspace theory, and introduces Epistemic World Models as a new category for structured-knowledge discovery agents.

Significance. If the taxonomy and framework are adopted, the paper could offer a structured lens for evaluating world models against human-like cognitive capabilities and highlight under-explored areas such as intrinsic motivation. The introduction of Epistemic World Models provides a novel categorization that may stimulate targeted research in scientific discovery agents. As a high-level synthesis without new derivations, empirical tests, or formal mappings, its significance lies in guiding future conceptual and experimental work rather than providing immediately actionable technical advances.

major comments (1)
  1. [Abstract] Abstract and framework presentation: The central claim that the proposed framework 'fully incorporates' all listed CAT functions rests on the unexamined selection of one specific enumeration of those functions. The manuscript does not justify this choice against competing cognitive architectures (e.g., ACT-R, SOAR, LIDA, Global Workspace) that differ on whether functions such as attention, emotion, or procedural learning are primitive or emergent, nor does it demonstrate a complete mapping without omissions or unaddressed interactions. This assumption is load-bearing for the 'unified' and 'fully incorporates' assertions.
minor comments (1)
  1. The application of the taxonomy to video, embodied, and epistemic categories would benefit from clearer notation or a summary table distinguishing how each CAT function is realized (or not) in representative prior works.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding the justification of our selected cognitive functions and the strength of the 'unified' and 'fully incorporates' claims below.

read point-by-point responses
  1. Referee: The central claim that the proposed framework 'fully incorporates' all listed CAT functions rests on the unexamined selection of one specific enumeration of those functions. The manuscript does not justify this choice against competing cognitive architectures (e.g., ACT-R, SOAR, LIDA, Global Workspace) that differ on whether functions such as attention, emotion, or procedural learning are primitive or emergent, nor does it demonstrate a complete mapping without omissions or unaddressed interactions. This assumption is load-bearing for the 'unified' and 'fully incorporates' assertions.

    Authors: We agree that the manuscript would benefit from an explicit justification of the chosen cognitive functions and a clearer qualification of our claims. Our enumeration (memory, perception, language, reasoning, imagining, motivation, and meta-cognition) was selected as a representative synthesis of functions most directly relevant to world modeling in AI, drawing from common elements across CAT literature. However, we acknowledge that the current text does not compare this selection to specific architectures such as ACT-R, SOAR, LIDA, or Global Workspace, nor does it provide a detailed mapping that addresses potential omissions or interactions (e.g., attention or emotion as modulators). In the revised manuscript, we will add a dedicated paragraph in the introduction that (1) motivates the selection by referencing overlaps with major CATs, (2) notes that functions like attention and emotion can emerge from or modulate the core set, and (3) qualifies 'fully incorporates' to mean that the framework supplies structural mechanisms for these functions while recognizing that complete mappings and interaction details remain open for future work. This revision will make the assumptions explicit and reduce the load-bearing nature of the claim. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual synthesis of external CAT literature

full rationale

The paper offers a high-level taxonomy and gap analysis for world models by mapping established cognitive functions from Cognitive Architecture Theory (CAT) onto AI components. No equations, fitted parameters, or derivations appear; the 'fully incorporates' claim is presented as an organizational synthesis of prior external literature rather than a self-referential reduction. No self-citation chains, ansatzes, or renamings of known results are load-bearing for the central assertions. The framework remains self-contained against external benchmarks and does not force its outputs by construction from its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on CAT as the authoritative grounding for cognitive functions and introduces Epistemic World Models without independent empirical support or falsifiable predictions.

axioms (1)
  • domain assumption Cognitive Architecture Theory provides the complete and necessary list of cognitive functions for evaluating human-like world models.
    The entire unified framework and gap analysis are built directly on this premise.
invented entities (1)
  • Epistemic World Models no independent evidence
    purpose: A new category of agent frameworks for scientific discovery that operate over structured knowledge.
    Introduced in the paper as an addition to existing video and embodied world model categories.

pith-pipeline@v0.9.0 · 5550 in / 1299 out tokens · 41120 ms · 2026-05-10T08:08:23.281682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhyGround: Benchmarking Physical Reasoning in Generative World Models

    cs.CV 2026-05 accept novelty 7.0

    PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

Reference graph

Works this paper leans on

234 extracted references · 137 canonical work pages · cited by 1 Pith paper · 28 internal anchors

  1. [1]

    preprint (2026), https://research.beingbeyond.com/projects/being-h07/being-h07.pdf

    Being-h0.7: A latent world-action model from egocentric videos. preprint (2026), https://research.beingbeyond.com/projects/being-h07/being-h07.pdf

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

  3. [3]

    World Simulation with Video Foundation Models for Physical AI

    Ali, A., Bai, J., Bala, M., Balaji, Y., Blakeman, A., Cai, T., Cao, J., Cao, T., Cha, E., Chao, Y.W., et al.: World simulation with video foundation models for physical ai. arXiv preprint arXiv:2511.00062 (2025)

  4. [4]

    Alonso, E., Jelley, A., Micheli, V., Kanervisto, A., Storkey, A., Pearce, T., Fleuret, F.:Diffusionforworldmodeling:Visualdetailsmatterinatari.AdvancesinNeural Information Processing Systems37, 58757–58791 (2024)

  5. [5]

    Concrete Problems in AI Safety

    Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 (2016)

  6. [6]

    Human–Computer Interaction12(4), 439–462 (1997)

    Anderson, J.R., Matessa, M., Lebiere, C.: Act-r: A theory of higher level cognition and its relation to visual attention. Human–Computer Interaction12(4), 439–462 (1997)

  7. [7]

    Sensors (Basel, Switzerland)25(18), 5877 (2025)

    Arshid, K., Krayani, A., Marcenaro, L., Gomez, D.M., Regazzoni, C.: Toward au- tonomous uav swarm navigation: a review of trajectory design paradigms. Sensors (Basel, Switzerland)25(18), 5877 (2025)

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., Ballas, N.: Self-supervised learning from images with a joint-embedding pre- Human Cognition in Machines: A Unified Perspective of World Models 43 dictive architecture. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 15619–15629 (2023)

  9. [9]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al.: V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985 (2025)

  10. [10]

    Cambridge University Press (1993)

    Baars, B.J.: A cognitive theory of consciousness. Cambridge University Press (1993)

  11. [11]

    arXiv preprint arXiv:2601.15284 (2026) 5

    Bagchi, A., Bao, Z., Bharadhwaj, H., Wang, Y.X., Tokmakov, P., Hebert, M.: Walk through paintings: Egocentric world models from internet priors. arXiv preprint arXiv:2601.15284 (2026)

  12. [12]

    Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

    Barcellona, L., Zadaianchuk, A., Allegro, D., Papa, S., Ghidoni, S., Gavves, E.: Dream to manipulate: Compositional world models empowering robot imitation learning with imagination. arXiv preprint arXiv:2412.14957 (2024)

  13. [13]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

  14. [14]

    Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: V-jepa: Latent video prediction for visual representation learning (2023)

  15. [15]

    arXiv preprint arXiv:1912.05510 (2019)

    Berseth, G., Geng, D., Devin, C., Rhinehart, N., Finn, C., Jayaraman, D., Levine, S.: Smirl: Surprise minimizing reinforcement learning in unstable environments. arXiv preprint arXiv:1912.05510 (2019)

  16. [16]

    arXiv preprint arXiv:2503.21232 (2025)

    Bheemaiah, A., Yang, S.: Knowledge graphs as world models for seman- tic material-aware obstacle handling in autonomous vehicles. arXiv preprint arXiv:2503.21232 (2025)

  17. [17]

    Motus: A Unified Latent Action World Model

    Bi, H., Tan, H., Xie, S., Wang, Z., Huang, S., Liu, H., Zhao, R., Feng, Y., Xiang, C., Rong, Y., et al.: Motus: A unified latent action world model. arXiv preprint arXiv:2512.13030 (2025)

  18. [18]

    Meta-thinking in llms via multi-agent reinforcement learning: A survey,

    Bilal, A., Mohsin, M.A., Umer, M., Bangash, M.A.K., Jamshed, M.A.: Meta- thinking in llms via multi-agent reinforcement learning: A survey. arXiv preprint arXiv:2504.14520 (2025)

  19. [19]

    Zero-shot robotic manipu- lation with pretrained image-editing diffusion models,

    Black, K., Nakamoto, M., Atreya, P., Walke, H., Finn, C., Kumar, A., Levine, S.: Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639 (2023)

  20. [20]

    Cognitive Systems Research91, 101353 (2025)

    Boggs, J.: Towards visual-symbolic integration in the soar cognitive architecture. Cognitive Systems Research91, 101353 (2025)

  21. [21]

    arXiv preprint arXiv:2509.19789 (2025)

    Bosio, C., Woelki, G., Hendy, N., Roy, N., Kim, B.: Rdar: Reward-driven agent relevance estimation for autonomous driving. arXiv preprint arXiv:2509.19789 (2025)

  22. [22]

    Bühler, K.: Sprachtheorie, vol. 2. Jena Fischer (1934)

  23. [23]

    arXiv preprint arXiv:2507.04075 (2025)

    Burchi,M.,Timofte,R.:Accurateandefficientworldmodelingwithmaskedlatent transformers. arXiv preprint arXiv:2507.04075 (2025)

  24. [24]

    arXiv preprint arXiv:2601.16471 (2026)

    Cao, M., Tang, H., Zhao, H., Han, M., Liu, R., Sun, Q., Chang, X., Reid, I., Liang, X.:Orderfromchaos:Physicalworldunderstandingfromglitchygameplayvideos. arXiv preprint arXiv:2601.16471 (2026)

  25. [25]

    WorldVLA: Towards Autoregressive Action World Model

    Cen, J., Yu, C., Yuan, H., Jiang, Y., Huang, S., Guo, J., Li, X., Song, Y., Luo, H., Wang, F., et al.: Worldvla: Towards autoregressive action world model. arXiv preprint arXiv:2506.21539 (2025) 44 Authors Suppressed Due to Excessive Length

  26. [26]

    Advances in Neural Information Processing Systems34, 965–979 (2021)

    Chang, J., Uehara, M., Sreenivas, D., Kidambi, R., Sun, W.: Mitigating covariate shift in imitation learning via offline data with partial coverage. Advances in Neural Information Processing Systems34, 965–979 (2021)

  27. [27]

    In: International Conference on Product-Focused Software Process Improvement

    Chatlatanagulchai, W., Thonglek, K., Reid, B., Kashiwa, Y., Leelaprute, P., Rungsawang, A., Manaskasemsak, B., Iida, H.: On the use of agentic coding manifests: An empirical study of claude code. In: International Conference on Product-Focused Software Process Improvement. pp. 543–551. Springer (2025)

  28. [28]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Cheang, C.L., Chen, G., Jing, Y., Kong, T., Li, H., Li, Y., Liu, Y., Wu, H., Xu, J., Yang, Y., et al.: Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 (2024)

  29. [29]

    Large Video Planner Enables Generalizable Robot Control

    Chen, B., Zhang, T., Geng, H., Song, K., Zhang, C., Li, P., Freeman, W.T., Malik, J., Abbeel, P., Tedrake, R., et al.: Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840 (2025)

  30. [30]

    Chen, S., Ma, S., Yu, S., Zhang, H., Zhao, S., Lu, C.: Exploring consciousness in llms: A systematic survey of theories, implementations, and frontier risks (2025), https://arxiv.org/abs/2505.19806

  31. [31]

    arXiv preprint arXiv:2412.18607 , year=

    Chen, Y., et al.: Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers. arXiv preprint arXiv:2412.18607 (2024)

  32. [32]

    Neuroscience of Consciousness2024(1), niae013 (2024)

    Colombatto, C., Fleming, S.M.: Folk psychological attributions of consciousness to large language models. Neuroscience of Consciousness2024(1), niae013 (2024)

  33. [33]

    arXiv preprint arXiv:2603.24327 (2026)

    Cornelissen, C., Leroux, S., Simoens, P.: Le mumo jepa: Multi-modal self- supervised representation learning with learnable fusion tokens. arXiv preprint arXiv:2603.24327 (2026)

  34. [34]

    arXiv preprint arXiv:2510.17482 (2025)

    Dang, C., et al.: Sparseworld: A flexible, adaptive, and efficient 4d occu- pancy world model powered by sparse and dynamic queries. arXiv preprint arXiv:2510.17482 (2025)

  35. [35]

    Darwin, C.: The descent of man, and selection in relation to sex, vol. 2. D. Ap- pleton (1872)

  36. [36]

    WW Norton & Company (1998)

    Deacon, T.W.: The symbolic species: The co-evolution of language and the brain. WW Norton & Company (1998)

  37. [37]

    Scientific Reports14(1), 28083 (2024)

    Dentella, V., Günther, F., Murphy, E., Marcus, G., Leivada, E.: Testing ai on lan- guage comprehension tasks reveals insensitivity to underlying meaning. Scientific Reports14(1), 28083 (2024)

  38. [38]

    arXiv preprint arXiv:2601.00844 , year=

    Destrade, M., Bounou, O., Lidec, Q.L., Ponce, J., LeCun, Y.: Value-guided action planning with jepa world models. arXiv preprint arXiv:2601.00844 (2025)

  39. [39]

    Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

    Dharmarajan, K., Huang, W., Wu, J., Fei-Fei, L., Zhang, R.: Dream2flow: Bridg- ing video generation and open-world manipulation with 3d object flow. arXiv preprint arXiv:2512.24766 (2025)

  40. [40]

    ACM Computing Surveys58(3), 1–38 (2025)

    Ding, J., Zhang, Y., Shang, Y., Zhang, Y., Zong, Z., Feng, J., Yuan, Y., Su, H., Li, N., Sukiennik, N., et al.: Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys58(3), 1–38 (2025)

  41. [41]

    Nature Machine Intelligence pp

    Doerig, A., Kietzmann, T.C., Allen, E., Wu, Y., Naselaris, T., Kay, K., Charest, I.: High-level visual representations in the human brain are aligned with large language models. Nature Machine Intelligence pp. 1–15 (2025)

  42. [42]

    Harvard university press (1993)

    Donald, M.: Origins of the modern mind: Three stages in the evolution of culture and cognition. Harvard university press (1993)

  43. [43]

    Authorea Preprints (2026) Human Cognition in Machines: A Unified Perspective of World Models 45

    Dong, J., Lyu, Q., Liu, B., Wang, X., Liang, W., Zhang, D., Tu, J., Li, H., Zhao, H., Ding, H., et al.: Learning to model the world: A survey of world models in artificial intelligence. Authorea Preprints (2026) Human Cognition in Machines: A Unified Perspective of World Models 45

  44. [44]

    Advances in neural information processing systems36, 9156–9172 (2023)

    Du, Y., Yang, S., Dai, B., Dai, H., Nachum, O., Tenenbaum, J., Schuurmans, D., Abbeel, P.: Learning universal policies via text-guided video generation. Advances in neural information processing systems36, 9156–9172 (2023)

  45. [45]

    arXiv e-prints pp

    Dung Nguyen, V., Yang, Z., Buckley, C.L., Ororbia, A.: R-aif: Solving sparse- reward robotic tasks from pixels with active inference and world models. arXiv e-prints pp. arXiv–2409 (2024)

  46. [46]

    arXiv preprint arXiv:2601.06309 (2026)

    Durante, Z., Singh, S., Khatua, A., Agarwal, S., Tan, R., Lee, Y.J., Gao, J., Adeli, E., Fei-Fei, L.: Videoweave: A data-centric approach for efficient video understanding. arXiv preprint arXiv:2601.06309 (2026)

  47. [47]

    Friston, et al., Active inference and artificial reasoning, arXiv preprint (2025).arXiv:2512.21129

    Friston, K., Da Costa, L., Tschantz, A., Heins, C., Buckley, C., Verbelen, T., Parr, T.: Active inference and artificial reasoning. arXiv preprint arXiv:2512.21129 (2025)

  48. [48]

    Neuroscience & Biobehavioral Reviews68, 862– 879 (2016)

    Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G., et al.: Active inference and learning. Neuroscience & Biobehavioral Reviews68, 862– 879 (2016)

  49. [49]

    Fung, P., Bachrach, Y., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Gong, H., et al.: Embodied ai agents: Modeling the world (2025)

  50. [50]

    Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

    Gao, S., Zhou, S., Du, Y., Zhang, J., Gan, C.: Adaworld: Learning adaptable world models with latent actions. arXiv preprint arXiv:2503.18938 (2025)

  51. [51]

    IEEE Transactions on Intelligent Vehicles (2024)

    Gao, Y., Zhang, Q., Ding, D.W., Zhao, D.: Dream to drive with predictive indi- vidual world model. IEEE Transactions on Intelligent Vehicles (2024)

  52. [52]

    arXiv preprint arXiv:2601.05230 (2026)

    Garrido, Q., Nagarajan, T., Terver, B., Ballas, N., LeCun, Y., Rabbat, M.: Learn- ing latent action world models in the wild. arXiv preprint arXiv:2601.05230 (2026)

  53. [53]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Ge, Z., Huang, H., Zhou, M., Li, J., Wang, G., Tang, S., Zhuang, Y.: Worldgpt: Empowering llm as multimodal world model. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 7346–7355 (2024)

  54. [54]

    Journal of Personalized Medicine16(4), 181 (2026)

    Gentile, G., Morello, G., La Cognata, V., Guarnaccia, M., Cavallaro, S.: Artificial intelligence in transcriptomics: From human-in-the-loop to agentic ai. Journal of Personalized Medicine16(4), 181 (2026)

  55. [55]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Goff, M., Hogan, G., Hotz, G., du Parc Locmaria, A., Raczy, K., Schäfer, H., Shihadeh, A., Zhang, W., Yousfi, Y.: Learning to drive from a world model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 1964–1973 (2025)

  56. [56]

    Towards an AI co-scientist

    Gottweis, J., Weng, W.H., Daryin, A., Tu, T., Palepu, A., Sirkovic, P., Myaskovsky, A., Weissenberger, F., Rong, K., Tanno, R., et al.: Towards an ai co-scientist. arXiv preprint arXiv:2502.18864 (2025)

  57. [57]

    arXiv preprint arXiv:2411.06559 , year=

    Gu, Y., Zhang, K., Ning, Y., Zheng, B., Gou, B., Xue, T., Chang, C., Srivastava, S., Xie, Y., Qi, P., et al.: Is your llm secretly a world model of the internet? model-based planning for web agents. arXiv preprint arXiv:2411.06559 (2024)

  58. [58]

    Gumbsch, C., Sajid, N., Martius, G., Butz, M.V.: In: The Twelfth International Conference on Learning Representations (2023)

  59. [59]

    IEEE Robotics and Automation Letters11(3), 2466–2473 (2026)

    Guo, J., Ma, X., Wang, Y., Yang, M., Liu, H., Li, Q.: Flowdreamer: A rgb- d world model with flow-based motion representations for robot manipulation. IEEE Robotics and Automation Letters11(3), 2466–2473 (2026)

  60. [60]

    World Models

    Ha, D., Schmidhuber, J.: World models. arXiv preprint arXiv:1803.101222(3) (2018)

  61. [61]

    LTX-Video: Realtime Video Latent Diffusion

    HaCohen, Y., Chiprut, N., Brazowski, B., Shalem, D., Moshe, D., Richardson, E., Levin, E., Shiran, G., Zabari, N., Gordon, O., et al.: Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103 (2024)

  62. [62]

    In: AAAI Workshops (2017) 46 Authors Suppressed Due to Excessive Length

    Hadfield-Menell, D., Dragan, A.D., Abbeel, P., Russell, S.: The off-switch game. In: AAAI Workshops (2017) 46 Authors Suppressed Due to Excessive Length

  63. [63]

    TD-MPC2: Scalable, Robust World Models for Continuous Control

    Hansen, N., Su, H., Wang, X.: Td-mpc2: Scalable, robust world models for con- tinuous control. arXiv preprint arXiv:2310.16828 (2023)

  64. [64]

    Hierarchical world models as visual whole-body humanoid controllers

    Hansen, N., SV, J., Sobal, V., LeCun, Y., Wang, X., Su, H.: Hierarchi- cal world models as visual whole-body humanoid controllers. arXiv preprint arXiv:2405.18418 (2024)

  65. [65]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Hao, C., Lu, W., Xu, Y., Chen, Y.: Neural motion simulator pushing the limit of world models in reinforcement learning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27608–27617 (2025)

  66. [67]

    GAIA-1: A Generative World Model for Autonomous Driving

    Hu, A., et al.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)

  67. [68]

    Enerverse: Envisioning embodied future space for robotics manipulation

    Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y., Liao, Y., Gao, P., Li, H., Yao, M., et al.: Enerverse: Envisioning embodied future space for robotics manipulation. arXiv preprint arXiv:2501.01895 (2025)

  68. [69]

    SafeDreamer: Safe reinforcement learning with world models

    Huang, W., Ji, J., Xia, C., Zhang, B., Yang, Y.: Safedreamer: Safe reinforcement learning with world models. arXiv preprint arXiv:2307.07176 (2023)

  69. [70]

    PointWorld: Scaling 3D world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

    Huang, W., Chao, Y.W., Mousavian, A., Liu, M.Y., Fox, D., Mo, K., Fei-Fei, L.: Pointworld: Scaling 3d world models for in-the-wild robotic manipulation. arXiv preprint arXiv:2601.03782 (2026)

  70. [71]

    Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion (2025),https://arxiv.org/abs/ 2506.08009

  71. [72]

    arXiv preprint arXiv:2505.11528 (2025)

    Huang,Y.,Zhang,J.,Zou,S.,Liu,X.,Hu,R.,Xu,K.:Ladi-wm:Alatentdiffusion- based world model for predictive manipulation. arXiv preprint arXiv:2505.11528 (2025)

  72. [73]

    The Platonic Representation Hypothesis

    Huh, M., Cheung, B., Wang, T., Isola, P.: The platonic representation hypothesis. arXiv preprint arXiv:2405.07987 (2024)

  73. [74]

    https://www.reddit.com/r/AmItheAsshole/

    Ibrahim, L., Cheng, M.: Thinking beyond the anthropomorphic paradigm benefits llm research. arXiv preprint arXiv:2502.09192 (2025)

  74. [75]

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y., et al.:π0.5: a vision- language-action model with open-world generalization (2025)

  75. [76]

    arXiv preprint arXiv:2601.22647 (2026)

    Jang, J., Yoo, M., Yoon, S., Woo, H.: Test-time mixture of world models for em- bodied agents in dynamic environments. arXiv preprint arXiv:2601.22647 (2026)

  76. [77]

    Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

    Jang, J., Ye, S., Lin, Z., Xiang, J., Bjorck, J., Fang, Y., Hu, F., Huang, S., Kun- dalia, K., Lin, Y.C., et al.: Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705 (2025)

  77. [78]

    Self-refining video sampling.arXiv preprint arXiv:2601.18577, 2026

    Jang, S., Ki, T., Jo, J., Xie, S., Yoon, J., Hwang, S.J.: Self-refining video sampling. arXiv preprint arXiv:2601.18577 (2026)

  78. [79]

    In: Creative Writing, pp

    Jaynes, J.: from the origin of consciousness in the breakdown of the bicameral mind. In: Creative Writing, pp. 541–543. Routledge (2013)

  79. [80]

    IRL-VLA: Training an vision-language-action policy via reward world model,

    Jiang, A., Gao, Y., Wang, Y., Sun, Z., Wang, S., Heng, Y., Sun, H., Tang, S., Zhu, L., Chai, J., et al.: Irl-vla: Training an vision-language-action policy via reward world model. arXiv preprint arXiv:2508.06571 (2025)

  80. [81]

    Trends in Cognitive Sciences (2025) Human Cognition in Machines: A Unified Perspective of World Models 47

    Johnson, S.G., Karimi, A.H., Bengio, Y., Chater, N., Gerstenberg, T., Larson, K., Levine, S., Mitchell, M., Rahwan, I., Schölkopf, B., et al.: Imagining and building wise machines: The centrality of ai metacognition. Trends in Cognitive Sciences (2025) Human Cognition in Machines: A Unified Perspective of World Models 47

Showing first 80 references.