{"work":{"id":"07227eee-8445-4c98-bce4-c6a6fd5ed907","openalex_id":null,"doi":null,"arxiv_id":"1803.10122","raw_key":null,"title":"World Models","authors":null,"authors_text":"David Ha, J\\\"urgen Schmidhuber","year":2018,"venue":"cs.LG","abstract":"We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.\n  An interactive version of this paper is available at https://worldmodels.github.io/","external_url":"https://arxiv.org/abs/1803.10122","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T19:51:10.709564+00:00","pith_arxiv_id":"1803.10122","created_at":"2026-05-10T00:49:48.805189+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":false,"display_title":"World Models","render_title":"World Models"},"hub":{"state":{"work_id":"07227eee-8445-4c98-bce4-c6a6fd5ed907","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":141,"external_cited_by_count":null,"distinct_field_count":9,"first_pith_cited_at":"2019-06-20T14:13:12+00:00","last_pith_cited_at":"2026-05-22T14:51:22+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-09T09:14:56.861125+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":36},{"context_role":"method","n":3},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":35},{"context_polarity":"use_method","n":3},{"context_polarity":"unclear","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"World Models","claims":[{"claim_text":"We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.\n  An interactive version of this paper is","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"enables joint optimization, fully exploiting priors from multiple founda- tion models. Extensive experiments demonstrate that our method signifi- cantly outperforms baselines in visual quality and long-term consistency. Keywords:Multimodal World Model·Foundation Model·Video Gen- eration·Representation Alignment 1 Introduction World models enable agents to predict environmental dynamics and plan ac- tions [11,65]. Recent video diffusion models [1,13,22,42,53,65] trained on large- scale datasets w","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 13 [26] Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025. 36 [27] David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 35 [28] Yoav HaCohen, Nisan Chip","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"106, 107, 119, 124, 166], models are urgently required to transition from virtual-world usage to real-world applications. As a result, world models have begun to enter the spotlight, with researchers increasingly focusing on the ability of large models to function in the physical world, moving beyond virtual environments. The concept of world models was initially introduced by [40], and later works such as [5, 12, 42] began to next-frame-predict tasks like video generation and 3D generation as f","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"interpret ambiguous scenarios and guide complex decision- making under uncertainty. However, most existing LLM- or VLM-empowered driving methods follow the paradigm that maps inputs directly to actions, falling short in explaining and capturing the temporal evolution of driving scenes-an essential factor for robust and anticipatory planning. Meanwhile, another stream of research, world model- ing [17], has emerged to simulate the spatio-temporal evo- lution of the scenes, as exemplified by video","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"We build on the MuJoCo PointMaze environment (Todorov et al., 2012), using top-down RGB renderings as input. Maze layouts are randomly generated on a10× 10grid with connected free space; models are trained on 25 layouts and evaluated on 20 held-out layouts. Start and goal locations are sampled uniformly with grid-distance separationH, defining easy (D∈ [5, 8]), medium (D∈ [9, 12]), and hard (D∈[13,16]) regimes. Dataset and training details are provided in appendix B.4. Planning and Baselines.Hie","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"large-scale architectures that function as \"World Simulators\" capable of modeling physical laws and long-horizon causal- ities [14], [15]. This progression marks a substantial leap in generative capabilities, enabling models not only to synthesize visual content but to understand and predict the underlying physics of the environment, thereby paving the way for AGI [16], [17]. To fully appreciate this leap, it is essential to understand video generation has the potential to achieve world modeling","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks World Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (24 contexts).","role_counts":[{"n":24,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-18T03:50:43.310057+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"ad438a60-54a3-4183-b69f-20c372ff34ee","orcid":null,"display_name":"David Ha"},{"id":"a3e5dc95-9ce3-49c7-af47-02776872b35f","orcid":null,"display_name":"J\\\"urgen Schmidhuber"}]},"error":null,"updated_at":"2026-05-18T03:50:40.140343+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T09:07:57.550935+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Mastering Diverse Domains through World Models","work_id":"6aeb260f-8c7c-4f9c-b98b-067cd7c59acd","shared_citers":26},{"title":"Dream to Control: Learning Behaviors by Latent Imagination","work_id":"5103f4be-344a-4139-8504-eaa59f5bac9d","shared_citers":16},{"title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","work_id":"a9c28401-f16a-4933-89f0-788e2f94e52b","shared_citers":13},{"title":"GAIA-1: A Generative World Model for Autonomous Driving","work_id":"313484e6-a442-4522-8e19-d07e502844a8","shared_citers":12},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":11},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":9},{"title":"Ahmed Hendawy, Jan Peters, and Carlo D’Eramo","work_id":"360ec5fb-79fd-4490-bc73-3d161609c42d","shared_citers":9},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":9},{"title":"Cosmos World Foundation Model Platform for Physical AI","work_id":"a2dba24c-318d-476a-8b21-4289c265810c","shared_citers":9},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":9},{"title":"//arxiv.org/abs/2310.06114","work_id":"16f38691-7ab6-4e23-bba5-6b656579e579","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":7},{"title":"Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523","work_id":"1339e674-d09b-48b4-8e6f-efe55dcab22e","shared_citers":7},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":7},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":7},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":6},{"title":"//arxiv.org/abs/2010.02193","work_id":"154f6f5f-bb34-456d-8107-45d5b51433ce","shared_citers":6},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":6},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":6},{"title":"LeWorld- Model: Stable end-to-end joint-embedding predictive architecture from pixels","work_id":"d00a2f82-c871-4a58-9f34-22167e8efa93","shared_citers":6},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":6},{"title":"Revisiting Feature Prediction for Learning Visual Representations from Video","work_id":"f7251dcf-5341-4915-bfe7-27812387b61a","shared_citers":6},{"title":"World Action Models are Zero-shot Policies","work_id":"9a85fc69-74df-450e-94cd-69d186e9e830","shared_citers":6}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2023},{"n":1,"year":2024},{"n":4,"year":2025},{"n":64,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T09:08:01.831322+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T09:08:06.373828+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"World Models","claims":[{"claim_text":"We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.\n  An interactive version of this paper is","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"enables joint optimization, fully exploiting priors from multiple founda- tion models. Extensive experiments demonstrate that our method signifi- cantly outperforms baselines in visual quality and long-term consistency. Keywords:Multimodal World Model·Foundation Model·Video Gen- eration·Representation Alignment 1 Introduction World models enable agents to predict environmental dynamics and plan ac- tions [11,65]. Recent video diffusion models [1,13,22,42,53,65] trained on large- scale datasets w","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 13 [26] Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video generation.arXiv preprint arXiv:2505.00337, 2025. 36 [27] David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018. 35 [28] Yoav HaCohen, Nisan Chip","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"106, 107, 119, 124, 166], models are urgently required to transition from virtual-world usage to real-world applications. As a result, world models have begun to enter the spotlight, with researchers increasingly focusing on the ability of large models to function in the physical world, moving beyond virtual environments. The concept of world models was initially introduced by [40], and later works such as [5, 12, 42] began to next-frame-predict tasks like video generation and 3D generation as f","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"interpret ambiguous scenarios and guide complex decision- making under uncertainty. However, most existing LLM- or VLM-empowered driving methods follow the paradigm that maps inputs directly to actions, falling short in explaining and capturing the temporal evolution of driving scenes-an essential factor for robust and anticipatory planning. Meanwhile, another stream of research, world model- ing [17], has emerged to simulate the spatio-temporal evo- lution of the scenes, as exemplified by video","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"We build on the MuJoCo PointMaze environment (Todorov et al., 2012), using top-down RGB renderings as input. Maze layouts are randomly generated on a10× 10grid with connected free space; models are trained on 25 layouts and evaluated on 20 held-out layouts. Start and goal locations are sampled uniformly with grid-distance separationH, defining easy (D∈ [5, 8]), medium (D∈ [9, 12]), and hard (D∈[13,16]) regimes. Dataset and training details are provided in appendix B.4. Planning and Baselines.Hie","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"large-scale architectures that function as \"World Simulators\" capable of modeling physical laws and long-horizon causal- ities [14], [15]. This progression marks a substantial leap in generative capabilities, enabling models not only to synthesize visual content but to understand and predict the underlying physics of the environment, thereby paving the way for AGI [16], [17]. To fully appreciate this leap, it is essential to understand video generation has the potential to achieve world modeling","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks World Models because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (24 contexts).","role_counts":[{"n":24,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-18T03:50:43.314508+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"World Models","claims":[{"claim_text":"We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.\n  An interactive version of this paper is","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks World Models because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T09:07:53.620512+00:00"}},"summary":{"title":"World Models","claims":[{"claim_text":"We explore building generative neural network models of popular reinforcement learning environments. Our world model can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.\n  An interactive version of this paper is","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks World Models because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Mastering Diverse Domains through World Models","work_id":"6aeb260f-8c7c-4f9c-b98b-067cd7c59acd","shared_citers":26},{"title":"Dream to Control: Learning Behaviors by Latent Imagination","work_id":"5103f4be-344a-4139-8504-eaa59f5bac9d","shared_citers":16},{"title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","work_id":"a9c28401-f16a-4933-89f0-788e2f94e52b","shared_citers":13},{"title":"GAIA-1: A Generative World Model for Autonomous Driving","work_id":"313484e6-a442-4522-8e19-d07e502844a8","shared_citers":12},{"title":"OpenVLA: An Open-Source Vision-Language-Action Model","work_id":"3e7e65c5-5aed-4fe9-8414-2092bcb31cc7","shared_citers":11},{"title":"$\\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization","work_id":"d1ad7304-d09a-49bc-809e-846439f6aff9","shared_citers":9},{"title":"Ahmed Hendawy, Jan Peters, and Carlo D’Eramo","work_id":"360ec5fb-79fd-4490-bc73-3d161609c42d","shared_citers":9},{"title":"Auto-Encoding Variational Bayes","work_id":"97d95295-30e1-42b4-bbf6-85f0fa4edb44","shared_citers":9},{"title":"Cosmos World Foundation Model Platform for Physical AI","work_id":"a2dba24c-318d-476a-8b21-4289c265810c","shared_citers":9},{"title":"Wan: Open and Advanced Large-Scale Video Generative Models","work_id":"ad3ebc3b-4224-46c9-b61d-bcf135da0a7c","shared_citers":9},{"title":"//arxiv.org/abs/2310.06114","work_id":"16f38691-7ab6-4e23-bba5-6b656579e579","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"DINOv2: Learning Robust Visual Features without Supervision","work_id":"26b304e5-b54a-4f26-be7e-83299eca52e4","shared_citers":7},{"title":"Gaia-2: A controllable multi-view generative world model for autonomous driving.arXiv preprint arXiv:2503.20523","work_id":"1339e674-d09b-48b4-8e6f-efe55dcab22e","shared_citers":7},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":7},{"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","shared_citers":7},{"title":"$\\pi_0$: A Vision-Language-Action Flow Model for General Robot Control","work_id":"f790abdc-a796-482f-a40d-f8ee035ecfc2","shared_citers":6},{"title":"//arxiv.org/abs/2010.02193","work_id":"154f6f5f-bb34-456d-8107-45d5b51433ce","shared_citers":6},{"title":"Flow Matching for Generative Modeling","work_id":"6edb71c4-5d64-40af-a394-9757ea051a36","shared_citers":6},{"title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","work_id":"b8f5e260-fff5-444e-bcf5-2c42cfefd83d","shared_citers":6},{"title":"LeWorld- Model: Stable end-to-end joint-embedding predictive architecture from pixels","work_id":"d00a2f82-c871-4a58-9f34-22167e8efa93","shared_citers":6},{"title":"Octo: An Open-Source Generalist Robot Policy","work_id":"f9ca0722-8855-48c3-a27a-0eefb7e19253","shared_citers":6},{"title":"Revisiting Feature Prediction for Learning Visual Representations from Video","work_id":"f7251dcf-5341-4915-bfe7-27812387b61a","shared_citers":6},{"title":"World Action Models are Zero-shot Policies","work_id":"9a85fc69-74df-450e-94cd-69d186e9e830","shared_citers":6}],"time_series":[{"n":1,"year":2019},{"n":1,"year":2023},{"n":1,"year":2024},{"n":4,"year":2025},{"n":64,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"ad438a60-54a3-4183-b69f-20c372ff34ee","orcid":null,"display_name":"David Ha","source":"manual","import_confidence":0.72},{"id":"a3e5dc95-9ce3-49c7-af47-02776872b35f","orcid":null,"display_name":"J\\\"urgen Schmidhuber","source":"manual","import_confidence":0.72}]}}