Physically Viable World Models: A Case for Query-Conditioned Embodied AI

Adam J. Thorpe; Cheng-Hsi Hsiao; Hassan Iqbal; Krishna Kumar; Neel P. Bhatt; Stepan Tretiakov; Su Ann Low; Ufuk Topcu; Xingjian Li

arxiv: 2605.30542 · v1 · pith:SR3OEXSZnew · submitted 2026-05-28 · 💻 cs.AI

Physically Viable World Models: A Case for Query-Conditioned Embodied AI

Adam J. Thorpe , Stepan Tretiakov , Cheng-Hsi Hsiao , Su Ann Low , Xingjian Li , Hassan Iqbal , Neel P. Bhatt , Ufuk Topcu

show 1 more author

Krishna Kumar

This is my paper

Pith reviewed 2026-06-29 06:52 UTC · model grok-4.3

classification 💻 cs.AI

keywords embodied AIworld modelsintervention queriesphysical viabilityquery-conditioned modelsmodular componentsorchestratoraction outcomes

0 comments

The pith

Embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query rather than predict observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that observation-predictive world models in embodied AI produce visually plausible but physically incorrect rollouts because distinct physical systems can appear identical from passive views yet diverge when actions intervene. It demonstrates this failure mode through controlled benchmarks that hold the visible scene fixed while changing latent physics, leading models to recommend infeasible actions, mispredict outcomes, or certify unsafe behavior. In response the authors claim that viable models must determine the minimal physical structure needed for a given query and assemble it from modular components covering environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response. An autonomous orchestrator selects and composes these components, allowing analytic, simulated, learned, or hybrid transitions provided they preserve the distinctions that determine interventional outcomes. This decomposition yields interpretable models whose outputs can be audited directly against the query and supplies both a design principle for new systems and a feasibility test for existing ones.

Core claim

What carries the argument

The query-conditioned physically viable world model, which identifies the simplest physical abstraction sufficient for an intervention query and assembles it from modular components via an autonomous orchestrator.

If this is right

Models remain interpretable because each component can be verified independently against the query.
Outputs become auditable because the model must preserve only the distinctions relevant to the intervention.
When closed-form physics is unavailable the transition model may be analytic, simulated, learned, or hybrid provided it keeps the structure that determines outcomes.
The approach supplies an explicit feasibility test: an existing model is viable only if it preserves the distinctions relevant to the query.
Demonstrations show the method succeeding on queries where current systems fail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular decomposition could support incremental verification of safety properties in robotic planning without requiring full physics simulation.
Benchmarks focused on intervention divergence might replace or supplement current visual-prediction metrics in embodied-AI evaluation.
An orchestrator that switches abstractions per query could reduce computational cost by activating only the physics needed for the immediate decision.
The same principle of minimal sufficient abstraction might apply to other domains where passive prediction diverges from active control, such as process control or interactive simulation.

Load-bearing premise

Distinct physical systems can look identical from observations yet produce different results under intervention, so models built only on observation prediction are structurally insufficient.

What would settle it

Controlled benchmarks that fix the visible scene while varying latent physics, where an observation-predictive model recommends an infeasible action or mispredicts an interaction outcome.

Figures

Figures reproduced from arXiv: 2605.30542 by Adam J. Thorpe, Cheng-Hsi Hsiao, Hassan Iqbal, Krishna Kumar, Neel P. Bhatt, Stepan Tretiakov, Su Ann Low, Ufuk Topcu, Xingjian Li.

**Figure 2.** Figure 2: Controlled evaluation scenes used to expose failures of visual world models under latent [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Representative construction of a physically viable world model. A user-specified inter [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Counterfactual evaluation of a ramp-to-cup intervention query under varying release heights, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Viscosity-dependent pouring under a query-conditioned world model. (a) A fixed probing [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Query-conditioned construction of a physically viable driving model from an observation [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Static-image benchmark for VLM physical prediction. Each scene shows a ball rolling [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Controlled ramp-to-tower rigid-body interactions under latent material variation. Each row [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison between physical simulation and visual diffusion model in a ball and [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison between physical simulation and visual diffusion model in a ball and two [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison between the ground-truth physical rollout and V-JEPA 2-generated trajectories [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison between the ground-truth physical rollout and the V-JEPA2-generated [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Viscosity-dependent robot-arm pouring. Each row shows a time sequence for a different [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. Existing observation-predictive world models can produce visually plausible but physically wrong rollouts. This failure is structural; distinct physical systems can look identical yet diverge under intervention. We expose this problem with controlled benchmarks that fix the visible scene while varying latent physics. We show that such models may recommend infeasible actions, mispredict interaction outcomes, or certify unsafe behavior. We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response. An autonomous orchestrator should identify the relevant abstraction and compose compatible learned and structured components per query. When closed-form physics is unavailable, uncertain, or costly, the transition model may be analytic, simulated, learned, or hybrid, but it must preserve the structure that determines interventional outcomes. This decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query. It also provides a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query. We demonstrate this approach on queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper makes a clear case that observation-predictive world models are structurally limited for embodied tasks but offers only a high-level conceptual framework without new derivations or detailed results.

read the letter

The main point is that world models for embodied AI need to focus on answering intervention queries by preserving the right physical distinctions rather than just predicting future observations. Distinct systems can produce identical visuals yet behave differently under action, so pure predictive models can recommend bad actions or certify unsafe plans.

The paper does well laying out this structural problem and the controlled benchmark idea of holding the visible scene fixed while changing latent physics. The modular breakdown into environment representation, latent state estimation, interventional dynamics, and query response gives a workable way to make models interpretable and auditable. The suggestion that the right abstraction is the simplest one that keeps query-relevant distinctions is a useful design rule.

The soft spots are that the argument stays conceptual. The abstract mentions demonstrations on failing queries but provides no numbers, comparisons, or implementation details to evaluate them. The orchestrator is sketched at a high level without specifics on how it would select or verify components. The definition of physically viable models is tied directly to the intervention requirement, which makes the framing feel somewhat circular.

This is for researchers working on world models and planning in robotics. Readers looking for design principles will get value from the framing even if the execution is light. It deserves peer review so the full paper can be checked for concrete benchmarks and comparisons.

Referee Report

3 major / 1 minor

Summary. The paper claims that world models for embodied AI must be physically viable by representing the physical structure needed to answer intervention queries rather than merely predicting observations. It argues that observation-predictive models are structurally insufficient because distinct physical systems can produce identical observations yet diverge under interventions. The proposed solution is a modular decomposition (environment representation, latent state/parameter estimation, action specification, interventional dynamics, query-level response) assembled by an autonomous orchestrator that selects the simplest sufficient abstraction per query. The manuscript asserts that this yields interpretable, verifiable models and demonstrates the approach on queries where existing systems fail, while outlining a design principle and feasibility test.

Significance. The emphasis on interventional structure over observational prediction aligns with causal reasoning principles and could guide more reliable planning and verification in embodied AI if the modular decomposition is made concrete. The query-conditioned abstraction idea provides a useful design heuristic. However, the manuscript offers only a high-level conceptual argument with no derivations, benchmarks, or implementations, so any significance remains prospective rather than established.

major comments (3)

[Abstract] Abstract: the definition of physically viable world models is constructed directly from the requirement to answer intervention queries ("constructed to answer intervention queries by representing the physical structure governing action outcomes"), so the proposed modular solution largely restates the problem definition rather than deriving an independent criterion or mechanism.
[Abstract] Abstract: the manuscript states that controlled benchmarks are used to expose the structural failure of observation-predictive models and that the approach is demonstrated on queries where existing systems fail, yet no benchmark descriptions, datasets, quantitative results, or specific query examples are provided, leaving the central empirical claim unsupported.
[Abstract] Abstract: the autonomous orchestrator is introduced as the component that identifies the relevant abstraction and composes learned/structured modules, but no architecture, selection algorithm, or adaptation procedure is specified, which is load-bearing for the feasibility of the overall proposal.

minor comments (1)

[Abstract] Abstract: the phrase "when closed-form physics is unavailable, uncertain, or costly" introduces a qualification on transition models without clarifying how the orchestrator would detect or handle these cases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which correctly identify that the manuscript is a high-level conceptual proposal rather than an empirical study. We address each point below, agreeing where the abstract overstates content and clarifying the intent of the definitional and architectural elements. Revisions will focus on precision in claims and scope.

read point-by-point responses

Referee: [Abstract] Abstract: the definition of physically viable world models is constructed directly from the requirement to answer intervention queries ("constructed to answer intervention queries by representing the physical structure governing action outcomes"), so the proposed modular solution largely restates the problem definition rather than deriving an independent criterion or mechanism.

Authors: The definition is deliberately tied to interventional capability because that is the distinguishing requirement for embodied AI. The modular decomposition supplies an independent mechanism by enumerating the minimal components (environment representation, latent state/parameter estimation, action specification, interventional dynamics, query-level response) whose composition must preserve query-relevant physical distinctions. This structure is not automatic in observation-predictive models and therefore constitutes a testable criterion. We will revise the abstract to separate the definitional premise from the modular criterion more explicitly. revision: partial
Referee: [Abstract] Abstract: the manuscript states that controlled benchmarks are used to expose the structural failure of observation-predictive models and that the approach is demonstrated on queries where existing systems fail, yet no benchmark descriptions, datasets, quantitative results, or specific query examples are provided, leaving the central empirical claim unsupported.

Authors: The referee is correct. The manuscript contains no datasets, quantitative results, or formal benchmark descriptions; the referenced "controlled benchmarks" and "demonstrations" are limited to illustrative thought experiments in the text. We will revise the abstract to remove or qualify all language implying empirical evaluation and to characterize the work accurately as a design principle with conceptual illustrations. revision: yes
Referee: [Abstract] Abstract: the autonomous orchestrator is introduced as the component that identifies the relevant abstraction and composes learned/structured modules, but no architecture, selection algorithm, or adaptation procedure is specified, which is load-bearing for the feasibility of the overall proposal.

Authors: We agree that no concrete architecture or algorithm for the orchestrator is provided. The paper advances a design principle and feasibility test rather than an implemented system; the orchestrator is therefore described only at the level of required functionality. We will revise the abstract and relevant sections to state explicitly that detailed orchestrator mechanisms constitute future work and to list the minimal interface properties any such component must satisfy. revision: partial

Circularity Check

1 steps flagged

Physically viable world models defined directly via intervention-query requirement

specific steps

self definitional [Abstract]
"World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. [...] We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response."

The paper first defines physically viable models as those built to answer intervention queries, then claims embodied AI requires models with the precise components needed to fulfill that definition, rendering the architecture a direct restatement of the initial requirement rather than a derived result.

full rationale

The abstract explicitly defines 'physically viable' world models as those 'constructed to answer intervention queries' and then states that embodied AI 'requires' models with exactly the listed modular components for that purpose. This makes the central proposal equivalent to the problem statement by construction rather than an independent derivation. No equations or further technical steps are available in the provided text to inspect for additional reductions, and the logical observation that identical visuals can mask different physics is not itself circular. The overall score reflects partial circularity confined to the definitional framing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The argument rests on one key domain assumption about identical appearances with divergent physics and introduces the orchestrator as a new component without independent evidence.

axioms (1)

domain assumption Distinct physical systems can look identical yet diverge under intervention.
Invoked to establish that observation-predictive models are structurally inadequate.

invented entities (1)

autonomous orchestrator no independent evidence
purpose: To identify the relevant abstraction and compose compatible learned and structured components per query.
Proposed as part of the model architecture with no external validation or falsifiable handle provided.

pith-pipeline@v0.9.1-grok · 5847 in / 1242 out tokens · 35424 ms · 2026-06-29T06:52:52.097825+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 48 canonical work pages · 18 internal anchors

[1]

Physically embod- ied Gaussian splatting: A realtime correctable world model for robotics.arXiv preprint arXiv:2406.10788, 2024

Jad Abou-Chakra, Krishan Rana, Feras Dayoub, and Niko Sünderhauf. Physically embod- ied Gaussian splatting: A realtime correctable world model for robotics.arXiv preprint arXiv:2406.10788, 2024

work page arXiv 2024
[2]

Artificial analysis text-to-video leaderboard (open-weights mod- els)

Artificial Analysis. Artificial analysis text-to-video leaderboard (open-weights mod- els). https://artificialanalysis.ai/video/leaderboard/text-to-video? open-weights=true, 2026. Accessed: 2026-04-27

2026
[3]

Artificial analysis text-to-video leaderboard (open-weights mod- els)

Artificial Analysis. Artificial analysis text-to-video leaderboard (open-weights mod- els). https://artificialanalysis.ai/video/leaderboard/image-to-video? open-weights=true, 2026. Accessed: 2026-04-27

2026
[4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Genesis: A generative and universal physics engine for robotics and beyond, December 2024

Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024

2024
[6]

Dream to manipulate: Compositional world models empowering robot imitation learning with imagination.arXiv preprint arXiv:2412.14957, 2024

Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination.arXiv preprint arXiv:2412.14957, 2024

work page arXiv 2024
[7]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Interaction networks for learning about objects, relations and physics.Advances in neural information processing systems, 29, 2016

Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics.Advances in neural information processing systems, 29, 2016

2016
[9]

Divergence-free smoothed particle hydrodynamics

Jan Bender and Dan Koschier. Divergence-free smoothed particle hydrodynamics. InProceed- ings of the 14th ACM SIGGRAPH/Eurographics symposium on computer animation, pages 147–155, 2015

2015
[10]

Enforcing analytic constraints in neural networks emulating physical systems.Physical review letters, 126(9):098302, 2021

Tom Beucler, Michael Pritchard, Stephan Rasp, Jordan Ott, Pierre Baldi, and Pierre Gentine. Enforcing analytic constraints in neural networks emulating physical systems.Physical review letters, 126(9):098302, 2021

2021
[11]

Pangu-Weather: A 3d high-resolution model for fast and accurate global weather forecast.arXiv preprint arXiv:2211.02556, 2022

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu-Weather: A 3d high-resolution model for fast and accurate global weather forecast.arXiv preprint arXiv:2211.02556, 2022

work page arXiv 2022
[12]

A Compositional Object-Based Approach to Learning Physical Dynamics

Michael B. Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A composi- tional object-based approach to learning physical dynamics. InInternational Conference on Learning Representations (ICLR), 2017. arXiv:1612.00341

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

PhysGen3D: Crafting a miniature interactive world from a single image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. PhysGen3D: Crafting a miniature interactive world from a single image. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.20746

work page arXiv 2025
[14]

TransDreamer: Reinforcement learning with transformer world models

Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. TransDreamer: Reinforcement learning with transformer world models. InDeep Reinforcement Learning Workshop, NeurIPS 2021,

2021
[15]

Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018. 10

2018
[16]

Physbench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

work page arXiv 2025
[17]

Lagrangian Neural Networks,

Miles Cranmer, Sam Greydanus, Stephan Hoyer, Peter Battaglia, David Spergel, and Shirley Ho. Lagrangian neural networks. InICLR Workshop on Deep Differential Equations, 2020. arXiv:2003.04630

work page arXiv 2020
[18]

Discovering symbolic models from deep learning with inductive biases.Advances in neural information processing systems, 33:17429–17442, 2020

Miles Cranmer, Alvaro Sanchez Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel, and Shirley Ho. Discovering symbolic models from deep learning with inductive biases.Advances in neural information processing systems, 33:17429–17442, 2020

2020
[19]

Facing off world model backbones: RNNs, transformers, and S4

Fei Deng, Junyeong Park, and Sungjin Ahn. Facing off world model backbones: RNNs, transformers, and S4. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2307.02064

work page arXiv 2023
[20]

Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

Zihan Ding, Amy Zhang, Yuandong Tian, and Qinqing Zheng. Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

work page arXiv 2024
[21]

Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax – a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281, 2021

work page arXiv 2021
[22]

Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning

Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning. InProceedings of the international conference on automated planning and scheduling, volume 30, pages 440–448, 2020

2020
[23]

Greydanus, M

Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2019. arXiv:1906.01563

work page arXiv 2019
[24]

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, et al. " phyworldbench": A comprehensive evaluation of physical realism in text-to-video models.arXiv preprint arXiv:2507.13428, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

T2vphysbench: A first-principles benchmark for physical consistency in text-to-video gen- eration.arXiv preprint arXiv:2505.00337, 2025

Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video gen- eration.arXiv preprint arXiv:2505.00337, 2025

work page arXiv 2025
[26]

MaskViT: Masked visual pre-training for video prediction

Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Fei-Fei Li. MaskViT: Masked visual pre-training for video prediction. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2206.11894

work page arXiv 2023
[27]

Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018

2018
[28]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[31]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

2019
[32]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with discrete world models. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2010.02193. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

2025
[35]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2310.16828

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

DiffTaichi: Differentiable programming for physical simulation

Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. DiffTaichi: Differentiable programming for physical simulation. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:1910.00935

work page arXiv 2020
[37]

Vid2World: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

work page arXiv 2025
[38]

Tianyu Huang, Haoze Zhang, Yihan Zeng, Zhilu Zhang, Hui Li, Wangmeng Zuo, and Rynson W. H. Lau. DreamPhysics: Learning physics-based 3d dynamics with video diffusion priors. In AAAI Conference on Artificial Intelligence, 2025. arXiv:2406.01476

work page arXiv 2025
[39]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B. Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InProceedings of the 39th International Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning Research, pages 9902–9915. PMLR, 2022. arXiv:2205.09991

work page internal anchor Pith review Pith/arXiv arXiv 2022
[40]

gradSim: Differentiable simulation for system identification and visuomotor control

Krishna Murthy Jatavallabhula, Miles Macklin, Florian Golemo, Vikram V oleti, Linda Petrini, Martin Weiss, Breandan Considine, Jerome Parent-Levesque, Kevin Xie, Kenny Erleben, Liam Paull, Florian Shkurti, Derek Nowrouzezahrai, and Sanja Fidler. gradSim: Differentiable simulation for system identification and visuomotor control. InInternational Conference...

work page arXiv 2021
[41]

Integrated task and motion planning in belief space.The International Journal of Robotics Research, 32(9-10):1194–1227, 2013

Leslie Pack Kaelbling and Tomás Lozano-Pérez. Integrated task and motion planning in belief space.The International Journal of Robotics Research, 32(9-10):1194–1227, 2013

2013
[42]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

3D Gaussian Splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023. arXiv:2308.04079

work page arXiv 2023
[44]

Contrastive learning of structured world mod- els

Thomas Kipf, Elise van der Pol, and Max Welling. Contrastive learning of structured world mod- els. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:1911.12247

work page arXiv 2020
[45]

GraphCast: Learning skillful medium-range global weather forecasting.arXiv preprint arXiv:2212.12794, 2022

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. GraphCast: Learning skillful medium-range global weather forecas...

work page arXiv 2022
[46]

Learning Particle Dynamics for Manipulating Rigid Bodies, Deformable Objects, and Fluids

Yunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B. Tenenbaum, and Antonio Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. InInternational Conference on Learning Representations (ICLR), 2019. arXiv:1810.01566

work page internal anchor Pith review Pith/arXiv arXiv 2019
[47]

Ltx-2 model card.https://huggingface.co/Lightricks/LTX-2, 2026

Lightricks. Ltx-2 model card.https://huggingface.co/Lightricks/LTX-2, 2026

2026
[48]

arXiv preprint arXiv:2306.06561 , year=

Yu-Ren Liu, Biwei Huang, Zhengmao Zhu, Honglong Tian, Mingming Gong, Yang Yu, and Kun Zhang. Learning world models with identifiable factorization.arXiv preprint arXiv:2306.06561, 2023. 12

work page arXiv 2023
[49]

PhysFlow: Unleashing the potential of multi-modal foundation models and video diffusion for 4d dynamic physical scene simulation

Zhuoman Liu, Weicai Ye, Yan Luximon, Pengfei Wan, and Di Zhang. PhysFlow: Unleashing the potential of multi-modal foundation models and video diffusion for 4d dynamic physical scene simulation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2411.14423

work page arXiv 2025
[50]

Object-centric learning with slot attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2006.15055

work page arXiv 2020
[51]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Transformers are sample-efficient world models

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2209.00588

work page arXiv 2023
[53]

Smoothed particle hydrodynamics.Reports on progress in physics, 68(8):1703– 1759, 2005

Joe J Monaghan. Smoothed particle hydrodynamics.Reports on progress in physics, 68(8):1703– 1759, 2005

2005
[54]

Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

2026
[55]

VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

Felix O’Mahony, Roberto Cipolla, and Ayush Tewari. Vdaworld: World modelling via vlm- directed abstraction and simulation.arXiv preprint arXiv:2512.11061, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. FourCastNet: A global data- driven high-resolution weather model using adaptive fourier neural operators.arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Battaglia

Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learn- ing mesh-based simulation with graph networks. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2010.03409

work page arXiv 2021
[58]

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

2019
[59]

A VID: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. A VID: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

work page arXiv 2024
[60]

Sanchez-Gonzalez, J

Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter W. Battaglia. Learning to simulate complex physics with graph networks. InInternational Conference on Machine Learning (ICML), 2020. arXiv:2002.09405

work page arXiv 2020
[61]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. arXiv:1911.08265

work page internal anchor Pith review Pith/arXiv arXiv 2020
[62]

Newton: GPU-accelerated physics simulation for robotics and simulation research, April 2025

The Newton Contributors. Newton: GPU-accelerated physics simulation for robotics and simulation research, April 2025

2025
[63]

PhysGaussian: Physics-integrated 3d gaussians for generative dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. PhysGaussian: Physics-integrated 3d gaussians for generative dynamics. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.12198

work page arXiv 2024
[64]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, Andrii Zadaianchuk, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025. 13

work page arXiv 2025
[65]

Physion-eval: Evaluating physical realism in generated video via human reasoning.arXiv preprint arXiv:2603.19607, 2026

Qin Zhang, Peiyu Jing, Hong-Xing Yu, Fangqiang Ding, Fan Nie, Weimin Wang, Yilun Du, James Zou, Jiajun Wu, and Bing Shuai. Physion-eval: Evaluating physical realism in generated video via human reasoning.arXiv preprint arXiv:2603.19607, 2026

work page arXiv 2026
[66]

Physdreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406. Springer, 2024

2024
[67]

Hierarchical Planning with Latent World Models

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, et al. Hierarchical planning with latent world models.arXiv preprint arXiv:2604.03208, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[68]

ChronoDreamer: Action-conditioned world model as an online simulator for robotic planning.arXiv preprint arXiv:2512.18619, 2025

Zhenhao Zhou and Dan Negrut. ChronoDreamer: Action-conditioned world model as an online simulator for robotic planning.arXiv preprint arXiv:2512.18619, 2025. 14 A Platforms for Numerical Experimentation In this section, we describe the computational tools and platforms used in our experimental pipeline. These include foundation models (e.g., vision-langua...

work page arXiv 2025
[69]

In several trials, the model predicts no collision because it states that the target is not in the ball’s path, particularly in the cup and double-jelly scenes

Low-context prompts sometimes produce incorrect physical interpretations of the scene. In several trials, the model predicts no collision because it states that the target is not in the ball’s path, particularly in the cup and double-jelly scenes. Other responses infer incorrect scene geometry or motion entirely, including predicting that objects slide of...
[70]

However, the model still does not consistently or precisely predict the realized outcome

High-context prompts isolate the remaining physical prediction errors.When release and collision are explicitly specified, many low-context ambiguities disappear and the responses usually invoke qualitatively relevant effects including rolling motion, momentum transfer, deformation, frictional dissipation, and water sloshing. However, the model still does...
[71]

However, the responses still frequently describe tipping, toppling, sliding, and spilling as possible outcomes rather than resolving which event occurs

Counterfactual prompts improve directional reasoning but still produce threshold uncer- tainty.The model usually predicts the correct qualitative trend: lower release height reduces impact energy, honey damps sloshing, high friction suppresses sliding, and gelatin dissipates more energy than rigid blocks. However, the responses still frequently describe t...

[1] [1]

Physically embod- ied Gaussian splatting: A realtime correctable world model for robotics.arXiv preprint arXiv:2406.10788, 2024

Jad Abou-Chakra, Krishan Rana, Feras Dayoub, and Niko Sünderhauf. Physically embod- ied Gaussian splatting: A realtime correctable world model for robotics.arXiv preprint arXiv:2406.10788, 2024

work page arXiv 2024

[2] [2]

Artificial analysis text-to-video leaderboard (open-weights mod- els)

Artificial Analysis. Artificial analysis text-to-video leaderboard (open-weights mod- els). https://artificialanalysis.ai/video/leaderboard/text-to-video? open-weights=true, 2026. Accessed: 2026-04-27

2026

[3] [3]

Artificial analysis text-to-video leaderboard (open-weights mod- els)

Artificial Analysis. Artificial analysis text-to-video leaderboard (open-weights mod- els). https://artificialanalysis.ai/video/leaderboard/image-to-video? open-weights=true, 2026. Accessed: 2026-04-27

2026

[4] [4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Genesis: A generative and universal physics engine for robotics and beyond, December 2024

Genesis Authors. Genesis: A generative and universal physics engine for robotics and beyond, December 2024

2024

[6] [6]

Dream to manipulate: Compositional world models empowering robot imitation learning with imagination.arXiv preprint arXiv:2412.14957, 2024

Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination.arXiv preprint arXiv:2412.14957, 2024

work page arXiv 2024

[7] [7]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Interaction networks for learning about objects, relations and physics.Advances in neural information processing systems, 29, 2016

Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks for learning about objects, relations and physics.Advances in neural information processing systems, 29, 2016

2016

[9] [9]

Divergence-free smoothed particle hydrodynamics

Jan Bender and Dan Koschier. Divergence-free smoothed particle hydrodynamics. InProceed- ings of the 14th ACM SIGGRAPH/Eurographics symposium on computer animation, pages 147–155, 2015

2015

[10] [10]

Enforcing analytic constraints in neural networks emulating physical systems.Physical review letters, 126(9):098302, 2021

Tom Beucler, Michael Pritchard, Stephan Rasp, Jordan Ott, Pierre Baldi, and Pierre Gentine. Enforcing analytic constraints in neural networks emulating physical systems.Physical review letters, 126(9):098302, 2021

2021

[11] [11]

Pangu-Weather: A 3d high-resolution model for fast and accurate global weather forecast.arXiv preprint arXiv:2211.02556, 2022

Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Pangu-Weather: A 3d high-resolution model for fast and accurate global weather forecast.arXiv preprint arXiv:2211.02556, 2022

work page arXiv 2022

[12] [12]

A Compositional Object-Based Approach to Learning Physical Dynamics

Michael B. Chang, Tomer Ullman, Antonio Torralba, and Joshua B. Tenenbaum. A composi- tional object-based approach to learning physical dynamics. InInternational Conference on Learning Representations (ICLR), 2017. arXiv:1612.00341

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

PhysGen3D: Crafting a miniature interactive world from a single image

Boyuan Chen, Hanxiao Jiang, Shaowei Liu, Saurabh Gupta, Yunzhu Li, Hao Zhao, and Shenlong Wang. PhysGen3D: Crafting a miniature interactive world from a single image. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.20746

work page arXiv 2025

[14] [14]

TransDreamer: Reinforcement learning with transformer world models

Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. TransDreamer: Reinforcement learning with transformer world models. InDeep Reinforcement Learning Workshop, NeurIPS 2021,

2021

[15] [15]

Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018. 10

2018

[16] [16]

Physbench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmarking and enhancing vision-language models for physical world understanding.arXiv preprint arXiv:2501.16411, 2025

work page arXiv 2025

[17] [17]

Lagrangian Neural Networks,

Miles Cranmer, Sam Greydanus, Stephan Hoyer, Peter Battaglia, David Spergel, and Shirley Ho. Lagrangian neural networks. InICLR Workshop on Deep Differential Equations, 2020. arXiv:2003.04630

work page arXiv 2020

[18] [18]

Discovering symbolic models from deep learning with inductive biases.Advances in neural information processing systems, 33:17429–17442, 2020

Miles Cranmer, Alvaro Sanchez Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel, and Shirley Ho. Discovering symbolic models from deep learning with inductive biases.Advances in neural information processing systems, 33:17429–17442, 2020

2020

[19] [19]

Facing off world model backbones: RNNs, transformers, and S4

Fei Deng, Junyeong Park, and Sungjin Ahn. Facing off world model backbones: RNNs, transformers, and S4. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2307.02064

work page arXiv 2023

[20] [20]

Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

Zihan Ding, Amy Zhang, Yuandong Tian, and Qinqing Zheng. Diffusion world model: Fu- ture modeling beyond step-by-step rollout for offline reinforcement learning.arXiv preprint arXiv:2402.03570, 2024

work page arXiv 2024

[21] [21]

Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem

C. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax – a differentiable physics engine for large scale rigid body simulation.arXiv preprint arXiv:2106.13281, 2021

work page arXiv 2021

[22] [22]

Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning

Caelan Reed Garrett, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning. InProceedings of the international conference on automated planning and scheduling, volume 30, pages 440–448, 2020

2020

[23] [23]

Greydanus, M

Sam Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2019. arXiv:1906.01563

work page arXiv 2019

[24] [24]

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, et al. " phyworldbench": A comprehensive evaluation of physical realism in text-to-video models.arXiv preprint arXiv:2507.13428, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

T2vphysbench: A first-principles benchmark for physical consistency in text-to-video gen- eration.arXiv preprint arXiv:2505.00337, 2025

Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, and Jiale Zhao. T2vphysbench: A first-principles benchmark for physical consistency in text-to-video gen- eration.arXiv preprint arXiv:2505.00337, 2025

work page arXiv 2025

[26] [26]

MaskViT: Masked visual pre-training for video prediction

Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, and Fei-Fei Li. MaskViT: Masked visual pre-training for video prediction. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2206.11894

work page arXiv 2023

[27] [27]

Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018

David Ha and Jürgen Schmidhuber. Recurrent world models facilitate policy evolution.Ad- vances in neural information processing systems, 31, 2018

2018

[28] [28]

World Models

David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model.arXiv preprint arXiv:2601.03233, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[30] [30]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[31] [31]

Learning latent dynamics for planning from pixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InInternational conference on machine learning, pages 2555–2565. PMLR, 2019

2019

[32] [32]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering Atari with discrete world models. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2010.02193. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021

[33] [33]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

2025

[35] [35]

TD-MPC2: Scalable, Robust World Models for Continuous Control

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control. InInternational Conference on Learning Representations (ICLR), 2024. arXiv:2310.16828

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

DiffTaichi: Differentiable programming for physical simulation

Yuanming Hu, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. DiffTaichi: Differentiable programming for physical simulation. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:1910.00935

work page arXiv 2020

[37] [37]

Vid2World: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357, 2025

work page arXiv 2025

[38] [38]

Tianyu Huang, Haoze Zhang, Yihan Zeng, Zhilu Zhang, Hui Li, Wangmeng Zuo, and Rynson W. H. Lau. DreamPhysics: Learning physics-based 3d dynamics with video diffusion priors. In AAAI Conference on Artificial Intelligence, 2025. arXiv:2406.01476

work page arXiv 2025

[39] [39]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B. Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. InProceedings of the 39th International Conference on Machine Learning (ICML), volume 162 ofProceedings of Machine Learning Research, pages 9902–9915. PMLR, 2022. arXiv:2205.09991

work page internal anchor Pith review Pith/arXiv arXiv 2022

[40] [40]

gradSim: Differentiable simulation for system identification and visuomotor control

Krishna Murthy Jatavallabhula, Miles Macklin, Florian Golemo, Vikram V oleti, Linda Petrini, Martin Weiss, Breandan Considine, Jerome Parent-Levesque, Kevin Xie, Kenny Erleben, Liam Paull, Florian Shkurti, Derek Nowrouzezahrai, and Sanja Fidler. gradSim: Differentiable simulation for system identification and visuomotor control. InInternational Conference...

work page arXiv 2021

[41] [41]

Integrated task and motion planning in belief space.The International Journal of Robotics Research, 32(9-10):1194–1227, 2013

Leslie Pack Kaelbling and Tomás Lozano-Pérez. Integrated task and motion planning in belief space.The International Journal of Robotics Research, 32(9-10):1194–1227, 2013

2013

[42] [42]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

3D Gaussian Splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4), 2023. arXiv:2308.04079

work page arXiv 2023

[44] [44]

Contrastive learning of structured world mod- els

Thomas Kipf, Elise van der Pol, and Max Welling. Contrastive learning of structured world mod- els. InInternational Conference on Learning Representations (ICLR), 2020. arXiv:1911.12247

work page arXiv 2020

[45] [45]

GraphCast: Learning skillful medium-range global weather forecasting.arXiv preprint arXiv:2212.12794, 2022

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. GraphCast: Learning skillful medium-range global weather forecas...

work page arXiv 2022

[46] [46]

Learning Particle Dynamics for Manipulating Rigid Bodies, Deformable Objects, and Fluids

Yunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B. Tenenbaum, and Antonio Torralba. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. InInternational Conference on Learning Representations (ICLR), 2019. arXiv:1810.01566

work page internal anchor Pith review Pith/arXiv arXiv 2019

[47] [47]

Ltx-2 model card.https://huggingface.co/Lightricks/LTX-2, 2026

Lightricks. Ltx-2 model card.https://huggingface.co/Lightricks/LTX-2, 2026

2026

[48] [48]

arXiv preprint arXiv:2306.06561 , year=

Yu-Ren Liu, Biwei Huang, Zhengmao Zhu, Honglong Tian, Mingming Gong, Yang Yu, and Kun Zhang. Learning world models with identifiable factorization.arXiv preprint arXiv:2306.06561, 2023. 12

work page arXiv 2023

[49] [49]

PhysFlow: Unleashing the potential of multi-modal foundation models and video diffusion for 4d dynamic physical scene simulation

Zhuoman Liu, Weicai Ye, Yan Luximon, Pengfei Wan, and Di Zhang. PhysFlow: Unleashing the potential of multi-modal foundation models and video diffusion for 4d dynamic physical scene simulation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2411.14423

work page arXiv 2025

[50] [50]

Object-centric learning with slot attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. arXiv:2006.15055

work page arXiv 2020

[51] [51]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense- based benchmark for video generation.arXiv preprint arXiv:2410.05363, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Transformers are sample-efficient world models

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2209.00588

work page arXiv 2023

[53] [53]

Smoothed particle hydrodynamics.Reports on progress in physics, 68(8):1703– 1759, 2005

Joe J Monaghan. Smoothed particle hydrodynamics.Reports on progress in physics, 68(8):1703– 1759, 2005

2005

[54] [54]

Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do gener- ative video models understand physical principles? InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

2026

[55] [55]

VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

Felix O’Mahony, Roberto Cipolla, and Ayush Tewari. Vdaworld: World modelling via vlm- directed abstraction and simulation.arXiv preprint arXiv:2512.11061, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[56] [56]

FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators

Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. FourCastNet: A global data- driven high-resolution weather model using adaptive fourier neural operators.arXiv preprint...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[57] [57]

Battaglia

Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learn- ing mesh-based simulation with graph networks. InInternational Conference on Learning Representations (ICLR), 2021. arXiv:2010.03409

work page arXiv 2021

[58] [58]

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational physics, 378:686–707, 2019

2019

[59] [59]

A VID: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

Marc Rigter, Tarun Gupta, Agrin Hilmkil, and Chao Ma. A VID: Adapting video diffusion models to world models.arXiv preprint arXiv:2410.12822, 2024

work page arXiv 2024

[60] [60]

Sanchez-Gonzalez, J

Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter W. Battaglia. Learning to simulate complex physics with graph networks. InInternational Conference on Machine Learning (ICML), 2020. arXiv:2002.09405

work page arXiv 2020

[61] [61]

Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020. arXiv:1911.08265

work page internal anchor Pith review Pith/arXiv arXiv 2020

[62] [62]

Newton: GPU-accelerated physics simulation for robotics and simulation research, April 2025

The Newton Contributors. Newton: GPU-accelerated physics simulation for robotics and simulation research, April 2025

2025

[63] [63]

PhysGaussian: Physics-integrated 3d gaussians for generative dynamics

Tianyi Xie, Zeshun Zong, Yuxing Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. PhysGaussian: Physics-integrated 3d gaussians for generative dynamics. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.12198

work page arXiv 2024

[64] [64]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025

Chenyu Zhang, Daniil Cherniavskii, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, Andrii Zadaianchuk, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv preprint arXiv:2504.02918, 2025. 13

work page arXiv 2025

[65] [65]

Physion-eval: Evaluating physical realism in generated video via human reasoning.arXiv preprint arXiv:2603.19607, 2026

Qin Zhang, Peiyu Jing, Hong-Xing Yu, Fangqiang Ding, Fan Nie, Weimin Wang, Yilun Du, James Zou, Jiajun Wu, and Bing Shuai. Physion-eval: Evaluating physical realism in generated video via human reasoning.arXiv preprint arXiv:2603.19607, 2026

work page arXiv 2026

[66] [66]

Physdreamer: Physics-based interaction with 3d objects via video generation

Tianyuan Zhang, Hong-Xing Yu, Rundi Wu, Brandon Y Feng, Changxi Zheng, Noah Snavely, Jiajun Wu, and William T Freeman. Physdreamer: Physics-based interaction with 3d objects via video generation. InEuropean Conference on Computer Vision, pages 388–406. Springer, 2024

2024

[67] [67]

Hierarchical Planning with Latent World Models

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, et al. Hierarchical planning with latent world models.arXiv preprint arXiv:2604.03208, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[68] [68]

ChronoDreamer: Action-conditioned world model as an online simulator for robotic planning.arXiv preprint arXiv:2512.18619, 2025

Zhenhao Zhou and Dan Negrut. ChronoDreamer: Action-conditioned world model as an online simulator for robotic planning.arXiv preprint arXiv:2512.18619, 2025. 14 A Platforms for Numerical Experimentation In this section, we describe the computational tools and platforms used in our experimental pipeline. These include foundation models (e.g., vision-langua...

work page arXiv 2025

[69] [69]

In several trials, the model predicts no collision because it states that the target is not in the ball’s path, particularly in the cup and double-jelly scenes

Low-context prompts sometimes produce incorrect physical interpretations of the scene. In several trials, the model predicts no collision because it states that the target is not in the ball’s path, particularly in the cup and double-jelly scenes. Other responses infer incorrect scene geometry or motion entirely, including predicting that objects slide of...

[70] [70]

However, the model still does not consistently or precisely predict the realized outcome

High-context prompts isolate the remaining physical prediction errors.When release and collision are explicitly specified, many low-context ambiguities disappear and the responses usually invoke qualitatively relevant effects including rolling motion, momentum transfer, deformation, frictional dissipation, and water sloshing. However, the model still does...

[71] [71]

However, the responses still frequently describe tipping, toppling, sliding, and spilling as possible outcomes rather than resolving which event occurs

Counterfactual prompts improve directional reasoning but still produce threshold uncer- tainty.The model usually predicts the correct qualitative trend: lower release height reduces impact energy, honey damps sloshing, high friction suppresses sliding, and gelatin dissipates more energy than rigid blocks. However, the responses still frequently describe t...