ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Haotian Xue; Lama Moukheiber; Liqian Ma; Yipu Chen; Yongxin Chen; Yuchen Zhu; Zelin Zhao

arxiv: 2605.08567 · v2 · pith:JSS5QPY6new · submitted 2026-05-09 · 💻 cs.CV

ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

Haotian Xue , Yipu Chen , Liqian Ma , Zelin Zhao , Lama Moukheiber , Yuchen Zhu , Yongxin Chen This is my paper

Pith reviewed 2026-05-20 23:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords action-conditioned world modelsphysical dynamicsout-of-distribution generalizationvideo predictionbenchmarkdeformable objectsrigid body dynamicssimulation

0 comments

The pith

A new benchmark reveals that action-conditioned video world models generalize well only on simple rigid interactions and falter on deformable or high-dimensional physics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ACWM-Phys, a benchmark built in a controllable simulator to test action-conditioned world models on rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. It defines in-distribution and out-of-distribution evaluation protocols that shift interaction patterns or scene setups while keeping the action space fixed. Experiments on the ACWM-DiT model show stronger out-of-distribution performance on visually simple, low-dimensional tasks with clear geometry and larger drops on deformable contacts, high-dimensional control, and complex articulated motion. These results indicate that current models still depend primarily on visual appearance patterns rather than internalized physical rules. Ablations further show that cross-attention aids high-dimensional action conditioning, causal VAEs beat frame-wise encoders, and richer action spaces can aid generalization despite increased difficulty.

Core claim

Through systematic experiments on ACWM-DiT, out-of-distribution generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion, suggesting that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics.

What carries the argument

The ACWM-Phys benchmark, which supplies controlled training and evaluation data across multiple physical regimes together with in-distribution and out-of-distribution protocols inside a fully controllable simulator.

If this is right

World models that continue to rely on visual patterns will continue to show uneven generalization across physical regimes.
Cross-attention layers improve conditioning when action spaces become high-dimensional.
Causal VAEs provide better temporal consistency than frame-wise encoders for video prediction under action control.
Larger action spaces increase modeling difficulty yet supply richer signals that can improve out-of-distribution robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended with real-robot recordings to test whether the observed generalization gaps persist outside simulation.
Architectures that explicitly encode physical constraints might close the performance gap on complex interactions without needing vastly more data.
Similar complexity-dependent generalization patterns may appear when the same evaluation protocols are applied to language-conditioned or multi-agent world models.

Load-bearing premise

The simulation environment faithfully captures essential real-world physical interactions without simulator-specific artifacts, and results on the single tested model generalize to the broader class of action-conditioned world models.

What would settle it

A follow-up experiment in which the same model achieves comparable out-of-distribution accuracy on deformable and articulated tasks even after visual textures and lighting are randomized would indicate that performance does not hinge on appearance cues.

Figures

Figures reproduced from arXiv: 2605.08567 by Haotian Xue, Lama Moukheiber, Liqian Ma, Yipu Chen, Yongxin Chen, Yuchen Zhu, Zelin Zhao.

**Figure 1.** Figure 1: ACWM-Phys provides diverse physical scenes to help answer two questions: how well can ACWMs learn different types of physics, and can they generalize beyond the training distribution? We evaluate both in-distribution prediction and out-of-distribution generalization, such as more/fewer water particles or cubes. Despite this progress, existing ACWMs and their accompanying benchmarks suffer from a critical … view at source ↗

**Figure 2.** Figure 2: ACWM-Phys dataset overview. Four representative frames per environment across the eight tasks, grouped by physical interaction category. Each row shares a category color (left border and label): rigid-body, deformable, particle, and kinematics. Dataset statistics and action-space definitions are summarized in Appendix A.4. be the conditioning context. Under flow matching, we sample z0 ∼ N (0, I), set z1 = … view at source ↗

**Figure 3.** Figure 3: ACWM-DiT architecture. Noisy latent tokens z1:Tl (conditioning frames at σ=0, predicted frames at diffusion step σ) are processed by N stacked DiT blocks with alternating spatial and temporal self-attention, modulated via AdaLN from a joint conditioning signal formed by summing the timestep embedding and the temporally compressed action embedding. 4.1.1 Categories of Physical Interactions Rigid-Body Dyna… view at source ↗

**Figure 4.** Figure 4: Case study: Pour Water. GT (top) and predicted (bottom) frames at four evenly-spaced timesteps. Two InD episodes (top block) and two OoD episodes (bottom block) with less water (left) and more water (right); The robot arm closely follows the ground-truth trajectory, indicating accurate prediction of articulated motion. Pour Water is also predicted well overall, although in the OoD setting the model sometim… view at source ↗

**Figure 5.** Figure 5: Case study: Push Cube. GT (top) and predicted (bottom) frames at four evenly-spaced timesteps. Two InD episodes (top block) and two OoD episodes (bottom block) show diverse cube configurations, with one cube (left) and four cubes (right). The model accurately tracks cube positions and push trajectories across both distributions. In contrast, contact-rich deformation, particle dynamics, and high-DoF contro… view at source ↗

**Figure 6.** Figure 6: Auto-regressive Generation. The model generates frames 1→37 (blue) conditioned on the first frame, then generates frames 37→T (red) conditioned on the last predicted frame of the first window. GT (top) and predicted (bottom) frames at four evenly-spaced timesteps per window [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: SSIM vs. diffusion steps for ACWM-DiT-S (100k training steps). Blue circles: InD test; red squares: OoD test. Higher SSIM is better (↑). Stack Cube. InD stacking trajectories e.g. pick-up, transport, and placement are accurately predicted. Under OoD target placement shifts, the model predicts a plausible but positionally incorrect stack, indicating limited spatial extrapolation beyond training placement co… view at source ↗

**Figure 8.** Figure 8: PSNR vs. diffusion steps for ACWM-DiT-S (100k training steps). Blue circles: InD test; red squares: OoD test. Higher PSNR is better (↑). Robot Arm. The overlay row (blue-tinted GT ghost over prediction) reveals systematic end-effector position errors under OoD workspace expansion. InD predictions closely match GT joint-angle trajectories; OoD predictions reproduce plausible arm motion but with a consisten… view at source ↗

**Figure 9.** Figure 9: Dataset visualizations for all eight ACWM-Phys environments. Left: rigid-body and deformable tasks. Right: particle and kinematics tasks. For each environment, InD (top) and OoD (bottom) ground-truth frames are shown at eight evenly-spaced timesteps from a representative episode. In-Distribution Out-of-Distribution GT Pred [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Push Rope case study. InD (left) and OoD with longer rope (right). In-Distribution Out-of-Distribution GT Pred [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Cloth Move case study. InD (left) and OoD cloth-size shift (right). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Push Sand case study. InD (left) and OoD doubled-particle-count (right). In-Distribution Out-of-Distribution GT Pred [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Stack Cube case study. InD (left) and OoD placement-shift (right). In-Distribution Out-of-Distribution GT Pred Overlay [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Robot Arm case study. InD (left) and OoD workspace-expansion (right). Overlay row: GT (blue tint, 45% opacity) over prediction highlights positional error. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Reacher case study. InD (left) and OoD corner-sector goals (right). Overlay nearly coincides with Pred, confirming strong geometric generalization. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

read the original abstract

Action-conditioned world models (ACWMs) have shown strong promise for video prediction and decision-making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task-specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM-Phys, a new benchmark for evaluating action-conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in-distribution and out-of-distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark in a fully controllable simulator, ACWM-Phys enables precise data collection, reproducible evaluation, and systematic analysis of model capabilities for physically grounded world modeling. Through systematic experiments on ACWM-DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics. Ablations show that cross-attention improves high-dimensional action conditioning, causal VAEs outperform frame-wise encoders, and larger action spaces are harder to model but can improve generalization by providing richer control signals. These findings guide the design of physically grounded world models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces ACWM-Phys, a benchmark for action-conditioned world models (ACWMs) that evaluates video prediction under diverse physical dynamics (rigid-body, kinematics, deformable-object, and particle) in a controllable simulator. It defines in-distribution and out-of-distribution protocols with controlled shifts and reports experiments on ACWM-DiT showing that OoD generalization is stronger for visually simple, low-dimensional interactions with clear geometry but weaker for deformable contacts, high-dimensional control, and complex articulated motion. The authors interpret the larger drops as evidence that the model relies on visual appearance patterns rather than fully learning underlying physics. Ablations examine cross-attention for action conditioning, causal VAEs versus frame-wise encoders, and the effects of action-space size.

Significance. If the central findings hold, the benchmark supplies a reproducible, controllable testbed that expands coverage beyond egocentric navigation or narrow robotics tasks, while the ablations offer practical guidance on architectural choices for high-dimensional action conditioning and temporal modeling. The work highlights a plausible gap between current ACWMs and robust physical understanding, which could steer subsequent model development.

major comments (1)

[Abstract and §5] Abstract and §5 (results on OoD protocols): the interpretation that larger generalization drops on deformable contacts, high-dimensional control, and complex articulated motion demonstrate reliance on visual appearance patterns rather than physics is not isolated. These regimes are also dynamically more complex; without experiments that hold the underlying physical rules fixed while varying only visual cues (or vice versa), or direct probes of invariants such as momentum conservation or contact-force prediction, the specific causal claim remains under-supported by the reported patterns.

minor comments (2)

The abstract and experimental description report systematic ablations but omit statistical tests, exact training-set sizes, and error bars; adding these would make the quantitative claims more robust.
[Ablations] Clarify how 'effective task complexity' is defined and measured independently of the physical regime, and whether any quantitative metric (e.g., degrees of freedom or contact frequency) is used to support the qualitative distinction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed comments. We address the major comment below and outline the revisions we will make to clarify our claims.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (results on OoD protocols): the interpretation that larger generalization drops on deformable contacts, high-dimensional control, and complex articulated motion demonstrate reliance on visual appearance patterns rather than physics is not isolated. These regimes are also dynamically more complex; without experiments that hold the underlying physical rules fixed while varying only visual cues (or vice versa), or direct probes of invariants such as momentum conservation or contact-force prediction, the specific causal claim remains under-supported by the reported patterns.

Authors: We agree that the observed OoD drops occur in regimes that are also dynamically more complex, and that our interpretation does not isolate reliance on visual patterns from this confounding factor. The manuscript presents the differential generalization as suggestive evidence rather than a definitive causal demonstration; however, we acknowledge that the current wording in the abstract and §5 can be read as stronger than the supporting experiments warrant. In the revised manuscript we will (i) qualify the relevant sentences to state that the patterns are consistent with reliance on visual appearance while noting the role of increased dynamic complexity, and (ii) add a short discussion paragraph outlining the value of future controlled experiments (e.g., fixed physics with varied visual cues, or direct prediction of invariants such as momentum or contact forces) that would more cleanly separate the two explanations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark paper with no derivations or self-referential reductions

full rationale

The paper introduces the ACWM-Phys benchmark and reports controlled experiments on ACWM-DiT, drawing conclusions from observed OoD generalization patterns across physical regimes. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central claims rest on empirical results from a new simulation environment rather than reducing to self-defined quantities or prior author work by construction. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes an empirical benchmark rather than a mathematical derivation, so the ledger is light: the main unverified premise is simulator fidelity to real physics.

axioms (1)

domain assumption The chosen simulation accurately represents the target physical dynamics without introducing confounding artifacts.
Invoked when claiming that observed generalization gaps reflect model limitations rather than simulator mismatch.

pith-pipeline@v0.9.0 · 5829 in / 1211 out tokens · 48778 ms · 2026-05-20T23:33:26.859585+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through systematic experiments on ACWM-DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 15 internal anchors

[1]

Walk through paintings: Ego-centric world models from internet priors.arXiv preprint arXiv:2601.15284, 2026

A. Bagchi, Z. Bao, H. Bharadhwaj, Y.-X. Wang, P. Tokmakov, and M. Hebert. Walk through paintings: Egocentric world models from internet priors. arXiv preprint arXiv:2601.15284 , 2026

work page arXiv 2026
[2]

Y. Chen, P. Li, J. Yang, K. He, X. Wu, Y. Xu, K. Wang, J. Liu, N. Liu, Y. Huang, et al. Bridgev2w: Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793, 2026

work page arXiv 2026
[3]

Y. Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

World Models

D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122 , 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Training Agents Inside of Scalable World Models

D. Hafner, W. Yan, and T. Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

J. Ho, A. Jain, and P. Abbeel. Denoising diﬀusion probabilistic models. In Advances in Neural Infor- mation Processing Systems , volume 33, 2020. 10

work page 2020
[7]

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diﬀusion models. Advances in neural information processing systems , 35:8633–8646, 2022

work page 2022
[8]

Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoﬀroy, M. Roberts, M. Fisher, E. Shechtman, K. Sunkavalli, F. Liu, Z. Li, and H. Tan. Relic: Interactive video world models with long-horizon memory, 2025

work page 2025
[9]

Hore and D

A. Hore and D. Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

work page 2010
[10]

Huang, J

S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diﬀusion models to interactive world models. arXiv preprint arXiv:2505.14357 , 2025

work page arXiv 2025
[11]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diﬀusion. arXiv preprint arXiv:2506.08009 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Wovr: World models as reliable simulators for post-training vla policies with rl,

Z. Jiang, S. Zhou, Y. Jiang, Z. Huang, M. Wei, Y. Chen, T. Zhou, Z. Guo, H. Lin, Q. Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl. arXiv preprint arXiv:2602.13977, 2026

work page arXiv 2026
[13]

B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Karras, M

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diﬀusion-based generative models. In Advances in Neural Information Processing Systems , 2022

work page 2022
[15]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

B. F. Labs. Flux. https://github.com/black-forest-labs/flux , 2024

work page 2024
[17]

M.-Q. Le, Y. Zhu, V. Kalogeiton, and D. Samaras. What about gravity in video generation? post- training newton’s laws with veriﬁable rewards. arXiv preprint arXiv:2512.00425 , 2025

work page arXiv 2025
[18]

C. Li, O. Michel, X. Pan, S. Liu, M. Roberts, and S. Xie. Pisa experiments: Exploring physics post- training for video diﬀusion models by watching stuﬀ drop. arXiv preprint arXiv:2503.09595 , 2025

work page arXiv 2025
[19]

Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipu- lating rigid bodies, deformable objects, and ﬂuids. arXiv preprint arXiv:1810.01566 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectiﬁed ﬂow. arXiv preprint arXiv:2209.03003 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Motamed, L

S. Motamed, L. Culp, K. Swersky, P. Jaini, and R. Geirhos. Do generative video models understand physical principles? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

work page 2026
[23]

Parker-Holder and S

J. Parker-Holder and S. Fruchter. Genie 3: A new frontier for world models. URL https://deepmind. google/discover/blog/genie-3-a-new-frontier-for-world-models/. Blog post, 2025

work page 2025
[24]

Peebles and S

W. Peebles and S. Xie. Scalable diﬀusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023

work page 2023
[25]

Solaris: Building a multiplayer video world model in minecraft

G. Savva, O. Michel, D. Lu, S. Waiwitlikhit, T. Meehan, D. Mishra, S. Poddar, J. Lu, and S. Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208 , 2026. 11

work page arXiv 2026
[26]

D. Shah, B. Eysenbach, N. Rhinehart, and S. Levine. Rapid exploration for open-world navigation with latent goal models. In 5th Annual Conference on Robot Learning , 2021

work page 2021
[27]

W. Sun, H. Zhang, H. Wang, J. Wu, Z. Wang, Z. Wang, Y. Wang, J. Zhang, T. Wang, and C. Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Todorov, T

E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 5026–5033. IEEE, 2012

work page 2012
[29]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan et al. Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Wan2.1: Open video foundation models

Wan-Video Team. Wan2.1: Open video foundation models. GitHub repository, 2025. Technical report and weights; project page details evolving

work page 2025
[31]

J. Wang, A. Ma, K. Cao, J. Zheng, J. Feng, Z. Zhang, W. Pang, and X. Liang. Wisa: World simulator assistant for physics-aware text-to-video generation. In Advances in Neural Information Processing Systems, 2025

work page 2025
[32]

Z. Wang, P. Hu, J. Wang, T. J. Zhang, Y. Cheng, L. Chen, Y. Yan, Z. Jiang, H. Li, and X. Liang. Prophy: Progressive physical alignment for dynamic world simulation. arXiv preprint arXiv:2512.05564 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Z. Wang, X. Wei, B. Li, Z. Guo, J. Zhang, H. Wei, K. Wang, and L. Zhang. Videoverse: How far is your t2v generator from a world model? arXiv preprint arXiv:2510.08398 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang et al. CogVideoX: Text-to-video diﬀusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

S. Ye, Y. Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y. L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Y. Yuan, X. Wang, T. Wickremasinghe, Z. Nadir, B. Ma, and S. H. Chan. Newtongen: Physics- consistent and controllable text-to-video generation via neural newtonian dynamics. In International Conference on Learning Representations , 2026

work page 2026
[37]

Zhang, D

C. Zhang, D. Cherniavskii, A. Tragoudaras, A. Vozikis, T. Nijdam, D. W. Prinzhorn, M. Bodracska, N. Sebe, A. Zadaianchuk, and E. Gavves. Morpheus: Benchmarking physical reasoning of video gener- ative models with real physical experiments. arXiv preprint arXiv:2504.02918 , 2025

work page arXiv 2025
[38]

Zhang, C

K. Zhang, C. Xiao, Y. Mei, J. Xu, and V. M. Patel. Think before you diﬀuse: Llms-guided physics-aware video generation, 2025

work page 2025
[39]

S. Zhou, H. Wang, H. Cheng, J. Li, D. Wang, J. Jiang, Y. Jin, J. Huang, S. Mao, S. Liu, Y. Yang, H. Song, S. Wei, Z. Zhang, P. Huang, S. Liu, Z. Hao, H. Li, Y. Li, W. Zhou, Z. Zhao, Z. He, H. Wen, S. Huang, P. Yun, B. Cheng, P. K. Fu, W. K. Lai, J. Chen, K. Wang, Z. Sun, Z. Li, H. Hu, D. Zhang, C. H. Yuen, B. Wang, Z. Wang, C. Zou, and B. Yang. Physinone:...

work page 2026

[1] [1]

Walk through paintings: Ego-centric world models from internet priors.arXiv preprint arXiv:2601.15284, 2026

A. Bagchi, Z. Bao, H. Bharadhwaj, Y.-X. Wang, P. Tokmakov, and M. Hebert. Walk through paintings: Egocentric world models from internet priors. arXiv preprint arXiv:2601.15284 , 2026

work page arXiv 2026

[2] [2]

Y. Chen, P. Li, J. Yang, K. He, X. Wu, Y. Xu, K. Wang, J. Liu, N. Liu, Y. Huang, et al. Bridgev2w: Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793, 2026

work page arXiv 2026

[3] [3]

Y. Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

World Models

D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122 , 2(3):440, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

Training Agents Inside of Scalable World Models

D. Hafner, W. Yan, and T. Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

J. Ho, A. Jain, and P. Abbeel. Denoising diﬀusion probabilistic models. In Advances in Neural Infor- mation Processing Systems , volume 33, 2020. 10

work page 2020

[7] [7]

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diﬀusion models. Advances in neural information processing systems , 35:8633–8646, 2022

work page 2022

[8] [8]

Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoﬀroy, M. Roberts, M. Fisher, E. Shechtman, K. Sunkavalli, F. Liu, Z. Li, and H. Tan. Relic: Interactive video world models with long-horizon memory, 2025

work page 2025

[9] [9]

Hore and D

A. Hore and D. Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

work page 2010

[10] [10]

Huang, J

S. Huang, J. Wu, Q. Zhou, S. Miao, and M. Long. Vid2world: Crafting video diﬀusion models to interactive world models. arXiv preprint arXiv:2505.14357 , 2025

work page arXiv 2025

[11] [11]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diﬀusion. arXiv preprint arXiv:2506.08009 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Wovr: World models as reliable simulators for post-training vla policies with rl,

Z. Jiang, S. Zhou, Y. Jiang, Z. Huang, M. Wei, Y. Chen, T. Zhou, Z. Guo, H. Lin, Q. Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl. arXiv preprint arXiv:2602.13977, 2026

work page arXiv 2026

[13] [13]

B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Karras, M

T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diﬀusion-based generative models. In Advances in Neural Information Processing Systems , 2022

work page 2022

[15] [15]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

B. F. Labs. Flux. https://github.com/black-forest-labs/flux , 2024

work page 2024

[17] [17]

M.-Q. Le, Y. Zhu, V. Kalogeiton, and D. Samaras. What about gravity in video generation? post- training newton’s laws with veriﬁable rewards. arXiv preprint arXiv:2512.00425 , 2025

work page arXiv 2025

[18] [18]

C. Li, O. Michel, X. Pan, S. Liu, M. Roberts, and S. Xie. Pisa experiments: Exploring physics post- training for video diﬀusion models by watching stuﬀ drop. arXiv preprint arXiv:2503.09595 , 2025

work page arXiv 2025

[19] [19]

Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipu- lating rigid bodies, deformable objects, and ﬂuids. arXiv preprint arXiv:1810.01566 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Flow Matching for Generative Modeling

Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectiﬁed ﬂow. arXiv preprint arXiv:2209.03003 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Motamed, L

S. Motamed, L. Culp, K. Swersky, P. Jaini, and R. Geirhos. Do generative video models understand physical principles? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026

work page 2026

[23] [23]

Parker-Holder and S

J. Parker-Holder and S. Fruchter. Genie 3: A new frontier for world models. URL https://deepmind. google/discover/blog/genie-3-a-new-frontier-for-world-models/. Blog post, 2025

work page 2025

[24] [24]

Peebles and S

W. Peebles and S. Xie. Scalable diﬀusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023

work page 2023

[25] [25]

Solaris: Building a multiplayer video world model in minecraft

G. Savva, O. Michel, D. Lu, S. Waiwitlikhit, T. Meehan, D. Mishra, S. Poddar, J. Lu, and S. Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208 , 2026. 11

work page arXiv 2026

[26] [26]

D. Shah, B. Eysenbach, N. Rhinehart, and S. Levine. Rapid exploration for open-world navigation with latent goal models. In 5th Annual Conference on Robot Learning , 2021

work page 2021

[27] [27]

W. Sun, H. Zhang, H. Wang, J. Wu, Z. Wang, Z. Wang, Y. Wang, J. Zhang, T. Wang, and C. Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Todorov, T

E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 5026–5033. IEEE, 2012

work page 2012

[29] [29]

Wan: Open and Advanced Large-Scale Video Generative Models

T. Wan et al. Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Wan2.1: Open video foundation models

Wan-Video Team. Wan2.1: Open video foundation models. GitHub repository, 2025. Technical report and weights; project page details evolving

work page 2025

[31] [31]

J. Wang, A. Ma, K. Cao, J. Zheng, J. Feng, Z. Zhang, W. Pang, and X. Liang. Wisa: World simulator assistant for physics-aware text-to-video generation. In Advances in Neural Information Processing Systems, 2025

work page 2025

[32] [32]

Z. Wang, P. Hu, J. Wang, T. J. Zhang, Y. Cheng, L. Chen, Y. Yan, Z. Jiang, H. Li, and X. Liang. Prophy: Progressive physical alignment for dynamic world simulation. arXiv preprint arXiv:2512.05564 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Z. Wang, X. Wei, B. Li, Z. Guo, J. Zhang, H. Wei, K. Wang, and L. Zhang. Videoverse: How far is your t2v generator from a world model? arXiv preprint arXiv:2510.08398 , 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Z. Yang et al. CogVideoX: Text-to-video diﬀusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

S. Ye, Y. Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y. L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 , 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[36] [36]

Y. Yuan, X. Wang, T. Wickremasinghe, Z. Nadir, B. Ma, and S. H. Chan. Newtongen: Physics- consistent and controllable text-to-video generation via neural newtonian dynamics. In International Conference on Learning Representations , 2026

work page 2026

[37] [37]

Zhang, D

C. Zhang, D. Cherniavskii, A. Tragoudaras, A. Vozikis, T. Nijdam, D. W. Prinzhorn, M. Bodracska, N. Sebe, A. Zadaianchuk, and E. Gavves. Morpheus: Benchmarking physical reasoning of video gener- ative models with real physical experiments. arXiv preprint arXiv:2504.02918 , 2025

work page arXiv 2025

[38] [38]

Zhang, C

K. Zhang, C. Xiao, Y. Mei, J. Xu, and V. M. Patel. Think before you diﬀuse: Llms-guided physics-aware video generation, 2025

work page 2025

[39] [39]

S. Zhou, H. Wang, H. Cheng, J. Li, D. Wang, J. Jiang, Y. Jin, J. Huang, S. Mao, S. Liu, Y. Yang, H. Song, S. Wei, Z. Zhang, P. Huang, S. Liu, Z. Hao, H. Li, Y. Li, W. Zhou, Z. Zhao, Z. He, H. Wen, S. Huang, P. Yun, B. Cheng, P. K. Fu, W. K. Lai, J. Chen, K. Wang, Z. Sun, Z. Li, H. Hu, D. Zhang, C. H. Yuen, B. Wang, Z. Wang, C. Zou, and B. Yang. Physinone:...

work page 2026