ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
Pith reviewed 2026-05-20 23:33 UTC · model grok-4.3
The pith
A new benchmark reveals that action-conditioned video world models generalize well only on simple rigid interactions and falter on deformable or high-dimensional physics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through systematic experiments on ACWM-DiT, out-of-distribution generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion, suggesting that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics.
What carries the argument
The ACWM-Phys benchmark, which supplies controlled training and evaluation data across multiple physical regimes together with in-distribution and out-of-distribution protocols inside a fully controllable simulator.
If this is right
- World models that continue to rely on visual patterns will continue to show uneven generalization across physical regimes.
- Cross-attention layers improve conditioning when action spaces become high-dimensional.
- Causal VAEs provide better temporal consistency than frame-wise encoders for video prediction under action control.
- Larger action spaces increase modeling difficulty yet supply richer signals that can improve out-of-distribution robustness.
Where Pith is reading between the lines
- The benchmark could be extended with real-robot recordings to test whether the observed generalization gaps persist outside simulation.
- Architectures that explicitly encode physical constraints might close the performance gap on complex interactions without needing vastly more data.
- Similar complexity-dependent generalization patterns may appear when the same evaluation protocols are applied to language-conditioned or multi-agent world models.
Load-bearing premise
The simulation environment faithfully captures essential real-world physical interactions without simulator-specific artifacts, and results on the single tested model generalize to the broader class of action-conditioned world models.
What would settle it
A follow-up experiment in which the same model achieves comparable out-of-distribution accuracy on deformable and articulated tasks even after visual textures and lighting are randomized would indicate that performance does not hinge on appearance cues.
Figures
read the original abstract
Action-conditioned world models (ACWMs) have shown strong promise for video prediction and decision-making. However, existing benchmarks are largely restricted to egocentric navigation or narrow, task-specific robotics datasets, offering only limited coverage of the rich physical interactions required for generalized world understanding. We introduce ACWM-Phys, a new benchmark for evaluating action-conditioned prediction under diverse physical dynamics in a clean, controllable simulation environment with a carefully designed action space. ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics. To evaluate both interpolation and generalization, we design in-distribution and out-of-distribution protocols with controlled shifts in interaction patterns or scene configurations. By building the benchmark in a fully controllable simulator, ACWM-Phys enables precise data collection, reproducible evaluation, and systematic analysis of model capabilities for physically grounded world modeling. Through systematic experiments on ACWM-DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics. Ablations show that cross-attention improves high-dimensional action conditioning, causal VAEs outperform frame-wise encoders, and larger action spaces are harder to model but can improve generalization by providing richer control signals. These findings guide the design of physically grounded world models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ACWM-Phys, a benchmark for action-conditioned world models (ACWMs) that evaluates video prediction under diverse physical dynamics (rigid-body, kinematics, deformable-object, and particle) in a controllable simulator. It defines in-distribution and out-of-distribution protocols with controlled shifts and reports experiments on ACWM-DiT showing that OoD generalization is stronger for visually simple, low-dimensional interactions with clear geometry but weaker for deformable contacts, high-dimensional control, and complex articulated motion. The authors interpret the larger drops as evidence that the model relies on visual appearance patterns rather than fully learning underlying physics. Ablations examine cross-attention for action conditioning, causal VAEs versus frame-wise encoders, and the effects of action-space size.
Significance. If the central findings hold, the benchmark supplies a reproducible, controllable testbed that expands coverage beyond egocentric navigation or narrow robotics tasks, while the ablations offer practical guidance on architectural choices for high-dimensional action conditioning and temporal modeling. The work highlights a plausible gap between current ACWMs and robust physical understanding, which could steer subsequent model development.
major comments (1)
- [Abstract and §5] Abstract and §5 (results on OoD protocols): the interpretation that larger generalization drops on deformable contacts, high-dimensional control, and complex articulated motion demonstrate reliance on visual appearance patterns rather than physics is not isolated. These regimes are also dynamically more complex; without experiments that hold the underlying physical rules fixed while varying only visual cues (or vice versa), or direct probes of invariants such as momentum conservation or contact-force prediction, the specific causal claim remains under-supported by the reported patterns.
minor comments (2)
- The abstract and experimental description report systematic ablations but omit statistical tests, exact training-set sizes, and error bars; adding these would make the quantitative claims more robust.
- [Ablations] Clarify how 'effective task complexity' is defined and measured independently of the physical regime, and whether any quantitative metric (e.g., degrees of freedom or contact frequency) is used to support the qualitative distinction.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments. We address the major comment below and outline the revisions we will make to clarify our claims.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (results on OoD protocols): the interpretation that larger generalization drops on deformable contacts, high-dimensional control, and complex articulated motion demonstrate reliance on visual appearance patterns rather than physics is not isolated. These regimes are also dynamically more complex; without experiments that hold the underlying physical rules fixed while varying only visual cues (or vice versa), or direct probes of invariants such as momentum conservation or contact-force prediction, the specific causal claim remains under-supported by the reported patterns.
Authors: We agree that the observed OoD drops occur in regimes that are also dynamically more complex, and that our interpretation does not isolate reliance on visual patterns from this confounding factor. The manuscript presents the differential generalization as suggestive evidence rather than a definitive causal demonstration; however, we acknowledge that the current wording in the abstract and §5 can be read as stronger than the supporting experiments warrant. In the revised manuscript we will (i) qualify the relevant sentences to state that the patterns are consistent with reliance on visual appearance while noting the role of increased dynamic complexity, and (ii) add a short discussion paragraph outlining the value of future controlled experiments (e.g., fixed physics with varied visual cues, or direct prediction of invariants such as momentum or contact forces) that would more cleanly separate the two explanations. revision: partial
Circularity Check
No circularity: empirical benchmark paper with no derivations or self-referential reductions
full rationale
The paper introduces the ACWM-Phys benchmark and reports controlled experiments on ACWM-DiT, drawing conclusions from observed OoD generalization patterns across physical regimes. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central claims rest on empirical results from a new simulation environment rather than reducing to self-defined quantities or prior author work by construction. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The chosen simulation accurately represents the target physical dynamics without introducing confounding artifacts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Through systematic experiments on ACWM-DiT, we find that OoD generalization depends not only on the physical regime but also on effective task complexity: models generalize well on visually simple, low-dimensional interactions with clear geometric structure, but suffer larger drops on deformable contacts, high-dimensional control, and complex articulated motion. This suggests that the model still relies heavily on visual appearance patterns instead of fully learning the underlying physics.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ACWM-Phys contains training and evaluation data spanning rigid-body dynamics, kinematics, deformable-object interactions, and particle dynamics.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A. Bagchi, Z. Bao, H. Bharadhwaj, Y.-X. Wang, P. Tokmakov, and M. Hebert. Walk through paintings: Egocentric world models from internet priors. arXiv preprint arXiv:2601.15284 , 2026
- [2]
-
[3]
Y. Guo, L. X. Shi, J. Chen, and C. Finn. Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
D. Ha and J. Schmidhuber. World models. arXiv preprint arXiv:1803.10122 , 2(3):440, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Training Agents Inside of Scalable World Models
D. Hafner, W. Yan, and T. Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Infor- mation Processing Systems , volume 33, 2020. 10
work page 2020
-
[7]
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. Advances in neural information processing systems , 35:8633–8646, 2022
work page 2022
-
[8]
Y. Hong, Y. Mei, C. Ge, Y. Xu, Y. Zhou, S. Bi, Y. Hold-Geoffroy, M. Roberts, M. Fisher, E. Shechtman, K. Sunkavalli, F. Liu, Z. Li, and H. Tan. Relic: Interactive video world models with long-horizon memory, 2025
work page 2025
-
[9]
A. Hore and D. Ziou. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010
work page 2010
- [10]
-
[11]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
X. Huang, Z. Li, G. He, M. Zhou, and E. Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Wovr: World models as reliable simulators for post-training vla policies with rl,
Z. Jiang, S. Zhou, Y. Jiang, Z. Huang, M. Wei, Y. Chen, T. Zhou, Z. Guo, H. Lin, Q. Zhang, et al. Wovr: World models as reliable simulators for post-training vla policies with rl. arXiv preprint arXiv:2602.13977, 2026
-
[13]
B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective. arXiv preprint arXiv:2411.02385 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [14]
-
[15]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
B. F. Labs. Flux. https://github.com/black-forest-labs/flux , 2024
work page 2024
- [17]
- [18]
-
[19]
Y. Li, J. Wu, R. Tedrake, J. B. Tenenbaum, and A. Torralba. Learning particle dynamics for manipu- lating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Flow Matching for Generative Modeling
Y. Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
S. Motamed, L. Culp, K. Swersky, P. Jaini, and R. Geirhos. Do generative video models understand physical principles? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 948–958, 2026
work page 2026
-
[23]
J. Parker-Holder and S. Fruchter. Genie 3: A new frontier for world models. URL https://deepmind. google/discover/blog/genie-3-a-new-frontier-for-world-models/. Blog post, 2025
work page 2025
-
[24]
W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023
work page 2023
-
[25]
Solaris: Building a multiplayer video world model in minecraft
G. Savva, O. Michel, D. Lu, S. Waiwitlikhit, T. Meehan, D. Mishra, S. Poddar, J. Lu, and S. Xie. Solaris: Building a multiplayer video world model in minecraft. arXiv preprint arXiv:2602.22208 , 2026. 11
-
[26]
D. Shah, B. Eysenbach, N. Rhinehart, and S. Levine. Rapid exploration for open-world navigation with latent goal models. In 5th Annual Conference on Robot Learning , 2021
work page 2021
-
[27]
W. Sun, H. Zhang, H. Wang, J. Wu, Z. Wang, Z. Wang, Y. Wang, J. Zhang, T. Wang, and C. Guo. Worldplay: Towards long-term geometric consistency for real-time interactive world modeling. arXiv preprint arXiv:2512.14614 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages 5026–5033. IEEE, 2012
work page 2012
-
[29]
Wan: Open and Advanced Large-Scale Video Generative Models
T. Wan et al. Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Wan2.1: Open video foundation models
Wan-Video Team. Wan2.1: Open video foundation models. GitHub repository, 2025. Technical report and weights; project page details evolving
work page 2025
-
[31]
J. Wang, A. Ma, K. Cao, J. Zheng, J. Feng, Z. Zhang, W. Pang, and X. Liang. Wisa: World simulator assistant for physics-aware text-to-video generation. In Advances in Neural Information Processing Systems, 2025
work page 2025
-
[32]
Z. Wang, P. Hu, J. Wang, T. J. Zhang, Y. Cheng, L. Chen, Y. Yan, Z. Jiang, H. Li, and X. Liang. Prophy: Progressive physical alignment for dynamic world simulation. arXiv preprint arXiv:2512.05564 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Z. Wang, X. Wei, B. Li, Z. Guo, J. Zhang, H. Wei, K. Wang, and L. Zhang. Videoverse: How far is your t2v generator from a world model? arXiv preprint arXiv:2510.08398 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Z. Yang et al. CogVideoX: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
S. Ye, Y. Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y. L. Tan, C. Zhu, J. Xiang, et al. World action models are zero-shot policies. arXiv preprint arXiv:2602.15922 , 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Y. Yuan, X. Wang, T. Wickremasinghe, Z. Nadir, B. Ma, and S. H. Chan. Newtongen: Physics- consistent and controllable text-to-video generation via neural newtonian dynamics. In International Conference on Learning Representations , 2026
work page 2026
-
[37]
C. Zhang, D. Cherniavskii, A. Tragoudaras, A. Vozikis, T. Nijdam, D. W. Prinzhorn, M. Bodracska, N. Sebe, A. Zadaianchuk, and E. Gavves. Morpheus: Benchmarking physical reasoning of video gener- ative models with real physical experiments. arXiv preprint arXiv:2504.02918 , 2025
- [38]
-
[39]
S. Zhou, H. Wang, H. Cheng, J. Li, D. Wang, J. Jiang, Y. Jin, J. Huang, S. Mao, S. Liu, Y. Yang, H. Song, S. Wei, Z. Zhang, P. Huang, S. Liu, Z. Hao, H. Li, Y. Li, W. Zhou, Z. Zhao, Z. He, H. Wen, S. Huang, P. Yun, B. Cheng, P. K. Fu, W. K. Lai, J. Chen, K. Wang, Z. Sun, Z. Li, H. Hu, D. Zhang, C. H. Yuen, B. Wang, Z. Wang, C. Zou, and B. Yang. Physinone:...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.