CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
Pith reviewed 2026-05-25 04:37 UTC · model grok-4.3
The pith
Video generation models fail to keep physical predictions consistent when viewpoint or scene changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRONOS is an intervention-based benchmark built in a photorealistic Unreal Engine environment that evaluates counterfactual physical consistency by generating videos where the physical event type stays fixed while viewpoint, scene, object category, and object appearance are systematically varied. Evaluation of recent open-source video generators reveals substantial failures: prediction quality for the same event type is affected by appearance, environment, and particularly by viewpoint changes.
What carries the argument
The CRONOS benchmark, which produces controlled videos by intervening on four visual factors while holding the physical event type constant to test response to those changes.
If this is right
- Current open-source video generators exhibit inconsistent prediction quality for identical physical events under different visual conditions.
- Viewpoint changes produce the strongest degradation in prediction quality among the tested interventions.
- The benchmark supplies a reproducible testbed for measuring how generated video quality changes with specific interventions.
- It sets a concrete target for models that maintain consistent performance across multiple condition changes.
Where Pith is reading between the lines
- Training pipelines may need explicit viewpoint augmentation or consistency objectives to reduce reliance on appearance cues.
- The same intervention design could be applied to test longer-term causal predictions beyond single events.
- Improved performance on this benchmark would indicate better suitability for downstream uses like robotic planning that require viewpoint-invariant physics.
Load-bearing premise
That systematically varying viewpoint, scene, object category, and object appearance while holding the physical event type fixed constitutes a valid test of whether a model has learned underlying causal physical structure rather than superficial correlations.
What would settle it
A video model that produces equally high-quality predictions for the same physical event type, such as a collision, no matter the changes in viewpoint, scene, or object appearance would falsify the claim of substantial failures.
Figures
read the original abstract
Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CRONOS, an intervention-based benchmark built in a photorealistic Unreal Engine environment to evaluate counterfactual physical consistency in video prediction models. It systematically varies viewpoint, scene, object category, and object appearance while holding the underlying physical event type (e.g., collision, fall) fixed, and reports that recent open-source video generators exhibit substantial failures, with prediction quality degrading particularly under viewpoint changes. The work positions this as evidence that models exploit superficial correlations rather than causal physical structure, and releases the dataset and code for reproducibility.
Significance. If the evaluation methodology is robust, CRONOS offers a concrete, reproducible diagnostic for world-model development by providing controlled interventions on factors that should not affect physical event predictions. The open release of data and code is a clear strength that supports further community use and extension.
major comments (2)
- [§3] §3 (Benchmark Construction): The claim that fixing the 3D physical event type while intervening on viewpoint tests for causal physical structure (rather than 2D projection sensitivity) is load-bearing but insufficiently justified. Viewpoint changes alter projected trajectories, occlusions, and velocities even for identical 3D dynamics; without explicit controls for this distribution shift or comparisons against 3D-aware baselines, the observed performance drops cannot be unambiguously attributed to failure to learn underlying physics.
- [§4] §4 (Evaluation and Results): The abstract states that 'prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes,' but the section provides no quantitative details on the exact scoring metric, statistical tests, or ablation controls for viewpoint-induced shifts. This makes it impossible to verify robustness of the 'substantial failures' claim or rule out post-hoc choices in evaluation.
minor comments (2)
- [Abstract] Abstract: 'Counterfactual physical consistency' is used without a concise operational definition that distinguishes it from standard robustness to distribution shift.
- [Related Work] Related Work: Additional citations to prior video prediction benchmarks that also use simulation environments would help situate the novelty of the four-factor intervention design.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the CRONOS benchmark. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: §3 (Benchmark Construction): The claim that fixing the 3D physical event type while intervening on viewpoint tests for causal physical structure (rather than 2D projection sensitivity) is load-bearing but insufficiently justified. Viewpoint changes alter projected trajectories, occlusions, and velocities even for identical 3D dynamics; without explicit controls for this distribution shift or comparisons against 3D-aware baselines, the observed performance drops cannot be unambiguously attributed to failure to learn underlying physics.
Authors: We agree that the justification requires expansion to more clearly separate 3D physical invariance from 2D projection effects. In the revised manuscript we will augment §3 with explicit details on how the Unreal Engine simulation holds world-coordinate dynamics (velocities, collision parameters, gravity) fixed while only camera extrinsics are varied, along with quantitative characterization of the resulting 2D distribution shifts (e.g., changes in projected speed histograms and occlusion statistics). We will also add a limitations paragraph noting the current absence of open-source 3D-aware video generators suitable for direct comparison and will outline how future 3D models could be evaluated within the same protocol. These changes directly address the attribution concern while preserving the benchmark's core design. revision: yes
-
Referee: §4 (Evaluation and Results): The abstract states that 'prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes,' but the section provides no quantitative details on the exact scoring metric, statistical tests, or ablation controls for viewpoint-induced shifts. This makes it impossible to verify robustness of the 'substantial failures' claim or rule out post-hoc choices in evaluation.
Authors: We concur that additional quantitative transparency is needed. The revised §4 will specify the exact scoring metric (a composite of event-type classification accuracy and 3D trajectory consistency error derived from the simulator ground truth), report all results with statistical tests (paired t-tests and ANOVA across intervention conditions with p-values), and include targeted ablations that isolate viewpoint effects by comparing performance on identical 3D events rendered from the original versus altered viewpoints. These additions will allow readers to verify the magnitude and robustness of the reported failures. revision: yes
Circularity Check
Empirical benchmark with no derivation chain or fitted predictions
full rationale
The paper presents an intervention-based benchmark (CRONOS) for evaluating video generators on counterfactual physical consistency using new photorealistic data generated in Unreal Engine. No mathematical derivations, equations, parameter fitting, or first-principles predictions are described; the central claims rest on direct empirical evaluation of existing models across controlled interventions (viewpoint, scene, etc.). This is self-contained against external benchmarks and does not reduce any result to its own inputs or self-citations by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Varying viewpoint, scene, object category, and appearance while keeping physical event type fixed isolates whether a model has learned causal physical structure.
Reference graph
Works this paper leans on
-
[1]
A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
M. Assran et al. Self-supervised learning from video with joint embedding predictive architec- tures.arXiv preprint arXiv:2301.08243, 2023
- [3]
-
[4]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
VideoPhy: Evaluating Physical Commonsense for Video Generation
H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y . Bitton, C. Jiang, Y . Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Videophy-2: A challenging action-centric physical commonsense evaluation in video generation
H. Bansal, C. Peng, Y . Bitton, R. Goldenberg, A. Grover, and K.-W. Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025
- [7]
-
[8]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, V . Jampani, and R. Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[10]
Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025
F. Bordes, Q. Garrido, J. T. Kao, A. Williams, M. Rabbat, and E. Dupoux. Intphys 2: Bench- marking intuitive physics understanding in complex synthetic environments.arXiv preprint arXiv:2506.09849, 2025
-
[11]
SAM 3D: 3Dfy Anything in Images
X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [12]
- [13]
-
[14]
Epic Games.Unreal Engine, 2025. URLhttps://unrealengine.com. Software
work page 2025
- [15]
- [16]
- [17]
- [18]
-
[19]
D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
D. Hendrycks and T. G. Dietterich. Benchmarking neural network robustness to common cor- ruptions and perturbations. In7th International Conference on Learning Representations, ICLR
-
[21]
URLhttps://openreview.net/forum?id=HJz6tiCqYm
OpenReview.net, 2019. URLhttps://openreview.net/forum?id=HJz6tiCqYm
work page 2019
-
[22]
J. Ho, W. Chan, C. Saharia, M. Norouzi, and D. Fleet. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. InInternational Conference on Learning Representations (ICLR), 2022
work page 2022
- [24]
- [25]
-
[26]
Z. Huang et al. Vbench: Comprehensive benchmark suite for video generative models.arXiv preprint arXiv:2311.17982, 2023
-
[27]
S. Jassim, M. Holubar, A. Richter, C. Wolff, X. Ohmer, and E. Bruni. Grasp: a novel bench- mark for evaluating language grounding and situated physics understanding in multimodal language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 6297–6305, 2024
work page 2024
-
[28]
B. Kang, Y . Yue, R. Lu, Z. Lin, Y . Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [29]
-
[30]
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Yao, G. Zhu, T. Fang, H. Wu, Y . Ai, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
C. Li, O. Michel, X. Pan, S. Liu, M. Roberts, and S. Xie. Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop. InInternational Conference on Machine Learning, pages 35685–35709. PMLR, 2025
work page 2025
- [32]
-
[33]
Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024
work page 2024
-
[34]
X. Ma, Y . Wang, G. Jia, X. Chen, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [35]
-
[36]
F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y . Cheng, D. Li, Y . Qiao, and P. Luo. To- wards world simulator: Crafting physical commonsense-based benchmark for video generation. arXiv preprint arXiv:2410.05363, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Do generative video models understand physical principles?
S. Motamed, L. Culp, K. Swersky, P. Jaini, and R. Geirhos. Do generative video models understand physical principles?, 2025. URLhttps://arxiv.org/abs/2501.09038
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Pearl.Causality: Models, Reasoning and Inference
J. Pearl.Causality: Models, Reasoning and Inference. Cambridge University Press, 2009
work page 2009
-
[40]
Movie Gen: A Cast of Media Foundation Models
A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
T. Ressler-Antal, F. Fundel, M. B. Alaya, S. A. Baumann, F. Krause, M. Gui, and B. Ommer. Dismo: Disentangled motion representations for open-world motion transfer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[42]
J. Richens and T. Everitt. Robust agents learn causal world models. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[43]
B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y . Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021
work page 2021
-
[44]
M. Shu, C. Liu, W. Qiu, and A. L. Yuille. Identifying model weakness with adversarial examiner. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11998–12006. AAAI Press, 2020. doi: 10.1609/aaai.v34i07.6876
-
[45]
Make-A-Video: Text-to-Video Generation without Text-Video Data
U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
H. Teng, H. Jia, L. Sun, L. Li, M. Li, M. Tang, S. Han, T. Zhang, W. Zhang, W. Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
H.-Y . Tung, M. Ding, Z. Chen, D. Bear, C. Gan, J. Tenenbaum, D. Yamins, J. Fan, and K. Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties.Advances in Neural Information Processing Systems, 36:67048–67068, 2023
work page 2023
-
[48]
Towards Accurate Generative Models of Video: A New Metric & Challenges
T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 12
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[49]
R. Upadhyay, H. Zhang, J. Solomon, A. Agrawal, P. Boreddy, S. S. Narayana, Y . Ba, A. Wong, C. M. de Melo, and A. Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026
-
[50]
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
K. Yi, C. Gan, Y . Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. InInternational Conference on Learning Representations, 2020
work page 2020
-
[53]
C. Zhang, D. Cherniavskii, A. Tragoudaras, A. V ozikis, T. Nijdam, D. W. E. Prinzhorn, M. Bodracska, N. Sebe, A. Zadaianchuk, and E. Gavves. Morpheus: Benchmarking phys- ical reasoning of video generative models with real physical experiments, 2025. URL https://arxiv.org/abs/2504.02918
- [54]
-
[55]
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
D. Zheng, Z. Huang, H. Liu, K. Zou, Y . He, F. Zhang, L. Gu, Y . Zhang, J. He, W.-S. Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 13 A Dataset details The dataset of CRONOS is handcrafted using combinations of object and scene assets. Objects are selected to align wi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.