pith. sign in

arxiv: 2605.23699 · v1 · pith:5TTQ2Y2Ynew · submitted 2026-05-22 · 💻 cs.CV

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

Pith reviewed 2026-05-25 04:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords video predictionphysical consistencycounterfactual evaluationbenchmarkworld modelsvideo generationintervention-based test
0
0 comments X

The pith

Video generation models fail to keep physical predictions consistent when viewpoint or scene changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRONOS, a benchmark that tests whether video prediction models learn underlying physical rules or rely on visual patterns. It creates videos of fixed event types such as collisions or falls while changing viewpoint, scene, object category, and appearance in a photorealistic simulator. Tests on recent open-source generators show that prediction quality drops or varies with these changes, especially viewpoint shifts. A reader would care because reliable world models need to handle the same physics regardless of how the scene looks. The benchmark supplies a controlled way to measure and target this consistency.

Core claim

CRONOS is an intervention-based benchmark built in a photorealistic Unreal Engine environment that evaluates counterfactual physical consistency by generating videos where the physical event type stays fixed while viewpoint, scene, object category, and object appearance are systematically varied. Evaluation of recent open-source video generators reveals substantial failures: prediction quality for the same event type is affected by appearance, environment, and particularly by viewpoint changes.

What carries the argument

The CRONOS benchmark, which produces controlled videos by intervening on four visual factors while holding the physical event type constant to test response to those changes.

If this is right

  • Current open-source video generators exhibit inconsistent prediction quality for identical physical events under different visual conditions.
  • Viewpoint changes produce the strongest degradation in prediction quality among the tested interventions.
  • The benchmark supplies a reproducible testbed for measuring how generated video quality changes with specific interventions.
  • It sets a concrete target for models that maintain consistent performance across multiple condition changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines may need explicit viewpoint augmentation or consistency objectives to reduce reliance on appearance cues.
  • The same intervention design could be applied to test longer-term causal predictions beyond single events.
  • Improved performance on this benchmark would indicate better suitability for downstream uses like robotic planning that require viewpoint-invariant physics.

Load-bearing premise

That systematically varying viewpoint, scene, object category, and object appearance while holding the physical event type fixed constitutes a valid test of whether a model has learned underlying causal physical structure rather than superficial correlations.

What would settle it

A video model that produces equally high-quality predictions for the same physical event type, such as a collision, no matter the changes in viewpoint, scene, or object appearance would falsify the claim of substantial failures.

Figures

Figures reproduced from arXiv: 2605.23699 by Adam Kortylewski, Le\'on Begiristain, Olaf D\"unkel.

Figure 1
Figure 1. Figure 1: The CRONOS Benchmark. A benchmark for evaluating counterfactual physical consis￾tency: whether a model’s predictions of physical events respond appropriately to controlled changes in the visual input. visual correlations [38, 41]. Despite rapid progress in video generation, it remains unclear whether modern models acquire such representations or primarily rely on superficial statistical regularities in the… view at source ↗
Figure 2
Figure 2. Figure 2: CRONOS dataset overview. Examples illustrating the three physical events (rows) used in the CRONOS benchmark: collision, fall, and occlusion. For each event instance, we render multiple counterfactual observations by varying factors such as scene context, camera viewpoint, object category, and object appearance. Colored overlays show object trajectories across time, visualizing the underlying motion dynami… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of multiple video generation models on CRONOS. Generated futures for a collision event are compared with the ground-truth render for successive frames. While most models preserve coarse scene structure, they fail to maintain consistent object dynamics, exhibiting trajectory drift, incorrect physical interactions, or object identity distortions over time. 0.3 0.6 1 3 5 Human Score r=1… view at source ↗
Figure 4
Figure 4. Figure 4: Results of the human evaluation study. Model performances positively correlate with human ratings (higher is better), as the Pearson correlation coefficients indicate. quality. We follow [6] and evaluate on a scale between 1 (very poor) and 5 (excellent). We select 540 representative videos from various models with diverse objects and scenes, and we collect median￾aggregated ratings of three annotators. Mo… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity to counterfactual interventions. Sensitivities are averaged across metrics, and lower values indicate lower sensitivity. All evaluated models show substantial variation across intervention types, including appearance changes that alter only objects’ visual properties. every metric in both I2V and V2V settings ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dataset assets. Overview of Unreal Engine assets used for all rendered sequences: original scenes, object models, and appearance variations. Object disappearance detection. Videos in which objects vanish abruptly can yield artificially inflated scores on object-centric metrics, since evaluation is only meaningful on frames where the target object is present. At the same time, given our video settings, we m… view at source ↗
Figure 7
Figure 7. Figure 7: Instructions of the user study. qualification exam before participation. The 8 hired Prolific annotators received a compensation of 14 £/hour. E Additional examples In 9, 10 and 11 we show the complete set of generated videos for the examples in 3. F Broader impacts Potential positive impacts. CRONOS is an evaluation benchmark for counterfactual physical consistency in video generation models. Its main int… view at source ↗
Figure 8
Figure 8. Figure 8: GUI used in the user for annotating video quality. [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional generated examples. I Computational resources We provide an overview of the computational resources in [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional generated examples. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional generated examples. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Failure rates per metric. For each event, we provide the fraction of videos failing in each metric. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

Video prediction is increasingly viewed as a path toward generalizable world models, yet it remains unclear whether these systems learn underlying causal structure or merely exploit superficial visual correlations for future prediction. We introduce CRONOS, an intervention-based benchmark designed to evaluate counterfactual physical consistency: whether a model's predictions of physical events respond appropriately to controlled changes in the visual input, such as variations of scene context, viewpoint, object appearance, and object category. Built in a photorealistic Unreal Engine environment, CRONOS enables controlled, high-fidelity generation of videos across diverse scenes and dynamics. In contrast to previous benchmarks, CRONOS systematically intervenes on four key factors - viewpoint, scene, object category, and object appearance - while keeping the underlying physical event type, such as a collision, occlusion, or fall, fixed. Our evaluation of recent open-source video generators reveals substantial failures in counterfactual physical consistency: prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes. CRONOS provides a controlled and reproducible testbed for diagnosing how the quality of generated videos changes for different interventions, establishing a concrete target for developing models that perform consistently across changes of multiple conditions. The dataset and code are available at our project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CRONOS, an intervention-based benchmark built in a photorealistic Unreal Engine environment to evaluate counterfactual physical consistency in video prediction models. It systematically varies viewpoint, scene, object category, and object appearance while holding the underlying physical event type (e.g., collision, fall) fixed, and reports that recent open-source video generators exhibit substantial failures, with prediction quality degrading particularly under viewpoint changes. The work positions this as evidence that models exploit superficial correlations rather than causal physical structure, and releases the dataset and code for reproducibility.

Significance. If the evaluation methodology is robust, CRONOS offers a concrete, reproducible diagnostic for world-model development by providing controlled interventions on factors that should not affect physical event predictions. The open release of data and code is a clear strength that supports further community use and extension.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The claim that fixing the 3D physical event type while intervening on viewpoint tests for causal physical structure (rather than 2D projection sensitivity) is load-bearing but insufficiently justified. Viewpoint changes alter projected trajectories, occlusions, and velocities even for identical 3D dynamics; without explicit controls for this distribution shift or comparisons against 3D-aware baselines, the observed performance drops cannot be unambiguously attributed to failure to learn underlying physics.
  2. [§4] §4 (Evaluation and Results): The abstract states that 'prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes,' but the section provides no quantitative details on the exact scoring metric, statistical tests, or ablation controls for viewpoint-induced shifts. This makes it impossible to verify robustness of the 'substantial failures' claim or rule out post-hoc choices in evaluation.
minor comments (2)
  1. [Abstract] Abstract: 'Counterfactual physical consistency' is used without a concise operational definition that distinguishes it from standard robustness to distribution shift.
  2. [Related Work] Related Work: Additional citations to prior video prediction benchmarks that also use simulation environments would help situate the novelty of the four-factor intervention design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the CRONOS benchmark. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: §3 (Benchmark Construction): The claim that fixing the 3D physical event type while intervening on viewpoint tests for causal physical structure (rather than 2D projection sensitivity) is load-bearing but insufficiently justified. Viewpoint changes alter projected trajectories, occlusions, and velocities even for identical 3D dynamics; without explicit controls for this distribution shift or comparisons against 3D-aware baselines, the observed performance drops cannot be unambiguously attributed to failure to learn underlying physics.

    Authors: We agree that the justification requires expansion to more clearly separate 3D physical invariance from 2D projection effects. In the revised manuscript we will augment §3 with explicit details on how the Unreal Engine simulation holds world-coordinate dynamics (velocities, collision parameters, gravity) fixed while only camera extrinsics are varied, along with quantitative characterization of the resulting 2D distribution shifts (e.g., changes in projected speed histograms and occlusion statistics). We will also add a limitations paragraph noting the current absence of open-source 3D-aware video generators suitable for direct comparison and will outline how future 3D models could be evaluated within the same protocol. These changes directly address the attribution concern while preserving the benchmark's core design. revision: yes

  2. Referee: §4 (Evaluation and Results): The abstract states that 'prediction quality for the same physical event type is affected by appearance, environment, and, particularly by viewpoint changes,' but the section provides no quantitative details on the exact scoring metric, statistical tests, or ablation controls for viewpoint-induced shifts. This makes it impossible to verify robustness of the 'substantial failures' claim or rule out post-hoc choices in evaluation.

    Authors: We concur that additional quantitative transparency is needed. The revised §4 will specify the exact scoring metric (a composite of event-type classification accuracy and 3D trajectory consistency error derived from the simulator ground truth), report all results with statistical tests (paired t-tests and ANOVA across intervention conditions with p-values), and include targeted ablations that isolate viewpoint effects by comparing performance on identical 3D events rendered from the original versus altered viewpoints. These additions will allow readers to verify the magnitude and robustness of the reported failures. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or fitted predictions

full rationale

The paper presents an intervention-based benchmark (CRONOS) for evaluating video generators on counterfactual physical consistency using new photorealistic data generated in Unreal Engine. No mathematical derivations, equations, parameter fitting, or first-principles predictions are described; the central claims rest on direct empirical evaluation of existing models across controlled interventions (viewpoint, scene, etc.). This is self-contained against external benchmarks and does not reduce any result to its own inputs or self-citations by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen interventions isolate physical consistency; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Varying viewpoint, scene, object category, and appearance while keeping physical event type fixed isolates whether a model has learned causal physical structure.
    This premise defines what the benchmark measures and is stated in the abstract description of the intervention design.

pith-pipeline@v0.9.0 · 5760 in / 1199 out tokens · 18263 ms · 2026-05-25T04:37:22.570177+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 20 internal anchors

  1. [1]

    A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  2. [2]

    Assran et al

    M. Assran et al. Self-supervised learning from video with joint embedding predictive architec- tures.arXiv preprint arXiv:2301.08243, 2023

  3. [3]

    T. Ates, M. S. Atesoglu, C. Yigit, I. Kesen, M. Kobas, E. Erdem, A. Erdem, T. Goksun, and D. Yuret. Craft: A benchmark for causal reasoning about forces and interactions, 2022. URL https://arxiv.org/abs/2012.04293

  4. [4]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    H. Bansal, Z. Lin, T. Xie, Z. Zong, M. Yarom, Y . Bitton, C. Jiang, Y . Sun, K.-W. Chang, and A. Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520, 2024

  6. [6]

    Videophy-2: A challenging action-centric physical commonsense evaluation in video generation

    H. Bansal, C. Peng, Y . Bitton, R. Goldenberg, A. Grover, and K.-W. Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation.arXiv preprint arXiv:2503.06800, 2025

  7. [7]

    D. M. Bear, E. Wang, D. Mrowca, F. J. Binder, H.-Y . F. Tung, R. Pramod, C. Holdaway, S. Tao, K. Smith, F.-Y . Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261, 2021

  8. [8]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y . Levi, Z. English, V . V oleti, A. Letts, V . Jampani, and R. Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  9. [9]

    Blattmann, R

    A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  10. [10]

    Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025

    F. Bordes, Q. Garrido, J. T. Kao, A. Williams, M. Rabbat, and E. Dupoux. Intphys 2: Bench- marking intuitive physics understanding in complex synthetic environments.arXiv preprint arXiv:2506.09849, 2025

  11. [11]

    SAM 3D: 3Dfy Anything in Images

    X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

  12. [12]

    Y . Chen, X. Zhu, and T. Li. A physical coherence benchmark for evaluating video generation models via optical flow-guided frame prediction, 2025. URL https://arxiv.org/abs/ 2502.05503. 10

  13. [13]

    Dünkel, A

    O. Dünkel, A. Jesslen, J. Xie, C. Theobalt, C. Rupprecht, and A. Kortylewski. CNS-Bench: Benchmarking image classifier robustness under continuous nuisance shifts. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19978–19988, 2025

  14. [14]

    URLhttps://unrealengine.com

    Epic Games.Unreal Engine, 2025. URLhttps://unrealengine.com. Software

  15. [15]

    W. Feng, J. Li, M. Saxon, T.-j. Fu, W. Chen, and W. Y . Wang. Tc-bench: Benchmarking temporal compositionality in text-to-video and image-to-video generation.arXiv preprint arXiv:2406.08656, 2024

  16. [16]

    A. Foss, C. Evans, S. Mitts, K. Sinha, A. Rizvi, and J. T. Kao. CausalVQA: A physically grounded causal reasoning benchmark for video models, 2025. URL https://arxiv.org/ abs/2506.09943

  17. [17]

    J. Gu, X. Liu, Y . Zeng, A. Nagarajan, F. Zhu, D. Hong, Y . Fan, Q. Yan, K. Zhou, M.-Y . Liu, and X. E. Wang. PhyWorldBench: A comprehensive evaluation of physical realism in text-to-video models, 2025. URLhttps://arxiv.org/abs/2507.13428

  18. [18]

    X. Guo, J. Huo, Z. Shi, Z. Song, J. Zhang, and J. Zhao. T2VPhysBench: A first-principles benchmark for physical consistency in text-to-video generation, 2025. URLhttps://arxiv. org/abs/2505.00337

  19. [19]

    World Models

    D. Ha and J. Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  20. [20]

    Hendrycks and T

    D. Hendrycks and T. G. Dietterich. Benchmarking neural network robustness to common cor- ruptions and perturbations. In7th International Conference on Learning Representations, ICLR

  21. [21]

    URLhttps://openreview.net/forum?id=HJz6tiCqYm

    OpenReview.net, 2019. URLhttps://openreview.net/forum?id=HJz6tiCqYm

  22. [22]

    J. Ho, W. Chan, C. Saharia, M. Norouzi, and D. Fleet. Imagen video: High definition video generation with diffusion models.arXiv preprint arXiv:2210.02303, 2022

  23. [23]

    J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. InInternational Conference on Learning Representations (ICLR), 2022

  24. [24]

    L. Hu, A. Shankarampeta, Y . Huang, Z. Dai, H. Yu, Y . Zhao, H. Kang, D. Zhao, T. Rosing, and H. Zhang. Benchmarking scientific understanding and reasoning for video generation using VideoScience-Bench, 2025. URLhttps://arxiv.org/abs/2512.02942

  25. [25]

    Huang, F

    Z. Huang, F. Zhang, X. Xu, Y . He, J. Yu, Z. Dong, Q. Ma, N. Chanpaisit, C. Si, Y . Jiang, et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  26. [26]

    Huang et al

    Z. Huang et al. Vbench: Comprehensive benchmark suite for video generative models.arXiv preprint arXiv:2311.17982, 2023

  27. [27]

    Jassim, M

    S. Jassim, M. Holubar, A. Richter, C. Wolff, X. Ohmer, and E. Bruni. Grasp: a novel bench- mark for evaluating language grounding and situated physics understanding in multimodal language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 6297–6305, 2024

  28. [28]

    B. Kang, Y . Yue, R. Lu, Z. Lin, Y . Zhao, K. Wang, G. Huang, and J. Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385, 2024

  29. [29]

    Karaev, Y

    N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6013–6022, 2025

  30. [30]

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Yao, G. Zhu, T. Fang, H. Wu, Y . Ai, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 11

  31. [31]

    C. Li, O. Michel, X. Pan, S. Liu, M. Roberts, and S. Xie. Pisa experiments: Exploring physics post-training for video diffusion models by watching stuff drop. InInternational Conference on Machine Learning, pages 35685–35709. PMLR, 2025

  32. [32]

    D. Li, Y . Fang, Y . Chen, S. Yang, S. Cao, J. Wong, M. Luo, X. Wang, H. Yin, J. E. Gonzalez, et al. Worldmodelbench: Judging video generation models as world models.arXiv preprint arXiv:2502.20694, 2025

  33. [33]

    Y . Liu, X. Cun, X. Liu, X. Wang, Y . Zhang, H. Chen, Y . Liu, T. Zeng, R. Chan, and Y . Shan. Evalcrafter: Benchmarking and evaluating large video generation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22139–22149, 2024

  34. [34]

    X. Ma, Y . Wang, G. Jia, X. Chen, Z. Liu, Y .-F. Li, C. Chen, and Y . Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  35. [35]

    Z. Ma, M. Liufu, and G. Gkioxari. Out of sight, out of mind? evaluating state evolution in video world models.arXiv preprint arXiv:2603.13215, 2026

  36. [36]

    F. Meng, J. Liao, X. Tan, W. Shao, Q. Lu, K. Zhang, Y . Cheng, D. Li, Y . Qiao, and P. Luo. To- wards world simulator: Crafting physical commonsense-based benchmark for video generation. arXiv preprint arXiv:2410.05363, 2024

  37. [37]

    Do generative video models understand physical principles?

    S. Motamed, L. Culp, K. Swersky, P. Jaini, and R. Geirhos. Do generative video models understand physical principles?, 2025. URLhttps://arxiv.org/abs/2501.09038

  38. [38]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  39. [39]

    Pearl.Causality: Models, Reasoning and Inference

    J. Pearl.Causality: Models, Reasoning and Inference. Cambridge University Press, 2009

  40. [40]

    Movie Gen: A Cast of Media Foundation Models

    A. Polyak, A. Zohar, A. Brown, A. Tjandra, A. Sinha, A. Lee, A. Vyas, B. Shi, C.-Y . Ma, C.-Y . Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  41. [41]

    Ressler-Antal, F

    T. Ressler-Antal, F. Fundel, M. B. Alaya, S. A. Baumann, F. Krause, M. Gui, and B. Ommer. Dismo: Disentangled motion representations for open-world motion transfer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  42. [42]

    Richens and T

    J. Richens and T. Everitt. Robust agents learn causal world models. InThe Twelfth International Conference on Learning Representations, 2024

  43. [43]

    Schölkopf, F

    B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y . Bengio. Toward causal representation learning.Proceedings of the IEEE, 109(5):612–634, 2021

  44. [44]

    M. Shu, C. Liu, W. Qiu, and A. L. Yuille. Identifying model weakness with adversarial examiner. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11998–12006. AAAI Press, 2020. doi: 10.1609/aaai.v34i07.6876

  45. [45]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  46. [46]

    H. Teng, H. Jia, L. Sun, L. Li, M. Li, M. Tang, S. Han, T. Zhang, W. Zhang, W. Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  47. [47]

    H.-Y . Tung, M. Ding, Z. Chen, D. Bear, C. Gan, J. Tenenbaum, D. Yamins, J. Fan, and K. Smith. Physion++: Evaluating physical scene understanding that requires online inference of different physical properties.Advances in Neural Information Processing Systems, 36:67048–67068, 2023

  48. [48]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 12

  49. [49]

    de Melo, and Achuta Kadambi

    R. Upadhyay, H. Zhang, J. Solomon, A. Agrawal, P. Boreddy, S. S. Narayana, Y . Ba, A. Wong, C. M. de Melo, and A. Kadambi. Worldbench: Disambiguating physics for diagnostic evaluation of world models.arXiv preprint arXiv:2601.21282, 2026

  50. [50]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  51. [51]

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  52. [52]

    K. Yi, C. Gan, Y . Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum. Clevrer: Collision events for video representation and reasoning. InInternational Conference on Learning Representations, 2020

  53. [53]

    Zhang, D

    C. Zhang, D. Cherniavskii, A. Tragoudaras, A. V ozikis, T. Nijdam, D. W. E. Prinzhorn, M. Bodracska, N. Sebe, A. Zadaianchuk, and E. Gavves. Morpheus: Benchmarking phys- ical reasoning of video generative models with real physical experiments, 2025. URL https://arxiv.org/abs/2504.02918

  54. [54]

    Zhang, P

    Q. Zhang, P. Jing, H.-X. Yu, F. Ding, F. Nie, W. Wang, Y . Du, J. Zou, J. Wu, and B. Shuai. Physion-eval: Evaluating physical realism in generated video via human reasoning, 2026. URL https://arxiv.org/abs/2603.19607

  55. [55]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    D. Zheng, Z. Huang, H. Liu, K. Zou, Y . He, F. Zhang, L. Gu, Y . Zhang, J. He, W.-S. Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755, 2025. 13 A Dataset details The dataset of CRONOS is handcrafted using combinations of object and scene assets. Objects are selected to align wi...