pith. sign in

arxiv: 2606.28757 · v1 · pith:RHLJK3WPnew · submitted 2026-06-27 · 💻 cs.CV · cs.RO

A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models

Pith reviewed 2026-06-30 09:49 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords world modelsphysical plausibilitymulti-agent collisionsbenchmarkmomentum conservationvideo reconstructiondynamics evaluation
0
0 comments X

The pith

World models generate visually realistic multi-agent collisions yet frequently violate momentum and energy conservation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CrashTwin as a benchmark that tests whether generative world models obey physical laws when producing multi-agent interaction sequences such as vehicle collisions. Standard evaluations focus on image quality and semantic match, which do not detect whether generated motion respects conservation principles. CrashTwin supplies 37K collision sequences together with a reconstruction method that extracts 3D positions, velocities, and energies directly from uncalibrated video output. When applied to current models, the diagnostics show repeated failures in spatio-temporal consistency and physical conservation even on sequences that look convincing. This matters for any use of world models as simulators, because safety-critical planning requires dynamics that are physically correct rather than merely plausible in appearance.

Core claim

CrashTwin shows that state-of-the-art world models achieve high perceptual quality while committing severe physical violations during complex multi-agent interactions, as quantified by systematic failures across spatio-temporal consistency, momentum and kinetic energy conservation, and world-dynamics integrity.

What carries the argument

The calibration-free reconstruction pipeline that recovers metric-scale 3D kinematics and physical attributes from uncalibrated world-model video rollouts.

If this is right

  • Any world model intended for simulation of safety-critical interactions must be assessed on the three proposed physical dimensions in addition to visual metrics.
  • Models that pass perceptual tests but fail conservation checks cannot be trusted for downstream planning or control tasks.
  • The diagnostic suite supplies concrete numerical targets that can guide training or post-processing to enforce physical consistency.
  • The paired synthetic and real-world collision datasets enable controlled ablation of which interaction types expose the largest physical gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adding the physical metrics as auxiliary losses during model training could reduce violations without separate post-hoc filtering.
  • The same reconstruction approach might be applied to non-collision multi-agent scenarios to test whether physical failures are limited to high-contact regimes.
  • Deployed simulators could reject or re-sample any rollout that exceeds the reported violation thresholds before using it for decision making.

Load-bearing premise

The calibration-free reconstruction pipeline can reliably recover accurate metric-scale 3D kinematics and physical attributes directly from uncalibrated video rollouts.

What would settle it

Applying the pipeline to controlled synthetic videos whose true 3D trajectories and energies are known and observing large systematic discrepancies between recovered and ground-truth momentum values would show the pipeline cannot support the claimed diagnostics.

Figures

Figures reproduced from arXiv: 2606.28757 by Boris Ivanovic, Junyuan Hong, Lulin Liu, Marco Pavone, Nuo Chen, Wenyan Cong, Yang Zhou, Yan Wang, Yunhao Yang, Zhangyang Wang, Zhengzhong Tu, Zhiwen Fan, Zihao Li, Zihao Zhu, Ziyao Zeng.

Figure 1
Figure 1. Figure 1: CrashTwin Benchmark. Left illustrates the content of CrashTwin and the proposed evaluation pipeline that extracts physical attributes from uncalibrated videos to enable physics-grounded evaluation. Middle shows the two-alternative￾forced-choice(2AFC Score), demonstrating that metrics derived from physical dynam￾ics, indicated by colored bars, align more strongly with human preferences for physical realism … view at source ↗
Figure 2
Figure 2. Figure 2: Failure cases of existing world models revealed by our benchmark. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the CrashTwin framework. We derive safety-critical collision from national pre-crash statistics and instantiate them in both a controllable CARLA suite and a diverse real-world corpus. Current world models often violate physical principles in these rollouts, motivating our physics-grounded evaluation framework that measures spatio-temporal consistency, momentum and energy conservation, and worl… view at source ↗
Figure 4
Figure 4. Figure 4: Global dynamic reconstruction pipeline. The system reconstructs the 3D dynamics of collision participants from uncalibrated accident videos, where camera parameters, scene scale, and ego-motion are not directly available. We combine 3D tracking, video segmentation, metric depth estimation, and visual odometry to obtain coherent per-actor trajectories. Tracking fragments are re-linked to maintain identity c… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison before and after physics post-training. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative illustration of typical failure patterns captured by our [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Global dynamic reconstruction stages. For a representative collision, we visualize the intermediate 3D trajectories after each stage of the reconstruction pipeline. From top to bottom, the rows show initial 3D track fragments, re-linked 3D instance tracks guided by SAM2 masks, relative dynamics after metric depth correction, raw global trajectories after ego motion compensation, and the final Kalman-smooth… view at source ↗
Figure 8
Figure 8. Figure 8: Additional qualitative reconstruction results. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: User interface for the human 2AFC study. Each page displays two rollouts from the same scenario (left: A, right: B) with anonymized model identities, together with the question “Which video seems more physically reasonable?” and radio buttons for selecting A or B. Annotators can replay both videos before answering, and a progress bar at the top indicates completion status of the session. based metrics to c… view at source ↗
read the original abstract

Generative world models hold immense promise as scalable simulators for autonomous systems, particularly for synthesizing rare but safety-critical multi-agent interactions, such as vehicle collisions. However, current evaluation paradigms index heavily on visual fidelity and semantic alignment, leaving a critical blind spot: they cannot reliably quantify whether generated dynamics actually obey the fundamental physical laws required for reliable simulation. Assessing this physical plausibility is inherently difficult due to a lack of physical metrics and the challenge of extracting metric-scale kinematics from uncalibrated video rollouts. To bridge this gap, we introduce CrashTwin, a physics-grounded evaluation framework designed to stress-test the physical trustworthiness of world models. CrashTwin couples a diverse dataset of multi-agent collision scenarios, comprising 25K controllable synthetic and 12K in-the-wild real-world collision sequences with a novel calibration-free reconstruction pipeline, enabling the recovery of 3D physical attributes directly from world model rollouts. We propose a diagnostic suite that systematically evaluates three dimensions: spatio-temporal consistency, momentum and kinetic energy conservation, and world-dynamics integrity. Extensive benchmarking of state-of-the-art models reveals a crucial insight: high perceptual quality frequently masks severe physical violations during complex interactions. By quantitatively exposing these failure modes, CrashTwin provides a vital diagnostic tool for developing physically grounded world models capable of reliable real-world simulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CrashTwin, a physics-grounded benchmark for evaluating multi-agent dynamics in generative world models, with emphasis on vehicle collision scenarios. It contributes a dataset of 25K synthetic and 12K real-world collision sequences, a calibration-free reconstruction pipeline to recover 3D physical attributes (positions, velocities, masses) from uncalibrated video rollouts, and a diagnostic suite measuring spatio-temporal consistency, momentum/kinetic energy conservation, and world-dynamics integrity. Benchmarking of state-of-the-art models indicates that high perceptual quality often conceals severe physical violations.

Significance. If the reconstruction pipeline is shown to recover accurate metric-scale kinematics with bounded error, the benchmark would provide a valuable, falsifiable diagnostic for physical trustworthiness in world models intended for simulation. The focus on conservation laws during complex interactions and the scale of the collision dataset are strengths that address a genuine gap beyond visual metrics.

major comments (2)
  1. [Reconstruction pipeline (§3)] Reconstruction pipeline (abstract and §3): The claim that the calibration-free pipeline 'enables the recovery of 3D physical attributes directly' from monocular rollouts is load-bearing for all three diagnostic dimensions, yet no quantitative validation (e.g., MAE on recovered velocities, depths, or masses against ground truth on the 25K synthetic sequences) or error bounds are reported. Systematic bias from depth estimation on collision scenes or domain shift to generated videos would appear as physical violations even if the world model is correct.
  2. [Conservation metrics (§4.2)] Conservation metrics (§4.2): Momentum and kinetic energy conservation checks require absolute metric velocities and masses; any unquantified scale ambiguity or reconstruction variance directly undermines attribution of violations to the models rather than the measurement process. Explicit sensitivity analysis to reconstruction noise should be added.
minor comments (1)
  1. [Abstract / §4] The distinction between 'world-dynamics integrity' and the conservation checks is not immediately clear from the abstract or high-level description; a concise definition or table of metrics would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of validating the reconstruction pipeline and quantifying its impact on the conservation metrics. We address each major comment below and commit to revisions that strengthen the manuscript's claims without altering its core contributions.

read point-by-point responses
  1. Referee: [Reconstruction pipeline (§3)] Reconstruction pipeline (abstract and §3): The claim that the calibration-free pipeline 'enables the recovery of 3D physical attributes directly' from monocular rollouts is load-bearing for all three diagnostic dimensions, yet no quantitative validation (e.g., MAE on recovered velocities, depths, or masses against ground truth on the 25K synthetic sequences) or error bounds are reported. Systematic bias from depth estimation on collision scenes or domain shift to generated videos would appear as physical violations even if the world model is correct.

    Authors: We agree that explicit quantitative validation of the reconstruction pipeline against ground truth is essential to support the load-bearing claims. The 25K synthetic sequences were generated with known metric-scale ground truth (positions, velocities, masses), which enables direct computation of errors. The current manuscript focuses on the pipeline's application to both synthetic and real-world data but does not report MAE or error bounds. We will add a dedicated validation subsection in §3 reporting MAE, RMSE, and relative error for velocities, depths, and masses on the full synthetic set, along with analysis of potential biases in collision scenes (e.g., occlusion, fast motion) and a brief discussion of domain shift considerations for generated videos. This revision will include error bounds and confidence intervals. revision: yes

  2. Referee: [Conservation metrics (§4.2)] Conservation metrics (§4.2): Momentum and kinetic energy conservation checks require absolute metric velocities and masses; any unquantified scale ambiguity or reconstruction variance directly undermines attribution of violations to the models rather than the measurement process. Explicit sensitivity analysis to reconstruction noise should be added.

    Authors: We concur that reconstruction variance must be quantified to ensure violations are attributable to the world models. While the pipeline recovers metric-scale attributes via the synthetic calibration and real-world priors described in §3, we did not include sensitivity analysis in the original submission. We will add this analysis to §4.2 by (i) injecting controlled Gaussian noise at levels matching the reported reconstruction errors into the recovered velocities and masses, (ii) recomputing the conservation violation rates, and (iii) reporting how the diagnostic scores vary with noise magnitude. This will demonstrate robustness and clarify the separation between measurement error and model violations. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark pipeline is independent contribution

full rationale

The paper introduces CrashTwin as an evaluation framework consisting of a dataset and a calibration-free reconstruction pipeline to extract 3D physical attributes from video rollouts, followed by computation of conservation and consistency metrics. No derivation chain, prediction, or first-principles result is presented that reduces by construction to fitted inputs, self-definitions, or self-citations. The pipeline is framed as a novel enabling tool rather than a result derived from the metrics it produces. The abstract and provided text contain no load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results. This is a standard benchmark paper whose central claims rest on the proposed method's external applicability, not internal equivalence to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that physical conservation laws must hold for trustworthy simulation and that the new pipeline can extract the necessary attributes without calibration; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption World models for autonomous systems must obey fundamental physical laws such as momentum and kinetic energy conservation to be reliable simulators.
    This underpins the entire diagnostic suite and the claim that current models fail.
invented entities (1)
  • CrashTwin framework no independent evidence
    purpose: To provide physics-grounded evaluation of world model rollouts
    Newly proposed benchmark and pipeline; no independent evidence outside the paper is described.

pith-pipeline@v0.9.1-grok · 5810 in / 1245 out tokens · 26966 ms · 2026-06-30T09:49:47.186848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 21 canonical work pages · 14 internal anchors

  1. [1]

    Waymo safety report. Tech. rep., Waymo LLC (Dec 2021),https://storage. googleapis . com / waymo - uploads / files / documents / safety / 2021 - 12 - waymo - safety-report.pdf, accessed 2026-03-04

  2. [2]

    DOT HS810(767), 4 (2007)

    Administration, N.H.T.S., et al.: Pre-crash scenario typology for crash avoidance research. DOT HS810(767), 4 (2007)

  3. [3]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

  4. [4]

    In: Asian Conference on Computer Vision

    Aliakbarian, M.S., Saleh, F.S., Salzmann, M., Fernando, B., Petersson, L., Ander- sson, L.: Viena: A driving anticipation dataset. In: Asian Conference on Computer Vision. pp. 449–466. Springer (2018)

  5. [5]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    Bansal, H., Lin, Z., Xie, T., Zong, Z., Yarom, M., Bitton, Y., Jiang, C., Sun, Y., Chang, K.W., Grover, A.: Videophy: Evaluating physical commonsense for video generation. arXiv preprint arXiv:2406.03520 (2024)

  6. [6]

    In: Proceedings of the 28th ACM International Conference on Multimedia

    Bao, W., Yu, Q., Kong, Y.: Uncertainty-based traffic accident anticipation with spatio-temporal relational learning. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 2682–2690 (2020)

  7. [7]

    Brach, R.M., Brach, R.M.: A review of impact models for vehicle collision (1987)

  8. [8]

    In: Asian conference on computer vision

    Chan, F.H., Chen, Y.T., Xiang, Y., Sun, M.: Anticipating accidents in dashcam videos. In: Asian conference on computer vision. pp. 136–153. Springer (2016)

  9. [9]

    Cornell University (1997)

    Chatterjee, A.: Rigid body collisions: some general considerations, new collision laws, and some experimental data. Cornell University (1997)

  10. [10]

    SkyReels-V2: Infinite-length Film Generative Model

    Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025)

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  12. [12]

    Cong, W., Zhu, H., Wang, P., Liu, B., Xu, D., Wang, K., Pan, D.Z., Wang, Y., Fan, Z., Wang, Z.: Can test-time scaling improve world foundation model? arXiv preprint arXiv:2503.24320 (2025)

  13. [13]

    In: Conference on robot learning

    Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017)

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Fang, J., Li, L.l., Zhou, J., Xiao, J., Yu, H., Lv, C., Xue, J., Chua, T.S.: Abductive ego-view accident video understanding for safe driving perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22030–22040 (2024)

  15. [15]

    IEEE transactions on intelligent transportation systems 23(6), 4959–4971 (2021) A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models 31

    Fang, J., Yan, D., Qiao, J., Xue, J., Yu, H.: Dada: Driver attention prediction in driving accident scenarios. IEEE transactions on intelligent transportation systems 23(6), 4959–4971 (2021) A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models 31

  16. [16]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Gao, Y., Guo, H., Hoang, T., Huang, W., Jiang, L., Kong, F., Li, H., Li, J., Li, L., Li, X., et al.: Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113 (2025)

  17. [17]

    Technical report, Google DeepMind (2026),https: //blog.google/innovation-and-ai/technology/ai/veo-3-1-ingredients-to- video/, released: January 13, 2026

    Google DeepMind: Veo 3.1. Technical report, Google DeepMind (2026),https: //blog.google/innovation-and-ai/technology/ai/veo-3-1-ingredients-to- video/, released: January 13, 2026. Accessed: March 5, 2026

  18. [18]

    arXiv preprint arXiv:2506.00227 (2025)

    Gosselin, A., Luo, G.Y., Lara, L., Golemo, F., Nowrouzezahrai, D., Paull, L., Jolicoeur-Martineau, A., Pal, C.: Ctrl-crash: Controllable diffusion for realistic car crashes. arXiv preprint arXiv:2506.00227 (2025)

  19. [19]

    Green, D.M., Swets, J.A., et al.: Signal detection theory and psychophysics, vol. 1. Wiley New York (1966)

  20. [20]

    In: 2024 IEEE International Conference on Multimedia and Expo (ICME)

    Guo, Z., Zhou, Y., Gou, C.: Drivinggen: Efficient safety-critical driving video gen- eration with latent diffusion models. In: 2024 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2024)

  21. [21]

    In: International conference on machine learning

    Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning latent dynamics for planning from pixels. In: International conference on machine learning. pp. 2555–2565. PMLR (2019)

  22. [22]

    HailuoAI: Hailuo.https://hailuoai.video/(2024), accessed: 2025-02-24

  23. [23]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Han, H., Li, S., Chen, J., Yuan, Y., Wu, Y., Deng, Y., Leong, C.T., Du, H., Fu, J., Li, Y., et al.: Video-bench: Human-aligned video generation benchmark. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 18858– 18868 (2025)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hassan, M., Stapf, S., Rahimi, A., Rezende, P., Haghighi, Y., Brüggemann, D., Katircioglu,I.,Zhang,L.,Chen,X.,Saha,S.,etal.:Gem:Ageneralizableego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene composition control. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22404–22415 (2025)

  25. [25]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  26. [26]

    GAIA-1: A Generative World Model for Autonomous Driving

    Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080 (2023)

  27. [27]

    Iclr1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

  28. [28]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 10579–10596 (2024)

    Hu, M., Yin, W., Zhang, C., Cai, Z., Long, X., Chen, H., Wang, K., Yu, G., Shen, C., Shen, S.: Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence46(12), 10579–10596 (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video gener- ative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

  30. [30]

    Kalman, R.E.: A new approach to linear filtering and prediction problems (1960)

  31. [31]

    How Far is Video Generation from World Model: A Physical Law Perspective

    Kang, B., Yue, Y., Lu, R., Lin, Z., Zhao, Y., Wang, K., Huang, G., Feng, J.: How far is video generation from world model: A physical law perspective, 2024. URL https://arxiv. org/abs/2411.023852, 36

  32. [32]

    IEEE Trans- actions on Intelligent Vehicles9(1), 1792–1803 (2023) 32 N

    Karim, M.M., Yin, Z., Qin, R.: An attention-guided multistream feature fusion network for early localization of risky traffic agents in driving videos. IEEE Trans- actions on Intelligent Vehicles9(1), 1792–1803 (2023) 32 N. Chen et al

  33. [33]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., et al.: Mapanything: Univer- sal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414 (2025)

  34. [34]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Kim, H., Lee, K., Hwang, G., Suh, C.: Crash to not crash: Learn to identify dan- gerous vehicles using a simulator. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 978–985 (2019)

  35. [35]

    In: 2025 IEEE International Conference on Robotics and Automation (ICRA)

    Li, C., Zhou, K., Liu, T., Wang, Y., Zhuang, M., Gao, H.a., Jin, B., Zhao, H.: Avd2: Accident video diffusion for accident video description. In: 2025 IEEE International Conference on Robotics and Automation (ICRA). pp. 13289–13296. IEEE (2025)

  36. [36]

    Advances in Neural Information Processing Systems37, 109790–109816 (2024)

    Liao, M., Ye, Q., Zuo, W., Wan, F., Wang, T., Zhao, Y., Wang, J., Zhang, X., et al.: Evaluation of text-to-video generation models: A dynamics perspective. Advances in Neural Information Processing Systems37, 109790–109816 (2024)

  37. [37]

    IEEE Transactions on Intelligent Transportation Systems23(8), 12518–12530 (2021)

    Liu, C., Li, Z., Chang, F., Li, S., Xie, J.: Temporal shift and spatial attention-based two-stream network for traffic risk assessment. IEEE Transactions on Intelligent Transportation Systems23(8), 12518–12530 (2021)

  38. [38]

    Liu, M., Zhang, W.: Is your video language model a reliable judge? arXiv preprint arXiv:2503.05977 (2025)

  39. [39]

    In: European Conference on Computer Vision

    Liu, S., Ren, Z., Gupta, S., Wang, S.: Physgen: Rigid-body physics-grounded image-to-video generation. In: European Conference on Computer Vision. pp. 360–

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22139–22149 (2024)

  41. [41]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  42. [42]

    In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Luo, H., Wang, F.: A simulation-based framework for urban traffic accident detec- tion. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

  43. [43]

    Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y., Li, D., Qiao, Y., Luo, P.: Towards world simulator: Crafting physical commonsense-based benchmark for video generation. arXiv preprint arXiv:2410.05363 (2024)

  44. [44]

    https://www.ecfr.gov/current/title-49/subtitle-B/chapter-V/part-563/ section-563.7

    National Highway Traffic Safety Administration: 49 CFR §563.7, Data Elements. https://www.ecfr.gov/current/title-49/subtitle-B/chapter-V/part-563/ section-563.7

  45. [45]

    QuantiPhy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models.arXiv preprint arXiv:2512.19526, 2025

    Puyin, L., Xiang, T., Mao, E., Wei, S., Chen, X., Masood, A., Fei-Fei, L., Adeli, E.: Quantiphy: A quantitative benchmark evaluating physical reasoning abilities of vision-language models. arXiv preprint arXiv:2512.19526 (2025)

  46. [46]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  47. [47]

    AIAA journal3(8), 1445–1450 (1965)

    Rauch, H.E., Tung, F., Striebel, C.T.: Maximum likelihood estimates of linear dynamic systems. AIAA journal3(8), 1445–1450 (1965)

  48. [48]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  49. [49]

    nature163(4148), 688–688 (1949)

    Simpson, E.H.: Measurement of diversity. nature163(4148), 688–688 (1949)

  50. [50]

    Cambridge university press (2018) A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models 33

    Stronge, W.J.: Impact mechanics. Cambridge university press (2018) A Physics-Grounded Benchmark for Multi-Agent Dynamics in World Models 33

  51. [51]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Sun, K., Huang, K., Liu, X., Wu, Y., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 8406–8416 (2025)

  52. [52]

    Advances in neural information processing systems34, 16558–16569 (2021)

    Teed, Z., Deng, J.: Droid-slam: Deep visual slam for monocular, stereo, and rgb- d cameras. Advances in neural information processing systems34, 16558–16569 (2021)

  53. [53]

    Playerone: Egocentric world simulator.arXiv preprint arXiv:2506.09995, 2025

    Tu, Y., Luo, H., Chen, X., Bai, X., Wang, F., Zhao, H.: Playerone: Egocentric world simulator. arXiv preprint arXiv:2506.09995 (2025)

  54. [54]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Unterthiner, T., Van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717 (2018)

  55. [55]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025)

  56. [56]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Wang, T., Kim, S., Wenxuan, J., Xie, E., Ge, C., Chen, J., Li, Z., Luo, P.: Deepac- cident: A motion and accident prediction benchmark for v2x autonomous driving. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 5599–5606 (2024)

  57. [57]

    In: European conference on computer vision

    Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real-world-drive world models for autonomous driving. In: European conference on computer vision. pp. 55–72. Springer (2024)

  58. [58]

    In: European Conference on Computer Vision

    Wang, Y., Lipson, L., Deng, J.: Sea-raft: Simple, efficient, accurate raft for optical flow. In: European Conference on Computer Vision. pp. 36–54. Springer (2024)

  59. [59]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, Y., He, J., Fan, L., Li, H., Chen, Y., Zhang, Z.: Driving into the future: Mul- tiview visual forecasting and planning with world model for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14749–14759 (2024)

  60. [60]

    Warner, C.Y., Smith, G.C., James, M.B., Germane, G.J.: Friction applications in accident reconstruction. Tech. rep., SAE Technical Paper (1983)

  61. [61]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wen, Y., Zhao, Y., Liu, Y., Jia, F., Wang, Y., Luo, C., Zhang, C., Wang, T., Sun, X., Zhang, X.: Panacea: Panoramic and controllable video generation for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6902–6912 (2024)

  62. [62]

    arXiv preprint arXiv:2401.07781 (2024)

    Wu, J.Z., Fang, G., Wu, H., Wang, X., Ge, Y., Cun, X., Zhang, D.J., Liu, J.W., Gu, Y., Zhao, R., et al.: Towards a better metric for text-to-video generation. arXiv preprint arXiv:2401.07781 (2024)

  63. [63]

    Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125,

    Xu, R., Lin, H., Jeon, W., Feng, H., Zou, Y., Sun, L., Gorman, J., Tolstaya, E., Tang, S., White, B., et al.: Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios. arXiv preprint arXiv:2510.26125 (2025)

  64. [64]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xue, Q., Yin, X., Yang, B., Gao, W.: Phyt2v: Llm-guided iterative self-refinement for physics-grounded text-to-video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18826–18836 (2025)

  65. [65]

    IEEE transactions on pattern analysis and machine intelligence45(1), 444–459 (2022)

    Yao, Y., Wang, X., Xu, M., Pu, Z., Wang, Y., Atkins, E., Crandall, D.J.: Dota: Unsupervised detection of traffic anomaly in driving videos. IEEE transactions on pattern analysis and machine intelligence45(1), 444–459 (2022)

  66. [66]

    In: 2019 IEEE/RSJ International conference on intelligent robots and systems (IROS)

    Yao, Y., Xu, M., Wang, Y., Crandall, D.J., Atkins, E.M.: Unsupervised traffic acci- dent detection in first-person videos. In: 2019 IEEE/RSJ International conference on intelligent robots and systems (IROS). pp. 273–280. IEEE (2019)

  67. [67]

    In: Euro- pean Conference on Computer Vision

    You, T., Han, B.: Traffic accident benchmark for causality recognition. In: Euro- pean Conference on Computer Vision. pp. 540–556. Springer (2020) 34 N. Chen et al

  68. [68]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, K., Tang, Z., Hu, X., Pan, X., Guo, X., Liu, Y., Huang, J., Yuan, L., Zhang, Q., Long, X.X., et al.: Epona: Autoregressive diffusion world model for autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27220–27230 (2025)

  69. [69]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Zhao, G., Wang, X., Zhu, Z., Chen, X., Huang, G., Bao, X., Wang, X.: Drivedreamer-2: Llm-enhanced world models for diverse driving video generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 10412–10420 (2025)

  70. [70]

    VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    Zheng, D., Huang, Z., Liu, H., Zou, K., He, Y., Zhang, F., Gu, L., Zhang, Y., He, J., Zheng, W.S., et al.: Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755 (2025)

  71. [71]

    In: European conference on computer vision

    Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: Occworld: Learning a 3d occupancy world model for autonomous driving. In: European conference on computer vision. pp. 55–72. Springer (2024)

  72. [72]

    Vehicle System Dynamics46(S1), 3–15 (2008)

    Zhou, J., Peng, H., Lu, J.: Collision model for vehicle motion prediction after light impacts. Vehicle System Dynamics46(S1), 3–15 (2008)

  73. [73]

    In: European conference on computer vision

    Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: European conference on computer vision. pp. 474–490. Springer (2020)