pith. sign in

arxiv: 2602.13294 · v3 · pith:46QXRGDRnew · submitted 2026-02-09 · 💻 cs.CV · cs.AI

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Pith reviewed 2026-05-22 11:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords physical reasoningmultimodal large language modelsvideo reconstructionsimulator code generationphysical dynamicsbenchmark evaluationVisPhyBench
0
0 comments X

The pith

Multimodal models generate semantic scene descriptions but fail to infer precise physical parameters or produce consistent dynamics when forced to output executable simulator code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VisPhyWorld, a new evaluation framework that tests physical reasoning in multimodal large language models by requiring them to produce runnable simulator code from observed videos rather than answering questions. This code can be executed to reconstruct both appearance and motion, making any physical hypotheses directly inspectable, editable, and falsifiable. The authors accompany the framework with VisPhyBench, a collection of 209 scenes drawn from 108 physical templates, and a protocol that measures how well models recover visual details and physically plausible behavior. Experiments demonstrate that leading models handle semantic understanding adequately yet consistently err when estimating parameters such as mass, friction, or elasticity and when maintaining coherent trajectories over time.

Core claim

Requiring models to emit executable simulator code from visual input separates physical reasoning from rendering and reveals that current state-of-the-art MLLMs, while competent at semantic scene understanding, cannot reliably recover accurate physical parameters or generate dynamics that remain consistent with Newtonian principles across frames.

What carries the argument

Code-driven video reconstruction pipeline that converts a model's generated simulator script into an executable program whose output video is compared against ground-truth appearance and motion.

If this is right

  • Models must output explicit, testable physical hypotheses rather than vague descriptions.
  • Evaluation becomes independent of rendering quality because the simulator code itself is inspected.
  • The benchmark supplies quantitative scores for both appearance fidelity and physical plausibility of reconstructed motion.
  • Current top models achieve high success rates on semantic tasks yet low accuracy on parameter estimation and long-term dynamics consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be extended to interactive settings where models update code after observing simulated outcomes.
  • It offers a route to diagnose whether failures stem from missing physical priors or from inability to translate priors into executable form.
  • Similar code-generation probes might apply to other domains requiring explicit causal models, such as chemical reaction networks.

Load-bearing premise

That the ability to write and run correct simulator code directly measures committed physical reasoning instead of surface-level pattern matching from training data.

What would settle it

A controlled experiment in which models produce code that matches ground-truth dynamics on seen parameter values but systematically deviates on novel mass or friction values would falsify the claim that the method isolates genuine physical inference.

Figures

Figures reproduced from arXiv: 2602.13294 by Jiarong Liang, Ka-Hei Hui, Max Ku, Ping Nie, Wenhu Chen.

Figure 1
Figure 1. Figure 1: MLLMs struggle to simulate physical dynamics. Un￾der the same inputs, code generated with rigid-body simulation backends (Three.js/P5.js) produces more physically consistent roll￾outs, whereas non-physics backends (SVG/Manim) often exhibit implausible motion or contact artifacts such as interpenetration. ation protocols often rely on recognition-based queries or surface-level judgments, which can obscure w… view at source ↗
Figure 2
Figure 2. Figure 2: Unlike traditional VQA paradigms, VisPhyWorld accesses physical understanding evaluation by requiring MLLMs to actively reconstruct scenes via executable code, offering superior reasoning explainability compared to traditional paradigms. VisPhyWorld probes the physical reasoning capabilities of MLLMs through visual-to-code reconstruction. Given two key frames (and optionally object detections), the model p… view at source ↗
Figure 3
Figure 3. Figure 3: VisPhyWorld Framework. (1) System & Data Construction: We process raw video sequences to extract key frames (Istart, Ilater) and detection contexts using multimodal agents. (2) Pipeline & Simulation Flow: An LLM-based agent performs motion analysis and generates raw executable code, which is then sanitized and rendered. (3) Evaluation Benchmark: We propose a multi-metric benchmark integrating semantic and … view at source ↗
Figure 4
Figure 4. Figure 4: Key metrics on VisPhyBench. We compare code-driven reconstruction (multiple MLLMs) against pixel-space baselines (Veo 3.1 and SVD) under the unified evaluation protocol. layout and inferred parameters; (iii) an executable program C ∈ Ycode; and (iv) a rendered video Xˆ = (ˆIt) Tˆ t=1 obtained by executing the executable program C. 3.2. VisPhyWorld Architecture (I start, Ilater, D) fLLM −−−→ (A, S, C) Rphys… view at source ↗
Figure 5
Figure 5. Figure 5: This case shows that VisPhyWorld exhibits strong physi￾cal grounding, correctly simulating the collision dynamics. More examples are in the Appendix [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GPT-5 reconstructs object identities and collision dynam￾ics most faithfully over time. Pixel-space baselines (Veo-3.1 and SVD/img2vid) generate trajectories with implausible motion/con￾tact events due to the lack of an explicit physics hypothesis. precise physical reasoning. GPT-5 in Three.js shows strong physical grounding by correctly simulating the collision dynamics, achieving Gemini 10.0 with DINO 0.… view at source ↗
Figure 8
Figure 8. Figure 8: A detailed case study (ID 2). 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A detailed case study (ID 3). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: A detailed case study (ID 4). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A detailed case study (ID 5). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A detailed case study (ID 6). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: A detailed case study (ID 7). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: A detailed case study (ID 8). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Full multimodal LLM prompt template used by VisPhyWorld for both motion analysis and code generation. B.2. Detection Context D To reduce ambiguity in object discovery and initialization, VisPhyWorld can optionally provide a structured detection context D for the first keyframe I start . D is a per-sample JSON annotation containing a list of objects with coarse geometry and appearance attributes. All coord… view at source ↗
Figure 16
Figure 16. Figure 16: 3D prompt variant used for dataset 3D. It mirrors [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: High-level deterministic rendering protocol used in VisPhyWorld. Low-level implementation details are included in the released codebase. B.5. Robustness: Automatic Retry and Fallback To handle syntax errors or runtime exceptions in model-generated programs, we implement a lightweight robustness protocol that ensures evaluation is well-defined for all samples. Error-conditioned single-step repair. If the i… view at source ↗
Figure 18
Figure 18. Figure 18: High-level structure of the fallback template used when both model attempts fail. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Temporal alignment procedure used before computing frame-wise metrics. Gemini-based Physics & Video Consistency Prompt You are an expert evaluator of physical simulations and video quality. Compare the provided reference video (Ground Truth) with the generated video. Your goal is to determine if the generated video accurately reconstructs the physical event shown in the reference. Focus on the following d… view at source ↗
Figure 20
Figure 20. Figure 20: Prompt template used for Gemini-based evaluation. Note that the prompt is explicitly designed to penalize physical violations (e.g., incorrect collision logic), ensuring the score reflects physical understanding rather than just perceptual similarity. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Per-scene boxplot distributions on VisPhyBench for representative metric families (higher is better unless marked ↓). D.3. Reconstruction & Perceptual Metrics [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
read the original abstract

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% of benchmark runs before fallback. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics. Our code is available https://github.com/TIGER-AI-Lab/VisPhyWorld

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces VisPhyWorld, an execution-based evaluation framework for physical reasoning in MLLMs that requires models to generate executable simulator code from visual observations of scenes. It presents VisPhyBench consisting of 209 evaluation scenes derived from 108 physical templates, along with a protocol that measures appearance reconstruction and physical motion plausibility. The work reports that the pipeline yields valid reconstructed videos in 97.7% of runs before fallback, that state-of-the-art MLLMs exhibit strong semantic scene understanding, yet struggle to infer accurate physical parameters and to produce consistent physical dynamics.

Significance. If the results hold, the code-driven approach supplies a more direct, inspectable, and falsifiable test of committed physical hypotheses than standard VQA or VoE protocols, which can often be solved without explicit world models. The systematic template-based benchmark construction and high validity rate are strengths that could support reproducible progress on physical commonsense evaluation in multimodal models.

major comments (1)
  1. [Evaluation Protocol and Experiments] The central claim attributes performance gaps specifically to deficits in inferring physical parameters and simulating dynamics. However, the evaluation protocol requires models to emit syntactically valid, engine-compatible simulator code (bodies, forces, integrators, etc.). Without a control condition that supplies textual scene specifications rather than visual input, it is impossible to separate failures due to physical inference from failures due to incomplete knowledge of the simulator API and calling conventions. This entanglement is load-bearing for the attribution in the abstract and experimental claims.
minor comments (1)
  1. [Abstract] The abstract states that the code is available at the GitHub link; the repository should explicitly include the full set of 209 scenes, the 108 templates, and the exact scoring scripts used for the 97.7% validity figure to enable independent verification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our work. We address the major comment below and describe the revisions we intend to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation Protocol and Experiments] The central claim attributes performance gaps specifically to deficits in inferring physical parameters and simulating dynamics. However, the evaluation protocol requires models to emit syntactically valid, engine-compatible simulator code (bodies, forces, integrators, etc.). Without a control condition that supplies textual scene specifications rather than visual input, it is impossible to separate failures due to physical inference from failures due to incomplete knowledge of the simulator API and calling conventions. This entanglement is load-bearing for the attribution in the abstract and experimental claims.

    Authors: We appreciate the referee pointing out this potential confounding factor. Our current protocol provides models with detailed textual descriptions of the simulator API, including examples of body definitions, force applications, and integrator usage, in addition to the visual input. The fact that 97.7% of generated codes are syntactically valid and engine-compatible indicates that the models largely understand the API conventions. Nevertheless, we agree that an explicit control condition with textual scene specifications (e.g., providing object types, positions, and initial velocities in text) would help disentangle visual perception issues from physical reasoning deficits. In the revised version, we will conduct and report such a control experiment. We will update the relevant sections, including the abstract, to clarify the scope of our claims based on these additional results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained.

full rationale

The paper introduces VisPhyWorld and VisPhyBench as a new execution-based evaluation framework and dataset for MLLM physical reasoning via code generation. No mathematical derivations, equations, or first-principles predictions appear in the provided text. Central claims rest on direct empirical runs (97.7% validity rate, model comparisons) against externally defined scenes and templates rather than any reduction to fitted parameters, self-definitions, or self-citation chains. The approach is independent of its own outputs and does not rename or smuggle prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on domain assumption that physical templates yield representative scenes and that code executability equates to testable physical reasoning; no free parameters or invented entities apparent from abstract.

axioms (1)
  • domain assumption Physical templates can generate representative scenes that test genuine physical reasoning when models produce executable code.
    Benchmark comprises 209 scenes derived from 108 physical templates.

pith-pipeline@v0.9.0 · 5735 in / 1209 out tokens · 61515 ms · 2026-05-22T11:03:22.095760+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

  1. [1]

    Anthropic

    Accessed: 2026-01-15. Anthropic. Claude sonnet 4.5. https://www. anthropic.com/news/claude-sonnet-4-5 ,

  2. [2]

    Baillargeon, R., Spelke, E

    Accessed: 2026-01-15. Baillargeon, R., Spelke, E. S., and Wasserman, S. Object permanence in five-month-old infants.Cog- nition, 20(3):191–208,

  3. [3]

    doi: https://doi.org/10.1016/0010-0277(85)90008-3

    ISSN 0010-0277. doi: https://doi.org/10.1016/0010-0277(85)90008-3. URL https://www.sciencedirect.com/ science/article/pii/0010027785900083. Bakhtin, A., van der Maaten, L., Johnson, J., Gustafson, L., and Girshick, R. Phyre: A new benchmark for physical reasoning,

  4. [4]

    Bansal, H., Lin, Z., Xie, T., Zong, Z., Yarom, M., Bitton, Y ., Jiang, C., Sun, Y ., Chang, K.-W., and Grover, A

    URL https://arxiv.org/abs/ 1908.05656. Bansal, H., Lin, Z., Xie, T., Zong, Z., Yarom, M., Bitton, Y ., Jiang, C., Sun, Y ., Chang, K.-W., and Grover, A. Videophy: Evaluating physical commonsense for video generation,

  5. [5]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    URL https://arxiv.org/abs/ 2406.03520. Bansal, H., Peng, C., Bitton, Y ., Goldenberg, R., Grover, A., and Chang, K.-W. Videophy-2: A challenging action- centric physical commonsense evaluation in video genera- tion,

  6. [6]

    URL https://arxiv.org/abs/1909. 12000. Bear, D. M., Wang, E., Mrowca, D., Binder, F. J., Tung, H.-Y . F., Pramod, R. T., Holdaway, C., Tao, S., Smith, K., Sun, F.-Y ., Fei-Fei, L., Kanwisher, N., Tenenbaum, J. B., Yamins, D. L. K., and Fan, J. E. Physion: Evaluating phys- ical prediction from vision in humans and machines,

  7. [7]

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., Jampani, V ., and Rombach, R

    URLhttps://arxiv.org/abs/2106.08261. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., Jampani, V ., and Rombach, R. Stable video diffusion: Scaling latent video diffusion models to large datasets,

  8. [8]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    URL https://arxiv.org/abs/ 2311.15127. Bordes, F., Garrido, Q., Kao, J. T., Williams, A., Rabbat, M., and Dupoux, E. Intphys 2: Benchmarking intu- itive physics understanding in complex synthetic envi- ronments,

  9. [9]

    Caron, M., Touvron, H., Misra, I., J ´egou, H., Mairal, J., Bojanowski, P., and Joulin, A

    URL https://arxiv.org/abs/ 2506.09849. Caron, M., Touvron, H., Misra, I., J ´egou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers,

  10. [10]

    Emerging Properties in Self-Supervised Vision Transformers

    URL https: //arxiv.org/abs/2104.14294. Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y ., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-driven question answering dataset. InThe 2023 Conference on Empirical Methods in Natural Language Processing,

  11. [11]

    doi: 10.1109/tpami.2020.3045810

    ISSN 1939-3539. doi: 10.1109/tpami.2020.3045810. URL http://dx. doi.org/10.1109/TPAMI.2020.3045810. Foley, J. D., van Dam, A., Feiner, S. K., and Hughes, J. F. Computer Graphics: Principles and Practice.Addison- Wesley, second edition,

  12. [12]

    URL https://arxiv.org/ abs/2411.15296. Fung, P., Bachrach, Y ., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Gong, H., J´egou, H., Lazaric, A., Majumdar, A., Madotto, A., Meier, F., Metze, F., Morency, L.-P., Moutakanni, T., Pino, J., Terver, B., Tighe, J., Tomasello, P., and Malik, J. Embodied ai agents: Modeling the world,

  13. [13]

    Galimzyanov, T., Titov, S., Golubev, Y ., and Bogomolov, E

    URL https:// arxiv.org/abs/2506.22355. Galimzyanov, T., Titov, S., Golubev, Y ., and Bogomolov, E. Drawing pandas: A benchmark for llms in generat- ing plotting code,

  14. [14]

    9 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Garrido, Q., Ballas, N., Assran, M., Bardes, A., Najman, L., Rabbat, M., Dupoux, E., and LeCun, Y

    URL https://arxiv.org/ abs/2412.02764. 9 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Garrido, Q., Ballas, N., Assran, M., Bardes, A., Najman, L., Rabbat, M., Dupoux, E., and LeCun, Y . Intuitive physics understanding emerges from self-supervised pretraining on natural videos,

  15. [15]

    URL https://arxiv.org/ abs/2502.11831. Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next genera- tion agentic capabilities,

  16. [16]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    URL https://arxiv. org/abs/2507.06261. Google AI for Developers. Gemini 3 developer guide. https://ai.google.dev/gemini-api/ docs/gemini-3,

  17. [17]

    Goswami, K., Mathur, P., Rossi, R., and Dernoncourt, F

    Accessed: 2026-01-15. Goswami, K., Mathur, P., Rossi, R., and Dernoncourt, F. Plotgen: Multi-agent llm-based scientific data visual- ization via multimodal feedback,

  18. [18]

    He, L., Song, Y ., Huang, H., Aliaga, D., and Zhou, X

    URL https: //arxiv.org/abs/2502.00988. He, L., Song, Y ., Huang, H., Aliaga, D., and Zhou, X. Kubrick: Multimodal agent collaborations for syn- thetic video generation, 2024a. URL https://arxiv. org/abs/2408.10453. He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., Wang, K., Do, Q. D., Ni, Y ., Lyu, B.,...

  19. [19]

    Huynh-Thu, Q

    URL https://arxiv.org/abs/ 2509.22799. Huynh-Thu, Q. and Ghanbari, M. Scope of validity of psnr in image/video quality assessment.Electronics Letters, 44:800 – 801, 02

  20. [20]

    Jassim, S., Holubar, M., Richter, A., Wolff, C., Ohmer, X., and Bruni, E

    doi: 10.1049/el:20080522. Jassim, S., Holubar, M., Richter, A., Wolff, C., Ohmer, X., and Bruni, E. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models,

  21. [21]

    Kang, B., Yue, Y ., Lu, R., Lin, Z., Zhao, Y ., Wang, K., Huang, G., and Feng, J

    URL https: //arxiv.org/abs/2311.09048. Kang, B., Yue, Y ., Lu, R., Lin, Z., Zhao, Y ., Wang, K., Huang, G., and Feng, J. How far is video generation from world model: A physical law perspective,

  22. [22]

    How Far is Video Generation from World Model: A Physical Law Perspective

    URL https://arxiv.org/abs/2411.02385. Keluskar, A., Bhattacharjee, A., and Liu, H. Do llms under- stand ambiguity in text? a case study in open-world ques- tion answering,

  23. [23]

    Krojer, B., Komeili, M., Ross, C., Garrido, Q., Sinha, K., Ballas, N., and Assran, M

    URL https://arxiv.org/ abs/2411.12395. Krojer, B., Komeili, M., Ross, C., Garrido, Q., Sinha, K., Ballas, N., and Assran, M. A shortcut-aware video- qa benchmark for physical understanding via minimal video pairs,

  24. [24]

    Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W

    URL https://arxiv.org/abs/ 2506.09987. Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W. Theoremexplainagent: Towards multimodal explanations for llm theorem understanding,

  25. [25]

    Li, D., Fang, Y ., Chen, Y ., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J

    URL https:// arxiv.org/abs/2502.19400. Li, D., Fang, Y ., Chen, Y ., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J. E., Stoica, I., Han, S., and Lu, Y . Worldmodelbench: Judging video generation models as world models,

  26. [26]

    Li, S., Wu, K., Zhang, C., and Zhu, Y

    URL https: //arxiv.org/abs/2502.20694. Li, S., Wu, K., Zhang, C., and Zhu, Y . I-phyre: Interac- tive physical reasoning,

  27. [27]

    org/abs/2312.03009

    URL https://arxiv. org/abs/2312.03009. Lin, C.-Y . ROUGE: A package for automatic evalua- tion of summaries. InText Summarization Branches Out, pp. 74–81, Barcelona, Spain, July

  28. [28]

    Liu, S., Ren, Z., Gupta, S., and Wang, S

    URLhttps://arxiv.org/abs/2511.02778. Liu, S., Ren, Z., Gupta, S., and Wang, S. Physgen: Rigid- body physics-grounded image-to-video generation. In European Conference on Computer Vision (ECCV),

  29. [29]

    org/abs/2311.12631

    URL https://arxiv. org/abs/2311.12631. Margoni, F., Surian, L., and Baillargeon, R. The violation- of-expectation paradigm: A conceptual overview.Psy- chological Review, 131(3):716–748,

  30. [30]

    10 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y ., Li, D., Qiao, Y ., and Luo, P

    URL https://arxiv.org/abs/ 2312.10728. 10 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y ., Li, D., Qiao, Y ., and Luo, P. Towards world simulator: Crafting physical commonsense-based benchmark for video generation,

  31. [31]

    Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    URL https: //arxiv.org/abs/2410.05363. Motamed, S., Chen, M., Gool, L. V ., and Laina, I. Travl: A recipe for making video-language models better judges of physics implausibility, 2025a. URL https://arxiv. org/abs/2510.07550. Motamed, S., Culp, L., Swersky, K., Jaini, P., and Geirhos, R. Do generative video models understand physical prin- ciples?, 2025b....

  32. [32]

    Pezeshkpour, P

    Accessed: 2026-01-15. Pezeshkpour, P. and Hruschka, E. Large language mod- els sensitivity to the order of options in multiple-choice questions,

  33. [33]

    arXiv:2308.11483 [cs]

    URL https://arxiv.org/abs/ 2308.11483. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision,

  34. [34]

    Learning Transferable Visual Models From Natural Language Supervision

    URLhttps://arxiv.org/abs/2103.00020. Rajani, N. F., Zhang, R., Tan, Y . C., Zheng, S., Weiss, J., Vyas, A., Gupta, A., XIong, C., Socher, R., and Radev, D. Esprit: Explaining solutions to physical reason- ing tasks,

  35. [35]

    Riochet, R., Castro, M

    URL https://arxiv.org/abs/ 2005.00730. Riochet, R., Castro, M. Y ., Bernard, M., Lerer, A., Fergus, R., Izard, V ., and Dupoux, E. Intphys: A framework and benchmark for visual intuitive physics reasoning,

  36. [36]

    Rodriguez, J

    URLhttps://arxiv.org/abs/1803.07616. Rodriguez, J. A., Puri, A., Agarwal, S., Laradji, I. H., Rodriguez, P., Rajeswar, S., Vazquez, D., Pal, C., and Pedersoli, M. Starvector: Generating scalable vec- tor graphics code from images and text,

  37. [37]

    URL https://arxiv.org/abs/2312.11556. Shen, H., Wu, T., Han, Q., Hsieh, Y ., Wang, J., Zhang, Y ., Cheng, Y ., Hao, Z., Ni, Y ., Wang, X., Wan, Z., Zhang, K., Xu, W., Xiong, J., Luo, P., Chen, W., Tao, C., Mao, Z., and Wong, N. Phyx: Does your model have the ”wits” for physical reasoning?,

  38. [38]

    org/abs/2505.15929

    URL https://arxiv. org/abs/2505.15929. Teed, Z. and Deng, J. Raft: Recurrent all-pairs field trans- forms for optical flow,

  39. [39]

    Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020

    URL https://arxiv. org/abs/2003.12039. The Manim Community Developers. Manim – Mathemat- ical Animation Framework, April

  40. [40]

    Tung, H.-Y ., Ding, M., Chen, Z., Bear, D., Gan, C., Tenen- baum, J

    Accessed: 2026- 01-15. Tung, H.-Y ., Ding, M., Chen, Z., Bear, D., Gan, C., Tenen- baum, J. B., Yamins, D. L., Fan, J. E., and Smith, K. A. Physion++: Evaluating physical scene understanding that requires online inference of different physical proper- ties,

  41. [41]

    DOI: 10.1109/TIP.2003.819861

    doi: 10.1109/TIP.2003.819861. Yang, Y ., Cheng, W., Chen, S., Zeng, X., Zhang, J., Wang, L., Yu, G., Ma, X., and Jiang, Y .-G. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arxiv:2504.06263,

  42. [42]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    URL https:// arxiv.org/abs/1910.01442. 11 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Zhang, L., Zhang, L., Mou, X., and Zhang, D. Fsim: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8):2378– 2386,

  43. [43]

    Zhang, L., Shen, Y ., and Li, H

    doi: 10.1109/TIP.2011.2109730. Zhang, L., Shen, Y ., and Li, H. Vsi: A visual saliency- induced index for perceptual image quality assessment. IEEE Transactions on Image Processing, 23(10):4270– 4281,

  44. [44]

    Zhang, R., Isola, P., Efros, A

    doi: 10.1109/TIP.2014.2346028. Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric,

  45. [45]

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

    URLhttps://arxiv.org/ abs/1801.03924. Zhang, S., Ma, J., Wu, J., Ritchie, D., and Agrawala, M. Editing motion graphics video via motion vectorization and transformation.ACM Trans. Graph., dec

  46. [46]

    Zhang, T., Kishore, V ., Wu, F., Weinberger, K

    doi: 10.1145/3618316. Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . Bertscore: Evaluating text generation with bert,

  47. [47]

    BERTScore: Evaluating Text Generation with BERT

    URLhttps://arxiv.org/abs/1904.09675. 12 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Appendix A. Case Study Case Study 2 GT Analysis:The scene consists of a blank, white background with no visible ground line, platforms, walls, or other static supports. There are no ramps, pegs, or obstacles; the space appears open and unob...

  48. [48]

    mean improvement

    and (ii) paired bootstrap confidence intervals over per-scene differences (Table 9). We use paired resampling because all methods are evaluated on the same set of scenes (N= 209 ), and define “mean improvement” so that positive values indicate better performance by VisPhyWorld (GPT-5, threejs), taking metric direction into account (↑/↓). GPT/uni00AD5/uni0...