VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Jiarong Liang; Ka-Hei Hui; Max Ku; Ping Nie; Wenhu Chen

arxiv: 2602.13294 · v3 · pith:46QXRGDRnew · submitted 2026-02-09 · 💻 cs.CV · cs.AI

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Jiarong Liang , Max Ku , Ka-Hei Hui , Ping Nie , Wenhu Chen This is my paper

Pith reviewed 2026-05-22 11:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords physical reasoningmultimodal large language modelsvideo reconstructionsimulator code generationphysical dynamicsbenchmark evaluationVisPhyBench

0 comments

The pith

Multimodal models generate semantic scene descriptions but fail to infer precise physical parameters or produce consistent dynamics when forced to output executable simulator code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VisPhyWorld, a new evaluation framework that tests physical reasoning in multimodal large language models by requiring them to produce runnable simulator code from observed videos rather than answering questions. This code can be executed to reconstruct both appearance and motion, making any physical hypotheses directly inspectable, editable, and falsifiable. The authors accompany the framework with VisPhyBench, a collection of 209 scenes drawn from 108 physical templates, and a protocol that measures how well models recover visual details and physically plausible behavior. Experiments demonstrate that leading models handle semantic understanding adequately yet consistently err when estimating parameters such as mass, friction, or elasticity and when maintaining coherent trajectories over time.

Core claim

Requiring models to emit executable simulator code from visual input separates physical reasoning from rendering and reveals that current state-of-the-art MLLMs, while competent at semantic scene understanding, cannot reliably recover accurate physical parameters or generate dynamics that remain consistent with Newtonian principles across frames.

What carries the argument

Code-driven video reconstruction pipeline that converts a model's generated simulator script into an executable program whose output video is compared against ground-truth appearance and motion.

If this is right

Models must output explicit, testable physical hypotheses rather than vague descriptions.
Evaluation becomes independent of rendering quality because the simulator code itself is inspected.
The benchmark supplies quantitative scores for both appearance fidelity and physical plausibility of reconstructed motion.
Current top models achieve high success rates on semantic tasks yet low accuracy on parameter estimation and long-term dynamics consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be extended to interactive settings where models update code after observing simulated outcomes.
It offers a route to diagnose whether failures stem from missing physical priors or from inability to translate priors into executable form.
Similar code-generation probes might apply to other domains requiring explicit causal models, such as chemical reaction networks.

Load-bearing premise

That the ability to write and run correct simulator code directly measures committed physical reasoning instead of surface-level pattern matching from training data.

What would settle it

A controlled experiment in which models produce code that matches ground-truth dynamics on seen parameter values but systematically deviates on novel mass or friction values would falsify the claim that the method isolates genuine physical inference.

Figures

Figures reproduced from arXiv: 2602.13294 by Jiarong Liang, Ka-Hei Hui, Max Ku, Ping Nie, Wenhu Chen.

**Figure 1.** Figure 1: MLLMs struggle to simulate physical dynamics. Under the same inputs, code generated with rigid-body simulation backends (Three.js/P5.js) produces more physically consistent rollouts, whereas non-physics backends (SVG/Manim) often exhibit implausible motion or contact artifacts such as interpenetration. ation protocols often rely on recognition-based queries or surface-level judgments, which can obscure w… view at source ↗

**Figure 2.** Figure 2: Unlike traditional VQA paradigms, VisPhyWorld accesses physical understanding evaluation by requiring MLLMs to actively reconstruct scenes via executable code, offering superior reasoning explainability compared to traditional paradigms. VisPhyWorld probes the physical reasoning capabilities of MLLMs through visual-to-code reconstruction. Given two key frames (and optionally object detections), the model p… view at source ↗

**Figure 3.** Figure 3: VisPhyWorld Framework. (1) System & Data Construction: We process raw video sequences to extract key frames (Istart, Ilater) and detection contexts using multimodal agents. (2) Pipeline & Simulation Flow: An LLM-based agent performs motion analysis and generates raw executable code, which is then sanitized and rendered. (3) Evaluation Benchmark: We propose a multi-metric benchmark integrating semantic and … view at source ↗

**Figure 4.** Figure 4: Key metrics on VisPhyBench. We compare code-driven reconstruction (multiple MLLMs) against pixel-space baselines (Veo 3.1 and SVD) under the unified evaluation protocol. layout and inferred parameters; (iii) an executable program C ∈ Ycode; and (iv) a rendered video Xˆ = (ˆIt) Tˆ t=1 obtained by executing the executable program C. 3.2. VisPhyWorld Architecture (I start, Ilater, D) fLLM −−−→ (A, S, C) Rphys… view at source ↗

**Figure 5.** Figure 5: This case shows that VisPhyWorld exhibits strong physical grounding, correctly simulating the collision dynamics. More examples are in the Appendix [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: GPT-5 reconstructs object identities and collision dynamics most faithfully over time. Pixel-space baselines (Veo-3.1 and SVD/img2vid) generate trajectories with implausible motion/contact events due to the lack of an explicit physics hypothesis. precise physical reasoning. GPT-5 in Three.js shows strong physical grounding by correctly simulating the collision dynamics, achieving Gemini 10.0 with DINO 0.… view at source ↗

**Figure 8.** Figure 8: A detailed case study (ID 2). 13 [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: A detailed case study (ID 3). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: A detailed case study (ID 4). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: A detailed case study (ID 5). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: A detailed case study (ID 6). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: A detailed case study (ID 7). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: A detailed case study (ID 8). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Full multimodal LLM prompt template used by VisPhyWorld for both motion analysis and code generation. B.2. Detection Context D To reduce ambiguity in object discovery and initialization, VisPhyWorld can optionally provide a structured detection context D for the first keyframe I start . D is a per-sample JSON annotation containing a list of objects with coarse geometry and appearance attributes. All coord… view at source ↗

**Figure 16.** Figure 16: 3D prompt variant used for dataset 3D. It mirrors [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: High-level deterministic rendering protocol used in VisPhyWorld. Low-level implementation details are included in the released codebase. B.5. Robustness: Automatic Retry and Fallback To handle syntax errors or runtime exceptions in model-generated programs, we implement a lightweight robustness protocol that ensures evaluation is well-defined for all samples. Error-conditioned single-step repair. If the i… view at source ↗

**Figure 18.** Figure 18: High-level structure of the fallback template used when both model attempts fail. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: Temporal alignment procedure used before computing frame-wise metrics. Gemini-based Physics & Video Consistency Prompt You are an expert evaluator of physical simulations and video quality. Compare the provided reference video (Ground Truth) with the generated video. Your goal is to determine if the generated video accurately reconstructs the physical event shown in the reference. Focus on the following d… view at source ↗

**Figure 20.** Figure 20: Prompt template used for Gemini-based evaluation. Note that the prompt is explicitly designed to penalize physical violations (e.g., incorrect collision logic), ensuring the score reflects physical understanding rather than just perceptual similarity. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: Per-scene boxplot distributions on VisPhyBench for representative metric families (higher is better unless marked ↓). D.3. Reconstruction & Perceptual Metrics [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

read the original abstract

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% of benchmark runs before fallback. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics. Our code is available https://github.com/TIGER-AI-Lab/VisPhyWorld

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VisPhyWorld's code-generation framework for testing MLLM physical reasoning is a clear step past VQA/VoE but leaves physics and coding ability entangled.

read the letter

The main thing to know is that this paper moves physical reasoning evaluation from passive recognition to requiring models to output runnable simulator code from video input. That forces an explicit, falsifiable commitment to parameters and dynamics rather than just describing what is seen. The VisPhyBench with 209 scenes drawn from 108 templates and the 97.7% valid reconstruction rate show the pipeline is technically workable and that current top MLLMs handle scene semantics better than they handle accurate physical inference or consistent motion in simulation. Releasing the code is also useful for anyone who wants to inspect or extend the setup. The approach does improve on prior benchmarks by making the inferred world model editable and directly testable through execution. The soft spot is exactly the one in the stress-test note. Success depends on both extracting physical quantities from pixels and emitting syntactically correct, engine-compatible code. Without a text-only baseline that gives scene specs directly, errors cannot be cleanly attributed to physics deficits instead of gaps in knowing the simulator's object model or calling conventions. The abstract does not describe such a control, so the central claim about physical reasoning remains partly entangled. This is aimed at people working on multimodal models, robotics simulation, or benchmark design for reasoning. A reader interested in new evaluation protocols would get concrete value from the benchmark construction and the reported gaps. It deserves peer review to tighten the experimental isolation and check the full methods and error analysis.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces VisPhyWorld, an execution-based evaluation framework for physical reasoning in MLLMs that requires models to generate executable simulator code from visual observations of scenes. It presents VisPhyBench consisting of 209 evaluation scenes derived from 108 physical templates, along with a protocol that measures appearance reconstruction and physical motion plausibility. The work reports that the pipeline yields valid reconstructed videos in 97.7% of runs before fallback, that state-of-the-art MLLMs exhibit strong semantic scene understanding, yet struggle to infer accurate physical parameters and to produce consistent physical dynamics.

Significance. If the results hold, the code-driven approach supplies a more direct, inspectable, and falsifiable test of committed physical hypotheses than standard VQA or VoE protocols, which can often be solved without explicit world models. The systematic template-based benchmark construction and high validity rate are strengths that could support reproducible progress on physical commonsense evaluation in multimodal models.

major comments (1)

[Evaluation Protocol and Experiments] The central claim attributes performance gaps specifically to deficits in inferring physical parameters and simulating dynamics. However, the evaluation protocol requires models to emit syntactically valid, engine-compatible simulator code (bodies, forces, integrators, etc.). Without a control condition that supplies textual scene specifications rather than visual input, it is impossible to separate failures due to physical inference from failures due to incomplete knowledge of the simulator API and calling conventions. This entanglement is load-bearing for the attribution in the abstract and experimental claims.

minor comments (1)

[Abstract] The abstract states that the code is available at the GitHub link; the repository should explicitly include the full set of 209 scenes, the 108 templates, and the exact scoring scripts used for the 97.7% validity figure to enable independent verification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our work. We address the major comment below and describe the revisions we intend to make to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation Protocol and Experiments] The central claim attributes performance gaps specifically to deficits in inferring physical parameters and simulating dynamics. However, the evaluation protocol requires models to emit syntactically valid, engine-compatible simulator code (bodies, forces, integrators, etc.). Without a control condition that supplies textual scene specifications rather than visual input, it is impossible to separate failures due to physical inference from failures due to incomplete knowledge of the simulator API and calling conventions. This entanglement is load-bearing for the attribution in the abstract and experimental claims.

Authors: We appreciate the referee pointing out this potential confounding factor. Our current protocol provides models with detailed textual descriptions of the simulator API, including examples of body definitions, force applications, and integrator usage, in addition to the visual input. The fact that 97.7% of generated codes are syntactically valid and engine-compatible indicates that the models largely understand the API conventions. Nevertheless, we agree that an explicit control condition with textual scene specifications (e.g., providing object types, positions, and initial velocities in text) would help disentangle visual perception issues from physical reasoning deficits. In the revised version, we will conduct and report such a control experiment. We will update the relevant sections, including the abstract, to clarify the scope of our claims based on these additional results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained.

full rationale

The paper introduces VisPhyWorld and VisPhyBench as a new execution-based evaluation framework and dataset for MLLM physical reasoning via code generation. No mathematical derivations, equations, or first-principles predictions appear in the provided text. Central claims rest on direct empirical runs (97.7% validity rate, model comparisons) against externally defined scenes and templates rather than any reduction to fitted parameters, self-definitions, or self-citation chains. The approach is independent of its own outputs and does not rename or smuggle prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on domain assumption that physical templates yield representative scenes and that code executability equates to testable physical reasoning; no free parameters or invented entities apparent from abstract.

axioms (1)

domain assumption Physical templates can generate representative scenes that test genuine physical reasoning when models produce executable code.
Benchmark comprises 209 scenes derived from 108 physical templates.

pith-pipeline@v0.9.0 · 5735 in / 1209 out tokens · 61515 ms · 2026-05-22T11:03:22.095760+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean (reality_from_one_distinction) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations... Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost, washburn_uniqueness_aczel) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our pipeline produces valid reconstructed videos in 97.7% of benchmark runs... motion / physical plausibility (RAFT-EPE)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 10 internal anchors

[1]

Anthropic

Accessed: 2026-01-15. Anthropic. Claude sonnet 4.5. https://www. anthropic.com/news/claude-sonnet-4-5 ,

work page 2026
[2]

Baillargeon, R., Spelke, E

Accessed: 2026-01-15. Baillargeon, R., Spelke, E. S., and Wasserman, S. Object permanence in five-month-old infants.Cog- nition, 20(3):191–208,

work page 2026
[3]

doi: https://doi.org/10.1016/0010-0277(85)90008-3

ISSN 0010-0277. doi: https://doi.org/10.1016/0010-0277(85)90008-3. URL https://www.sciencedirect.com/ science/article/pii/0010027785900083. Bakhtin, A., van der Maaten, L., Johnson, J., Gustafson, L., and Girshick, R. Phyre: A new benchmark for physical reasoning,

work page doi:10.1016/0010-0277(85)90008-3
[4]

Bansal, H., Lin, Z., Xie, T., Zong, Z., Yarom, M., Bitton, Y ., Jiang, C., Sun, Y ., Chang, K.-W., and Grover, A

URL https://arxiv.org/abs/ 1908.05656. Bansal, H., Lin, Z., Xie, T., Zong, Z., Yarom, M., Bitton, Y ., Jiang, C., Sun, Y ., Chang, K.-W., and Grover, A. Videophy: Evaluating physical commonsense for video generation,

work page arXiv 1908
[5]

VideoPhy: Evaluating Physical Commonsense for Video Generation

URL https://arxiv.org/abs/ 2406.03520. Bansal, H., Peng, C., Bitton, Y ., Goldenberg, R., Grover, A., and Chang, K.-W. Videophy-2: A challenging action- centric physical commonsense evaluation in video genera- tion,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URL https://arxiv.org/abs/1909. 12000. Bear, D. M., Wang, E., Mrowca, D., Binder, F. J., Tung, H.-Y . F., Pramod, R. T., Holdaway, C., Tao, S., Smith, K., Sun, F.-Y ., Fei-Fei, L., Kanwisher, N., Tenenbaum, J. B., Yamins, D. L. K., and Fan, J. E. Physion: Evaluating phys- ical prediction from vision in humans and machines,

work page 1909
[7]

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., Jampani, V ., and Rombach, R

URLhttps://arxiv.org/abs/2106.08261. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., Jampani, V ., and Rombach, R. Stable video diffusion: Scaling latent video diffusion models to large datasets,

work page arXiv
[8]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

URL https://arxiv.org/abs/ 2311.15127. Bordes, F., Garrido, Q., Kao, J. T., Williams, A., Rabbat, M., and Dupoux, E. Intphys 2: Benchmarking intu- itive physics understanding in complex synthetic envi- ronments,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Caron, M., Touvron, H., Misra, I., J ´egou, H., Mairal, J., Bojanowski, P., and Joulin, A

URL https://arxiv.org/abs/ 2506.09849. Caron, M., Touvron, H., Misra, I., J ´egou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers,

work page arXiv
[10]

Emerging Properties in Self-Supervised Vision Transformers

URL https: //arxiv.org/abs/2104.14294. Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y ., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-driven question answering dataset. InThe 2023 Conference on Empirical Methods in Natural Language Processing,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

doi: 10.1109/tpami.2020.3045810

ISSN 1939-3539. doi: 10.1109/tpami.2020.3045810. URL http://dx. doi.org/10.1109/TPAMI.2020.3045810. Foley, J. D., van Dam, A., Feiner, S. K., and Hughes, J. F. Computer Graphics: Principles and Practice.Addison- Wesley, second edition,

work page doi:10.1109/tpami.2020.3045810 1939
[12]

URL https://arxiv.org/ abs/2411.15296. Fung, P., Bachrach, Y ., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Gong, H., J´egou, H., Lazaric, A., Majumdar, A., Madotto, A., Meier, F., Metze, F., Morency, L.-P., Moutakanni, T., Pino, J., Terver, B., Tighe, J., Tomasello, P., and Malik, J. Embodied ai agents: Modeling the world,

work page arXiv
[13]

Galimzyanov, T., Titov, S., Golubev, Y ., and Bogomolov, E

URL https:// arxiv.org/abs/2506.22355. Galimzyanov, T., Titov, S., Golubev, Y ., and Bogomolov, E. Drawing pandas: A benchmark for llms in generat- ing plotting code,

work page arXiv
[14]

9 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Garrido, Q., Ballas, N., Assran, M., Bardes, A., Najman, L., Rabbat, M., Dupoux, E., and LeCun, Y

URL https://arxiv.org/ abs/2412.02764. 9 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Garrido, Q., Ballas, N., Assran, M., Bardes, A., Najman, L., Rabbat, M., Dupoux, E., and LeCun, Y . Intuitive physics understanding emerges from self-supervised pretraining on natural videos,

work page arXiv
[15]

URL https://arxiv.org/ abs/2502.11831. Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next genera- tion agentic capabilities,

work page arXiv
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

URL https://arxiv. org/abs/2507.06261. Google AI for Developers. Gemini 3 developer guide. https://ai.google.dev/gemini-api/ docs/gemini-3,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Goswami, K., Mathur, P., Rossi, R., and Dernoncourt, F

Accessed: 2026-01-15. Goswami, K., Mathur, P., Rossi, R., and Dernoncourt, F. Plotgen: Multi-agent llm-based scientific data visual- ization via multimodal feedback,

work page 2026
[18]

He, L., Song, Y ., Huang, H., Aliaga, D., and Zhou, X

URL https: //arxiv.org/abs/2502.00988. He, L., Song, Y ., Huang, H., Aliaga, D., and Zhou, X. Kubrick: Multimodal agent collaborations for syn- thetic video generation, 2024a. URL https://arxiv. org/abs/2408.10453. He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., Wang, K., Do, Q. D., Ni, Y ., Lyu, B.,...

work page arXiv
[19]

Huynh-Thu, Q

URL https://arxiv.org/abs/ 2509.22799. Huynh-Thu, Q. and Ghanbari, M. Scope of validity of psnr in image/video quality assessment.Electronics Letters, 44:800 – 801, 02

work page arXiv
[20]

Jassim, S., Holubar, M., Richter, A., Wolff, C., Ohmer, X., and Bruni, E

doi: 10.1049/el:20080522. Jassim, S., Holubar, M., Richter, A., Wolff, C., Ohmer, X., and Bruni, E. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models,

work page doi:10.1049/el:20080522
[21]

Kang, B., Yue, Y ., Lu, R., Lin, Z., Zhao, Y ., Wang, K., Huang, G., and Feng, J

URL https: //arxiv.org/abs/2311.09048. Kang, B., Yue, Y ., Lu, R., Lin, Z., Zhao, Y ., Wang, K., Huang, G., and Feng, J. How far is video generation from world model: A physical law perspective,

work page arXiv
[22]

How Far is Video Generation from World Model: A Physical Law Perspective

URL https://arxiv.org/abs/2411.02385. Keluskar, A., Bhattacharjee, A., and Liu, H. Do llms under- stand ambiguity in text? a case study in open-world ques- tion answering,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Krojer, B., Komeili, M., Ross, C., Garrido, Q., Sinha, K., Ballas, N., and Assran, M

URL https://arxiv.org/ abs/2411.12395. Krojer, B., Komeili, M., Ross, C., Garrido, Q., Sinha, K., Ballas, N., and Assran, M. A shortcut-aware video- qa benchmark for physical understanding via minimal video pairs,

work page arXiv
[24]

Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W

URL https://arxiv.org/abs/ 2506.09987. Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W. Theoremexplainagent: Towards multimodal explanations for llm theorem understanding,

work page arXiv
[25]

Li, D., Fang, Y ., Chen, Y ., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J

URL https:// arxiv.org/abs/2502.19400. Li, D., Fang, Y ., Chen, Y ., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J. E., Stoica, I., Han, S., and Lu, Y . Worldmodelbench: Judging video generation models as world models,

work page arXiv
[26]

Li, S., Wu, K., Zhang, C., and Zhu, Y

URL https: //arxiv.org/abs/2502.20694. Li, S., Wu, K., Zhang, C., and Zhu, Y . I-phyre: Interac- tive physical reasoning,

work page arXiv
[27]

org/abs/2312.03009

URL https://arxiv. org/abs/2312.03009. Lin, C.-Y . ROUGE: A package for automatic evalua- tion of summaries. InText Summarization Branches Out, pp. 74–81, Barcelona, Spain, July

work page arXiv
[28]

Liu, S., Ren, Z., Gupta, S., and Wang, S

URLhttps://arxiv.org/abs/2511.02778. Liu, S., Ren, Z., Gupta, S., and Wang, S. Physgen: Rigid- body physics-grounded image-to-video generation. In European Conference on Computer Vision (ECCV),

work page arXiv
[29]

org/abs/2311.12631

URL https://arxiv. org/abs/2311.12631. Margoni, F., Surian, L., and Baillargeon, R. The violation- of-expectation paradigm: A conceptual overview.Psy- chological Review, 131(3):716–748,

work page arXiv
[30]

10 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y ., Li, D., Qiao, Y ., and Luo, P

URL https://arxiv.org/abs/ 2312.10728. 10 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y ., Li, D., Qiao, Y ., and Luo, P. Towards world simulator: Crafting physical commonsense-based benchmark for video generation,

work page arXiv
[31]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

URL https: //arxiv.org/abs/2410.05363. Motamed, S., Chen, M., Gool, L. V ., and Laina, I. Travl: A recipe for making video-language models better judges of physics implausibility, 2025a. URL https://arxiv. org/abs/2510.07550. Motamed, S., Culp, L., Swersky, K., Jaini, P., and Geirhos, R. Do generative video models understand physical prin- ciples?, 2025b....

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

Pezeshkpour, P

Accessed: 2026-01-15. Pezeshkpour, P. and Hruschka, E. Large language mod- els sensitivity to the order of options in multiple-choice questions,

work page 2026
[33]

arXiv:2308.11483 [cs]

URL https://arxiv.org/abs/ 2308.11483. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision,

work page arXiv
[34]

Learning Transferable Visual Models From Natural Language Supervision

URLhttps://arxiv.org/abs/2103.00020. Rajani, N. F., Zhang, R., Tan, Y . C., Zheng, S., Weiss, J., Vyas, A., Gupta, A., XIong, C., Socher, R., and Radev, D. Esprit: Explaining solutions to physical reason- ing tasks,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Riochet, R., Castro, M

URL https://arxiv.org/abs/ 2005.00730. Riochet, R., Castro, M. Y ., Bernard, M., Lerer, A., Fergus, R., Izard, V ., and Dupoux, E. Intphys: A framework and benchmark for visual intuitive physics reasoning,

work page arXiv 2005
[36]

Rodriguez, J

URLhttps://arxiv.org/abs/1803.07616. Rodriguez, J. A., Puri, A., Agarwal, S., Laradji, I. H., Rodriguez, P., Rajeswar, S., Vazquez, D., Pal, C., and Pedersoli, M. Starvector: Generating scalable vec- tor graphics code from images and text,

work page arXiv
[37]

URL https://arxiv.org/abs/2312.11556. Shen, H., Wu, T., Han, Q., Hsieh, Y ., Wang, J., Zhang, Y ., Cheng, Y ., Hao, Z., Ni, Y ., Wang, X., Wan, Z., Zhang, K., Xu, W., Xiong, J., Luo, P., Chen, W., Tao, C., Mao, Z., and Wong, N. Phyx: Does your model have the ”wits” for physical reasoning?,

work page arXiv
[38]

org/abs/2505.15929

URL https://arxiv. org/abs/2505.15929. Teed, Z. and Deng, J. Raft: Recurrent all-pairs field trans- forms for optical flow,

work page arXiv
[39]

Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020

URL https://arxiv. org/abs/2003.12039. The Manim Community Developers. Manim – Mathemat- ical Animation Framework, April

work page arXiv 2003
[40]

Tung, H.-Y ., Ding, M., Chen, Z., Bear, D., Gan, C., Tenen- baum, J

Accessed: 2026- 01-15. Tung, H.-Y ., Ding, M., Chen, Z., Bear, D., Gan, C., Tenen- baum, J. B., Yamins, D. L., Fan, J. E., and Smith, K. A. Physion++: Evaluating physical scene understanding that requires online inference of different physical proper- ties,

work page 2026
[41]

DOI: 10.1109/TIP.2003.819861

doi: 10.1109/TIP.2003.819861. Yang, Y ., Cheng, W., Chen, S., Zeng, X., Zhang, J., Wang, L., Yu, G., Ma, X., and Jiang, Y .-G. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arxiv:2504.06263,

work page doi:10.1109/tip.2003.819861 2003
[42]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

URL https:// arxiv.org/abs/1910.01442. 11 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Zhang, L., Zhang, L., Mou, X., and Zhang, D. Fsim: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8):2378– 2386,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[43]

Zhang, L., Shen, Y ., and Li, H

doi: 10.1109/TIP.2011.2109730. Zhang, L., Shen, Y ., and Li, H. Vsi: A visual saliency- induced index for perceptual image quality assessment. IEEE Transactions on Image Processing, 23(10):4270– 4281,

work page doi:10.1109/tip.2011.2109730 2011
[44]

Zhang, R., Isola, P., Efros, A

doi: 10.1109/TIP.2014.2346028. Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric,

work page doi:10.1109/tip.2014.2346028 2014
[45]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

URLhttps://arxiv.org/ abs/1801.03924. Zhang, S., Ma, J., Wu, J., Ritchie, D., and Agrawala, M. Editing motion graphics video via motion vectorization and transformation.ACM Trans. Graph., dec

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Zhang, T., Kishore, V ., Wu, F., Weinberger, K

doi: 10.1145/3618316. Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . Bertscore: Evaluating text generation with bert,

work page doi:10.1145/3618316
[47]

BERTScore: Evaluating Text Generation with BERT

URLhttps://arxiv.org/abs/1904.09675. 12 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Appendix A. Case Study Case Study 2 GT Analysis:The scene consists of a blank, white background with no visible ground line, platforms, walls, or other static supports. There are no ramps, pegs, or obstacles; the space appears open and unob...

work page internal anchor Pith review Pith/arXiv arXiv 1904
[48]

mean improvement

and (ii) paired bootstrap confidence intervals over per-scene differences (Table 9). We use paired resampling because all methods are evaluated on the same set of scenes (N= 209 ), and define “mean improvement” so that positive values indicate better performance by VisPhyWorld (GPT-5, threejs), taking metric direction into account (↑/↓). GPT/uni00AD5/uni0...

work page arXiv 2036

[1] [1]

Anthropic

Accessed: 2026-01-15. Anthropic. Claude sonnet 4.5. https://www. anthropic.com/news/claude-sonnet-4-5 ,

work page 2026

[2] [2]

Baillargeon, R., Spelke, E

Accessed: 2026-01-15. Baillargeon, R., Spelke, E. S., and Wasserman, S. Object permanence in five-month-old infants.Cog- nition, 20(3):191–208,

work page 2026

[3] [3]

doi: https://doi.org/10.1016/0010-0277(85)90008-3

ISSN 0010-0277. doi: https://doi.org/10.1016/0010-0277(85)90008-3. URL https://www.sciencedirect.com/ science/article/pii/0010027785900083. Bakhtin, A., van der Maaten, L., Johnson, J., Gustafson, L., and Girshick, R. Phyre: A new benchmark for physical reasoning,

work page doi:10.1016/0010-0277(85)90008-3

[4] [4]

Bansal, H., Lin, Z., Xie, T., Zong, Z., Yarom, M., Bitton, Y ., Jiang, C., Sun, Y ., Chang, K.-W., and Grover, A

URL https://arxiv.org/abs/ 1908.05656. Bansal, H., Lin, Z., Xie, T., Zong, Z., Yarom, M., Bitton, Y ., Jiang, C., Sun, Y ., Chang, K.-W., and Grover, A. Videophy: Evaluating physical commonsense for video generation,

work page arXiv 1908

[5] [5]

VideoPhy: Evaluating Physical Commonsense for Video Generation

URL https://arxiv.org/abs/ 2406.03520. Bansal, H., Peng, C., Bitton, Y ., Goldenberg, R., Grover, A., and Chang, K.-W. Videophy-2: A challenging action- centric physical commonsense evaluation in video genera- tion,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

URL https://arxiv.org/abs/1909. 12000. Bear, D. M., Wang, E., Mrowca, D., Binder, F. J., Tung, H.-Y . F., Pramod, R. T., Holdaway, C., Tao, S., Smith, K., Sun, F.-Y ., Fei-Fei, L., Kanwisher, N., Tenenbaum, J. B., Yamins, D. L. K., and Fan, J. E. Physion: Evaluating phys- ical prediction from vision in humans and machines,

work page 1909

[7] [7]

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., Jampani, V ., and Rombach, R

URLhttps://arxiv.org/abs/2106.08261. Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y ., English, Z., V oleti, V ., Letts, A., Jampani, V ., and Rombach, R. Stable video diffusion: Scaling latent video diffusion models to large datasets,

work page arXiv

[8] [8]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

URL https://arxiv.org/abs/ 2311.15127. Bordes, F., Garrido, Q., Kao, J. T., Williams, A., Rabbat, M., and Dupoux, E. Intphys 2: Benchmarking intu- itive physics understanding in complex synthetic envi- ronments,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Caron, M., Touvron, H., Misra, I., J ´egou, H., Mairal, J., Bojanowski, P., and Joulin, A

URL https://arxiv.org/abs/ 2506.09849. Caron, M., Touvron, H., Misra, I., J ´egou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers,

work page arXiv

[10] [10]

Emerging Properties in Self-Supervised Vision Transformers

URL https: //arxiv.org/abs/2104.14294. Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y ., Ma, X., Xu, J., Wang, X., and Xia, T. Theoremqa: A theorem-driven question answering dataset. InThe 2023 Conference on Empirical Methods in Natural Language Processing,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

doi: 10.1109/tpami.2020.3045810

ISSN 1939-3539. doi: 10.1109/tpami.2020.3045810. URL http://dx. doi.org/10.1109/TPAMI.2020.3045810. Foley, J. D., van Dam, A., Feiner, S. K., and Hughes, J. F. Computer Graphics: Principles and Practice.Addison- Wesley, second edition,

work page doi:10.1109/tpami.2020.3045810 1939

[12] [12]

URL https://arxiv.org/ abs/2411.15296. Fung, P., Bachrach, Y ., Celikyilmaz, A., Chaudhuri, K., Chen, D., Chung, W., Dupoux, E., Gong, H., J´egou, H., Lazaric, A., Majumdar, A., Madotto, A., Meier, F., Metze, F., Morency, L.-P., Moutakanni, T., Pino, J., Terver, B., Tighe, J., Tomasello, P., and Malik, J. Embodied ai agents: Modeling the world,

work page arXiv

[13] [13]

Galimzyanov, T., Titov, S., Golubev, Y ., and Bogomolov, E

URL https:// arxiv.org/abs/2506.22355. Galimzyanov, T., Titov, S., Golubev, Y ., and Bogomolov, E. Drawing pandas: A benchmark for llms in generat- ing plotting code,

work page arXiv

[14] [14]

9 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Garrido, Q., Ballas, N., Assran, M., Bardes, A., Najman, L., Rabbat, M., Dupoux, E., and LeCun, Y

URL https://arxiv.org/ abs/2412.02764. 9 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Garrido, Q., Ballas, N., Assran, M., Bardes, A., Najman, L., Rabbat, M., Dupoux, E., and LeCun, Y . Intuitive physics understanding emerges from self-supervised pretraining on natural videos,

work page arXiv

[15] [15]

URL https://arxiv.org/ abs/2502.11831. Google. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next genera- tion agentic capabilities,

work page arXiv

[16] [16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

URL https://arxiv. org/abs/2507.06261. Google AI for Developers. Gemini 3 developer guide. https://ai.google.dev/gemini-api/ docs/gemini-3,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Goswami, K., Mathur, P., Rossi, R., and Dernoncourt, F

Accessed: 2026-01-15. Goswami, K., Mathur, P., Rossi, R., and Dernoncourt, F. Plotgen: Multi-agent llm-based scientific data visual- ization via multimodal feedback,

work page 2026

[18] [18]

He, L., Song, Y ., Huang, H., Aliaga, D., and Zhou, X

URL https: //arxiv.org/abs/2502.00988. He, L., Song, Y ., Huang, H., Aliaga, D., and Zhou, X. Kubrick: Multimodal agent collaborations for syn- thetic video generation, 2024a. URL https://arxiv. org/abs/2408.10453. He, X., Jiang, D., Zhang, G., Ku, M., Soni, A., Siu, S., Chen, H., Chandra, A., Jiang, Z., Arulraj, A., Wang, K., Do, Q. D., Ni, Y ., Lyu, B.,...

work page arXiv

[19] [19]

Huynh-Thu, Q

URL https://arxiv.org/abs/ 2509.22799. Huynh-Thu, Q. and Ghanbari, M. Scope of validity of psnr in image/video quality assessment.Electronics Letters, 44:800 – 801, 02

work page arXiv

[20] [20]

Jassim, S., Holubar, M., Richter, A., Wolff, C., Ohmer, X., and Bruni, E

doi: 10.1049/el:20080522. Jassim, S., Holubar, M., Richter, A., Wolff, C., Ohmer, X., and Bruni, E. Grasp: A novel benchmark for evaluating language grounding and situated physics understanding in multimodal language models,

work page doi:10.1049/el:20080522

[21] [21]

Kang, B., Yue, Y ., Lu, R., Lin, Z., Zhao, Y ., Wang, K., Huang, G., and Feng, J

URL https: //arxiv.org/abs/2311.09048. Kang, B., Yue, Y ., Lu, R., Lin, Z., Zhao, Y ., Wang, K., Huang, G., and Feng, J. How far is video generation from world model: A physical law perspective,

work page arXiv

[22] [22]

How Far is Video Generation from World Model: A Physical Law Perspective

URL https://arxiv.org/abs/2411.02385. Keluskar, A., Bhattacharjee, A., and Liu, H. Do llms under- stand ambiguity in text? a case study in open-world ques- tion answering,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Krojer, B., Komeili, M., Ross, C., Garrido, Q., Sinha, K., Ballas, N., and Assran, M

URL https://arxiv.org/ abs/2411.12395. Krojer, B., Komeili, M., Ross, C., Garrido, Q., Sinha, K., Ballas, N., and Assran, M. A shortcut-aware video- qa benchmark for physical understanding via minimal video pairs,

work page arXiv

[24] [24]

Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W

URL https://arxiv.org/abs/ 2506.09987. Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W. Theoremexplainagent: Towards multimodal explanations for llm theorem understanding,

work page arXiv

[25] [25]

Li, D., Fang, Y ., Chen, Y ., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J

URL https:// arxiv.org/abs/2502.19400. Li, D., Fang, Y ., Chen, Y ., Yang, S., Cao, S., Wong, J., Luo, M., Wang, X., Yin, H., Gonzalez, J. E., Stoica, I., Han, S., and Lu, Y . Worldmodelbench: Judging video generation models as world models,

work page arXiv

[26] [26]

Li, S., Wu, K., Zhang, C., and Zhu, Y

URL https: //arxiv.org/abs/2502.20694. Li, S., Wu, K., Zhang, C., and Zhu, Y . I-phyre: Interac- tive physical reasoning,

work page arXiv

[27] [27]

org/abs/2312.03009

URL https://arxiv. org/abs/2312.03009. Lin, C.-Y . ROUGE: A package for automatic evalua- tion of summaries. InText Summarization Branches Out, pp. 74–81, Barcelona, Spain, July

work page arXiv

[28] [28]

Liu, S., Ren, Z., Gupta, S., and Wang, S

URLhttps://arxiv.org/abs/2511.02778. Liu, S., Ren, Z., Gupta, S., and Wang, S. Physgen: Rigid- body physics-grounded image-to-video generation. In European Conference on Computer Vision (ECCV),

work page arXiv

[29] [29]

org/abs/2311.12631

URL https://arxiv. org/abs/2311.12631. Margoni, F., Surian, L., and Baillargeon, R. The violation- of-expectation paradigm: A conceptual overview.Psy- chological Review, 131(3):716–748,

work page arXiv

[30] [30]

10 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y ., Li, D., Qiao, Y ., and Luo, P

URL https://arxiv.org/abs/ 2312.10728. 10 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Meng, F., Liao, J., Tan, X., Shao, W., Lu, Q., Zhang, K., Cheng, Y ., Li, D., Qiao, Y ., and Luo, P. Towards world simulator: Crafting physical commonsense-based benchmark for video generation,

work page arXiv

[31] [31]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

URL https: //arxiv.org/abs/2410.05363. Motamed, S., Chen, M., Gool, L. V ., and Laina, I. Travl: A recipe for making video-language models better judges of physics implausibility, 2025a. URL https://arxiv. org/abs/2510.07550. Motamed, S., Culp, L., Swersky, K., Jaini, P., and Geirhos, R. Do generative video models understand physical prin- ciples?, 2025b....

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

Pezeshkpour, P

Accessed: 2026-01-15. Pezeshkpour, P. and Hruschka, E. Large language mod- els sensitivity to the order of options in multiple-choice questions,

work page 2026

[33] [33]

arXiv:2308.11483 [cs]

URL https://arxiv.org/abs/ 2308.11483. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision,

work page arXiv

[34] [34]

Learning Transferable Visual Models From Natural Language Supervision

URLhttps://arxiv.org/abs/2103.00020. Rajani, N. F., Zhang, R., Tan, Y . C., Zheng, S., Weiss, J., Vyas, A., Gupta, A., XIong, C., Socher, R., and Radev, D. Esprit: Explaining solutions to physical reason- ing tasks,

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

Riochet, R., Castro, M

URL https://arxiv.org/abs/ 2005.00730. Riochet, R., Castro, M. Y ., Bernard, M., Lerer, A., Fergus, R., Izard, V ., and Dupoux, E. Intphys: A framework and benchmark for visual intuitive physics reasoning,

work page arXiv 2005

[36] [36]

Rodriguez, J

URLhttps://arxiv.org/abs/1803.07616. Rodriguez, J. A., Puri, A., Agarwal, S., Laradji, I. H., Rodriguez, P., Rajeswar, S., Vazquez, D., Pal, C., and Pedersoli, M. Starvector: Generating scalable vec- tor graphics code from images and text,

work page arXiv

[37] [37]

URL https://arxiv.org/abs/2312.11556. Shen, H., Wu, T., Han, Q., Hsieh, Y ., Wang, J., Zhang, Y ., Cheng, Y ., Hao, Z., Ni, Y ., Wang, X., Wan, Z., Zhang, K., Xu, W., Xiong, J., Luo, P., Chen, W., Tao, C., Mao, Z., and Wong, N. Phyx: Does your model have the ”wits” for physical reasoning?,

work page arXiv

[38] [38]

org/abs/2505.15929

URL https://arxiv. org/abs/2505.15929. Teed, Z. and Deng, J. Raft: Recurrent all-pairs field trans- forms for optical flow,

work page arXiv

[39] [39]

Raft: Recurrent all-pairs field transforms for optical flow.arXiv:2003.12039, 2020

URL https://arxiv. org/abs/2003.12039. The Manim Community Developers. Manim – Mathemat- ical Animation Framework, April

work page arXiv 2003

[40] [40]

Tung, H.-Y ., Ding, M., Chen, Z., Bear, D., Gan, C., Tenen- baum, J

Accessed: 2026- 01-15. Tung, H.-Y ., Ding, M., Chen, Z., Bear, D., Gan, C., Tenen- baum, J. B., Yamins, D. L., Fan, J. E., and Smith, K. A. Physion++: Evaluating physical scene understanding that requires online inference of different physical proper- ties,

work page 2026

[41] [41]

DOI: 10.1109/TIP.2003.819861

doi: 10.1109/TIP.2003.819861. Yang, Y ., Cheng, W., Chen, S., Zeng, X., Zhang, J., Wang, L., Yu, G., Ma, X., and Jiang, Y .-G. Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arxiv:2504.06263,

work page doi:10.1109/tip.2003.819861 2003

[42] [42]

CLEVRER: CoLlision Events for Video REpresentation and Reasoning

URL https:// arxiv.org/abs/1910.01442. 11 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Zhang, L., Zhang, L., Mou, X., and Zhang, D. Fsim: A feature similarity index for image quality assessment. IEEE Transactions on Image Processing, 20(8):2378– 2386,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[43] [43]

Zhang, L., Shen, Y ., and Li, H

doi: 10.1109/TIP.2011.2109730. Zhang, L., Shen, Y ., and Li, H. Vsi: A visual saliency- induced index for perceptual image quality assessment. IEEE Transactions on Image Processing, 23(10):4270– 4281,

work page doi:10.1109/tip.2011.2109730 2011

[44] [44]

Zhang, R., Isola, P., Efros, A

doi: 10.1109/TIP.2014.2346028. Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric,

work page doi:10.1109/tip.2014.2346028 2014

[45] [45]

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

URLhttps://arxiv.org/ abs/1801.03924. Zhang, S., Ma, J., Wu, J., Ritchie, D., and Agrawala, M. Editing motion graphics video via motion vectorization and transformation.ACM Trans. Graph., dec

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Zhang, T., Kishore, V ., Wu, F., Weinberger, K

doi: 10.1145/3618316. Zhang, T., Kishore, V ., Wu, F., Weinberger, K. Q., and Artzi, Y . Bertscore: Evaluating text generation with bert,

work page doi:10.1145/3618316

[47] [47]

BERTScore: Evaluating Text Generation with BERT

URLhttps://arxiv.org/abs/1904.09675. 12 VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction Appendix A. Case Study Case Study 2 GT Analysis:The scene consists of a blank, white background with no visible ground line, platforms, walls, or other static supports. There are no ramps, pegs, or obstacles; the space appears open and unob...

work page internal anchor Pith review Pith/arXiv arXiv 1904

[48] [48]

mean improvement

and (ii) paired bootstrap confidence intervals over per-scene differences (Table 9). We use paired resampling because all methods are evaluated on the same set of scenes (N= 209 ), and define “mean improvement” so that positive values indicate better performance by VisPhyWorld (GPT-5, threejs), taking metric direction into account (↑/↓). GPT/uni00AD5/uni0...

work page arXiv 2036