NewtPhys: Do Foundation Models Understand Newtonian Physics?

Andrei Bursuc; Raoul de Charette; Sebastian Cavada; Soumava Paul; Tuan-Hung Vu

arxiv: 2606.03986 · v1 · pith:PFIGR4ZHnew · submitted 2026-06-02 · 💻 cs.CV

NewtPhys: Do Foundation Models Understand Newtonian Physics?

Sebastian Cavada , Soumava Paul , Tuan-Hung Vu , Andrei Bursuc , Raoul de Charette This is my paper

Pith reviewed 2026-06-28 10:17 UTC · model grok-4.3

classification 💻 cs.CV

keywords physics reasoningvision-language modelsfoundation modelsNewtonian dynamicsbenchmark datasetmultiview scenesforce predictionper-pixel annotations

0 comments

The pith

A new dataset of real-world scenes with force annotations shows foundation models have limitations in low-level Newtonian physics reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates NewtPhys to move beyond synthetic scenes and high-level event questions by supplying dense 4D annotations of 3D forces and per-pixel physics quantities on multiview images of actual environments. It runs evaluations across 56 vision-language models and 10 vision foundation models to measure performance on tasks that require tracking forces and physical properties over time. The results indicate that existing models fall short on these low-level dynamics even when they handle simpler benchmarks. This setup is meant to support more realistic testing and training for physics-aware vision systems.

Core claim

NewtPhys supplies dense annotations of 3D forces and amodal per-pixel physics, tracking, semantics and geometry quantities across timesteps on real multiview scenes, and evaluation of 56 VLMs plus 10 VFMs on it demonstrates limitations in low-level Newtonian physics reasoning.

What carries the argument

NewtPhys, a 4D dataset built from multiview real-world images and physics-grounded simulations that supplies 3D forces and per-pixel annotations.

If this is right

High-level visual question answering benchmarks overestimate models' grasp of physics.
Dense real-scene annotations become necessary for future physics reasoning tests.
Vision foundation models require additional mechanisms to handle force prediction and temporal physical quantities.
The dataset opens the way for training loops that incorporate explicit physics signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Failure on these tasks suggests models may also lack reliable intuition for embodied tasks such as manipulation or navigation.
Comparing NewtPhys annotations against direct real-world force measurements could quantify any remaining simulation gaps.
Similar annotation pipelines could be applied to fluid or deformable-body scenarios to test broader physical understanding.

Load-bearing premise

The simulations that annotate the real scenes produce 3D forces and per-pixel quantities that match actual Newtonian dynamics without major gaps.

What would settle it

If a model achieves high accuracy on NewtPhys tasks that require predicting forces or per-pixel physics quantities across timesteps, that would indicate the claimed limitations do not hold.

Figures

Figures reproduced from arXiv: 2606.03986 by Andrei Bursuc, Raoul de Charette, Sebastian Cavada, Soumava Paul, Tuan-Hung Vu.

**Figure 1.** Figure 1: NewtPhys studies low-level physics understanding in vision-language models (VLMs) and vision foundation models (VFMs), revealing gaps in Newtonian reasoning. We introduce a physically annotated dataset spanning diverse scenes, objects (rigid and soft) and dynamic interactions captured from static or moving cameras. It provides 2D and 3D annotations in metric units, including materials, kinematics, dynamics… view at source ↗

**Figure 2.** Figure 2: Benchmarks for physical understanding. High-level physical understanding benchmarks (e.g., event ordering, general physics, frame reconstruction) are typically more realistic than low-level ones, which rely on toy simulators. In contrast, NewtPhys provides a realistic benchmark for both high- and low-level physical understanding in real-world scenes. To our knowledge, no benchmark provides dense, pixel-ali… view at source ↗

**Figure 3.** Figure 3: Dataset construction. We construct arbitrary scenarios from spawning up to ten objects (GSO [15]) into various scenes (DL3DV [30]). The resulting 3DGS primitives serve as simulatable particles in Simplicits physical simulator [37], which we highly customize to exhaustively capture Newtonian forces in time and space. Besides realistic renderings, the pipeline outputs 11 ground truth maps capturing pixel-lev… view at source ↗

**Figure 4.** Figure 4: Dataset statistics and VQA samples. Top: Distribution of collisions, material properties, object velocity, camera motion, and material types. Bottom: Examples of VQA tasks including spatial reasoning, mechanics, and material understanding. We provide additional VQA examples in Appendix Sec. C.2. tion questions; Spatial reasoning covers geometrical sensing of size, distances and general scene layout; Viewp… view at source ↗

**Figure 5.** Figure 5: Overall VQA performance. (a) Performance of the largest per family models. The open-source InternVL2.5 78B narrows the gap with closed-source models, which dominate the board, while all models show similar trends. Parentheses indicate average open-source performance. (b) Average performance of all 54 open-source models w.r.t. their overall rank, showing that some tasks are progressing more slowly. Note on … view at source ↗

**Figure 6.** Figure 6: Physics VQA and variations. (a) Subcategories performance reveals that material understanding performance is dominated by ‘material identification’ (42.6%) while properties estimation ranges much below (23.6%– 34.0%). (b) Varying object softness, we notice that VLMs perform better on soft objects. (c) Varying the number of objects shows that some VQAs are more stable than others. they instead perform wel… view at source ↗

**Figure 7.** Figure 7: Evaluation of LLM priors. We design experiments to assess whether and how VLMs rely on LLM prior knowledge. In (a) we report VLM performance in the degenerated VQA scenario where the region of interest is masked out. This sheds light on a handful of LLM-biased VLMs that benefit from not accessing visual data. In (b), various counterfactual scenarios are evaluated. Refer to the text for details. that softe… view at source ↗

**Figure 9.** Figure 9: Common sense correlation per benchmark.Our study reveals a striking evidence that existing benchmarks weakly correlate with physics, with MMVet and OCRBench having a Pearson coefficient as low as r = 0.1. (a) Spatial cues (b) Physical cues [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Effects of spatial and physical cues. For 31 video VLMs, we evaluate the effect of adding (a) spatial cues, which help some smaller models, or (b) physical cues, which strongly improve performance. VLMs still have to locate objects before estimating their physical properties. To ease this, we introduce spatial cues either in the visual form by circling the Region of Interest (ROI) [46] or in the textual f… view at source ↗

**Figure 11.** Figure 11: Experts to Novices prompting. Inspired by physics education research we evaluate 31 video VLMs with questions reformulated with different level of expertise, ranging from child (≈10yo) to expert (physicist) and report their performance w.r.t. the original NewtPhys VQA formulation. This showcases that NewtPhys aligns with graduate prompting and that most VLMs perform better with undergrad formulation. ...… view at source ↗

**Figure 12.** Figure 12: Robustness to intrinsic perturbations. We perturb (a) Young’s Modulus and (b) object scale and observe that VQA accuracy remains largely stable, indicating that our original intrinsics are coherent. change significantly under perturbation. A.3. Rendering At each time step, the Simplicits routine resolves all physics constraint with a Newton solver while saving only the resulting particles states. RGB ren… view at source ↗

**Figure 14.** Figure 14: Detailed datasets statistics. We report here additional dataset statistics, highlighting the variability of NewtPhys dataset along various axes of study [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

**Figure 15.** Figure 15: shows all the markers used in main paper’s plots and their corresponding models [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗

read the original abstract

Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at https://astra-vision.github.io/NewtPhys.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NewtPhys gives a real-image dataset with simulated 4D physics labels and a broad model evaluation, but the sim-to-real accuracy of those labels is unverified and carries the main claims.

read the letter

The main takeaway is a new dataset called NewtPhys that starts from real multiview scenes and layers on dense physics annotations via simulation, then tests 56 VLMs plus 10 VFMs for low-level Newtonian understanding.

What is actually new is the move to real imagery with fine-grained 4D labels for 3D forces, per-pixel physics quantities, tracking, semantics, and geometry across timesteps. Earlier benchmarks stayed mostly synthetic or used simpler VQA formats, so this combination is a clear step forward. They also release the data and code, which is useful.

The evaluation scale is decent and shows models struggling on the tasks, which aligns with the abstract's point about current limitations.

The soft spot is the missing validation for the annotations themselves. The whole argument that the benchmark reveals reasoning failures depends on the simulations producing faithful Newtonian quantities on real scenes. The abstract gives no error analysis, comparison to physical measurements, or handling of effects like friction and sensor noise. If those gaps exist, model errors could trace to annotation mismatch rather than physics understanding. That makes the central claim rest on an assumption that is not yet shown to hold.

This is for people working on physics-aware vision, robotics simulation, or foundation model benchmarks. A reader building or testing such models would get practical value from the released dataset if the labels check out.

It deserves peer review because the dataset construction is concrete and the evaluation is large enough to be worth examining, even if the methods section needs more on annotation fidelity.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces NewtPhys, a 4D dataset constructed from multiview images of real-world scenes augmented via physics-grounded simulations. It supplies dense, fine-grained annotations across timesteps, including 3D forces and amodal per-pixel quantities for physics, tracking, semantics, and geometry. The work benchmarks 56 VLMs (54 open-weight plus 2 closed-source frontier models) and 10 VFMs on low-level Newtonian physics reasoning tasks, reports limitations in current models, and releases the dataset and code publicly to support future physics-grounded vision research.

Significance. If the annotations are shown to be faithful, NewtPhys would provide a useful bridge between synthetic physics benchmarks and real visual complexity, enabling more diagnostic evaluation of physical reasoning in foundation models. The public release of data and code is a clear strength for reproducibility.

major comments (1)

[Abstract / Dataset Construction] Abstract and dataset construction section: the claim that NewtPhys reveals limitations in low-level physics reasoning depends on the simulations producing 3D forces and per-pixel quantities that faithfully match true Newtonian dynamics. The manuscript supplies no quantitative validation (error bounds, comparison to real measurements, or analysis of unmodeled effects such as friction or deformation), leaving open the possibility that observed model errors arise from annotation mismatch rather than reasoning failure. This is load-bearing for the central empirical conclusion.

minor comments (2)

[Experiments] Evaluation protocol: specify the exact question formats, answer formats, and scoring metrics used for the VLM and VFM benchmarks so that the reported limitations can be directly reproduced.
[Figures] Figure and table captions: ensure all annotation examples include explicit scale bars or units for the physics quantities shown.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on NewtPhys. We address the major comment below.

read point-by-point responses

Referee: [Abstract / Dataset Construction] Abstract and dataset construction section: the claim that NewtPhys reveals limitations in low-level physics reasoning depends on the simulations producing 3D forces and per-pixel quantities that faithfully match true Newtonian dynamics. The manuscript supplies no quantitative validation (error bounds, comparison to real measurements, or analysis of unmodeled effects such as friction or deformation), leaving open the possibility that observed model errors arise from annotation mismatch rather than reasoning failure. This is load-bearing for the central empirical conclusion.

Authors: We agree that the fidelity of the physics annotations is central to our conclusions and that the initial manuscript provides no quantitative validation such as error bounds, real-measurement comparisons, or explicit discussion of unmodeled effects. The simulations apply Newtonian mechanics to real multiview geometry via established physics engines, but this does not substitute for direct validation. In the revised manuscript we will add a validation subsection in the dataset construction section that reports available trajectory and force consistency checks, quantifies simulation-to-observation discrepancies where measurable, and discusses limitations arising from friction, deformation, and other unmodeled phenomena. This will clarify whether model errors stem from reasoning shortfalls or annotation mismatch. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and model evaluation with no derivation chain

full rationale

The paper presents NewtPhys as a 4D dataset constructed from multiview real-world images annotated via physics-grounded simulations, followed by empirical benchmarking of 56 VLMs and 10 VFMs. No mathematical derivation, first-principles prediction, parameter fitting, or uniqueness theorem is claimed or present. The central contribution is dataset release and evaluation results; the simulation-to-reality gap noted by the skeptic is a validity assumption, not a reduction of any result to its own inputs by construction. No self-citations, ansatzes, or renamings of known results appear in a load-bearing role for any derivation. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, new axioms, or invented entities are introduced; the work relies on existing simulation software and standard VLM evaluation practices.

pith-pipeline@v0.9.1-grok · 5711 in / 1057 out tokens · 73492 ms · 2026-06-28T10:17:22.021081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 8 canonical work pages · 5 internal anchors

[1]

Physical reasoning in young infants: Seeking explanations for impossible events.British Journal of Developmental Psychology, 1994

Renée Baillargeon. Physical reasoning in young infants: Seeking explanations for impossible events.British Journal of Developmental Psychology, 1994. 1, 2, 9

1994
[2]

Battaglia, Jessica B

Peter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the National Academy of Sciences, 2013. 2

2013
[3]

Relational inductive biases, deep learning, and graph networks

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive bi- ases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[4]

Physion: Evaluating physical prediction from vision in humans and machines

Daniel Bear, Elias Wang, Damian Mrowca, Felix Je- didja Binder, Hsiao-Yu Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin A Smith, Fan-Yun Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines. InNeurIPS, 2021. 2

2021
[5]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InAAAI, 2020. 5

2020
[7]

The origin of concepts.Journal of Cog- nition and Development, 2000

Susan Carey. The origin of concepts.Journal of Cog- nition and Development, 2000. 1, 2

2000
[8]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021. 5, 10

2021
[9]

Are we on the right way for evaluating large vision-language models? In NIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? In NIPS, 2024. 7

2024
[10]

Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks. InCVPR, 2024. 5

2024
[11]

Categorization and representation of physics problems by experts and novices.Cognitive science, 1981

Michelene TH Chi, Paul J Feltovich, and Robert Glaser. Categorization and representation of physics problems by experts and novices.Cognitive science, 1981. 9

1981
[12]

Physbench: Benchmark- ing and enhancing vision-language models for physical world understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmark- ing and enhancing vision-language models for physical world understanding. InICLR, 2025. 2, 5

2025
[13]

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. InCVPR, 2024. 5

2024
[14]

Pin: Positional insert unlocks object localisation abilities in vlms

Michael Dorkenwald, Nimrod Barazani, Cees GM Snoek, and Yuki M Asano. Pin: Positional insert unlocks object localisation abilities in vlms. InCVPR,
[15]

Google scanned objects: A high-quality dataset of 3d scanned household items

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. InICRA, 2022. 3, 4, 12

2022
[16]

Probing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Mani- nis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. InCVPR, 2024. 2

2024
[17]

Kaolin: A pytorch library for accelerating 3d deep learning research

Clement Fuji Tsang, Maria Shugrina, Jean Fran- cois Lafleche, Or Perel, Charles Loop, Towaki Takikawa, Vismay Modi, Alexander Zook, Jiehan Wang, Wenzheng Chen, Tianchang Shen, Jun Gao, Krishna Murthy Jatavallabhula, Edward Smith, Artem Rozantsev, Sanja Fidler, Gavriel State, Jason Gorski, Tommy Xiang, Jianing Li, Michael Li, and Rev Lebare- dian. Kaolin: ...

2024
[18]

Gemini 3.1

Google DeepMind. Gemini 3.1. https://deepmind. google, 2026. Proprietary model; no public technical report. 5

2026
[19]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR,
[20]

Sugar: Surface- aligned gaussian splatting for efficient 3d mesh recon- struction and high-quality mesh rendering.CVPR,

Antoine Guédon and Vincent Lepetit. Sugar: Surface- aligned gaussian splatting for efficient 3d mesh recon- struction and high-quality mesh rendering.CVPR,
[21]

Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.TOG,

Antoine Guédon, Diego Gomez, Nissim Maruani, Bingchen Gong, George Drettakis, and Maks Ovs- janikov. Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.TOG,
[22]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, 2024. 45

2024
[23]

Position: The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. InICML, 2024. 8

2024
[24]

Phillip Isola, Daniel Zoran, Dilip Krishnan, and Ed- ward H. Adelson. Learning visual groups from co- occurrences in space and time, 2015. 2

2015
[25]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min- joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. 7

2016
[26]

3d gaussian splatting for real-time radiance field rendering.ACM TG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimküh- ler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TG, 2023. 2, 4, 12

2023
[27]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In ICCV, 2023. 10

2023
[28]

Pisa experiments: Ex- ploring physics post-training for video diffusion models by watching stuff drop

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie. Pisa experiments: Ex- ploring physics post-training for video diffusion models by watching stuff drop. InICML, 2025. 2

2025
[29]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 3, 4

2024
[31]

Visual instruction tuning

HaotianLiu, ChunyuanLi, QingyangWu, andYongJae Lee. Visual instruction tuning. InNeurIPS, 2023. 5

2023
[32]

Mmbench: Is your multi-modal model an all- around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all- around player? InECCV, 2024. 7

2024
[33]

Ocrbench: on the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. SCIS, 2024. 7, 8

2024
[34]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Hao Yang, et al. Deepseek-vl: towards real- world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Mathvista: Evaluating mathematical reasoning of foundation mod- els in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation mod- els in visual contexts. InICLR, 2024. 7

2024
[36]

Intuitive physics: the straight-down belief and its origin.Journal of Experimental Psychology: Learning, Memory, and Cognition, 1983

Michael McCloskey, Allyson Washburn, and Linda Felch. Intuitive physics: the straight-down belief and its origin.Journal of Experimental Psychology: Learning, Memory, and Cognition, 1983. 1, 2

1983
[37]

Simplicits: Mesh-free, geometry- agnostic elastic simulation.ACM TOG, 2024

Vismay Modi, Nicholas Sharp, Or Perel, Shinjiro Sueda, and David IW Levin. Simplicits: Mesh-free, geometry- agnostic elastic simulation.ACM TOG, 2024. 3, 4, 11, 12

2024
[38]

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video mod- els understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

OpenAI. Gpt-5.5. https://openai.com, 2026. Propri- etary model; no public technical report. 5

2026
[40]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InICLR,
[41]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021. 5, 10

2021
[42]

Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.TPAMI, 2022

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.TPAMI, 2022. 10

2022
[43]

Am-radio: Agglomerative vision founda- tion model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision founda- tion model reduce all domains into one. InCVPR, 2024. 10

2024
[44]

Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI,

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI,

2019
[45]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 3

2016
[46]

What does clip know about a red circle? visual prompt engineering for vlms

Aleksandar Shtedritski, Christian Rupprecht, and An- drea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. InICCV, 2023. 8

2023
[47]

Alpha-clip: A clip model focusing on wherever you want

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha-clip: A clip model focusing on wherever you want. InCVPR, 2024. 8

2024
[48]

Tenenbaum, Charles Kemp, Thomas L

Joshua B. Tenenbaum, Charles Kemp, Thomas L. Grif- fiths, and Noah D. Goodman. How to grow a mind: Statistics, structure, and abstraction.Science, 2011. 2

2011
[49]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InNeurIPS, 2024. 5

2024
[50]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Fran- cisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. InICML, 2021. 10

2021
[51]

Physion++: Evaluating physical scene understanding that requires online infer- ence of different physical properties

Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Yamins, Ju- dith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online infer- ence of different physical properties. InNeurIPS, 2023. 2

2023
[52]

Least-squares estimation of trans- formation parameters between two point patterns

Shinji Umeyama. Least-squares estimation of trans- formation parameters between two point patterns. TPAMI, 2002. 3

2002
[53]

Vggt: Visualgeometrygroundedtransformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, An- drea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visualgeometrygroundedtransformer. InCVPR,
[54]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 7

2022
[55]

Galileo: Perceiving physical object properties by integrating a physics engine with deep learning

Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. InNeuRIPS, 2015. 2

2015
[56]

Physics 101: Learning physical object properties from unlabeled videos

Jiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics 101: Learning physical object properties from unlabeled videos. InBMVC, 2016. 2

2016
[57]

Point-it-out: Benchmarking embodied reasoning for vision language models in multi-stage visual grounding.arXiv preprint arXiv:2509.25794,

Haotian Xue, Yunhao Ge, Yu Zeng, Zhaoshuo Li, Ming- Yu Liu, Yongxin Chen, and Jiaojiao Fan. Point-it-out: Benchmarking embodied reasoning for vision language models in multi-stage visual grounding.arXiv preprint arXiv:2509.25794, 2025. 8

work page arXiv 2025
[58]

Clevrer: Collision events for video representation and reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Ji- ajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. InICLR, 2020. 2

2020
[59]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. InICML, 2024. 7, 8

2024
[60]

Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zee- shan Nadir, Bole Ma, and Stanley H. Chan. New- tonGen: Physics-consistent and controllable text-to- video generation via neural newtonian dynamics.arXiv preprint arXiv: 2509.21309, 2025. 2

work page arXiv 2025
[61]

Mmmu: A massive multi- discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi- discipline multimodal understanding and reasoning benchmark for ex...

2024
[62]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 10

2023
[63]

A general protocol to probe large vision models for 3d physical understanding

Guanqi Zhan, Chuanxia Zheng, Weidi Xie, and Andrew Zisserman. A general protocol to probe large vision models for 3d physical understanding. InNeurIPS,
[64]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv, 2025

Chenyu Zhang, Daniil Cherniavskii, Antonios Tragoudaras, Antonios Vozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, Andrii Zadaianchuk, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv, 2025. 2

2025
[65]

Phystoolbench: Benchmarking physical tool understanding for mllms.arXiv preprint arXiv:2510.09507, 2025

Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, and Ying-Cong Chen. Phystoolbench: Bench- marking physical tool understanding for mllms.arXiv preprint arXiv:2510.09507, 2025. 2

work page arXiv 2025
[66]

Under- standing tools: Task-oriented object modeling, learning and recognition

Yixin Zhu, Yibiao Zhao, and Song Chun Zhu. Under- standing tools: Task-oriented object modeling, learning and recognition. InCVPR, 2015. 2

2015

[1] [1]

Physical reasoning in young infants: Seeking explanations for impossible events.British Journal of Developmental Psychology, 1994

Renée Baillargeon. Physical reasoning in young infants: Seeking explanations for impossible events.British Journal of Developmental Psychology, 1994. 1, 2, 9

1994

[2] [2]

Battaglia, Jessica B

Peter W. Battaglia, Jessica B. Hamrick, and Joshua B. Tenenbaum. Simulation as an engine of physical scene understanding.Proceedings of the National Academy of Sciences, 2013. 2

2013

[3] [3]

Relational inductive biases, deep learning, and graph networks

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive bi- ases, deep learning, and graph networks.arXiv preprint arXiv:1806.01261, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[4] [4]

Physion: Evaluating physical prediction from vision in humans and machines

Daniel Bear, Elias Wang, Damian Mrowca, Felix Je- didja Binder, Hsiao-Yu Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin A Smith, Fan-Yun Sun, et al. Physion: Evaluating physical prediction from vision in humans and machines. InNeurIPS, 2021. 2

2021

[5] [5]

PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InAAAI, 2020. 5

2020

[7] [7]

The origin of concepts.Journal of Cog- nition and Development, 2000

Susan Carey. The origin of concepts.Journal of Cog- nition and Development, 2000. 1, 2

2000

[8] [8]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV, 2021. 5, 10

2021

[9] [9]

Are we on the right way for evaluating large vision-language models? In NIPS, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? In NIPS, 2024. 7

2024

[10] [10]

Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vi- sion foundation models and aligning for generic visual- linguistic tasks. InCVPR, 2024. 5

2024

[11] [11]

Categorization and representation of physics problems by experts and novices.Cognitive science, 1981

Michelene TH Chi, Paul J Feltovich, and Robert Glaser. Categorization and representation of physics problems by experts and novices.Cognitive science, 1981. 9

1981

[12] [12]

Physbench: Benchmark- ing and enhancing vision-language models for physical world understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, and Yue Wang. Physbench: Benchmark- ing and enhancing vision-language models for physical world understanding. InICLR, 2025. 2, 5

2025

[13] [13]

Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. InCVPR, 2024. 5

2024

[14] [14]

Pin: Positional insert unlocks object localisation abilities in vlms

Michael Dorkenwald, Nimrod Barazani, Cees GM Snoek, and Yuki M Asano. Pin: Positional insert unlocks object localisation abilities in vlms. InCVPR,

[15] [15]

Google scanned objects: A high-quality dataset of 3d scanned household items

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. InICRA, 2022. 3, 4, 12

2022

[16] [16]

Probing the 3d awareness of visual foundation models

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Mani- nis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. InCVPR, 2024. 2

2024

[17] [17]

Kaolin: A pytorch library for accelerating 3d deep learning research

Clement Fuji Tsang, Maria Shugrina, Jean Fran- cois Lafleche, Or Perel, Charles Loop, Towaki Takikawa, Vismay Modi, Alexander Zook, Jiehan Wang, Wenzheng Chen, Tianchang Shen, Jun Gao, Krishna Murthy Jatavallabhula, Edward Smith, Artem Rozantsev, Sanja Fidler, Gavriel State, Jason Gorski, Tommy Xiang, Jianing Li, Michael Li, and Rev Lebare- dian. Kaolin: ...

2024

[18] [18]

Gemini 3.1

Google DeepMind. Gemini 3.1. https://deepmind. google, 2026. Proprietary model; no public technical report. 5

2026

[19] [19]

Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InCVPR,

[20] [20]

Sugar: Surface- aligned gaussian splatting for efficient 3d mesh recon- struction and high-quality mesh rendering.CVPR,

Antoine Guédon and Vincent Lepetit. Sugar: Surface- aligned gaussian splatting for efficient 3d mesh recon- struction and high-quality mesh rendering.CVPR,

[21] [21]

Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.TOG,

Antoine Guédon, Diego Gomez, Nissim Maruani, Bingchen Gong, George Drettakis, and Maks Ovs- janikov. Milo: Mesh-in-the-loop gaussian splatting for detailed and efficient surface reconstruction.TOG,

[22] [22]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InCVPR, 2024. 45

2024

[23] [23]

Position: The platonic representation hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. Position: The platonic representation hypothesis. InICML, 2024. 8

2024

[24] [24]

Phillip Isola, Daniel Zoran, Dilip Krishnan, and Ed- ward H. Adelson. Learning visual groups from co- occurrences in space and time, 2015. 2

2015

[25] [25]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min- joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InECCV, 2016. 7

2016

[26] [26]

3d gaussian splatting for real-time radiance field rendering.ACM TG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimküh- ler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TG, 2023. 2, 4, 12

2023

[27] [27]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. In ICCV, 2023. 10

2023

[28] [28]

Pisa experiments: Ex- ploring physics post-training for video diffusion models by watching stuff drop

Chenyu Li, Oscar Michel, Xichen Pan, Sainan Liu, Mike Roberts, and Saining Xie. Pisa experiments: Ex- ploring physics post-training for video diffusion models by watching stuff drop. InICML, 2025. 2

2025

[29] [29]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 3, 4

2024

[31] [31]

Visual instruction tuning

HaotianLiu, ChunyuanLi, QingyangWu, andYongJae Lee. Visual instruction tuning. InNeurIPS, 2023. 5

2023

[32] [32]

Mmbench: Is your multi-modal model an all- around player? InECCV, 2024

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all- around player? InECCV, 2024. 7

2024

[33] [33]

Ocrbench: on the hidden mystery of ocr in large multimodal models

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. SCIS, 2024. 7, 8

2024

[34] [34]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Hao Yang, et al. Deepseek-vl: towards real- world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Mathvista: Evaluating mathematical reasoning of foundation mod- els in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation mod- els in visual contexts. InICLR, 2024. 7

2024

[36] [36]

Intuitive physics: the straight-down belief and its origin.Journal of Experimental Psychology: Learning, Memory, and Cognition, 1983

Michael McCloskey, Allyson Washburn, and Linda Felch. Intuitive physics: the straight-down belief and its origin.Journal of Experimental Psychology: Learning, Memory, and Cognition, 1983. 1, 2

1983

[37] [37]

Simplicits: Mesh-free, geometry- agnostic elastic simulation.ACM TOG, 2024

Vismay Modi, Nicholas Sharp, Or Perel, Shinjiro Sueda, and David IW Levin. Simplicits: Mesh-free, geometry- agnostic elastic simulation.ACM TOG, 2024. 3, 4, 11, 12

2024

[38] [38]

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video mod- els understand physical principles?arXiv preprint arXiv:2501.09038, 2025. 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

OpenAI. Gpt-5.5. https://openai.com, 2026. Propri- etary model; no public technical report. 5

2026

[40] [40]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. InICLR,

[41] [41]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, 2021. 5, 10

2021

[42] [42]

Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.TPAMI, 2022

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monoc- ular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.TPAMI, 2022. 10

2022

[43] [43]

Am-radio: Agglomerative vision founda- tion model reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. Am-radio: Agglomerative vision founda- tion model reduce all domains into one. InCVPR, 2024. 10

2024

[44] [44]

Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI,

Ronan Riochet, Mario Ynocente Castro, Mathieu Bernard, Adam Lerer, Rob Fergus, Véronique Izard, and Emmanuel Dupoux. Intphys 2019: A benchmark for visual intuitive physics understanding.TPAMI,

2019

[45] [45]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016. 3

2016

[46] [46]

What does clip know about a red circle? visual prompt engineering for vlms

Aleksandar Shtedritski, Christian Rupprecht, and An- drea Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms. InICCV, 2023. 8

2023

[47] [47]

Alpha-clip: A clip model focusing on wherever you want

Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha-clip: A clip model focusing on wherever you want. InCVPR, 2024. 8

2024

[48] [48]

Tenenbaum, Charles Kemp, Thomas L

Joshua B. Tenenbaum, Charles Kemp, Thomas L. Grif- fiths, and Noah D. Goodman. How to grow a mind: Statistics, structure, and abstraction.Science, 2011. 2

2011

[49] [49]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. InNeurIPS, 2024. 5

2024

[50] [50]

Training data-efficient image transformers & distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Fran- cisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers & distillation through attention. InICML, 2021. 10

2021

[51] [51]

Physion++: Evaluating physical scene understanding that requires online infer- ence of different physical properties

Hsiao-Yu Tung, Mingyu Ding, Zhenfang Chen, Daniel Bear, Chuang Gan, Josh Tenenbaum, Dan Yamins, Ju- dith Fan, and Kevin Smith. Physion++: Evaluating physical scene understanding that requires online infer- ence of different physical properties. InNeurIPS, 2023. 2

2023

[52] [52]

Least-squares estimation of trans- formation parameters between two point patterns

Shinji Umeyama. Least-squares estimation of trans- formation parameters between two point patterns. TPAMI, 2002. 3

2002

[53] [53]

Vggt: Visualgeometrygroundedtransformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, An- drea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visualgeometrygroundedtransformer. InCVPR,

[54] [54]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InNeurIPS, 2022. 7

2022

[55] [55]

Galileo: Perceiving physical object properties by integrating a physics engine with deep learning

Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. InNeuRIPS, 2015. 2

2015

[56] [56]

Physics 101: Learning physical object properties from unlabeled videos

Jiajun Wu, Joseph J Lim, Hongyi Zhang, Joshua B Tenenbaum, and William T Freeman. Physics 101: Learning physical object properties from unlabeled videos. InBMVC, 2016. 2

2016

[57] [57]

Point-it-out: Benchmarking embodied reasoning for vision language models in multi-stage visual grounding.arXiv preprint arXiv:2509.25794,

Haotian Xue, Yunhao Ge, Yu Zeng, Zhaoshuo Li, Ming- Yu Liu, Yongxin Chen, and Jiaojiao Fan. Point-it-out: Benchmarking embodied reasoning for vision language models in multi-stage visual grounding.arXiv preprint arXiv:2509.25794, 2025. 8

work page arXiv 2025

[58] [58]

Clevrer: Collision events for video representation and reasoning

Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Ji- ajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning. InICLR, 2020. 2

2020

[59] [59]

Mm-vet: Evaluating large multimodal models for integrated capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. InICML, 2024. 7, 8

2024

[60] [60]

Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zee- shan Nadir, Bole Ma, and Stanley H. Chan. New- tonGen: Physics-consistent and controllable text-to- video generation via neural newtonian dynamics.arXiv preprint arXiv: 2509.21309, 2025. 2

work page arXiv 2025

[61] [61]

Mmmu: A massive multi- discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi- discipline multimodal understanding and reasoning benchmark for ex...

2024

[62] [62]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 10

2023

[63] [63]

A general protocol to probe large vision models for 3d physical understanding

Guanqi Zhan, Chuanxia Zheng, Weidi Xie, and Andrew Zisserman. A general protocol to probe large vision models for 3d physical understanding. InNeurIPS,

[64] [64]

Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv, 2025

Chenyu Zhang, Daniil Cherniavskii, Antonios Tragoudaras, Antonios Vozikis, Thijmen Nijdam, Derck WE Prinzhorn, Mark Bodracska, Nicu Sebe, Andrii Zadaianchuk, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments.arXiv, 2025. 2

2025

[65] [65]

Phystoolbench: Benchmarking physical tool understanding for mllms.arXiv preprint arXiv:2510.09507, 2025

Zixin Zhang, Kanghao Chen, Xingwang Lin, Lutao Jiang, Xu Zheng, Yuanhuiyi Lyu, Litao Guo, Yinchuan Li, and Ying-Cong Chen. Phystoolbench: Bench- marking physical tool understanding for mllms.arXiv preprint arXiv:2510.09507, 2025. 2

work page arXiv 2025

[66] [66]

Under- standing tools: Task-oriented object modeling, learning and recognition

Yixin Zhu, Yibiao Zhao, and Song Chun Zhu. Under- standing tools: Task-oriented object modeling, learning and recognition. InCVPR, 2015. 2

2015