Vision Language Models Cannot Reason About Physical Transformation

Bingyang Wang; Dezhi Luo; Hokin Deng; Maijunxian Wang; Pinyuan Feng; Pooyan Rahmanzadehgervi; Siheng Wang; Tianwei Zhao; Yijiang Li; Ziqiao Ma

REVIEW 2 major objections 4 minor 37 references

Current vision-language models fail to track physical quantities that stay the same through visual change.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.5

2026-07-15 13:26 UTC pith:KROYXAHZ

load-bearing objection Solid large-scale diagnostic of a real VLM failure on conservation; the lab-stimulus caveat is real but does not sink the result. the 2 major comments →

arxiv 2603.07109 v2 pith:KROYXAHZ submitted 2026-03-07 cs.AI

Vision Language Models Cannot Reason About Physical Transformation

Dezhi Luo , Yijiang Li , Maijunxian Wang , Tianwei Zhao , Bingyang Wang , Siheng Wang , Pinyuan Feng , Pooyan Rahmanzadehgervi

show 2 more authors

Ziqiao Ma Hokin Deng

This is my paper

classification cs.AI

keywords vision-language modelsphysical transformationconservationinvariancetemporal reasoningembodied AImulti-frame evaluationheuristic bias

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that vision-language models do not yet form the kind of stable physical representations humans use when something changes shape, position, or appearance while keeping the same amount. It introduces ConservationBench: paired video tasks that ask whether number, length, volume, or size is conserved after a transformation, plus matched non-conserving controls where the quantity actually changes. Across 112 models and more than 23,000 questions, accuracy stays near chance; high scores on conservation trials reverse on the controls, revealing a default bias toward "same" rather than genuine tracking. Empty-image and text-only controls show strong language priors for invariance, yet real visual frames make models worse once both task types are balanced. More frames, different sampling, and varied prompting do not fix the failure. The result matters because embodied and dynamic applications require exactly this kind of transformation-invariant physical reasoning.

Core claim

Current VLMs do not maintain transformation-invariant representations of physical properties across dynamic scenes. High accuracy on conservation tasks is typically bought by a default "invariance" heuristic that collapses on matched non-conserving controls, leaving strict pairwise performance well below chance for almost all models.

What carries the argument

ConservationBench: 192 conservation videos and 192 matched non-conserving controls across number, length, volume, and size, crossed with frame count, extraction method, and prompting to produce 23,040 trials that force models to decide whether a quantity is preserved.

Load-bearing premise

That clean laboratory videos of four simple quantity transformations, presented as short frame sequences with forced three-choice answers, are a fair diagnostic of whether models can form general transformation-invariant physical representations.

What would settle it

A model family that simultaneously exceeds chance on both conservation trials and their matched non-conserving controls under the same multi-frame, multi-prompt protocol, without trading one accuracy for the other.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Average accuracy on conservation-style questions is not evidence of physical understanding unless matched non-conserving controls are also passed.
Extra frames, human- or model-selected keyframes, and continuity-oriented prompts do not by themselves produce transformation-invariant reasoning.
Textual priors for quantity invariance dominate; real visual content often interferes rather than corrects.
Static image encoders with weak temporal aggregation are insufficient for object-state tracking over time.
Conservation-style paired tasks remain useful ongoing sanity checks even as models improve on broader physical-reasoning suites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that maintain and update explicit object-state variables across frames are more likely to close the gap than further scale of current static encoders.
The same deficit should surface in planning, tool-use, and robotic-manipulation settings that require tracking quantity-preserving actions.
Domain-dependent reversals under captions (invariance bias for number/length, change bias for volume/size) suggest language can trigger competing heuristics rather than supply the missing physical model.
Cross-benchmark correlations imply that fixing sequential state integration would lift performance on multiple video and physical-reasoning suites at once.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Solid large-scale diagnostic of a real VLM failure on conservation; the lab-stimulus caveat is real but does not sink the result.

read the letter

The punchline is simple: across 112 VLMs and 23k trials, models do not track quantity through physical transformations. They sit near chance overall, and the ones that look good on conserving trials collapse on matched non-conserving controls (r ≈ −0.51). Strict pairwise accuracy is under 10% for most models. That pattern is the real contribution.

What is new is the design, not the slogan. Prior physics and counting benches exist, but none pair conserving and non-conserving videos under the same visual contexts, then factorially vary frames, sampling, and prompts, then strip the images to empty or text-only to measure the prior. The empty-image and text-only controls are the cleanest part: models default hard to “same,” and real visuals actually hurt conservation accuracy relative to blank images. Human baseline is ~98%. Scaling does almost nothing for conservation (R² ≈ 0.02). Cosmos Reason fails the same way. The preliminary attention/confidence probe on Qwen2.5-VL-7B (Frame-1 anchoring + overconfident “same” errors) is consistent with the behavioral story even if it is only one model.

Soft spots, in proportion. The stimuli are clean lab videos—fixed camera, constant lighting, no occlusion, forced three-choice. The paper’s own limitations section says so. Cross-bench correlations (Video-MME, PhysBench, etc.) show shared variance but do not prove the same representations would appear under messy interactive conditions. So the title-level claim “cannot reason about physical transformation” is a bit broader than the four properties under these regimes. That is a real caveat, not a fatal one; the inverse pattern and the bias controls still stand on the data they collected. I also do not see a public code/data release in the manuscript, which matters if this is meant to be a lasting sanity check.

Math and stats look ordinary and appropriate (repeated-measures ANOVA with Bonferroni, hybrid answer mapping). Citations are dense but on-topic; self-cites are to related prior work by the same group, not load-bearing circularity.

This is for people building or evaluating VLMs for embodied or video reasoning. It is a diagnostic, not a method paper. I would bring it to reading group, cite the failure pattern when I need a concrete conservation baseline, and send it to peer review. The central empirical claim holds; the over-generalization risk is already flagged by the authors and is the main thing referees should pressure.

Referee Report

2 major / 4 minor

Summary. The paper introduces ConservationBench, a cognitively grounded evaluation of whether VLMs maintain transformation-invariant representations of four physical quantities (number, length, volume, size). It constructs 192 conserving videos and 192 matched non-conserving controls, varies frame count, sampling strategy and prompting to produce 23 040 trials, and evaluates 112 VLMs. Aggregate accuracy stays near chance; conservation accuracy is negatively correlated with non-conserving accuracy (r = -0.51); strict pairwise success is below chance for 82/112 models; empty-image and text-only controls reveal a strong textual invariance prior that real visual content disrupts rather than corrects; and neither temporal resolution, prompting nor curated sampling yields balanced performance. A preliminary confidence/attention analysis on one model further suggests Frame-1 anchoring. The authors conclude that current VLMs fail to form transformation-invariant physical representations across dynamic scenes.

Significance. If the result holds, it supplies a clean, falsifiable diagnostic that current multi-frame VLMs lack a core cognitive substrate (conservation) required for reliable physical reasoning in dynamic environments. Strengths include the carefully paired design, human baseline >98 %, factorial controls, empty-image/text ablations that isolate textual priors, repeated-measures ANOVA, cross-benchmark correlations, and an open mechanistic probe. The benchmark can serve as an enduring sanity check for future architectures that claim temporally grounded physical understanding. The main interpretive risk is that the laboratory stimuli may overstate a narrow multi-frame matching deficit rather than a general representational failure; the authors already flag this limitation.

major comments (2)

The title-level claim that VLMs “cannot reason about physical transformation” and “fail to maintain transformation-invariant representations … across dynamic scenes” rests on four clean, fixed-camera, no-occlusion laboratory properties (Table 4, §3). While the inverse conserve/non-conserve pattern, empty-image controls and null temporal effects are internally robust, the manuscript does not yet demonstrate that the same deficit appears under richer temporal, interactive or occluded conditions. The Limitations section (§6) acknowledges the lab setting, but the abstract and title do not sufficiently qualify the scope. A modest re-phrasing that ties the claim explicitly to the controlled multi-frame regime would keep the central result intact while preventing over-generalization.
The mechanistic analysis (§4.7, Figs. 16–17) is performed on a single 7 B model. The Frame-1 anchoring signature is suggestive, yet the paper treats it as supporting evidence for a general architectural bottleneck. Either expand the probe to at least one additional model family or clearly label the analysis as preliminary and non-generalizable; otherwise the causal interpretation remains under-supported relative to the behavioral claims.

minor comments (4)

Figure 2B caption and surrounding text should state the exact n and whether the correlation is Pearson or Spearman; the value r = -0.510 is given but the test is not named.
In §4.4 the Bonferroni-corrected p-values are reported, yet the corresponding effect sizes (partial η² or Cohen’s d) are omitted; adding them would help readers judge practical significance of the modest frame-count effect on Volume & Size.
Appendix H tables list “Strict (%)” without restating the definition; a one-sentence reminder in the table caption would improve readability.
A few typographical inconsistencies remain (e.g., “V olume” with a space in several places, “Houd´e” accent rendering).

Circularity Check

0 steps flagged

No load-bearing circularity: purely empirical benchmark evaluation with independent controls and external human baseline; minor non-load-bearing self-citations only.

full rationale

The paper's central claim (VLMs fail to maintain transformation-invariant physical representations) is derived entirely from new empirical measurements on ConservationBench: 23,040 trials across 112 models, inverse conserve/non-conserve correlation (r=-0.51), near-floor strict pairwise accuracy, empty-image/text-only controls isolating textual priors, and null effects of frames/prompts/sampling. These are self-contained against the models' own outputs and a human baseline (98.35%). No parameters are fitted and then re-labeled as predictions; no uniqueness theorems or ansätze are imported; no equations reduce by construction. Self-citations (e.g., Li et al. 2025a, Luo et al. 2025b) appear only in related-work and discussion framing and are not premises for the failure result. The evaluation is therefore non-circular; the only residual softness is external validity of the lab tasks, which is a correctness/scope issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 1 invented entities

Empirical evaluation paper; almost no free parameters or invented physical entities. The main modeling choices are design decisions of the benchmark itself (which properties, how many frames, which prompts) rather than fitted constants. Background cognitive claims about conservation are taken from the developmental literature and treated as domain assumptions.

axioms (3)

domain assumption Conservation of number, length, volume and size under the scripted transformations is the ground-truth physical fact that a competent reasoner must recover.
Stated in §3.1 and Table 2; taken from Piagetian and developmental literature without re-derivation.
domain assumption The non-conserving controls alter only the target quantity while holding task-irrelevant visual features matched, so differential performance diagnoses sensitivity to quantity change rather than superficial heuristics.
Design claim of §3.2; correctness of the claim rests on the authors’ video construction.
ad hoc to paper Three-choice forced-choice accuracy (and the strict pairwise metric) is a valid measure of transformation-invariant representation.
Evaluation protocol of §4.1; alternative open-ended or continuous measures are not explored.

invented entities (1)

ConservationBench no independent evidence
purpose: Provide a controlled diagnostic suite of 384 videos and 23 040 questions that isolate conservation reasoning under factorial frame and prompt conditions.
New dataset constructed for this paper; independent evidence will exist only once the videos and code are released and re-used by others.

pith-pipeline@v1.1.0-grok45 · 32906 in / 2344 out tokens · 27245 ms · 2026-07-15T13:26:52.307544+00:00 · methodology

0 comments

read the original abstract

Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promise in embodied applications, whether they genuinely understand physical transformations remains unclear. We introduce ConservationBench evaluating conservation -- whether physical quantities remain invariant under transformations. Spanning four properties with paired conserving/non-conserving scenarios, we generate and evaluate 23,040 questions across 112 VLMs. Results reveal systematic failure: performance remains near chance with improvements on conservation tasks accompanied by drops on controls. Control experiments show strong textual priors favoring invariance, yet models perform worse with actual visual content when performance is balanced across conserving and non-conserving scenarios. Neither temporal resolution, prompting, nor curated sampling helps. These findings show that current VLMs fail to maintain transformation-invariant representations of physical properties across dynamic scenes.

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 1 canonical work pages

[1]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558,

Azzolini, A., Bai, J., Brandon, H., Cao, J., Chattopadhyay, P., Chen, H., Chu, J., Cui, Y ., Diamond, J., Ding, Y ., et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558,

Pith/arXiv arXiv
[2]

M., Wang, E., Mrowca, D., Binder, F

Bear, D. M., Wang, E., Mrowca, D., Binder, F. J., Tung, H.-Y . F., Pramod, R., Holdaway, C., Tao, S., Smith, K., Sun, F.-Y ., et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261,

Pith/arXiv arXiv
[3]

Emergent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788,

Berti, L., Giorgi, F., and Kasneci, G. Emergent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788,

Pith/arXiv arXiv
[4]

Bubeck, S., Chadrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y

URLhttps://arxiv.org/abs/2307.15818. Bubeck, S., Chadrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y . T., Li, Y ., Lund- berg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4,

Pith/arXiv arXiv
[5]

Buschoff, L. M. S., V oudouris, K., Akata, E., Bethge, M., Tenenbaum, J. B., and Schulz, E. Testing the limits of fine- tuning to improve reasoning in vision language models. arXiv preprint arXiv:2502.15678,

Pith/arXiv arXiv
[6]

Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142,

10 Vision Language Models Cannot Reason About Physical Transformation Cai, Z., Wang, Y ., Sun, Q., Wang, R., Gu, C., Yin, W., Lin, Z., Yang, Z., Wei, C., Qian, O., et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142,

arXiv
[7]

Cheng, K., Li, Y ., Xu, F., Zhang, J., Zhou, H., and Liu, Y

URL https://arxiv.org/abs/2412.05271. Cheng, K., Li, Y ., Xu, F., Zhang, J., Zhou, H., and Liu, Y . Vision-language models can self-improve reasoning via reflection.arXiv preprint arXiv:2411.00855, 2024a. Cheng, Z., Leng, S., Zhang, H., Xin, Y ., Li, X., Chen, G., Zhu, Y ., Zhang, W., Luo, Z., Zhao, D., and Bing, L. Videollama 2: Advancing spatial-tempora...

Pith/arXiv arXiv
[8]

Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al

URLhttps://arxiv.org/abs/2303.03378. Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pp. 11198– 11201,

Pith/arXiv arXiv
[9]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Fu, C., Dai, Y ., Luo, Y ., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y ., Zhang, M., et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24108–24118, 2025a. Fu, S., Bonnen, T., Guillory, D., and Darre...

Pith/arXiv arXiv
[10]

Physically grounded vision- language models for robotic manipulation, 2024a

Gao, J., Sarkar, B., Xia, F., Xiao, T., Wu, J., Ichter, B., Ma- jumdar, A., and Sadigh, D. Physically grounded vision- language models for robotic manipulation, 2024a. URL https://arxiv.org/abs/2309.02561. Gao, Q., Li, Y ., Lyu, H., Sun, H., Luo, D., and Deng, H. Vision language models see what you want but not what you see.arXiv preprint arXiv:2410.00324...

Pith/arXiv arXiv
[11]

doi: 10.1207/s15327078in0602

work page doi:10.1207/s15327078in0602
[12]

B., Dhariwal, P., Gray, S., et al

Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701,

Pith/arXiv arXiv 2010
[13]

Videop2r: Video under- standing from perception to reasoning.arXiv preprint arXiv:2511.11113,

Jiang, Y ., Wang, Y ., Zhao, R., Parag, T., Chen, Z., Liao, Z., and Unnikrishnan, J. Videop2r: Video under- standing from perception to reasoning.arXiv preprint arXiv:2511.11113,

Pith/arXiv arXiv
[14]

B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

Pith/arXiv arXiv 2001
[15]

D., Vasconcelos, N., Golan, T., Luo, D., and Deng, H

Li, Y ., Gao, Q., Zhao, T., Wang, B., Sun, H., Lyu, H., Hawkins, R. D., Vasconcelos, N., Golan, T., Luo, D., and Deng, H. Core knowledge deficits in multi-modal lan- guage models.arXiv preprint arXiv:2410.10855, 2025a. Li, Y ., Wang, B., Zhao, T., Gao, Q., Deng, H., and Luo, D. Evaluating multi-modal language models through con- cept hacking. InWorkshop o...

Pith/arXiv arXiv
[16]

Liu, Y ., Li, Z., Yang, B., Li, C., Yin, X., Liu, C.-l., Jin, L., and Bai, X

URL https://arxiv.org/abs/ 2501.10928. Liu, Y ., Li, Z., Yang, B., Li, C., Yin, X., Liu, C.-l., Jin, L., and Bai, X. On the hidden mystery of ocr in large multimodal models.arXiv preprint arXiv:2305.07895,

Pith/arXiv arXiv
[17]

Rethinking the simulation vs

Luo, D., Gao, Q., and Deng, H. Rethinking the simulation vs. rendering dichotomy: No free lunch in spatial world modelling.arXiv preprint arXiv:2510.20835, 2025a. Luo, D., Li, Y ., and Deng, H. The philosophical foun- dations of growing ai like a child.arXiv preprint arXiv:2502.10742, 2025b. Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. Ok-vqa:...

arXiv
[18]

Mitchell, M

URL https: //arxiv.org/abs/2410.05363. Mitchell, M. and Krakauer, D. C. The debate over under- standing in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120,

Pith/arXiv arXiv
[19]

URL https://arxiv.org/abs/ 2501.09038. Nasiriany, S., Xia, F., Yu, W., Xiao, T., Liang, J., Dasgupta, I., Xie, A., Driess, D., Wahid, A., Xu, Z., Vuong, Q., 12 Vision Language Models Cannot Reason About Physical Transformation Zhang, T., Lee, T.-W. E., Lee, K.-H., Xu, P., Kirmani, S., Zhu, Y ., Zeng, A., Hausman, K., Heess, N., Finn, C., Levine, S., and I...

Pith/arXiv arXiv
[20]

Newman, K., Wang, S., Zang, Y ., Heffren, D., and Sun, C

URLhttps://arxiv.org/abs/2402.07872. Newman, K., Wang, S., Zang, Y ., Heffren, D., and Sun, C. Do pre-trained vision-language models encode object states?arXiv preprint arXiv:2409.10488,

Pith/arXiv arXiv
[21]

org/abs/2004.10796

URLhttps://arxiv. org/abs/2004.10796. Patel, M., Gokhale, T., Baral, C., and Yang, Y . Cripp- vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,

Pith/arXiv arXiv 2004
[22]

Phybench: Holis- tic evaluation of physical perception and reasoning in large language models.arXiv preprint arXiv:2504.16074,

Qiu, S., Guo, S., Song, Z.-Y ., Sun, Y ., Cai, Z., Wei, J., Luo, T., Yin, Y ., Zhang, H., Hu, Y ., et al. Phybench: Holis- tic evaluation of physical perception and reasoning in large language models.arXiv preprint arXiv:2504.16074,

Pith/arXiv arXiv
[23]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision.arXiv preprint arXiv: 2103.00020,

Pith/arXiv arXiv
[24]

R., and Nguyen, A

Rahmanzadehgervi, P., Bolton, L., Taesiri, M. R., and Nguyen, A. T. Vision language models are blind.arXiv preprint arXiv:2407.06581,

Pith/arXiv arXiv
[25]

Spelke, E

URLhttps://arxiv.org/abs/2401.15977. Spelke, E. S., Breinlinger, K., Macomber, J., and Jacobson, K. Origins of knowledge.Psychological review, 99(4): 605,

Pith/arXiv arXiv
[26]

Probing mechanical reasoning in large vision language models.arXiv preprint arXiv:2410.00318,

Sun, H., Gao, Q., Lyu, H., Luo, D., Li, Y ., and Deng, H. Probing mechanical reasoning in large vision language models.arXiv preprint arXiv:2410.00318,

Pith/arXiv arXiv
[27]

Probing perceptual constancy in large vision language models.arXiv preprint arXiv:2502.10273,

Sun, H., Yu, S., Li, Y ., Gao, Q., Lyu, H., Deng, H., and Luo, D. Probing perceptual constancy in large vision language models.arXiv preprint arXiv:2502.10273,

arXiv
[28]

Viarouge, A., Houd´e, O., and Borst, G

URLhttps://arxiv.org/abs/2503.19786. Viarouge, A., Houd´e, O., and Borst, G. The progressive 6- year-old conserver: Numerical saliency and sensitivity as core mechanisms of numerical abstraction in a piaget-like estimation task.Cognition, 190:137–142,

Pith/arXiv arXiv
[29]

Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models.arXiv preprint arXiv:2506.17667,

Wang, L., Su, E., Liu, J., Li, P., Xia, P., Xiao, J., Zhang, W., Dai, X., Chen, X., Meng, Y ., et al. Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models.arXiv preprint arXiv:2506.17667,

arXiv
[30]

Wei, J., Tay, Y ., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met- zler, D., et al

URL https://arxiv.org/abs/2409.12191. Wei, J., Tay, Y ., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met- zler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682,

Pith/arXiv arXiv
[31]

Llava-o1: Let vision language models reason step-by- step.arXiv preprint arXiv:2411.10440,

Xu, G., Jin, P., Hao, L., Song, Y ., Sun, L., and Yuan, L. Llava-o1: Let vision language models reason step-by- step.arXiv preprint arXiv:2411.10440,

Pith/arXiv arXiv
[32]

Yu, S., Cho, J., Yadav, P., and Bansal, M

URL https://arxiv.org/abs/2503.23368. Yu, S., Cho, J., Yadav, P., and Bansal, M. Self-chained image-language model for video localization and question answering.Advances in Neural Information Processing Systems, 36:76749–76771,

Pith/arXiv arXiv
[33]

When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936,

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., and Zou, J. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936,

Pith/arXiv arXiv
[34]

Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L

URLhttps://arxiv.org/abs/1811.10830. Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12104–12113,

Pith/arXiv arXiv
[35]

Exploring perceptual limitation of multimodal large lan- guage models.arXiv preprint arXiv:2402.07384, 2024a

Zhang, J., Hu, J., Khayatkhoei, M., Ilievski, F., and Sun, M. Exploring perceptual limitation of multimodal large lan- guage models.arXiv preprint arXiv:2402.07384, 2024a. 14 Vision Language Models Cannot Reason About Physical Transformation Zhang, R., Zhang, B., Li, Y ., Zhang, H., Sun, Z., Gan, Z., Yang, Y ., Pang, R., and Yang, Y . Improve vision lan- ...

Pith/arXiv arXiv 2025
[36]

Zheng, Z., Yan, X., Chen, Z., Wang, J., Lim, Q. Z. E., Tenen- baum, J. B., and Gan, C. Contphy: Continuum physi- cal concept learning and reasoning from videos.arXiv preprint arXiv:2402.06119,

Pith/arXiv arXiv
[37]

No significant effects of prompt type or frame count are found for either task category among top models. The extraction method effect observed in the full sample is preserved—with SeViLA leading to significantly worse performance on V olume & Size tasks (p <0.01 )—further confirming that neither linguistic scaffolding nor increased temporal information f...

2025

[1] [1]

Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558,

Azzolini, A., Bai, J., Brandon, H., Cao, J., Chattopadhyay, P., Chen, H., Chu, J., Cui, Y ., Diamond, J., Ding, Y ., et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558,

Pith/arXiv arXiv

[2] [2]

M., Wang, E., Mrowca, D., Binder, F

Bear, D. M., Wang, E., Mrowca, D., Binder, F. J., Tung, H.-Y . F., Pramod, R., Holdaway, C., Tao, S., Smith, K., Sun, F.-Y ., et al. Physion: Evaluating physical prediction from vision in humans and machines.arXiv preprint arXiv:2106.08261,

Pith/arXiv arXiv

[3] [3]

Emergent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788,

Berti, L., Giorgi, F., and Kasneci, G. Emergent abilities in large language models: A survey.arXiv preprint arXiv:2503.05788,

Pith/arXiv arXiv

[4] [4]

Bubeck, S., Chadrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y

URLhttps://arxiv.org/abs/2307.15818. Bubeck, S., Chadrasekaran, V ., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y . T., Li, Y ., Lund- berg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4,

Pith/arXiv arXiv

[5] [5]

Buschoff, L. M. S., V oudouris, K., Akata, E., Bethge, M., Tenenbaum, J. B., and Schulz, E. Testing the limits of fine- tuning to improve reasoning in vision language models. arXiv preprint arXiv:2502.15678,

Pith/arXiv arXiv

[6] [6]

Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142,

10 Vision Language Models Cannot Reason About Physical Transformation Cai, Z., Wang, Y ., Sun, Q., Wang, R., Gu, C., Yin, W., Lin, Z., Yang, Z., Wei, C., Qian, O., et al. Holistic evaluation of multimodal llms on spatial intelligence.arXiv preprint arXiv:2508.13142,

arXiv

[7] [7]

Cheng, K., Li, Y ., Xu, F., Zhang, J., Zhou, H., and Liu, Y

URL https://arxiv.org/abs/2412.05271. Cheng, K., Li, Y ., Xu, F., Zhang, J., Zhou, H., and Liu, Y . Vision-language models can self-improve reasoning via reflection.arXiv preprint arXiv:2411.00855, 2024a. Cheng, Z., Leng, S., Zhang, H., Xin, Y ., Li, X., Chen, G., Zhu, Y ., Zhang, W., Luo, Z., Zhao, D., and Bing, L. Videollama 2: Advancing spatial-tempora...

Pith/arXiv arXiv

[8] [8]

Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al

URLhttps://arxiv.org/abs/2303.03378. Duan, H., Yang, J., Qiao, Y ., Fang, X., Chen, L., Liu, Y ., Dong, X., Zang, Y ., Zhang, P., Wang, J., et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pp. 11198– 11201,

Pith/arXiv arXiv

[9] [9]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Fu, C., Dai, Y ., Luo, Y ., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y ., Zhang, M., et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 24108–24118, 2025a. Fu, S., Bonnen, T., Guillory, D., and Darre...

Pith/arXiv arXiv

[10] [10]

Physically grounded vision- language models for robotic manipulation, 2024a

Gao, J., Sarkar, B., Xia, F., Xiao, T., Wu, J., Ichter, B., Ma- jumdar, A., and Sadigh, D. Physically grounded vision- language models for robotic manipulation, 2024a. URL https://arxiv.org/abs/2309.02561. Gao, Q., Li, Y ., Lyu, H., Sun, H., Luo, D., and Deng, H. Vision language models see what you want but not what you see.arXiv preprint arXiv:2410.00324...

Pith/arXiv arXiv

[11] [11]

doi: 10.1207/s15327078in0602

work page doi:10.1207/s15327078in0602

[12] [12]

B., Dhariwal, P., Gray, S., et al

Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701,

Pith/arXiv arXiv 2010

[13] [13]

Videop2r: Video under- standing from perception to reasoning.arXiv preprint arXiv:2511.11113,

Jiang, Y ., Wang, Y ., Zhao, R., Parag, T., Chen, Z., Liao, Z., and Unnikrishnan, J. Videop2r: Video under- standing from perception to reasoning.arXiv preprint arXiv:2511.11113,

Pith/arXiv arXiv

[14] [14]

B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

Pith/arXiv arXiv 2001

[15] [15]

D., Vasconcelos, N., Golan, T., Luo, D., and Deng, H

Li, Y ., Gao, Q., Zhao, T., Wang, B., Sun, H., Lyu, H., Hawkins, R. D., Vasconcelos, N., Golan, T., Luo, D., and Deng, H. Core knowledge deficits in multi-modal lan- guage models.arXiv preprint arXiv:2410.10855, 2025a. Li, Y ., Wang, B., Zhao, T., Gao, Q., Deng, H., and Luo, D. Evaluating multi-modal language models through con- cept hacking. InWorkshop o...

Pith/arXiv arXiv

[16] [16]

Liu, Y ., Li, Z., Yang, B., Li, C., Yin, X., Liu, C.-l., Jin, L., and Bai, X

URL https://arxiv.org/abs/ 2501.10928. Liu, Y ., Li, Z., Yang, B., Li, C., Yin, X., Liu, C.-l., Jin, L., and Bai, X. On the hidden mystery of ocr in large multimodal models.arXiv preprint arXiv:2305.07895,

Pith/arXiv arXiv

[17] [17]

Rethinking the simulation vs

Luo, D., Gao, Q., and Deng, H. Rethinking the simulation vs. rendering dichotomy: No free lunch in spatial world modelling.arXiv preprint arXiv:2510.20835, 2025a. Luo, D., Li, Y ., and Deng, H. The philosophical foun- dations of growing ai like a child.arXiv preprint arXiv:2502.10742, 2025b. Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. Ok-vqa:...

arXiv

[18] [18]

Mitchell, M

URL https: //arxiv.org/abs/2410.05363. Mitchell, M. and Krakauer, D. C. The debate over under- standing in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120,

Pith/arXiv arXiv

[19] [19]

URL https://arxiv.org/abs/ 2501.09038. Nasiriany, S., Xia, F., Yu, W., Xiao, T., Liang, J., Dasgupta, I., Xie, A., Driess, D., Wahid, A., Xu, Z., Vuong, Q., 12 Vision Language Models Cannot Reason About Physical Transformation Zhang, T., Lee, T.-W. E., Lee, K.-H., Xu, P., Kirmani, S., Zhu, Y ., Zeng, A., Hausman, K., Heess, N., Finn, C., Levine, S., and I...

Pith/arXiv arXiv

[20] [20]

Newman, K., Wang, S., Zang, Y ., Heffren, D., and Sun, C

URLhttps://arxiv.org/abs/2402.07872. Newman, K., Wang, S., Zang, Y ., Heffren, D., and Sun, C. Do pre-trained vision-language models encode object states?arXiv preprint arXiv:2409.10488,

Pith/arXiv arXiv

[21] [21]

org/abs/2004.10796

URLhttps://arxiv. org/abs/2004.10796. Patel, M., Gokhale, T., Baral, C., and Yang, Y . Cripp- vqa: Counterfactual reasoning about implicit physical properties via video question answering.arXiv preprint arXiv:2211.03779,

Pith/arXiv arXiv 2004

[22] [22]

Phybench: Holis- tic evaluation of physical perception and reasoning in large language models.arXiv preprint arXiv:2504.16074,

Qiu, S., Guo, S., Song, Z.-Y ., Sun, Y ., Cai, Z., Wei, J., Luo, T., Yin, Y ., Zhang, H., Hu, Y ., et al. Phybench: Holis- tic evaluation of physical perception and reasoning in large language models.arXiv preprint arXiv:2504.16074,

Pith/arXiv arXiv

[23] [23]

W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision.arXiv preprint arXiv: 2103.00020,

Pith/arXiv arXiv

[24] [24]

R., and Nguyen, A

Rahmanzadehgervi, P., Bolton, L., Taesiri, M. R., and Nguyen, A. T. Vision language models are blind.arXiv preprint arXiv:2407.06581,

Pith/arXiv arXiv

[25] [25]

Spelke, E

URLhttps://arxiv.org/abs/2401.15977. Spelke, E. S., Breinlinger, K., Macomber, J., and Jacobson, K. Origins of knowledge.Psychological review, 99(4): 605,

Pith/arXiv arXiv

[26] [26]

Probing mechanical reasoning in large vision language models.arXiv preprint arXiv:2410.00318,

Sun, H., Gao, Q., Lyu, H., Luo, D., Li, Y ., and Deng, H. Probing mechanical reasoning in large vision language models.arXiv preprint arXiv:2410.00318,

Pith/arXiv arXiv

[27] [27]

Probing perceptual constancy in large vision language models.arXiv preprint arXiv:2502.10273,

Sun, H., Yu, S., Li, Y ., Gao, Q., Lyu, H., Deng, H., and Luo, D. Probing perceptual constancy in large vision language models.arXiv preprint arXiv:2502.10273,

arXiv

[28] [28]

Viarouge, A., Houd´e, O., and Borst, G

URLhttps://arxiv.org/abs/2503.19786. Viarouge, A., Houd´e, O., and Borst, G. The progressive 6- year-old conserver: Numerical saliency and sensitivity as core mechanisms of numerical abstraction in a piaget-like estimation task.Cognition, 190:137–142,

Pith/arXiv arXiv

[29] [29]

Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models.arXiv preprint arXiv:2506.17667,

Wang, L., Su, E., Liu, J., Li, P., Xia, P., Xiao, J., Zhang, W., Dai, X., Chen, X., Meng, Y ., et al. Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models.arXiv preprint arXiv:2506.17667,

arXiv

[30] [30]

Wei, J., Tay, Y ., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met- zler, D., et al

URL https://arxiv.org/abs/2409.12191. Wei, J., Tay, Y ., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met- zler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682,

Pith/arXiv arXiv

[31] [31]

Llava-o1: Let vision language models reason step-by- step.arXiv preprint arXiv:2411.10440,

Xu, G., Jin, P., Hao, L., Song, Y ., Sun, L., and Yuan, L. Llava-o1: Let vision language models reason step-by- step.arXiv preprint arXiv:2411.10440,

Pith/arXiv arXiv

[32] [32]

Yu, S., Cho, J., Yadav, P., and Bansal, M

URL https://arxiv.org/abs/2503.23368. Yu, S., Cho, J., Yadav, P., and Bansal, M. Self-chained image-language model for video localization and question answering.Advances in Neural Information Processing Systems, 36:76749–76771,

Pith/arXiv arXiv

[33] [33]

When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936,

Yuksekgonul, M., Bianchi, F., Kalluri, P., Jurafsky, D., and Zou, J. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936,

Pith/arXiv arXiv

[34] [34]

Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L

URLhttps://arxiv.org/abs/1811.10830. Zhai, X., Kolesnikov, A., Houlsby, N., and Beyer, L. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12104–12113,

Pith/arXiv arXiv

[35] [35]

Exploring perceptual limitation of multimodal large lan- guage models.arXiv preprint arXiv:2402.07384, 2024a

Zhang, J., Hu, J., Khayatkhoei, M., Ilievski, F., and Sun, M. Exploring perceptual limitation of multimodal large lan- guage models.arXiv preprint arXiv:2402.07384, 2024a. 14 Vision Language Models Cannot Reason About Physical Transformation Zhang, R., Zhang, B., Li, Y ., Zhang, H., Sun, Z., Gan, Z., Yang, Y ., Pang, R., and Yang, Y . Improve vision lan- ...

Pith/arXiv arXiv 2025

[36] [36]

Zheng, Z., Yan, X., Chen, Z., Wang, J., Lim, Q. Z. E., Tenen- baum, J. B., and Gan, C. Contphy: Continuum physi- cal concept learning and reasoning from videos.arXiv preprint arXiv:2402.06119,

Pith/arXiv arXiv

[37] [37]

No significant effects of prompt type or frame count are found for either task category among top models. The extraction method effect observed in the full sample is preserved—with SeViLA leading to significantly worse performance on V olume & Size tasks (p <0.01 )—further confirming that neither linguistic scaffolding nor increased temporal information f...

2025