hub

PhysBench: Benchmarking and enhancing vision-language models for physical world understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue Wang · 2025 · arXiv 2501.16411

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

read on arXiv browse 23 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors?

cs.CV · 2026-06-06 · unverdicted · novelty 7.0

ChronoPhyBench is a new benchmark and dataset for chronological physical dynamics reasoning that combines video-conditioned next-state prediction with VQA to reduce language bias in MLLM evaluation.

Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

PhaseLock extracts motion priors from 2-step inference and enforces them via Latent Delta Guidance to raise physical consistency scores by 6.2 points on average in image-to-video diffusion models.

Benchmarking Single-Factor Physical Video-to-Audio Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.

World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

WMW audits VLMs by requiring typed physical state-transition traces and using a verifier to detect inconsistencies missed by answer-only evaluation, with TraceBank as a released resource of synthetic scenarios.

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

cs.CV · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

ESI-Bench shows active exploration outperforms passive observation in multimodal LLMs on spatial tasks but reveals failures from poor action choices and overconfident belief commitment unlike humans.

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.

Grounding Video Reasoning in Physical Signals

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.

OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

cs.CL · 2026-04-23 · unverdicted · novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.

SCP: Spatial Causal Prediction in Video

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.

Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

cs.CV · 2025-11-14 · unverdicted · novelty 7.0

SandboxVLM enhances VLMs' spatial intelligence by encoding 3D geometry with abstract bounding boxes in a four-stage zero-shot pipeline, yielding an 8.3% improvement on SAT Real benchmark.

Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs

cs.DB · 2026-06-04 · unverdicted · novelty 6.0

Introduces CausalPhys benchmark with causal graphs and CRFT fine-tuning to improve VLMs' causal physical reasoning accuracy and interpretability.

$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

A vision-language framework generates text-based rigid-body scene configurations from videos using motion reasoning and optical flow, reporting 0.30 IoU on CLEVRER (7x over baselines) and transfer to 235 real videos.

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

cs.CV · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

GeoWorld-VLM aligns VLM image features with intermediate representations from camera-conditioned world models via fine-tuning only the encoder and projector, yielding ~4% gains on What'sUp and VSR spatial benchmarks across two VLM backbones.

Quantitative Video World Model Evaluation for Geometric-Consistency

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.

From Priors to Perception: Grounding Video-LLMs in Physical Reality

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.

Multimodal Language Models Cannot Spot Spatial Inconsistencies

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.

OmniView-Space: Reinforcing Spatial Reasoning via Multi-Perspective Spatial Mapping

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

OmniView-Space framework with MPSM, tool-guided reasoning, and distillation achieves SOTA on spatial reasoning benchmarks for MLLMs while reducing external geometry dependencies.

Physically Viable World Models: A Case for Query-Conditioned Embodied AI

cs.AI · 2026-05-28 · unverdicted · novelty 5.0

Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.

PhysBrain 1.0 Technical Report

cs.RO · 2026-05-14 · unverdicted · novelty 5.0

PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.

Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

cs.AI · 2025-12-29 · unverdicted · novelty 5.0

A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

cs.CV · 2025-11-23 · unverdicted · novelty 5.0

MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.

Evidence of a Cognitive Shift in AI Education: How Students Are Rethinking Human Intelligence?

cs.CY · 2026-04-14 · unverdicted · novelty 4.0

Longitudinal poll data from 471 students in AI courses shows a shift toward preferring human intelligence, reaching 65% in technical courses and 90% in design courses by 2026.

Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

cs.CL · 2026-04-22

citing papers explorer

Showing 23 of 23 citing papers.

ChronoPhyBench: Do MLLMs Truly Understand the World or Merely Exploit Language Priors? cs.CV · 2026-06-06 · unverdicted · none · ref 43
ChronoPhyBench is a new benchmark and dataset for chronological physical dynamics reasoning that combines video-conditioned next-state prediction with VQA to reduce language bias in MLLM evaluation.
Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them cs.CV · 2026-06-04 · unverdicted · none · ref 22
PhaseLock extracts motion priors from 2-step inference and enforces them via Latent Delta Guidance to raise physical consistency scores by 6.2 points on average in image-to-video diffusion models.
Benchmarking Single-Factor Physical Video-to-Audio Generation cs.CV · 2026-05-28 · unverdicted · none · ref 14
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language Models cs.CL · 2026-05-28 · unverdicted · none · ref 2
WMW audits VLMs by requiring typed physical state-transition traces and using a verifier to detect inconsistencies missed by answer-only evaluation, with TraceBank as a released resource of synthetic scenarios.
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop cs.CV · 2026-05-18 · unverdicted · none · ref 4 · 2 links
ESI-Bench shows active exploration outperforms passive observation in multimodal LLMs on spatial tasks but reveals failures from poor action choices and overconfident belief commitment unlike humans.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 13
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.
Grounding Video Reasoning in Physical Signals cs.CV · 2026-04-23 · unverdicted · none · ref 6
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving cs.CL · 2026-04-23 · unverdicted · none · ref 96
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve performance.
SCP: Spatial Causal Prediction in Video cs.CV · 2026-03-04 · unverdicted · none · ref 13
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models cs.CV · 2025-11-14 · unverdicted · none · ref 5
SandboxVLM enhances VLMs' spatial intelligence by encoding 3D geometry with abstract bounding boxes in a four-stage zero-shot pipeline, yielding an 8.3% improvement on SAT Real benchmark.
Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs cs.DB · 2026-06-04 · unverdicted · none · ref 13
Introduces CausalPhys benchmark with causal graphs and CRFT fine-tuning to improve VLMs' causal physical reasoning accuracy and interpretability.
$\Delta$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos cs.CV · 2026-05-20 · unverdicted · none · ref 17
A vision-language framework generates text-based rigid-body scene configurations from videos using motion reasoning and optical flow, reporting 0.30 IoU on CLEVRER (7x over baselines) and transfer to 235 real videos.
GeoWorld-VLM: Geometry from World Models for Vision-Language Models cs.CV · 2026-05-15 · unverdicted · none · ref 11 · 2 links
GeoWorld-VLM aligns VLM image features with intermediate representations from camera-conditioned world models via fine-tuning only the encoder and projector, yielding ~4% gains on What'sUp and VSR spatial benchmarks across two VLM backbones.
Quantitative Video World Model Evaluation for Geometric-Consistency cs.CV · 2026-05-14 · unverdicted · none · ref 8
PDI-Bench computes 3D projective residuals from segmented and tracked points to quantify geometric inconsistency in AI-generated videos.
From Priors to Perception: Grounding Video-LLMs in Physical Reality cs.CV · 2026-05-06 · unverdicted · none · ref 11
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
Multimodal Language Models Cannot Spot Spatial Inconsistencies cs.CV · 2026-04-01 · unverdicted · none · ref 11
Multimodal LLMs significantly underperform humans at spotting objects that break 3D consistency in multi-view image pairs.
OmniView-Space: Reinforcing Spatial Reasoning via Multi-Perspective Spatial Mapping cs.CV · 2026-07-01 · unverdicted · none · ref 20
OmniView-Space framework with MPSM, tool-guided reasoning, and distillation achieves SOTA on spatial reasoning benchmarks for MLLMs while reducing external geometry dependencies.
Physically Viable World Models: A Case for Query-Conditioned Embodied AI cs.AI · 2026-05-28 · unverdicted · none · ref 16
Embodied AI requires query-conditioned world models that select the simplest physical abstraction sufficient to answer intervention queries.
PhysBrain 1.0 Technical Report cs.RO · 2026-05-14 · unverdicted · none · ref 9
PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control cs.AI · 2025-12-29 · unverdicted · none · ref 24
A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models cs.CV · 2025-11-23 · unverdicted · none · ref 13
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
Evidence of a Cognitive Shift in AI Education: How Students Are Rethinking Human Intelligence? cs.CY · 2026-04-14 · unverdicted · none · ref 8
Longitudinal poll data from 471 students in AI courses shows a shift toward preferring human intelligence, reaching 65% in technical courses and 90% in design courses by 2026.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving cs.CL · 2026-04-22 · unreviewed · ref 80

PhysBench: Benchmarking and enhancing vision-language models for physical world understanding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer