MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

· 2025 · cs.CV · arXiv 2511.18373

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-related reasoning involving motion dynamics and spatial interactions. We present a novel approach to address this gap by translating physical-world context cues into interpretable representations aligned with VLM perception, comprehension, and reasoning. We introduce MASS, a model-agnostic approach that injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. We also contribute a comprehensive benchmark, MASS-Bench, consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections and grounding over sub-segments, as well as full-sequence 3D motion tracking of entities. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning to MASS. Experiments and ablations show that our refined VLMs outperform comparable baselines, larger models, and prior state-of-the-art models, achieving performance comparable to closed-source state-of-the-art VLMs, with only a 2\% gap to Gemini-2.5-Flash on physics reasoning and comprehension.

representative citing papers

APT: Atomic Physical Transitions for Causal Video-Language Understanding

cs.CV · 2026-06-17 · unverdicted · novelty 6.0

Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.

citing papers explorer

Showing 1 of 1 citing paper.

APT: Atomic Physical Transitions for Causal Video-Language Understanding cs.CV · 2026-06-17 · unverdicted · none · ref 35 · internal anchor
Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

fields

years

verdicts

representative citing papers

citing papers explorer