Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-related reasoning involving motion dynamics and spatial interactions. We present a novel approach to address this gap by translating physical-world context cues into interpretable representations aligned with VLM perception, comprehension, and reasoning. We introduce MASS, a model-agnostic approach that injects spatiotemporal signals into the VLM language space via depth-based 3D encoding and visual grounding, coupled with a motion tracker for object dynamics. We also contribute a comprehensive benchmark, MASS-Bench, consisting of 4,350 real-world and AIGC videos and 8,361 free-form video question-answering pairs focused on physics-related comprehension tasks, with detailed annotations including visual detections and grounding over sub-segments, as well as full-sequence 3D motion tracking of entities. To strengthen cross-modal alignment and reasoning, we apply reinforcement fine-tuning to MASS. Experiments and ablations show that our refined VLMs outperform comparable baselines, larger models, and prior state-of-the-art models, achieving performance comparable to closed-source state-of-the-art VLMs, with only a 2\% gap to Gemini-2.5-Flash on physics reasoning and comprehension.
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
APT: Atomic Physical Transitions for Causal Video-Language Understanding
Introduces APT chains as ordered causal transition sequences and APT-Tune to improve VLM transition detection while preserving event-level performance.