Video LMMs name objects and actions reliably but fail to detect the precise frames and locations of contact and release events, revealing shortcut learning instead of physical grounding.
For each video, we cropped short 10-frame video clips around three tempo- rally separated core interaction events
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection
Video LMMs name objects and actions reliably but fail to detect the precise frames and locations of contact and release events, revealing shortcut learning instead of physical grounding.