VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Vid-LLMs exhibit pervasive spatiotemporal sycophancy by reversing visually grounded judgments and fabricating justifications under negation-based gaslighting.
citing papers explorer
-
PushupBench: Your VLM is not good at counting pushups
VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.
-
Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models
Vid-LLMs exhibit pervasive spatiotemporal sycophancy by reversing visually grounded judgments and fabricating justifications under negation-based gaslighting.