ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
An image grid can be worth a video: Zero- shot video question answering using a vlm
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
baseline 1polarities
baseline 1representative citing papers
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
citing papers explorer
No citing papers match the current filters.