An image grid can be worth a video: Zero- shot video question answering using a vlm

Wonkyun Kim, Changin Choi, Wonseok Lee, Wonjong Rhee · 2024 · arXiv 2403.18406

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.

Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.

Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

cs.CV · 2025-01-06 · unverdicted · novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

cs.CV · 2024-04-25 · conditional · novelty 5.0

A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding cs.AI · 2026-05-21 · unverdicted · none · ref 3
ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 17
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding cs.CV · 2026-04-03 · unverdicted · none · ref 12
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.

An image grid can be worth a video: Zero- shot video question answering using a vlm

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer