AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 6years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
background 1representative citing papers
KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.
PEEK distills caption-conditioned frame relevance into a lightweight visual model, outperforming adaptive baselines on ActivityNet Captions and MSR-VTT especially at 1-2 frame budgets while adding only 5.2% overhead.
QCA selects compact, query-relevant keyframes from long videos via segment-wise budget allocation and diversity-aware addition, achieving higher accuracy than GPT-4o on LongVideoBench with half the frames.
AdaQ is a training-free adaptive quasi-Gaussian sampling method for keyframe selection that improves long-video understanding in MLLMs and can outperform GPT-4o with 64 frames.
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.
citing papers explorer
-
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.