Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

· 2025 · cs.CV · arXiv 2508.03337

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While keyframe selection is the dominant strategy for mitigating this, we identify a critical flaw: even state-of-the-art selectors produce prompts suffering from significant temporal redundancy, a challenge unique to video that we term 'visual echoes'. This issue leads to context dilution and can paradoxically degrade performance. To address this dual challenge, we propose a novel refinement framework that synergistically combines Adaptive Frame-Pruning(AFP) with a lightweight text-based semantic graph. AFP intelligently prunes 'visual echoes' by adaptively clustering frames, while the semantic graph provides crucial, low-cost semantic compensation. Conducting extensive experiments on the LongVideoBench and Video-MME benchmarks against multiple state-of-the-art selectors, our approach demonstrates a drastic reduction in total input tokens by up to 82.2%. Crucially, by creating a concise, high-quality prompt, our framework not only enhances efficiency but also demonstrates a remarkable ability to robustify and improve the accuracy of upstream selectors, achieving results that are highly competitive with, and often superior to, baselines that use vastly more frames.

representative citing papers

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.

citing papers explorer

Showing 1 of 1 citing paper.

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding cs.CV · 2026-04-19 · unverdicted · none · ref 32 · internal anchor
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.

Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

fields

years

verdicts

representative citing papers

citing papers explorer