Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan · 2025 · arXiv 2506.22139

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.

PEEK: Picking Essential frames via Efficient Knowledge distillation

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

PEEK distills caption-conditioned frame relevance into a lightweight visual model, outperforming adaptive baselines on ActivityNet Captions and MSR-VTT especially at 1-2 frame budgets while adding only 5.2% overhead.

QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

QCA selects compact, query-relevant keyframes from long videos via segment-wise budget allocation and diversity-aware addition, achieving higher accuracy than GPT-4o on LongVideoBench with half the frames.

Towards Fast and Effective Long Video Understanding of Multimodal Large Language Models via Adaptive Quasi-Gaussian Sampling

cs.CV · 2026-06-23 · unverdicted · novelty 5.0

AdaQ is a training-free adaptive quasi-Gaussian sampling method for keyframe selection that improves long-video understanding in MLLMs and can outperform GPT-4o with 64 frames.

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

cs.CV · 2026-06-05 · unverdicted · novelty 4.0

This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.

citing papers explorer

Showing 6 of 6 citing papers.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 50
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation cs.CV · 2026-06-02 · unverdicted · none · ref 26
KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.
PEEK: Picking Essential frames via Efficient Knowledge distillation cs.CV · 2026-05-29 · unverdicted · none · ref 38
PEEK distills caption-conditioned frame relevance into a lightweight visual model, outperforming adaptive baselines on ActivityNet Captions and MSR-VTT especially at 1-2 frame budgets while adding only 5.2% overhead.
QCA: Query- and Content-Aware Keyframe Selection for Long Video Understanding cs.CV · 2026-07-01 · unverdicted · none · ref 38
QCA selects compact, query-relevant keyframes from long videos via segment-wise budget allocation and diversity-aware addition, achieving higher accuracy than GPT-4o on LongVideoBench with half the frames.
Towards Fast and Effective Long Video Understanding of Multimodal Large Language Models via Adaptive Quasi-Gaussian Sampling cs.CV · 2026-06-23 · unverdicted · none · ref 19
AdaQ is a training-free adaptive quasi-Gaussian sampling method for keyframe selection that improves long-video understanding in MLLMs and can outperform GPT-4o with 64 frames.
Watch, Remember, Reason: Human-View Video Understanding with MLLMs cs.CV · 2026-06-05 · unverdicted · none · ref 54
This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.

Q-frame: Query-aware frame selection and multi-resolution adaptation for video-llms

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer