arXiv preprint arXiv:2407.15047 , year =

End-to-End Video Question Answering with Frame Scoring Mechanisms · 2024 · arXiv 2407.15047

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling

cs.CV · 2026-03-24 · unverdicted · novelty 6.0

ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

cs.CV · 2025-12-09 · unverdicted · novelty 6.0

OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.

CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.

Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

citing papers explorer

Showing 4 of 4 citing papers.

ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV · 2026-03-24 · unverdicted · none · ref 28
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval cs.CV · 2025-12-09 · unverdicted · none · ref 23
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering cs.CV · 2026-05-18 · unverdicted · none · ref 63
CRAFT introduces a query-conditioned pipeline with dynamic keyframe selection, ASR, and a hybrid critic loop that achieves top scores on MAGMaR 2026 for grounded multi-video question answering.
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration cs.CV · 2026-05-01 · unverdicted · none · ref 15
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

arXiv preprint arXiv:2407.15047 , year =

fields

years

verdicts

representative citing papers

citing papers explorer