Framethinker: Learning to think with long videos via multi-turn frame spotlighting

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng · 2025 · arXiv 2509.24304

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Decoupling planning from answer authority in long-video agents reduces evidence misalignment and raises accuracy to 55.1% on LVBench and 62.0% on LongVideoBench.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

cs.CV · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

cs.CV · 2025-12-17 · unverdicted · novelty 6.0

Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.

citing papers explorer

Showing 4 of 4 citing papers.

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA cs.CV · 2026-05-06 · unverdicted · none · ref 12
VTAgent uses a question-guided agent to anchor keyframes for evidence-aware Video TextVQA, delivering up to +12 accuracy and new SOTA results via training-free operation plus SFT and RL.
VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority cs.CV · 2026-05-12 · unverdicted · none · ref 2
Decoupling planning from answer authority in long-video agents reduces evidence misalignment and raises accuracy to 55.1% on LVBench and 62.0% on LongVideoBench.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 28 · 2 links
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning cs.CV · 2025-12-17 · unverdicted · none · ref 21
Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.

Framethinker: Learning to think with long videos via multi-turn frame spotlighting

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer