arxiv: 2605.04870 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: unknown

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Haibin He , Maoyuan Ye , Jing Zhang , Juhua Liu , Bo Du

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords Video TextVQAkeyframe anchoringevidence localizationmultimodal agentsVideo-LLMsvisual question answeringvideo understandingreinforcement learning

0 comments

The pith

Localizing question-relevant keyframes is the main bottleneck in Video TextVQA, not reasoning capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video TextVQA models underperform on benchmarks because they struggle to identify which frames contain the textual evidence required to answer a given question. A frame-wise upper-bound analysis, treating a sample as correct if any single frame yields the right answer, produces markedly higher accuracy than standard full-video inference. This gap indicates that evidence localization, rather than the models' ability to reason over text once found, limits performance. The paper introduces a question-guided agent that first anchors the relevant keyframes and then performs answering on those frames. The approach improves results in a training-free regime and, after supervised fine-tuning plus reinforcement learning, delivers average gains of 12.12 accuracy points and 11.15 ANLS points while setting new state-of-the-art numbers.

Core claim

The primary bottleneck in Video TextVQA lies in the localization of key question-relevant evidence frames rather than in the models' reasoning capacity itself. A question-guided agent framework called VTAgent explicitly anchors these keyframes before answering, operating effectively without training and, when combined with supervised fine-tuning and reinforcement learning, producing average improvements of +12.12 in accuracy and +11.15 in ANLS across benchmarks while establishing new state-of-the-art results.

What carries the argument

VTAgent, a question-guided agent framework that explicitly anchors relevant keyframes in a video before performing question answering on the selected frames.

If this is right

Frame-wise upper-bound performance significantly exceeds direct video-based inference, confirming localization as the dominant limitation.
The agent framework surpasses direct video inference even in a training-free setting.
Supervised fine-tuning combined with reinforcement learning on the anchored keyframes yields +12.12 accuracy and +11.15 ANLS average gains.
New state-of-the-art results are established on existing Video TextVQA benchmarks.
Explicit keyframe anchoring plays a critical role in advancing Video TextVQA performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of localization and reasoning stages could be applied to other video tasks where relevant information appears only in sparse temporal windows.
Integrating the anchoring step directly into the architecture of future Video-LLMs might reduce reliance on post-hoc agent orchestration.
Evaluating the method on videos longer than those in current benchmarks would test whether the agent's localization remains effective as temporal span increases.

Load-bearing premise

That the frame-wise upper-bound analysis validly isolates localization as the sole bottleneck and that the agent can reliably select those frames without introducing new errors.

What would settle it

Replacing the agent's selected keyframes with randomly chosen frames of the same count and measuring whether accuracy gains disappear would falsify the claim that precise evidence localization drives the observed improvements.

Figures

Figures reproduced from arXiv: 2605.04870 by Bo Du, Haibin He, Jing Zhang, Juhua Liu, Maoyuan Ye.

**Figure 1.** Figure 1: Illustration of the key motivation and effectiveness of VTAgent. view at source ↗

**Figure 2.** Figure 2: Overview of VTAgent. VTAgent performs keyframe anchoring and keyframe-conditioned reasoning to generate reliable answers from identified keyframes. conduct a preliminary experiment and reveal that the primary bottleneck for solving existing Video TextVQA problems lies in the localization of key-frames with question-relevant evidence. Consequently, we investigate a question-guided agent framework that expli… view at source ↗

**Figure 3.** Figure 3: Training pipeline of VTAgent. 3.2 Supervised Fine-tuning (SFT) Although recent MLLMs (e.g., Qwen3-VL [1]) already exhibit strong agentic capabilities, their performance remains improvable through effective post-training. To this end, we introduce SFT to provide explicit and structured supervision, which guides the model to better align its intermediate reasoning trajectories with the corresponding agentic… view at source ↗

**Figure 5.** Figure 5: Analysis of VTAgent. (a) demonstrates high keyframe hit rates, confirming reliable evidence localization; (b) shows superior answer accuracy over the baseline under varying task difficulties, highlighting effective reasoning with keyframe anchoring. independently yields the correct answer under frame-wise evaluation, and frameunsolvable subset Setu, where no single frame produces the correct prediction, i… view at source ↗

read the original abstract

Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper argues localization of relevant frames is the main bottleneck in Video TextVQA and offers an agentic anchoring method plus a frame-wise upper bound to address it, but the upper-bound comparison risks confounding input format with localization failure.

read the letter

The main thing to know is that this work claims the performance gap in Video TextVQA comes mostly from failing to find the right frames rather than from weak reasoning once the frames are in hand. They demonstrate this with a frame-wise upper bound that treats a sample as solved if any single frame produces the correct answer, and that bound beats direct video inference. They then introduce an agent that explicitly anchors to those keyframes before answering, first in a training-free mode and later with SFT and RL for larger gains that reach new SOTA numbers on the benchmarks.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that Video-LLMs underperform on Video TextVQA primarily due to failure to localize question-relevant keyframes rather than insufficient reasoning capacity. This is supported by a frame-wise upper-bound analysis (counting a sample correct if any single frame yields the right answer) that substantially outperforms direct video inference. The authors propose VTAgent, a question-guided agent framework for explicit keyframe anchoring that operates training-free and surpasses direct inference; with additional SFT and RL it yields average gains of +12.12 accuracy and +11.15 ANLS, establishing new SOTA results on benchmarks. Code release is promised.

Significance. If the upper-bound analysis cleanly isolates localization as the bottleneck and the agent reliably anchors frames without new errors, the work offers a useful diagnostic insight and practical evidence-aware approach for Video TextVQA, shifting emphasis toward explicit localization in multimodal video models. The empirical SOTA claims and planned public code release would strengthen reproducibility and impact if the central assumption holds.

major comments (1)

[Upper-bound analysis] Upper-bound analysis (described in abstract and early sections): The claim that the frame-wise upper-bound isolates localization failure as the sole bottleneck rests on the assumption that the performance gap versus full-video inference arises purely from missing the right frame. However, processing isolated frames (as image inputs) may alter attention patterns, temporal modeling, preprocessing, or context handling compared to video-length inputs, confounding the isolation. This assumption is load-bearing for the paper's central insight and the motivation for the agent framework; additional controls or ablations comparing single-frame vs. video regimes are needed to validate it.

minor comments (1)

[Experimental results] The abstract reports numerical gains and SOTA status but provides limited detail on experimental controls, baseline implementations, statistical significance, or run-to-run variance; expanding these in the results section would improve clarity without altering the core claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and insightful feedback on our manuscript. The concern regarding the upper-bound analysis is well-taken, and we address it directly below while committing to revisions that enhance the rigor of our claims.

read point-by-point responses

Referee: [Upper-bound analysis] Upper-bound analysis (described in abstract and early sections): The claim that the frame-wise upper-bound isolates localization failure as the sole bottleneck rests on the assumption that the performance gap versus full-video inference arises purely from missing the right frame. However, processing isolated frames (as image inputs) may alter attention patterns, temporal modeling, preprocessing, or context handling compared to video-length inputs, confounding the isolation. This assumption is load-bearing for the paper's central insight and the motivation for the agent framework; additional controls or ablations comparing single-frame vs. video regimes are needed to validate it.

Authors: We acknowledge that input regime differences—such as attention patterns, temporal modeling, preprocessing, and context handling—between isolated frames and full video sequences represent a potential confounding factor in the upper-bound analysis. Nevertheless, the analysis still provides meaningful evidence that the underlying Video-LLM possesses sufficient reasoning capacity when the relevant evidence is explicitly available, as the same model backbone is used in both regimes and the gap persists across multiple benchmarks and architectures. This supports our central motivation that explicit keyframe anchoring can address a primary bottleneck. To directly address the referee's concern, we will revise the manuscript to include: (1) expanded discussion of these potential differences and the limitations of the frame-wise upper bound; (2) new ablations that apply video-consistent preprocessing (e.g., temporal positional encodings and uniform frame sampling) to single-frame inputs for direct comparison against standard video inference; and (3) quantitative results showing that even under matched preprocessing the localization gap remains substantial. These additions will strengthen the isolation of the localization insight and better motivate the VTAgent framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical analysis or framework

full rationale

The paper conducts an empirical frame-wise upper-bound analysis on existing benchmarks, observes a performance gap versus direct video inference, and proposes an agentic keyframe anchoring framework whose gains are measured via SFT/RL on the same external benchmarks. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central claim (localization as bottleneck) rests on direct measurement rather than any construction that reduces the result to its own inputs by definition. This is a standard empirical paper whose derivation chain is observation-to-proposal supported by independent test-set results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced beyond standard components of existing Video-LLMs and agent frameworks.

pith-pipeline@v0.9.0 · 5515 in / 1098 out tokens · 21649 ms · 2026-05-08T16:28:18.743207+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 11 canonical work pages · 7 internal anchors

[1]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review arXiv 2025
[2]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 4, 9, 10, 11

work page internal anchor Pith review arXiv 2025
[3]

Internlm2 technical report

Cai,Z.,Cao,M.,Chen,H.,Chen,K.,Chen,K.,Chen,X.,Chen,X.,Chen,Z.,Chen, Z., Chu, P., et al.: Internlm2 technical report. arXiv preprint arXiv:2403.17297 (2024) 4

work page arXiv 2024
[4]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 4

work page internal anchor Pith review arXiv 2024
[5]

CoRR (2024) 2, 9, 10, 11

Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., et al.: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. CoRR (2024) 2, 9, 10, 11

2024
[6]

In: NeurIPS (2025) 8

Deng, Y., Bansal, H., Yin, F., Peng, N., Wang, W., Chang, K.W.: Openvlthinker: Complex vision-language reasoning via iterative sft-rl cycles. In: NeurIPS (2025) 8

2025
[7]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 4

work page internal anchor Pith review arXiv 2024
[8]

Nature645(8081), 633–638 (2025) 8

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645(8081), 633–638 (2025) 8

2025
[9]

NeurIPS37, 25663–25686 (2024) 2, 4

He, H., Ye, M., Zhang, J., Liu, J., Du, B., Tao, D.: Gomatching: A simple baseline for video text spotting via long and short term matching. NeurIPS37, 25663–25686 (2024) 2, 4

2024
[10]

arXiv preprint arXiv:2505.22228 (2025) 2, 4

He, H., Zhang, J., Ye, M., Liu, J., Du, B., Tao, D.: Gomatching++: Parameter- and data-efficient arbitrary-shaped video text spotting and benchmarking. arXiv preprint arXiv:2505.22228 (2025) 2, 4

work page arXiv 2025
[11]

arXiv preprint arXiv:2511.20190 (2025) 2, 4, 9, 10, 11

He, H., Zhong, Q., Liu, J., Du, B., Wang, P., Zhang, J.: Sfa: Scan, focus, and amplify toward guidance-aware answering for video textvqa. arXiv preprint arXiv:2511.20190 (2025) 2, 4, 9, 10, 11

work page arXiv 2025
[12]

URL https: //doi.org/10.48550/arXiv.2509.24304

He, Z., Qu, X., Li, Y., Huang, S., Liu, D., Cheng, Y.: Framethinker: Learn- ing to think with long videos via multi-turn frame spotlighting. arXiv preprint arXiv:2509.24304 (2025) 9

work page arXiv 2025
[13]

ICLR1(2), 3 (2022) 10

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022) 10

2022
[14]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 5971–5984 (2024) 2, 9, 10, 11 VTAgent 17

2024
[15]

In: CVPR

Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: CVPR. pp. 4122–4134 (2025) 2, 4, 9, 10, 11

2025
[16]

In: ICCV

Long, X., Tian, K., Xu, P., Jia, G., Li, J., Yang, S., Shao, Y., Zhang, K., Jiang, C., Xu, H., et al.: Adsqa: Towards advertisement video understanding. In: ICCV. pp. 23396–23407 (2025) 8

2025
[17]

NeurIPS28(2015) 2

Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de- tection with region proposal networks. NeurIPS28(2015) 2

2015
[18]

NeurIPS36, 51065–51079 (2023) 1

Sanders, K., Etter, D., Kriz, R., Van Durme, B.: Multivent: Multilingual videos of events and aligned natural text. NeurIPS36, 51065–51079 (2023) 1

2023
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 8

work page internal anchor Pith review arXiv 2024
[20]

In: NeurIPS (2025) 1

Shi, Y., Wang, H., Xie, W., Zhang, H., Zhao, L., Zhang, Y., Li, X., Fu, C., Wen, Z., Liu, W., et al.: Mme-videoocr: Evaluating ocr-based capabilities of multimodal llms in video scenarios. In: NeurIPS (2025) 1

2025
[21]

In: NeurIPS (2025) 8

Tan, H., Ji, Y., Hao, X., Chen, X., Wang, P., Wang, Z., Zhang, S.: Reason-rft: Re- inforcement fine-tuning for visual reasoning of vision language models. In: NeurIPS (2025) 8

2025
[22]

In: International Conference on Doc- ument Analysis and Recognition

Tom, G., Mathew, M., Garcia-Bordils, S., Karatzas, D., Jawahar, C.: Reading between the lanes: Text videoqa on the road. In: International Conference on Doc- ument Analysis and Recognition. pp. 137–154. Springer (2023) 2, 7, 9

2023
[23]

In: CVPR

Wang, J., Ge, Y., Yan, R., Ge, Y., Lin, K.Q., Tsutsui, S., Lin, X., Cai, G., Wu, J., Shan, Y., et al.: All in one: Exploring unified video-language pre-training. In: CVPR. pp. 6598–6608 (2023) 2

2023
[24]

CoRR (2024) 2, 9, 10, 11

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR (2024) 2, 9, 10, 11

2024
[25]

5: Empowering video mllms with long and rich context modeling

Wang, Y., Li, X., Yan, Z., He, Y., Yu, J., Zeng, X., Wang, C., Ma, C., Huang, H., Gao, J., et al.: Internvideo2. 5: Empowering video mllms with long and rich context modeling. CoRR (2025) 4

2025
[26]

PR157, 110818 (2025) 1

Wu, W., Zhao, Y., Li, Z., Li, J., Zhou, H., Shou, M.Z., Bai, X.: A large cross-modal video retrieval dataset with reading comprehension. PR157, 110818 (2025) 1

2025
[27]

Knowledge-Based Systems p

Yan, R., Guo, W., Lu, Z., Liu, X., Liu, X., Zhang, Y., Yuan, X.: Tom: Boosting textvqa by capturing text-oriented keypoints. Knowledge-Based Systems p. 115480 (2026) 2, 4, 9, 10, 11

2026
[28]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 4

work page internal anchor Pith review arXiv 2025
[29]

IEEE TPAMI 47(3), 1431–1447 (2024) 2

Ye, M., Zhang, J., Liu, J., Liu, C., Yin, B., Liu, C., Du, B., Tao, D.: Hi-sam: Mar- rying segment anything model for hierarchical text segmentation. IEEE TPAMI 47(3), 1431–1447 (2024) 2

2024
[30]

In: CVPR

Ye, M., Zhang, J., Zhao, S., Liu, J., Liu, T., Du, B., Tao, D.: Deepsolo: Let trans- former decoder with explicit points solo for text spotting. In: CVPR. pp. 19348– 19357 (2023) 4

2023
[31]

CoRR (2025) 9

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Fan, T., Liu, G., Liu, L., Liu, X., et al.: Dapo: An open-source llm reinforcement learning system at scale. CoRR (2025) 9

2025
[32]

IJCV130(10), 2425– 2452 (2022) 1 18 He et al

Zablocki, É., Ben-Younes, H., Pérez, P., Cord, M.: Explainability of deep vision- based autonomous driving systems: Review and challenges. IJCV130(10), 2425– 2452 (2022) 1 18 He et al

2022
[33]

IEEE Transactions on Artificial Intelligence3(2), 297–308 (2021) 1

Zhang, C., Tao, Y., Du, K., Ding, W., Wang, B., Liu, J., Wang, W.: Character-level street view text spotting based on deep multisegmentation network for smarter autonomous driving. IEEE Transactions on Artificial Intelligence3(2), 297–308 (2021) 1

2021
[34]

In: AAAI

Zhang, Y., Zeng, G., Shen, H., Wu, D., Zhou, Y., Ma, C.: Track the answer: Extending textvqa from image to video with spatio-temporal clues. In: AAAI. vol. 39, pp. 10275–10283 (2025) 2, 4, 9, 10, 11

2025
[35]

In: ACM MM

Zhang, Y., Zeng, G., Wu, D., Shen, H., Li, B., Zhou, Y., Ma, C., Bi, X.: Gather and trace: Rethinking video textvqa from an instance-oriented perspective. In: ACM MM. pp. 876–885 (2025) 2, 4, 9, 10, 11

2025
[36]

In: NeurIPS

Zhao, M., Li, B., Wang, J., Li, W., Zhou, W., Zhang, L., Xuyang, S., Yu, Z., Yu, X., Li, G., et al.: Towards video text visual question answering: Benchmark and baseline. In: NeurIPS. vol. 35, pp. 35549–35562 (2022) 2, 4, 7, 9, 10, 11

2022
[37]

In: ACM MM

Zhao, Y., Ma, J., Qi, Z., Xie, Z., Luo, Y., Kang, Q., Shan, Y.: Vtlayout: a multi- modal approach for video text layout. In: ACM MM. pp. 2775–2784 (2023) 1

2023
[38]

Group Sequence Policy Optimization

Zheng, C., Liu, S., Li, M., Chen, X.H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., et al.: Group sequence policy optimization. arXiv preprint arXiv:2507.18071 (2025) 9

work page internal anchor Pith review arXiv 2025