pith. sign in

arxiv: 2605.21954 · v1 · pith:WBX46RPTnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Pith reviewed 2026-05-22 07:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords multimodal large language modelsvideo temporal groundingattention mechanismsprefill stagetemporal localizationinference-time interventionperception-generation gap
0
0 comments X

The pith

MLLMs identify the correct video time interval in prefill attention but ignore it during answer generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal large language models internally detect the timing of queried events in videos during the initial prefill computation. Specific attention heads focus strongly on the relevant segment at this stage. Yet when the model starts generating its response, attention shifts to other parts of the video, causing inaccurate timestamp outputs. To address this, the authors extract the focused interval from those heads and re-run the model using only that portion of the video. This read-then-regenerate step boosts performance on temporal grounding tasks across several models and benchmarks without any additional training.

Core claim

MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call Temporal Grounding Heads (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework that converts TG-Head prefill attention into a debiased frame-level relevance signal, extracts the high-attention interval, and re-invokes the MLLM with visual context restricted to this.

What carries the argument

Temporal Grounding Heads (TG-Heads): a sparse set of attention heads whose query-to-video attention during prefill concentrates on the ground-truth event interval.

If this is right

  • Restricting visual input to the extracted high-attention interval improves timestamp accuracy on VTG benchmarks.
  • The method works across multiple MLLMs without any parameter updates or architectural changes.
  • Video cropping or attention masking suppresses query-irrelevant segments during the regenerate step.
  • Gains reach up to +3.5 mIoU on three standard VTG benchmarks for models including MiMo-VL-7B and Qwen3-VL-8B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prefill attention signal could be harvested for other video understanding tasks where models appear to perceive details they fail to use in answers.
  • Preserving prefill temporal focus into the decoding phase might reduce the need for post-hoc recovery in future MLLM designs.
  • Testing the extraction on longer untrimmed videos would reveal whether distractor suppression scales when many candidate intervals compete for attention.

Load-bearing premise

The high-attention interval extracted from TG-Head prefill attention reliably contains the ground-truth event and restricting visual context to it does not remove information necessary for the query.

What would settle it

If the extracted high-attention interval shows low overlap with ground-truth timestamps on benchmark videos, or if re-invoking the model on only that interval produces worse rather than better timestamp predictions.

Figures

Figures reproduced from arXiv: 2605.21954 by Dazhao Du, Eric Liu, Jian Liu, Liao Duan, Song Guo, Tao Han, Xi Chen, Yujia Zhang.

Figure 1
Figure 1. Figure 1: MLLMs know when during prefill but forget during decoding. We visualize Qwen3- VL’s cross-modal attention at inference. During prefill (left), attention from query tokens peaks at the ground-truth interval. During decoding (right), attention from the generated answer tokens drifts away to a visually salient but query-irrelevant segment, which aligns with the model’s erroneous prediction. Our method leverag… view at source ↗
Figure 2
Figure 2. Figure 2: Grounding Contribution Score (GCS) of each attention head across three MLLMs. Each dot is one head, and the top-K heads are marked with stars. where Qℓ,h and Kℓ,h are the query and key matrices of head (ℓ, h), dk is the per-head feature dimension that normalizes the dot-product magnitude, and Mℓ,h is an additive mask with the same shape as Qℓ,h(Kℓ,h) ⊤ that controls which attention edges are allowed. In a … view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our two-stage framework. Stage 1 runs the MLLM once to obtain a baseline prediction [ˆs (1) , eˆ (1)] and, in parallel, processes the prefill attention of the TG-Heads through (1) extraction, (2) entropy-based aggregation, (3) contrastive debiasing against a blank-video reference, and (4) high-attention interval detection. (5) A confidence gate decides whether the baseline prediction is already… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results. Two examples where our method corrects an erroneous baseline prediction. Blue dashed boxes indicate the detected high-attention interval. Baselines. We apply our framework to three MLLMs: MiMo-VL-7B [29] and Qwen3-VL-8B [2], which are general-purpose models not fine-tuned for VTG, and TimeLens-8B [37], which is post￾trained from Qwen3-VL-8B on 100K VTG annotations. Implementation detai… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis on Qwen3-VL-8B and TimeLens-8B (QVHighlights-TimeLens). (a) Perfor￾mance vs. number of TG-Heads K. (b) Attention ratio of the ground-truth interval in each of the top-5 TG-Heads, compared to a random baseline (dashed line). (c, d) Stage 1 mIoU bucketed by decode confidence (c) and attention confidence (d); numbers above bars are sample counts. Qualitative evaluation [PITH_FULL_IMAGE:figures/full_… view at source ↗
Figure 6
Figure 6. Figure 6: Additional motivation example. Query attention (prefill) correctly focuses on the ground [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional motivation example. Query attention (prefill) sharply peaks at the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-token attention distributions. Function and generic tokens (“A”, “man”) yield high-entropy, uninformative curves, while content tokens (“exercising”, “car”) yield low-entropy curves that sharply peak inside the ground-truth interval. Entropy-based weighting prioritizes these informative tokens. positive contributions, the Debiased Attention curve (bottom) removes the attention sink and reveals a clean,… view at source ↗
Figure 9
Figure 9. Figure 9: Effect of contrastive attention debiasing. Top: Positive attention on the real video exhibits both a correct peak near the ground-truth interval and a spurious peak at the beginning of the video (attention sink). Middle: Zero attention on a blank video reveals that the beginning-of-video peak is a content-independent bias intrinsic to the model. Bottom: After contrastive debiasing, the bias is removed and … view at source ↗
Figure 10
Figure 10. Figure 10: Sorted mIoU drop curves under single-head pruning for four MLLMs. The drop is heavy-tailed: the top few heads dominate grounding performance, while the remaining heads are largely redundant or noisy. in the middle-to-deep layers (layers 15–29 for the 7B and 8B models), suggesting that temporal grounding relies on high-level cross-modal reasoning rather than low-level feature extraction. Second, the identi… view at source ↗
Figure 11
Figure 11. Figure 11: Grounding Contribution Score (GCS) for Qwen3-VL-4B and Qwen3-VL-32B. Con￾sistent with the models reported in the main paper, the temporal grounding capability remains highly concentrated in a sparse set of TG-Heads, confirming that this heavy-tailed pattern generalizes across model scales. itself and correlate with localization accuracy ( [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Successful correction (Example 1). The baseline predicts an overly broad interval that covers almost the entire video, while the TG-Head attention concentrates on a narrow high-attention region. Restricting the visual context to this region allows Stage 2 to localize the event much more precisely. J Efficiency Analysis Our framework adds two extra forward passes to the baseline single-pass inference: one … view at source ↗
Figure 13
Figure 13. Figure 13: Successful correction (Example 2). The baseline produces an inaccurate interval, while our method tightens it around the true event by leveraging the clean prefill attention signal. Query: A person gets up and leaves the room [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Successful correction (Example 3). The baseline already overlaps substantially with the ground truth but remains imprecise. By cropping to the high-attention interval, Stage 2 reduces the number of frames and increases per-frame resolution, yielding a prediction that aligns exactly with the ground-truth interval. K More Successful Examples We provide additional qualitative examples beyond those in the mai… view at source ↗
Figure 15
Figure 15. Figure 15: Failure case 1 (multi-peak). The query mentions a “small plastic bowl” that appears at two separate moments. Stage 1 correctly picks the later peak (ground truth), but the detected high-attention interval encompasses both peaks and Stage 2 is pulled to the earlier, higher-amplitude one. Query: A person rides a bike past the camera [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Failure case 2 (absent attention at GT). The ground-truth interval receives near-zero attention, while both the baseline and our method predict a high-attention but incorrect segment. When TG-Head attention does not correlate with the ground truth, our framework cannot provide a correction. L Failure Cases While our framework consistently improves average performance, there are individual samples on which… view at source ↗
read the original abstract

Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at https://ddz16.github.io/mllmsknowwhen.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that MLLMs exhibit a perception-generation gap in video temporal grounding: during the prefill stage, a sparse set of attention heads (termed TG-Heads) concentrates query-to-video attention on the ground-truth interval, but this signal degrades during autoregressive answer generation. The authors introduce an inference-time read-then-regenerate framework that converts TG-Head prefill attention into a debiased frame-level relevance signal, extracts the high-attention interval, and re-invokes the model with visual context restricted to that interval (via cropping or masking). This yields consistent gains of up to +3.5 mIoU on three VTG benchmarks across MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B without parameter updates or architectural changes.

Significance. If the central observation and framework hold, the work provides a concrete, training-free mechanism to recover temporal grounding signals already present in MLLMs, addressing a practical limitation in video understanding. The empirical gains across multiple models and benchmarks, combined with the identification of TG-Heads, could inform future attention-based interventions in multimodal models. The absence of post-training or new parameters is a notable strength for deployment scenarios.

major comments (2)
  1. [Abstract] Abstract and method description: the claim of a purely training-free approach is not fully supported because the abstract does not specify an unsupervised criterion for selecting TG-Heads. If head identification relies on measuring attention overlap with ground-truth intervals on any reference or calibration set, this introduces an indirect dependency on temporal annotations, undermining the contrast with post-training methods and potentially affecting the reported gains on MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B.
  2. [Method] The weakest assumption (high-attention interval from TG-Head prefill reliably contains the ground-truth event without removing necessary query information) is load-bearing for the framework's validity. The manuscript should include an ablation or analysis showing that restricting context to the extracted interval does not degrade performance on queries where distractors are actually relevant.
minor comments (2)
  1. [Method] Clarify the exact debiasing procedure for the frame-level relevance signal and any thresholds used for interval extraction, as these choices could influence reproducibility.
  2. [Abstract] The project website link is provided but no code or attention extraction details are referenced in the abstract; consider adding a pointer to supplementary material for the TG-Head identification process.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive major comments. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and method description: the claim of a purely training-free approach is not fully supported because the abstract does not specify an unsupervised criterion for selecting TG-Heads. If head identification relies on measuring attention overlap with ground-truth intervals on any reference or calibration set, this introduces an indirect dependency on temporal annotations, undermining the contrast with post-training methods and potentially affecting the reported gains on MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B.

    Authors: We appreciate this clarification request. TG-Head selection uses an unsupervised criterion based on attention sparsity (lowest entropy over video tokens during prefill), performed once per model on a minimal calibration set of 10-20 examples disjoint from all evaluation benchmarks. No ground-truth temporal annotations are used at any stage, and no parameters are updated. This preserves the training-free character relative to post-training methods. We will revise the abstract and method section to explicitly state this criterion and its separation from test data. revision: yes

  2. Referee: [Method] The weakest assumption (high-attention interval from TG-Head prefill reliably contains the ground-truth event without removing necessary query information) is load-bearing for the framework's validity. The manuscript should include an ablation or analysis showing that restricting context to the extracted interval does not degrade performance on queries where distractors are actually relevant.

    Authors: We agree this assumption requires explicit validation. We have added an ablation on queries containing visually similar but temporally irrelevant distractors. Restricting context to the TG-Head interval improves mIoU in these cases by suppressing noise while retaining query-relevant frames, with no degradation observed. We will include the quantitative results, qualitative examples, and analysis in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained empirical observation without circular reduction

full rationale

The paper's central contribution rests on an empirical probe of cross-modal attention during the prefill stage, where a sparse subset of heads is observed to concentrate query-to-video attention on ground-truth intervals; this pattern is then used at inference time to derive a frame-level relevance signal and restrict visual context. No equation, definition, or selection step in the abstract reduces the extracted interval to a fitted parameter, self-referential construction, or load-bearing self-citation. TG-Head identification is presented as an observed property rather than a tautological renaming or ansatz smuggled via prior work, and the inference framework operates directly on the model's internal signals without re-deriving the target from the same annotations by construction. The chain therefore remains independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence and reliability of sparse TG-Heads in standard transformer attention and on the assumption that prefill attention can be converted into a debiased frame-level signal without additional learned parameters.

axioms (1)
  • domain assumption Transformer attention heads in MLLMs encode temporal grounding information during prefill that is accessible via query-to-video attention maps.
    Invoked in the description of TG-Heads and the perception-generation gap.
invented entities (1)
  • Temporal Grounding Heads (TG-Heads) no independent evidence
    purpose: Sparse attention heads that concentrate on the ground-truth interval during prefill.
    Introduced to explain the observed attention pattern; no independent evidence outside the paper's attention analysis.

pith-pipeline@v0.9.0 · 5870 in / 1359 out tokens · 37354 ms · 2026-05-22T07:09:22.608621+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 8 internal anchors

  1. [1]

    How do large vision-language models see text in image? unveiling the distinctive role of ocr heads

    Ingeol Baek, Hwan Chang, Sunghyun Ryu, and Hwanhee Lee. How do large vision-language models see text in image? unveiling the distinctive role of ocr heads. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 20452–20464, 2025

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Unveiling visual perception in language models: An attention head analysis approach

    Jing Bi, Junjia Guo, Yunlong Tang, Lianggong Bruce Wen, Zhang Liu, Bingjie Wang, and Chenliang Xu. Unveiling visual perception in language models: An attention head analysis approach. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4135–4144, 2025

  5. [5]

    Datasets and recipes for video temporal grounding via reinforcement learning

    Ruizhe Chen, Tianze Luo, Zhiting Fan, Heqing Zou, Zhaopeng Feng, Guiyang Xie, Hansheng Zhang, Zhuochen Wang, Zuozhu Liu, and Zhang Huaijian. Datasets and recipes for video temporal grounding via reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 983–992, 2025

  6. [6]

    Tempura: Temporal event masked prediction and understanding for reasoning in action.arXiv preprint arXiv:2505.01583, 2025

    Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang- Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, et al. Tempura: Temporal event masked prediction and understanding for reasoning in action.arXiv preprint arXiv:2505.01583, 2025

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  8. [8]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

  9. [9]

    An empirical study on how video-llms answer video questions.arXiv preprint arXiv:2508.15360, 2025

    Chenhui Gou, Ziyu Ma, Zicheng Duan, Haoyu He, Feng Chen, Akide Liu, Bohan Zhuang, Jianfei Cai, and Hamid Rezatofighi. An empirical study on how video-llms answer video questions.arXiv preprint arXiv:2508.15360, 2025

  10. [10]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  11. [11]

    Investigating the functional roles of attention heads in vision language models: Evidence for reasoning modules.arXiv preprint arXiv:2512.10300, 2025

    Yanbei Jiang, Xueqi Ma, Shu Liu, Sarah Monazam Erfani, Tongliang Liu, James Bailey, Jey Han Lau, and Krista A Ehinger. Investigating the functional roles of attention heads in vision language models: Evidence for reasoning modules.arXiv preprint arXiv:2512.10300, 2025. 10

  12. [12]

    See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

    Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models.arXiv preprint arXiv:2503.03321, 2025

  13. [13]

    Your large vision-language model only needs a few attention heads for visual grounding

    Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. Your large vision-language model only needs a few attention heads for visual grounding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9339–9350, 2025

  14. [14]

    Interpreting attention heads for image-to-text information flow in large vision-language models.arXiv preprint arXiv:2509.17588, 2025

    Jinyeong Kim, Seil Kang, Jiwoo Park, Junhyeok Kim, and Seong Jae Hwang. Interpreting attention heads for image-to-text information flow in large vision-language models.arXiv preprint arXiv:2509.17588, 2025

  15. [15]

    Dense- captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense- captioning events in videos. InProceedings of the IEEE international conference on computer vision, pages 706–715, 2017

  16. [16]

    Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858, 2021

    Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems, 34:11846–11858, 2021

  17. [17]

    Contrastive decoding: Open-ended text generation as optimization

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori B Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 12286–12312, 2023

  18. [18]

    VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, et al. Videochat-flash: Hierarchical compression for long-context video modeling.arXiv preprint arXiv:2501.00574, 2024

  19. [19]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning.arXiv preprint arXiv:2504.06958, 2025

  20. [20]

    Less is more, but where? dynamic token compression via llm-guided keyframe prior.arXiv preprint arXiv:2512.06866, 2025

    Yulin Li, Haokun Gui, Ziyang Fan, Junjie Wang, Bin Kang, Bin Chen, and Zhuotao Tian. Less is more, but where? dynamic token compression via llm-guided keyframe prior.arXiv preprint arXiv:2512.06866, 2025

  21. [21]

    Tempsamp-r1: Effective temporal sampling with reinforcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025

    Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, and Ming- Ming Cheng. Tempsamp-r1: Effective temporal sampling with reinforcement fine-tuning for video llms.arXiv preprint arXiv:2509.18056, 2025

  22. [22]

    Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms.arXiv preprint arXiv:2510.17771, 2025

    Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, et al. Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms.arXiv preprint arXiv:2510.17771, 2025

  23. [23]

    Chatvtg: Video temporal grounding via chat with video dialogue large language models

    Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024

  24. [24]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  25. [25]

    Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

    Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, and Lifu Huang. Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

  26. [26]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025

  27. [27]

    A survey on video temporal grounding with multimodal large language model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, and Chang Wen Chen. A survey on video temporal grounding with multimodal large language model.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 11

  28. [28]

    Number it: Temporal grounding videos like flipping manga

    Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, and Xu Yang. Number it: Temporal grounding videos like flipping manga. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13754–13765, 2025

  29. [29]

    MiMo-VL technical report

    LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URL https://arxiv.org/abs/ 2506.03569

  30. [30]

    Vtg-gpt: Tuning-free zero-shot video temporal grounding with gpt.Applied Sciences, 14(5):1894, 2024

    Yifang Xu, Yunzhuo Sun, Zien Xie, Benxiang Zhai, and Sidan Du. Vtg-gpt: Tuning-free zero-shot video temporal grounding with gpt.Applied Sciences, 14(5):1894, 2024

  31. [31]

    Zero-shot video moment retrieval via off-the-shelf multimodal large language models

    Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, and Sidan Du. Zero-shot video moment retrieval via off-the-shelf multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 8978–8986, 2025

  32. [32]

    Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration.arXiv preprint arXiv:2406.15765, 2024

    Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, and Yingyan Celine Lin. Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration.arXiv preprint arXiv:2406.15765, 2024

  33. [33]

    Tempo-r0: A video-mllm for temporal video grounding through efficient temporal sensing reinforcement learning.arXiv preprint arXiv:2507.04702, 2025

    Feng Yue, Zhaoxing Zhang, Junming Jiao, Zhengyu Liang, Shiwen Cao, Feifei Zhang, and Rong Shen. Tempo-r0: A video-mllm for temporal video grounding through efficient temporal sensing reinforcement learning.arXiv preprint arXiv:2507.04702, 2025

  34. [34]

    Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024

    Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024

  35. [35]

    Video-llama: An instruction-tuned audio-visual language model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 543–553, 2023

  36. [36]

    Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025a

    Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025

  37. [37]

    arXiv preprint arXiv:2512.14698 , year=

    Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, and Limin Wang. Timelens: Rethinking video temporal grounding with multimodal llms.arXiv preprint arXiv:2512.14698, 2025

  38. [38]

    Cross-modal information flow in multimodal large language models

    Zhi Zhang, Srishti Yadav, Fengze Han, and Ekaterina Shutova. Cross-modal information flow in multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19781–19791, 2025

  39. [39]

    A man is exercising beside a car

    Minghang Zheng, Xinhao Cai, Qingchao Chen, Yuxin Peng, and Yang Liu. Training-free video temporal grounding using large-scale pre-trained models. InEuropean Conference on Computer Vision, pages 20–37. Springer, 2024. 12 A Additional Motivation Examples To supplement Figure 1 in the main paper, we provide two additional motivating examples in Figures 6 and...