Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

Jia-Hong Huang; Jiaqi Li; Minzhe Ni; Shuntian Zheng; Xiaoman Lu; Yixian Shen; Yu Guan

arxiv: 2603.05663 · v4 · pith:N7KVDAN4new · submitted 2026-03-05 · 💻 cs.CV

Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding

Jiaqi Li , Shuntian Zheng , Yixian Shen , Jia-Hong Huang , Xiaoman Lu , Minzhe Ni , Yu Guan This is my paper

classification 💻 cs.CV

keywords evidencetokenspruningsemvidcross-frametemporaltokentraining-free

0 comments

read the original abstract

Video Temporal Grounding (VTG) localizes the temporal boundaries of query-relevant moments in long, untrimmed videos, making video-language-model prohibitively expensive. While recent training-free token pruning has shown success in video question answering, naively applying these objectives to VTG causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: evidence retention, which keeps query-critical patches especially around event boundaries, and connectivity strength, which preserves cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and context tokens for scene continuity. Extensive experiments show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets. Our code is available at https://github.com/JiaqiLi404/SemVID

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning
cs.LG 2026-06 unverdicted novelty 6.0

SpecFlow represents intermediate visual thoughts in fixed-size DCT space and uses classifier-free guidance to steer updates from textual thoughts, achieving up to 2.1x lower computation and KV cache costs.
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
cs.CV 2026-05 unverdicted novelty 6.0

EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
Fast Transformer Inference on ARM-Based HMPSoCs
cs.AR 2026-06 unverdicted novelty 4.0

Extends ARM-CL with transformer kernels and CPU-GPU cooperative inference to achieve up to 3x faster inference and 15.72% additional latency reduction on ARM HMPSoCs.