Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding
read the original abstract
Video Temporal Grounding (VTG) localizes the temporal boundaries of query-relevant moments in long, untrimmed videos, making video-language-model prohibitively expensive. While recent training-free token pruning has shown success in video question answering, naively applying these objectives to VTG causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: evidence retention, which keeps query-critical patches especially around event boundaries, and connectivity strength, which preserves cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and context tokens for scene continuity. Extensive experiments show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets. Our code is available at https://github.com/JiaqiLi404/SemVID
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning
SpecFlow represents intermediate visual thoughts in fixed-size DCT space and uses classifier-free guidance to steer updates from textual thoughts, achieving up to 2.1x lower computation and KV cache costs.
-
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
-
Fast Transformer Inference on ARM-Based HMPSoCs
Extends ARM-CL with transformer kernels and CPU-GPU cooperative inference to achieve up to 3x faster inference and 15.72% additional latency reduction on ARM HMPSoCs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.