Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Jeongseok Hyun, Sukjun Hwang, Su Ho Han, Taeoh Kim, Inwoong Lee, Dongyoon Wee, Joon-Young Lee, Seon Joo Kim, Minho Shim · 2025 · arXiv 2507.07990

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.

VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

cs.CV · 2026-05-05 · unverdicted · novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.

Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs

cs.CV · 2025-09-09 · unverdicted · novelty 5.0

Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual evidence.

citing papers explorer

Showing 3 of 3 citing papers.

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs cs.CV · 2026-05-19 · unverdicted · none · ref 8
DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models cs.CV · 2026-05-05 · unverdicted · none · ref 34
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs cs.CV · 2025-09-09 · unverdicted · none · ref 10
Video Parallel Scaling improves VideoLLM performance by aggregating outputs from parallel inferences on complementary disjoint frame subsets, effectively contracting the Chinchilla scaling law via uncorrelated visual evidence.

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer