arxiv: 2605.10050 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: no theorem link

EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs

Jiameng Li , Minye Wu , Jiezhang Cao , Aleksei Tiulpin , Matthew B. Blaschko

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords token pruningVideoLLMstemporal redundancylong-form video understandingtraining-free methodefficient inference

0 comments

The pith

EchoPrune prunes video tokens scored as temporal echoes from prior frames to let VideoLLMs process up to 20 times more frames at fixed token budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EchoPrune as a training-free way to cut visual tokens in Video Large Language Models processing long videos. It treats tokens that reconstruct well from the previous frame as mere temporal echoes rather than new content, scoring each token by its query relevance and its reconstruction error across frames. Tokens with high error or strong query match are kept while predictable ones are dropped. This keeps task-relevant and novel temporal cues intact even as the number of observed frames rises sharply. Tests on multiple models and benchmarks report both higher accuracy and faster inference under the same token limit for the language model.

Core claim

EchoPrune scores visual tokens by query-guided crossmodal relevance combined with temporal reconstruction error measured via correspondence matching and echo matching across consecutive frames; tokens with low reconstruction error are interpreted as temporal echoes and pruned, preserving task-relevant cues and temporal novelty while allowing up to 20 times more frames under a fixed LLM-side visual token budget.

What carries the argument

The dual scoring of tokens by query-guided crossmodal relevance and temporal reconstruction error via correspondence and echo matching across frames.

If this is right

VideoLLMs can ingest longer videos with finer temporal sampling without raising the token count passed to the language model.
Performance improves on video understanding tasks because more frames supply additional evidence while redundant tokens are removed.
Inference speeds up, particularly during prefilling, since fewer tokens reach the LLM decoder.
The method applies to existing VideoLLMs without any retraining or architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reconstruction-based view of redundancy could extend to pruning in other sequential data like audio streams or time-series sensor inputs.
Focusing on temporal novelty might reduce hallucination rates in VideoLLMs by limiting exposure to predictable but uninformative frames.
Combining EchoPrune with query-agnostic compression techniques could further lower token budgets for very long videos.

Load-bearing premise

Tokens that reconstruct well from the previous frame via correspondence and echo matching contain no critical task-relevant or temporal information that pruning would lose.

What would settle it

A video clip and query where pruning tokens with low reconstruction error from the prior frame causes the model to produce an incorrect answer that the unpruned version answered correctly.

Figures

Figures reproduced from arXiv: 2605.10050 by Aleksei Tiulpin, Jiameng Li, Jiezhang Cao, Matthew B. Blaschko, Minye Wu.

**Figure 1.** Figure 1: Left: EchoPrune extends the temporal resolution with more visible frames for fine-grained video understanding, then selects pivot tokens to fit the budgets (F32 is short for 32 frames). Right: Under the default token footprint, EchoPrune scales visible frames up to boost SOTA performance. Abstract Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense fr… view at source ↗

**Figure 2.** Figure 2: Overview. EchoPrune identifies the most informative tokens via a three-fold decomposition: Step 1. Crossmodal relevance guided by user query (r); Step 2. Spatial motion based on correspondence matching (δcorr); Step 3. Temporal novelty driven by echo matching (δecho). position i. Following Eqn. (1), EchoPrune frames the MMR score of video tokens {v k i } as: R(v k i , T ) := r(v k i , T ) (2) D(v k i , V … view at source ↗

**Figure 3.** Figure 3: Temporal reconstruction (˜· omitted). where τ works as the temperature parameter. To enforce local spatiotemporal coherence and reduce computational overhead, we restrict the candidate set of {v k−1 j } to a spatial neighborhood Ω ϵ i centered at position i with a radius ϵ, e.g., 3×3. Then we apply softmax over candidates from frame k−1 to obtain the conditional probability p k ij : p k ij ≜ p [PITH_FUL… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: shows ablation study on VideoMME (F160) for LLaVA-OV-7B (top) and Qwen2.5VL-7B (bottom). See discussions on Qwen3VL in App. B.3 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Temporal resolution under different sampling rates. The light-colored frames represent dropped clips under decreasing temporal resolution. Due to the large sampling interval, the pivot frames are fully skipped in the default F32 (the top row). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Interpretability of EchoPrune (LLaVA-OV-7B). To avoid undersampling, we first allow sufficient visible frames and then prune tokens (F192/16.7%) to reach the budgets of full-frame F32. B.2 Datasets Description VideoMME [7]. VideoMME comprises 900 videos spanning 6 diverse domains and 30 subcategories. The video durations range from 11 seconds to one hour, with an average duration of 1,018 seconds. Each vid… view at source ↗

**Figure 8.** Figure 8: Ablation study on Qwen3VL. LLaVA-OV-7B (F160/20%) 58 59.5 61 k-1 k-2 k-3 Scores Range VideoMME Egoschema [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 10.** Figure 10: Performance under various FPS. LLaVA-OV-7B (F160/20%) 59 60 61 62 0.1 0.3 0.5 0.7 0.9 Scores Factor λ VideoMME Egoschema [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 12.** Figure 12: Frame matching. Correspondence (left) refers to the history patch in the same location. Neighbor (middle) and Holistic (right) of echo matching leverage the patches in a neighbor region or from the full frame, respectively. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Pruning effects on LLaVA-OV-7B (VideoMME, F160/20%). Our pruning effectively eliminates query-irrelevant patches while preserving pivot slots. Specifically, we track landmarks consistently (top) and identify new events of varying head directions (bottom). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EchoPrune's temporal echo pruning for VideoLLMs looks promising on paper but rests on an unverified assumption about safe redundancy that the abstract doesn't test.

read the letter

The one or two things to know about this paper are that EchoPrune interprets redundant video tokens as temporal echoes that can be pruned if they reconstruct well from the previous frame, using a combination of query-guided relevance and temporal reconstruction error. The abstract reports that this allows VideoLLMs to handle up to 20x more frames under the same token budget, with performance gains and inference speedups. What the paper does well is to clearly state the problem with current approaches to token reduction in videos and to propose a lightweight method that tries to preserve both task-relevant information and temporal changes. By using correspondence matching and echo matching, it aims to avoid the weaknesses of treating videos as static or using heuristic merging. The experiments on several VideoLLM models and benchmarks suggest it could be a useful tool for efficiency. The soft spots are proportionate to the limited information available. The abstract does not provide the precise scoring formula, the way the two criteria are balanced, or any ablations that would show if the echo component is necessary. There is also no discussion of potential issues like error accumulation or missed details in the matching process. The central assumption that low-reconstruction-error tokens are redundant and removable without harm remains untested in the provided text. This makes the performance numbers hard to fully trust without more evidence. This paper is for researchers and engineers focused on making long-form video understanding more efficient in large language models. A reader in that area would find the temporal echo framing interesting as a new angle on pruning. It has enough substance and engagement with prior work to merit a serious referee, even if revisions are needed for the details. I would recommend sending this to peer review rather than a desk reject, because the idea targets a genuine bottleneck and the reported outcomes are specific enough to be checked.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes EchoPrune, a training-free token pruning technique for Video Large Language Models (VideoLLMs). It interprets redundant visual tokens as 'temporal echoes' that can be reconstructed from previous frames via correspondence and echo matching. Tokens are retained if they have high query-guided cross-modal relevance or high temporal reconstruction error. This approach aims to allow VideoLLMs to process up to 20 times more frames under the same token budget, leading to performance gains of +8.6% and inference speedups of 5.6x for prefilling on Qwen2.5VL-7B, evaluated on six video understanding benchmarks with models including LLaVA-OV, Qwen2.5VL, and Qwen3VL.

Significance. If the empirical claims hold, the work could meaningfully advance efficient long-form video understanding in multimodal models by enabling higher temporal resolution at fixed token budgets. The training-free design and the framing of redundancy as temporal echoes represent conceptual strengths that distinguish it from segment-level merging heuristics. These aspects could support broader adoption in resource-constrained VideoLLM deployments.

major comments (3)

[Abstract] Abstract: The headline performance claims (+8.6% improvement and 5.6x prefilling speedup) are stated without any reference to baselines, number of frames processed in the comparison, error bars, statistical tests, or data-exclusion rules. These omissions make it impossible to evaluate whether the gains are attributable to the proposed pruning or to other factors.
[Abstract] Abstract: The scoring procedure that combines query-guided cross-modal relevance with temporal reconstruction error (via correspondence matching and echo matching) is described only at a conceptual level. No explicit formula, weighting scheme, or pruning threshold is provided, which is load-bearing for both reproducibility and for verifying that the method actually preserves task-relevant temporal information.
[Abstract] Abstract: The core assumption that tokens with low reconstruction error from the prior frame are safely redundant is presented without ablations, failure-case analysis, or evidence that correspondence matching does not discard subtle motion, lighting changes, or query-critical details. This assumption directly determines which tokens are pruned and therefore requires explicit support.

minor comments (1)

[Abstract] Abstract: The phrase 'extensive experiments' is used but no specific benchmark names or per-benchmark breakdowns are supplied, which would help readers assess the breadth of the evaluation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below. Where the abstract can be clarified without exceeding length constraints, we will revise it; for deeper methodological details, we will ensure the main text provides explicit support and cross-references.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance claims (+8.6% improvement and 5.6x prefilling speedup) are stated without any reference to baselines, number of frames processed in the comparison, error bars, statistical tests, or data-exclusion rules. These omissions make it impossible to evaluate whether the gains are attributable to the proposed pruning or to other factors.

Authors: We agree the abstract is concise and would benefit from added context. The full manuscript reports results against uniform sampling and prior token-pruning baselines, using up to 20x more frames under a fixed token budget, with averages across six benchmarks and multiple models. We will revise the abstract to briefly note the primary baselines and the 20x frame scaling to make the claims more interpretable while preserving brevity. revision: yes
Referee: [Abstract] Abstract: The scoring procedure that combines query-guided cross-modal relevance with temporal reconstruction error (via correspondence matching and echo matching) is described only at a conceptual level. No explicit formula, weighting scheme, or pruning threshold is provided, which is load-bearing for both reproducibility and for verifying that the method actually preserves task-relevant temporal information.

Authors: The abstract intentionally summarizes the method at a high level. The complete paper supplies the explicit scoring formula (a weighted sum of query-guided cross-modal relevance and temporal reconstruction error), the correspondence/echo matching procedures, the weighting coefficients, and the adaptive threshold selection. We will update the abstract with a short reference to the combined scoring function and direct readers to the method section for the full equations. revision: yes
Referee: [Abstract] Abstract: The core assumption that tokens with low reconstruction error from the prior frame are safely redundant is presented without ablations, failure-case analysis, or evidence that correspondence matching does not discard subtle motion, lighting changes, or query-critical details. This assumption directly determines which tokens are pruned and therefore requires explicit support.

Authors: We acknowledge that the abstract alone does not contain supporting evidence. The manuscript includes ablations isolating the reconstruction-error term, qualitative token visualizations, and quantitative checks confirming that motion and query-relevant details are retained via the joint scoring. To strengthen the presentation, we will add a concise paragraph on potential edge cases (e.g., rapid lighting shifts) and how the combined relevance-plus-error criterion mitigates them. revision: yes

Circularity Check

0 steps flagged

No circularity; abstract presents heuristic insight and scoring rule without equations or self-referential reductions

full rationale

The provided abstract contains no equations, no derivation chain, and no citations (self or otherwise). EchoPrune is introduced as a training-free heuristic that scores tokens by combining query-guided cross-modal relevance with temporal reconstruction error via correspondence and echo matching. The central claim—that low-reconstruction-error tokens are redundant echoes—is framed as an interpretive insight rather than a quantity derived from fitted parameters or prior results. No step reduces by construction to its own inputs, and the reported gains (+8.6% performance, 5.6x speedup) are presented as empirical outcomes on external benchmarks. This satisfies the default expectation of a non-circular proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard computer-vision assumptions about frame correspondence and cross-modal alignment; it introduces the conceptual framing of temporal echoes without additional free parameters or external validation mentioned.

axioms (1)

domain assumption Visual tokens from consecutive frames can be meaningfully compared via correspondence matching and reconstruction error
Invoked as the basis for identifying temporal redundancy in the core method description.

invented entities (1)

temporal echoes no independent evidence
purpose: Conceptual label for redundant tokens that are predictable from prior frames
New interpretive framing introduced to motivate the pruning rule; no independent falsifiable prediction supplied in the abstract.

pith-pipeline@v0.9.0 · 5562 in / 1446 out tokens · 60167 ms · 2026-05-12T02:37:39.029113+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 8 internal anchors

[1]

Divprune: Diversity-based visual token pruning for large multimodal models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. In CVPR, 2025

work page 2025
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Flashvlm: Text-guided visual token selection for large multimodal models

Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, and Keze Wang. Flashvlm: Text-guided visual token selection for large multimodal models. arXiv preprint arXiv:2512.20561, 2025

work page arXiv 2025
[4]

The use of mmr, diversity-based reranking for reordering documents and producing summaries

Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998

work page 1998
[5]

Unified spatiotemporal token compression for video-llms at ultra-low retention

Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, and Guo Lu. Unified spatiotemporal token compression for video-llms at ultra-low retention. In CVPR, 2026

work page 2026
[6]

Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging

Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, and Zhuotao Tian. Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging. In ICLR, 2026

work page 2026
[7]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In CVPR, 2025

work page 2025
[8]

Framefusion: Combining similarity and importance for video token reduction on large vision language models

Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large vision language models. In ICCV, 2025

work page 2025
[9]

Submodular functions and optimization, volume 58

Satoru Fujishige. Submodular functions and optimization, volume 58. Elsevier, 2005

work page 2005
[10]

Echoing- pixels: Cross-modal adaptive token reduction for efficient audio-visual llms.arXiv preprint arXiv:2512.10324, 2025

Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, and Jingjing Chen. Echoing- pixels: Cross-modal adaptive token reduction for efficient audio-visual llms. arXiv preprint arXiv:2512.10324, 2025

work page arXiv 2025
[11]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826, 2025

work page internal anchor Pith review arXiv 2025
[12]

KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models

Haifeng Huang and Yang Li. Kitoke: Kernel-based interval-aware token compression for video large language models. arXiv preprint arXiv:2604.03414, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019

work page 2019
[14]

Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning

Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, and Huan Li. Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning. In EMNLP, 2025

work page 2025
[15]

See the forest for the trees: Loosely speculative decoding via visual-semantic guidance for efficient inference of video llms

Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, and Huan Li. See the forest for the trees: Loosely speculative decoding via visual-semantic guidance for efficient inference of video llms. In ACL, 2026. 10

work page 2026
[16]

Compression tells intelligence: Visual coding, visual token technology, and the unification

Xin Jin, Jinming Liu, Yuntao Wei, Junyan Lin, Zhicheng Wang, Jianguo Huang, Xudong Yang, Yanxiao Liu, and Wenjun Zeng. Compression tells intelligence: Visual coding, visual token technology, and the unification. arXiv preprint arXiv:2601.20742, 2026

work page arXiv 2026
[17]

Forestprune: High-ratio visual token compression for video multimodal large language models via spatial-temporal forest modeling

Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu, Chao Chang, HuaiXi Wang, Yiyi Zhou, and Rongrong Ji. Forestprune: High-ratio visual token compression for video multimodal large language models via spatial-temporal forest modeling. In CVPR Findings, 2026

work page 2026
[18]

Parallelvlm: Lossless video-llm acceleration with visual alignment aware parallel speculative decoding

Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, and Cong Wang. Parallelvlm: Lossless video-llm acceleration with visual alignment aware parallel speculative decoding. In CVPR, 2026

work page 2026
[19]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

Jiameng Li, Aleksei Tiulpin, and Matthew B. Blaschko. MI-pruner: Crossmodal mutual information-guided token pruner for efficient mllms. arXiv preprint arXiv:2604.03072, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Keeping the evidence chain: Semantic evidence allocation for training-free token pruning in video temporal grounding

Jiaqi Li, Shuntian Zheng, Yixian Shen, Jia-Hong Huang, Xiaoman Lu, Minzhe Ni, and Yu Guan. Keeping the evidence chain: Semantic evidence allocation for training-free token pruning in video temporal grounding. arXiv preprint arXiv:2603.05663, 2026

work page arXiv 2026
[22]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022

work page 2022
[23]

Resprune: Text-conditioned subspace reconstruction for visual token pruning in large vision-language models

Xu Li, Yi Zheng, Yuxuan Liang, Zhe Liu, Xiaolei Chen, Haotian Chen, Rui Zhu, and Xiangyang Xue. Resprune: Text-conditioned subspace reconstruction for visual token pruning in large vision-language models. arXiv preprint arXiv:2603.21105, 2026

work page arXiv 2026
[24]

EAGLE: Speculative sampling requires rethinking feature uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. In ICML, 2024

work page 2024
[25]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. In EMNLP, 2024

work page 2024
[26]

EAGLE-3: Scaling up inference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. In NeurIPS, 2025

work page 2025
[27]

Resadapt: Adaptive resolution for efficient multimodal reasoning.arXiv preprint arXiv:2603.28610, 2026

Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, and Kang Liu. Resadapt: Adaptive resolution for efficient multimodal reasoning. arXiv preprint arXiv:2603.28610, 2026

work page arXiv 2026
[28]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024

work page 2024
[29]

Video compression commander: Plug-and-play inference acceleration for video large language models

Xuyang Liu, Yiyu Wang, Junpeng Ma, and Linfeng Zhang. Video compression commander: Plug-and-play inference acceleration for video large language models. In EMNLP, 2025

work page 2025
[30]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022

work page 2022
[31]

γ−mod: Exploring mixture-of-depth adaptation for multimodal large language models

Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, and Rongrong Ji. γ−mod: Exploring mixture-of-depth adaptation for multimodal large language models. In ICLR, 2025

work page 2025
[32]

Mmg-vid: Maximizing marginal gains at segment-level and token-level for efficient video llms

Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, and Shanghang Zhang. Mmg-vid: Maximizing marginal gains at segment-level and token-level for efficient video llms. In AAAI, 2026

work page 2026
[33]

Gift: Global irreplaceability frame targeting for efficient video understanding

Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, et al. Gift: Global irreplaceability frame targeting for efficient video understanding. In CVPR, 2026

work page 2026
[34]

Apet: Approximation-error guided token compression for efficient vlms

Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, and Hairong Zheng. Apet: Approximation-error guided token compression for efficient vlms. In CVPR, 2026

work page 2026
[35]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. In NeurIPS, 2023

work page 2023
[36]

Certain topics in telegraph transmission theory

Harry Nyquist. Certain topics in telegraph transmission theory. Transactions of the American Institute of Electrical Engineers, 47(2):617–644, 1928. 11

work page 1928
[37]

Does your vision-language model get lost in the long video sampling dilemma? In ICCV, 2025

Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, and Jiaya Jia. Does your vision-language model get lost in the long video sampling dilemma? In ICCV, 2025

work page 2025
[38]

Clustering by fast search and find of density peaks

Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. science, 344(6191):1492–1496, 2014

work page 2014
[39]

arXiv preprint arXiv:2602.13191 , year=

Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, and Mihai Dusmanu. Cope-videolm: Leveraging codec primitives for efficient video language modeling. arXiv preprint arXiv:2602.13191, 2026

work page arXiv 2026
[40]

Holitom: Holistic token merging for fast video large language models

Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models. In NeurIPS, 2025

work page 2025
[41]

FastVID: Dynamic density pruning for fast video large language models

Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, pengzhang liu, Sicheng Zhao, and Guiguang Ding. FastVID: Dynamic density pruning for fast video large language models. In NeurIPS, 2025

work page 2025
[42]

Attend before attention: Efficient and scalable video understanding via autoregressive gazing

Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M Chan, et al. Attend before attention: Efficient and scalable video understanding via autoregressive gazing. In CVPR, 2026

work page 2026
[43]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019

work page 2019
[44]

Moviechat+: Question-aware sparse memory for long video question answering

Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering. TPAMI, 2025

work page 2025
[45]

Onevision-encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, et al. Onevision-encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence. arXiv preprint arXiv:2602.08683, 2026

work page arXiv 2026
[46]

Adaptive keyframe sampling for long video understanding

Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. In CVPR, 2025

work page 2025
[47]

Dycoke: Dynamic compression of tokens for fast video large language models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. In CVPR, 2025

work page 2025
[48]

Omnizip: Audio- guided dynamic token compression for fast omnimodal large language models

Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian Liu, and Huan Wang. Omnizip: Audio- guided dynamic token compression for fast omnimodal large language models. In CVPR, 2026

work page 2026
[49]

Qwen2.5-vl, January 2025

Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2. 5-vl/

work page 2025
[50]

The information bottleneck method

Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000
[51]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022

work page 2022
[52]

Pixelprune: Pixel-level adaptive visual token reduction via predictive coding

Nan Wang, Zhiwei Jin, Chen Chen, and Haonan Lu. Pixelprune: Pixel-level adaptive visual token reduction via predictive coding. arXiv preprint arXiv:2604.00886, 2026

work page arXiv 2026
[53]

Lvbench: An extreme long video understanding benchmark

Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. In ICCV, 2025

work page 2025
[54]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Longvideobench: A benchmark for long-context interleaved video-language understanding

Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. In NeurIPS, 2024

work page 2024
[56]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In ICLR, 2024

work page 2024
[57]

Vision transformers with self-distilled registers

Zipeng Yan, Yinjie Chen, Chong Zhou, Bo Dai, and Andrew Luo. Vision transformers with self-distilled registers. In NeurIPS, 2025. 12

work page 2025
[58]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Visionzip: Longer is better but not necessary in vision language models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In CVPR, 2025

work page 2025
[60]

Visionthink: Smart and efficient vision language model via reinforcement learning

Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning. In NeurIPS, 2025

work page 2025
[61]

Visiontrim: Unified vision token compression for training-free mllm acceleration

Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, and Jianke Zhu. Visiontrim: Unified vision token compression for training-free mllm acceleration. In ICLR, 2026

work page 2026
[62]

Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration

Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, and Yingyan Celine Lin. Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration. In ICML, 2024

work page 2024
[63]

Unicomp: Rethinking video compression through informational uniqueness

Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, and Lin Ma. Unicomp: Rethinking video compression through informational uniqueness. In CVPR, 2026

work page 2026
[64]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023

work page 2023
[65]

Unified spatio-temporal token scoring for efficient video vlms

Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, and Sangho Lee. Unified spatio-temporal token scoring for efficient video vlms. arXiv preprint arXiv:2603.18004, 2026

work page arXiv 2026
[66]

p-mod: Building mixture-of-depths mllms via progressive ratio decay

Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, and Limin Wang. p-mod: Building mixture-of-depths mllms via progressive ratio decay. In ICCV, 2025

work page 2025
[67]

Lmms-eval: Reality check on the evaluation of large multimodal models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, 2025

work page 2025
[68]

Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. In ICCV, 2025

work page 2025
[69]

Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms

Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, and Shanghang Zhang. Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms. In NeurIPS, 2025

work page 2025
[70]

Sparsevlm: Visual token sparsification for efficient vision-language model inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. In ICML, 2025

work page 2025
[71]

blinded LLM

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. In CVPR, 2025. 13 Appendix A Extended Related Work A.1 Efficient VideoLLMs We extend the technical background to demonstrate why training-free token pruning is a promising aven...

work page arXiv 2025