Recognition: no theorem link
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
Pith reviewed 2026-05-12 02:37 UTC · model grok-4.3
The pith
EchoPrune prunes video tokens scored as temporal echoes from prior frames to let VideoLLMs process up to 20 times more frames at fixed token budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EchoPrune scores visual tokens by query-guided crossmodal relevance combined with temporal reconstruction error measured via correspondence matching and echo matching across consecutive frames; tokens with low reconstruction error are interpreted as temporal echoes and pruned, preserving task-relevant cues and temporal novelty while allowing up to 20 times more frames under a fixed LLM-side visual token budget.
What carries the argument
The dual scoring of tokens by query-guided crossmodal relevance and temporal reconstruction error via correspondence and echo matching across frames.
If this is right
- VideoLLMs can ingest longer videos with finer temporal sampling without raising the token count passed to the language model.
- Performance improves on video understanding tasks because more frames supply additional evidence while redundant tokens are removed.
- Inference speeds up, particularly during prefilling, since fewer tokens reach the LLM decoder.
- The method applies to existing VideoLLMs without any retraining or architectural changes.
Where Pith is reading between the lines
- The reconstruction-based view of redundancy could extend to pruning in other sequential data like audio streams or time-series sensor inputs.
- Focusing on temporal novelty might reduce hallucination rates in VideoLLMs by limiting exposure to predictable but uninformative frames.
- Combining EchoPrune with query-agnostic compression techniques could further lower token budgets for very long videos.
Load-bearing premise
Tokens that reconstruct well from the previous frame via correspondence and echo matching contain no critical task-relevant or temporal information that pruning would lose.
What would settle it
A video clip and query where pruning tokens with low reconstruction error from the prior frame causes the model to produce an incorrect answer that the unpruned version answered correctly.
Figures
read the original abstract
Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes EchoPrune, a training-free token pruning technique for Video Large Language Models (VideoLLMs). It interprets redundant visual tokens as 'temporal echoes' that can be reconstructed from previous frames via correspondence and echo matching. Tokens are retained if they have high query-guided cross-modal relevance or high temporal reconstruction error. This approach aims to allow VideoLLMs to process up to 20 times more frames under the same token budget, leading to performance gains of +8.6% and inference speedups of 5.6x for prefilling on Qwen2.5VL-7B, evaluated on six video understanding benchmarks with models including LLaVA-OV, Qwen2.5VL, and Qwen3VL.
Significance. If the empirical claims hold, the work could meaningfully advance efficient long-form video understanding in multimodal models by enabling higher temporal resolution at fixed token budgets. The training-free design and the framing of redundancy as temporal echoes represent conceptual strengths that distinguish it from segment-level merging heuristics. These aspects could support broader adoption in resource-constrained VideoLLM deployments.
major comments (3)
- [Abstract] Abstract: The headline performance claims (+8.6% improvement and 5.6x prefilling speedup) are stated without any reference to baselines, number of frames processed in the comparison, error bars, statistical tests, or data-exclusion rules. These omissions make it impossible to evaluate whether the gains are attributable to the proposed pruning or to other factors.
- [Abstract] Abstract: The scoring procedure that combines query-guided cross-modal relevance with temporal reconstruction error (via correspondence matching and echo matching) is described only at a conceptual level. No explicit formula, weighting scheme, or pruning threshold is provided, which is load-bearing for both reproducibility and for verifying that the method actually preserves task-relevant temporal information.
- [Abstract] Abstract: The core assumption that tokens with low reconstruction error from the prior frame are safely redundant is presented without ablations, failure-case analysis, or evidence that correspondence matching does not discard subtle motion, lighting changes, or query-critical details. This assumption directly determines which tokens are pruned and therefore requires explicit support.
minor comments (1)
- [Abstract] Abstract: The phrase 'extensive experiments' is used but no specific benchmark names or per-benchmark breakdowns are supplied, which would help readers assess the breadth of the evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below. Where the abstract can be clarified without exceeding length constraints, we will revise it; for deeper methodological details, we will ensure the main text provides explicit support and cross-references.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance claims (+8.6% improvement and 5.6x prefilling speedup) are stated without any reference to baselines, number of frames processed in the comparison, error bars, statistical tests, or data-exclusion rules. These omissions make it impossible to evaluate whether the gains are attributable to the proposed pruning or to other factors.
Authors: We agree the abstract is concise and would benefit from added context. The full manuscript reports results against uniform sampling and prior token-pruning baselines, using up to 20x more frames under a fixed token budget, with averages across six benchmarks and multiple models. We will revise the abstract to briefly note the primary baselines and the 20x frame scaling to make the claims more interpretable while preserving brevity. revision: yes
-
Referee: [Abstract] Abstract: The scoring procedure that combines query-guided cross-modal relevance with temporal reconstruction error (via correspondence matching and echo matching) is described only at a conceptual level. No explicit formula, weighting scheme, or pruning threshold is provided, which is load-bearing for both reproducibility and for verifying that the method actually preserves task-relevant temporal information.
Authors: The abstract intentionally summarizes the method at a high level. The complete paper supplies the explicit scoring formula (a weighted sum of query-guided cross-modal relevance and temporal reconstruction error), the correspondence/echo matching procedures, the weighting coefficients, and the adaptive threshold selection. We will update the abstract with a short reference to the combined scoring function and direct readers to the method section for the full equations. revision: yes
-
Referee: [Abstract] Abstract: The core assumption that tokens with low reconstruction error from the prior frame are safely redundant is presented without ablations, failure-case analysis, or evidence that correspondence matching does not discard subtle motion, lighting changes, or query-critical details. This assumption directly determines which tokens are pruned and therefore requires explicit support.
Authors: We acknowledge that the abstract alone does not contain supporting evidence. The manuscript includes ablations isolating the reconstruction-error term, qualitative token visualizations, and quantitative checks confirming that motion and query-relevant details are retained via the joint scoring. To strengthen the presentation, we will add a concise paragraph on potential edge cases (e.g., rapid lighting shifts) and how the combined relevance-plus-error criterion mitigates them. revision: yes
Circularity Check
No circularity; abstract presents heuristic insight and scoring rule without equations or self-referential reductions
full rationale
The provided abstract contains no equations, no derivation chain, and no citations (self or otherwise). EchoPrune is introduced as a training-free heuristic that scores tokens by combining query-guided cross-modal relevance with temporal reconstruction error via correspondence and echo matching. The central claim—that low-reconstruction-error tokens are redundant echoes—is framed as an interpretive insight rather than a quantity derived from fitted parameters or prior results. No step reduces by construction to its own inputs, and the reported gains (+8.6% performance, 5.6x speedup) are presented as empirical outcomes on external benchmarks. This satisfies the default expectation of a non-circular proposal.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual tokens from consecutive frames can be meaningfully compared via correspondence matching and reconstruction error
invented entities (1)
-
temporal echoes
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Divprune: Diversity-based visual token pruning for large multimodal models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. In CVPR, 2025
work page 2025
-
[2]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Flashvlm: Text-guided visual token selection for large multimodal models
Kaitong Cai, Jusheng Zhang, Jing Yang, Yijia Fan, Pengtao Xie, Jian Wang, and Keze Wang. Flashvlm: Text-guided visual token selection for large multimodal models. arXiv preprint arXiv:2512.20561, 2025
-
[4]
The use of mmr, diversity-based reranking for reordering documents and producing summaries
Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998
work page 1998
-
[5]
Unified spatiotemporal token compression for video-llms at ultra-low retention
Junhao Du, Jialong Xue, Anqi Li, Jincheng Dai, and Guo Lu. Unified spatiotemporal token compression for video-llms at ultra-low retention. In CVPR, 2026
work page 2026
-
[6]
Ziyang Fan, Keyu Chen, Ruilong Xing, Yulin Li, Li Jiang, and Zhuotao Tian. Flashvid: Efficient video large language models via training-free tree-based spatiotemporal token merging. In ICLR, 2026
work page 2026
-
[7]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In CVPR, 2025
work page 2025
-
[8]
Tianyu Fu, Tengxuan Liu, Qinghao Han, Guohao Dai, Shengen Yan, Huazhong Yang, Xuefei Ning, and Yu Wang. Framefusion: Combining similarity and importance for video token reduction on large vision language models. In ICCV, 2025
work page 2025
-
[9]
Submodular functions and optimization, volume 58
Satoru Fujishige. Submodular functions and optimization, volume 58. Elsevier, 2005
work page 2005
-
[10]
Chao Gong, Depeng Wang, Zhipeng Wei, Ya Guo, Huijia Zhu, and Jingjing Chen. Echoing- pixels: Cross-modal adaptive token reduction for efficient audio-visual llms. arXiv preprint arXiv:2512.10324, 2025
-
[11]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826, 2025
work page internal anchor Pith review arXiv 2025
-
[12]
KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models
Haifeng Huang and Yang Li. Kitoke: Kernel-based interval-aware token compression for video large language models. arXiv preprint arXiv:2604.03414, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, pages 6700–6709, 2019
work page 2019
-
[14]
Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning
Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, and Huan Li. Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning. In EMNLP, 2025
work page 2025
-
[15]
Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, and Huan Li. See the forest for the trees: Loosely speculative decoding via visual-semantic guidance for efficient inference of video llms. In ACL, 2026. 10
work page 2026
-
[16]
Compression tells intelligence: Visual coding, visual token technology, and the unification
Xin Jin, Jinming Liu, Yuntao Wei, Junyan Lin, Zhicheng Wang, Jianguo Huang, Xudong Yang, Yanxiao Liu, and Wenjun Zeng. Compression tells intelligence: Visual coding, visual token technology, and the unification. arXiv preprint arXiv:2601.20742, 2026
-
[17]
Shaobo Ju, Baiyang Song, Tao Chen, Jiapeng Zhang, Qiong Wu, Chao Chang, HuaiXi Wang, Yiyi Zhou, and Rongrong Ji. Forestprune: High-ratio visual token compression for video multimodal large language models via spatial-temporal forest modeling. In CVPR Findings, 2026
work page 2026
-
[18]
Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, and Cong Wang. Parallelvlm: Lossless video-llm acceleration with visual alignment aware parallel speculative decoding. In CVPR, 2026
work page 2026
-
[19]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs
Jiameng Li, Aleksei Tiulpin, and Matthew B. Blaschko. MI-pruner: Crossmodal mutual information-guided token pruner for efficient mllms. arXiv preprint arXiv:2604.03072, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Jiaqi Li, Shuntian Zheng, Yixian Shen, Jia-Hong Huang, Xiaoman Lu, Minzhe Ni, and Yu Guan. Keeping the evidence chain: Semantic evidence allocation for training-free token pruning in video temporal grounding. arXiv preprint arXiv:2603.05663, 2026
-
[22]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022
work page 2022
-
[23]
Xu Li, Yi Zheng, Yuxuan Liang, Zhe Liu, Xiaolei Chen, Haotian Chen, Rui Zhu, and Xiangyang Xue. Resprune: Text-conditioned subspace reconstruction for visual token pruning in large vision-language models. arXiv preprint arXiv:2603.21105, 2026
-
[24]
EAGLE: Speculative sampling requires rethinking feature uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. In ICML, 2024
work page 2024
-
[25]
EAGLE-2: Faster inference of language models with dynamic draft trees
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. In EMNLP, 2024
work page 2024
-
[26]
EAGLE-3: Scaling up inference acceleration of large language models via training-time test
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. In NeurIPS, 2025
work page 2025
-
[27]
Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, and Kang Liu. Resadapt: Adaptive resolution for efficient multimodal reasoning. arXiv preprint arXiv:2603.28610, 2026
-
[28]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In CVPR, 2024
work page 2024
-
[29]
Video compression commander: Plug-and-play inference acceleration for video large language models
Xuyang Liu, Yiyu Wang, Junpeng Ma, and Linfeng Zhang. Video compression commander: Plug-and-play inference acceleration for video large language models. In EMNLP, 2025
work page 2025
-
[30]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022
work page 2022
-
[31]
γ−mod: Exploring mixture-of-depth adaptation for multimodal large language models
Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, and Rongrong Ji. γ−mod: Exploring mixture-of-depth adaptation for multimodal large language models. In ICLR, 2025
work page 2025
-
[32]
Mmg-vid: Maximizing marginal gains at segment-level and token-level for efficient video llms
Junpeng Ma, Qizhe Zhang, Ming Lu, Zhibin Wang, Qiang Zhou, Jun Song, and Shanghang Zhang. Mmg-vid: Maximizing marginal gains at segment-level and token-level for efficient video llms. In AAAI, 2026
work page 2026
-
[33]
Gift: Global irreplaceability frame targeting for efficient video understanding
Junpeng Ma, Sashuai Zhou, Guanghao Li, Xin Gao, Yue Cao, Hengyu Zeng, Yuxiang Yan, Zhibin Wang, Jun Song, Bo Zheng, et al. Gift: Global irreplaceability frame targeting for efficient video understanding. In CVPR, 2026
work page 2026
-
[34]
Apet: Approximation-error guided token compression for efficient vlms
Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, and Hairong Zheng. Apet: Approximation-error guided token compression for efficient vlms. In CVPR, 2026
work page 2026
-
[35]
Egoschema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. In NeurIPS, 2023
work page 2023
-
[36]
Certain topics in telegraph transmission theory
Harry Nyquist. Certain topics in telegraph transmission theory. Transactions of the American Institute of Electrical Engineers, 47(2):617–644, 1928. 11
work page 1928
-
[37]
Does your vision-language model get lost in the long video sampling dilemma? In ICCV, 2025
Tianyuan Qu, Longxiang Tang, Bohao Peng, Senqiao Yang, Bei Yu, and Jiaya Jia. Does your vision-language model get lost in the long video sampling dilemma? In ICCV, 2025
work page 2025
-
[38]
Clustering by fast search and find of density peaks
Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. science, 344(6191):1492–1496, 2014
work page 2014
-
[39]
arXiv preprint arXiv:2602.13191 , year=
Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, and Mihai Dusmanu. Cope-videolm: Leveraging codec primitives for efficient video language modeling. arXiv preprint arXiv:2602.13191, 2026
-
[40]
Holitom: Holistic token merging for fast video large language models
Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models. In NeurIPS, 2025
work page 2025
-
[41]
FastVID: Dynamic density pruning for fast video large language models
Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, pengzhang liu, Sicheng Zhao, and Guiguang Ding. FastVID: Dynamic density pruning for fast video large language models. In NeurIPS, 2025
work page 2025
-
[42]
Attend before attention: Efficient and scalable video understanding via autoregressive gazing
Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M Chan, et al. Attend before attention: Efficient and scalable video understanding via autoregressive gazing. In CVPR, 2026
work page 2026
-
[43]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, pages 8317–8326, 2019
work page 2019
-
[44]
Moviechat+: Question-aware sparse memory for long video question answering
Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering. TPAMI, 2025
work page 2025
-
[45]
Onevision-encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence
Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, et al. Onevision-encoder: Codec-aligned sparsity as a foundational principle for multimodal intelligence. arXiv preprint arXiv:2602.08683, 2026
-
[46]
Adaptive keyframe sampling for long video understanding
Xi Tang, Jihao Qiu, Lingxi Xie, Yunjie Tian, Jianbin Jiao, and Qixiang Ye. Adaptive keyframe sampling for long video understanding. In CVPR, 2025
work page 2025
-
[47]
Dycoke: Dynamic compression of tokens for fast video large language models
Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Dycoke: Dynamic compression of tokens for fast video large language models. In CVPR, 2025
work page 2025
-
[48]
Omnizip: Audio- guided dynamic token compression for fast omnimodal large language models
Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian Liu, and Huan Wang. Omnizip: Audio- guided dynamic token compression for fast omnimodal large language models. In CVPR, 2026
work page 2026
-
[49]
Qwen Team. Qwen2.5-vl, January 2025. URL https://qwenlm.github.io/blog/qwen2. 5-vl/
work page 2025
-
[50]
The information bottleneck method
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[51]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, 2022
work page 2022
-
[52]
Pixelprune: Pixel-level adaptive visual token reduction via predictive coding
Nan Wang, Zhiwei Jin, Chen Chen, and Haonan Lu. Pixelprune: Pixel-level adaptive visual token reduction via predictive coding. arXiv preprint arXiv:2604.00886, 2026
-
[53]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. In ICCV, 2025
work page 2025
-
[54]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Longvideobench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding. In NeurIPS, 2024
work page 2024
-
[56]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In ICLR, 2024
work page 2024
-
[57]
Vision transformers with self-distilled registers
Zipeng Yan, Yinjie Chen, Chong Zhou, Bo Dai, and Andrew Luo. Vision transformers with self-distilled registers. In NeurIPS, 2025. 12
work page 2025
-
[58]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Visionzip: Longer is better but not necessary in vision language models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. In CVPR, 2025
work page 2025
-
[60]
Visionthink: Smart and efficient vision language model via reinforcement learning
Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, and Jiaya Jia. Visionthink: Smart and efficient vision language model via reinforcement learning. In NeurIPS, 2025
work page 2025
-
[61]
Visiontrim: Unified vision token compression for training-free mllm acceleration
Hanxun Yu, Wentong Li, Xuan Qu, Song Wang, Junbo Chen, and Jianke Zhu. Visiontrim: Unified vision token compression for training-free mllm acceleration. In ICLR, 2026
work page 2026
-
[62]
Zhongzhi Yu, Zheng Wang, Yonggan Fu, Huihong Shi, Khalid Shaikh, and Yingyan Celine Lin. Unveiling and harnessing hidden attention sinks: Enhancing large language models without training through attention calibration. In ICML, 2024
work page 2024
-
[63]
Unicomp: Rethinking video compression through informational uniqueness
Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, and Lin Ma. Unicomp: Rethinking video compression through informational uniqueness. In CVPR, 2026
work page 2026
-
[64]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In ICCV, 2023
work page 2023
-
[65]
Unified spatio-temporal token scoring for efficient video vlms
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, and Sangho Lee. Unified spatio-temporal token scoring for efficient video vlms. arXiv preprint arXiv:2603.18004, 2026
-
[66]
p-mod: Building mixture-of-depths mllms via progressive ratio decay
Jun Zhang, Desen Meng, Zhengming Zhang, Zhenpeng Huang, Tao Wu, and Limin Wang. p-mod: Building mixture-of-depths mllms via progressive ratio decay. In ICCV, 2025
work page 2025
-
[67]
Lmms-eval: Reality check on the evaluation of large multimodal models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. In Findings of the Association for Computational Linguistics: NAACL 2025, 2025
work page 2025
-
[68]
Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. In ICCV, 2025
work page 2025
-
[69]
Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, and Shanghang Zhang. Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms. In NeurIPS, 2025
work page 2025
-
[70]
Sparsevlm: Visual token sparsification for efficient vision-language model inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference. In ICML, 2025
work page 2025
-
[71]
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. In CVPR, 2025. 13 Appendix A Extended Related Work A.1 Efficient VideoLLMs We extend the technical background to demonstrate why training-free token pruning is a promising aven...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.