pith. sign in

arxiv: 2607.02484 · v1 · pith:BVLSC4HLnew · submitted 2026-07-02 · 💻 cs.CV · cs.AI

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Pith reviewed 2026-07-03 14:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual token pruningentropy-aware pruningsubmodular maximizationvision-language modelstextual noisemultimodal benchmarkstoken compressioncross-modal scoring
0
0 comments X

The pith

Entropy filtering of textual noise followed by spatially-prioritized submodular selection lets vision-language models keep fine-grained visual cues under tight token budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two reasons standard visual token pruning loses performance: textual instructions spread noise that corrupts cross-modal relevance scores for image patches, and simple top-k picks create fragmented coverage. It introduces Entropy-Aware Dense Pruning to first compute statistical entropy on the scores and remove the noisy parts, then solve a submodular maximization problem that includes a spatial prior so the kept tokens form a coherent, non-redundant set. The resulting method is tested on multiple multimodal benchmarks with strict token limits. A reader would care because the approach directly targets why efficiency gains usually come at the cost of accuracy on detailed queries. If the method works as described, models could process images faster while still answering precise questions that depend on small visual details.

Core claim

EADP reformulates pruning as structured compression. It first uses statistical entropy to quantify and filter textual noise, producing a robust fine-grained instruction relevance score. It then casts token selection as submodular maximization with a spatial prior to guarantee a holistic non-redundant visual representation. Experiments show the framework improves the accuracy-efficiency trade-off of VLMs, preserves fine-grained cues under strict budgets, and reaches state-of-the-art results on challenging multimodal benchmarks.

What carries the argument

Entropy-Aware Dense Pruning (EADP), a two-stage process that cleans cross-modal scores via statistical entropy before performing submodular maximization under a spatial prior for token selection.

If this is right

  • VLMs retain fine-grained visual information even when forced to a small fixed number of tokens.
  • Token selection avoids both noise corruption in scoring and spatial clustering of kept patches.
  • The accuracy-efficiency trade-off improves relative to prior pruning techniques on the same models.
  • State-of-the-art results appear on multimodal benchmarks that stress fine detail and dense instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same entropy-plus-submodular pattern could be tested on other selection tasks where instructions introduce noise, such as audio or video token pruning.
  • Real-time deployment settings with hard latency constraints would gain the most from the reported token reduction without accuracy loss.
  • If the spatial prior is the main source of the coverage gain, simpler diversity penalties might achieve similar results at lower computational cost.
  • Extending the entropy measure to multi-turn conversations could reveal whether accumulated textual noise grows and requires stronger filtering.

Load-bearing premise

Statistical entropy can reliably identify and remove the textual noise that distorts cross-modal relevance scores for individual image patches.

What would settle it

A controlled test that measures whether the entropy-cleaned scores correlate more strongly with human patch-relevance judgments than raw cross-modal scores, or that compares the submodular selections against top-k selections on the same budget for retention of task-critical patches.

Figures

Figures reproduced from arXiv: 2607.02484 by Wei Shen, Xuankun Yang, Xuehui Wang.

Figure 1
Figure 1. Figure 1: (a) illustrates a limitation of global guidance: it tends to attend to background regions. (b) highlights the dispersion phenomenon caused by textual noise. (c) reveals the issues of feature fragmentation and selection redundancy. In summary, our main contributions are three-fold: – Crucial Insights into Pruning Bottlenecks: We systematically analyze the failure modes of existing visual token pruning parad… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the EADP. EADP acts as a plug-and-play module compressing N visual tokens into a highly informative subset of K tokens for the downstream LLM. Stage 1: An entropy-guided denoising mechanism filters out high-entropy textual noise to get the dense guidance score S D. This is fused with the global EOS score S G to yield a robust instruction relevance score S I . Stage 2: After refining S I via gau… view at source ↗
read the original abstract

Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that visual token pruning in VLMs is hindered by textual noise dispersion corrupting cross-modal scores and by feature fragmentation from naive selection. It proposes Entropy-Aware Dense Pruning (EADP), which first applies statistical entropy to filter textual noise and produce robust fine-grained relevance scores, then reformulates selection as submodular maximization incorporating a spatial prior to yield holistic non-redundant tokens. The abstract asserts that this improves the accuracy-efficiency trade-off and achieves SoTA results on challenging multimodal benchmarks under strict token budgets.

Significance. If the central claims are substantiated, the work would offer a structured compression approach that combines information-theoretic filtering with combinatorial optimization, potentially enabling more reliable preservation of fine-grained visual information in token-limited VLM inference.

major comments (2)
  1. [Abstract] Abstract: the assumption that statistical entropy reliably quantifies and removes textual noise corrupting dense cross-modal scoring is load-bearing for the entire pipeline, yet remains unexamined; if noise manifests as systematic semantic misalignment rather than high-entropy dispersion, the filtering step will not produce a robust relevance matrix and the subsequent submodular maximization cannot guarantee preservation of fine-grained cues.
  2. [Abstract] Abstract: the statements 'Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off ... while achieving SoTA performance' are unsupported by any data, tables, ablations, error bars, or quantitative results in the manuscript text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the specific comments on the abstract. We respond to each major comment below and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assumption that statistical entropy reliably quantifies and removes textual noise corrupting dense cross-modal scoring is load-bearing for the entire pipeline, yet remains unexamined; if noise manifests as systematic semantic misalignment rather than high-entropy dispersion, the filtering step will not produce a robust relevance matrix and the subsequent submodular maximization cannot guarantee preservation of fine-grained cues.

    Authors: We agree that the entropy filtering step is central and that its behavior under different noise regimes merits direct examination. The manuscript motivates the choice via the observed dispersion of textual noise in cross-modal scores and shows end-to-end gains, but does not isolate the entropy component with targeted ablations or score visualizations. In revision we will add an analysis subsection that (i) compares entropy-filtered versus raw relevance matrices on sample instructions and (ii) reports an ablation replacing entropy with alternative noise-robustness heuristics, thereby testing the assumption more explicitly. revision: yes

  2. Referee: [Abstract] Abstract: the statements 'Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off ... while achieving SoTA performance' are unsupported by any data, tables, ablations, error bars, or quantitative results in the manuscript text.

    Authors: The full manuscript contains a complete Experiments section (including tables reporting accuracy-efficiency trade-offs, ablations on the submodular objective and spatial prior, error bars across multiple runs, and SoTA comparisons on the cited multimodal benchmarks under strict token budgets). The abstract summarizes those results. If the experimental content was inadvertently omitted from the version the referee received, we will ensure the revised submission makes the link between abstract claims and specific tables/figures explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: method relies on empirical design choices without self-referential reductions

full rationale

The provided abstract and description introduce EADP via two design steps—entropy-based noise filtering to produce relevance scores, followed by submodular maximization with a spatial prior—but contain no equations, fitted parameters, self-citations, or derivations that reduce any claim to its own inputs by construction. No self-definitional loops, renamed predictions, or load-bearing self-citations appear. The framework is presented as a reformulation justified by identified bottlenecks and validated experimentally, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.1-grok · 5709 in / 1059 out tokens · 29498 ms · 2026-07-03T14:45:10.173619+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 47 canonical work pages · 25 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., van den Driessche, G., Mugford, K., Sifre, L., Soyer, H., Doersch, C., Gupta, A., Stanczyk, P., Noh, H., Gontijo Lo...

  2. [2]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Alvar, S.R., Singh, G., Akbari, M., Zhang, Y.: Divprune: Diversity-based visual token pruning for large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9392–9401 (2025) 1, 4, 11, 12, 25

  3. [3]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., et al.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 4

  4. [4]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  5. [5]

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 1, 10, 12, 24

  6. [6]

    In: European Conference on Computer Vision

    Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024) 1, 4, 11, 25

  7. [7]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 1, 12

  8. [8]

    InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    Chen, Z., et al.: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 (2023) 4

  9. [9]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 3

  10. [10]

    In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D

    Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? an analysis of BERT’s attention. In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D. (eds.) Proceedings of the 2019 ACL Workshop BlackboxNLP: AnalyzingandInterpretingNeuralNetworksforNLP.AssociationforComputational Linguistics, Florence, Italy (Aug 2019).https://doi.org/10.1865...

  11. [11]

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale (2021), https://arxiv.org/abs/2010.119291 Supplementary Materials 33

  12. [12]

    In: Proceedings of the 32nd ACM International Conference on Multimedia

    Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024) 11

  13. [13]

    where to look

    Duan, Y., Li, A., Li, Y., Li, L., Wang, P.: Gridprune: From" where to look" to" what to select" in visual token pruning for mllms. arXiv preprint arXiv:2511.10081 (2025) 1, 4, 7

  14. [14]

    Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models (2025),https://arxiv.org/abs/ 2306.1339410, 22

  15. [15]

    Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis (2025),https://arxiv.org/abs/ 2405.2107511, 23

  16. [16]

    In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6325–6334 (2017).https://doi.org/10.1109/CVPR.2017.67010, 21

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1437...

  18. [18]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Guan, T., Wang, Z., Fu, P., Guo, Z., Shen, W., Zhou, K., Yue, T., Duan, C., Sun, H., Jiang, Q., et al.: A token-level text image foundation model for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23210–23220 (2025) 1

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Guan, T., Yang, Z., Wan, J., Yang, M., Guo, Z., Hu, Z., Luo, R., Chen, R., Jiang, S., Wang, P., et al.: Codepercept: Code-grounded visual stem perception for mllms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33542–33552 (2026) 1

  20. [20]

    Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people (2018), https://arxiv.org/abs/1802.0821810, 21

  21. [21]

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning andcompositionalquestionanswering.In:2019IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR). pp. 6693–6702 (2019).https://doi.org/ 10.1109/CVPR.2019.0068610, 21

  22. [22]

    arXiv:2102.05918 , year=

    Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., Duerig, T.: ALIGN: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021) 3

  23. [23]

    Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images (2016),https://arxiv.org/abs/1603.07396 10, 22

  24. [24]

    Large Language Models are Zero-Shot Reasoners

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 (2022) 4

  25. [25]

    In: Tractability (2014), https://api.semanticscholar.org/CorpusID:61074902 34 X

    Krause, A., Golovin, D.: Submodular function maximization. In: Tractability (2014), https://api.semanticscholar.org/CorpusID:61074902 34 X. Wang et al

  26. [26]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 1

  27. [27]

    In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023) 3, 4

    Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023) 3, 4

  28. [28]

    In: IEEE Conf

    Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Lou, P., Wang, L., Qiao, Y.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22195–22206 (2024).https://doi.org/10. 1109/CVPR52733.2024.0209511, 23

  29. [29]

    Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., Chang, K.W., Gao, J.: Grounded language-image pre-training (2022),https://arxiv.org/abs/2112.038574

  30. [30]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3

    Li, W., Chen, L., Dai, D., Zhu, Z., Tan, M., Yuan, L., Li, L., Wang, J., Liu, J.: InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3

  31. [31]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2020) 3

    Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., Gao, J.: OSCAR: Object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020) 3

  32. [32]

    arXiv preprint arXiv:2508.07871 (2025) 2, 4

    Li, Y., Yang, J., Shen, Z., Han, L., Xu, H., Tang, R.: Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning. arXiv preprint arXiv:2508.07871 (2025) 2, 4

  33. [33]

    Li, Y., Wang, H., Duan, Y., Xu, H., Li, X.: Exploring visual interpretability for contrastive language-image pre-training (2022), https://arxiv.org/abs/2209. 0704618

  34. [34]

    In: Bouamor, H., Pino, J., Bali, K

    Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing. Association for Computational Linguistics, Singa- pore (Dec 2023). https://doi.org/10.18653/v1/2023.em...

  35. [35]

    In: Proceedings of the 2024 conference on empirical methods in natural language processing

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024) 1

  36. [36]

    In: Annual Meeting of the Association for Computational Linguistics (2011), https://api.semanticscholar.org/CorpusID:3203712

    Lin, H.C., Bilmes, J.A.: A class of submodular functions for document summariza- tion. In: Annual Meeting of the Association for Computational Linguistics (2011), https://api.semanticscholar.org/CorpusID:3203712

  37. [37]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Lin, Z., Lin, M., Lin, L., Ji, R.: Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 5334–5342 (2025) 1, 4

  38. [38]

    io/blog/2024-01-30-llava-next/1, 10, 24

    Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/1, 10, 24

  39. [39]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 1, 3, 4, 10, 12, 24

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 1, 3, 4, 10, 12, 24

  40. [40]

    HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

    Liu, J., Du, F., Zhu, G., Lian, N., Li, J., Chen, B.: Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models. arXiv preprint arXiv:2508.00553 (2025) 2, 4, 11, 12, 25 Supplementary Materials 35

  41. [41]

    Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all-around player? (2024),https://arxiv.org/abs/2307.0628110, 22

  42. [42]

    Science China Information Sciences67(12) (Dec 2024).https://doi.org/10.1007/s11432- 024-4235-6,http://dx.doi.org/10.1007/s11432-024-4235-610, 22

    Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12) (Dec 2024).https://doi.org/10.1007/s11432- 024-4235-6,http://dx.doi.org/10.1007/s11432-024-4235-610, 22

  43. [43]

    In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 2507–2521. Curran Associat...

  44. [44]

    Advances in Applied Probability7(1), 83–122 (1975) 9

    Macchi, O.: The coincidence approach to stochastic point processes. Advances in Applied Probability7(1), 83–122 (1975) 9

  45. [45]

    In: Muresan, S., Nakov, P., Villavicencio, A

    Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland (May 2022). https://doi.org/1...

  46. [46]

    Mathew, M., Bagal, V., Tito, R.P., Karatzas, D., Valveny, E., Jawahar, C.V.: Infographicvqa (2021),https://arxiv.org/abs/2104.1275611, 22

  47. [47]

    Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images (2021),https://arxiv.org/abs/2007.0039811, 22

  48. [48]

    Mathematical Programming14, 265–294 (1978),https://api.semanticscholar.org/CorpusID:2068004252, 18

    Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming14, 265–294 (1978),https://api.semanticscholar.org/CorpusID:2068004252, 18

  49. [49]

    In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021) 2, 3, 18, 24

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021) 2, 3, 18, 24

  50. [50]

    Transactions of the Association for Computa- tional Linguistics8(2020)

    Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computa- tional Linguistics8(2020). https://doi.org/10.1162/tacl_a_00349 , https: //aclanthology.org/2020.tacl-1.54/5

  51. [51]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: Llava-prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22857–22867 (2025) 1, 4, 11, 25

  52. [52]

    Shao, K., Tao, K., Qin, C., You, H., Sui, Y., Wang, H.: Holitom: Holistic token merging for fast video large language models (2025),https://arxiv.org/abs/ 2505.2133412

  53. [53]

    In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8309–8318 (2019). https://doi.org/10.1109/CVPR.2019.0085110, 21

  54. [54]

    In: Proceedings of the 31st International Conference on Computational Linguistics

    Song, D., Wang, W., Chen, S., Wang, X., Guan, M.X., Wang, B.: Less is more: A simple yet effective token reduction method for efficient multi-modal llms. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7614–7623 (2025) 1, 4, 11, 19, 26 36 X. Wang et al

  55. [55]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2017) 1, 3

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS) (2017) 1, 3

  56. [56]

    In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D

    Vig, J., Belinkov, Y.: Analyzing the structure of attention in a transformer language model. In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D. (eds.) Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy (Aug 2019),https://aclanthology.org/W19-4808/2

  57. [57]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

  58. [58]

    important tokens

    Wen, Z., Gao, Y., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., Zhang, L.: Stop looking for “important tokens” in multimodal language models: Duplication matters more. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 9972–9991 (2025) 1, 4, 11, 25

  59. [59]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Wu, H., Li, D., Chen, B., Li, J.: LongVideoBench: A benchmark for long-context interleaved video-language understanding. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Informa- tion Processing Systems. vol. 37, pp. 28828–28857. Curran Associates, Inc. (2024). https://doi.org/10.52202/0790...

  60. [60]

    Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks (2024),https://arxiv.org/abs/2309.174535

  61. [61]

    Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y., Cao, Y., He, C., Wang, J., Wu, F., Lin, D.: Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction (2025),https://arxiv.org/abs/2410.17247 11, 25

  62. [62]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, S., Chen, Y., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not necessary in vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19792– 19802 (2025) 1, 4, 11, 25

  63. [63]

    Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., Xu, C.: Filip: Fine-grained interactive language-image pre-training (2021), https://arxiv.org/abs/2111.077834

  64. [64]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Ye, R., Jin, W., Wang, H., Wu, Y., Xu, J., Liu, Y., Luo, P.: mplug-owl: Mod- ularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023) 3

  65. [65]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Ye, W., Wu, Q., Lin, W., Zhou, Y.: Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 22128–22136 (2025) 1, 4

  66. [66]

    Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities (2024),https: //arxiv.org/abs/2308.0249010, 23

  67. [67]

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training (2023),https://arxiv.org/abs/2303.1534324

  68. [68]

    arXiv preprint arXiv:2509.00320 (2025) 2, 4

    Zhang, H., Lyu, M., He, C., Ao, Y., Lin, Y.: Trimtokenator: Towards adaptive visual token pruning for large multimodal models. arXiv preprint arXiv:2509.00320 (2025) 2, 4

  69. [69]

    Zhang, H., Lyu, M., Huang, B., Ao, Y., Lin, Y.: Trimtokenator-lc: Towards adaptive visual token pruning for large multimodal models with long contexts (2025),https: //arxiv.org/abs/2512.227482 Supplementary Materials 37

  70. [70]

    TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

    Zhang, H., Ou, C., Yan, D., Wang, P., Yan, Q., Li, Y., Xiao, R., Shen, C.: Pio- fvlm: Rethinking training-free visual token reduction for vlm acceleration from an inference-objective perspective. arXiv preprint arXiv:2602.04657 (2026) 1, 4

  71. [71]

    Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J.A., Hu, K., Liu, S., Zhang, Y., Yang, J., Li, C., Liu, Z.: Lmms-eval: Reality check on the evaluation of large multimodal models (2024),https://arxiv.org/abs/2407.1277211

  72. [72]

    arXiv preprint arXiv:2506.10967 (2025) 2, 4, 7, 11, 12, 19, 25, 26

    Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y., Pan, J., She, Q., Zhang, S.: Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms. arXiv preprint arXiv:2506.10967 (2025) 2, 4, 7, 11, 12, 19, 25, 26

  73. [73]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Zhang, Y., Fan, C.K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D., Okuno, T., Nakata, Y., Keutzer, K., et al.: Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417 (2024) 1, 4, 11, 25

  74. [74]

    0271310, 13, 24

    Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data (2025), https://arxiv.org/abs/2410. 0271310, 13, 24

  75. [75]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 3

  76. [76]

    highlighted tokens

    Zou, X., Lu, D., Wang, Y., Yan, Y., Lyu, Y., Zheng, X., Zhang, L., Hu, X.: Don’t just chase "highlighted tokens" in mllms: Revisiting visual holistic context retention (2025),https://arxiv.org/abs/2510.029122