Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Wei Shen; Xuankun Yang; Xuehui Wang

arxiv: 2607.02484 · v1 · pith:BVLSC4HLnew · submitted 2026-07-02 · 💻 cs.CV · cs.AI

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Xuehui Wang , Xuankun Yang , Wei Shen This is my paper

Pith reviewed 2026-07-03 14:45 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords visual token pruningentropy-aware pruningsubmodular maximizationvision-language modelstextual noisemultimodal benchmarkstoken compressioncross-modal scoring

0 comments

The pith

Entropy filtering of textual noise followed by spatially-prioritized submodular selection lets vision-language models keep fine-grained visual cues under tight token budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two reasons standard visual token pruning loses performance: textual instructions spread noise that corrupts cross-modal relevance scores for image patches, and simple top-k picks create fragmented coverage. It introduces Entropy-Aware Dense Pruning to first compute statistical entropy on the scores and remove the noisy parts, then solve a submodular maximization problem that includes a spatial prior so the kept tokens form a coherent, non-redundant set. The resulting method is tested on multiple multimodal benchmarks with strict token limits. A reader would care because the approach directly targets why efficiency gains usually come at the cost of accuracy on detailed queries. If the method works as described, models could process images faster while still answering precise questions that depend on small visual details.

Core claim

EADP reformulates pruning as structured compression. It first uses statistical entropy to quantify and filter textual noise, producing a robust fine-grained instruction relevance score. It then casts token selection as submodular maximization with a spatial prior to guarantee a holistic non-redundant visual representation. Experiments show the framework improves the accuracy-efficiency trade-off of VLMs, preserves fine-grained cues under strict budgets, and reaches state-of-the-art results on challenging multimodal benchmarks.

What carries the argument

Entropy-Aware Dense Pruning (EADP), a two-stage process that cleans cross-modal scores via statistical entropy before performing submodular maximization under a spatial prior for token selection.

If this is right

VLMs retain fine-grained visual information even when forced to a small fixed number of tokens.
Token selection avoids both noise corruption in scoring and spatial clustering of kept patches.
The accuracy-efficiency trade-off improves relative to prior pruning techniques on the same models.
State-of-the-art results appear on multimodal benchmarks that stress fine detail and dense instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-plus-submodular pattern could be tested on other selection tasks where instructions introduce noise, such as audio or video token pruning.
Real-time deployment settings with hard latency constraints would gain the most from the reported token reduction without accuracy loss.
If the spatial prior is the main source of the coverage gain, simpler diversity penalties might achieve similar results at lower computational cost.
Extending the entropy measure to multi-turn conversations could reveal whether accumulated textual noise grows and requires stronger filtering.

Load-bearing premise

Statistical entropy can reliably identify and remove the textual noise that distorts cross-modal relevance scores for individual image patches.

What would settle it

A controlled test that measures whether the entropy-cleaned scores correlate more strongly with human patch-relevance judgments than raw cross-modal scores, or that compares the submodular selections against top-k selections on the same budget for retention of task-critical patches.

Figures

Figures reproduced from arXiv: 2607.02484 by Wei Shen, Xuankun Yang, Xuehui Wang.

**Figure 1.** Figure 1: (a) illustrates a limitation of global guidance: it tends to attend to background regions. (b) highlights the dispersion phenomenon caused by textual noise. (c) reveals the issues of feature fragmentation and selection redundancy. In summary, our main contributions are three-fold: – Crucial Insights into Pruning Bottlenecks: We systematically analyze the failure modes of existing visual token pruning parad… view at source ↗

**Figure 2.** Figure 2: Overview of the EADP. EADP acts as a plug-and-play module compressing N visual tokens into a highly informative subset of K tokens for the downstream LLM. Stage 1: An entropy-guided denoising mechanism filters out high-entropy textual noise to get the dense guidance score S D. This is fused with the global EOS score S G to yield a robust instruction relevance score S I . Stage 2: After refining S I via gau… view at source ↗

read the original abstract

Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EADP combines entropy filtering for textual noise with spatial-prior submodular selection for VLM token pruning, but the SoTA claims rest on experiments not visible in the provided text.

read the letter

The core idea is to treat visual token pruning as a two-step structured compression task: first use statistical entropy on cross-modal scores to strip out textual noise, then replace top-k with submodular maximization that includes a spatial prior so the kept tokens stay holistic and non-redundant.

That combination is the main thing on offer. Existing pruning work already uses relevance scores and sometimes diversity terms, but the explicit entropy step aimed at noise dispersion plus the submodular formulation with spatial structure looks like a targeted response to the failure modes the authors flag for dense instructions.

The paper states the problem clearly enough and the method is concrete. The stress-test concern about entropy missing semantically localized misalignment rather than high-entropy dispersion is fair to raise; if the noise in practice is systematic misalignment instead of dispersed high-entropy tokens, the first stage could pass corrupted scores to the submodular step and the claimed robustness would not follow. The abstract does not show ablations or numbers that would let a reader check this.

No equations, derivations, or results appear in the text we have, so the soundness of the performance claims cannot be assessed yet. The citation pattern and prior-work discussion are not visible either.

This is for researchers working on efficient VLMs who already know the token-budget problem. A reader looking for a new pruning recipe might borrow the entropy-plus-submodular framing even if the specific results need checking.

It should go to peer review. The problem is practical, the proposed fix is well-specified, and the experiments (once shown) can be evaluated directly.

Referee Report

2 major / 0 minor

Summary. The paper claims that visual token pruning in VLMs is hindered by textual noise dispersion corrupting cross-modal scores and by feature fragmentation from naive selection. It proposes Entropy-Aware Dense Pruning (EADP), which first applies statistical entropy to filter textual noise and produce robust fine-grained relevance scores, then reformulates selection as submodular maximization incorporating a spatial prior to yield holistic non-redundant tokens. The abstract asserts that this improves the accuracy-efficiency trade-off and achieves SoTA results on challenging multimodal benchmarks under strict token budgets.

Significance. If the central claims are substantiated, the work would offer a structured compression approach that combines information-theoretic filtering with combinatorial optimization, potentially enabling more reliable preservation of fine-grained visual information in token-limited VLM inference.

major comments (2)

[Abstract] Abstract: the assumption that statistical entropy reliably quantifies and removes textual noise corrupting dense cross-modal scoring is load-bearing for the entire pipeline, yet remains unexamined; if noise manifests as systematic semantic misalignment rather than high-entropy dispersion, the filtering step will not produce a robust relevance matrix and the subsequent submodular maximization cannot guarantee preservation of fine-grained cues.
[Abstract] Abstract: the statements 'Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off ... while achieving SoTA performance' are unsupported by any data, tables, ablations, error bars, or quantitative results in the manuscript text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and the specific comments on the abstract. We respond to each major comment below and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the assumption that statistical entropy reliably quantifies and removes textual noise corrupting dense cross-modal scoring is load-bearing for the entire pipeline, yet remains unexamined; if noise manifests as systematic semantic misalignment rather than high-entropy dispersion, the filtering step will not produce a robust relevance matrix and the subsequent submodular maximization cannot guarantee preservation of fine-grained cues.

Authors: We agree that the entropy filtering step is central and that its behavior under different noise regimes merits direct examination. The manuscript motivates the choice via the observed dispersion of textual noise in cross-modal scores and shows end-to-end gains, but does not isolate the entropy component with targeted ablations or score visualizations. In revision we will add an analysis subsection that (i) compares entropy-filtered versus raw relevance matrices on sample instructions and (ii) reports an ablation replacing entropy with alternative noise-robustness heuristics, thereby testing the assumption more explicitly. revision: yes
Referee: [Abstract] Abstract: the statements 'Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off ... while achieving SoTA performance' are unsupported by any data, tables, ablations, error bars, or quantitative results in the manuscript text.

Authors: The full manuscript contains a complete Experiments section (including tables reporting accuracy-efficiency trade-offs, ablations on the submodular objective and spatial prior, error bars across multiple runs, and SoTA comparisons on the cited multimodal benchmarks under strict token budgets). The abstract summarizes those results. If the experimental content was inadvertently omitted from the version the referee received, we will ensure the revised submission makes the link between abstract claims and specific tables/figures explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: method relies on empirical design choices without self-referential reductions

full rationale

The provided abstract and description introduce EADP via two design steps—entropy-based noise filtering to produce relevance scores, followed by submodular maximization with a spatial prior—but contain no equations, fitted parameters, self-citations, or derivations that reduce any claim to its own inputs by construction. No self-definitional loops, renamed predictions, or load-bearing self-citations appear. The framework is presented as a reformulation justified by identified bottlenecks and validated experimentally, remaining self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described or can be extracted.

pith-pipeline@v0.9.1-grok · 5709 in / 1059 out tokens · 29498 ms · 2026-07-03T14:45:10.173619+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 47 canonical work pages · 25 internal anchors

[1]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., van den Driessche, G., Mugford, K., Sifre, L., Soyer, H., Doersch, C., Gupta, A., Stanczyk, P., Noh, H., Gontijo Lo...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Alvar, S.R., Singh, G., Akbari, M., Zhang, Y.: Divprune: Diversity-based visual token pruning for large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9392–9401 (2025) 1, 4, 11, 12, 25

2025
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., et al.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 1, 10, 12, 24

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: European Conference on Computer Vision

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024) 1, 4, 11, 25

2024
[7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 1, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Chen, Z., et al.: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 3

work page internal anchor Pith review Pith/arXiv arXiv 2014
[10]

In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D

Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? an analysis of BERT’s attention. In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D. (eds.) Proceedings of the 2019 ACL Workshop BlackboxNLP: AnalyzingandInterpretingNeuralNetworksforNLP.AssociationforComputational Linguistics, Florence, Italy (Aug 2019).https://doi.org/10.1865...

work page doi:10.18653/v1/w19-4828 2019
[11]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale (2021), https://arxiv.org/abs/2010.119291 Supplementary Materials 33

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024) 11

2024
[13]

where to look

Duan, Y., Li, A., Li, Y., Li, L., Wang, P.: Gridprune: From" where to look" to" what to select" in visual token pruning for mllms. arXiv preprint arXiv:2511.10081 (2025) 1, 4, 7

work page arXiv 2025
[14]

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models (2025),https://arxiv.org/abs/ 2306.1339410, 22

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis (2025),https://arxiv.org/abs/ 2405.2107511, 23

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6325–6334 (2017).https://doi.org/10.1109/CVPR.2017.67010, 21

work page doi:10.1109/cvpr.2017.67010 2017
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1437...

2024
[18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Guan, T., Wang, Z., Fu, P., Guo, Z., Shen, W., Zhou, K., Yue, T., Duan, C., Sun, H., Jiang, Q., et al.: A token-level text image foundation model for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23210–23220 (2025) 1

2025
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guan, T., Yang, Z., Wan, J., Yang, M., Guo, Z., Hu, Z., Luo, R., Chen, R., Jiang, S., Wang, P., et al.: Codepercept: Code-grounded visual stem perception for mllms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33542–33552 (2026) 1

2026
[20]

Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people (2018), https://arxiv.org/abs/1802.0821810, 21

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning andcompositionalquestionanswering.In:2019IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR). pp. 6693–6702 (2019).https://doi.org/ 10.1109/CVPR.2019.0068610, 21

work page doi:10.1109/cvpr.2019.0068610 2019
[22]

arXiv:2102.05918 , year=

Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., Duerig, T.: ALIGN: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021) 3

work page arXiv 2021
[23]

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images (2016),https://arxiv.org/abs/1603.07396 10, 22

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Large Language Models are Zero-Shot Reasoners

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 (2022) 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

In: Tractability (2014), https://api.semanticscholar.org/CorpusID:61074902 34 X

Krause, A., Golovin, D.: Submodular function maximization. In: Tractability (2014), https://api.semanticscholar.org/CorpusID:61074902 34 X. Wang et al

2014
[26]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023) 3, 4

Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023) 3, 4

2023
[28]

In: IEEE Conf

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Lou, P., Wang, L., Qiao, Y.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22195–22206 (2024).https://doi.org/10. 1109/CVPR52733.2024.0209511, 23

work page arXiv 2024
[29]

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., Chang, K.W., Gao, J.: Grounded language-image pre-training (2022),https://arxiv.org/abs/2112.038574

work page arXiv 2022
[30]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3

Li, W., Chen, L., Dai, D., Zhu, Z., Tan, M., Yuan, L., Li, L., Wang, J., Liu, J.: InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3

2023
[31]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2020) 3

Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., Gao, J.: OSCAR: Object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020) 3

2020
[32]

arXiv preprint arXiv:2508.07871 (2025) 2, 4

Li, Y., Yang, J., Shen, Z., Han, L., Xu, H., Tang, R.: Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning. arXiv preprint arXiv:2508.07871 (2025) 2, 4

work page arXiv 2025
[33]

Li, Y., Wang, H., Duan, Y., Xu, H., Li, X.: Exploring visual interpretability for contrastive language-image pre-training (2022), https://arxiv.org/abs/2209. 0704618

2022
[34]

In: Bouamor, H., Pino, J., Bali, K

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing. Association for Computational Linguistics, Singa- pore (Dec 2023). https://doi.org/10.18653/v1/2023.em...

work page doi:10.18653/v1/2023.emnlp-main.20 2023
[35]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024) 1

2024
[36]

In: Annual Meeting of the Association for Computational Linguistics (2011), https://api.semanticscholar.org/CorpusID:3203712

Lin, H.C., Bilmes, J.A.: A class of submodular functions for document summariza- tion. In: Annual Meeting of the Association for Computational Linguistics (2011), https://api.semanticscholar.org/CorpusID:3203712

2011
[37]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Lin, Z., Lin, M., Lin, L., Ji, R.: Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 5334–5342 (2025) 1, 4

2025
[38]

io/blog/2024-01-30-llava-next/1, 10, 24

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/1, 10, 24

2024
[39]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 1, 3, 4, 10, 12, 24

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 1, 3, 4, 10, 12, 24

2023
[40]

HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

Liu, J., Du, F., Zhu, G., Lian, N., Li, J., Chen, B.: Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models. arXiv preprint arXiv:2508.00553 (2025) 2, 4, 11, 12, 25 Supplementary Materials 35

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all-around player? (2024),https://arxiv.org/abs/2307.0628110, 22

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Science China Information Sciences67(12) (Dec 2024).https://doi.org/10.1007/s11432- 024-4235-6,http://dx.doi.org/10.1007/s11432-024-4235-610, 22

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12) (Dec 2024).https://doi.org/10.1007/s11432- 024-4235-6,http://dx.doi.org/10.1007/s11432-024-4235-610, 22

work page doi:10.1007/s11432- 2024
[43]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 2507–2521. Curran Associat...

2022
[44]

Advances in Applied Probability7(1), 83–122 (1975) 9

Macchi, O.: The coincidence approach to stochastic point processes. Advances in Applied Probability7(1), 83–122 (1975) 9

1975
[45]

In: Muresan, S., Nakov, P., Villavicencio, A

Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland (May 2022). https://doi.org/1...

work page doi:10.18653/v1/2022.findings- 2022
[46]

Mathew, M., Bagal, V., Tito, R.P., Karatzas, D., Valveny, E., Jawahar, C.V.: Infographicvqa (2021),https://arxiv.org/abs/2104.1275611, 22

work page arXiv 2021
[47]

Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images (2021),https://arxiv.org/abs/2007.0039811, 22

work page arXiv 2021
[48]

Mathematical Programming14, 265–294 (1978),https://api.semanticscholar.org/CorpusID:2068004252, 18

Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming14, 265–294 (1978),https://api.semanticscholar.org/CorpusID:2068004252, 18

1978
[49]

In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021) 2, 3, 18, 24

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021) 2, 3, 18, 24

2021
[50]

Transactions of the Association for Computa- tional Linguistics8(2020)

Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computa- tional Linguistics8(2020). https://doi.org/10.1162/tacl_a_00349 , https: //aclanthology.org/2020.tacl-1.54/5

work page doi:10.1162/tacl_a_00349 2020
[51]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: Llava-prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22857–22867 (2025) 1, 4, 11, 25

2025
[52]

Shao, K., Tao, K., Qin, C., You, H., Sui, Y., Wang, H.: Holitom: Holistic token merging for fast video large language models (2025),https://arxiv.org/abs/ 2505.2133412

work page arXiv 2025
[53]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8309–8318 (2019). https://doi.org/10.1109/CVPR.2019.0085110, 21

work page doi:10.1109/cvpr.2019.0085110 2019
[54]

In: Proceedings of the 31st International Conference on Computational Linguistics

Song, D., Wang, W., Chen, S., Wang, X., Guan, M.X., Wang, B.: Less is more: A simple yet effective token reduction method for efficient multi-modal llms. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7614–7623 (2025) 1, 4, 11, 19, 26 36 X. Wang et al

2025
[55]

In: Advances in Neural Information Processing Systems (NeurIPS) (2017) 1, 3

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS) (2017) 1, 3

2017
[56]

In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D

Vig, J., Belinkov, Y.: Analyzing the structure of attention in a transformer language model. In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D. (eds.) Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy (Aug 2019),https://aclanthology.org/W19-4808/2

2019
[57]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

2022
[58]

important tokens

Wen, Z., Gao, Y., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., Zhang, L.: Stop looking for “important tokens” in multimodal language models: Duplication matters more. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 9972–9991 (2025) 1, 4, 11, 25

2025
[59]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Wu, H., Li, D., Chen, B., Li, J.: LongVideoBench: A benchmark for long-context interleaved video-language understanding. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Informa- tion Processing Systems. vol. 37, pp. 28828–28857. Curran Associates, Inc. (2024). https://doi.org/10.52202/0790...

work page doi:10.52202/079017-090711 2024
[60]

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks (2024),https://arxiv.org/abs/2309.174535

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y., Cao, Y., He, C., Wang, J., Wu, F., Lin, D.: Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction (2025),https://arxiv.org/abs/2410.17247 11, 25

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, S., Chen, Y., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not necessary in vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19792– 19802 (2025) 1, 4, 11, 25

2025
[63]

Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., Xu, C.: Filip: Fine-grained interactive language-image pre-training (2021), https://arxiv.org/abs/2111.077834

work page arXiv 2021
[64]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Ye, R., Jin, W., Wang, H., Wu, Y., Xu, J., Liu, Y., Luo, P.: mplug-owl: Mod- ularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ye, W., Wu, Q., Lin, W., Zhou, Y.: Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 22128–22136 (2025) 1, 4

2025
[66]

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities (2024),https: //arxiv.org/abs/2308.0249010, 23

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training (2023),https://arxiv.org/abs/2303.1534324

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

arXiv preprint arXiv:2509.00320 (2025) 2, 4

Zhang, H., Lyu, M., He, C., Ao, Y., Lin, Y.: Trimtokenator: Towards adaptive visual token pruning for large multimodal models. arXiv preprint arXiv:2509.00320 (2025) 2, 4

work page arXiv 2025
[69]

Zhang, H., Lyu, M., Huang, B., Ao, Y., Lin, Y.: Trimtokenator-lc: Towards adaptive visual token pruning for large multimodal models with long contexts (2025),https: //arxiv.org/abs/2512.227482 Supplementary Materials 37

work page arXiv 2025
[70]

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Zhang, H., Ou, C., Yan, D., Wang, P., Yan, Q., Li, Y., Xiao, R., Shen, C.: Pio- fvlm: Rethinking training-free visual token reduction for vlm acceleration from an inference-objective perspective. arXiv preprint arXiv:2602.04657 (2026) 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2026
[71]

Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J.A., Hu, K., Liu, S., Zhang, Y., Yang, J., Li, C., Liu, Z.: Lmms-eval: Reality check on the evaluation of large multimodal models (2024),https://arxiv.org/abs/2407.1277211

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

arXiv preprint arXiv:2506.10967 (2025) 2, 4, 7, 11, 12, 19, 25, 26

Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y., Pan, J., She, Q., Zhang, S.: Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms. arXiv preprint arXiv:2506.10967 (2025) 2, 4, 7, 11, 12, 19, 25, 26

work page arXiv 2025
[73]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Zhang, Y., Fan, C.K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D., Okuno, T., Nakata, Y., Keutzer, K., et al.: Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417 (2024) 1, 4, 11, 25

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

0271310, 13, 24

Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data (2025), https://arxiv.org/abs/2410. 0271310, 13, 24

2025
[75]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

highlighted tokens

Zou, X., Lu, D., Wang, Y., Yan, Y., Lyu, Y., Zheng, X., Zhang, L., Hu, X.: Don’t just chase "highlighted tokens" in mllms: Revisiting visual holistic context retention (2025),https://arxiv.org/abs/2510.029122

work page arXiv 2025

[1] [1]

Flamingo: a Visual Language Model for Few-Shot Learning

Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- sch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., van den Driessche, G., Mugford, K., Sifre, L., Soyer, H., Doersch, C., Gupta, A., Stanczyk, P., Noh, H., Gontijo Lo...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Alvar, S.R., Singh, G., Akbari, M., Zhang, Y.: Divprune: Diversity-based visual token pruning for large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9392–9401 (2025) 1, 4, 11, 12, 25

2025

[3] [3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., et al.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report (2025),https://arxiv.org/abs/2502.13923 1, 10, 12, 24

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

In: European Conference on Computer Vision

Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. In: European Conference on Computer Vision. pp. 19–35. Springer (2024) 1, 4, 11, 25

2024

[7] [7]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024) 1, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Chen, Z., et al.: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238 (2023) 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014) 3

work page internal anchor Pith review Pith/arXiv arXiv 2014

[10] [10]

In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D

Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? an analysis of BERT’s attention. In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D. (eds.) Proceedings of the 2019 ACL Workshop BlackboxNLP: AnalyzingandInterpretingNeuralNetworksforNLP.AssociationforComputational Linguistics, Florence, Italy (Aug 2019).https://doi.org/10.1865...

work page doi:10.18653/v1/w19-4828 2019

[11] [11]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale (2021), https://arxiv.org/abs/2010.119291 Supplementary Materials 33

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

In: Proceedings of the 32nd ACM International Conference on Multimedia

Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu, Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.: Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 11198–11201 (2024) 11

2024

[13] [13]

where to look

Duan, Y., Li, A., Li, Y., Li, L., Wang, P.: Gridprune: From" where to look" to" what to select" in visual token pruning for mllms. arXiv preprint arXiv:2511.10081 (2025) 1, 4, 7

work page arXiv 2025

[14] [14]

Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., Wu, Y., Ji, R., Shan, C., He, R.: Mme: A comprehensive evaluation benchmark for multimodal large language models (2025),https://arxiv.org/abs/ 2306.1339410, 22

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., Chen, P., Li, Y., Lin, S., Zhao, S., Li, K., Xu, T., Zheng, X., Chen, E., Shan, C., He, R., Sun, X.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis (2025),https://arxiv.org/abs/ 2405.2107511, 23

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6325–6334 (2017).https://doi.org/10.1109/CVPR.2017.67010, 21

work page doi:10.1109/cvpr.2017.67010 2017

[17] [17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., Manocha, D., Zhou, T.: Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1437...

2024

[18] [18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Guan, T., Wang, Z., Fu, P., Guo, Z., Shen, W., Zhou, K., Yue, T., Duan, C., Sun, H., Jiang, Q., et al.: A token-level text image foundation model for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23210–23220 (2025) 1

2025

[19] [19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Guan, T., Yang, Z., Wan, J., Yang, M., Guo, Z., Hu, Z., Luo, R., Chen, R., Jiang, S., Wang, P., et al.: Codepercept: Code-grounded visual stem perception for mllms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33542–33552 (2026) 1

2026

[20] [20]

Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people (2018), https://arxiv.org/abs/1802.0821810, 21

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning andcompositionalquestionanswering.In:2019IEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR). pp. 6693–6702 (2019).https://doi.org/ 10.1109/CVPR.2019.0068610, 21

work page doi:10.1109/cvpr.2019.0068610 2019

[22] [22]

arXiv:2102.05918 , year=

Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q.V., Sung, Y., Li, Z., Duerig, T.: ALIGN: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021) 3

work page arXiv 2021

[23] [23]

Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images (2016),https://arxiv.org/abs/1603.07396 10, 22

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Large Language Models are Zero-Shot Reasoners

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 (2022) 4

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

In: Tractability (2014), https://api.semanticscholar.org/CorpusID:61074902 34 X

Krause, A., Golovin, D.: Submodular function maximization. In: Tractability (2014), https://api.semanticscholar.org/CorpusID:61074902 34 X. Wang et al

2014

[26] [26]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024) 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023) 3, 4

Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023) 3, 4

2023

[28] [28]

In: IEEE Conf

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Lou, P., Wang, L., Qiao, Y.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22195–22206 (2024).https://doi.org/10. 1109/CVPR52733.2024.0209511, 23

work page arXiv 2024

[29] [29]

Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., Chang, K.W., Gao, J.: Grounded language-image pre-training (2022),https://arxiv.org/abs/2112.038574

work page arXiv 2022

[30] [30]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3

Li, W., Chen, L., Dai, D., Zhu, Z., Tan, M., Yuan, L., Li, L., Wang, J., Liu, J.: InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 3

2023

[31] [31]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2020) 3

Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., Gao, J.: OSCAR: Object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020) 3

2020

[32] [32]

arXiv preprint arXiv:2508.07871 (2025) 2, 4

Li, Y., Yang, J., Shen, Z., Han, L., Xu, H., Tang, R.: Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning. arXiv preprint arXiv:2508.07871 (2025) 2, 4

work page arXiv 2025

[33] [33]

Li, Y., Wang, H., Duan, Y., Xu, H., Li, X.: Exploring visual interpretability for contrastive language-image pre-training (2022), https://arxiv.org/abs/2209. 0704618

2022

[34] [34]

In: Bouamor, H., Pino, J., Bali, K

Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Nat- ural Language Processing. Association for Computational Linguistics, Singa- pore (Dec 2023). https://doi.org/10.18653/v1/2023.em...

work page doi:10.18653/v1/2023.emnlp-main.20 2023

[35] [35]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024) 1

2024

[36] [36]

In: Annual Meeting of the Association for Computational Linguistics (2011), https://api.semanticscholar.org/CorpusID:3203712

Lin, H.C., Bilmes, J.A.: A class of submodular functions for document summariza- tion. In: Annual Meeting of the Association for Computational Linguistics (2011), https://api.semanticscholar.org/CorpusID:3203712

2011

[37] [37]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Lin, Z., Lin, M., Lin, L., Ji, R.: Boosting multimodal large language models with visual tokens withdrawal for rapid inference. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 5334–5342 (2025) 1, 4

2025

[38] [38]

io/blog/2024-01-30-llava-next/1, 10, 24

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., Lee, Y.J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https://llava-vl.github. io/blog/2024-01-30-llava-next/1, 10, 24

2024

[39] [39]

In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 1, 3, 4, 10, 12, 24

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2023) 1, 3, 4, 10, 12, 24

2023

[40] [40]

HiPrune: Hierarchical Attention for Efficient Token Pruning in Vision-Language Models

Liu, J., Du, F., Zhu, G., Lian, N., Li, J., Chen, B.: Hiprune: Training-free visual token pruning via hierarchical attention in vision-language models. arXiv preprint arXiv:2508.00553 (2025) 2, 4, 11, 12, 25 Supplementary Materials 35

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., Chen, K., Lin, D.: Mmbench: Is your multi-modal model an all-around player? (2024),https://arxiv.org/abs/2307.0628110, 22

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Science China Information Sciences67(12) (Dec 2024).https://doi.org/10.1007/s11432- 024-4235-6,http://dx.doi.org/10.1007/s11432-024-4235-610, 22

Liu, Y., Li, Z., Huang, M., Yang, B., Yu, W., Li, C., Yin, X.C., Liu, C.L., Jin, L., Bai, X.: Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences67(12) (Dec 2024).https://doi.org/10.1007/s11432- 024-4235-6,http://dx.doi.org/10.1007/s11432-024-4235-610, 22

work page doi:10.1007/s11432- 2024

[43] [43]

In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems. vol. 35, pp. 2507–2521. Curran Associat...

2022

[44] [44]

Advances in Applied Probability7(1), 83–122 (1975) 9

Macchi, O.: The coincidence approach to stochastic point processes. Advances in Applied Probability7(1), 83–122 (1975) 9

1975

[45] [45]

In: Muresan, S., Nakov, P., Villavicencio, A

Masry, A., Long, D.X., Tan, J.Q., Joty, S., Hoque, E.: ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland (May 2022). https://doi.org/1...

work page doi:10.18653/v1/2022.findings- 2022

[46] [46]

Mathew, M., Bagal, V., Tito, R.P., Karatzas, D., Valveny, E., Jawahar, C.V.: Infographicvqa (2021),https://arxiv.org/abs/2104.1275611, 22

work page arXiv 2021

[47] [47]

Mathew, M., Karatzas, D., Jawahar, C.V.: Docvqa: A dataset for vqa on document images (2021),https://arxiv.org/abs/2007.0039811, 22

work page arXiv 2021

[48] [48]

Mathematical Programming14, 265–294 (1978),https://api.semanticscholar.org/CorpusID:2068004252, 18

Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming14, 265–294 (1978),https://api.semanticscholar.org/CorpusID:2068004252, 18

1978

[49] [49]

In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021) 2, 3, 18, 24

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML) (2021) 2, 3, 18, 24

2021

[50] [50]

Transactions of the Association for Computa- tional Linguistics8(2020)

Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computa- tional Linguistics8(2020). https://doi.org/10.1162/tacl_a_00349 , https: //aclanthology.org/2020.tacl-1.54/5

work page doi:10.1162/tacl_a_00349 2020

[51] [51]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: Llava-prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22857–22867 (2025) 1, 4, 11, 25

2025

[52] [52]

Shao, K., Tao, K., Qin, C., You, H., Sui, Y., Wang, H.: Holitom: Holistic token merging for fast video large language models (2025),https://arxiv.org/abs/ 2505.2133412

work page arXiv 2025

[53] [53]

In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8309–8318 (2019). https://doi.org/10.1109/CVPR.2019.0085110, 21

work page doi:10.1109/cvpr.2019.0085110 2019

[54] [54]

In: Proceedings of the 31st International Conference on Computational Linguistics

Song, D., Wang, W., Chen, S., Wang, X., Guan, M.X., Wang, B.: Less is more: A simple yet effective token reduction method for efficient multi-modal llms. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7614–7623 (2025) 1, 4, 11, 19, 26 36 X. Wang et al

2025

[55] [55]

In: Advances in Neural Information Processing Systems (NeurIPS) (2017) 1, 3

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS) (2017) 1, 3

2017

[56] [56]

In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D

Vig, J., Belinkov, Y.: Analyzing the structure of attention in a transformer language model. In: Linzen, T., Chrupała, G., Belinkov, Y., Hupkes, D. (eds.) Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy (Aug 2019),https://aclanthology.org/W19-4808/2

2019

[57] [57]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4

2022

[58] [58]

important tokens

Wen, Z., Gao, Y., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., Zhang, L.: Stop looking for “important tokens” in multimodal language models: Duplication matters more. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 9972–9991 (2025) 1, 4, 11, 25

2025

[59] [59]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Wu, H., Li, D., Chen, B., Li, J.: LongVideoBench: A benchmark for long-context interleaved video-language understanding. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Informa- tion Processing Systems. vol. 37, pp. 28828–28857. Curran Associates, Inc. (2024). https://doi.org/10.52202/0790...

work page doi:10.52202/079017-090711 2024

[60] [60]

Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M.: Efficient streaming language models with attention sinks (2024),https://arxiv.org/abs/2309.174535

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y., Cao, Y., He, C., Wang, J., Wu, F., Lin, D.: Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction (2025),https://arxiv.org/abs/2410.17247 11, 25

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Yang, S., Chen, Y., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not necessary in vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19792– 19802 (2025) 1, 4, 11, 25

2025

[63] [63]

Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang, X., Li, Z., Jiang, X., Xu, C.: Filip: Fine-grained interactive language-image pre-training (2021), https://arxiv.org/abs/2111.077834

work page arXiv 2021

[64] [64]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Ye, R., Jin, W., Wang, H., Wu, Y., Xu, J., Liu, Y., Luo, P.: mplug-owl: Mod- ularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ye, W., Wu, Q., Lin, W., Zhou, Y.: Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 22128–22136 (2025) 1, 4

2025

[66] [66]

Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., Wang, L.: Mm- vet: Evaluating large multimodal models for integrated capabilities (2024),https: //arxiv.org/abs/2308.0249010, 23

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training (2023),https://arxiv.org/abs/2303.1534324

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [68]

arXiv preprint arXiv:2509.00320 (2025) 2, 4

Zhang, H., Lyu, M., He, C., Ao, Y., Lin, Y.: Trimtokenator: Towards adaptive visual token pruning for large multimodal models. arXiv preprint arXiv:2509.00320 (2025) 2, 4

work page arXiv 2025

[69] [69]

Zhang, H., Lyu, M., Huang, B., Ao, Y., Lin, Y.: Trimtokenator-lc: Towards adaptive visual token pruning for large multimodal models with long contexts (2025),https: //arxiv.org/abs/2512.227482 Supplementary Materials 37

work page arXiv 2025

[70] [70]

TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models

Zhang, H., Ou, C., Yan, D., Wang, P., Yan, Q., Li, Y., Xiao, R., Shen, C.: Pio- fvlm: Rethinking training-free visual token reduction for vlm acceleration from an inference-objective perspective. arXiv preprint arXiv:2602.04657 (2026) 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2026

[71] [71]

Zhang, K., Li, B., Zhang, P., Pu, F., Cahyono, J.A., Hu, K., Liu, S., Zhang, Y., Yang, J., Li, C., Liu, Z.: Lmms-eval: Reality check on the evaluation of large multimodal models (2024),https://arxiv.org/abs/2407.1277211

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [72]

arXiv preprint arXiv:2506.10967 (2025) 2, 4, 7, 11, 12, 19, 25, 26

Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y., Pan, J., She, Q., Zhang, S.: Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms. arXiv preprint arXiv:2506.10967 (2025) 2, 4, 7, 11, 12, 19, 25, 26

work page arXiv 2025

[73] [73]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Zhang, Y., Fan, C.K., Ma, J., Zheng, W., Huang, T., Cheng, K., Gudovskiy, D., Okuno, T., Nakata, Y., Keutzer, K., et al.: Sparsevlm: Visual token sparsification for efficient vision-language model inference. arXiv preprint arXiv:2410.04417 (2024) 1, 4, 11, 25

work page internal anchor Pith review Pith/arXiv arXiv 2024

[74] [74]

0271310, 13, 24

Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data (2025), https://arxiv.org/abs/2410. 0271310, 13, 24

2025

[75] [75]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023) 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [76]

highlighted tokens

Zou, X., Lu, D., Wang, Y., Yan, Y., Lyu, Y., Zheng, X., Zhang, L., Hu, X.: Don’t just chase "highlighted tokens" in mllms: Revisiting visual holistic context retention (2025),https://arxiv.org/abs/2510.029122

work page arXiv 2025