pith. sign in

arxiv: 2606.24156 · v1 · pith:ZMQPNYINnew · submitted 2026-06-23 · 💻 cs.CV

Accelerating Multimodal Large Language Models with Prior-Corrected Token Reduction

Pith reviewed 2026-06-26 01:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual token reductionmultimodal large language modelsattention prior correctiontraining-free pruningmodel-induced biastoken efficiency
0
0 comments X

The pith

Null token probe corrects model-induced prior in attention to improve visual token pruning for MLLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention scores used for visual token pruning in multimodal LLMs are dominated by a model-induced prior that favors certain task-agnostic regions even without any instruction, which suppresses scores for tokens that actually carry task-specific information. PriorTR fixes this by estimating the prior attention map with a null token and contrasting it against the instruction-conditioned attention to quantify the extra information each token provides. The null token is inserted directly into the attention block so both the prior and the task posterior are obtained in one forward pass. If the approach holds, pruning decisions become more reliable at low token counts, preserving accuracy while cutting computation.

Core claim

PriorTR separates task-conditioned attention from the model-induced prior by contrasting the attention distribution with that estimated from a null token inserted as an instruction-agnostic probe, thereby identifying visual tokens that contribute usable information beyond the prior for more effective pruning.

What carries the argument

Null token inserted as an instruction-agnostic probe in the attention block to estimate the model-induced prior attention map and subtract it from the task-conditioned map in a single forward pass.

If this is right

  • PriorTR improves the accuracy-efficiency trade-off over strong training-free baselines, especially under aggressive token budgets.
  • The method delivers consistent gains across multiple multimodal benchmarks and different MLLMs.
  • Both the prior and task-conditioned distributions are obtained without duplicated propagation through the model.
  • No additional training or fine-tuning is required for the correction to take effect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same null-token contrast could be tested on other attention-based selection tasks inside transformers where an unconditional bias may suppress conditional signals.
  • Combining the prior correction with existing quantization or KV-cache optimizations might produce additive speed-ups on edge devices.
  • If the prior is stable across inputs, the null-token computation could be cached or approximated for further efficiency.

Load-bearing premise

That a single null token inserted in the attention block accurately captures the full model-induced prior distribution without changing how task-conditioned attention is computed for the real instruction.

What would settle it

Compare the attention scores produced when the null token is present against the attention distribution obtained by running the model on the same visual input with no text instruction at all; large mismatch would indicate the probe fails to isolate the prior.

Figures

Figures reproduced from arXiv: 2606.24156 by Jianwei Yin, Jingcai Guo, Taotao Cai, Yuxiang Cai, Zengjie Chen, Zhi Chen.

Figure 1
Figure 1. Figure 1: Left: Visualization of text-to-visual attention maps across LLM decoding layers for a sample instruction (“What is the person doing in this image?”). (a) Null instruction: even without textual input, the model consistently attends to certain instruction-agnostic regions, revealing a strong model-induced prior. (b) Task instruction: attention distribution is partially influenced by the model-induced prior. … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PriorTR. Given an input image and a task instruction, PriorTR performs a single forward pass up to layer L of the shared LLM decoder. At layer L, attention from the null token to visual tokens estimates the model-induced saliency prior, while attention from instruction tokens to visual tokens estimates the task-conditioned posterior. Visual tokens are then scored by their V-usable information c… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison under different token budgets on LLaVA-1.5-13B. These radar plots illustrate normalized performance [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of prior, posterior, and prior-corrected attention maps. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative examples from the 12 image understanding benchmarks used in our evaluation. Each panel shows a sample image [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effect of pruning layer (K=64, LLaVA-1.5-7B). The dashed line denotes the full-token baseline. PriorTR is stable across depths and consistently outperforms FastV; FastV degrades noticeably at shallow layers. B.4. Effect of Pruning Layer We study how the choice of pruning layer L affects down￾stream performance [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure cases at K=64 (↓ 88.9%). Representative examples where the full-token baseline is correct but both FastV and PriorTR fail, across counting, left–right spatial relations, and depth ordering—a budget-level ceiling rather than a method-specific weakness. What does the large black-and-white sign mounted on the pole say? Base line The large black-and-white sign mounted on the pole says "One Way." FastV … view at source ↗
Figure 8
Figure 8. Figure 8: FastV completely swaps the two signs, reading the large sign as [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A bright white horse monopolizes FastV’s prior-driven attention, leaving the trolley’s front panel with virtually no token budget. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: FastV’s prior-biased attention concentrates on the prominent teddy bear, leaving insufficient tokens for the sheep’s woolly [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: No bat is present, yet FastV hallucinates [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

Visual token reduction has emerged as an effective strategy for accelerating Multimodal Large Language Models (MLLMs). Many existing methods prune tokens by ranking text-visual attention scores. However, we show that attention is often dominated by a model-induced prior: even without textual instruction, MLLMs tend to focus on certain task-agnostic regions. Consequently, the attention scores of instruction-conditioned tokens are suppressed, increasing the risk that these tokens are discarded during pruning. To address this issue, we propose Prior-Corrected Token Reduction (PriorTR), a training-free token reduction method that explicitly separates task-conditioned attention from the model-induced prior. PriorTR estimates the attention map of the prior, and contrasts it with the task-conditioned attention distribution to measure the additional usable information contributed by each visual token. Importantly, PriorTR computes both the model-induced prior and the task-conditioned posterior within a single forward pass by introducing a null token that serves as an instruction-agnostic probe in the attention block. This design avoids duplicated propagation. Extensive experiments across multiple multimodal benchmarks and MLLMs demonstrate that PriorTR consistently improves the trade-off between accuracy and efficiency over strong training-free baselines, particularly under aggressive token budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to introduce Prior-Corrected Token Reduction (PriorTR), a training-free method for visual token reduction in MLLMs. It identifies that attention is dominated by a model-induced prior even without instructions, and proposes using a single null token in the attention block to estimate this prior and contrast it with the task-conditioned attention to select tokens with additional usable information. This is said to improve the accuracy-efficiency trade-off over baselines, particularly at low token budgets.

Significance. If the separation of prior and task-conditioned attention holds without distortion, PriorTR could offer a simple, training-free enhancement to existing attention-based pruning methods for MLLMs, improving efficiency at aggressive reduction rates while preserving accuracy. The single-pass design avoids extra forward passes, which is a practical strength.

major comments (1)
  1. [Abstract and method description] Abstract and method description: The central mechanism relies on the null token providing a clean estimate of the model-induced prior without altering the task-conditioned attention. However, inserting an additional token into the attention computation changes the softmax denominator for all tokens (as per standard scaled dot-product attention), so the posterior attention map is not identical to the no-null case. This distortion could contaminate the contrast signal used to identify 'additional usable information', which is load-bearing for the claimed separation and the reported gains under aggressive budgets. The manuscript should include analysis quantifying this effect or a method to mitigate it.
minor comments (1)
  1. [Abstract] The abstract states that PriorTR 'consistently improves' the trade-off but provides no quantitative details on the magnitude of gains or the specific baselines compared; adding a brief summary of key results would strengthen the claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the potential impact of the null token on the attention computation. We address the concern point by point below.

read point-by-point responses
  1. Referee: [Abstract and method description] Abstract and method description: The central mechanism relies on the null token providing a clean estimate of the model-induced prior without altering the task-conditioned attention. However, inserting an additional token into the attention computation changes the softmax denominator for all tokens (as per standard scaled dot-product attention), so the posterior attention map is not identical to the no-null case. This distortion could contaminate the contrast signal used to identify 'additional usable information', which is load-bearing for the claimed separation and the reported gains under aggressive budgets. The manuscript should include analysis quantifying this effect or a method to mitigate it.

    Authors: We agree that inserting the null token alters the softmax normalization, so the task-conditioned attention scores are not identical to those computed without it. This is a mathematically accurate observation. In the revised manuscript we will add a dedicated analysis section that (i) quantifies the magnitude of the change in attention scores with versus without the null token across layers and heads, and (ii) measures how much the final token-ranking decisions differ when the contrast is computed from the distorted versus undistorted maps. Preliminary internal checks indicate the distortion is small relative to the prior-versus-task gap we exploit, but we will report these numbers explicitly and discuss whether a simple re-normalization correction is warranted. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines PriorTR as an explicit algorithmic procedure: insert a null token as an instruction-agnostic probe, compute the prior attention map and task-conditioned map in one forward pass, then contrast them to rank tokens by additional usable information. No equations or claims in the abstract reduce any output quantity to a fitted parameter or to the input definition itself. No self-citations are invoked as load-bearing uniqueness theorems, and the method is presented as training-free without renaming known results or smuggling ansatzes. The derivation chain is therefore self-contained and independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on the assumption that attention can be cleanly decomposed into prior and task-conditioned parts via a null token; no free parameters are mentioned.

axioms (1)
  • domain assumption Attention scores in transformer blocks can be separated into a model-induced prior component and a task-conditioned component.
    This separation is the explicit basis for the contrast step in PriorTR.
invented entities (1)
  • null token no independent evidence
    purpose: Instruction-agnostic probe to estimate model-induced prior attention map
    Newly introduced element in the attention block for this purpose.

pith-pipeline@v0.9.1-grok · 5752 in / 1117 out tokens · 22686 ms · 2026-06-26T01:07:42.399073+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 3 linked inside Pith

  1. [1]

    In: Pro- ceedings of the IEEE/CVF international conference on computer vision

    Agrawal, H., Desai, K., Wang, Y ., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel object captioning at scale. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 8948–8957 (2019)

  2. [2]

    Advances in neural infor- mation processing systems35, 23716–23736 (2022)

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural infor- mation processing systems35, 23716–23736 (2022)

  3. [3]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Alvar, S.R., Singh, G., Akbari, M., Zhang, Y .: Di- vprune: Diversity-based visual token pruning for large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9392– 9401 (2025)

  4. [4]

    arXiv preprint arXiv:2511.21631 (2025)

    Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3- vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  5. [5]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  6. [6]

    In: European Conference on Computer Vision

    Chen, B., Cai, Y ., Luo, Y ., Zhang, Y ., Chen, Z.: Spec- tral evolution-guided token pruning in multimodal large language models. In: European Conference on Computer Vision. Springer (2026)

  7. [7]

    In: European Conference on Com- puter Vision

    Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. In: European Conference on Com- puter Vision. pp. 19–35. Springer (2024)

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: In- ternvl: Scaling up vision foundation models and align- ing for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  9. [9]

    In: W ACV 2020

    Chen, Z., Li, J., Luo, Y ., Huang, Z., Yang, Y .: Canzsl: Cycle-consistent adversarial networks for zero-shot learning from natural language. In: W ACV 2020. pp. 874–883 (2020)

  10. [10]

    IEEE Transactions on Multimedia (2026)

    Chen, Z., Luo, Y ., Huang, Z., Li, J., Wang, S., Yu, X.: Distributed zero-shot learning for visual recognition. IEEE Transactions on Multimedia (2026)

  11. [11]

    ICCV 2021 (2021)

    Chen, Z., Luo, Y ., Qiu, R., Wang, S., Huang, Z., Li, J., Zhang, Z.: Semantics disentangling for generalized zero-shot learning. ICCV 2021 (2021)

  12. [12]

    IEEE Transactions on Multimedia (2022)

    Chen, Z., Luo, Y ., Wang, S., Li, J., Huang, Z.: Gsm- flow: Generation shifts mitigating flow for generalized zero-shot learning. IEEE Transactions on Multimedia (2022)

  13. [13]

    In: ACM MM 2021 (2021)

    Chen, Z., Luo, Y ., Wang, S., Qiu, R., Li, J., Huang, Z.: Mitigating generation shifts for generalized zero-shot learning. In: ACM MM 2021 (2021)

  14. [14]

    In: ACM MM

    Chen, Z., Wang, S., Li, J., Huang, Z.: Rethinking gen- erative zero-shot learning: An ensemble learning per- spective for recognising visual patches. In: ACM MM

  15. [15]

    3413–3421 (2020)

    pp. 3413–3421 (2020)

  16. [16]

    Pattern Recognition p

    Chen, Z., Yu, X., Tao, X., Li, Y ., Huang, Z.: Cluster- aware prompt ensemble learning for few-shot vision– language model adaptation. Pattern Recognition p. 112596 (2025)

  17. [17]

    In: ACM MM 2023 (2023)

    Chen, Z., Zhang, P., Li, J., Wang, S., Huang, Z.: Zero- shot learning by harnessing adversarial samples. In: ACM MM 2023 (2023)

  18. [18]

    In: ICCV 2025 (2025)

    Chen, Z., Zhao, Z., Guo, J., Li, J., Huang, Z.: Svip: Semantically contextualized visual patches for zero- shot learning. In: ICCV 2025 (2025)

  19. [19]

    Pattern Recog- nition p

    Chen, Z., Zhao, Z., Luo, Y ., Li, Y ., Tao, X., Huang, Z.: Fastedit: Fast text-guided single-image editing via semantic-aware diffusion fine-tuning. Pattern Recog- nition p. 112583 (2025)

  20. [20]

    arXiv preprint arXiv:2507.06261 (2025)

    Comanici, G., Bieber, E., Schaekermann, M., Pasu- pat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  21. [21]

    In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing

    Fan, Y ., Zhao, A., Fu, J., Tong, J., Su, H., Pan, Y ., Zhang, W., Shen, X.: Visipruner: Decoding discontin- uous cross-modal dynamics for efficient multimodal llms. In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing. pp. 18896–18913 (2025)

  22. [22]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track (2025)

    Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track (2025)

  23. [23]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grau- man, K., Luo, J., Bigham, J.P.: Vizwiz grand chal- lenge: Answering visual questions from blind people. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3608–3617 (2018)

  24. [24]

    arXiv preprint arXiv:2411.176862(3) (2024)

    Han, Y ., Liu, X., Ding, P., Wang, D., Chen, H., Yan, Q., Huang, S.: Rethinking token reduction in mllms: Towards a unified paradigm for training-free accelera- tion. arXiv preprint arXiv:2411.176862(3) (2024)

  25. [25]

    arXiv preprint arXiv:2410.08584 (2024)

    He, Y ., Chen, F., Liu, J., Shao, W., Zhou, H., Zhang, K., Zhuang, B.: Zipvl: Efficient large vision- language models with dynamic token sparsification. arXiv preprint arXiv:2410.08584 (2024)

  26. [26]

    arXiv preprint arXiv:2509.00419 (2025)

    Hu, L., Shang, F., Feng, W., Wan, L.: Lightvlm: Ac- celeraing large multimodal models with pyramid to- ken merging and kv cache compression. arXiv preprint arXiv:2509.00419 (2025)

  27. [27]

    In: European conference on computer vision

    Huang, K., Zou, H., Xi, Y ., Wang, B., Xie, Z., Yu, L.: Ivtp: Instruction-guided visual token pruning for large vision-language models. In: European conference on computer vision. pp. 214–230. Springer (2024)

  28. [28]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion

    Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 6700–6709 (2019)

  29. [29]

    arXiv preprint arXiv:2410.21276 (2024)

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  30. [30]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Jang, Y ., Song, Y ., Yu, Y ., Kim, Y ., Kim, G.: Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2758– 2766 (2017)

  31. [31]

    arXiv preprint arXiv:2503.11549 (2025)

    Jeddi, A., Baghbanzadeh, N., Dolatabadi, E., Taati, B.: Similarity-aware token pruning: Your vlm but faster. arXiv preprint arXiv:2503.11549 (2025)

  32. [32]

    In: Proceedings of the 29th symposium on operating systems principles

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serv- ing with pagedattention. In: Proceedings of the 29th symposium on operating systems principles. pp. 611– 626 (2023)

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition

    Li, B., Ge, Y ., Ge, Y ., Wang, G., Wang, R., Zhang, R., Shan, Y .: Seed-bench: Benchmarking multi- modal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 13299–13308 (2024)

  34. [34]

    arXiv e-prints pp

    Li, D., Yang, Z., Lu, S.: Todre: Visual token prun- ing via diversity and task awareness for efficient large vision-language models. arXiv e-prints pp. arXiv– 2505 (2025)

  35. [35]

    In: Proceedings of the Computer Vision and Pat- tern Recognition Conference

    Li, S., Hu, Y ., Ning, X., Liu, X., Hong, K., Jia, X., Li, X., Yan, Y ., Ran, P., Dai, G., et al.: Mbq: Modality- balanced quantization for large vision-language mod- els. In: Proceedings of the Computer Vision and Pat- tern Recognition Conference. pp. 4167–4177 (2025)

  36. [36]

    In: ACM MM 2025, pp

    Li, X., Zhang, D., Du, Z., Zhu, L., Chen, Z., Li, J.: Pataug: Augmentation of augmentation for test- time adaptation. In: ACM MM 2025, pp. 5080–5089 (2025)

  37. [37]

    IEEE Transactions on Pattern Analysis and Machine Intelli- gence (2025)

    Li, Y ., Zhang, Y ., Wang, C., Zhong, Z., Chen, Y ., Chu, R., Liu, S., Jia, J.: Mini-gemini: Mining the poten- tial of multi-modality vision language models. IEEE Transactions on Pattern Analysis and Machine Intelli- gence (2025)

  38. [38]

    In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Pro- cessing

    Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision- language models. In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Pro- cessing. pp. 292–305 (2023)

  39. [39]

    arXiv preprint arXiv:2501.14204 (2025)

    Liang, X., Guan, C., Lu, J., Chen, H., Wang, H., Hu, H.: Dynamic token reduction during gen- eration for vision language models. arXiv preprint arXiv:2501.14204 (2025)

  40. [40]

    NeurIPS2024 (2024)

    Lim, J.S., Chen, Z., Chen, Z., Baktashmotlagh, M., Yu, X., Huang, Z., Luo, Y .: Dipex: Dispersing prompt expansion for class-agnostic object detection. NeurIPS2024 (2024)

  41. [41]

    In: Proceed- ings of the 2024 conference on empirical methods in natural language processing

    Lin, B., Ye, Y ., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual repre- sentation by alignment before projection. In: Proceed- ings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024)

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

  43. [43]

    Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https:// llava-vl.github.io/blog/2024-01-30- llava-next/

  44. [44]

    Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning (2023)

  45. [45]

    arXiv preprint arXiv:2410.07278 (2024)

    Liu, Y ., Wu, F., Li, R., Tang, Z., Li, K.: Par: Prompt- aware token reduction method for efficient large multimodal models. arXiv preprint arXiv:2410.07278 (2024)

  46. [46]

    Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)

  47. [47]

    Advances in Neural In- formation Processing Systems35, 2507–2521 (2022)

    Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to ex- plain: Multimodal reasoning via thought chains for science question answering. Advances in Neural In- formation Processing Systems35, 2507–2521 (2022)

  48. [48]

    In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition

    Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Ok-vqa: A visual question answering benchmark re- quiring external knowledge. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. pp. 3195–3204 (2019)

  49. [49]

    In: Proceedings of the IEEE international conference on computer vision

    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k enti- ties: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)

  50. [50]

    OpenAI blog1(8), 9 (2019)

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsuper- vised multitask learners. OpenAI blog1(8), 9 (2019)

  51. [51]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Shang, Y ., Cai, M., Xu, B., Lee, Y .J., Yan, Y .: Llava- prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22857–22867 (2025)

  52. [52]

    In: AAAI 2026 (2026)

    Shao, Y ., Lin, D., Yan, M., Chen, S., Zeng, F., Liao, M., Ma, A., Yan, Z., Wang, H., Wang, Y ., et al.: Tr-dq: Time-rotation diffusion quantization. In: AAAI 2026 (2026)

  53. [53]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

    Shao, Z., Wang, M., Yu, Z., Pan, W., Yang, Y ., Wei, T., Zhang, H., Mao, N., Chen, W., Yu, J.: Growing a twig to accelerate large vision-language models. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20064–20074 (2025)

  54. [54]

    arXiv preprint arXiv:2305.17455 (2023)

    Shi, D., Tao, C., Rao, A., Yang, Z., Yuan, C., Wang, J.: Crossget: Cross-guided ensemble of tokens for accel- erating vision-language transformers. arXiv preprint arXiv:2305.17455 (2023)

  55. [55]

    Si, G., Yin, H., Li, X., Liao, W., He, T., Peng, P., Zhu, W., et al.: Infoprune: Revisiting visual token pruning from an information-theoretic perspective (2025)

  56. [56]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)

  57. [57]

    In: Proceedings of the 31st International Conference on Computational Linguistics

    Song, D., Wang, W., Chen, S., Wang, X., Guan, M.X., Wang, B.: Less is more: A simple yet effective token reduction method for efficient multi-modal llms. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7614–7623 (2025)

  58. [58]

    CVPR 2022 (2022)

    Su, H., Li, J., Chen, Z., Zhu, L., Lu, K.: Distinguish- ing unseen from seen for generalized zero-shot learn- ing. CVPR 2022 (2022)

  59. [59]

    arXiv preprint arXiv:2602.03060 (2026)

    Sun, Z., Ma, Y ., Liu, G., Chen, Y ., Tang, X., Hu, Y ., Xu, Y .: Ivc-prune: Revealing the implicit visual coordinates in lvlms for vision token pruning. arXiv preprint arXiv:2602.03060 (2026)

  60. [60]

    In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence

    Vasu, P.K.A., Faghri, F., Li, C.L., Koc, C., True, N., Antony, A., Santhanam, G., Gabriel, J., Grasch, P., Tuzel, O., et al.: Fastvlm: Efficient vision encod- ing for vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 19769–19780 (2025)

  61. [61]

    Advances in Neural Informa- tion Processing Systems37, 114553–114573 (2024)

    Wang, C., Wang, Z., Xu, X., Tang, Y ., Zhou, J., Lu, J.: Q-vlm: Post-training quantization for large vision-language models. Advances in Neural Informa- tion Processing Systems37, 114553–114573 (2024)

  62. [62]

    arXiv preprint arXiv:2505.19235 (2025)

    Wang, Q., Ye, H., Chung, M.Y ., Liu, Y ., Lin, Y ., Kuo, M., Ma, M., Zhang, J., Chen, Y .: Core- matching: A co-adaptive sparse inference framework with token and neuron pruning for comprehensive ac- celeration of vision-language models. arXiv preprint arXiv:2505.19235 (2025)

  63. [63]

    In: IJCAI 2025 (2025)

    Wang, T., Guo, J., Li, D., Chen, Z.: On the discrim- ination and consistency for exemplar-free class incre- mental learning. In: IJCAI 2025 (2025)

  64. [64]

    In: CVPR 2026 Findings (2026)

    Wang, W., Guo, J., Cai, Y ., Chen*, Z.: Learning multi- modal prototypes for cross-domain few-shot object detection. In: CVPR 2026 Findings (2026)

  65. [65]

    arXiv preprint arXiv:2602.17196 (2026)

    Wang, Y ., Wu, J., Ni, Z., Yang, C., Liu, Y ., Yang, L., Zhou, Y ., Wen, Y ., He, L.: Entropy- prune: Matrix entropy guided visual token pruning for multimodal large language models. arXiv preprint arXiv:2602.17196 (2026)

  66. [66]

    Wen, Z., Gao, Y ., Li, W., He, C., Zhang, L.: To- ken pruning in multimodal large language models: Are we solving the right problem? In: Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025. pp. 15537–15549. Association for Computational Lin- guistics (2025)

  67. [67]

    important tokens

    Wen, Z., Gao, Y ., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., Zhang, L.: Stop looking for “important tokens” in multimodal language models: Duplication matters more. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing. pp. 9972–9991 (2025)

  68. [68]

    In: Findings of the Association for Com- putational Linguistics: EMNLP 2024

    Wu, Q., Xu, W., Liu, W., Tan, T., Liujianfeng, L., Li, A., Luan, J., Wang, B., Shang, S.: Mobilevlm: A vision-language model for better intra-and inter-ui un- derstanding. In: Findings of the Association for Com- putational Linguistics: EMNLP 2024. pp. 10231– 10251 (2024)

  69. [69]

    arXiv preprint arXiv:2502.00791 (2025)

    Xing, L., Wang, A.J., Yan, R., Shu, X., Tang, J.: Vision-centric token compression in large language model. arXiv preprint arXiv:2502.00791 (2025)

  70. [70]

    In: In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025)

    Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y ., Cao, Y ., He, C., Wang, J., Wu, F., et al.: Pyramid- drop: Accelerating your large vision-language models via pyramid visual redundancy reduction. In: In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025)

  71. [71]

    In: Pro- ceedings of the 25th ACM international conference on Multimedia

    Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y .: Video question answering via gradually refined attention over appearance and motion. In: Pro- ceedings of the 25th ACM international conference on Multimedia. pp. 1645–1653 (2017)

  72. [72]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Xu, J., Mei, T., Yao, T., Rui, Y .: Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5288–5296 (2016)

  73. [73]

    In: International Conference on Learning Representations (2020)

    Xu, Y ., Zhao, S., Song, J., Stewart, R., Ermon, S.: A theory of usable information under computational constraints. In: International Conference on Learning Representations (2020)

  74. [74]

    In: Proceedings of the Computer Vi- sion and Pattern Recognition Conference

    Yang, C., Sui, Y ., Xiao, J., Huang, L., Gong, Y ., Li, C., Yan, J., Bai, Y ., Sadayappan, P., Hu, X., et al.: Topv: Compatible token pruning with inference time opti- mization for fast and low-memory multimodal vision language model. In: Proceedings of the Computer Vi- sion and Pattern Recognition Conference. pp. 19803– 19813 (2025)

  75. [75]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yang, C., Dong, X., Zhu, X., Su, W., Wang, J., Tian, H., Chen, Z., Wang, W., Lu, L., Dai, J.: Pvc: Pro- gressive visual token compression for unified image and video processing in large vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24939–24949 (2025)

  76. [76]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yang, S., Chen, Y ., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not nec- essary in vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19792–19802 (2025)

  77. [77]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Ye, W., Wu, Q., Lin, W., Zhou, Y .: Fit and prune: Fast and training-free visual token pruning for multi- modal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 22128–22136 (2025)

  78. [78]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Ye, X., Gan, Y ., Ge, Y ., Zhang, X.P., Tang, Y .: Atp- llava: Adaptive token pruning for large vision lan- guage models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 24972–24982 (2025)

  79. [79]

    In: ACM MM 2022 (2022)

    You, F., Li, J., Chen, Z., Zhu, L.: Pixel exclu- sion: Uncertainty-aware boundary discovery for ac- tive cross-domain semantic segmentation. In: ACM MM 2022 (2022)

  80. [80]

    In: ACM MM 2021

    You, F., Li, J., Zhu, L., Chen, Z., Huang, Z.: Domain adaptive semantic segmentation without source data. In: ACM MM 2021. pp. 3293–3302 (2021)

Showing first 80 references.