Accelerating Multimodal Large Language Models with Prior-Corrected Token Reduction
Pith reviewed 2026-06-26 01:07 UTC · model grok-4.3
The pith
Null token probe corrects model-induced prior in attention to improve visual token pruning for MLLMs
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PriorTR separates task-conditioned attention from the model-induced prior by contrasting the attention distribution with that estimated from a null token inserted as an instruction-agnostic probe, thereby identifying visual tokens that contribute usable information beyond the prior for more effective pruning.
What carries the argument
Null token inserted as an instruction-agnostic probe in the attention block to estimate the model-induced prior attention map and subtract it from the task-conditioned map in a single forward pass.
If this is right
- PriorTR improves the accuracy-efficiency trade-off over strong training-free baselines, especially under aggressive token budgets.
- The method delivers consistent gains across multiple multimodal benchmarks and different MLLMs.
- Both the prior and task-conditioned distributions are obtained without duplicated propagation through the model.
- No additional training or fine-tuning is required for the correction to take effect.
Where Pith is reading between the lines
- The same null-token contrast could be tested on other attention-based selection tasks inside transformers where an unconditional bias may suppress conditional signals.
- Combining the prior correction with existing quantization or KV-cache optimizations might produce additive speed-ups on edge devices.
- If the prior is stable across inputs, the null-token computation could be cached or approximated for further efficiency.
Load-bearing premise
That a single null token inserted in the attention block accurately captures the full model-induced prior distribution without changing how task-conditioned attention is computed for the real instruction.
What would settle it
Compare the attention scores produced when the null token is present against the attention distribution obtained by running the model on the same visual input with no text instruction at all; large mismatch would indicate the probe fails to isolate the prior.
Figures
read the original abstract
Visual token reduction has emerged as an effective strategy for accelerating Multimodal Large Language Models (MLLMs). Many existing methods prune tokens by ranking text-visual attention scores. However, we show that attention is often dominated by a model-induced prior: even without textual instruction, MLLMs tend to focus on certain task-agnostic regions. Consequently, the attention scores of instruction-conditioned tokens are suppressed, increasing the risk that these tokens are discarded during pruning. To address this issue, we propose Prior-Corrected Token Reduction (PriorTR), a training-free token reduction method that explicitly separates task-conditioned attention from the model-induced prior. PriorTR estimates the attention map of the prior, and contrasts it with the task-conditioned attention distribution to measure the additional usable information contributed by each visual token. Importantly, PriorTR computes both the model-induced prior and the task-conditioned posterior within a single forward pass by introducing a null token that serves as an instruction-agnostic probe in the attention block. This design avoids duplicated propagation. Extensive experiments across multiple multimodal benchmarks and MLLMs demonstrate that PriorTR consistently improves the trade-off between accuracy and efficiency over strong training-free baselines, particularly under aggressive token budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce Prior-Corrected Token Reduction (PriorTR), a training-free method for visual token reduction in MLLMs. It identifies that attention is dominated by a model-induced prior even without instructions, and proposes using a single null token in the attention block to estimate this prior and contrast it with the task-conditioned attention to select tokens with additional usable information. This is said to improve the accuracy-efficiency trade-off over baselines, particularly at low token budgets.
Significance. If the separation of prior and task-conditioned attention holds without distortion, PriorTR could offer a simple, training-free enhancement to existing attention-based pruning methods for MLLMs, improving efficiency at aggressive reduction rates while preserving accuracy. The single-pass design avoids extra forward passes, which is a practical strength.
major comments (1)
- [Abstract and method description] Abstract and method description: The central mechanism relies on the null token providing a clean estimate of the model-induced prior without altering the task-conditioned attention. However, inserting an additional token into the attention computation changes the softmax denominator for all tokens (as per standard scaled dot-product attention), so the posterior attention map is not identical to the no-null case. This distortion could contaminate the contrast signal used to identify 'additional usable information', which is load-bearing for the claimed separation and the reported gains under aggressive budgets. The manuscript should include analysis quantifying this effect or a method to mitigate it.
minor comments (1)
- [Abstract] The abstract states that PriorTR 'consistently improves' the trade-off but provides no quantitative details on the magnitude of gains or the specific baselines compared; adding a brief summary of key results would strengthen the claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the potential impact of the null token on the attention computation. We address the concern point by point below.
read point-by-point responses
-
Referee: [Abstract and method description] Abstract and method description: The central mechanism relies on the null token providing a clean estimate of the model-induced prior without altering the task-conditioned attention. However, inserting an additional token into the attention computation changes the softmax denominator for all tokens (as per standard scaled dot-product attention), so the posterior attention map is not identical to the no-null case. This distortion could contaminate the contrast signal used to identify 'additional usable information', which is load-bearing for the claimed separation and the reported gains under aggressive budgets. The manuscript should include analysis quantifying this effect or a method to mitigate it.
Authors: We agree that inserting the null token alters the softmax normalization, so the task-conditioned attention scores are not identical to those computed without it. This is a mathematically accurate observation. In the revised manuscript we will add a dedicated analysis section that (i) quantifies the magnitude of the change in attention scores with versus without the null token across layers and heads, and (ii) measures how much the final token-ranking decisions differ when the contrast is computed from the distorted versus undistorted maps. Preliminary internal checks indicate the distortion is small relative to the prior-versus-task gap we exploit, but we will report these numbers explicitly and discuss whether a simple re-normalization correction is warranted. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper defines PriorTR as an explicit algorithmic procedure: insert a null token as an instruction-agnostic probe, compute the prior attention map and task-conditioned map in one forward pass, then contrast them to rank tokens by additional usable information. No equations or claims in the abstract reduce any output quantity to a fitted parameter or to the input definition itself. No self-citations are invoked as load-bearing uniqueness theorems, and the method is presented as training-free without renaming known results or smuggling ansatzes. The derivation chain is therefore self-contained and independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention scores in transformer blocks can be separated into a model-induced prior component and a task-conditioned component.
invented entities (1)
-
null token
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Pro- ceedings of the IEEE/CVF international conference on computer vision
Agrawal, H., Desai, K., Wang, Y ., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P.: Nocaps: Novel object captioning at scale. In: Pro- ceedings of the IEEE/CVF international conference on computer vision. pp. 8948–8957 (2019)
2019
-
[2]
Advances in neural infor- mation processing systems35, 23716–23736 (2022)
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y ., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural infor- mation processing systems35, 23716–23736 (2022)
2022
-
[3]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Alvar, S.R., Singh, G., Akbari, M., Zhang, Y .: Di- vprune: Diversity-based visual token pruning for large multimodal models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9392– 9401 (2025)
2025
-
[4]
arXiv preprint arXiv:2511.21631 (2025)
Bai, S., Cai, Y ., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3- vl technical report. arXiv preprint arXiv:2511.21631 (2025)
Pith/arXiv arXiv 2025
-
[5]
Advances in neural information processing systems33, 1877–1901 (2020)
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)
1901
-
[6]
In: European Conference on Computer Vision
Chen, B., Cai, Y ., Luo, Y ., Zhang, Y ., Chen, Z.: Spec- tral evolution-guided token pruning in multimodal large language models. In: European Conference on Computer Vision. Springer (2026)
2026
-
[7]
In: European Conference on Com- puter Vision
Chen, L., Zhao, H., Liu, T., Bai, S., Lin, J., Zhou, C., Chang, B.: An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision- language models. In: European Conference on Com- puter Vision. pp. 19–35. Springer (2024)
2024
-
[8]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: In- ternvl: Scaling up vision foundation models and align- ing for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
2024
-
[9]
In: W ACV 2020
Chen, Z., Li, J., Luo, Y ., Huang, Z., Yang, Y .: Canzsl: Cycle-consistent adversarial networks for zero-shot learning from natural language. In: W ACV 2020. pp. 874–883 (2020)
2020
-
[10]
IEEE Transactions on Multimedia (2026)
Chen, Z., Luo, Y ., Huang, Z., Li, J., Wang, S., Yu, X.: Distributed zero-shot learning for visual recognition. IEEE Transactions on Multimedia (2026)
2026
-
[11]
ICCV 2021 (2021)
Chen, Z., Luo, Y ., Qiu, R., Wang, S., Huang, Z., Li, J., Zhang, Z.: Semantics disentangling for generalized zero-shot learning. ICCV 2021 (2021)
2021
-
[12]
IEEE Transactions on Multimedia (2022)
Chen, Z., Luo, Y ., Wang, S., Li, J., Huang, Z.: Gsm- flow: Generation shifts mitigating flow for generalized zero-shot learning. IEEE Transactions on Multimedia (2022)
2022
-
[13]
In: ACM MM 2021 (2021)
Chen, Z., Luo, Y ., Wang, S., Qiu, R., Li, J., Huang, Z.: Mitigating generation shifts for generalized zero-shot learning. In: ACM MM 2021 (2021)
2021
-
[14]
In: ACM MM
Chen, Z., Wang, S., Li, J., Huang, Z.: Rethinking gen- erative zero-shot learning: An ensemble learning per- spective for recognising visual patches. In: ACM MM
-
[15]
3413–3421 (2020)
pp. 3413–3421 (2020)
2020
-
[16]
Pattern Recognition p
Chen, Z., Yu, X., Tao, X., Li, Y ., Huang, Z.: Cluster- aware prompt ensemble learning for few-shot vision– language model adaptation. Pattern Recognition p. 112596 (2025)
2025
-
[17]
In: ACM MM 2023 (2023)
Chen, Z., Zhang, P., Li, J., Wang, S., Huang, Z.: Zero- shot learning by harnessing adversarial samples. In: ACM MM 2023 (2023)
2023
-
[18]
In: ICCV 2025 (2025)
Chen, Z., Zhao, Z., Guo, J., Li, J., Huang, Z.: Svip: Semantically contextualized visual patches for zero- shot learning. In: ICCV 2025 (2025)
2025
-
[19]
Pattern Recog- nition p
Chen, Z., Zhao, Z., Luo, Y ., Li, Y ., Tao, X., Huang, Z.: Fastedit: Fast text-guided single-image editing via semantic-aware diffusion fine-tuning. Pattern Recog- nition p. 112583 (2025)
2025
-
[20]
arXiv preprint arXiv:2507.06261 (2025)
Comanici, G., Bieber, E., Schaekermann, M., Pasu- pat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
Pith/arXiv arXiv 2025
-
[21]
In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing
Fan, Y ., Zhao, A., Fu, J., Tong, J., Su, H., Pan, Y ., Zhang, W., Shen, X.: Visipruner: Decoding discontin- uous cross-modal dynamics for efficient multimodal llms. In: Proceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing. pp. 18896–18913 (2025)
2025
-
[22]
In: The Thirty-ninth Annual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track (2025)
Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. In: The Thirty-ninth Annual Conference on Neural Information Processing Sys- tems Datasets and Benchmarks Track (2025)
2025
-
[23]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grau- man, K., Luo, J., Bigham, J.P.: Vizwiz grand chal- lenge: Answering visual questions from blind people. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3608–3617 (2018)
2018
-
[24]
arXiv preprint arXiv:2411.176862(3) (2024)
Han, Y ., Liu, X., Ding, P., Wang, D., Chen, H., Yan, Q., Huang, S.: Rethinking token reduction in mllms: Towards a unified paradigm for training-free accelera- tion. arXiv preprint arXiv:2411.176862(3) (2024)
arXiv 2024
-
[25]
arXiv preprint arXiv:2410.08584 (2024)
He, Y ., Chen, F., Liu, J., Shao, W., Zhou, H., Zhang, K., Zhuang, B.: Zipvl: Efficient large vision- language models with dynamic token sparsification. arXiv preprint arXiv:2410.08584 (2024)
arXiv 2024
-
[26]
arXiv preprint arXiv:2509.00419 (2025)
Hu, L., Shang, F., Feng, W., Wan, L.: Lightvlm: Ac- celeraing large multimodal models with pyramid to- ken merging and kv cache compression. arXiv preprint arXiv:2509.00419 (2025)
arXiv 2025
-
[27]
In: European conference on computer vision
Huang, K., Zou, H., Xi, Y ., Wang, B., Xie, Z., Yu, L.: Ivtp: Instruction-guided visual token pruning for large vision-language models. In: European conference on computer vision. pp. 214–230. Springer (2024)
2024
-
[28]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. pp. 6700–6709 (2019)
2019
-
[29]
arXiv preprint arXiv:2410.21276 (2024)
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
Pith/arXiv arXiv 2024
-
[30]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Jang, Y ., Song, Y ., Yu, Y ., Kim, Y ., Kim, G.: Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2758– 2766 (2017)
2017
-
[31]
arXiv preprint arXiv:2503.11549 (2025)
Jeddi, A., Baghbanzadeh, N., Dolatabadi, E., Taati, B.: Similarity-aware token pruning: Your vlm but faster. arXiv preprint arXiv:2503.11549 (2025)
arXiv 2025
-
[32]
In: Proceedings of the 29th symposium on operating systems principles
Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serv- ing with pagedattention. In: Proceedings of the 29th symposium on operating systems principles. pp. 611– 626 (2023)
2023
-
[33]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition
Li, B., Ge, Y ., Ge, Y ., Wang, G., Wang, R., Zhang, R., Shan, Y .: Seed-bench: Benchmarking multi- modal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 13299–13308 (2024)
2024
-
[34]
arXiv e-prints pp
Li, D., Yang, Z., Lu, S.: Todre: Visual token prun- ing via diversity and task awareness for efficient large vision-language models. arXiv e-prints pp. arXiv– 2505 (2025)
2025
-
[35]
In: Proceedings of the Computer Vision and Pat- tern Recognition Conference
Li, S., Hu, Y ., Ning, X., Liu, X., Hong, K., Jia, X., Li, X., Yan, Y ., Ran, P., Dai, G., et al.: Mbq: Modality- balanced quantization for large vision-language mod- els. In: Proceedings of the Computer Vision and Pat- tern Recognition Conference. pp. 4167–4177 (2025)
2025
-
[36]
In: ACM MM 2025, pp
Li, X., Zhang, D., Du, Z., Zhu, L., Chen, Z., Li, J.: Pataug: Augmentation of augmentation for test- time adaptation. In: ACM MM 2025, pp. 5080–5089 (2025)
2025
-
[37]
IEEE Transactions on Pattern Analysis and Machine Intelli- gence (2025)
Li, Y ., Zhang, Y ., Wang, C., Zhong, Z., Chen, Y ., Chu, R., Liu, S., Jia, J.: Mini-gemini: Mining the poten- tial of multi-modality vision language models. IEEE Transactions on Pattern Analysis and Machine Intelli- gence (2025)
2025
-
[38]
In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Pro- cessing
Li, Y ., Du, Y ., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision- language models. In: Proceedings of the 2023 Confer- ence on Empirical Methods in Natural Language Pro- cessing. pp. 292–305 (2023)
2023
-
[39]
arXiv preprint arXiv:2501.14204 (2025)
Liang, X., Guan, C., Lu, J., Chen, H., Wang, H., Hu, H.: Dynamic token reduction during gen- eration for vision language models. arXiv preprint arXiv:2501.14204 (2025)
arXiv 2025
-
[40]
NeurIPS2024 (2024)
Lim, J.S., Chen, Z., Chen, Z., Baktashmotlagh, M., Yu, X., Huang, Z., Luo, Y .: Dipex: Dispersing prompt expansion for class-agnostic object detection. NeurIPS2024 (2024)
2024
-
[41]
In: Proceed- ings of the 2024 conference on empirical methods in natural language processing
Lin, B., Ye, Y ., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual repre- sentation by alignment before projection. In: Proceed- ings of the 2024 conference on empirical methods in natural language processing. pp. 5971–5984 (2024)
2024
-
[42]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, H., Li, C., Li, Y ., Lee, Y .J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)
2024
-
[43]
Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., Lee, Y .J.: Llava-next: Improved reasoning, ocr, and world knowledge (January 2024),https:// llava-vl.github.io/blog/2024-01-30- llava-next/
2024
-
[44]
Liu, H., Li, C., Wu, Q., Lee, Y .J.: Visual instruction tuning (2023)
2023
-
[45]
arXiv preprint arXiv:2410.07278 (2024)
Liu, Y ., Wu, F., Li, R., Tang, Z., Li, K.: Par: Prompt- aware token reduction method for efficient large multimodal models. arXiv preprint arXiv:2410.07278 (2024)
arXiv 2024
-
[46]
Liu, Y ., Duan, H., Zhang, Y ., Li, B., Zhang, S., Zhao, W., Yuan, Y ., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? In: European conference on computer vision. pp. 216–233. Springer (2024)
2024
-
[47]
Advances in Neural In- formation Processing Systems35, 2507–2521 (2022)
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to ex- plain: Multimodal reasoning via thought chains for science question answering. Advances in Neural In- formation Processing Systems35, 2507–2521 (2022)
2022
-
[48]
In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Ok-vqa: A visual question answering benchmark re- quiring external knowledge. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition. pp. 3195–3204 (2019)
2019
-
[49]
In: Proceedings of the IEEE international conference on computer vision
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k enti- ties: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision. pp. 2641–2649 (2015)
2015
-
[50]
OpenAI blog1(8), 9 (2019)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsuper- vised multitask learners. OpenAI blog1(8), 9 (2019)
2019
-
[51]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Shang, Y ., Cai, M., Xu, B., Lee, Y .J., Yan, Y .: Llava- prumerge: Adaptive token reduction for efficient large multimodal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 22857–22867 (2025)
2025
-
[52]
In: AAAI 2026 (2026)
Shao, Y ., Lin, D., Yan, M., Chen, S., Zeng, F., Liao, M., Ma, A., Yan, Z., Wang, H., Wang, Y ., et al.: Tr-dq: Time-rotation diffusion quantization. In: AAAI 2026 (2026)
2026
-
[53]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision
Shao, Z., Wang, M., Yu, Z., Pan, W., Yang, Y ., Wei, T., Zhang, H., Mao, N., Chen, W., Yu, J.: Growing a twig to accelerate large vision-language models. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20064–20074 (2025)
2025
-
[54]
arXiv preprint arXiv:2305.17455 (2023)
Shi, D., Tao, C., Rao, A., Yang, Z., Yuan, C., Wang, J.: Crossget: Cross-guided ensemble of tokens for accel- erating vision-language transformers. arXiv preprint arXiv:2305.17455 (2023)
arXiv 2023
-
[55]
Si, G., Yin, H., Li, X., Liao, W., He, T., Peng, P., Zhu, W., et al.: Infoprune: Revisiting visual token pruning from an information-theoretic perspective (2025)
2025
-
[56]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Singh, A., Natarajan, V ., Shah, M., Jiang, Y ., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)
2019
-
[57]
In: Proceedings of the 31st International Conference on Computational Linguistics
Song, D., Wang, W., Chen, S., Wang, X., Guan, M.X., Wang, B.: Less is more: A simple yet effective token reduction method for efficient multi-modal llms. In: Proceedings of the 31st International Conference on Computational Linguistics. pp. 7614–7623 (2025)
2025
-
[58]
CVPR 2022 (2022)
Su, H., Li, J., Chen, Z., Zhu, L., Lu, K.: Distinguish- ing unseen from seen for generalized zero-shot learn- ing. CVPR 2022 (2022)
2022
-
[59]
arXiv preprint arXiv:2602.03060 (2026)
Sun, Z., Ma, Y ., Liu, G., Chen, Y ., Tang, X., Hu, Y ., Xu, Y .: Ivc-prune: Revealing the implicit visual coordinates in lvlms for vision token pruning. arXiv preprint arXiv:2602.03060 (2026)
arXiv 2026
-
[60]
In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence
Vasu, P.K.A., Faghri, F., Li, C.L., Koc, C., True, N., Antony, A., Santhanam, G., Gabriel, J., Grasch, P., Tuzel, O., et al.: Fastvlm: Efficient vision encod- ing for vision language models. In: Proceedings of the Computer Vision and Pattern Recognition Confer- ence. pp. 19769–19780 (2025)
2025
-
[61]
Advances in Neural Informa- tion Processing Systems37, 114553–114573 (2024)
Wang, C., Wang, Z., Xu, X., Tang, Y ., Zhou, J., Lu, J.: Q-vlm: Post-training quantization for large vision-language models. Advances in Neural Informa- tion Processing Systems37, 114553–114573 (2024)
2024
-
[62]
arXiv preprint arXiv:2505.19235 (2025)
Wang, Q., Ye, H., Chung, M.Y ., Liu, Y ., Lin, Y ., Kuo, M., Ma, M., Zhang, J., Chen, Y .: Core- matching: A co-adaptive sparse inference framework with token and neuron pruning for comprehensive ac- celeration of vision-language models. arXiv preprint arXiv:2505.19235 (2025)
arXiv 2025
-
[63]
In: IJCAI 2025 (2025)
Wang, T., Guo, J., Li, D., Chen, Z.: On the discrim- ination and consistency for exemplar-free class incre- mental learning. In: IJCAI 2025 (2025)
2025
-
[64]
In: CVPR 2026 Findings (2026)
Wang, W., Guo, J., Cai, Y ., Chen*, Z.: Learning multi- modal prototypes for cross-domain few-shot object detection. In: CVPR 2026 Findings (2026)
2026
-
[65]
arXiv preprint arXiv:2602.17196 (2026)
Wang, Y ., Wu, J., Ni, Z., Yang, C., Liu, Y ., Yang, L., Zhou, Y ., Wen, Y ., He, L.: Entropy- prune: Matrix entropy guided visual token pruning for multimodal large language models. arXiv preprint arXiv:2602.17196 (2026)
arXiv 2026
-
[66]
Wen, Z., Gao, Y ., Li, W., He, C., Zhang, L.: To- ken pruning in multimodal large language models: Are we solving the right problem? In: Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1, 2025. pp. 15537–15549. Association for Computational Lin- guistics (2025)
2025
-
[67]
important tokens
Wen, Z., Gao, Y ., Wang, S., Zhang, J., Zhang, Q., Li, W., He, C., Zhang, L.: Stop looking for “important tokens” in multimodal language models: Duplication matters more. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing. pp. 9972–9991 (2025)
2025
-
[68]
In: Findings of the Association for Com- putational Linguistics: EMNLP 2024
Wu, Q., Xu, W., Liu, W., Tan, T., Liujianfeng, L., Li, A., Luan, J., Wang, B., Shang, S.: Mobilevlm: A vision-language model for better intra-and inter-ui un- derstanding. In: Findings of the Association for Com- putational Linguistics: EMNLP 2024. pp. 10231– 10251 (2024)
2024
-
[69]
arXiv preprint arXiv:2502.00791 (2025)
Xing, L., Wang, A.J., Yan, R., Shu, X., Tang, J.: Vision-centric token compression in large language model. arXiv preprint arXiv:2502.00791 (2025)
arXiv 2025
-
[70]
In: In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025)
Xing, L., Huang, Q., Dong, X., Lu, J., Zhang, P., Zang, Y ., Cao, Y ., He, C., Wang, J., Wu, F., et al.: Pyramid- drop: Accelerating your large vision-language models via pyramid visual redundancy reduction. In: In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2025)
2025
-
[71]
In: Pro- ceedings of the 25th ACM international conference on Multimedia
Xu, D., Zhao, Z., Xiao, J., Wu, F., Zhang, H., He, X., Zhuang, Y .: Video question answering via gradually refined attention over appearance and motion. In: Pro- ceedings of the 25th ACM international conference on Multimedia. pp. 1645–1653 (2017)
2017
-
[72]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Xu, J., Mei, T., Yao, T., Rui, Y .: Msr-vtt: A large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5288–5296 (2016)
2016
-
[73]
In: International Conference on Learning Representations (2020)
Xu, Y ., Zhao, S., Song, J., Stewart, R., Ermon, S.: A theory of usable information under computational constraints. In: International Conference on Learning Representations (2020)
2020
-
[74]
In: Proceedings of the Computer Vi- sion and Pattern Recognition Conference
Yang, C., Sui, Y ., Xiao, J., Huang, L., Gong, Y ., Li, C., Yan, J., Bai, Y ., Sadayappan, P., Hu, X., et al.: Topv: Compatible token pruning with inference time opti- mization for fast and low-memory multimodal vision language model. In: Proceedings of the Computer Vi- sion and Pattern Recognition Conference. pp. 19803– 19813 (2025)
2025
-
[75]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Yang, C., Dong, X., Zhu, X., Su, W., Wang, J., Tian, H., Chen, Z., Wang, W., Lu, L., Dai, J.: Pvc: Pro- gressive visual token compression for unified image and video processing in large vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24939–24949 (2025)
2025
-
[76]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yang, S., Chen, Y ., Tian, Z., Wang, C., Li, J., Yu, B., Jia, J.: Visionzip: Longer is better but not nec- essary in vision language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19792–19802 (2025)
2025
-
[77]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Ye, W., Wu, Q., Lin, W., Zhou, Y .: Fit and prune: Fast and training-free visual token pruning for multi- modal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 22128–22136 (2025)
2025
-
[78]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition
Ye, X., Gan, Y ., Ge, Y ., Zhang, X.P., Tang, Y .: Atp- llava: Adaptive token pruning for large vision lan- guage models. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 24972–24982 (2025)
2025
-
[79]
In: ACM MM 2022 (2022)
You, F., Li, J., Chen, Z., Zhu, L.: Pixel exclu- sion: Uncertainty-aware boundary discovery for ac- tive cross-domain semantic segmentation. In: ACM MM 2022 (2022)
2022
-
[80]
In: ACM MM 2021
You, F., Li, J., Zhu, L., Chen, Z., Huang, Z.: Domain adaptive semantic segmentation without source data. In: ACM MM 2021. pp. 3293–3302 (2021)
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.