pith. machine review for the scientific record. sign in

arxiv: 2604.07812 · v1 · submitted 2026-04-09 · 💻 cs.CV

Recognition: unknown

HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual token pruningmultimodal large language modelsattention head importancetraining-free inference optimizationvision-language benchmarkstoken reductioninference latency
0
0 comments X

The pith

Different attention heads capture distinct visual semantics in multimodal models, enabling targeted pruning of 80% of visual tokens while retaining 96% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention heads in multimodal large language models do not contribute equally to image understanding. Some heads focus on specific visual features while others handle different aspects, so weighting them by importance and combining with text-guided attention scores lets the system identify and keep only the most relevant visual tokens. This pruning happens without any retraining or fine-tuning and can be dropped into existing models. If the approach holds, it directly cuts inference time and memory use, making large vision-language models practical for real-time or constrained hardware by removing most visual tokens yet preserving task performance.

Core claim

HAWK is a training-free visual token pruning method that uses head importance weights to reflect the distinct roles of different attention heads in visual processing, then combines those weights with text-guided attention to score and retain crucial tokens. When applied to models such as Qwen2.5-VL, it prunes 80.2% of visual tokens while keeping 96.0% of original accuracy, reduces end-to-end latency to 74.4% of baseline, and lowers GPU memory across tested models, outperforming prior pruning techniques on mainstream vision-language benchmarks.

What carries the argument

Head importance weights that quantify each attention head's distinct contribution to visual semantics, multiplied with text-guided attention scores to rank visual token significance for pruning.

If this is right

  • The method applies directly to multiple mainstream multimodal models without modification or fine-tuning.
  • Pruning 80% of visual tokens cuts end-to-end latency to roughly three-quarters of the original while preserving nearly all accuracy.
  • GPU memory consumption decreases across tested models as fewer tokens are processed.
  • Task-relevant tokens are retained while redundant ones are removed, maintaining performance on vision-language benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar head-specialization patterns may appear in other transformer architectures, suggesting the weighting idea could transfer beyond multimodal models.
  • Real-time visual reasoning on edge devices becomes more feasible once token counts drop this sharply.
  • Head importance scores might shift with task type, opening a path to dynamic, task-adaptive pruning at inference time.

Load-bearing premise

Different attention heads inherently capture distinct visual semantics and play distinct roles, so their importance weights can guide effective token pruning without any training.

What would settle it

Running the same models and benchmarks with head-importance pruning versus uniform or random pruning and finding that accuracy falls more sharply under HAWK than under the baselines.

Figures

Figures reproduced from arXiv: 2604.07812 by Jian Yang, Mengjie Zhang, Qihui Zhu, Shuangwu Chen, Tao Zhang, Xianzhi Yu, Xiaobin Tan, Yang Liu, Yinfei Pan, Yuchen Wang, Zhenhua Dong, Zijian Wen.

Figure 1
Figure 1. Figure 1: Comparison of visual token pruning methods on multiple [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left. The ablation process of the visual attention head. After ablating the attention heads as described above, the model’s visual comprehension ability noticeably declines. Right. Ablation results of visual attention heads. Different heads exhibit varying impacts on visual tasks, and their importance shows consistent trends across multiple datasets. or tasks encountered in real-world scenarios. Fine-tunin… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of HAWK. We first employ a text-guided attention mechanism to assess the relevance of each visual embedding [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study results under different pruning ratios. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual Head Ablation Study. Comparison of results across different benchmarks relative to the Base model. The red dashed line indicates the baseline performance (1.0). As discussed in the main text, masking the visibil￾ity of visual tokens for specific attention heads results in a consistent pattern of performance variation across di￾verse tasks. To further verify the generalizability of this finding, we e… view at source ↗
Figure 6
Figure 6. Figure 6: Cross-model analysis of visual head ablation. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Results: Heatmaps and Response Comparison. We present attention heatmaps to visualize the visual tokens retained by our HAWK, where redder regions indicate higher attention scores, alongside a qualitative comparison of the generated responses against other baseline methods. (Figure continued on the next page...) [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 7
Figure 7. Figure 7: Continued [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HAWK, a training-free visual token pruning technique for multimodal large language models that assigns importance weights to attention heads based on their observed specialization in visual semantics and combines these with text-guided attention scores to rank and drop redundant visual tokens. The central empirical claim is that the approach achieves state-of-the-art accuracy retention while delivering substantial efficiency gains; specifically, on Qwen2.5-VL it preserves 96.0% of baseline accuracy after removing 80.2% of visual tokens, reduces end-to-end latency to 74.4% of the original, and lowers GPU memory usage across tested models. The method is presented as plug-and-play for existing MLLMs without any fine-tuning or additional training.

Significance. If the head-importance mechanism proves robust to calibration-set choice and generalizes beyond the reported benchmarks, the work would offer a practical, training-free route to lower inference cost in vision-language models, which is valuable for real-time and edge deployment. The explicit release of code is a clear strength that supports reproducibility. The quantitative claims (96% retention at >80% pruning) are large enough to be impactful if they survive scrutiny on held-out data and stronger baselines.

major comments (2)
  1. [Abstract] Abstract: the headline result (96.0% accuracy retention after 80.2% token pruning on Qwen2.5-VL) is load-bearing for the SOTA claim, yet the abstract does not specify whether head-importance weights are computed per-sample, averaged over a fixed calibration set, or derived from the evaluation distribution itself. This detail is required to assess whether the reported numbers reflect a general architectural property or dataset-specific head specialization.
  2. [Method] Method section (presumed §3): the core modeling assumption that 'different heads may capture distinct visual semantics and inherently play distinct roles' is used to motivate per-head weighting, but no ablation is referenced that isolates this choice (e.g., HAWK versus an otherwise identical text-guided pruner that ignores head identity). Without such a control, it remains unclear whether the performance edge is attributable to head awareness or to the text-guided attention component alone.
minor comments (2)
  1. [Abstract] The latency figure (74.4% of original) should clarify whether pruning overhead is included in the end-to-end measurement and on which hardware the numbers were obtained.
  2. A small table or plot showing head-importance variance across layers or heads on a held-out calibration set would help readers evaluate the empirical basis for the 'distinct roles' observation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity and experimental rigor that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline result (96.0% accuracy retention after 80.2% token pruning on Qwen2.5-VL) is load-bearing for the SOTA claim, yet the abstract does not specify whether head-importance weights are computed per-sample, averaged over a fixed calibration set, or derived from the evaluation distribution itself. This detail is required to assess whether the reported numbers reflect a general architectural property or dataset-specific head specialization.

    Authors: We agree that the abstract should explicitly state how the head-importance weights are obtained. In the HAWK method, these weights are computed once by averaging attention scores over a fixed calibration set of samples drawn from the training distribution; they are not recomputed per input or derived from the evaluation set. This design choice ensures the weights reflect general head specialization rather than test-set leakage. We will revise the abstract to include this description. revision: yes

  2. Referee: [Method] Method section (presumed §3): the core modeling assumption that 'different heads may capture distinct visual semantics and inherently play distinct roles' is used to motivate per-head weighting, but no ablation is referenced that isolates this choice (e.g., HAWK versus an otherwise identical text-guided pruner that ignores head identity). Without such a control, it remains unclear whether the performance edge is attributable to head awareness or to the text-guided attention component alone.

    Authors: We acknowledge that an explicit ablation isolating the contribution of head-specific importance weights would strengthen the paper. The current manuscript motivates the design from observed head specialization but does not report a direct comparison against a uniform-head, text-guided-only variant. We will add this ablation study in the revised version, evaluating both variants on the same benchmarks and reporting the accuracy and efficiency differences to quantify the benefit of head awareness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is an algorithmic procedure with empirical validation

full rationale

The paper presents HAWK as a training-free algorithmic procedure that computes per-head importance weights from attention patterns and applies text-guided signals to rank and prune visual tokens. No equations, derivations, or central claims reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations. Performance numbers (e.g., 96% accuracy retention at 80.2% pruning) are reported as direct empirical outcomes on standard benchmarks rather than tautological predictions. The approach is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that attention heads differ in visual semantics; no free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption Different attention heads capture distinct visual semantics and play distinct roles in visual processing
    This is the key observation stated in the abstract that motivates the head importance weighting.

pith-pipeline@v0.9.0 · 5591 in / 1177 out tokens · 53739 ms · 2026-05-10T17:16:37.369399+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 19 canonical work pages · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Divprune: Diversity-based visual token pruning for large multimodal models

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9392–9401, 2025. 1, 3, 5

  3. [3]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

  4. [4]

    Matryoshka multimodal models,

    Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models.arXiv preprint arXiv:2405.17430, 2024. 2, 3

  5. [5]

    InternLM2 Technical Report

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024. 2

  6. [6]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 2, 3, 5

  7. [7]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna

    Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 1

  8. [8]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 5

  9. [9]

    Feather the throttle: Revisiting visual token pruning for vision-language model acceleration, 2025

    Mark Endo, Xiaohan Wang, and Serena Yeung-Levy. Feather the throttle: Revisiting visual token pruning for vision-language model acceleration, 2025. 4

  10. [10]

    MME: A comprehensive evaluation benchmark for multimodal large language models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Trac...

  11. [11]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 5, 2

  12. [12]

    Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

    Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,...

  13. [13]

    Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images

    Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. InEuropean Conference on Computer Vision, pages 390–406. Springer, 2024. 1

  14. [14]

    Worldsense: Evaluating real-world omni- modal understanding for multimodal llms

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms. 2025. 5, 2

  15. [15]

    Similarity-aware token pruning: Your vlm but faster.arXiv preprint arXiv:2503.11549, 2025

    Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, and Babak Taati. Similarity-aware token pruning: Your vlm but faster.arXiv preprint arXiv:2503.11549, 2025. 1, 3

  16. [16]

    From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023

    Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 2

  17. [17]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016. 5, 2

  18. [18]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 3

  19. [19]

    Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025

    Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025. 2, 3

  20. [20]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 1, 2

  21. [21]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3

  22. [22]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1, 3

  23. [23]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 3

  24. [24]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5, 1, 2

  25. [25]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,

  26. [26]

    The coincidence approach to stochastic point processes.Advances in Applied Probability, 7(1):83–122,

    Odile Macchi. The coincidence approach to stochastic point processes.Advances in Applied Probability, 7(1):83–122,

  27. [27]

    Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning. InFind- ings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 5, 1, 2

  28. [28]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR, 2019. 5, 1, 2

  29. [29]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857– 22867, 2025. 2

  30. [30]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5, 2

  31. [31]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv preprint arXiv:2403.08295, 2024. 1

  32. [32]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

  33. [33]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1

  34. [34]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2, 3

  35. [36]

    Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025

    Zichen Wen, Yifeng Gao, et al. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025. 1

  36. [37]

    Grok-1.5 vision preview.https://x.ai/blog/ grok-1.5v, 2024

    xAI. Grok-1.5 vision preview.https://x.ai/blog/ grok-1.5v, 2024. Introduces the RealWorldQA bench- mark. Accessed: 2026-03-13. 5, 1, 2

  37. [38]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 2

  38. [39]

    VisionZip: Longer is Better but Not Necessary in Vision Language Models

    Senqiao Yang, Yukang Chen, et al. Visionzip: Longer is better but not necessary in vision language models.arXiv preprint arXiv:2412.04467, 2024. 1

  39. [40]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 1

  40. [41]

    Fit and prune: fast and training-free visual token pruning for multi- modal large language models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: fast and training-free visual token pruning for multi- modal large language models. AAAI Press, 2025. 3

  41. [42]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1

  42. [43]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 3

  43. [44]

    highlighted tokens

    Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, and Shanghang Zhang. Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms.arXiv preprint arXiv:2506.10967,

  44. [45]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1

  45. [46]

    SparseVLM: Visual token sparsification for efficient vision- language model inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. SparseVLM: Visual token sparsification for efficient vision- language model inference. InForty-second International Conference on Machine Learning, 2025. 3

  46. [47]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 2, 3, 5, 6 HA WK: Head Importance-Aware Visual Token Pruning in Multimodal Models Supplementar...

  47. [48]

    HA WK (Ours):The bar graph shows14different food items

    Wheat. HA WK (Ours):The bar graph shows14different food items. These are: Lamb, Corn, Barley, Rye, Beef, Wheat, Coffee, Tea,Peanuts, Palm oil, Pork, Rice, Sugar, Cocoa. So, there are 14 food items shown in the bar graph. Figure 7.Qualitative Results: Heatmaps and Response Comparison.We present attention heatmaps to visualize the visual tokens retained by ...