Recognition: unknown
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
Pith reviewed 2026-05-10 17:16 UTC · model grok-4.3
The pith
Different attention heads capture distinct visual semantics in multimodal models, enabling targeted pruning of 80% of visual tokens while retaining 96% accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HAWK is a training-free visual token pruning method that uses head importance weights to reflect the distinct roles of different attention heads in visual processing, then combines those weights with text-guided attention to score and retain crucial tokens. When applied to models such as Qwen2.5-VL, it prunes 80.2% of visual tokens while keeping 96.0% of original accuracy, reduces end-to-end latency to 74.4% of baseline, and lowers GPU memory across tested models, outperforming prior pruning techniques on mainstream vision-language benchmarks.
What carries the argument
Head importance weights that quantify each attention head's distinct contribution to visual semantics, multiplied with text-guided attention scores to rank visual token significance for pruning.
If this is right
- The method applies directly to multiple mainstream multimodal models without modification or fine-tuning.
- Pruning 80% of visual tokens cuts end-to-end latency to roughly three-quarters of the original while preserving nearly all accuracy.
- GPU memory consumption decreases across tested models as fewer tokens are processed.
- Task-relevant tokens are retained while redundant ones are removed, maintaining performance on vision-language benchmarks.
Where Pith is reading between the lines
- Similar head-specialization patterns may appear in other transformer architectures, suggesting the weighting idea could transfer beyond multimodal models.
- Real-time visual reasoning on edge devices becomes more feasible once token counts drop this sharply.
- Head importance scores might shift with task type, opening a path to dynamic, task-adaptive pruning at inference time.
Load-bearing premise
Different attention heads inherently capture distinct visual semantics and play distinct roles, so their importance weights can guide effective token pruning without any training.
What would settle it
Running the same models and benchmarks with head-importance pruning versus uniform or random pruning and finding that accuracy falls more sharply under HAWK than under the baselines.
Figures
read the original abstract
In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token significance, HAWK effectively retains task-relevant visual tokens while removing redundant ones. The proposed HAWK is entirely training-free and can be seamlessly applied to various MLLMs. Extensive experiments on multiple mainstream vision-language benchmarks demonstrate that HAWK achieves state-of-the-art accuracy. When applied to Qwen2.5-VL, HAWK retains 96.0% of the original accuracy after pruning 80.2% of the visual tokens. Additionally, it reduces end-to-end latency to 74.4% of the original and further decreases GPU memory usage across the tested models. The code is available at https://github.com/peppery77/HAWK.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces HAWK, a training-free visual token pruning technique for multimodal large language models that assigns importance weights to attention heads based on their observed specialization in visual semantics and combines these with text-guided attention scores to rank and drop redundant visual tokens. The central empirical claim is that the approach achieves state-of-the-art accuracy retention while delivering substantial efficiency gains; specifically, on Qwen2.5-VL it preserves 96.0% of baseline accuracy after removing 80.2% of visual tokens, reduces end-to-end latency to 74.4% of the original, and lowers GPU memory usage across tested models. The method is presented as plug-and-play for existing MLLMs without any fine-tuning or additional training.
Significance. If the head-importance mechanism proves robust to calibration-set choice and generalizes beyond the reported benchmarks, the work would offer a practical, training-free route to lower inference cost in vision-language models, which is valuable for real-time and edge deployment. The explicit release of code is a clear strength that supports reproducibility. The quantitative claims (96% retention at >80% pruning) are large enough to be impactful if they survive scrutiny on held-out data and stronger baselines.
major comments (2)
- [Abstract] Abstract: the headline result (96.0% accuracy retention after 80.2% token pruning on Qwen2.5-VL) is load-bearing for the SOTA claim, yet the abstract does not specify whether head-importance weights are computed per-sample, averaged over a fixed calibration set, or derived from the evaluation distribution itself. This detail is required to assess whether the reported numbers reflect a general architectural property or dataset-specific head specialization.
- [Method] Method section (presumed §3): the core modeling assumption that 'different heads may capture distinct visual semantics and inherently play distinct roles' is used to motivate per-head weighting, but no ablation is referenced that isolates this choice (e.g., HAWK versus an otherwise identical text-guided pruner that ignores head identity). Without such a control, it remains unclear whether the performance edge is attributable to head awareness or to the text-guided attention component alone.
minor comments (2)
- [Abstract] The latency figure (74.4% of original) should clarify whether pruning overhead is included in the end-to-end measurement and on which hardware the numbers were obtained.
- A small table or plot showing head-importance variance across layers or heads on a held-out calibration set would help readers evaluate the empirical basis for the 'distinct roles' observation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of clarity and experimental rigor that we will address in the revision. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline result (96.0% accuracy retention after 80.2% token pruning on Qwen2.5-VL) is load-bearing for the SOTA claim, yet the abstract does not specify whether head-importance weights are computed per-sample, averaged over a fixed calibration set, or derived from the evaluation distribution itself. This detail is required to assess whether the reported numbers reflect a general architectural property or dataset-specific head specialization.
Authors: We agree that the abstract should explicitly state how the head-importance weights are obtained. In the HAWK method, these weights are computed once by averaging attention scores over a fixed calibration set of samples drawn from the training distribution; they are not recomputed per input or derived from the evaluation set. This design choice ensures the weights reflect general head specialization rather than test-set leakage. We will revise the abstract to include this description. revision: yes
-
Referee: [Method] Method section (presumed §3): the core modeling assumption that 'different heads may capture distinct visual semantics and inherently play distinct roles' is used to motivate per-head weighting, but no ablation is referenced that isolates this choice (e.g., HAWK versus an otherwise identical text-guided pruner that ignores head identity). Without such a control, it remains unclear whether the performance edge is attributable to head awareness or to the text-guided attention component alone.
Authors: We acknowledge that an explicit ablation isolating the contribution of head-specific importance weights would strengthen the paper. The current manuscript motivates the design from observed head specialization but does not report a direct comparison against a uniform-head, text-guided-only variant. We will add this ablation study in the revised version, evaluating both variants on the same benchmarks and reporting the accuracy and efficiency differences to quantify the benefit of head awareness. revision: yes
Circularity Check
No significant circularity; method is an algorithmic procedure with empirical validation
full rationale
The paper presents HAWK as a training-free algorithmic procedure that computes per-head importance weights from attention patterns and applies text-guided signals to rank and prune visual tokens. No equations, derivations, or central claims reduce by construction to fitted parameters, self-referential definitions, or load-bearing self-citations. Performance numbers (e.g., 96% accuracy retention at 80.2% pruning) are reported as direct empirical outcomes on standard benchmarks rather than tautological predictions. The approach is self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Different attention heads capture distinct visual semantics and play distinct roles in visual processing
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Divprune: Diversity-based visual token pruning for large multimodal models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9392–9401, 2025. 1, 3, 5
2025
-
[3]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...
2025
-
[4]
Mu Cai, Jianwei Yang, Jianfeng Gao, and Yong Jae Lee. Matryoshka multimodal models.arXiv preprint arXiv:2405.17430, 2024. 2, 3
-
[5]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[6]
An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024. 2, 3, 5
2024
-
[7]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna
Wei-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023. 1
2023
-
[8]
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 11198–11201, 2024. 5
2024
-
[9]
Feather the throttle: Revisiting visual token pruning for vision-language model acceleration, 2025
Mark Endo, Xiaohan Wang, and Serena Yeung-Levy. Feather the throttle: Revisiting visual token pruning for vision-language model acceleration, 2025. 4
2025
-
[10]
MME: A comprehensive evaluation benchmark for multimodal large language models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, Rongrong Ji, Caifeng Shan, and Ran He. MME: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth An- nual Conference on Neural Information Processing Systems Datasets and Benchmarks Trac...
2025
-
[11]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 5, 2
2025
-
[12]
Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,...
2024
-
[13]
Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images
Zonghao Guo, Ruyi Xu, Yuan Yao, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, and Gao Huang. Llava-uhd: an lmm perceiving any aspect ratio and high- resolution images. InEuropean Conference on Computer Vision, pages 390–406. Springer, 2024. 1
2024
-
[14]
Worldsense: Evaluating real-world omni- modal understanding for multimodal llms
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms. 2025. 5, 2
2025
-
[15]
Similarity-aware token pruning: Your vlm but faster.arXiv preprint arXiv:2503.11549, 2025
Ahmadreza Jeddi, Negin Baghbanzadeh, Elham Dolatabadi, and Babak Taati. Similarity-aware token pruning: Your vlm but faster.arXiv preprint arXiv:2503.11549, 2025. 1, 3
-
[16]
Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 2
-
[17]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016. 5, 2
2016
-
[18]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025
Wentong Li, Yuqian Yuan, Jian Liu, Dongqi Tang, Song Wang, Jie Qin, Jianke Zhu, and Lei Zhang. Tokenpacker: Efficient visual projector for multimodal llm.International Journal of Computer Vision, pages 1–19, 2025. 2, 3
2025
-
[20]
Evaluating Object Hallucination in Large Vision-Language Models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 1, 2
work page internal anchor Pith review arXiv 2023
-
[21]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3
2023
-
[22]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1, 3
2024
-
[23]
Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 1, 3
2024
-
[24]
Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 5, 1, 2
2024
-
[25]
Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[26]
The coincidence approach to stochastic point processes.Advances in Applied Probability, 7(1):83–122,
Odile Macchi. The coincidence approach to stochastic point processes.Advances in Applied Probability, 7(1):83–122,
-
[27]
Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning. InFind- ings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 5, 1, 2
2022
-
[28]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. InICDAR, 2019. 5, 1, 2
2019
-
[29]
Llava-prumerge: Adaptive token reduction for efficient large multimodal models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857– 22867, 2025. 2
2025
-
[30]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 5, 2
2019
-
[31]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and tech- nology.arXiv preprint arXiv:2403.08295, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 1
2017
-
[34]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1, 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Zichen Wen, Yifeng Gao, et al. Stop looking for important tokens in multimodal language models: Duplication matters more.arXiv preprint arXiv:2502.11494, 2025. 1
-
[37]
Grok-1.5 vision preview.https://x.ai/blog/ grok-1.5v, 2024
xAI. Grok-1.5 vision preview.https://x.ai/blog/ grok-1.5v, 2024. Introduces the RealWorldQA bench- mark. Accessed: 2026-03-13. 5, 1, 2
2024
-
[38]
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy re- duction.arXiv preprint arXiv:2410.17247, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[39]
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang, Yukang Chen, et al. Visionzip: Longer is better but not necessary in vision language models.arXiv preprint arXiv:2412.04467, 2024. 1
-
[40]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[41]
Fit and prune: fast and training-free visual token pruning for multi- modal large language models
Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: fast and training-free visual token pruning for multi- modal large language models. AAAI Press, 2025. 3
2025
-
[42]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[43]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[44]
Qizhe Zhang, Mengzhen Liu, Lichen Li, Ming Lu, Yuan Zhang, Junwen Pan, Qi She, and Shanghang Zhang. Beyond attention or similarity: Maximizing conditional diversity for token pruning in mllms.arXiv preprint arXiv:2506.10967,
-
[45]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 1
work page internal anchor Pith review arXiv 2024
-
[46]
SparseVLM: Visual token sparsification for efficient vision- language model inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, and Shanghang Zhang. SparseVLM: Visual token sparsification for efficient vision- language model inference. InForty-second International Conference on Machine Learning, 2025. 3
2025
-
[47]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 2, 3, 5, 6 HA WK: Head Importance-Aware Visual Token Pruning in Multimodal Models Supplementar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
HA WK (Ours):The bar graph shows14different food items
Wheat. HA WK (Ours):The bar graph shows14different food items. These are: Lamb, Corn, Barley, Rye, Beef, Wheat, Coffee, Tea,Peanuts, Palm oil, Pork, Rice, Sugar, Cocoa. So, there are 14 food items shown in the bar graph. Figure 7.Qualitative Results: Heatmaps and Response Comparison.We present attention heatmaps to visualize the visual tokens retained by ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.