pith. sign in

arxiv: 2605.19218 · v1 · pith:Y3B5GQDPnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

Pith reviewed 2026-05-20 07:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsKV cache compressionkey channel pruningstructured pruningPCA rotationefficient inferenceattention kerneltoken sparsity
0
0 comments X

The pith

RotateK aligns token-varying key channel importances via online PCA rotation so that lightweight head-wise masks can prune them accurately while preserving visual tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models suffer severe KV cache pressure because images encode into thousands of tokens. The paper claims that compressing the channel dimension of keys, rather than discarding tokens, lets more visual content survive at a fixed memory budget. It establishes that an online PCA-based rotation can align per-token channel importances into one shared low-dimensional subspace. This alignment turns head-wise pruning masks into an accurate yet hardware-efficient option, supported by a fused attention kernel that runs directly on the resulting sparse keys. Experiments on two VLM backbones show better accuracy and lower decoding latency than earlier channel-pruning techniques, with further gains when combined with token pruning.

Core claim

We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

What carries the argument

online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace

If this is right

  • Joint token-channel pruning outperforms token-only baselines at identical KV cache budgets.
  • Decoding runs faster because the fused kernel processes sparse-channel keys directly.
  • More visual tokens remain available without raising memory usage by shifting budget from tokens to channels.
  • Accuracy holds better on fine-grained tasks than with prior unstructured or plain head-wise key pruning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rotation alignment could apply to other attention-heavy models where per-token channel importance also varies.
  • If the shared subspace remains stable at larger model scales, the approach would support longer context windows without proportional cache growth.

Load-bearing premise

An online PCA-based rotation can reliably align token-dependent channel importance into a shared low-dimensional subspace across varying inputs, models, and tasks.

What would settle it

Run head-wise key pruning on fine-grained VLM perception tasks both with and without the online PCA rotation; if the rotated version shows no accuracy improvement or higher degradation at the same sparsity level, the alignment claim does not hold.

Figures

Figures reproduced from arXiv: 2605.19218 by Beomseok Kang, Donghwee Son, Dongwon Jo, Jae-Joon Kim, Jiwon Song.

Figure 1
Figure 1. Figure 1: Token Pruning only vs. Token and Channel Pruning. (a) Comparison between token pruning only (red) and joint token-channel pruning with RotateK (green) under similar KV cache budgets. For FastV, we compare 0.20 token sparsity against FastV (0.30× token) + RotateK (0.25× channel). For VisionZip, we compare 0.22 token sparsity against VisionZip (0.35× token) + RotateK (0.25× channel). (b) Unlike token pruning… view at source ↗
Figure 2
Figure 2. Figure 2: Key Channel Pruning Masks. (a) Visualization of the original visual Key states, where few channels exhibit strong outlier patterns across tokens. (b) ThinK applies a shared head-wise mask, retaining the same channels for all tokens. (c) SparK applies token-wise masks that vary across tokens, reflecting token-dependent channel importance. The heterogeneous patterns suggest that informative channels are not … view at source ↗
Figure 3
Figure 3. Figure 3: High-level Idea of RotateK. (a) In the original channel basis, different visual tokens exhibit different low-importance channels, making structured head-wise pruning ineffective. RotateK applies an orthogonal rotation R to align token-dependent importance into a shared channel basis, where most tokens consistently exhibit low activations on the same channels. (b) Visualization of the rotated Key states aft… view at source ↗
Figure 4
Figure 4. Figure 4: Inference Flow of RotateK. (a) Following visual token compression, RotateK applies head￾wise rotation-based channel pruning to visual Key states during prefill, caching the compressed KV states together with the corresponding rotation matrices. (b) During decoding, queries are transformed using the cached rotations, while a fused attention kernel (see Appendix C for details) combines full-channel attention… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy–KV Cache Trade-offs. Accuracy–memory trade-offs on ChartQA using LLaVA-NeXT-8B (orange) and Qwen2.5-VL-7B (green) under varying token and channel sparsity ratios. Compared to token-only pruning (black), RotateK jointly prunes tokens and Key channels, consistently achieving higher accuracy under the same KV cache budget [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Channel Pruning Robustness. Accuracy under increasing channel sparsity ratios for ThinK, SparK, and RotateK integrated with FastV and VisionZip on LLaVA-NeXT-8B and Qwen2.5- VL-7B. While ThinK rapidly degrades at high sparsity ratios, RotateK consistently maintains strong performance even at aggressive channel pruning levels, demonstrating improved robustness of structured Key channel pruning. 4.2 Comparis… view at source ↗
Figure 7
Figure 7. Figure 7: Latency and Memory Analysis. Comparison of RotateK, ThinK, and SparK on prefill latency, decoding latency, KV cache memory, GPU peak memory, and generation throughput. (a,c) RotateK introduces negligible prefill overhead across varying batch sizes and sequence lengths. (b,d) Unlike ThinK and SparK, which reconstruct full-channel Keys before attention, RotateK directly performs sparse-channel attention and … view at source ↗
read the original abstract

Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RotateK, a rotation-based structured Key channel pruning framework for Vision-Language Models to mitigate KV cache pressure during inference. RotateK utilizes an online PCA-based rotation to align token-dependent channel importances into a shared low-dimensional subspace, facilitating accurate pruning with lightweight head-wise masks. A fused Triton attention kernel is introduced to handle sparse-channel Keys efficiently. Experiments on two representative VLM backbones indicate that RotateK outperforms prior Key channel pruning methods in both accuracy and decoding latency, with additional benefits from joint token-channel pruning at matched KV cache budgets.

Significance. Should the online PCA rotation prove effective in creating a stable shared subspace, this work would provide a valuable complementary strategy to token pruning for preserving visual content in VLMs under memory constraints. The structured, hardware-friendly pruning and custom kernel implementation represent practical advances for efficient multimodal model deployment.

major comments (2)
  1. [§3.2] §3.2 (online PCA rotation description): the central assumption that this rotation reliably aligns token-dependent channel importances into one shared low-dimensional subspace across token distributions and modalities lacks quantitative support; the manuscript should report explained-variance ratios post-rotation and subspace stability metrics (e.g., principal-component variation between visual and textual tokens) to validate that head-wise masks incur no substantial information loss.
  2. [§4] §4 (experiments) and Table 1: while consistent outperformance is claimed on two backbones, the absence of ablations isolating the rotation's contribution, error bars across runs, or sensitivity analysis to input modality mix makes it difficult to assess whether the reported accuracy-latency gains are robust or load-bearing for the framework's novelty.
minor comments (2)
  1. [§3] The notation for the rotation matrix and head-wise mask application could be clarified with an explicit equation reference in the method section to improve readability.
  2. Consider expanding the related-work discussion to include more recent KV-cache compression techniques for VLMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (online PCA rotation description): the central assumption that this rotation reliably aligns token-dependent channel importances into one shared low-dimensional subspace across token distributions and modalities lacks quantitative support; the manuscript should report explained-variance ratios post-rotation and subspace stability metrics (e.g., principal-component variation between visual and textual tokens) to validate that head-wise masks incur no substantial information loss.

    Authors: We agree that additional quantitative evidence would strengthen the validation of the rotation step. In the revised manuscript we will report explained-variance ratios after the online PCA rotation together with subspace stability metrics, including principal-component variation between visual and textual tokens. These results will directly show that head-wise masks preserve information with negligible loss across modalities. revision: yes

  2. Referee: [§4] §4 (experiments) and Table 1: while consistent outperformance is claimed on two backbones, the absence of ablations isolating the rotation's contribution, error bars across runs, or sensitivity analysis to input modality mix makes it difficult to assess whether the reported accuracy-latency gains are robust or load-bearing for the framework's novelty.

    Authors: We acknowledge the benefit of further experimental controls. The revised version will add ablations that isolate the rotation's contribution, include error bars computed over multiple runs, and provide sensitivity analysis across different visual-textual modality ratios. These additions will clarify the robustness of the observed gains and the role of the rotation within the overall framework. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies standard PCA to pruning without reducing claims to inputs by construction

full rationale

The paper introduces RotateK as a practical framework that applies an online PCA-based rotation to align token-dependent channel importance for head-wise Key pruning in VLMs, followed by a fused Triton kernel. No equations, derivations, or load-bearing steps are shown that define a quantity in terms of itself, rename a fitted parameter as a prediction, or rely on self-citation chains for uniqueness or ansatz. The approach builds on established PCA and structured pruning techniques with external validation through experiments on VLM backbones; the central claim of improved accuracy-latency trade-off is presented as an empirical outcome rather than a self-referential reduction. This is the expected non-finding for an applied systems paper whose contributions are algorithmic and implementation-driven.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the online PCA rotation for subspace alignment, which is introduced as the key innovation without specified free parameters or external validation in the abstract.

axioms (1)
  • domain assumption Online PCA-based rotation aligns token-dependent channel importance into a shared low-dimensional subspace
    This alignment is the core mechanism that enables accurate head-wise pruning while remaining hardware-friendly.
invented entities (1)
  • RotateK framework no independent evidence
    purpose: Rotation-based structured key channel pruning with fused Triton kernel
    New method and implementation introduced to address the identified trade-off.

pith-pipeline@v0.9.0 · 5731 in / 1305 out tokens · 50212 ms · 2026-05-20T07:39:13.023387+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 8 internal anchors

  1. [1]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

  2. [2]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  4. [4]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  5. [5]

    Modeling context in referring expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

  6. [6]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  7. [7]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in neural information processing systems, 35:2507–2521, 2022

  8. [8]

    Video-llava: Learning united visual representation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 5971–5984, 2024

  9. [9]

    Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science China Information Sciences, 68(10):200102, 2025

  10. [10]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Haoran Tan, Chunyuan Li, and Ziwei Liu. Long context transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024

  11. [11]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spatiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

  12. [12]

    Longvila: Scaling long-context visual language models for long videos

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos. In The Thirteenth International Conference on Learning Representations, 2024

  13. [13]

    Vl-cache: Sparsity and modality- aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317, 2024

    Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, and Panpan Xu. Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration.arXiv preprint arXiv:2410.23317, 2024

  14. [14]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. Pyramiddrop: Accelerating your large vision-language models via pyramid visual redundancy reduction.arXiv preprint arXiv:2410.17247, 2024

  15. [15]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

  16. [16]

    Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

    Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4065–4078, 2024

  17. [17]

    Visionzip: Longer is better but not necessary in vision language models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. Visionzip: Longer is better but not necessary in vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19792–19802, 2025. 10

  18. [18]

    Divprune: Diversity-based visual token pruning for large multimodal models

    Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, and Yong Zhang. Divprune: Diversity-based visual token pruning for large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9392–9401, 2025

  19. [19]

    Sparsevila: Decoupling visual sparsity for efficient vlm inference

    Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N Plataniotis, Yao Lu, Song Han, and Zhijian Liu. Sparsevila: Decoupling visual sparsity for efficient vlm inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23784–23794, 2025

  20. [20]

    Fit and prune: Fast and training-free visual token pruning for multi-modal large language models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and prune: Fast and training-free visual token pruning for multi-modal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 22128–22136, 2025

  21. [21]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

  22. [22]

    Ocrbench: on the hidden mystery of ocr in large multimodal models

    Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models. Science China Information Sciences, 67(12):220102, 2024

  23. [23]

    Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Systems, 37:121475–121499, 2024

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models.Advances in Neural Information Processing Systems, 37:121475–121499, 2024

  24. [24]

    Think: Thinner key cache by query-driven pruning.arXiv preprint arXiv:2407.21018, 2024

    Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, and Doyen Sahoo. Think: Thinner key cache by query-driven pruning.arXiv preprint arXiv:2407.21018, 2024

  25. [25]

    Spark: Query-aware unstructured sparsity with recoverable kv cache channel pruning

    Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, and Kang Liu. Spark: Query-aware unstructured sparsity with recoverable kv cache channel pruning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31961–31969, 2026

  26. [26]

    Leank: Learnable k cache channel pruning for efficient decoding

    Yike Zhang, Zhiyuan He, Huiqiang Jiang, Chengruidong Zhang, Yuqing Yang, Jianyong Wang, and Lili Qiu. Leank: Learnable k cache channel pruning for efficient decoding. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 31110–31125, 2025

  27. [27]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

  28. [28]

    Kvquant: Towards 10 million context length llm inference with kv cache quantization

    Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. Advances in Neural Information Processing Systems, 37:1270–1303, 2024

  29. [29]

    Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205,

    Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veli ˇckovi´c. Round and round we go! what makes rotary positional encodings useful?arXiv preprint arXiv:2410.06205, 2024

  30. [30]

    Rethinking rope scaling in quantized llm: Theory, outlier, and channel-band analysis with weight rescaling.arXiv preprint arXiv:2510.00028, 2025

    Ye Qiao, Haocheng Xu, Xiaofan Zhang, and Sitao Huang. Rethinking rope scaling in quantized llm: Theory, outlier, and channel-band analysis with weight rescaling.arXiv preprint arXiv:2510.00028, 2025

  31. [31]

    Lmms-eval: Reality check on the evaluation of large multimodal models

    Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, et al. Lmms-eval: Reality check on the evaluation of large multimodal models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 881–916, 2025

  32. [32]

    Llava-prumerge: Adaptive token reduction for efficient large multimodal models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025

  33. [33]

    Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms

    Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond text-visual attention: Exploiting visual cues for effective token pruning in vlms. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20857–20867, 2025

  34. [34]

    Flowcut: Rethinking redundancy via information flow for efficient vision-language models.arXiv preprint arXiv:2505.19536, 2025

    Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, and Ruixuan Li. Flowcut: Rethinking redundancy via information flow for efficient vision-language models.arXiv preprint arXiv:2505.19536, 2025. 11

  35. [35]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

  36. [36]

    Multi-stage vision token dropping: Towards efficient multimodal large language model

    Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, and Linfeng Zhang. Multi-stage vision token dropping: Towards efficient multimodal large language model.arXiv preprint arXiv:2411.10803, 2024

  37. [37]

    Post-training sparse attention with double sparsity.arXiv preprint arXiv:2408.07092, 2024

    Shuo Yang, Ying Sheng, Joseph E Gonzalez, Ion Stoica, and Lianmin Zheng. Post-training sparse attention with double sparsity.arXiv preprint arXiv:2408.07092, 2024

  38. [38]

    Quarot: Outlier-free 4-bit inference in rotated llms

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems, 37:100213–100240, 2024

  39. [39]

    SpinQuant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

  40. [40]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Information Processing Systems, 37:87766–87800, 2024

  41. [41]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024

  42. [42]

    Slicegpt: Compress large language models by deleting rows and columns

    Saleh Ashkboos, Maximilian L Croci, Marcelo Gennari Do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. InThe Twelfth International Conference on Learning Representations, 2024

  43. [43]

    Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118,

    Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. Palu: Compressing kv-cache with low-rank projection.arXiv preprint arXiv:2407.21118, 2024

  44. [44]

    MatryoshkaKV: Adaptive KV compression via trainable orthogonal projection

    Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, and Zhijie Deng. Matryoshkakv: Adaptive kv compression via trainable orthogonal projection.arXiv preprint arXiv:2410.14731, 2024. 12 A Related Works A.1 Visual Token Pruning Token-axis compression of the visual KV cache falls into two regimes. Pre-LLM methods, exemplified b...