LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

Chengruidong Zhang; Huiqiang Jiang; Jianyong Wang; Lili Qiu; Yike Zhang; Yuqing Yang; Zhiyuan He

arxiv: 2508.02215 · v1 · pith:G7KTNBN3new · submitted 2025-08-04 · 💻 cs.LG · cs.AI· cs.CL

LeanK: Learnable K Cache Channel Pruning for Efficient Decoding

Yike Zhang , Zhiyuan He , Huiqiang Jiang , Chengruidong Zhang , Yuqing Yang , Jianyong Wang , Lili Qiu This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords cacheleankdecodingattentionchannelchannelslong-contextmemory

0 comments

read the original abstract

Large language models (LLMs) enable long-context tasks but face efficiency challenges due to the growing key-value (KV) cache. We propose LeanK, a learning-based method that prunes unimportant key (K) cache channels by leveraging static channel sparsity. With a novel two-stage training process, LeanK learns channel-wise static mask that could satisfy specific sparsity ratio and hardware alignment requirement. LeanK reduces GPU memory and accelerates decoding without sacrificing accuracy. Experiments demonstrate up to 70% K cache and 16%-18% V cache memory reduction. Custom decoding kernel enables 1.3x speedup for attention computation. We also provide insights into model channels and attention heads during long-context inference by analyzing the learned importance distribution. Our code is available at https://aka.ms/LeanK.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...