CoDec: Prefix-Shared Decoding Kernel for LLMs

Chao Fang; Chengying Huan; Chen Tian; Guihai Chen; Kun Yang; Mo Zhou; Rong Gu; Rui Ning; Shaobo Ma; Sheng Zhong

arxiv: 2505.17694 · v2 · pith:QYRSOM4Gnew · submitted 2025-05-23 · 💻 cs.LG

CoDec: Prefix-Shared Decoding Kernel for LLMs

Zhibin Wang , Rui Ning , Chao Fang , Zhonghui Zhang , Xi Lin , Shaobo Ma , Mo Zhou , Xue Li

show 7 more authors

Zhongfeng Wang Chengying Huan Rong Gu Kun Yang Guihai Chen Sheng Zhong Chen Tian

This is my paper

classification 💻 cs.LG

keywords attentionaccesscodeccomputationkernelmemorystagedecode

0 comments

read the original abstract

Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for attention computation in efficiently processing shared KV cache access patterns while managing complex dependencies and balancing irregular workloads. To address the above challenges, we propose a dedicated attention kernel to combine the memory access of shared prefixes in the decoding stage, namely CoDec. CoDec delivers two key innovations: a novel shared-prefix attention kernel that optimizes memory hierarchy and exploits both intra-block and inter-block parallelism, and a comprehensive workload balancing mechanism that efficiently estimates cost, divides tasks, and schedules execution. Experimental results show that CoDec achieves an average $1.9\times$ speedup and $120.9\times$ memory access reduction compared to the state-of-the-art FlashDecoding kernel regarding attention computation in the decode stage and $3.8\times$ end-to-end time per output token compared to the vLLM.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching
cs.LG 2026-06 unverdicted novelty 5.0

SCD replaces raw KV cache transmission with compact semantic codes via reuse and patching to achieve up to 2.65x TTFT speedup while staying within 5% F1 of oracle quality.