EntmaxKV enables exact sparse KV-cache decoding for entmax attention via support-aware page selection and a Gaussian threshold estimator, matching full attention quality at a fraction of the cache size with up to 5.43x speedup.
A Simple and Effective L\_2 Norm-Based Strategy for KV Cache Compression
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
QCFuse achieves full-prefill quality in RAG with 1.7x average prefill speedup over full prefill and 1.5x over ProphetKV via compressed query-aware cache fusion.
VaSE improves KV cache eviction accuracy for reasoning models by over 4% versus prior eviction methods at 4x compression through value-magnitude protection and stochastic diversity.
TGV-KV uses text-vision budgeting, weighted ranking, and prioritised retention to evict KV cache in VLMs while retaining 99.2% accuracy at 5% budget on VizWiz-VQA.
citing papers explorer
-
EntmaxKV: Support-Aware Decoding for Entmax Attention
EntmaxKV enables exact sparse KV-cache decoding for entmax attention via support-aware page selection and a Gaussian threshold estimator, matching full attention quality at a fraction of the cache size with up to 5.43x speedup.
-
Value-Aware Stochastic KV Cache Eviction for Reasoning Models
VaSE improves KV cache eviction accuracy for reasoning models by over 4% versus prior eviction methods at 4x compression through value-magnitude protection and stochastic diversity.