The Pitfalls of KV Cache Compression

Alex Chen , Renato Geh , Aditya Grover , Guy Van den Broeck , Daniel Israel

Authors on Pith no claims yet

classification 💻 cs.LG cs.AI

keywords compressioncacheleakagemulti-instructionevictionfactorsgeneralidentify

read the original abstract

KV cache compression promises increased throughput and efficiency with negligible loss in performance. While the gains in throughput are indisputable and recent literature has indeed shown minimal degradation on particular benchmarks, in general the consequences of compression in realistic scenarios such as multi-instruction prompting have been insufficiently studied. In this paper, we identify several pitfalls that practitioners should be aware of when deploying KV cache compressed LLMs. We evaluate five KV cache compression methods (StreamingLLM, SnapKV, TOVA, H2O, and K-Norm) on Llama3.1 8B and Qwen2.5 14B under multi-instruction prompting with IFEval. Importantly, we show that certain instructions degrade much more rapidly with compression, effectively causing them to be completely ignored by the LLM. As a practical example, we highlight system prompt leakage as a case study, empirically demonstrating the impact of compression on leakage and general instruction-following. We identify several factors that contribute to system prompt leakage: compression method, instruction order, and KV eviction bias. We then propose simple changes to KV cache eviction policies that can reduce the impact of these factors and improve the overall performance in multi-instruction tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression
cs.LG 2026-05 unverdicted novelty 6.0

A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading hurts accuracy on context-heavy tasks due to low-rank key projections and bad landmarks, but a simpler strategy recovers performance across models.
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading degrades performance on context-intensive tasks due to low-rank key projections and unreliable landmarks, but a simpler alternative strategy restores accuracy across LLM families.
KV Cache Offloading for Context-Intensive Tasks
cs.LG 2026-04 unverdicted novelty 6.0

KV offloading hurts accuracy on context-heavy tasks because of low-rank key projections and bad landmarks, but a simpler strategy improves results across models and benchmarks.
Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.