FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

· 2025 · cs.LG · arXiv 2511.00868

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to 70%, improves offline serving throughput by 1.38-1.55x, and lowers online token latency by 1.6-2.1x, all while maintaining accuracy in long-context, long-generation scenarios.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training

cs.LG · 2026-04-23 · unverdicted · novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

citing papers explorer

Showing 2 of 2 citing papers.

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training cs.LG · 2026-04-23 · unverdicted · none · ref 123 · internal anchor
Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference cs.LG · 2026-05-08 · unverdicted · none · ref 43 · internal anchor
Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer