pith:EM2A7KL7
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
LLMs can cut KV cache memory by profiling attention heads once and evicting tokens selectively per head type.
arxiv:2310.01801 v4 · 2023-10-03 · cs.CL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{EM2A7KL7DVBG3VHQBSRAVTUMG5}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various tasks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss.
That the attention-head structures identified by a single lightweight profiling pass remain stable and sufficient to guide token eviction across diverse generation tasks and contexts without materially degrading output quality or requiring any model updates.
FastGen adaptively compresses LLM KV caches via lightweight attention profiling: evicting long-range contexts on local heads, non-special tokens on special-token heads, and retaining full caches on broad-attention heads, yielding substantial memory savings with negligible quality loss.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:14.272361Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
23340fa97f1d426dd4f00ca20ace8c37532352596ea1ea91a591c8ed76947c51
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/EM2A7KL7DVBG3VHQBSRAVTUMG5 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 23340fa97f1d426dd4f00ca20ace8c37532352596ea1ea91a591c8ed76947c51
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "0e193d84667f6bfe1d3dd00a5dcff1a21a7121bcb148bb9dae7ceb420b7365e4",
"cross_cats_sorted": [],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2023-10-03T05:17:08Z",
"title_canon_sha256": "67dcb3a627ccfd1e7e1ed9f026f773a287848ff13092ed123185a24c13fd96f3"
},
"schema_version": "1.0",
"source": {
"id": "2310.01801",
"kind": "arxiv",
"version": 4
}
}