CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation
Pith reviewed 2026-05-21 13:27 UTC · model grok-4.3
The pith
By compiling retention signals offline from a calibration corpus, CompilerKV turns noisy per-prompt estimates into fast lookups that improve compressed KV performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CompilerKV compiles corrective tables for per-head reliability and prompt-level compression sensitivity offline from a calibration corpus. This reduces online correction after the standard observation-window scan to O(1) lookups plus a budget clamp. The resulting tables behave as portable architectural priors whose rankings transfer across disjoint corpora on four backbones with mean Spearman correlation 0.90, while direct model-to-model transfer costs only 0.4 to 0.8 LongBench points on average. At a 512-token budget the method attains compressed-SOTA on all four backbones and improves over the strongest prefill-only baseline by 1.67 points on average.
What carries the argument
Offline-compiled retention tables that encode per-head reliability and compression sensitivity for O(1) online lookups after an observation window.
If this is right
- Retention rankings from the compiled tables transfer across disjoint corpora with mean Spearman correlation of 0.90.
- Tables transfer directly from one model to another at a cost of only 0.4 to 0.8 LongBench points on average.
- The performance gap widens under pressure, remaining strongest at 128k context lengths and when retaining only 1.56 percent of prefill KV states.
- Batch-16 serving stays feasible at 32k inputs where the full KV cache triggers out-of-memory errors.
Where Pith is reading between the lines
- The portability of the tables suggests that attention head behaviors contain stable, model-intrinsic regularities that can be pre-extracted once and reused across many users.
- The same offline compilation idea could be tested on eviction policies that continue to act during the decoding phase rather than only at prefill.
- Model releases might include pre-compiled tables as optional artifacts to simplify high-performance inference for downstream users.
- Extending the calibration corpus to include more diverse task types would test how far the cross-prompt regularity assumption generalizes.
Load-bearing premise
Corrective signals such as per-head reliability and prompt-level compression sensitivity exhibit far higher cross-prompt regularity than within-prompt signal-to-noise, allowing effective offline compilation from a calibration corpus.
What would settle it
If retention decisions made from online estimates on a fresh prompt consistently outperform the pre-compiled tables on held-out test prompts from the same distribution, the claimed benefit of offline compilation would be falsified.
Figures
read the original abstract
Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce \textsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $\bar\rho{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, \textsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CompilerKV, a prefill-only KV compression policy for LLMs that compiles retention tables offline from a calibration corpus instead of estimating per-head reliability and prompt-level compression sensitivity online from a single prompt. It argues these signals show higher cross-prompt regularity than within-prompt noise, enabling O(1) lookups at inference. Experiments on four backbones report that the tables transfer across disjoint corpora (mean Spearman ρ̄=0.90) and models (0.4–0.8 point loss), and at a 512-token budget CompilerKV achieves compressed-SOTA with +1.67 average gain (task-bootstrap 95% CI [+1.08, +2.37]) over the strongest baseline, with further gains under long-context pressure regimes.
Significance. If the cross-prompt regularity premise holds and the tables function as portable architectural priors, the approach could simplify KV cache management by eliminating per-prompt online correction, improving efficiency in memory-constrained serving. The reported transfer results and quantitative gains with confidence intervals on multiple backbones provide concrete evidence of practical utility if the experimental controls are robust.
major comments (2)
- [Abstract] Abstract: The claim that corrective signals exhibit 'far higher cross-prompt regularity than within-prompt signal-to-noise' is load-bearing for preferring offline compilation, yet the reported mean Spearman ρ̄=0.90 demonstrates only stability of relative rankings across corpora; it does not compare absolute reliability estimates against those obtainable from online single-prompt observation windows, leaving open whether prompt-specific deviations are negligible enough to justify freezing the tables.
- [Results] Experimental section (implied by results on calibration and transfer): The performance advantage at 512-token budget and under 512/32k cache ratios rests on the calibration corpus being representative and disjoint; without explicit quantification of within-prompt vs. between-prompt variance on the tested backbones or tasks, the justification for O(1) lookup over stronger online estimators remains incompletely supported.
minor comments (1)
- [Abstract] The abstract mentions 'task-bootstrap 95% CI' but does not specify the number of tasks or bootstrap procedure details, which would aid reproducibility assessment.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important aspects of the statistical justification for offline compilation, and we respond to each point below while committing to targeted revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that corrective signals exhibit 'far higher cross-prompt regularity than within-prompt signal-to-noise' is load-bearing for preferring offline compilation, yet the reported mean Spearman ρ̄=0.90 demonstrates only stability of relative rankings across corpora; it does not compare absolute reliability estimates against those obtainable from online single-prompt observation windows, leaving open whether prompt-specific deviations are negligible enough to justify freezing the tables.
Authors: We agree that the reported Spearman correlation primarily establishes consistency in relative token rankings across corpora rather than a head-to-head comparison of absolute reliability values from single-prompt online windows. Because the retention policy operates on ranked priorities under a fixed budget, this ranking stability is the operative property for compression decisions. The observed transfer performance (minimal degradation on disjoint corpora and 0.4–0.8 point loss on model transfer) provides supporting evidence that prompt-specific deviations are small enough to justify the frozen tables. To directly address the comparison, we will add a new subsection in the revised manuscript that contrasts variance in per-prompt reliability estimates against the compiled tables. revision: yes
-
Referee: [Results] Experimental section (implied by results on calibration and transfer): The performance advantage at 512-token budget and under 512/32k cache ratios rests on the calibration corpus being representative and disjoint; without explicit quantification of within-prompt vs. between-prompt variance on the tested backbones or tasks, the justification for O(1) lookup over stronger online estimators remains incompletely supported.
Authors: The referee correctly notes that an explicit within- versus between-prompt variance decomposition would more rigorously support the preference for O(1) lookups. Our current evidence rests on the high cross-corpus Spearman correlation together with the empirical transfer results and the reported performance gains (including task-bootstrap confidence intervals). These outcomes are consistent with between-prompt variance being subordinate to the stable architectural signal. We will revise the experimental section to include a direct variance analysis on the four backbones and tasks, thereby providing the requested quantification and clarifying why the offline tables outperform stronger online baselines in the tested regimes. revision: yes
Circularity Check
No significant circularity; empirical transfer on disjoint data keeps derivation self-contained
full rationale
The paper's core argument—that per-head reliability and compression sensitivity exhibit higher cross-prompt regularity—is supported by direct measurement of ranking transfer (mean Spearman ρ̄=0.90) across explicitly disjoint corpora and modest model-to-model transfer loss. Retention tables are compiled from a calibration set and evaluated on separate test corpora and backbones, so the reported +1.67 average gain at 512-token budget is an out-of-sample empirical result rather than a quantity forced by construction from the same inputs. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided derivation chain; the statistical unit choice is justified by observable regularity rather than by re-using the target performance metric.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Corrective signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise
invented entities (1)
-
compiled retention tables as portable architectural priors
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We frame the compilation of compression policies as an offline reinforcement learning (RL) problem... Conservative Q-Learning... Head Heterogeneity Table... Risk-Adaptive Threshold Gating
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 (Stability-Oriented Attention Approximation Bound)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Titans: Learning to Memorize at Test Time
Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Generating Long Sequences with Sparse Transformers
Child, R. Generating long sequences with sparse transform- ers.arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[4]
A dataset of information-seeking questions and answers anchored in research papers,
Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and Gardner, M. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011,
-
[5]
Which heads matter for reasoning? rl-guided kv cache compres- sion.arXiv preprint arXiv:2510.08525,
Du, W., Jiang, L., Tao, K., Liu, X., and Wang, H. Which heads matter for reasoning? rl-guided kv cache compres- sion.arXiv preprint arXiv:2510.08525,
-
[6]
Feng, Y ., Lv, J., Cao, Y ., Xie, X., and Zhou, S. K. Ada- kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Fu, Y ., Cai, Z., Asi, A., Xiong, W., Dong, Y ., and Xiao, W. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258,
-
[8]
Samsum corpus: A human-annotated dialogue dataset for abstractive summarization
Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. Samsum corpus: A human-annotated dialogue dataset for abstrac- tive summarization.arXiv preprint arXiv:1911.12237,
-
[9]
Guo, Z., Kamigaito, H., and Watanabe, T. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters.arXiv preprint arXiv:2406.12335,
-
[10]
Effi- cient attentions for long document summarization.arXiv preprint arXiv:2104.02112,
Huang, L., Cao, S., Parulian, N., Ji, H., and Wang, L. Effi- cient attentions for long document summarization.arXiv preprint arXiv:2104.02112,
-
[11]
Li, X. and Roth, D. Learning question classifiers. InCOL- ING 2002: The 19th International Conference on Com- putational Linguistics,
work page 2002
-
[12]
Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Liu, T., Xu, C., and McAuley, J. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023a. Liu, Z., Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y ., Re, C., et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Adaptive kv-cache compression without manually setting budget.arXiv preprint arXiv:2509.03136,
Tang, C., Liu, J., Xu, H., and Huang, L. Adaptive kv-cache compression without manually setting budget.arXiv preprint arXiv:2509.03136,
-
[15]
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned
V oita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned.arXiv preprint arXiv:1905.09418,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[17]
Efficient Streaming Language Models with Attention Sinks
Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y ., and Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,
work page 2018
-
[20]
Dynamickv: Task-aware adaptive kv cache compression for long context llms
Zhou, X., Wang, W., Zeng, M., Guo, J., Liu, X., Shen, L., Zhang, M., and Ding, L. Dynamickv: Task-aware adaptive kv cache compression for long context llms. arXiv preprint arXiv:2412.14838,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.