CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

Baoliang Tian; Chengzhi Wang; Haijun Zhang; Ning Yang; Yibo Liu

arxiv: 2602.08686 · v2 · pith:SAV3OGFSnew · submitted 2026-02-09 · 💻 cs.LG · cs.AI

CompilerKV: Risk-Adaptive KV Compression via Offline Experience Compilation

Ning Yang , Chengzhi Wang , Yibo Liu , Baoliang Tian , Haijun Zhang This is my paper

Pith reviewed 2026-05-21 13:27 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords KV cache compressionlarge language modelslong context inferenceoffline compilationretention policyprefill-only methodscache budget optimizationserving efficiency

0 comments

The pith

By compiling retention signals offline from a calibration corpus, CompilerKV turns noisy per-prompt estimates into fast lookups that improve compressed KV performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that signals for deciding which key-value states to retain, such as per-head reliability and prompt-level compression sensitivity, show more consistency across prompts than noise within any single prompt. By compiling these signals into tables from a calibration set in advance, the method replaces online estimation with constant-time lookups after a short observation window. This change produces better retention decisions under a fixed token budget and yields the highest scores among prefill-only compression approaches on four different model backbones. A sympathetic reader would care because the approach keeps long-context accuracy higher while using far less memory during inference and serving. The tables also prove portable, transferring rankings across separate data sets and even between models with only small losses.

Core claim

CompilerKV compiles corrective tables for per-head reliability and prompt-level compression sensitivity offline from a calibration corpus. This reduces online correction after the standard observation-window scan to O(1) lookups plus a budget clamp. The resulting tables behave as portable architectural priors whose rankings transfer across disjoint corpora on four backbones with mean Spearman correlation 0.90, while direct model-to-model transfer costs only 0.4 to 0.8 LongBench points on average. At a 512-token budget the method attains compressed-SOTA on all four backbones and improves over the strongest prefill-only baseline by 1.67 points on average.

What carries the argument

Offline-compiled retention tables that encode per-head reliability and compression sensitivity for O(1) online lookups after an observation window.

If this is right

Retention rankings from the compiled tables transfer across disjoint corpora with mean Spearman correlation of 0.90.
Tables transfer directly from one model to another at a cost of only 0.4 to 0.8 LongBench points on average.
The performance gap widens under pressure, remaining strongest at 128k context lengths and when retaining only 1.56 percent of prefill KV states.
Batch-16 serving stays feasible at 32k inputs where the full KV cache triggers out-of-memory errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The portability of the tables suggests that attention head behaviors contain stable, model-intrinsic regularities that can be pre-extracted once and reused across many users.
The same offline compilation idea could be tested on eviction policies that continue to act during the decoding phase rather than only at prefill.
Model releases might include pre-compiled tables as optional artifacts to simplify high-performance inference for downstream users.
Extending the calibration corpus to include more diverse task types would test how far the cross-prompt regularity assumption generalizes.

Load-bearing premise

Corrective signals such as per-head reliability and prompt-level compression sensitivity exhibit far higher cross-prompt regularity than within-prompt signal-to-noise, allowing effective offline compilation from a calibration corpus.

What would settle it

If retention decisions made from online estimates on a fresh prompt consistently outperform the pre-compiled tables on held-out test prompts from the same distribution, the claimed benefit of offline compilation would be falsified.

Figures

Figures reproduced from arXiv: 2602.08686 by Baoliang Tian, Chengzhi Wang, Haijun Zhang, Ning Yang, Yibo Liu.

**Figure 1.** Figure 1: Overview of the CompilerKV Framework. The framework consists of three integrated stages: (1) computing a noise-resilient baseline score to filter out transient distractions; (2) modulating the ranking via a compiled Head Heterogeneity Table to strictly govern the functional differences among attention heads; and (3) querying a Risk-Gating Table to dynamically calibrate the retention threshold based on the … view at source ↗

**Figure 2.** Figure 2: Performance vs. KV Cache Size. Comparison of average accuracy on LongBench across different budget constraints. Our method (ComplierKV) degrades most gracefully, maintaining usability even at extreme compression ratios where baselines fail. 5. Experiments 5.1. Experimental Setup Datasets and Metrics. We evaluate our method on LongBench (Bai et al., 2024), a comprehensive benchmark for long-context unders… view at source ↗

**Figure 3.** Figure 3: Needle-in-a-Haystack Pressure Test on Mistral-7B. Visual comparison of retrieval accuracy (Green=100%, Red=0%) across varying context lengths (x-axis) and needle depths (y-axis). FullKV (a) sets the upper bound. While baselines like StreamingLLM (b) and SnapKV (c) struggle with long-range dependencies, and DynamicKV (e) shows fragmentation at extreme lengths, our method CompilerKV (f) maintains a robust re… view at source ↗

read the original abstract

Prefill-only KV compression freezes a token subset at the end of prefill and decodes from it without further eviction. The retention decision is therefore irreversible, yet existing methods estimate the corrective signals it relies on, per-head reliability and prompt-level compression sensitivity, online from a single noisy prompt. We argue this is the wrong statistical unit: these signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise. We introduce \textsc{CompilerKV}, a KV-retention policy whose corrective tables are compiled offline from a calibration corpus, reducing online correction after the standard observation-window scan to $O(1)$ lookups plus a budget clamp. We find that compiled retention tables behave as portable architectural priors: rankings transfer across disjoint corpora on four backbones (mean Spearman $\bar\rho{=}0.90$), and direct model-to-model table transfer costs only $0.4$--$0.8$ LongBench points on average. At a 512-token budget, \textsc{CompilerKV} attains compressed-SOTA on all four backbones, improving over the strongest prefill-only baseline by $+1.67$ points on average (task-bootstrap 95\% CI $[+1.08,+2.37]$). Pressure regimes amplify the gap: under a fixed $512/32k$ cache ratio, CompilerKV remains the strongest compressed method through 128k RULER ($\sim\!73$ vs.\ FullKV $\sim\!79$, SnapKV $\sim\!38$); on 32k NIAH it reaches $0.89$ vs.\ SnapKV $0.42$; and at 32k input, retaining only $1.56\%$ of the prefill KV, batch-16 serving remains feasible where FullKV is OOM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes CompilerKV, a prefill-only KV compression policy for LLMs that compiles retention tables offline from a calibration corpus instead of estimating per-head reliability and prompt-level compression sensitivity online from a single prompt. It argues these signals show higher cross-prompt regularity than within-prompt noise, enabling O(1) lookups at inference. Experiments on four backbones report that the tables transfer across disjoint corpora (mean Spearman ρ̄=0.90) and models (0.4–0.8 point loss), and at a 512-token budget CompilerKV achieves compressed-SOTA with +1.67 average gain (task-bootstrap 95% CI [+1.08, +2.37]) over the strongest baseline, with further gains under long-context pressure regimes.

Significance. If the cross-prompt regularity premise holds and the tables function as portable architectural priors, the approach could simplify KV cache management by eliminating per-prompt online correction, improving efficiency in memory-constrained serving. The reported transfer results and quantitative gains with confidence intervals on multiple backbones provide concrete evidence of practical utility if the experimental controls are robust.

major comments (2)

[Abstract] Abstract: The claim that corrective signals exhibit 'far higher cross-prompt regularity than within-prompt signal-to-noise' is load-bearing for preferring offline compilation, yet the reported mean Spearman ρ̄=0.90 demonstrates only stability of relative rankings across corpora; it does not compare absolute reliability estimates against those obtainable from online single-prompt observation windows, leaving open whether prompt-specific deviations are negligible enough to justify freezing the tables.
[Results] Experimental section (implied by results on calibration and transfer): The performance advantage at 512-token budget and under 512/32k cache ratios rests on the calibration corpus being representative and disjoint; without explicit quantification of within-prompt vs. between-prompt variance on the tested backbones or tasks, the justification for O(1) lookup over stronger online estimators remains incompletely supported.

minor comments (1)

[Abstract] The abstract mentions 'task-bootstrap 95% CI' but does not specify the number of tasks or bootstrap procedure details, which would aid reproducibility assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of the statistical justification for offline compilation, and we respond to each point below while committing to targeted revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that corrective signals exhibit 'far higher cross-prompt regularity than within-prompt signal-to-noise' is load-bearing for preferring offline compilation, yet the reported mean Spearman ρ̄=0.90 demonstrates only stability of relative rankings across corpora; it does not compare absolute reliability estimates against those obtainable from online single-prompt observation windows, leaving open whether prompt-specific deviations are negligible enough to justify freezing the tables.

Authors: We agree that the reported Spearman correlation primarily establishes consistency in relative token rankings across corpora rather than a head-to-head comparison of absolute reliability values from single-prompt online windows. Because the retention policy operates on ranked priorities under a fixed budget, this ranking stability is the operative property for compression decisions. The observed transfer performance (minimal degradation on disjoint corpora and 0.4–0.8 point loss on model transfer) provides supporting evidence that prompt-specific deviations are small enough to justify the frozen tables. To directly address the comparison, we will add a new subsection in the revised manuscript that contrasts variance in per-prompt reliability estimates against the compiled tables. revision: yes
Referee: [Results] Experimental section (implied by results on calibration and transfer): The performance advantage at 512-token budget and under 512/32k cache ratios rests on the calibration corpus being representative and disjoint; without explicit quantification of within-prompt vs. between-prompt variance on the tested backbones or tasks, the justification for O(1) lookup over stronger online estimators remains incompletely supported.

Authors: The referee correctly notes that an explicit within- versus between-prompt variance decomposition would more rigorously support the preference for O(1) lookups. Our current evidence rests on the high cross-corpus Spearman correlation together with the empirical transfer results and the reported performance gains (including task-bootstrap confidence intervals). These outcomes are consistent with between-prompt variance being subordinate to the stable architectural signal. We will revise the experimental section to include a direct variance analysis on the four backbones and tasks, thereby providing the requested quantification and clarifying why the offline tables outperform stronger online baselines in the tested regimes. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical transfer on disjoint data keeps derivation self-contained

full rationale

The paper's core argument—that per-head reliability and compression sensitivity exhibit higher cross-prompt regularity—is supported by direct measurement of ranking transfer (mean Spearman ρ̄=0.90) across explicitly disjoint corpora and modest model-to-model transfer loss. Retention tables are compiled from a calibration set and evaluated on separate test corpora and backbones, so the reported +1.67 average gain at 512-token budget is an out-of-sample empirical result rather than a quantity forced by construction from the same inputs. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided derivation chain; the statistical unit choice is justified by observable regularity rather than by re-using the target performance metric.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; the approach rests primarily on the cross-prompt regularity assumption and the claim that compiled tables act as portable priors. No explicit free parameters or additional invented entities are detailed beyond the retention tables themselves.

axioms (1)

domain assumption Corrective signals exhibit far higher cross-prompt regularity than within-prompt signal-to-noise
This is the central argument given for preferring offline compilation over online estimation.

invented entities (1)

compiled retention tables as portable architectural priors no independent evidence
purpose: To enable O(1) online lookups for KV retention decisions that transfer across corpora and models
Introduced as the core output of offline compilation; transfer is claimed but independent evidence beyond the reported experiments is not provided.

pith-pipeline@v0.9.0 · 5865 in / 1308 out tokens · 64732 ms · 2026-05-21T13:27:10.570228+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We frame the compilation of compression policies as an offline reinforcement learning (RL) problem... Conservative Q-Learning... Head Heterogeneity Table... Risk-Adaptive Threshold Gating
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.1 (Stability-Oriented Attention Approximation Bound)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 10 internal anchors

[1]

Titans: Learning to Memorize at Test Time

Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Generating Long Sequences with Sparse Transformers

Child, R. Generating long sequences with sparse transform- ers.arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[4]

A dataset of information-seeking questions and answers anchored in research papers,

Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and Gardner, M. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011,

work page arXiv
[5]

Which heads matter for reasoning? rl-guided kv cache compres- sion.arXiv preprint arXiv:2510.08525,

Du, W., Jiang, L., Tao, K., Liu, X., and Wang, H. Which heads matter for reasoning? rl-guided kv cache compres- sion.arXiv preprint arXiv:2510.08525,

work page arXiv
[6]

Feng, Y ., Lv, J., Cao, Y ., Xie, X., and Zhou, S. K. Ada- kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024

Fu, Y ., Cai, Z., Asi, A., Xiong, W., Dong, Y ., and Xiao, W. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258,

work page arXiv
[8]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. Samsum corpus: A human-annotated dialogue dataset for abstrac- tive summarization.arXiv preprint arXiv:1911.12237,

work page arXiv 1911
[9]

Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

Guo, Z., Kamigaito, H., and Watanabe, T. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters.arXiv preprint arXiv:2406.12335,

work page arXiv
[10]

Effi- cient attentions for long document summarization.arXiv preprint arXiv:2104.02112,

Huang, L., Cao, S., Parulian, N., Ji, H., and Wang, L. Effi- cient attentions for long document summarization.arXiv preprint arXiv:2104.02112,

work page arXiv
[11]

and Roth, D

Li, X. and Roth, D. Learning question classifiers. InCOL- ING 2002: The 19th International Conference on Com- putational Linguistics,

work page 2002
[12]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Liu, T., Xu, C., and McAuley, J. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023a. Liu, Z., Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y ., Re, C., et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Adaptive kv-cache compression without manually setting budget.arXiv preprint arXiv:2509.03136,

Tang, C., Liu, J., Xu, H., and Huang, L. Adaptive kv-cache compression without manually setting budget.arXiv preprint arXiv:2509.03136,

work page arXiv
[15]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

V oita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned.arXiv preprint arXiv:1905.09418,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[17]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y ., and Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

work page 2018
[20]

Dynamickv: Task-aware adaptive kv cache compression for long context llms

Zhou, X., Wang, W., Zeng, M., Guo, J., Liu, X., Shen, L., Zhang, M., and Ding, L. Dynamickv: Task-aware adaptive kv cache compression for long context llms. arXiv preprint arXiv:2412.14838,

work page arXiv

[1] [1]

Titans: Learning to Memorize at Test Time

Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Z., Zhang, Y ., Gao, B., Liu, Y ., Li, Y ., Liu, T., Lu, K., Xiong, W., Dong, Y ., Hu, J., et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Generating Long Sequences with Sparse Transformers

Child, R. Generating long sequences with sparse transform- ers.arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[4] [4]

A dataset of information-seeking questions and answers anchored in research papers,

Dasigi, P., Lo, K., Beltagy, I., Cohan, A., Smith, N. A., and Gardner, M. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011,

work page arXiv

[5] [5]

Which heads matter for reasoning? rl-guided kv cache compres- sion.arXiv preprint arXiv:2510.08525,

Du, W., Jiang, L., Tao, K., Liu, X., and Wang, H. Which heads matter for reasoning? rl-guided kv cache compres- sion.arXiv preprint arXiv:2510.08525,

work page arXiv

[6] [6]

Feng, Y ., Lv, J., Cao, Y ., Xie, X., and Zhou, S. K. Ada- kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258, 2024

Fu, Y ., Cai, Z., Asi, A., Xiong, W., Dong, Y ., and Xiao, W. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.arXiv preprint arXiv:2410.19258,

work page arXiv

[8] [8]

Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

Gliwa, B., Mochol, I., Biesek, M., and Wawer, A. Samsum corpus: A human-annotated dialogue dataset for abstrac- tive summarization.arXiv preprint arXiv:1911.12237,

work page arXiv 1911

[9] [9]

Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters

Guo, Z., Kamigaito, H., and Watanabe, T. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters.arXiv preprint arXiv:2406.12335,

work page arXiv

[10] [10]

Effi- cient attentions for long document summarization.arXiv preprint arXiv:2104.02112,

Huang, L., Cao, S., Parulian, N., Ji, H., and Wang, L. Effi- cient attentions for long document summarization.arXiv preprint arXiv:2104.02112,

work page arXiv

[11] [11]

and Roth, D

Li, X. and Roth, D. Learning question classifiers. InCOL- ING 2002: The 19th International Conference on Com- putational Linguistics,

work page 2002

[12] [12]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Liu, T., Xu, C., and McAuley, J. Repobench: Benchmarking repository-level code auto-completion systems.arXiv preprint arXiv:2306.03091, 2023a. Liu, Z., Wang, J., Dao, T., Zhou, T., Yuan, B., Song, Z., Shrivastava, A., Zhang, C., Tian, Y ., Re, C., et al. Deja vu: Contextual sparsity for efficient llms at inference time. InInternational Conference on Machi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Adaptive kv-cache compression without manually setting budget.arXiv preprint arXiv:2509.03136,

Tang, C., Liu, J., Xu, H., and Huang, L. Adaptive kv-cache compression without manually setting budget.arXiv preprint arXiv:2509.03136,

work page arXiv

[15] [15]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Tang, J., Zhao, Y ., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-aware sparsity for efficient long-context llm inference.arXiv preprint arXiv:2406.10774,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

V oita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned.arXiv preprint arXiv:1905.09418,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[17] [17]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y ., and Han, S. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

work page 2018

[20] [20]

Dynamickv: Task-aware adaptive kv cache compression for long context llms

Zhou, X., Wang, W., Zeng, M., Guo, J., Liu, X., Shen, L., Zhang, M., and Ding, L. Dynamickv: Task-aware adaptive kv cache compression for long context llms. arXiv preprint arXiv:2412.14838,

work page arXiv