Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

Munsik Kim

arxiv: 2605.25085 · v2 · pith:YME2XZVAnew · submitted 2026-05-24 · 💻 cs.IT · cs.AI· cs.LG· math.IT

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

Munsik Kim This is my paper

Pith reviewed 2026-06-29 23:44 UTC · model grok-4.3

classification 💻 cs.IT cs.AIcs.LGmath.IT

keywords KV cache compressionautoregressive language modelssequential Wyner-Ziv codingpolynomial truncation sensitivitysliding window cacherate-distortion boundscontext truncationsuffix-only policies

0 comments

The pith

Under a polynomial truncation-sensitivity assumption, suffix-only KV cache policies require memory scaling as Θ(ε^{-1/α}) to achieve distortion ε.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the next-token distribution in autoregressive language models decays polynomially rather than geometrically in its sensitivity to truncating distant context. It frames online KV cache compression as sequential Wyner-Ziv source coding on the model's filtration, with the next token as side information. Under this assumption, it proves that sliding-window policies attain distortion ε with window size O(ε^{-1/α}) while a matching converse shows Ω(ε^{-1/α}) is necessary inside the suffix-only class, for overall Θ scaling. A reader cares because the result supplies an explicit memory-distortion tradeoff that explains why recency-based eviction outperforms random retention by orders of magnitude at fixed budget.

Core claim

Under the polynomial truncation-sensitivity assumption, the per-token memory requirement of suffix-only cache policies is characterized: a sliding-window scheme attains distortion ε with window w = O(ε^{-1/α}), and under an additional two-sided Bayes-risk condition a converse shows w = Ω(ε^{-1/α}) is necessary within this policy class, so the scaling is Θ(ε^{-1/α}) for suffix-only policies. An explicit block-Markov scheme achieves the upper bound.

What carries the argument

The polynomial truncation-sensitivity assumption that next-token distribution sensitivity to context truncation decays as a power law with truncation distance, inside a sequential Wyner-Ziv formulation of KV cache compression.

If this is right

Sliding-window caches achieve the optimal scaling among all suffix-only policies under the stated assumptions.
An explicit block-Markov scheme attains the upper bound, with rate-of-convergence exponent matching the converse under forward-decay and regularity hypotheses.
Recency-based eviction (sliding or sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget.
The fitted polynomial law predicts the observed degradation curves of concrete cache policies across models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Whether recurrent or propagating cache summaries can beat the Θ(ε^{-1/α}) scaling for suffix-only policies remains open by the paper's own statement.
The empirical power-law fit recovered independently from sink-plus-recent KL measurements suggests the assumption may hold across a range of model sizes and families.
Position-preserving ablation confirms the decay is not an artifact of positional encodings, pointing toward a test on models without such encodings.

Load-bearing premise

The next-token distribution's sensitivity to context truncation decays polynomially rather than geometrically with truncation distance.

What would settle it

A measurement showing that the KL divergence between next-token distributions with and without the oldest w tokens decays exponentially in w rather than polynomially would falsify the assumption and collapse the Θ(ε^{-1/α}) scaling claim.

Figures

Figures reproduced from arXiv: 2605.25085 by Munsik Kim.

**Figure 1.** Figure 1: Sequential Wyner–Ziv compression of the KV cache. The encoder ϕt maps past tokens to Cˆt; the decoder ψt reconstructs p˜t from (Cˆ≤t, Qt). Distortion DKL(pt ∥ p˜t). 3. Preliminaries and Problem Formulation Notation. Let X1, X2, . . . ∈ V denote tokens with |V| = V , and let Ft := σ(X1, . . . , Xt). The language model is a fixed function f producing logits Zt = f(X<t) via L attention layers; the next-token… view at source ↗

**Figure 2.** Figure 2: Measured sink-plus-recent KL decay on Qwen2.5-0.5B. The scheme follows the power law D ∝ k −2α where α is the TVdecay exponent of [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Cache-policy degradation on Qwen2.5-0.5B (Natural domain, position-preserving, 100 prefixes). Recency-based policies (sliding, sink-plus-recent) suppress distortion by ∼100× relative to Random-K at equal budget; the lightweight last-query attentionscore heavy-hitter sits in between. Every policy decays as a power law in k, as predicted by polynomial truncation sensitivity. Sliding and sink-plus-recent co… view at source ↗

**Figure 5.** Figure 5: Long-window TV decay to w = 8192 (Qwen2.5-0.5B, position-preserving, prefix length 16,500). A single power law describes the data across more than five octaves; the w = 8192 point sits on the trend rather than collapsing, because the long prefix keeps the window below ≈ 50% of the context (avoiding the boundary artifact that affected shorter-prefix sweeps). The Natural exponent (α = 0.36) is close to the s… view at source ↗

**Figure 6.** Figure 6: KL versus TV2 for sink-plus-recent on Qwen2.5-0.5B (Natural), at the lightest and heaviest smoothing levels. Both lie on a through-the-origin line, confirming the locally quadratic KL–TV relation; the fitted exponent ratio αKL/αTV equals 2.00 at every µ ∈ {10−4 , 10−2 , 0.1, 0.3}, so the exponent doubling is robust across smoothing. B. Notation We collect the principal notation, organized by category. C. P… view at source ↗

**Figure 7.** Figure 7: Measured TV decay on Qwen2.5-0.5B (positionpreserving protocol, 100 prefixes per domain). Both Natural (NLTK Gutenberg) and Code (GitHub Python) domains exhibit power-law decay TVdw ∝ w −α . Power-law fit yields log-RMSE 0.14 (Natural) and 0.08 (Code), versus log-RMSE 0.31 and 0.20 for an exponential fit. The natural-language exponent exceeds the code exponent (αnat = 0.44 > αcode = 0.38); the same expone… view at source ↗

**Figure 8.** Figure 8: Sensitivity ratio across 60 layers (Lemma O.11). Skip connections with LbLLN = 0.1 reduce the ratio from ≈ 320 to ≈ 1.8, supporting uniform dc as approximately optimal. error satisfies ∥a (L) t − aˆ (L) t ∥ 2 2 ≤ X L ℓ=1 s (ℓ) δ (ℓ) t , s (ℓ) := Y L m=ℓ+1 (1 + ϵm)(L (m) g ) 2 , for any {ϵm > 0}, where the (1 + ϵm) factors arise from the cross-terms in the squared norm. Taking ϵm → 0 (valid when the per-lay… view at source ↗

read the original abstract

We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Ziv source coding on the filtration induced by the model, with the next-step query as decoder side information. Empirically, across four models spanning two families and $0.5$-$3$B parameters, we find that the next-token distribution's sensitivity to context truncation decays \emph{polynomially} rather than \emph{geometrically}: a power law improves on an exponential fit by an order of magnitude in extrapolation, the fitted exponent is recovered independently from a sink-plus-recent KL measurement, and the decay is verified to be free of positional-encoding artifacts by a position-preserving ablation. Under a corresponding \emph{polynomial truncation-sensitivity} assumption, our main result characterizes the per-token memory requirement of \emph{suffix-only} cache policies: a sliding-window scheme attains distortion $\varepsilon$ with window $w = O(\varepsilon^{-1/\alpha})$, and -- under an additional two-sided Bayes-risk condition -- a converse shows $w = \Omega(\varepsilon^{-1/\alpha})$ is necessary within this policy class, so the scaling is $\Theta(\varepsilon^{-1/\alpha})$ for suffix-only policies. Whether recurrent or propagating cache summaries can beat this scaling is left open. An explicit block-Markov scheme achieves the upper bound; its rate-of-convergence exponent matches the converse under additional forward-decay and regularity hypotheses (not implied by truncation sensitivity alone), and differs by a factor of two otherwise. Empirically, the polynomial law predicts the degradation curves of concrete cache policies: recency-based eviction (sliding, sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget, with a power-law decay in the budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The empirical case for polynomial (not geometric) truncation sensitivity is the real takeaway; the tight Theta memory scaling needs an extra Bayes-risk condition that is not automatic from the main assumption.

read the letter

The main thing to know is that the paper reports a consistent empirical finding: next-token prediction sensitivity to context truncation decays polynomially across four models from two families, with the power-law fit beating exponential by an order of magnitude in extrapolation. They recover the exponent independently from a sink-plus-recent KL measurement and rule out positional-encoding artifacts with an ablation. That part looks like careful measurement work.

The theory then treats KV cache compression as sequential Wyner-Ziv coding on the model's filtration, with the next query as side information. Under the polynomial assumption a sliding-window policy reaches distortion ε with window size O(ε^{-1/α}). The matching lower bound requires an additional two-sided Bayes-risk condition that the authors note is not implied by truncation sensitivity alone, so the claimed Θ scaling is conditional. Their block-Markov scheme meets the upper bound, but its convergence exponent only matches the converse under further forward-decay and regularity hypotheses.

The soft spot is exactly that gap on the necessity side: without the extra condition the result reduces to an O upper bound plus the empirical observation. The paper leaves open whether recurrent or propagating summaries can beat the scaling, which is honest but limits how far the practical claim goes. The empirical degradation curves do line up with the polynomial prediction and show recency policies beating random retention by roughly two orders of magnitude at equal budget.

This is for people working on KV-cache compression and memory-efficient inference. The empirical checks are worth referee time even if the converse stays conditional. I would send it to review.

Referee Report

2 major / 1 minor

Summary. The paper empirically demonstrates that next-token distribution sensitivity to context truncation in autoregressive LMs decays polynomially (rather than geometrically) across multiple models, and under a corresponding polynomial truncation-sensitivity assumption formulates KV cache compression as sequential Wyner-Ziv coding; it shows that suffix-only policies (e.g., sliding-window) attain distortion ε with window size O(ε^{-1/α}), and under an additional two-sided Bayes-risk condition establishes a matching Ω lower bound, yielding Θ(ε^{-1/α}) scaling, with an explicit block-Markov scheme achieving the upper bound (whose exponent matches the converse only under further hypotheses).

Significance. If the polynomial assumption and bounds hold, the result supplies the first information-theoretic scaling law for suffix-only KV cache compression, explaining why recency-based policies outperform random retention by orders of magnitude and guiding practical memory-distortion trade-offs; the cross-model empirical support and independent exponent recovery are notable strengths, though the additional conditions required for tightness limit the scope of the central characterization.

major comments (2)

[Abstract / main result] Abstract and main result statement: the Θ(ε^{-1/α}) characterization for suffix-only policies rests on the two-sided Bayes-risk condition for the converse, which the text explicitly states is additional and 'not implied by truncation sensitivity alone' (as also noted for the block-Markov exponent match); without a proof relating this condition to the core polynomial assumption or an empirical verification on the induced filtration, the necessity direction is not established and the claim reduces to an O upper bound.
[Empirical section] Empirical verification paragraph: the claim that the fitted exponent α is 'recovered independently from a sink-plus-recent KL measurement' is central to supporting the polynomial law, yet no quantitative match statistic, confidence interval, or extrapolation error comparison between the two measurements is provided; this leaves open whether the recovery is statistically robust or coincidental.

minor comments (1)

[Problem formulation] Notation for the filtration and side-information query should be introduced with an explicit diagram or equation reference in the problem formulation to aid readability for readers outside information theory.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point by point below. Both points identify areas where the presentation can be strengthened, and we will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract / main result] Abstract and main result statement: the Θ(ε^{-1/α}) characterization for suffix-only policies rests on the two-sided Bayes-risk condition for the converse, which the text explicitly states is additional and 'not implied by truncation sensitivity alone' (as also noted for the block-Markov exponent match); without a proof relating this condition to the core polynomial assumption or an empirical verification on the induced filtration, the necessity direction is not established and the claim reduces to an O upper bound.

Authors: We agree that the Θ(ε^{-1/α}) scaling for suffix-only policies is established only conditionally on the two-sided Bayes-risk assumption, as already stated in the manuscript. The upper bound O(ε^{-1/α}) holds under the polynomial truncation-sensitivity assumption alone. To make the conditional nature of the converse fully explicit, we will revise the abstract and the statement of the main result to foreground that the matching lower bound requires the additional condition. No proof linking the Bayes-risk condition to truncation sensitivity is available, and we do not claim one. revision: yes
Referee: [Empirical section] Empirical verification paragraph: the claim that the fitted exponent α is 'recovered independently from a sink-plus-recent KL measurement' is central to supporting the polynomial law, yet no quantitative match statistic, confidence interval, or extrapolation error comparison between the two measurements is provided; this leaves open whether the recovery is statistically robust or coincidental.

Authors: We acknowledge the absence of quantitative support for the independent exponent recovery. In the revised manuscript we will add R² values and 95% confidence intervals for the exponents obtained from both the direct truncation-sensitivity fit and the sink-plus-recent KL measurement, together with a direct comparison of extrapolation error on held-out context lengths. These additions will allow readers to assess the statistical robustness of the recovery. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical bounds derived from stated assumption with independent empirical motivation

full rationale

The paper motivates a polynomial truncation-sensitivity assumption via empirical fits (power-law decay verified independently via multiple measurements and ablations) but then derives the O(ε^{-1/α}) upper bound and conditional Ω lower bound as mathematical consequences of that assumption plus an explicitly noted extra Bayes-risk condition. The scaling expression uses the fitted α as a parameter of the assumption rather than re-deriving or predicting the same fitted quantity. The abstract itself flags that the block-Markov exponent match requires further hypotheses 'not implied by truncation sensitivity alone,' avoiding any claim that the full Θ result follows from the core assumption. No self-citations, self-definitional steps, or fitted-input-as-prediction reductions appear in the provided derivation outline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirically observed polynomial decay (treated as an assumption for the bounds) together with standard sequential Wyner-Ziv theory; the only free parameter is the fitted exponent α.

free parameters (1)

exponent α = empirically fitted
Power-law exponent governing decay of next-token sensitivity to truncation distance; obtained by fitting to empirical measurements.

axioms (1)

domain assumption Polynomial truncation-sensitivity assumption
Next-token distribution sensitivity to context truncation decays polynomially with truncation distance.

pith-pipeline@v0.9.1-grok · 5876 in / 1301 out tokens · 43092 ms · 2026-06-29T23:44:11.536412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 12 canonical work pages · 9 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Ainslie, J. et al. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP, 2023

2023
[3]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Allal, L. B., Lozhkov, A., Bakouch, E., Bl \'a gojevi \'c , G. M., Penedo, G., Kydl \' c ek, H., et al. Smollm2: When smol goes big -- data-centric training of a small language model. arXiv preprint arXiv:2502.02737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

H., and Zoccolan, D

Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019

2019
[5]

and Tatikonda, S

Behmin, M. and Tatikonda, S. Multi-task rate-distortion: theory and applications. IEEE Trans. Inf. Theory, 2022

2022
[6]

Bennett, W. R. Spectra of quantized signals. Bell System Technical Journal, 27 0 (3): 0 446--472, 1948

1948
[7]

Concentration Inequalities: A Nonasymptotic Theory of Independence

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013

2013
[8]

Bradley, R. C. Basic properties of strong mixing conditions: A survey and some open questions. Probability Surveys, 2: 0 107--144, 2005

2005
[9]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Z., Zhang, Y., Gao, B., Liu, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Chang, B., Hu, J., and Xiao, W. PyramidKV : Dynamic KV cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006

2006
[11]

D., Stavrou, P

Charalambous, C. D., Stavrou, P. A., and Ahmed, N. U. Nonanticipative rate distortion function and relations to filtering theory. IEEE Transactions on Automatic Control, 59 0 (4): 0 937--952, 2013

2013
[12]

LongLoRA : Efficient fine-tuning of long-context large language models

Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., and Jia, J. LongLoRA : Efficient fine-tuning of long-context large language models. In International Conference on Learning Representations (ICLR), 2024

2024
[13]

Cover, T. M. and Thomas, J. A. Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006

2006
[14]

and K \"o rner, J

Csisz \'a r, I. and K \"o rner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, 2nd edition, 2011

2011
[15]

and Shields, P

Csisz \'a r, I. and Shields, P. C. Information theory and statistics: A tutorial. Now Publishers, 2004

2004
[16]

Lipschitz normalization for self-attention layers with application to graph neural networks

Dasoulas, G., Scaman, K., and Virmaux, A. Lipschitz normalization for self-attention layers with application to graph neural networks. In International Conference on Machine Learning (ICML), 2021

2021
[17]

DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model, 2024

DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model, 2024

2024
[18]

Attention is not all you need: Pure attention loses rank doubly exponentially with depth

Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp.\ 2793--2803, 2021

2021
[19]

Subgeometric rates of convergence of f -ergodic strong Markov processes

Douc, R., Fort, G., and Guillin, A. Subgeometric rates of convergence of f -ergodic strong Markov processes. Stochastic Processes and their Applications, 119 0 (3): 0 897--923, 2009

2009
[20]

Markov Chains

Douc, R., Moulines, E., Priouret, P., and Soulier, P. Markov Chains. Springer, 2018

2018
[21]

and Kim, Y.-H

El Gamal, A. and Kim, Y.-H. Network Information Theory. Cambridge University Press, 2011

2011
[22]

Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Ada-KV : Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Freedman, D. A. On tail probabilities for martingales. The Annals of Probability, 3 0 (1): 0 100--118, 1975

1975
[24]

Model tells you what to discard: Adaptive KV cache compression for LLMs

Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model tells you what to discard: Adaptive KV cache compression for LLMs . In International Conference on Learning Representations (ICLR), 2024

2024
[25]

Geiger, B. C. and Koch, T. Rate-distortion dimension of stochastic processes. In IEEE International Symposium on Information Theory (ISIT), 2016. arXiv:1607.06792

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

Asymptotically optimal block quantization

Gersho, A. Asymptotically optimal block quantization. IEEE Transactions on Information Theory, 25 0 (4): 0 373--380, 1979

1979
[27]

A mathematical perspective on transformers

Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. A mathematical perspective on transformers. arXiv preprint arXiv:2312.10794, 2023

work page arXiv 2023
[28]

Transformer feed-forward layers are key-value memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. In EMNLP, 2021

2021
[29]

Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7B , 2023

2023
[30]

GEAR : An efficient KV cache compression recipe for near-lossless generative inference of LLM , 2024

Kang, H., Zhang, Q., Kundu, S., Jeong, G., Liu, Z., Krishna, T., and Zhao, T. GEAR : An efficient KV cache compression recipe for near-lossless generative inference of LLM , 2024

2024
[31]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[32]

and Dembo, A

Kawabata, T. and Dembo, A. The rate-distortion dimension of sets and measures. IEEE Transactions on Information Theory, 40 0 (5): 0 1564--1572, 1994

1994
[33]

The Lipschitz constant of self-attention

Kim, H., Papamakarios, G., and Mnih, A. The Lipschitz constant of self-attention. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp.\ 5562--5571, 2021

2021
[34]

and Tuncel, E

Kostina, V. and Tuncel, E. Multiterminal source coding: fundamental limits and algorithms. Foundations and Trends in Communications and Information Theory, 2022

2022
[35]

and Verd \'u , S

Kostina, V. and Verd \'u , S. Fixed-length lossy compression in the finite blocklength regime. IEEE Transactions on Information Theory, 58 0 (6): 0 3309--3338, 2012

2012
[36]

SnapKV : LLM knows what you are looking for before generation

Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV : LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024
[37]

KIVI : a tuning-free asymmetric 2bit quantization for KV cache, 2024

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. KIVI : a tuning-free asymmetric 2bit quantization for KV cache, 2024

2024
[38]

Liu, Z. et al. Scissorhands : Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[39]

and Wagner, A

Mahmood, A. and Wagner, A. B. Minimax rate-distortion. IEEE Transactions on Information Theory, 2024. arXiv:2202.04481

work page arXiv 2024
[40]

Massey, J. L. Causality, feedback and directed information. In Proceedings of the International Symposium on Information Theory and its Applications (ISITA), pp.\ 303--305, 1990

1990
[41]

The Llama 3 Herd of Models

Meta AI . The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

and Hutter, M

Phuong, M. and Hutter, M. Formal algorithms for transformers. arXiv preprint arXiv:2207.09238, 2022

work page arXiv 2022
[43]

Piantadosi, S. T. Zipf's word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin and Review, 21: 0 1112--1130, 2014

2014
[44]

The intrinsic dimension of images and its impact on learning

Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., and Goldstein, T. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations (ICLR), 2021

2021
[45]

Universal coding, information, prediction, and estimation

Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30 0 (4): 0 629--636, 1984

1984
[46]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[47]

Quest : Query-aware sparsity for efficient long-context LLM inference

Tang, J., Zhao, Y., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest : Query-aware sparsity for efficient long-context LLM inference. In International Conference on Machine Learning (ICML), 2024

2024
[48]

Dispersion of Gaussian Sources with Memory and an Extension to Abstract Sources

Tasci, E. and Kostina, V. Dispersion of Gaussian sources with memory and an extension to abstract sources. arXiv preprint arXiv:2602.09176, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Tsybakov, A. B. Introduction to nonparametric estimation. Springer, 2009

2009
[50]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 2017

2017
[51]

Probability with Martingales

Williams, D. Probability with Martingales. Cambridge University Press, 1991

1991
[52]

Witsenhausen, H. S. Indirect rate distortion problems. IEEE Transactions on Information Theory, 26 0 (5): 0 518--521, 1980

1980
[53]

Wyner, A. D. and Ziv, J. The rate-distortion function for source coding with side information at the decoder. IEEE Transactions on Information Theory, 22 0 (1): 0 1--10, 1976

1976
[54]

Efficient streaming language models with attention sinks

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. In International Conference on Learning Representations (ICLR), 2024

2024
[55]

Yang, A. et al. Qwen2.5: A party of foundation models. In arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Zhang, Z. et al. H2O : Heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

Ainslie, J. et al. GQA : Training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of EMNLP, 2023

2023

[3] [3]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Allal, L. B., Lozhkov, A., Bakouch, E., Bl \'a gojevi \'c , G. M., Penedo, G., Kydl \' c ek, H., et al. Smollm2: When smol goes big -- data-centric training of a small language model. arXiv preprint arXiv:2502.02737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

H., and Zoccolan, D

Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. Intrinsic dimension of data representations in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2019

2019

[5] [5]

and Tatikonda, S

Behmin, M. and Tatikonda, S. Multi-task rate-distortion: theory and applications. IEEE Trans. Inf. Theory, 2022

2022

[6] [6]

Bennett, W. R. Spectra of quantized signals. Bell System Technical Journal, 27 0 (3): 0 446--472, 1948

1948

[7] [7]

Concentration Inequalities: A Nonasymptotic Theory of Independence

Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013

2013

[8] [8]

Bradley, R. C. Basic properties of strong mixing conditions: A survey and some open questions. Probability Surveys, 2: 0 107--144, 2005

2005

[9] [9]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Cai, Z., Zhang, Y., Gao, B., Liu, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Chang, B., Hu, J., and Xiao, W. PyramidKV : Dynamic KV cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

and Lugosi, G

Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006

2006

[11] [11]

D., Stavrou, P

Charalambous, C. D., Stavrou, P. A., and Ahmed, N. U. Nonanticipative rate distortion function and relations to filtering theory. IEEE Transactions on Automatic Control, 59 0 (4): 0 937--952, 2013

2013

[12] [12]

LongLoRA : Efficient fine-tuning of long-context large language models

Chen, Y., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., and Jia, J. LongLoRA : Efficient fine-tuning of long-context large language models. In International Conference on Learning Representations (ICLR), 2024

2024

[13] [13]

Cover, T. M. and Thomas, J. A. Elements of Information Theory. Wiley-Interscience, 2nd edition, 2006

2006

[14] [14]

and K \"o rner, J

Csisz \'a r, I. and K \"o rner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press, 2nd edition, 2011

2011

[15] [15]

and Shields, P

Csisz \'a r, I. and Shields, P. C. Information theory and statistics: A tutorial. Now Publishers, 2004

2004

[16] [16]

Lipschitz normalization for self-attention layers with application to graph neural networks

Dasoulas, G., Scaman, K., and Virmaux, A. Lipschitz normalization for self-attention layers with application to graph neural networks. In International Conference on Machine Learning (ICML), 2021

2021

[17] [17]

DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model, 2024

DeepSeek-AI . DeepSeek-V2 : A strong, economical, and efficient mixture-of-experts language model, 2024

2024

[18] [18]

Attention is not all you need: Pure attention loses rank doubly exponentially with depth

Dong, Y., Cordonnier, J.-B., and Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp.\ 2793--2803, 2021

2021

[19] [19]

Subgeometric rates of convergence of f -ergodic strong Markov processes

Douc, R., Fort, G., and Guillin, A. Subgeometric rates of convergence of f -ergodic strong Markov processes. Stochastic Processes and their Applications, 119 0 (3): 0 897--923, 2009

2009

[20] [20]

Markov Chains

Douc, R., Moulines, E., Priouret, P., and Soulier, P. Markov Chains. Springer, 2018

2018

[21] [21]

and Kim, Y.-H

El Gamal, A. and Kim, Y.-H. Network Information Theory. Cambridge University Press, 2011

2011

[22] [22]

Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Ada-KV : Optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference. arXiv preprint arXiv:2407.11550, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Freedman, D. A. On tail probabilities for martingales. The Annals of Probability, 3 0 (1): 0 100--118, 1975

1975

[24] [24]

Model tells you what to discard: Adaptive KV cache compression for LLMs

Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model tells you what to discard: Adaptive KV cache compression for LLMs . In International Conference on Learning Representations (ICLR), 2024

2024

[25] [25]

Geiger, B. C. and Koch, T. Rate-distortion dimension of stochastic processes. In IEEE International Symposium on Information Theory (ISIT), 2016. arXiv:1607.06792

work page internal anchor Pith review Pith/arXiv arXiv 2016

[26] [26]

Asymptotically optimal block quantization

Gersho, A. Asymptotically optimal block quantization. IEEE Transactions on Information Theory, 25 0 (4): 0 373--380, 1979

1979

[27] [27]

A mathematical perspective on transformers

Geshkovski, B., Letrouit, C., Polyanskiy, Y., and Rigollet, P. A mathematical perspective on transformers. arXiv preprint arXiv:2312.10794, 2023

work page arXiv 2023

[28] [28]

Transformer feed-forward layers are key-value memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. In EMNLP, 2021

2021

[29] [29]

Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7B , 2023

2023

[30] [30]

GEAR : An efficient KV cache compression recipe for near-lossless generative inference of LLM , 2024

Kang, H., Zhang, Q., Kundu, S., Jeong, G., Liu, Z., Krishna, T., and Zhao, T. GEAR : An efficient KV cache compression recipe for near-lossless generative inference of LLM , 2024

2024

[31] [31]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[32] [32]

and Dembo, A

Kawabata, T. and Dembo, A. The rate-distortion dimension of sets and measures. IEEE Transactions on Information Theory, 40 0 (5): 0 1564--1572, 1994

1994

[33] [33]

The Lipschitz constant of self-attention

Kim, H., Papamakarios, G., and Mnih, A. The Lipschitz constant of self-attention. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp.\ 5562--5571, 2021

2021

[34] [34]

and Tuncel, E

Kostina, V. and Tuncel, E. Multiterminal source coding: fundamental limits and algorithms. Foundations and Trends in Communications and Information Theory, 2022

2022

[35] [35]

and Verd \'u , S

Kostina, V. and Verd \'u , S. Fixed-length lossy compression in the finite blocklength regime. IEEE Transactions on Information Theory, 58 0 (6): 0 3309--3338, 2012

2012

[36] [36]

SnapKV : LLM knows what you are looking for before generation

Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV : LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems (NeurIPS), 2024

2024

[37] [37]

KIVI : a tuning-free asymmetric 2bit quantization for KV cache, 2024

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. KIVI : a tuning-free asymmetric 2bit quantization for KV cache, 2024

2024

[38] [38]

Liu, Z. et al. Scissorhands : Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023

[39] [39]

and Wagner, A

Mahmood, A. and Wagner, A. B. Minimax rate-distortion. IEEE Transactions on Information Theory, 2024. arXiv:2202.04481

work page arXiv 2024

[40] [40]

Massey, J. L. Causality, feedback and directed information. In Proceedings of the International Symposium on Information Theory and its Applications (ISITA), pp.\ 303--305, 1990

1990

[41] [41]

The Llama 3 Herd of Models

Meta AI . The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

and Hutter, M

Phuong, M. and Hutter, M. Formal algorithms for transformers. arXiv preprint arXiv:2207.09238, 2022

work page arXiv 2022

[43] [43]

Piantadosi, S. T. Zipf's word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin and Review, 21: 0 1112--1130, 2014

2014

[44] [44]

The intrinsic dimension of images and its impact on learning

Pope, P., Zhu, C., Abdelkader, A., Goldblum, M., and Goldstein, T. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations (ICLR), 2021

2021

[45] [45]

Universal coding, information, prediction, and estimation

Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30 0 (4): 0 629--636, 1984

1984

[46] [46]

Fast Transformer Decoding: One Write-Head is All You Need

Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911

[47] [47]

Quest : Query-aware sparsity for efficient long-context LLM inference

Tang, J., Zhao, Y., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest : Query-aware sparsity for efficient long-context LLM inference. In International Conference on Machine Learning (ICML), 2024

2024

[48] [48]

Dispersion of Gaussian Sources with Memory and an Extension to Abstract Sources

Tasci, E. and Kostina, V. Dispersion of Gaussian sources with memory and an extension to abstract sources. arXiv preprint arXiv:2602.09176, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Tsybakov, A. B. Introduction to nonparametric estimation. Springer, 2009

2009

[50] [50]

N., Kaiser, L., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 2017

2017

[51] [51]

Probability with Martingales

Williams, D. Probability with Martingales. Cambridge University Press, 1991

1991

[52] [52]

Witsenhausen, H. S. Indirect rate distortion problems. IEEE Transactions on Information Theory, 26 0 (5): 0 518--521, 1980

1980

[53] [53]

Wyner, A. D. and Ziv, J. The rate-distortion function for source coding with side information at the decoder. IEEE Transactions on Information Theory, 22 0 (1): 0 1--10, 1976

1976

[54] [54]

Efficient streaming language models with attention sinks

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient streaming language models with attention sinks. In International Conference on Learning Representations (ICLR), 2024

2024

[55] [55]

Yang, A. et al. Qwen2.5: A party of foundation models. In arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Zhang, Z. et al. H2O : Heavy-hitter oracle for efficient generative inference of large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023