arxiv: 2604.17935 · v1 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· cs.CC

Recognition: unknown

How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

Xiao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CC

keywords KV cachepointer chasingTransformer depthcache compressionmulti-hop reasoninglower boundslocality-respectinginference efficiency

0 comments

The pith

KV cache size s forces Transformer depth to scale as roughly k/s times log n over per-layer bits for k-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper models multi-step reasoning as k-hop pointer chasing over n tokens and asks how small a shared KV cache of size s can be before the required model depth L grows. It proves an unconditional lower bound of Omega of max of k/s and log n over H m p, an upper bound via windowed pointer doubling, and a bandwidth barrier showing that standard distinguishability arguments cannot exceed k/s once per-layer information reaches log n. The central conjecture is a stronger product lower bound Omega of ceil k/s times ceil log n over H m p that would follow if a probabilistic step on cache traces and pointer chains holds. If correct, aggressive KV compression multiplies the layers needed to preserve reasoning ability.

Core claim

Any Transformer solving k-hop pointer chasing with n at least 4k and cache size s at most sqrt n over 4 requires depth L equal to Omega of ceil k/s times ceil log2 n over H m p under the conjectured product bound, with an unconditional max lower bound of Omega of max of ceil k/s and log n over H m p, and a matching upper bound of O of min k and ceil k/s times log s times log n over m p via windowed pointer doubling; adaptive locality-respecting caches achieve error s/n independent of doubling stages while oblivious caches suffer exponential error.

What carries the argument

k-hop pointer chasing task on n tokens under shared KV cache of size s with locality-respecting cache controller

If this is right

When H m p is at least log n, no lower bound provable by per-window counting can exceed k/s.
Adaptive caches achieve error exactly s/n independent of the number of doubling stages, while random caches give exponential error in T.
The upper bound construction shows windowed pointer doubling achieves near-optimal depth up to log s factors.
Closing the conjecture requires only upgrading the max lower bound to a product via one probabilistic step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result implies that for fixed cache size, longer reasoning chains require proportionally deeper models unless cache eviction can be made more adaptive.
The bandwidth barrier suggests that simply adding more heads or precision yields diminishing returns for proving stronger depth lower bounds beyond log n bits.
Heavy-hitter style eviction works well in practice because its adaptive error stays linear in s/n rather than growing with stages.

Load-bearing premise

The cache controller respects locality of access and the joint distribution of cache trace and pointer chain satisfies the probabilistic independence needed to multiply the two factors in the lower bound.

What would settle it

A concrete Transformer of depth much smaller than ceil k/s times log n over H m p that solves k-hop pointer chasing with high probability under a locality-respecting cache of size s would falsify the conjectured product bound.

Figures

Figures reproduced from arXiv: 2604.17935 by Xiao Wang.

**Figure 1.** Figure 1: Experimental validation on PC16,k (5000 random permutations per configuration, no training). (a) Hardcoded Transformer with oracle cache: sharp 0% → 100% transition at L = k. (b) Windowed PD vs. serial for k = 8: cache reduces required depth from L = 8 to L = 3–4. (c) Random (oblivious) cache: exponential error amplification with k, consistent with Theorem 3.8’s (s/(n − T))T decay rate (cf. Remark 3.12 for… view at source ↗

**Figure 2.** Figure 2: Depth–cache tradeoff for k ∈ {4,8,16}. Gray dashed: serial strategy (L = k, cache-independent). Blue: windowed pointer doubling upper bound. Red: lower bound ⌈k/s⌉. Shaded blue region: the O(logs) gap between lower and upper bounds. 6.3 Random cache: error amplification Figure 1c shows that under oblivious (random) cache eviction, joint success probability decays sharply with k. At s = 8,n = 16,k = 8,T = 3… view at source ↗

read the original abstract

The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ($n \geq 4k$, $s \leq \sqrt{n}/4$) requires depth $L = \Omega(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$, and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$ via windowed pointer doubling, and a max-bound $L = \Omega(\max(\lceil k/s \rceil, \log n/(Hmp)))$. Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when $Hmp \lesssim \log n$. Any lower bound provable via per-window distinguishability counting -- including reachability, bandwidth, and combinations -- cannot exceed $\lceil k/s \rceil$ once $Hmp \geq \log_2 n$. Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over $T = \lceil \log_2 k \rceil$ doubling stages, oblivious caches give $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$), while adaptive locality-respecting caches achieve $\Pr[\mathcal{E}] = s/n$ exactly, independent of $T$. The $\Omega((n/s)^{T-1})$ separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves an unconditional max lower bound and a concrete upper bound on depth for k-hop pointer chasing under KV cache size s, but its main product conjecture still rests on one unproven probabilistic step about cache traces and pointer chains.

read the letter

This paper models multi-hop reasoning as k-hop pointer chasing on n tokens and studies the depth L needed when the KV cache is limited to size s under a locality-respecting controller. It proves an upper bound L = O(min(k, ceil(k/s) log s) * log n / (m p)) via windowed pointer doubling and an unconditional lower bound of Omega(max(ceil(k/s), log n / (H m p))). These are the solid parts that do not depend on the conjecture. The work also gives a clean separation showing that adaptive caches achieve error probability exactly s/n while oblivious caches suffer exponential degradation over the doubling stages. That distinction is useful because it matches why heavy-hitter eviction works better in practice for multi-hop tasks. The bandwidth barrier observation is also straightforward: once H m p reaches log n, per-window arguments cannot improve beyond the k/s term. The main soft spot is the conjectured product lower bound Omega(ceil(k/s) * ceil(log n / (H m p))) for n at least 4k and s at most sqrt(n)/4. The paper isolates the gap to a single probabilistic claim on the joint distribution of the cache trace and the pointer chain, and it states this explicitly as the only unclosed element. Without the full derivation it is hard to judge whether that step holds or whether the distributions allow enough independence to break the product. The unconditional results stand on their own even if the conjecture does not close. This is aimed at researchers working on theoretical limits for memory-efficient Transformer inference and on communication-complexity techniques in ML models. Readers who care about cache compression or lower bounds for reasoning will get concrete constructions and a well-posed open question. It deserves peer review because the proven bounds are new, the gap is stated honestly, and the adaptive-oblivious analysis has direct implications for implementation choices.

Referee Report

2 major / 1 minor

Summary. The paper analyzes depth-cache tradeoffs for KV-compressed Transformers on k-hop pointer chasing over n tokens, under a locality-respecting cache of size s, attention dimension m, H heads, and p-bit precision. It conjectures a product lower bound L = Ω(⌈k/s⌉ ⋅ ⌈log₂ n/(Hmp)⌉) for n ≥ 4k and s ≤ √n/4 (isolating the gap to a probabilistic argument on the joint distribution of cache traces and pointer chains), proves an unconditional upper bound L = O(min(k, ⌈k/s⌉ log s) ⋅ log n/(mp)) via windowed pointer doubling, proves a max lower bound L = Ω(max(⌈k/s⌉, log n/(Hmp))), establishes a bandwidth barrier when Hmp ≳ log n, and shows that adaptive locality-respecting caches achieve Pr[error] = s/n (independent of doubling stages) while oblivious caches suffer exponential degradation.

Significance. The unconditional upper bound, max lower bound, and adaptive-vs-oblivious error separation are solid contributions that already explain empirical advantages of heavy-hitter eviction for multi-hop reasoning and give concrete guidance on cache sizing. If the conjectured product bound is closed, the work would deliver tight, parameter-free tradeoffs linking cache compression to reasoning depth, with direct implications for efficient long-context inference.

major comments (2)

Abstract and §1 (main results): the conjectured product lower bound relies on an unproven probabilistic step linking the locality-respecting cache trace distribution to the random pointer chain; until this gap is closed, the claim should be revised to present only the proven Ω(max(⌈k/s⌉, log n/(Hmp))) bound as the unconditional lower bound, with the product form stated strictly as a conjecture.
Bandwidth barrier (result 2): the argument that per-window distinguishability (including reachability and bandwidth) cannot exceed ⌈k/s⌉ once Hmp ≥ log₂ n is load-bearing for the claim that stronger bounds require lifting external communication-complexity results; a self-contained reduction or explicit counter-example construction inside the Cache-Transformer model would make this section self-sufficient.

minor comments (1)

Notation and definitions: H, m, p, and the precise definition of the locality-respecting controller should be restated once in the introduction with forward references to the formal model, to improve readability for readers outside communication complexity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. We appreciate the positive assessment of the unconditional upper bound, max lower bound, and adaptive-oblivious separation. We agree that a major revision is appropriate and will clarify the presentation of the lower bounds while strengthening the bandwidth barrier discussion as outlined below.

read point-by-point responses

Referee: Abstract and §1 (main results): the conjectured product lower bound relies on an unproven probabilistic step linking the locality-respecting cache trace distribution to the random pointer chain; until this gap is closed, the claim should be revised to present only the proven Ω(max(⌈k/s⌉, log n/(Hmp))) bound as the unconditional lower bound, with the product form stated strictly as a conjecture.

Authors: We agree. The product lower bound remains conjectural precisely because of the identified probabilistic gap on the joint distribution of cache traces and pointer chains. In the revised manuscript we will update the abstract and §1 to state the unconditional lower bound as L = Ω(max(⌈k/s⌉, log n/(Hmp))), while presenting the product form L = Ω(⌈k/s⌉ ⋅ ⌈log₂ n/(Hmp)⌉) explicitly as a conjecture and isolating the remaining probabilistic step. This change makes the proven results unambiguous without altering any technical claims. revision: yes
Referee: Bandwidth barrier (result 2): the argument that per-window distinguishability (including reachability and bandwidth) cannot exceed ⌈k/s⌉ once Hmp ≥ log₂ n is load-bearing for the claim that stronger bounds require lifting external communication-complexity results; a self-contained reduction or explicit counter-example construction inside the Cache-Transformer model would make this section self-sufficient.

Authors: We thank the referee for this suggestion. The barrier follows because, when Hmp ≥ log₂ n, each window's attention can distinguish at most O(1) bits per token on average given the precision limit, so information flow across the ⌈k/s⌉ windows is capped regardless of depth. To make the section self-contained we will add an explicit counter-example: a single-window construction in which the cache state is restricted to s positions and the number of distinguishable pointer configurations is bounded by 2^{O(Hmp)}, which is O(1) once Hmp ≥ log n. This shows directly that no per-window counting argument (reachability or bandwidth) can exceed ⌈k/s⌉, while still noting that tighter product bounds require external communication-complexity machinery. revision: yes

Circularity Check

0 steps flagged

No circularity: bounds derived from external communication complexity and probability without self-referential reductions

full rationale

The paper's unconditional lower bound L = Ω(max(⌈k/s⌉, log n/(Hmp))) follows from per-window distinguishability counting, the upper bound from explicit windowed pointer doubling construction, and the adaptive/oblivious error scaling from direct probabilistic calculation on cache traces. The conjectured product lower bound is isolated to one unproven probabilistic step on joint distributions, which the paper flags as a gap rather than closing by definition or self-citation. All steps rely on standard external results (communication complexity for pointer chasing) and do not reduce any claimed quantity to a fitted input or prior self-citation by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard mathematical assumptions from communication complexity and one domain-specific assumption about cache controllers; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Cache controller is locality-respecting
Stated to be satisfied by all standard KV-compression methods; invoked for both lower and upper bounds.
standard math Standard communication-complexity lower bounds for pointer chasing apply to per-window distinguishability
Used to derive the bandwidth barrier and max-bound.

pith-pipeline@v0.9.0 · 5742 in / 1344 out tokens · 38114 ms · 2026-05-10T05:14:15.423098+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

In The Thirteenth International Conference on Learning Representations (ICLR)

Fundamental limitations on subquadratic alternatives to transformers. In The Thirteenth International Conference on Learning Representations (ICLR). arXiv:2410.04271. Samhruth Ananthanarayanan, Ayan Sengupta, and Tanmoy Chakraborty

work page arXiv
[2]

Ananthanarayanan and A

Understanding the physics of key-value cache compression for LLMs through attention dynamics.arXiv preprint arXiv:2603.01426. 25 Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao

work page arXiv
[3]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

PyramidKV: Dynamic KV cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069. Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu

work page internal anchor Pith review arXiv
[4]

R-kv: Redundancy-aware kv cache compression for training-free reasoning models acceleration.arXiv preprint arXiv:2505.24133,

R-KV: Redundancy-aware KV cache compression for reasoning models.arXiv preprint arXiv:2505.24133. Wenjie Du, Li Jiang, Keda Tao, Xue Liu, and Huan Wang

work page arXiv
[5]

Pavol Duris, Zvi Galil, and Georg Schnitger

Which heads matter for reasoning? RL-guided KV cache compression.arXiv preprint arXiv:2510.08525. Pavol Duris, Zvi Galil, and Georg Schnitger

work page arXiv
[6]

In Proceedings of the 38th Conference on Learning Theory (COLT), volume 291 ofProceedings of Machine Learning Research, pages 2757–2785

Compression barriers in autoregressive transformers. In Proceedings of the 38th Conference on Learning Theory (COLT), volume 291 ofProceedings of Machine Learning Research, pages 2757–2785. PMLR. arXiv:2502.15955. Alexander Kozachinskiy, Felipe Urrutia, Hector Jimenez, Tomasz Steifer, Germ´an Pizarro, Mat´ıas Fuentes, Francisco Meza, Cristian B. Calderon,...

work page arXiv
[7]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

Strassen attention, split VC dimension and compositionality in transformers.arXiv preprint arXiv:2501.19215. Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

work page arXiv
[8]

SnapKV: LLM Knows What You are Looking for Before Generation

SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems 38 (NeurIPS). arXiv:2404.14469. Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava

work page internal anchor Pith review arXiv
[9]

arXiv preprint arXiv:2305.17118 , year =

Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. InAdvances in Neural Information Processing Systems 36 (NeurIPS). arXiv:2305.17118. Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu

work page arXiv
[10]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

KIVI: A tuning-free asymmetric 2bit quantization for KV cache. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 32332–32344. PMLR. arXiv:2402.02750. Minghui Liu, Aadi Palnitkar, Tahseen Rabbani, Hyunwoo Jae, Kyle Rui Sang, Dixi Yao, Shayan Shabihi, Fuheng Zhao, Tian...

work page internal anchor Pith review arXiv
[11]

Hold onto that thought: Assessing kv cache compression on reasoning

Hold onto that thought: Assessing KV cache compression on reasoning.arXiv preprint arXiv:2512.12008. Xinyu Mao, Guangxu Yang, and Jiapeng Zhang

work page arXiv
[12]

In16th Innovations in Theoretical Computer Science Conference (ITCS), volume 325 ofLIPIcs, pages 75:1–75:14

Gadgetless lifting beats round elimination: Improved lower bounds for pointer chasing. In16th Innovations in Theoretical Computer Science Conference (ITCS), volume 325 ofLIPIcs, pages 75:1–75:14. arXiv:2411.10996. Noam Nisan and Avi Wigderson

work page arXiv
[13]

Papadimitriou

On limitations of the transformer architec- ture. InFirst Conference on Language Modeling (COLM). arXiv:2402.08164. Clayton Sanford, Daniel Hsu, and Matus Telgarsky

work page arXiv
[14]

InAdvances in Neural Information Processing Systems 36 (NeurIPS)

Representational strengths and limitations of transformers. InAdvances in Neural Information Processing Systems 36 (NeurIPS). arXiv:2306.02896. 26 Clayton Sanford, Daniel Hsu, and Matus Telgarsky

work page arXiv
[15]

Transformers, parallel computation, and logarithmic depth

Transformers, parallel computation, and log- arithmic depth. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 43276–43327. PMLR. Spotlight paper. arXiv:2402.09268. Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, and Hai Zhao

work page arXiv
[16]

arXiv preprint arXiv:2407.18003

Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption. InFirst Conference on Language Modeling (COLM). arXiv:2407.18003. Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang

work page arXiv
[17]

Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.arXiv preprint arXiv:2407.08454, 2024

Model tells you where to merge: Adaptive KV cache merging for LLMs on long-context tasks.arXiv preprint arXiv:2407.08454. Amir Yehudayoff

work page arXiv
[18]

arXiv preprint arXiv:2306.14048 , year=

H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems 36 (NeurIPS). arXiv:2306.14048. 27

work page arXiv