pith. sign in

arxiv: 2605.22884 · v1 · pith:ER3T5QZYnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

Pith reviewed 2026-05-25 05:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Tensor CacheKV cachesliding window attentionouter-product memoryfast weightslong-context modelingevictionassociative recall
0
0 comments X

The pith

Tensor Cache keeps information from evicted tokens accessible by compressing them into a fixed-size outer-product memory read via matrix multiplication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive Transformers face a linear growth in KV cache size with longer contexts, while sliding-window attention caps memory but loses all evicted tokens. Tensor Cache adds a second-level cache that stores evicted key-value pairs as a per-layer outer-product matrix and retrieves them for future queries using the identity that turns an outer product into a scaled value vector. A learned scalar gate blends the exact local attention output with this compressed memory read, and per-head decay and write rates are trained jointly. Experiments across scaling, associative recall, long-context modeling, and capacity tests show the approach raises the achievable quality for any given memory budget over pure bounded baselines.

Core claim

By routing only the KV pairs that leave the sliding window into a fixed-size outer-product matrix A and reading future queries against it with a single matrix multiplication that realizes the linear-attention identity, plus a trained scalar gate that fuses the two levels, the model retains access to a larger effective context while keeping total state size bounded; the same end-to-end training also corrects the spurious cross-token products introduced by the common chunked-mean update rule.

What carries the argument

Eviction-conditioned outer-product matrix A serving as L2 cache, read by the linear-attention identity q(k⊗v)=⟨q,k⟩v and fused to L1 sliding-window attention through a learned scalar gate.

If this is right

  • Models can process longer effective contexts at fixed memory cost.
  • Associative recall accuracy rises for facts that fall outside the active window.
  • Memory-capacity diagnostics show higher usable state without increasing the bound.
  • End-to-end training of the gate and per-head rates closes the gap to per-token writes within float32 precision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same eviction-fed outer-product structure could be tested as a drop-in addition to other linear or sparse attention variants.
  • Because the L2 matrix is updated only on evictions, the approach may combine naturally with retrieval-augmented generation pipelines that already maintain external stores.
  • If the per-head write rates learn to suppress uninformative tokens, the method might reduce the effective noise in the compressed memory compared with uniform decay.

Load-bearing premise

The learned scalar gate and per-head decay/write-rate parameters can be trained to combine the L2 outer-product reads with L1 attention without instability or large approximation errors.

What would settle it

Run the same long-context language-modeling benchmark with and without the L2 cache; if perplexity or downstream accuracy shows no improvement or a clear degradation once the gate and parameters are optimized, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.22884 by Antonio Torralba, Daniel Karl I. Weidele, Kabir Swain, Mauro Martino, Sijie Han.

Figure 1
Figure 1. Figure 1: Tensor Cache. Each layer keeps a local KV ring buffer (L1) and a fixed-size memory A (L2). On eviction, (k, v) is written into A via an outer-product update; queries read both paths and fuse via a learned gate. 1. Introduction Autoregressive Transformer inference caches per-layer keys and values (KV) so the prefix is not recomputed at every step (Vaswani et al., 2017; Shazeer, 2019), but retained KV state … view at source ↗
Figure 2
Figure 2. Figure 2: One streaming step of Tensor Cache, shown left to right. (1) Local window full: the local KV ring buffer holds the most recent W key/value pairs as a new pair (kt, vt) arrives. (2) Evict oldest: the displaced pair (kold, vold) is popped from the buffer to make room. (3) Update Tensor Cache: the evicted pair is written into the L2 attention memory A via the outer-product (or optional delta-rule) update A ← … view at source ↗
Figure 4
Figure 4. Figure 4: Streaming-decode throughput vs context length on the OpenWebText long-context evaluation (130M params, W = 512; median decode tokens-per-second over four eval seeds). Full KV starts highest (∼127 tok/s at L = 1K) but de￾grades with context and crosses below Window KV at L = 32K (∼106 vs. ∼115 tok/s). All bounded methods remain approx￾imately flat across the full range: Window KV (∼115 tok/s), StreamingLLM … view at source ↗
Figure 5
Figure 5. Figure 5: Long-context quality. Streaming NLL versus evaluation context length for each method. Solid lines (foreground) show the OpenWebText evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \emph{Tensor Cache}, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix $A$ and read by future queries through a single matrix multiplication, exploiting the linear-attention identity $q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i$. A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut $A\!\leftarrow\!\lambda A\!+\!\eta(\bar k\!\otimes\!\bar v)$ silently introduces $C^2{-}C$ spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory--quality frontier over bounded-state baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Tensor Cache, a two-level caching scheme for autoregressive Transformers: a first-level sliding-window softmax attention cache (L1) paired with a second-level fixed-size outer-product associative memory (L2) that stores KV pairs evicted from the window. Evicted pairs are written into a per-layer matrix A and read via the linear-attention identity q_t (k_i ⊗ v_i) = <q_t, k_i> v_i; a learned scalar gate fuses L1 and L2 outputs, and per-head decay/write-rate parameters are trained end-to-end. The authors replace the common chunked-mean training shortcut (which introduces C²-C spurious cross terms) with an exact parallel weighted-sum scan. They claim this construction improves the memory-quality frontier over bounded-state baselines on systems scaling, associative recall, long-context LM, and capacity diagnostics.

Significance. If the empirical improvements hold, the work supplies a practical, bounded-memory mechanism that retains evicted information through associative outer-product storage rather than discarding it, addressing a core scaling limitation of KV caches while preserving exact local attention for recent tokens. The explicit correction of the chunked-mean artifact and the use of standard trainable components are strengths; the approach is falsifiable via the listed benchmarks and could influence efficient long-context architectures if the gains prove robust.

major comments (2)
  1. [Abstract] Abstract: the central claim that Tensor Cache 'improves the memory-quality frontier' is asserted without any quantitative results, baselines, error bars, or controls in the provided text; the soundness of the empirical contribution cannot be assessed from the given material.
  2. [Abstract (paragraph describing the gate and end-to-end training)] The integration of the learned scalar gate with L1 softmax attention and L2 outer-product reads is presented as stable under end-to-end training, yet no analysis of gradient stability, approximation error accumulation, or failure modes under long eviction chains is supplied; this is load-bearing for the claim that evicted information remains accessible without large errors.
minor comments (2)
  1. [Abstract] Notation: the per-layer matrix A is introduced without an explicit equation defining its update rule or dimensions; adding a compact definition would improve clarity.
  2. [Abstract] The manuscript states that the outer-product memory and read identity are 'well-known'; a brief citation to the relevant linear-attention literature would help readers locate the foundation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and address the major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Tensor Cache 'improves the memory-quality frontier' is asserted without any quantitative results, baselines, error bars, or controls in the provided text; the soundness of the empirical contribution cannot be assessed from the given material.

    Authors: The abstract is a concise summary of the work and its outcomes, following standard practice. The quantitative results supporting the memory-quality frontier claim—including direct comparisons to bounded-state baselines, with controls and error bars where reported—are presented in full in the Experiments section across systems scaling, associative recall, long-context LM, and capacity diagnostics. revision: no

  2. Referee: [Abstract (paragraph describing the gate and end-to-end training)] The integration of the learned scalar gate with L1 softmax attention and L2 outer-product reads is presented as stable under end-to-end training, yet no analysis of gradient stability, approximation error accumulation, or failure modes under long eviction chains is supplied; this is load-bearing for the claim that evicted information remains accessible without large errors.

    Authors: The manuscript shows successful end-to-end training on all reported tasks. We acknowledge that no dedicated analysis of gradient stability, error accumulation over eviction chains, or associated failure modes is included. We will add this analysis (gradient norm tracking and error diagnostics) to the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper explicitly states that the outer-product memory and linear-attention read identity are well-known external results, with the contribution being their application to eviction-conditioned L2 caching plus replacement of the chunked-mean shortcut by an exact parallel weighted-sum scan. No derivation step reduces to a fitted parameter renamed as prediction, no self-citation is load-bearing for the central claim, and the learned scalar gate and per-head rates are standard end-to-end trainable components whose performance is evaluated on external benchmarks. The derivation chain is therefore self-contained against independent identities and does not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard linear-attention identity and introduces trainable per-head decay and write-rate parameters; no new physical entities are postulated.

free parameters (1)
  • per-head decay and write-rate parameters
    Trained end-to-end to control memory updates and forgetting; specific values not reported in abstract.
axioms (1)
  • standard math The linear-attention identity q_t (k_i ⊗ v_i) = <q_t, k_i> v_i holds and enables matrix-multiplication reads from the outer-product matrix A
    Invoked to justify reading evicted information from the L2 cache.

pith-pipeline@v0.9.0 · 5817 in / 1313 out tokens · 26920 ms · 2026-05-25T05:48:06.047927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 16 internal anchors

  1. [1]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  2. [2]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  3. [3]

    M. J. Kearns , title =

  4. [4]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  5. [5]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  6. [6]

    Suppressed for Anonymity , author=

  7. [7]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  8. [8]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  9. [9]

    and Zhang, Hao and Stoica, Ion , journal =

    Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , journal =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =

  10. [10]

    2023 , doi =

    Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and Maire, Michael and Hoffmann, Henry and Holtzman, Ari and Jiang, Junchen , journal =. 2023 , doi =

  11. [11]

    Efficient Streaming Language Models with Attention Sinks

    Efficient Streaming Language Models with Attention Sinks , author =. arXiv preprint arXiv:2309.17453 , year =

  12. [12]

    H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

    Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. arXiv preprint arXiv:2306.14048 , year =

  13. [13]

    2024 , doi =

    Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming , journal =. 2024 , doi =

  14. [14]

    2025 , doi =

    Goel, Raghavv and Park, Junyoung and Gagrani, Mukul and Jones, Dalton and Morse, Matthew and Langston, Harper and Lee, Mingu and Lott, Chris , journal =. 2025 , doi =

  15. [15]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =

  16. [16]

    Ring Attention with Blockwise Transformers for Near-Infinite Context

    Ring Attention with Blockwise Transformers for Near-Infinite Context , author =. arXiv preprint arXiv:2310.01889 , year =

  17. [17]

    2023 , doi =

    Ding, Jiayu and Ma, Shuming and Dong, Li and Zhang, Xingxing and Huang, Shaohan and Wang, Wenhui and Zheng, Nanning and Wei, Furu , journal =. 2023 , doi =

  18. [18]

    2019 , doi =

    Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan , journal =. 2019 , doi =

  19. [19]

    Compressive Transformers for Long-Range Sequence Modelling

    Compressive Transformers for Long-Range Sequence Modelling , author =. arXiv preprint arXiv:1911.05507 , year =

  20. [20]

    arXiv preprint arXiv:2203.08913 , year =

    Memorizing Transformers , author =. arXiv preprint arXiv:2203.08913 , year =

  21. [21]

    Using Fast Weights to Attend to the Recent Past

    Using Fast Weights to Attend to the Recent Past , author =. arXiv preprint arXiv:1610.06258 , year =

  22. [22]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Linear-Time Sequence Modeling with Selective State Spaces , author =. arXiv preprint arXiv:2312.00752 , year =

  23. [23]

    Retentive Network: A Successor to Transformer for Large Language Models

    Retentive Network: A Successor to Transformer for Large Language Models , author =. arXiv preprint arXiv:2307.08621 , year =

  24. [24]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Jamba: A Hybrid Transformer-Mamba Language Model , author =. arXiv preprint arXiv:2403.19887 , year =

  25. [25]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author =. arXiv preprint arXiv:2307.08691 , year =

  26. [26]

    Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

    Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , author =. arXiv preprint arXiv:2404.07143 , year =

  27. [27]

    arXiv preprint arXiv:2102.11174 , year =

    Linear Transformers Are Secretly Fast Weight Programmers , author =. arXiv preprint arXiv:2102.11174 , year =

  28. [28]

    arXiv preprint arXiv:2207.06881 , year =

    Recurrent Memory Transformer , author =. arXiv preprint arXiv:2207.06881 , year =

  29. [29]

    arXiv preprint arXiv:2402.09398 , year=

    Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference , author=. arXiv preprint arXiv:2402.09398 , year=

  30. [30]

    2019 , howpublished =

    OpenWebText Corpus , author =. 2019 , howpublished =

  31. [31]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author =. arXiv preprint arXiv:2308.14508 , year =

  32. [32]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. arXiv preprint arXiv:2104.09864 , year =

  33. [33]

    International Conference on Learning Representations , year =

    Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

  34. [34]

    Thirty-eighth Conference on Neural Information Processing Systems , year =

    Beck, Maximilian and P. Thirty-eighth Conference on Neural Information Processing Systems , year =

  35. [35]

    Advances in Neural Information Processing Systems , year =

    Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author =. Advances in Neural Information Processing Systems , year =

  36. [36]

    Longformer: The Long-Document Transformer

    Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =

  37. [37]

    1960 IRE WESCON Convention Record , volume =

    Adaptive Switching Circuits , author =. 1960 IRE WESCON Convention Record , volume =. 1960 , organization =

  38. [38]

    Neural Computation , volume =

    Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =. 1992 , publisher =

  39. [39]

    Advances in Neural Information Processing Systems , year =

    Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =

  40. [40]

    Fast Transformer Decoding: One Write-Head is All You Need

    Fast Transformer Decoding: One Write-Head is All You Need , author =. arXiv preprint arXiv:1911.02150 , year =

  41. [41]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , pages =

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author =. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , pages =

  42. [42]

    International Conference on Learning Representations , year =

    Efficient Streaming Language Models with Attention Sinks , author =. International Conference on Learning Representations , year =

  43. [43]

    Scissorhands: Exploiting the Persistence of Importance Hypothesis for

    Liu, Zichang and Desai, Aditya and Liao, Fangshuo and Wang, Weitao and Xie, Victor and Xu, Zhaozhuo and Kyrillidis, Anastasios and Shrivastava, Anshumali , booktitle =. Scissorhands: Exploiting the Persistence of Importance Hypothesis for

  44. [44]

    International Conference on Learning Representations , year =

    Compressive Transformers for Long-Range Sequence Modelling , author =. International Conference on Learning Representations , year =

  45. [45]

    Neural Computation , volume =

    Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =

  46. [46]

    Advances in Neural Information Processing Systems , year =

    Using Fast Weights to Attend to the Recent Past , author =. Advances in Neural Information Processing Systems , year =

  47. [47]

    Transformers are

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , pages =

  48. [48]

    International Conference on Machine Learning , pages =

    Linear Transformers Are Secretly Fast Weight Programmers , author =. International Conference on Machine Learning , pages =

  49. [49]

    Extending Context Window of Large Language Models via Positional Interpolation

    Extending Context Window of Large Language Models via Positional Interpolation , author =. arXiv preprint arXiv:2306.15595 , year =

  50. [50]

    Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico , booktitle =