Tensor Cache: Eviction-conditioned Associative Memory for Transformers
Pith reviewed 2026-05-25 05:48 UTC · model grok-4.3
The pith
Tensor Cache keeps information from evicted tokens accessible by compressing them into a fixed-size outer-product memory read via matrix multiplication.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By routing only the KV pairs that leave the sliding window into a fixed-size outer-product matrix A and reading future queries against it with a single matrix multiplication that realizes the linear-attention identity, plus a trained scalar gate that fuses the two levels, the model retains access to a larger effective context while keeping total state size bounded; the same end-to-end training also corrects the spurious cross-token products introduced by the common chunked-mean update rule.
What carries the argument
Eviction-conditioned outer-product matrix A serving as L2 cache, read by the linear-attention identity q(k⊗v)=⟨q,k⟩v and fused to L1 sliding-window attention through a learned scalar gate.
If this is right
- Models can process longer effective contexts at fixed memory cost.
- Associative recall accuracy rises for facts that fall outside the active window.
- Memory-capacity diagnostics show higher usable state without increasing the bound.
- End-to-end training of the gate and per-head rates closes the gap to per-token writes within float32 precision.
Where Pith is reading between the lines
- The same eviction-fed outer-product structure could be tested as a drop-in addition to other linear or sparse attention variants.
- Because the L2 matrix is updated only on evictions, the approach may combine naturally with retrieval-augmented generation pipelines that already maintain external stores.
- If the per-head write rates learn to suppress uninformative tokens, the method might reduce the effective noise in the compressed memory compared with uniform decay.
Load-bearing premise
The learned scalar gate and per-head decay/write-rate parameters can be trained to combine the L2 outer-product reads with L1 attention without instability or large approximation errors.
What would settle it
Run the same long-context language-modeling benchmark with and without the L2 cache; if perplexity or downstream accuracy shows no improvement or a clear degradation once the gate and parameters are optimized, the central claim is falsified.
Figures
read the original abstract
Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \emph{Tensor Cache}, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix $A$ and read by future queries through a single matrix multiplication, exploiting the linear-attention identity $q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i$. A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut $A\!\leftarrow\!\lambda A\!+\!\eta(\bar k\!\otimes\!\bar v)$ silently introduces $C^2{-}C$ spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory--quality frontier over bounded-state baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Tensor Cache, a two-level caching scheme for autoregressive Transformers: a first-level sliding-window softmax attention cache (L1) paired with a second-level fixed-size outer-product associative memory (L2) that stores KV pairs evicted from the window. Evicted pairs are written into a per-layer matrix A and read via the linear-attention identity q_t (k_i ⊗ v_i) = <q_t, k_i> v_i; a learned scalar gate fuses L1 and L2 outputs, and per-head decay/write-rate parameters are trained end-to-end. The authors replace the common chunked-mean training shortcut (which introduces C²-C spurious cross terms) with an exact parallel weighted-sum scan. They claim this construction improves the memory-quality frontier over bounded-state baselines on systems scaling, associative recall, long-context LM, and capacity diagnostics.
Significance. If the empirical improvements hold, the work supplies a practical, bounded-memory mechanism that retains evicted information through associative outer-product storage rather than discarding it, addressing a core scaling limitation of KV caches while preserving exact local attention for recent tokens. The explicit correction of the chunked-mean artifact and the use of standard trainable components are strengths; the approach is falsifiable via the listed benchmarks and could influence efficient long-context architectures if the gains prove robust.
major comments (2)
- [Abstract] Abstract: the central claim that Tensor Cache 'improves the memory-quality frontier' is asserted without any quantitative results, baselines, error bars, or controls in the provided text; the soundness of the empirical contribution cannot be assessed from the given material.
- [Abstract (paragraph describing the gate and end-to-end training)] The integration of the learned scalar gate with L1 softmax attention and L2 outer-product reads is presented as stable under end-to-end training, yet no analysis of gradient stability, approximation error accumulation, or failure modes under long eviction chains is supplied; this is load-bearing for the claim that evicted information remains accessible without large errors.
minor comments (2)
- [Abstract] Notation: the per-layer matrix A is introduced without an explicit equation defining its update rule or dimensions; adding a compact definition would improve clarity.
- [Abstract] The manuscript states that the outer-product memory and read identity are 'well-known'; a brief citation to the relevant linear-attention literature would help readers locate the foundation.
Simulated Author's Rebuttal
We thank the referee for the constructive review and address the major comments point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that Tensor Cache 'improves the memory-quality frontier' is asserted without any quantitative results, baselines, error bars, or controls in the provided text; the soundness of the empirical contribution cannot be assessed from the given material.
Authors: The abstract is a concise summary of the work and its outcomes, following standard practice. The quantitative results supporting the memory-quality frontier claim—including direct comparisons to bounded-state baselines, with controls and error bars where reported—are presented in full in the Experiments section across systems scaling, associative recall, long-context LM, and capacity diagnostics. revision: no
-
Referee: [Abstract (paragraph describing the gate and end-to-end training)] The integration of the learned scalar gate with L1 softmax attention and L2 outer-product reads is presented as stable under end-to-end training, yet no analysis of gradient stability, approximation error accumulation, or failure modes under long eviction chains is supplied; this is load-bearing for the claim that evicted information remains accessible without large errors.
Authors: The manuscript shows successful end-to-end training on all reported tasks. We acknowledge that no dedicated analysis of gradient stability, error accumulation over eviction chains, or associated failure modes is included. We will add this analysis (gradient norm tracking and error diagnostics) to the revised version. revision: yes
Circularity Check
No significant circularity
full rationale
The paper explicitly states that the outer-product memory and linear-attention read identity are well-known external results, with the contribution being their application to eviction-conditioned L2 caching plus replacement of the chunked-mean shortcut by an exact parallel weighted-sum scan. No derivation step reduces to a fitted parameter renamed as prediction, no self-citation is load-bearing for the central claim, and the learned scalar gate and per-head rates are standard end-to-end trainable components whose performance is evaluated on external benchmarks. The derivation chain is therefore self-contained against independent identities and does not collapse by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- per-head decay and write-rate parameters
axioms (1)
- standard math The linear-attention identity q_t (k_i ⊗ v_i) = <q_t, k_i> v_i holds and enables matrix-multiplication reads from the outer-product matrix A
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
exploiting the linear-attention identity qt(ki ⊗ vi)=⟨qt,ki⟩vi ... A←λA+η(kw⊗vw) ... parallel weighted-sum scan A←λ^C A + η Σ wt(kt⊗vt)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
per-head decay and write-rate parameters are trained end-to-end
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =
work page 2000
-
[2]
T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980
work page 1980
-
[3]
M. J. Kearns , title =
-
[4]
Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983
work page 1983
-
[5]
R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000
work page 2000
-
[6]
Suppressed for Anonymity , author=
-
[7]
A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981
work page 1981
-
[8]
A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959
work page 1959
-
[9]
and Zhang, Hao and Stoica, Ion , journal =
Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , journal =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =
work page 2023
-
[10]
Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and Maire, Michael and Hoffmann, Henry and Holtzman, Ari and Jiang, Junchen , journal =. 2023 , doi =
work page 2023
-
[11]
Efficient Streaming Language Models with Attention Sinks
Efficient Streaming Language Models with Attention Sinks , author =. arXiv preprint arXiv:2309.17453 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. arXiv preprint arXiv:2306.14048 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming , journal =. 2024 , doi =
work page 2024
-
[14]
Goel, Raghavv and Park, Junyoung and Gagrani, Mukul and Jones, Dalton and Morse, Matthew and Langston, Harper and Lee, Mingu and Lott, Chris , journal =. 2025 , doi =
work page 2025
-
[15]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring Attention with Blockwise Transformers for Near-Infinite Context , author =. arXiv preprint arXiv:2310.01889 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Ding, Jiayu and Ma, Shuming and Dong, Li and Zhang, Xingxing and Huang, Shaohan and Wang, Wenhui and Zheng, Nanning and Wei, Furu , journal =. 2023 , doi =
work page 2023
-
[18]
Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan , journal =. 2019 , doi =
work page 2019
-
[19]
Compressive Transformers for Long-Range Sequence Modelling
Compressive Transformers for Long-Range Sequence Modelling , author =. arXiv preprint arXiv:1911.05507 , year =
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[20]
arXiv preprint arXiv:2203.08913 , year =
Memorizing Transformers , author =. arXiv preprint arXiv:2203.08913 , year =
-
[21]
Using Fast Weights to Attend to the Recent Past
Using Fast Weights to Attend to the Recent Past , author =. arXiv preprint arXiv:1610.06258 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Linear-Time Sequence Modeling with Selective State Spaces , author =. arXiv preprint arXiv:2312.00752 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Retentive Network: A Successor to Transformer for Large Language Models
Retentive Network: A Successor to Transformer for Large Language Models , author =. arXiv preprint arXiv:2307.08621 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba: A Hybrid Transformer-Mamba Language Model , author =. arXiv preprint arXiv:2403.19887 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author =. arXiv preprint arXiv:2307.08691 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , author =. arXiv preprint arXiv:2404.07143 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
arXiv preprint arXiv:2102.11174 , year =
Linear Transformers Are Secretly Fast Weight Programmers , author =. arXiv preprint arXiv:2102.11174 , year =
-
[28]
arXiv preprint arXiv:2207.06881 , year =
Recurrent Memory Transformer , author =. arXiv preprint arXiv:2207.06881 , year =
-
[29]
arXiv preprint arXiv:2402.09398 , year=
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference , author=. arXiv preprint arXiv:2402.09398 , year=
- [30]
-
[31]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author =. arXiv preprint arXiv:2308.14508 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. arXiv preprint arXiv:2104.09864 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
International Conference on Learning Representations , year =
Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =
-
[34]
Thirty-eighth Conference on Neural Information Processing Systems , year =
Beck, Maximilian and P. Thirty-eighth Conference on Neural Information Processing Systems , year =
-
[35]
Advances in Neural Information Processing Systems , year =
Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author =. Advances in Neural Information Processing Systems , year =
-
[36]
Longformer: The Long-Document Transformer
Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[37]
1960 IRE WESCON Convention Record , volume =
Adaptive Switching Circuits , author =. 1960 IRE WESCON Convention Record , volume =. 1960 , organization =
work page 1960
-
[38]
Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =. 1992 , publisher =
work page 1992
-
[39]
Advances in Neural Information Processing Systems , year =
Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =
-
[40]
Fast Transformer Decoding: One Write-Head is All You Need
Fast Transformer Decoding: One Write-Head is All You Need , author =. arXiv preprint arXiv:1911.02150 , year =
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[41]
Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , pages =
Efficient Memory Management for Large Language Model Serving with PagedAttention , author =. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , pages =
-
[42]
International Conference on Learning Representations , year =
Efficient Streaming Language Models with Attention Sinks , author =. International Conference on Learning Representations , year =
-
[43]
Scissorhands: Exploiting the Persistence of Importance Hypothesis for
Liu, Zichang and Desai, Aditya and Liao, Fangshuo and Wang, Weitao and Xie, Victor and Xu, Zhaozhuo and Kyrillidis, Anastasios and Shrivastava, Anshumali , booktitle =. Scissorhands: Exploiting the Persistence of Importance Hypothesis for
-
[44]
International Conference on Learning Representations , year =
Compressive Transformers for Long-Range Sequence Modelling , author =. International Conference on Learning Representations , year =
-
[45]
Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =
-
[46]
Advances in Neural Information Processing Systems , year =
Using Fast Weights to Attend to the Recent Past , author =. Advances in Neural Information Processing Systems , year =
-
[47]
Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , pages =
-
[48]
International Conference on Machine Learning , pages =
Linear Transformers Are Secretly Fast Weight Programmers , author =. International Conference on Machine Learning , pages =
-
[49]
Extending Context Window of Large Language Models via Positional Interpolation
Extending Context Window of Large Language Models via Positional Interpolation , author =. arXiv preprint arXiv:2306.15595 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico , booktitle =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.