Tensor Cache: Eviction-conditioned Associative Memory for Transformers

Antonio Torralba; Daniel Karl I. Weidele; Kabir Swain; Mauro Martino; Sijie Han

arxiv: 2605.22884 · v1 · pith:ER3T5QZYnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

Kabir Swain , Sijie Han , Daniel Karl I. Weidele , Mauro Martino , Antonio Torralba This is my paper

Pith reviewed 2026-05-25 05:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Tensor CacheKV cachesliding window attentionouter-product memoryfast weightslong-context modelingevictionassociative recall

0 comments

The pith

Tensor Cache keeps information from evicted tokens accessible by compressing them into a fixed-size outer-product memory read via matrix multiplication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive Transformers face a linear growth in KV cache size with longer contexts, while sliding-window attention caps memory but loses all evicted tokens. Tensor Cache adds a second-level cache that stores evicted key-value pairs as a per-layer outer-product matrix and retrieves them for future queries using the identity that turns an outer product into a scaled value vector. A learned scalar gate blends the exact local attention output with this compressed memory read, and per-head decay and write rates are trained jointly. Experiments across scaling, associative recall, long-context modeling, and capacity tests show the approach raises the achievable quality for any given memory budget over pure bounded baselines.

Core claim

By routing only the KV pairs that leave the sliding window into a fixed-size outer-product matrix A and reading future queries against it with a single matrix multiplication that realizes the linear-attention identity, plus a trained scalar gate that fuses the two levels, the model retains access to a larger effective context while keeping total state size bounded; the same end-to-end training also corrects the spurious cross-token products introduced by the common chunked-mean update rule.

What carries the argument

Eviction-conditioned outer-product matrix A serving as L2 cache, read by the linear-attention identity q(k⊗v)=⟨q,k⟩v and fused to L1 sliding-window attention through a learned scalar gate.

If this is right

Models can process longer effective contexts at fixed memory cost.
Associative recall accuracy rises for facts that fall outside the active window.
Memory-capacity diagnostics show higher usable state without increasing the bound.
End-to-end training of the gate and per-head rates closes the gap to per-token writes within float32 precision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same eviction-fed outer-product structure could be tested as a drop-in addition to other linear or sparse attention variants.
Because the L2 matrix is updated only on evictions, the approach may combine naturally with retrieval-augmented generation pipelines that already maintain external stores.
If the per-head write rates learn to suppress uninformative tokens, the method might reduce the effective noise in the compressed memory compared with uniform decay.

Load-bearing premise

The learned scalar gate and per-head decay/write-rate parameters can be trained to combine the L2 outer-product reads with L1 attention without instability or large approximation errors.

What would settle it

Run the same long-context language-modeling benchmark with and without the L2 cache; if perplexity or downstream accuracy shows no improvement or a clear degradation once the gate and parameters are optimized, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.22884 by Antonio Torralba, Daniel Karl I. Weidele, Kabir Swain, Mauro Martino, Sijie Han.

**Figure 1.** Figure 1: Tensor Cache. Each layer keeps a local KV ring buffer (L1) and a fixed-size memory A (L2). On eviction, (k, v) is written into A via an outer-product update; queries read both paths and fuse via a learned gate. 1. Introduction Autoregressive Transformer inference caches per-layer keys and values (KV) so the prefix is not recomputed at every step (Vaswani et al., 2017; Shazeer, 2019), but retained KV state … view at source ↗

**Figure 2.** Figure 2: One streaming step of Tensor Cache, shown left to right. (1) Local window full: the local KV ring buffer holds the most recent W key/value pairs as a new pair (kt, vt) arrives. (2) Evict oldest: the displaced pair (kold, vold) is popped from the buffer to make room. (3) Update Tensor Cache: the evicted pair is written into the L2 attention memory A via the outer-product (or optional delta-rule) update A ← … view at source ↗

**Figure 4.** Figure 4: Streaming-decode throughput vs context length on the OpenWebText long-context evaluation (130M params, W = 512; median decode tokens-per-second over four eval seeds). Full KV starts highest (∼127 tok/s at L = 1K) but degrades with context and crosses below Window KV at L = 32K (∼106 vs. ∼115 tok/s). All bounded methods remain approximately flat across the full range: Window KV (∼115 tok/s), StreamingLLM … view at source ↗

**Figure 5.** Figure 5: Long-context quality. Streaming NLL versus evaluation context length for each method. Solid lines (foreground) show the OpenWebText evaluation ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \emph{Tensor Cache}, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix $A$ and read by future queries through a single matrix multiplication, exploiting the linear-attention identity $q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i$. A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut $A\!\leftarrow\!\lambda A\!+\!\eta(\bar k\!\otimes\!\bar v)$ silently introduces $C^2{-}C$ spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory--quality frontier over bounded-state baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tensor Cache adds an eviction-only outer-product L2 to sliding-window attention and fixes the chunked-mean cross-term artifact, but the abstract supplies no numbers so the claimed gains stay unverified.

read the letter

The main new pieces are the strict eviction conditioning on the L2 outer-product cache and the replacement of the chunked-mean update with an exact parallel weighted-sum scan. The scan removes the C²-C spurious cross terms that the mean shortcut silently adds, and that correction is a clean, self-contained improvement anyone using these memories should note. The L1/L2 fusion uses a learned scalar gate plus per-head decay and write rates trained end-to-end, which is standard but applied here to keep evicted information accessible without growing the window. The read itself is the usual linear-attention identity, so cost stays low. The paper does well by making the L2 cache receive only what the window evicts rather than mixing everything, and by calling out the training artifact explicitly. The soft spot is the complete absence of numbers: the abstract claims better memory-quality frontiers on scaling, associative recall, long-context LM, and capacity tests, yet gives no baselines, deltas, error bars, or controls. Without those, it is impossible to judge whether the gate integrates the two levels reliably or whether the per-head parameters actually move the frontier. The math on the cache and the scan looks internally consistent and does not collapse into a fitted quantity. This is aimed at people building bounded-memory long-context systems who already know linear attention and sliding windows. A reader who wants a practical L2 extension and the training fix would get value from trying the construction. It deserves peer review because the idea is concrete and the training correction is verifiable on its own, even if the empirical claims need the full tables to assess.

Referee Report

2 major / 2 minor

Summary. The paper introduces Tensor Cache, a two-level caching scheme for autoregressive Transformers: a first-level sliding-window softmax attention cache (L1) paired with a second-level fixed-size outer-product associative memory (L2) that stores KV pairs evicted from the window. Evicted pairs are written into a per-layer matrix A and read via the linear-attention identity q_t (k_i ⊗ v_i) = <q_t, k_i> v_i; a learned scalar gate fuses L1 and L2 outputs, and per-head decay/write-rate parameters are trained end-to-end. The authors replace the common chunked-mean training shortcut (which introduces C²-C spurious cross terms) with an exact parallel weighted-sum scan. They claim this construction improves the memory-quality frontier over bounded-state baselines on systems scaling, associative recall, long-context LM, and capacity diagnostics.

Significance. If the empirical improvements hold, the work supplies a practical, bounded-memory mechanism that retains evicted information through associative outer-product storage rather than discarding it, addressing a core scaling limitation of KV caches while preserving exact local attention for recent tokens. The explicit correction of the chunked-mean artifact and the use of standard trainable components are strengths; the approach is falsifiable via the listed benchmarks and could influence efficient long-context architectures if the gains prove robust.

major comments (2)

[Abstract] Abstract: the central claim that Tensor Cache 'improves the memory-quality frontier' is asserted without any quantitative results, baselines, error bars, or controls in the provided text; the soundness of the empirical contribution cannot be assessed from the given material.
[Abstract (paragraph describing the gate and end-to-end training)] The integration of the learned scalar gate with L1 softmax attention and L2 outer-product reads is presented as stable under end-to-end training, yet no analysis of gradient stability, approximation error accumulation, or failure modes under long eviction chains is supplied; this is load-bearing for the claim that evicted information remains accessible without large errors.

minor comments (2)

[Abstract] Notation: the per-layer matrix A is introduced without an explicit equation defining its update rule or dimensions; adding a compact definition would improve clarity.
[Abstract] The manuscript states that the outer-product memory and read identity are 'well-known'; a brief citation to the relevant linear-attention literature would help readers locate the foundation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Tensor Cache 'improves the memory-quality frontier' is asserted without any quantitative results, baselines, error bars, or controls in the provided text; the soundness of the empirical contribution cannot be assessed from the given material.

Authors: The abstract is a concise summary of the work and its outcomes, following standard practice. The quantitative results supporting the memory-quality frontier claim—including direct comparisons to bounded-state baselines, with controls and error bars where reported—are presented in full in the Experiments section across systems scaling, associative recall, long-context LM, and capacity diagnostics. revision: no
Referee: [Abstract (paragraph describing the gate and end-to-end training)] The integration of the learned scalar gate with L1 softmax attention and L2 outer-product reads is presented as stable under end-to-end training, yet no analysis of gradient stability, approximation error accumulation, or failure modes under long eviction chains is supplied; this is load-bearing for the claim that evicted information remains accessible without large errors.

Authors: The manuscript shows successful end-to-end training on all reported tasks. We acknowledge that no dedicated analysis of gradient stability, error accumulation over eviction chains, or associated failure modes is included. We will add this analysis (gradient norm tracking and error diagnostics) to the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper explicitly states that the outer-product memory and linear-attention read identity are well-known external results, with the contribution being their application to eviction-conditioned L2 caching plus replacement of the chunked-mean shortcut by an exact parallel weighted-sum scan. No derivation step reduces to a fitted parameter renamed as prediction, no self-citation is load-bearing for the central claim, and the learned scalar gate and per-head rates are standard end-to-end trainable components whose performance is evaluated on external benchmarks. The derivation chain is therefore self-contained against independent identities and does not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the standard linear-attention identity and introduces trainable per-head decay and write-rate parameters; no new physical entities are postulated.

free parameters (1)

per-head decay and write-rate parameters
Trained end-to-end to control memory updates and forgetting; specific values not reported in abstract.

axioms (1)

standard math The linear-attention identity q_t (k_i ⊗ v_i) = <q_t, k_i> v_i holds and enables matrix-multiplication reads from the outer-product matrix A
Invoked to justify reading evicted information from the L2 cache.

pith-pipeline@v0.9.0 · 5817 in / 1313 out tokens · 26920 ms · 2026-05-25T05:48:06.047927+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exploiting the linear-attention identity qt(ki ⊗ vi)=⟨qt,ki⟩vi ... A←λA+η(kw⊗vw) ... parallel weighted-sum scan A←λ^C A + η Σ wt(kt⊗vt)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

per-head decay and write-rate parameters are trained end-to-end

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 16 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

M. J. Kearns , title =

work page
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[6]

Suppressed for Anonymity , author=

work page
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[9]

and Zhang, Hao and Stoica, Ion , journal =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , journal =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =

work page 2023
[10]

2023 , doi =

Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and Maire, Michael and Hoffmann, Henry and Holtzman, Ari and Jiang, Junchen , journal =. 2023 , doi =

work page 2023
[11]

Efficient Streaming Language Models with Attention Sinks

Efficient Streaming Language Models with Attention Sinks , author =. arXiv preprint arXiv:2309.17453 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[12]

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. arXiv preprint arXiv:2306.14048 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[13]

2024 , doi =

Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming , journal =. 2024 , doi =

work page 2024
[14]

2025 , doi =

Goel, Raghavv and Park, Junyoung and Gagrani, Mukul and Jones, Dalton and Morse, Matthew and Langston, Harper and Lee, Mingu and Lott, Chris , journal =. 2025 , doi =

work page 2025
[15]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Ring Attention with Blockwise Transformers for Near-Infinite Context , author =. arXiv preprint arXiv:2310.01889 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[17]

2023 , doi =

Ding, Jiayu and Ma, Shuming and Dong, Li and Zhang, Xingxing and Huang, Shaohan and Wang, Wenhui and Zheng, Nanning and Wei, Furu , journal =. 2023 , doi =

work page 2023
[18]

2019 , doi =

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan , journal =. 2019 , doi =

work page 2019
[19]

Compressive Transformers for Long-Range Sequence Modelling

Compressive Transformers for Long-Range Sequence Modelling , author =. arXiv preprint arXiv:1911.05507 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1911
[20]

arXiv preprint arXiv:2203.08913 , year =

Memorizing Transformers , author =. arXiv preprint arXiv:2203.08913 , year =

work page arXiv
[21]

Using Fast Weights to Attend to the Recent Past

Using Fast Weights to Attend to the Recent Past , author =. arXiv preprint arXiv:1610.06258 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Linear-Time Sequence Modeling with Selective State Spaces , author =. arXiv preprint arXiv:2312.00752 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Retentive Network: A Successor to Transformer for Large Language Models

Retentive Network: A Successor to Transformer for Large Language Models , author =. arXiv preprint arXiv:2307.08621 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Jamba: A Hybrid Transformer-Mamba Language Model

Jamba: A Hybrid Transformer-Mamba Language Model , author =. arXiv preprint arXiv:2403.19887 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[25]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author =. arXiv preprint arXiv:2307.08691 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , author =. arXiv preprint arXiv:2404.07143 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2102.11174 , year =

Linear Transformers Are Secretly Fast Weight Programmers , author =. arXiv preprint arXiv:2102.11174 , year =

work page arXiv
[28]

arXiv preprint arXiv:2207.06881 , year =

Recurrent Memory Transformer , author =. arXiv preprint arXiv:2207.06881 , year =

work page arXiv
[29]

arXiv preprint arXiv:2402.09398 , year=

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference , author=. arXiv preprint arXiv:2402.09398 , year=

work page arXiv
[30]

2019 , howpublished =

OpenWebText Corpus , author =. 2019 , howpublished =

work page 2019
[31]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author =. arXiv preprint arXiv:2308.14508 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[32]

RoFormer: Enhanced Transformer with Rotary Position Embedding

RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. arXiv preprint arXiv:2104.09864 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[33]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

work page
[34]

Thirty-eighth Conference on Neural Information Processing Systems , year =

Beck, Maximilian and P. Thirty-eighth Conference on Neural Information Processing Systems , year =

work page
[35]

Advances in Neural Information Processing Systems , year =

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author =. Advances in Neural Information Processing Systems , year =

work page
[36]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2004
[37]

1960 IRE WESCON Convention Record , volume =

Adaptive Switching Circuits , author =. 1960 IRE WESCON Convention Record , volume =. 1960 , organization =

work page 1960
[38]

Neural Computation , volume =

Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =. 1992 , publisher =

work page 1992
[39]

Advances in Neural Information Processing Systems , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =

work page
[40]

Fast Transformer Decoding: One Write-Head is All You Need

Fast Transformer Decoding: One Write-Head is All You Need , author =. arXiv preprint arXiv:1911.02150 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1911
[41]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , pages =

Efficient Memory Management for Large Language Model Serving with PagedAttention , author =. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , pages =

work page
[42]

International Conference on Learning Representations , year =

Efficient Streaming Language Models with Attention Sinks , author =. International Conference on Learning Representations , year =

work page
[43]

Scissorhands: Exploiting the Persistence of Importance Hypothesis for

Liu, Zichang and Desai, Aditya and Liao, Fangshuo and Wang, Weitao and Xie, Victor and Xu, Zhaozhuo and Kyrillidis, Anastasios and Shrivastava, Anshumali , booktitle =. Scissorhands: Exploiting the Persistence of Importance Hypothesis for

work page
[44]

International Conference on Learning Representations , year =

Compressive Transformers for Long-Range Sequence Modelling , author =. International Conference on Learning Representations , year =

work page
[45]

Neural Computation , volume =

Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =

work page
[46]

Advances in Neural Information Processing Systems , year =

Using Fast Weights to Attend to the Recent Past , author =. Advances in Neural Information Processing Systems , year =

work page
[47]

Transformers are

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , pages =

work page
[48]

International Conference on Machine Learning , pages =

Linear Transformers Are Secretly Fast Weight Programmers , author =. International Conference on Machine Learning , pages =

work page
[49]

Extending Context Window of Large Language Models via Positional Interpolation

Extending Context Window of Large Language Models via Positional Interpolation , author =. arXiv preprint arXiv:2306.15595 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico , booktitle =

work page

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[3] [3]

M. J. Kearns , title =

work page

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[6] [6]

Suppressed for Anonymity , author=

work page

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[9] [9]

and Zhang, Hao and Stoica, Ion , journal =

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph E. and Zhang, Hao and Stoica, Ion , journal =. Efficient Memory Management for Large Language Model Serving with. 2023 , doi =

work page 2023

[10] [10]

2023 , doi =

Liu, Yuhan and Li, Hanchen and Cheng, Yihua and Ray, Siddhant and Huang, Yuyang and Zhang, Qizheng and Du, Kuntai and Yao, Jiayi and Lu, Shan and Ananthanarayanan, Ganesh and Maire, Michael and Hoffmann, Henry and Holtzman, Ari and Jiang, Junchen , journal =. 2023 , doi =

work page 2023

[11] [11]

Efficient Streaming Language Models with Attention Sinks

Efficient Streaming Language Models with Attention Sinks , author =. arXiv preprint arXiv:2309.17453 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhang, Zhenyu and Sheng, Ying and Zhou, Tianyi and Chen, Tianlong and Zheng, Lianmin and Cai, Ruisi and Song, Zhao and Tian, Yuandong and R. arXiv preprint arXiv:2306.14048 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

2024 , doi =

Li, Yuhong and Huang, Yingbing and Yang, Bowen and Venkitesh, Bharat and Locatelli, Acyr and Ye, Hanchen and Cai, Tianle and Lewis, Patrick and Chen, Deming , journal =. 2024 , doi =

work page 2024

[14] [14]

2025 , doi =

Goel, Raghavv and Park, Junyoung and Gagrani, Mukul and Jones, Dalton and Morse, Matthew and Langston, Harper and Lee, Mingu and Lott, Chris , journal =. 2025 , doi =

work page 2025

[15] [15]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size , author =. arXiv preprint arXiv:2408.00118 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Ring Attention with Blockwise Transformers for Near-Infinite Context , author =. arXiv preprint arXiv:2310.01889 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

2023 , doi =

Ding, Jiayu and Ma, Shuming and Dong, Li and Zhang, Xingxing and Huang, Shaohan and Wang, Wenhui and Zheng, Nanning and Wei, Furu , journal =. 2023 , doi =

work page 2023

[18] [18]

2019 , doi =

Dai, Zihang and Yang, Zhilin and Yang, Yiming and Carbonell, Jaime and Le, Quoc and Salakhutdinov, Ruslan , journal =. 2019 , doi =

work page 2019

[19] [19]

Compressive Transformers for Long-Range Sequence Modelling

Compressive Transformers for Long-Range Sequence Modelling , author =. arXiv preprint arXiv:1911.05507 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1911

[20] [20]

arXiv preprint arXiv:2203.08913 , year =

Memorizing Transformers , author =. arXiv preprint arXiv:2203.08913 , year =

work page arXiv

[21] [21]

Using Fast Weights to Attend to the Recent Past

Using Fast Weights to Attend to the Recent Past , author =. arXiv preprint arXiv:1610.06258 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Linear-Time Sequence Modeling with Selective State Spaces , author =. arXiv preprint arXiv:2312.00752 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Retentive Network: A Successor to Transformer for Large Language Models

Retentive Network: A Successor to Transformer for Large Language Models , author =. arXiv preprint arXiv:2307.08621 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Jamba: A Hybrid Transformer-Mamba Language Model

Jamba: A Hybrid Transformer-Mamba Language Model , author =. arXiv preprint arXiv:2403.19887 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author =. arXiv preprint arXiv:2307.08691 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention , author =. arXiv preprint arXiv:2404.07143 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2102.11174 , year =

Linear Transformers Are Secretly Fast Weight Programmers , author =. arXiv preprint arXiv:2102.11174 , year =

work page arXiv

[28] [28]

arXiv preprint arXiv:2207.06881 , year =

Recurrent Memory Transformer , author =. arXiv preprint arXiv:2207.06881 , year =

work page arXiv

[29] [29]

arXiv preprint arXiv:2402.09398 , year=

Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference , author=. arXiv preprint arXiv:2402.09398 , year=

work page arXiv

[30] [30]

2019 , howpublished =

OpenWebText Corpus , author =. 2019 , howpublished =

work page 2019

[31] [31]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding , author =. arXiv preprint arXiv:2308.14508 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

RoFormer: Enhanced Transformer with Rotary Position Embedding

RoFormer: Enhanced Transformer with Rotary Position Embedding , author =. arXiv preprint arXiv:2104.09864 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

International Conference on Learning Representations , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

work page

[34] [34]

Thirty-eighth Conference on Neural Information Processing Systems , year =

Beck, Maximilian and P. Thirty-eighth Conference on Neural Information Processing Systems , year =

work page

[35] [35]

Advances in Neural Information Processing Systems , year =

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , author =. Advances in Neural Information Processing Systems , year =

work page

[36] [36]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2004

[37] [37]

1960 IRE WESCON Convention Record , volume =

Adaptive Switching Circuits , author =. 1960 IRE WESCON Convention Record , volume =. 1960 , organization =

work page 1960

[38] [38]

Neural Computation , volume =

Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =. 1992 , publisher =

work page 1992

[39] [39]

Advances in Neural Information Processing Systems , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems , year =

work page

[40] [40]

Fast Transformer Decoding: One Write-Head is All You Need

Fast Transformer Decoding: One Write-Head is All You Need , author =. arXiv preprint arXiv:1911.02150 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1911

[41] [41]

Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , pages =

Efficient Memory Management for Large Language Model Serving with PagedAttention , author =. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , pages =

work page

[42] [42]

International Conference on Learning Representations , year =

Efficient Streaming Language Models with Attention Sinks , author =. International Conference on Learning Representations , year =

work page

[43] [43]

Scissorhands: Exploiting the Persistence of Importance Hypothesis for

Liu, Zichang and Desai, Aditya and Liao, Fangshuo and Wang, Weitao and Xie, Victor and Xu, Zhaozhuo and Kyrillidis, Anastasios and Shrivastava, Anshumali , booktitle =. Scissorhands: Exploiting the Persistence of Importance Hypothesis for

work page

[44] [44]

International Conference on Learning Representations , year =

Compressive Transformers for Long-Range Sequence Modelling , author =. International Conference on Learning Representations , year =

work page

[45] [45]

Neural Computation , volume =

Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , author =. Neural Computation , volume =

work page

[46] [46]

Advances in Neural Information Processing Systems , year =

Using Fast Weights to Attend to the Recent Past , author =. Advances in Neural Information Processing Systems , year =

work page

[47] [47]

Transformers are

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , pages =

work page

[48] [48]

International Conference on Machine Learning , pages =

Linear Transformers Are Secretly Fast Weight Programmers , author =. International Conference on Machine Learning , pages =

work page

[49] [49]

Extending Context Window of Large Language Models via Positional Interpolation

Extending Context Window of Large Language Models via Positional Interpolation , author =. arXiv preprint arXiv:2306.15595 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Peng, Bowen and Quesnelle, Jeffrey and Fan, Honglu and Shippole, Enrico , booktitle =

work page