Sparse Attention as Compact Kernel Regression

arxiv: 2601.22766 · v3 · submitted 2026-01-30 · 💻 cs.LG

Sparse Attention as Compact Kernel Regression

Saul Santos , Nuno Gon\c{c}alves , Daniel C. McNamee , Marcos Treviso , Andr\'e F.T Martins This is my paper

Pith reviewed 2026-05-16 09:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse attentionkernel regressionNadaraya-Watson estimatorentmaxEpanechnikov kerneltransformerlanguage modeling

0 comments p. Extension

The pith

Sparse attention corresponds to compact kernel regression via the Nadaraya-Watson estimator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a formal link between sparse attention mechanisms and regression with compact kernels that have bounded support. Normalized ReLU attention matches Epanechnikov kernel regression under fixed normalization, while sparsemax attention matches the same kernel under adaptive normalization. More generally, kernels such as biweight and triweight correspond to α-entmax attention for α equal to one plus one over n, with the Gaussian-softmax pair recovered in the limit as n grows large. This view shows sparsity as a direct result of kernel design rather than an added heuristic and supplies a systematic route to new attention variants. Experiments replace standard attention with a kernel-regression version inside Memory Mosaics and obtain competitive results on language modeling, in-context learning, and length generalization.

Core claim

We establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, widely used kernels in nonparametric density estimation—including Epanechnikov, biweight, and triweight—correspond to α-entmax attention with α = 1 + 1/n for n in the natural numbers, while the softmax/Gaussian relationship emerges in the limit n to infinity.

What carries the argument

The Nadaraya-Watson kernel regression estimator applied to compact kernels, which produces normalized attention weights that match the regression formula exactly.

If this is right

Normalized ReLU attention is produced by Epanechnikov kernel regression under fixed normalization.
Sparsemax attention is produced by the same kernel under adaptive normalization.
α-entmax attention with α = 1 + 1/n matches the biweight, triweight and related compact kernels.
The softmax attention mechanism is recovered as the limiting case when the kernel parameter n tends to infinity.
A kernel-regression replacement for attention inside Memory Mosaics achieves competitive performance on language modeling, in-context learning and length generalization tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other compact kernels outside the entmax parameterization could be used to construct attention layers with different smoothness or cutoff behaviors.
The bounded-support property of these kernels may directly improve length generalization by restricting effective attention range without explicit masking.
Analogous kernel correspondences might be derived for memory or retrieval mechanisms that operate outside the standard transformer attention block.

Load-bearing premise

The normalized attention weights exactly reproduce the Nadaraya-Watson estimator for the chosen compact kernel and its normalization scheme.

What would settle it

Compute the explicit Nadaraya-Watson weights for the Epanechnikov kernel on a set of query-key dot products and check whether they equal the normalized ReLU attention weights on the same inputs.

Figures

Figures reproduced from arXiv: 2601.22766 by Andr\'e F.T Martins, Daniel C. McNamee, Marcos Treviso, Nuno Gon\c{c}alves, Saul Santos.

**Figure 1.** Figure 1: Comparison between classical kernel functions and their corresponding attention activations. The left plot shows normalized one-dimensional kernel functions commonly used in kernel regression evaluated on an input u ∈ R n . The right panel shows different attention transformations applied to a 3D score vector u = [0, u1, u2]. Each transformation allows an interpretation with corresponding kernels shown on … view at source ↗

**Figure 2.** Figure 2: Validation loss of Memory Mosaic models with different kernels. Comparison on the BabiStories dataset for varying model depths. The horizontal axis shows the number of training iterations. Importantly, since b > 0, in the worst case where all attention scores equal m, at least one rectified score will be strictly positive, and so the denominator in (21) cannot be zero. Hence, ReLUmax avoids the 0/0 degener… view at source ↗

**Figure 3.** Figure 3: Training and validation loss of Memory Mosaic models with different kernels. Comparison on the BabiStories dataset for varying model depths. The horizontal axis shows the number of training iterations. variants. Multi-query Multi-token Associative Recall (MQMTAR). The Multi-query Multi-token Associative Recall task evaluates a model’s ability to store and retrieve multiple key–value associations within a … view at source ↗

**Figure 4.** Figure 4: Illustration of the MQMTAR, reverse, and sort tasks. In the MQMTAR task, Soft colors indicate different keys, values, and queries for visual clarity. In the reverse task, the model receives a sequence of integers and must output them in reverse order. In the sort task, the model receives a sequence of integers and must output the sequence in ascending order. C. Additional Experiments C.1. Out-Of-Distributi… view at source ↗

**Figure 5.** Figure 5: Prediction performance on the Simple English Wikipedia dataset using models trained on the BABISTORIES corpus. The plot reports the per-token average loss as a function of the token position within a 512-token input sequence. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper maps compact kernels to sparse attention forms like normalized ReLU and alpha-entmax, with competitive experiments on a kernel-based transformer.

read the letter

The main takeaway is that normalized ReLU attention corresponds to Epanechnikov kernel regression under fixed normalization, sparsemax under adaptive normalization, and alpha-entmax covers biweight and triweight kernels at alpha equal to 1 plus 1 over n. This extends the softmax-Gaussian link as n goes to infinity and frames sparsity as a natural outcome of kernel support rather than an add-on rule. The authors then build Memory Mosaics, a transformer variant that uses these kernel attentions, and report competitive results on language modeling, in-context learning, and length generalization tasks. That gives the work some practical grounding beyond the theory. The derivations appear to follow standard Nadaraya-Watson form once the fixed versus adaptive distinction is made, so the central correspondence holds up on the terms the paper sets. The experiments are straightforward and show the approach is usable, though they do not include exhaustive ablations against every recent sparse attention baseline. The main soft spot is that the fixed normalization needs an explicit formula in the main text to make the match to the classic estimator unmistakable; without it, readers might wonder if the link is to a slightly altered regression estimator. Overall this is aimed at people working on attention design and kernel methods who want a principled route to sparsity. A reader looking for alternatives to top-k or other heuristics would get concrete mappings and a working implementation to try. The thinking is clear and the citations to kernel regression literature are on point. Send it to peer review so the normalization details and experiment controls can be checked in full.

Referee Report

3 major / 2 minor

Summary. The manuscript claims a formal correspondence between sparse attention mechanisms and compact (bounded-support) kernel regression via the Nadaraya-Watson estimator. It asserts that normalized ReLU and sparsemax attention arise exactly from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively; more generally, kernels such as Epanechnikov, biweight, and triweight correspond to α-entmax attention for α = 1 + 1/n (with softmax/Gaussian recovered as n → ∞). Experiments with a kernel-regression-based transformer variant (Memory Mosaics) report competitive results on language modeling, in-context learning, and length generalization.

Significance. If the exact correspondences hold, the work supplies a principled nonparametric-statistics foundation for sparse attention design, explains the origin of sparsity from kernel support, and supplies non-heuristic alternatives to top-k or other associative-memory mechanisms. The Memory Mosaics experiments provide initial empirical grounding, though their strength is tied to the precision of the theoretical matches.

major comments (3)

[§3.2] §3.2 (normalized ReLU definition): the abstract states that normalized ReLU attention 'arises from Epanechnikov kernel regression under fixed normalization.' Standard Nadaraya-Watson is always adaptive (output = ∑ K_i v_i / ∑ K_i). If the paper's fixed normalization divides by a data-independent constant rather than the realized sum of kernel weights, the claimed exact match to the NW estimator does not hold; this is load-bearing for the central correspondence.
[§3.3] §3.3 (general α-entmax correspondence): the mapping from kernel order n to α = 1 + 1/n is presented as exact, yet the derivation appears to rely on a specific choice of normalization and support truncation. Provide the explicit step that shows the kernel weight function equals the α-entmax output for arbitrary n, including the handling of the partition function.
[Table 2] Table 2 / §5.1 (Memory Mosaics results): the reported gains over softmax baselines are modest and lack controls that isolate the kernel choice from other architectural differences (e.g., memory size, normalization scheme). Without these ablations the empirical support for the kernel-regression framing remains indirect.

minor comments (2)

[Abstract] Notation for 'fixed' versus 'adaptive' normalization is introduced without a one-sentence contrast in the abstract or §2; a brief parenthetical would aid readability.
[Figure 1] Figure 1 caption should explicitly state the kernel and normalization used for each attention variant shown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate.

read point-by-point responses

Referee: [§3.2] §3.2 (normalized ReLU definition): the abstract states that normalized ReLU attention 'arises from Epanechnikov kernel regression under fixed normalization.' Standard Nadaraya-Watson is always adaptive (output = ∑ K_i v_i / ∑ K_i). If the paper's fixed normalization divides by a data-independent constant rather than the realized sum of kernel weights, the claimed exact match to the NW estimator does not hold; this is load-bearing for the central correspondence.

Authors: We appreciate the referee's observation. The manuscript distinguishes 'fixed' and 'adaptive' normalizations precisely to address this. Fixed normalization divides by a constant (the kernel integral), yielding a non-adaptive estimator. We will revise the abstract and §3.2 to explicitly note that this is a fixed-normalization kernel regression, not the standard adaptive NW estimator. The correspondence to Epanechnikov kernel still holds under this variant. revision: yes
Referee: [§3.3] §3.3 (general α-entmax correspondence): the mapping from kernel order n to α = 1 + 1/n is presented as exact, yet the derivation appears to rely on a specific choice of normalization and support truncation. Provide the explicit step that shows the kernel weight function equals the α-entmax output for arbitrary n, including the handling of the partition function.

Authors: We will include a detailed derivation in the revised §3.3. Starting from the kernel K(x) = c_n (1 - x^2)_+^n for |x| < 1, the normalized weights are K(x_i) / Z where Z is the sum. This form matches the α-entmax output p_i = [ (α-1)^{-1} (z_i - τ) ]_+ ^{1/(α-1)} with α = 1 + 1/n, where the threshold τ and partition function Z are determined by the support truncation to ensure the expression is non-negative. The explicit steps equate the power and the normalization constant for general n. revision: yes
Referee: [Table 2] Table 2 / §5.1 (Memory Mosaics results): the reported gains over softmax baselines are modest and lack controls that isolate the kernel choice from other architectural differences (e.g., memory size, normalization scheme). Without these ablations the empirical support for the kernel-regression framing remains indirect.

Authors: We concur that the experiments lack isolating ablations, and the gains are modest. The Memory Mosaics architecture bundles multiple modifications. We will revise §5.1 to acknowledge this limitation and state that the results provide preliminary support for the framework, with the primary contribution being the theoretical correspondence. We are unable to perform additional experiments for this revision. revision: partial

Circularity Check

0 steps flagged

Correspondence extends kernel regression theory to sparse attention without reducing to self-definition or fitted inputs

full rationale

The paper derives normalized ReLU and sparsemax attention as arising from Epanechnikov kernel regression (with fixed vs. adaptive normalization) and generalizes to α-entmax for other compact kernels. These steps equate attention outputs to Nadaraya-Watson estimators by direct substitution of the kernel forms into the regression formula, which is a standard mathematical correspondence rather than a self-referential loop or parameter fit. No load-bearing claim reduces to a self-citation chain or renames a known result as novel; the unification is self-contained against external nonparametric estimation benchmarks. Minor score accounts for possible normalization variants noted in skeptic analysis, but these do not collapse the central equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the kernel regression interpretation of attention as the starting point, plus the specific normalization choices that produce the claimed equivalences.

axioms (1)

domain assumption Attention mechanisms correspond to the Nadaraya-Watson kernel regression estimator.
This is the foundational link invoked from recent prior work.

pith-pipeline@v0.9.0 · 5533 in / 1147 out tokens · 56985 ms · 2026-05-16T09:21:14.795764+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively... α-entmax attention with α=1+1/n ... K(u)∝[1−∥u∥²]ᵣ₊ where r=(α−1)⁻¹
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

softmax attention corresponding to a Gaussian kernel... rectified polynomial kernels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

[1]

Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., Madeira Ara ´ujo, J., Vitvitskyi, O., Pascanu, R., and Veliˇckovi´c, P

URLhttps://arxiv.org/abs/2401.12973. Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., Madeira Ara ´ujo, J., Vitvitskyi, O., Pascanu, R., and Veliˇckovi´c, P. Transformers need glasses! information over-squashing in language tasks.Advances in Neural Information Processing Systems, 37:98111–98142,

work page arXiv
[2]

Titans: Learning to Memorize at Test Time

Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

URL http://arxiv.org/abs/2004. 05150. arXiv: 2004.05150. Epanechnikov, V . A. Non-parametric estimation of a multi- variate probability density.Theory of Probability & Its Ap- plications, 14(1):153–158,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

URLhttps://doi.org/10.1137/1114019

doi: 10.1137/1114019. URLhttps://doi.org/10.1137/1114019. Fu, D. Y ., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and R ´e, C. Hungry Hungry Hippos: Towards lan- guage modeling with state space models. InInternational Conference on Learning Representations,

work page doi:10.1137/1114019
[7]

Key-value memory in the brain.arXiv preprint arXiv:2501.02950, 2025

URL https://arxiv.org/abs/2501.02950. Preprint, Jan

work page arXiv
[8]

Epub 2020 Aug

doi: 10.1016/j.neuron.2020.07.011. Epub 2020 Aug

work page doi:10.1016/j.neuron.2020.07.011 2020
[9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

doi: 10.18653/v1/2021.sustainlp-1.5

Association for Computational Linguis- tics. doi: 10.18653/v1/2021.sustainlp-1.5. URL https: //aclanthology.org/2021.sustainlp-1.5/. H¨ardle, W.Applied nonparametric regression. Number

work page doi:10.18653/v1/2021.sustainlp-1.5 2021
[11]

URL https://doi.org/10.1137/ 1109020

doi: 10.1137/1109020. URL https://doi.org/10.1137/ 1109020. Nakanishi, K. M. Scalable-softmax is superior for attention. arXiv preprint arXiv:2501.19399,

work page doi:10.1137/1109020
[12]

Santos, S

URL http: //jmlr.org/papers/v26/24-1961.html. Santos, S. J. R., Niculae, V ., McNamee, D. C., and Martins, A. Sparse and structured hopfield networks. InForty-first International Conference on Machine Learning,

work page 1961
[13]

neurips.cc/paper_files/paper/2015/file/ 8fb21ee7a2207526da55a679f0332de2-Paper.pdf

URL https://proceedings. neurips.cc/paper_files/paper/2015/file/ 8fb21ee7a2207526da55a679f0332de2-Paper.pdf. Sun, Y ., Dong, L., Huang, S., Ma, S., Xia, Y ., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models,

work page 2015
[14]

Tsai, Y .-H

doi: 10.1037/0735-7044.100.2.147. Tsai, Y .-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. InEMNLP,

work page doi:10.1037/0735-7044.100.2.147
[15]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

doi: 10.1146/ annurev.psych.53.100901.135114. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in neural information processing systems, 30,

work page arXiv
[16]

URL https://papers.nips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract. html. Vasylenko, P., Pitorro, H., Martins, A. F. T., and Treviso, M. Long-context generalization with sparse attention,

work page 2017
[17]

Veliˇckovi´c, P., Perivolaropoulos, C., Barbero, F., and Pas- canu, R

URLhttps://arxiv.org/abs/2506.16640. Veliˇckovi´c, P., Perivolaropoulos, C., Barbero, F., and Pas- canu, R. Softmax is not enough (for sharp size gener- alisation). InForty-second International Conference on Machine Learning,

work page arXiv
[18]

Test-time regression: a unifying framework for design- ing sequence models with associative memory,

URL https://arxiv.org/ abs/2501.12352. Watson, G. S. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A (1961-2002), 26(4): 359–372,

work page arXiv 1961
[19]

URL http://www

ISSN 0581572X. URL http://www. jstor.org/stable/25049340. Wu, D., Hu, J. Y .-C., Li, W., Chen, B.-Y ., and Liu, H. Stanhop: Sparse tandem hopfield model for memory- enhanced time series prediction. InThe Twelfth Interna- tional Conference on Learning Representations,

work page arXiv
[21]

Gated Linear Attention Transformers with Hardware-Efficient Training

URL https://arxiv.org/abs/2312.06635. Yassa, M. A. and Stark, C. E. L. Pattern separation in the hippocampus.Trends in Neurosciences, 34(10):515–525,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Zaheer, M., Guruganesh, G., Dubey, K

doi: 10.1016/j.tins.2011.06.006. Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Al- berti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer se- quences.Advances in Neural Information Processing Systems, 33,

work page doi:10.1016/j.tins.2011.06.006 2011
[23]

Memory mosaics

Zhang, J., Nolte, N., Sadhukhan, R., Chen, B., and Bottou, L. Memory mosaics. In Yue, Y ., Garg, A., Peng, N., Sha, F., and Yu, R. (eds.),International Conference on Representation Learning, volume 2025, pp. 36412–36433,

work page 2025
[24]

iclr.cc/paper_files/paper/2025/file/ 59c3ac496e6fe7be0c2c2b95014e2210-Paper-Conference

URLhttps://proceedings. iclr.cc/paper_files/paper/2025/file/ 59c3ac496e6fe7be0c2c2b95014e2210-Paper-Conference. pdf. 11 Sparse Attention as Compact Kernel Regression A. Proof of Proposition 1 Proof. Let q≡k n and r= 1 α−1. Since ∥ki∥=∥q∥= 1 , we have 1 2 ∥ki −q∥ 2 = 1−k ⊤ i q. The r-weight polynomial kernel is defined as: Kh(ki −q) = 1− ∥ki −q∥ 2 h2 r + (...

work page 2025
[25]

All models use the GPT-2 tokenizer (Radford et al.,

(Zhang et al., 2025), complementing the results in the main text (Figure 3). All models use the GPT-2 tokenizer (Radford et al.,

work page 2025
[26]

Experiments cover different numbers of parallel layers of persistent and contextual memories

and share a sequence length of 512 tokens. Experiments cover different numbers of parallel layers of persistent and contextual memories. For single-layer models, we vary k∈16,32,64 and report results for the best-performing value (k= 32 ), which is then used for all multi-layer configurations. Each contextual memory unit computes leaky-average keys and on...

work page 2019
[27]

B.3. Length Generalization For biological agents, episodic memory scales with the length of life rather than a fixed context window, placing memory retrieval inherently in the long-context regime and requiring continual length generalization. In such settings, dense aggregation over all stored memories would lead to severe interference, motivating the use...

work page 1986

[1] [1]

Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., Madeira Ara ´ujo, J., Vitvitskyi, O., Pascanu, R., and Veliˇckovi´c, P

URLhttps://arxiv.org/abs/2401.12973. Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., Madeira Ara ´ujo, J., Vitvitskyi, O., Pascanu, R., and Veliˇckovi´c, P. Transformers need glasses! information over-squashing in language tasks.Advances in Neural Information Processing Systems, 37:98111–98142,

work page arXiv

[2] [2]

Titans: Learning to Memorize at Test Time

Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [4]

URL http://arxiv.org/abs/2004. 05150. arXiv: 2004.05150. Epanechnikov, V . A. Non-parametric estimation of a multi- variate probability density.Theory of Probability & Its Ap- plications, 14(1):153–158,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[4] [5]

URLhttps://doi.org/10.1137/1114019

doi: 10.1137/1114019. URLhttps://doi.org/10.1137/1114019. Fu, D. Y ., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and R ´e, C. Hungry Hungry Hippos: Towards lan- guage modeling with state space models. InInternational Conference on Learning Representations,

work page doi:10.1137/1114019

[5] [7]

Key-value memory in the brain.arXiv preprint arXiv:2501.02950, 2025

URL https://arxiv.org/abs/2501.02950. Preprint, Jan

work page arXiv

[6] [8]

Epub 2020 Aug

doi: 10.1016/j.neuron.2020.07.011. Epub 2020 Aug

work page doi:10.1016/j.neuron.2020.07.011 2020

[7] [9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [10]

doi: 10.18653/v1/2021.sustainlp-1.5

Association for Computational Linguis- tics. doi: 10.18653/v1/2021.sustainlp-1.5. URL https: //aclanthology.org/2021.sustainlp-1.5/. H¨ardle, W.Applied nonparametric regression. Number

work page doi:10.18653/v1/2021.sustainlp-1.5 2021

[9] [11]

URL https://doi.org/10.1137/ 1109020

doi: 10.1137/1109020. URL https://doi.org/10.1137/ 1109020. Nakanishi, K. M. Scalable-softmax is superior for attention. arXiv preprint arXiv:2501.19399,

work page doi:10.1137/1109020

[10] [12]

Santos, S

URL http: //jmlr.org/papers/v26/24-1961.html. Santos, S. J. R., Niculae, V ., McNamee, D. C., and Martins, A. Sparse and structured hopfield networks. InForty-first International Conference on Machine Learning,

work page 1961

[11] [13]

neurips.cc/paper_files/paper/2015/file/ 8fb21ee7a2207526da55a679f0332de2-Paper.pdf

URL https://proceedings. neurips.cc/paper_files/paper/2015/file/ 8fb21ee7a2207526da55a679f0332de2-Paper.pdf. Sun, Y ., Dong, L., Huang, S., Ma, S., Xia, Y ., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models,

work page 2015

[12] [14]

Tsai, Y .-H

doi: 10.1037/0735-7044.100.2.147. Tsai, Y .-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. InEMNLP,

work page doi:10.1037/0735-7044.100.2.147

[13] [15]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

doi: 10.1146/ annurev.psych.53.100901.135114. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in neural information processing systems, 30,

work page arXiv

[14] [16]

URL https://papers.nips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract. html. Vasylenko, P., Pitorro, H., Martins, A. F. T., and Treviso, M. Long-context generalization with sparse attention,

work page 2017

[15] [17]

Veliˇckovi´c, P., Perivolaropoulos, C., Barbero, F., and Pas- canu, R

URLhttps://arxiv.org/abs/2506.16640. Veliˇckovi´c, P., Perivolaropoulos, C., Barbero, F., and Pas- canu, R. Softmax is not enough (for sharp size gener- alisation). InForty-second International Conference on Machine Learning,

work page arXiv

[16] [18]

Test-time regression: a unifying framework for design- ing sequence models with associative memory,

URL https://arxiv.org/ abs/2501.12352. Watson, G. S. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A (1961-2002), 26(4): 359–372,

work page arXiv 1961

[17] [19]

URL http://www

ISSN 0581572X. URL http://www. jstor.org/stable/25049340. Wu, D., Hu, J. Y .-C., Li, W., Chen, B.-Y ., and Liu, H. Stanhop: Sparse tandem hopfield model for memory- enhanced time series prediction. InThe Twelfth Interna- tional Conference on Learning Representations,

work page arXiv

[18] [21]

Gated Linear Attention Transformers with Hardware-Efficient Training

URL https://arxiv.org/abs/2312.06635. Yassa, M. A. and Stark, C. E. L. Pattern separation in the hippocampus.Trends in Neurosciences, 34(10):515–525,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [22]

Zaheer, M., Guruganesh, G., Dubey, K

doi: 10.1016/j.tins.2011.06.006. Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Al- berti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer se- quences.Advances in Neural Information Processing Systems, 33,

work page doi:10.1016/j.tins.2011.06.006 2011

[20] [23]

Memory mosaics

Zhang, J., Nolte, N., Sadhukhan, R., Chen, B., and Bottou, L. Memory mosaics. In Yue, Y ., Garg, A., Peng, N., Sha, F., and Yu, R. (eds.),International Conference on Representation Learning, volume 2025, pp. 36412–36433,

work page 2025

[21] [24]

iclr.cc/paper_files/paper/2025/file/ 59c3ac496e6fe7be0c2c2b95014e2210-Paper-Conference

URLhttps://proceedings. iclr.cc/paper_files/paper/2025/file/ 59c3ac496e6fe7be0c2c2b95014e2210-Paper-Conference. pdf. 11 Sparse Attention as Compact Kernel Regression A. Proof of Proposition 1 Proof. Let q≡k n and r= 1 α−1. Since ∥ki∥=∥q∥= 1 , we have 1 2 ∥ki −q∥ 2 = 1−k ⊤ i q. The r-weight polynomial kernel is defined as: Kh(ki −q) = 1− ∥ki −q∥ 2 h2 r + (...

work page 2025

[22] [25]

All models use the GPT-2 tokenizer (Radford et al.,

(Zhang et al., 2025), complementing the results in the main text (Figure 3). All models use the GPT-2 tokenizer (Radford et al.,

work page 2025

[23] [26]

Experiments cover different numbers of parallel layers of persistent and contextual memories

and share a sequence length of 512 tokens. Experiments cover different numbers of parallel layers of persistent and contextual memories. For single-layer models, we vary k∈16,32,64 and report results for the best-performing value (k= 32 ), which is then used for all multi-layer configurations. Each contextual memory unit computes leaky-average keys and on...

work page 2019

[24] [27]

B.3. Length Generalization For biological agents, episodic memory scales with the length of life rather than a fixed context window, placing memory retrieval inherently in the long-context regime and requiring continual length generalization. In such settings, dense aggregation over all stored memories would lead to severe interference, motivating the use...

work page 1986