pith. sign in

arxiv: 2601.22766 · v3 · submitted 2026-01-30 · 💻 cs.LG

Sparse Attention as Compact Kernel Regression

Pith reviewed 2026-05-16 09:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse attentionkernel regressionNadaraya-Watson estimatorentmaxEpanechnikov kerneltransformerlanguage modeling
0
0 comments X p. Extension

The pith

Sparse attention corresponds to compact kernel regression via the Nadaraya-Watson estimator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a formal link between sparse attention mechanisms and regression with compact kernels that have bounded support. Normalized ReLU attention matches Epanechnikov kernel regression under fixed normalization, while sparsemax attention matches the same kernel under adaptive normalization. More generally, kernels such as biweight and triweight correspond to α-entmax attention for α equal to one plus one over n, with the Gaussian-softmax pair recovered in the limit as n grows large. This view shows sparsity as a direct result of kernel design rather than an added heuristic and supplies a systematic route to new attention variants. Experiments replace standard attention with a kernel-regression version inside Memory Mosaics and obtain competitive results on language modeling, in-context learning, and length generalization.

Core claim

We establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, widely used kernels in nonparametric density estimation—including Epanechnikov, biweight, and triweight—correspond to α-entmax attention with α = 1 + 1/n for n in the natural numbers, while the softmax/Gaussian relationship emerges in the limit n to infinity.

What carries the argument

The Nadaraya-Watson kernel regression estimator applied to compact kernels, which produces normalized attention weights that match the regression formula exactly.

If this is right

  • Normalized ReLU attention is produced by Epanechnikov kernel regression under fixed normalization.
  • Sparsemax attention is produced by the same kernel under adaptive normalization.
  • α-entmax attention with α = 1 + 1/n matches the biweight, triweight and related compact kernels.
  • The softmax attention mechanism is recovered as the limiting case when the kernel parameter n tends to infinity.
  • A kernel-regression replacement for attention inside Memory Mosaics achieves competitive performance on language modeling, in-context learning and length generalization tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other compact kernels outside the entmax parameterization could be used to construct attention layers with different smoothness or cutoff behaviors.
  • The bounded-support property of these kernels may directly improve length generalization by restricting effective attention range without explicit masking.
  • Analogous kernel correspondences might be derived for memory or retrieval mechanisms that operate outside the standard transformer attention block.

Load-bearing premise

The normalized attention weights exactly reproduce the Nadaraya-Watson estimator for the chosen compact kernel and its normalization scheme.

What would settle it

Compute the explicit Nadaraya-Watson weights for the Epanechnikov kernel on a set of query-key dot products and check whether they equal the normalized ReLU attention weights on the same inputs.

Figures

Figures reproduced from arXiv: 2601.22766 by Andr\'e F.T Martins, Daniel C. McNamee, Marcos Treviso, Nuno Gon\c{c}alves, Saul Santos.

Figure 1
Figure 1. Figure 1: Comparison between classical kernel functions and their corresponding attention activations. The left plot shows normalized one-dimensional kernel functions commonly used in kernel regression evaluated on an input u ∈ R n . The right panel shows different attention transformations applied to a 3D score vector u = [0, u1, u2]. Each transformation allows an interpretation with corresponding kernels shown on … view at source ↗
Figure 2
Figure 2. Figure 2: Validation loss of Memory Mosaic models with different kernels. Comparison on the BabiStories dataset for varying model depths. The horizontal axis shows the number of training iterations. Importantly, since b > 0, in the worst case where all attention scores equal m, at least one rectified score will be strictly positive, and so the denominator in (21) cannot be zero. Hence, ReLUmax avoids the 0/0 degener… view at source ↗
Figure 3
Figure 3. Figure 3: Training and validation loss of Memory Mosaic models with different kernels. Comparison on the BabiStories dataset for varying model depths. The horizontal axis shows the number of training iterations. variants. Multi-query Multi-token Associative Recall (MQMTAR). The Multi-query Multi-token Associative Recall task evalu￾ates a model’s ability to store and retrieve multiple key–value associations within a … view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the MQMTAR, reverse, and sort tasks. In the MQMTAR task, Soft colors indicate different keys, values, and queries for visual clarity. In the reverse task, the model receives a sequence of integers and must output them in reverse order. In the sort task, the model receives a sequence of integers and must output the sequence in ascending order. C. Additional Experiments C.1. Out-Of-Distributi… view at source ↗
Figure 5
Figure 5. Figure 5: Prediction performance on the Simple English Wikipedia dataset using models trained on the BABISTORIES corpus. The plot reports the per-token average loss as a function of the token position within a 512-token input sequence. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims a formal correspondence between sparse attention mechanisms and compact (bounded-support) kernel regression via the Nadaraya-Watson estimator. It asserts that normalized ReLU and sparsemax attention arise exactly from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively; more generally, kernels such as Epanechnikov, biweight, and triweight correspond to α-entmax attention for α = 1 + 1/n (with softmax/Gaussian recovered as n → ∞). Experiments with a kernel-regression-based transformer variant (Memory Mosaics) report competitive results on language modeling, in-context learning, and length generalization.

Significance. If the exact correspondences hold, the work supplies a principled nonparametric-statistics foundation for sparse attention design, explains the origin of sparsity from kernel support, and supplies non-heuristic alternatives to top-k or other associative-memory mechanisms. The Memory Mosaics experiments provide initial empirical grounding, though their strength is tied to the precision of the theoretical matches.

major comments (3)
  1. [§3.2] §3.2 (normalized ReLU definition): the abstract states that normalized ReLU attention 'arises from Epanechnikov kernel regression under fixed normalization.' Standard Nadaraya-Watson is always adaptive (output = ∑ K_i v_i / ∑ K_i). If the paper's fixed normalization divides by a data-independent constant rather than the realized sum of kernel weights, the claimed exact match to the NW estimator does not hold; this is load-bearing for the central correspondence.
  2. [§3.3] §3.3 (general α-entmax correspondence): the mapping from kernel order n to α = 1 + 1/n is presented as exact, yet the derivation appears to rely on a specific choice of normalization and support truncation. Provide the explicit step that shows the kernel weight function equals the α-entmax output for arbitrary n, including the handling of the partition function.
  3. [Table 2] Table 2 / §5.1 (Memory Mosaics results): the reported gains over softmax baselines are modest and lack controls that isolate the kernel choice from other architectural differences (e.g., memory size, normalization scheme). Without these ablations the empirical support for the kernel-regression framing remains indirect.
minor comments (2)
  1. [Abstract] Notation for 'fixed' versus 'adaptive' normalization is introduced without a one-sentence contrast in the abstract or §2; a brief parenthetical would aid readability.
  2. [Figure 1] Figure 1 caption should explicitly state the kernel and normalization used for each attention variant shown.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (normalized ReLU definition): the abstract states that normalized ReLU attention 'arises from Epanechnikov kernel regression under fixed normalization.' Standard Nadaraya-Watson is always adaptive (output = ∑ K_i v_i / ∑ K_i). If the paper's fixed normalization divides by a data-independent constant rather than the realized sum of kernel weights, the claimed exact match to the NW estimator does not hold; this is load-bearing for the central correspondence.

    Authors: We appreciate the referee's observation. The manuscript distinguishes 'fixed' and 'adaptive' normalizations precisely to address this. Fixed normalization divides by a constant (the kernel integral), yielding a non-adaptive estimator. We will revise the abstract and §3.2 to explicitly note that this is a fixed-normalization kernel regression, not the standard adaptive NW estimator. The correspondence to Epanechnikov kernel still holds under this variant. revision: yes

  2. Referee: [§3.3] §3.3 (general α-entmax correspondence): the mapping from kernel order n to α = 1 + 1/n is presented as exact, yet the derivation appears to rely on a specific choice of normalization and support truncation. Provide the explicit step that shows the kernel weight function equals the α-entmax output for arbitrary n, including the handling of the partition function.

    Authors: We will include a detailed derivation in the revised §3.3. Starting from the kernel K(x) = c_n (1 - x^2)_+^n for |x| < 1, the normalized weights are K(x_i) / Z where Z is the sum. This form matches the α-entmax output p_i = [ (α-1)^{-1} (z_i - τ) ]_+ ^{1/(α-1)} with α = 1 + 1/n, where the threshold τ and partition function Z are determined by the support truncation to ensure the expression is non-negative. The explicit steps equate the power and the normalization constant for general n. revision: yes

  3. Referee: [Table 2] Table 2 / §5.1 (Memory Mosaics results): the reported gains over softmax baselines are modest and lack controls that isolate the kernel choice from other architectural differences (e.g., memory size, normalization scheme). Without these ablations the empirical support for the kernel-regression framing remains indirect.

    Authors: We concur that the experiments lack isolating ablations, and the gains are modest. The Memory Mosaics architecture bundles multiple modifications. We will revise §5.1 to acknowledge this limitation and state that the results provide preliminary support for the framework, with the primary contribution being the theoretical correspondence. We are unable to perform additional experiments for this revision. revision: partial

Circularity Check

0 steps flagged

Correspondence extends kernel regression theory to sparse attention without reducing to self-definition or fitted inputs

full rationale

The paper derives normalized ReLU and sparsemax attention as arising from Epanechnikov kernel regression (with fixed vs. adaptive normalization) and generalizes to α-entmax for other compact kernels. These steps equate attention outputs to Nadaraya-Watson estimators by direct substitution of the kernel forms into the regression formula, which is a standard mathematical correspondence rather than a self-referential loop or parameter fit. No load-bearing claim reduces to a self-citation chain or renames a known result as novel; the unification is self-contained against external nonparametric estimation benchmarks. Minor score accounts for possible normalization variants noted in skeptic analysis, but these do not collapse the central equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the kernel regression interpretation of attention as the starting point, plus the specific normalization choices that produce the claimed equivalences.

axioms (1)
  • domain assumption Attention mechanisms correspond to the Nadaraya-Watson kernel regression estimator.
    This is the foundational link invoked from recent prior work.

pith-pipeline@v0.9.0 · 5533 in / 1147 out tokens · 56985 ms · 2026-05-16T09:21:14.795764+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 4 internal anchors

  1. [1]

    Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., Madeira Ara ´ujo, J., Vitvitskyi, O., Pascanu, R., and Veliˇckovi´c, P

    URLhttps://arxiv.org/abs/2401.12973. Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., Madeira Ara ´ujo, J., Vitvitskyi, O., Pascanu, R., and Veliˇckovi´c, P. Transformers need glasses! information over-squashing in language tasks.Advances in Neural Information Processing Systems, 37:98111–98142,

  2. [2]

    Titans: Learning to Memorize at Test Time

    Behrouz, A., Zhong, P., and Mirrokni, V . Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

  3. [4]

    URL http://arxiv.org/abs/2004. 05150. arXiv: 2004.05150. Epanechnikov, V . A. Non-parametric estimation of a multi- variate probability density.Theory of Probability & Its Ap- plications, 14(1):153–158,

  4. [5]

    URLhttps://doi.org/10.1137/1114019

    doi: 10.1137/1114019. URLhttps://doi.org/10.1137/1114019. Fu, D. Y ., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and R ´e, C. Hungry Hungry Hippos: Towards lan- guage modeling with state space models. InInternational Conference on Learning Representations,

  5. [7]

    Key-value memory in the brain.arXiv preprint arXiv:2501.02950, 2025

    URL https://arxiv.org/abs/2501.02950. Preprint, Jan

  6. [8]

    Epub 2020 Aug

    doi: 10.1016/j.neuron.2020.07.011. Epub 2020 Aug

  7. [9]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  8. [10]

    doi: 10.18653/v1/2021.sustainlp-1.5

    Association for Computational Linguis- tics. doi: 10.18653/v1/2021.sustainlp-1.5. URL https: //aclanthology.org/2021.sustainlp-1.5/. H¨ardle, W.Applied nonparametric regression. Number

  9. [11]

    URL https://doi.org/10.1137/ 1109020

    doi: 10.1137/1109020. URL https://doi.org/10.1137/ 1109020. Nakanishi, K. M. Scalable-softmax is superior for attention. arXiv preprint arXiv:2501.19399,

  10. [12]

    Santos, S

    URL http: //jmlr.org/papers/v26/24-1961.html. Santos, S. J. R., Niculae, V ., McNamee, D. C., and Martins, A. Sparse and structured hopfield networks. InForty-first International Conference on Machine Learning,

  11. [13]

    neurips.cc/paper_files/paper/2015/file/ 8fb21ee7a2207526da55a679f0332de2-Paper.pdf

    URL https://proceedings. neurips.cc/paper_files/paper/2015/file/ 8fb21ee7a2207526da55a679f0332de2-Paper.pdf. Sun, Y ., Dong, L., Huang, S., Ma, S., Xia, Y ., Xue, J., Wang, J., and Wei, F. Retentive network: A successor to transformer for large language models,

  12. [14]

    Tsai, Y .-H

    doi: 10.1037/0735-7044.100.2.147. Tsai, Y .-H. H., Bai, S., Yamada, M., Morency, L.-P., and Salakhutdinov, R. Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel. InEMNLP,

  13. [15]

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

    doi: 10.1146/ annurev.psych.53.100901.135114. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in neural information processing systems, 30,

  14. [16]

    URL https://papers.nips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract. html. Vasylenko, P., Pitorro, H., Martins, A. F. T., and Treviso, M. Long-context generalization with sparse attention,

  15. [17]

    Veliˇckovi´c, P., Perivolaropoulos, C., Barbero, F., and Pas- canu, R

    URLhttps://arxiv.org/abs/2506.16640. Veliˇckovi´c, P., Perivolaropoulos, C., Barbero, F., and Pas- canu, R. Softmax is not enough (for sharp size gener- alisation). InForty-second International Conference on Machine Learning,

  16. [18]

    Test-time regression: a unifying framework for design- ing sequence models with associative memory,

    URL https://arxiv.org/ abs/2501.12352. Watson, G. S. Smooth regression analysis.Sankhy ¯a: The Indian Journal of Statistics, Series A (1961-2002), 26(4): 359–372,

  17. [19]

    URL http://www

    ISSN 0581572X. URL http://www. jstor.org/stable/25049340. Wu, D., Hu, J. Y .-C., Li, W., Chen, B.-Y ., and Liu, H. Stanhop: Sparse tandem hopfield model for memory- enhanced time series prediction. InThe Twelfth Interna- tional Conference on Learning Representations,

  18. [21]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    URL https://arxiv.org/abs/2312.06635. Yassa, M. A. and Stark, C. E. L. Pattern separation in the hippocampus.Trends in Neurosciences, 34(10):515–525,

  19. [22]

    Zaheer, M., Guruganesh, G., Dubey, K

    doi: 10.1016/j.tins.2011.06.006. Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Al- berti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer se- quences.Advances in Neural Information Processing Systems, 33,

  20. [23]

    Memory mosaics

    Zhang, J., Nolte, N., Sadhukhan, R., Chen, B., and Bottou, L. Memory mosaics. In Yue, Y ., Garg, A., Peng, N., Sha, F., and Yu, R. (eds.),International Conference on Representation Learning, volume 2025, pp. 36412–36433,

  21. [24]

    iclr.cc/paper_files/paper/2025/file/ 59c3ac496e6fe7be0c2c2b95014e2210-Paper-Conference

    URLhttps://proceedings. iclr.cc/paper_files/paper/2025/file/ 59c3ac496e6fe7be0c2c2b95014e2210-Paper-Conference. pdf. 11 Sparse Attention as Compact Kernel Regression A. Proof of Proposition 1 Proof. Let q≡k n and r= 1 α−1. Since ∥ki∥=∥q∥= 1 , we have 1 2 ∥ki −q∥ 2 = 1−k ⊤ i q. The r-weight polynomial kernel is defined as: Kh(ki −q) = 1− ∥ki −q∥ 2 h2 r + (...

  22. [25]

    All models use the GPT-2 tokenizer (Radford et al.,

    (Zhang et al., 2025), complementing the results in the main text (Figure 3). All models use the GPT-2 tokenizer (Radford et al.,

  23. [26]

    Experiments cover different numbers of parallel layers of persistent and contextual memories

    and share a sequence length of 512 tokens. Experiments cover different numbers of parallel layers of persistent and contextual memories. For single-layer models, we vary k∈16,32,64 and report results for the best-performing value (k= 32 ), which is then used for all multi-layer configurations. Each contextual memory unit computes leaky-average keys and on...

  24. [27]

    B.3. Length Generalization For biological agents, episodic memory scales with the length of life rather than a fixed context window, placing memory retrieval inherently in the long-context regime and requiring continual length generalization. In such settings, dense aggregation over all stored memories would lead to severe interference, motivating the use...