Sparse Attention as Compact Kernel Regression

· 2026 · cs.LG · arXiv 2601.22766

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

representative citing papers

Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

stat.ML · 2026-05-18 · unverdicted · novelty 7.0

Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.

citing papers explorer

Showing 1 of 1 citing paper.

Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity stat.ML · 2026-05-18 · unverdicted · none · ref 14 · internal anchor
Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.

Sparse Attention as Compact Kernel Regression

fields

years

verdicts

representative citing papers

citing papers explorer