Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.
Transformers are minimax optimal nonparametric in-context learners
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
verdicts
UNVERDICTED 3representative citing papers
A model-independent upper bound on generalization gap is established that depends solely on the Rényi entropy of the data-generating distribution for histogram-determined algorithms such as ERM.
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
citing papers explorer
-
Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity
Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.