Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.
Transformers are minimax optimal nonparametric in-context learners
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
verdicts
UNVERDICTED 3representative citing papers
A model-independent upper bound on generalization gap is established that depends solely on the Rényi entropy of the data-generating distribution for histogram-determined algorithms such as ERM.
Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.
citing papers explorer
-
Overfitting has a limitation: a model-independent generalization gap bound based on R\'enyi entropy
A model-independent upper bound on generalization gap is established that depends solely on the Rényi entropy of the data-generating distribution for histogram-determined algorithms such as ERM.