Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
fields
stat.ML 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Geometric tempering yields exponential convergence bounds for both Wasserstein and Fisher-Rao flows but produces no speedup in the Fisher-Rao metric, with new adaptive schedules derived from the tempered dynamics.
citing papers explorer
-
Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity
Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.
-
Properties and limitations of geometric tempering for gradient flow dynamics
Geometric tempering yields exponential convergence bounds for both Wasserstein and Fisher-Rao flows but produces no speedup in the Fisher-Rao metric, with new adaptive schedules derived from the tempered dynamics.