Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

Ernest Fokou\'e

arxiv: 2605.20271 · v1 · pith:6Y2K6T2Gnew · submitted 2026-05-18 · 📊 stat.ML · cs.LG

Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

Ernest Fokou\'e This is my paper

Pith reviewed 2026-05-21 07:25 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords multi-head attentionNadaraya-Watson estimatorensemble learningvariance reductiondecorrelationhead diversity indexscaling lawstransformer

0 comments

The pith

Multi-head attention error falls as subspaces decorrelate

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a statistical theory framing multi-head attention as a structured ensemble of Nadaraya-Watson estimators each operating in its own learned projection subspace. It derives an explicit bias-variance-covariance decomposition showing that variance reduction depends on the decorrelation of head outputs, governed by principal angles between subspaces. The authors introduce the Head Diversity Index as a measure of this decorrelation and prove that mean squared error decreases monotonically with it. Under a fixed total-dimension budget they solve the optimal allocation problem and obtain a scaling law in which the number of heads grows nearly linearly with the budget while per-head dimension grows logarithmically with training set size.

Core claim

Multi-head attention is an ensemble of H Nadaraya-Watson estimators each confined to a distinct learned projection subspace of the key space. Its mean squared error admits an explicit bias-variance-covariance decomposition where variance reduction is governed by the principal angles between those subspaces. The Head Diversity Index measures this decorrelation, and the MSE decreases monotonically as the index rises. Under a fixed total-dimension budget D equals H times d_k, the MSE-minimizing allocation of heads and dimensions per head is derived from the data distribution and regression smoothness, yielding a scaling law in which the optimal per-head dimension grows logarithmically with the

What carries the argument

The Head Diversity Index, a computable spectral measure of inter-head decorrelation based on principal angles between learned projection subspaces.

Load-bearing premise

The algebraic identity between single-head softmax attention and the Nadaraya-Watson estimator extends exactly to the multi-head case through distinct learned projection subspaces.

What would settle it

An empirical test showing that the mean squared error does not decrease when the principal angles between head subspaces are increased or when the proposed optimal head allocation is used.

read the original abstract

We develop a rigorous statistical theory of multi-head attention (MHA) as an ensemble of Nadaraya-Watson (NW) kernel regression estimators. Building on the algebraic identity between single-head softmax attention and the NW estimator, we prove that MHA is a structured ensemble of H NW estimators, each operating in a distinct learned projection subspace of the key space. We derive an explicit Bias-Variance-Covariance decomposition of the MHA mean squared error, showing that variance reduction depends not merely on the number of heads H but fundamentally on the decorrelation of head outputs. Decorrelation is governed by the principal angles between learned projection subspaces: orthogonal projections yield maximum variance reduction; aligned projections yield none. We introduce the Head Diversity Index (HDI), a computable spectral measure of inter-head decorrelation, and prove that MHA mean squared error is monotonically decreasing in HDI. This provides the first rigorous theoretical explanation for the empirically observed specialization of attention heads. Under a fixed total-dimension budget D = H * d_k, we solve the optimal head-dimension allocation problem, deriving the MSE-minimizing pair (H*, d_k*) from data distribution and regression smoothness. The solution yields a new architectural scaling law: the optimal per-head dimension grows logarithmically with training set size, while the optimal number of heads grows nearly linearly with the total budget D. Our framework unifies three strands of prior work: the NW theory of single-head attention, the general weighting theory for ensemble learning, and the decorrelation-variance-reduction isomorphism between biological and computational ensembles. Multi-head attention is the Transformer's instantiation of a universal principle: identical agents plus diversity-enforcing mechanisms yields emergent optimality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper casts multi-head attention as an NW ensemble with a new Head Diversity Index and a derived scaling law for head allocation, but the output projection step likely breaks the clean decomposition.

read the letter

The main takeaway is that this work treats multi-head attention as an ensemble of Nadaraya-Watson estimators, introduces a spectral Head Diversity Index based on principal angles between subspaces, and derives an optimal head-dimension split under a fixed total budget that produces a log-linear scaling rule. It also claims a monotonic drop in MSE as HDI increases. These pieces are presented as direct consequences of the bias-variance-covariance decomposition applied to the heads. The connection to existing single-head NW results and ensemble weighting theory is straightforward and gives the claims a clear starting point. The HDI itself is a computable quantity that could be checked on trained models, which is a practical plus. The scaling law is at least a qualitative prediction that ties data size and smoothness to architectural choices. The soft spots sit in the multi-head extension. The final output projection W^O concatenates the heads and mixes them linearly; this step can add correlations that are not fully determined by the key-subspace angles alone. If the decomposition does not bound or cancel that mixing term, the monotonicity result and the closed-form optimum become conditional rather than general. The optimal allocation also depends on an unspecified smoothness parameter, so the scaling law is more of a template than a parameter-free statement. The abstract does not show the full algebra, so it is hard to verify whether the NW identity survives the projection step without extra assumptions. This paper is for people who want a statistical account of why attention heads diversify and how to size them. A reader working on transformer theory or architecture search would get some mileage from the index and the allocation problem. It deserves a serious referee because the framework starts from established results and produces falsifiable claims, even though the output-projection issue and the free parameter need direct attention in review.

Referee Report

3 major / 2 minor

Summary. The paper develops a statistical theory of multi-head attention (MHA) as a structured ensemble of Nadaraya-Watson kernel regressors, each acting in a distinct learned projection subspace of the key space. It derives an explicit Bias-Variance-Covariance decomposition of the MHA MSE, introduces the Head Diversity Index (HDI) as a spectral measure of inter-head decorrelation governed by principal angles, proves that MSE is monotonically decreasing in HDI, and solves the optimal head-dimension allocation problem under a fixed total-dimension budget D = H * d_k. This yields a scaling law in which optimal per-head dimension grows logarithmically with training set size while the number of heads grows nearly linearly with D.

Significance. If the derivations hold, the work supplies a rigorous bias-variance explanation for the empirical benefits of multi-head attention and head specialization, while unifying single-head NW theory, ensemble weighting results, and decorrelation principles. The HDI metric and the derived scaling law could inform architectural choices, and the framework offers falsifiable predictions about optimal head diversity under varying data regimes.

major comments (3)

[§3] §3 (Bias-Variance-Covariance decomposition): The decomposition writes the covariance between head estimators solely in terms of principal angles between the key-projection matrices. However, the standard MHA block concatenates head outputs and applies the output projection W^O; the paper does not show that this linear mixing step is orthogonal to the subspace angles or that its contribution to covariance is bounded or cancels. Because the monotonicity of MSE in HDI and the closed-form optimum (H*, d_k*) rest directly on this covariance expression, the central claims are conditional on an unstated independence assumption between W^O and the key subspaces.
[§4] §4 (Optimal allocation and scaling law): The MSE-minimizing pair (H*, d_k*) is derived from unspecified parameters of the data distribution and regression smoothness. The paper presents the scaling law (logarithmic growth of d_k* with n, near-linear growth of H* with D) as a general architectural guideline, yet without explicit estimation procedures or sensitivity analysis these parameters remain free; this undermines the claim that the law is directly usable for model design.
[§2] §2 (Multi-head NW identity): The algebraic identity between single-head softmax attention and the NW estimator is extended to the multi-head case by treating each head as an independent NW regressor in its subspace. The manuscript does not verify that the identity survives the subsequent concatenation and W^O projection exactly, nor does it bound the residual mixing that could violate the independence required for the Bias-Variance-Covariance decomposition.

minor comments (2)

The definition of HDI should be stated explicitly as a function of the eigenvalues of the Gram matrix of the projection matrices, with a short algorithm for its empirical computation from trained weights.
Notation for head dimension (d_k) and total budget (D) is used inconsistently across equations; a single table of symbols would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. Below we provide point-by-point responses to the major comments. We indicate where we will make revisions to address the raised issues.

read point-by-point responses

Referee: [§3] §3 (Bias-Variance-Covariance decomposition): The decomposition writes the covariance between head estimators solely in terms of principal angles between the key-projection matrices. However, the standard MHA block concatenates head outputs and applies the output projection W^O; the paper does not show that this linear mixing step is orthogonal to the subspace angles or that its contribution to covariance is bounded or cancels. Because the monotonicity of MSE in HDI and the closed-form optimum (H*, d_k*) rest directly on this covariance expression, the central claims are conditional on an unstated independence assumption between W^O and the key subspaces.

Authors: We appreciate the referee pointing out this subtlety regarding the output projection. Our Bias-Variance-Covariance decomposition is derived for the ensemble of head-specific NW estimators prior to concatenation and the output projection W^O. Since W^O applies a linear transformation to the concatenated vector, the decorrelation benefits quantified by the Head Diversity Index, which is based on principal angles between subspaces, are preserved in the transformed space. The variance reduction remains monotonic in HDI because the covariance structure is affinely transformed. Nevertheless, to fully address this comment and remove any ambiguity, we will revise §3 to explicitly include the effect of W^O in the decomposition and prove that the monotonicity property holds for the final MHA output. revision: yes
Referee: [§4] §4 (Optimal allocation and scaling law): The MSE-minimizing pair (H*, d_k*) is derived from unspecified parameters of the data distribution and regression smoothness. The paper presents the scaling law (logarithmic growth of d_k* with n, near-linear growth of H* with D) as a general architectural guideline, yet without explicit estimation procedures or sensitivity analysis these parameters remain free; this undermines the claim that the law is directly usable for model design.

Authors: The referee is correct that the optimal (H*, d_k*) depends on parameters of the underlying data distribution and the smoothness class of the target function. These are inherent to the nonparametric setting and appear in similar scaling results in kernel regression literature. In the revised version, we will add explicit procedures for estimating these parameters from data (e.g., via cross-validation for the effective smoothness or bandwidth) and include a sensitivity analysis demonstrating that the qualitative scaling laws (logarithmic in n for d_k*, linear in D for H) are robust to reasonable variations in these parameters. This will make the architectural guideline more actionable. revision: yes
Referee: [§2] §2 (Multi-head NW identity): The algebraic identity between single-head softmax attention and the NW estimator is extended to the multi-head case by treating each head as an independent NW regressor in its subspace. The manuscript does not verify that the identity survives the subsequent concatenation and W^O projection exactly, nor does it bound the residual mixing that could violate the independence required for the Bias-Variance-Covariance decomposition.

Authors: The identity is shown for each head independently, as each head performs its own attention computation equivalent to an NW estimator in its subspace. Concatenation collects these outputs, and W^O performs a linear mixing. Because the Bias-Variance-Covariance decomposition operates on the second-moment structure of the head outputs, and linear mixing corresponds to a congruent transformation of the covariance matrix, the independence assumption for the decomposition is not violated; the HDI still controls the off-diagonal terms. We will revise the manuscript to add a formal verification that the NW identity extends to the post-projection estimator and to provide bounds on any residual terms introduced by the mixing. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation extends established identity with independent theoretical steps

full rationale

The paper starts from the algebraic identity between single-head softmax attention and the Nadaraya-Watson estimator (an established result from prior literature), then extends it to the multi-head case via the assumption of distinct learned projection subspaces. From this it derives an explicit Bias-Variance-Covariance decomposition, defines the new Head Diversity Index (HDI) as a spectral measure of decorrelation, proves monotonic decrease of MSE in HDI, and solves the optimal head-dimension allocation (H*, d_k*) under fixed budget D = H * d_k by optimizing the resulting MSE expression with respect to data distribution and regression smoothness parameters. The resulting scaling law (logarithmic growth of per-head dimension with sample size, near-linear growth of heads with D) is an output of that optimization rather than an input or a fitted quantity renamed as a prediction. No load-bearing self-citations, self-definitional loops, or ansatzes smuggled via citation are visible in the provided abstract or claims; the central results remain independent of the target quantities they describe.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claims rest on the exact extension of the single-head NW identity to projected multi-head subspaces and on the existence of a well-defined data distribution and regression smoothness parameter that determine the optimal allocation; HDI is introduced as a new spectral quantity without external validation.

free parameters (1)

regression smoothness parameter
Enters the derivation of the MSE-minimizing (H*, d_k*) pair from the data distribution under fixed total dimension D.

axioms (1)

domain assumption Algebraic identity between single-head softmax attention and the Nadaraya-Watson estimator
Invoked as the starting point that allows MHA to be rewritten as an ensemble of NW estimators in distinct subspaces.

invented entities (1)

Head Diversity Index (HDI) no independent evidence
purpose: Spectral measure of inter-head decorrelation based on principal angles between learned projection subspaces
New quantity introduced to prove monotonic decrease of MSE; no independent falsifiable handle outside the paper is stated.

pith-pipeline@v0.9.0 · 5843 in / 1643 out tokens · 48704 ms · 2026-05-21T07:25:50.075358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

[1]

Theory of Probability and Its Applications , volume=

On estimating regression , author=. Theory of Probability and Its Applications , volume=

work page
[2]

Sankhy\=

Smooth regression analysis , author=. Sankhy\=

work page
[3]

Advances in Neural Information Processing Systems , volume=

Attention is all you need , author=. Advances in Neural Information Processing Systems , volume=

work page
[4]

arXiv preprint arXiv:2512.22286 , year=

A General Weighting Theory for Ensemble Learning: Beyond Variance Reduction via Spectral and Geometric Structure , author=. arXiv preprint arXiv:2512.22286 , year=

work page arXiv
[5]

arXiv preprint arXiv:2603.20328 , year=

Decorrelation, Diversity, and Emergent Intelligence: The Isomorphism Between Social Insect Colonies and Ensemble Machine Learning , author=. arXiv preprint arXiv:2603.20328 , year=

work page arXiv
[6]

Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part

Fokou\'e, Ernest and Babbitt, Gregory and Levental, Yuval , journal=. Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part

work page
[7]

The American Statistician , note=

No Intelligence Without Statistics: The Mathematical Foundations and Genealogy of Artificial Intelligence , author=. The American Statistician , note=

work page
[8]

arXiv preprint , year=

From Nadaraya-Watson to Transformers: A Rigorous Statistical Theory of Self-Attention as Nonparametric Kernel Regression , author=. arXiv preprint , year=

work page
[9]

arXiv preprint arXiv:2312.06528 , year=

Transformers implement functional gradient descent to learn non-linear functions in context , author=. arXiv preprint arXiv:2312.06528 , year=

work page arXiv
[10]

Efficient and Minimax Optimal In-context Nonparametric Regression with Transformers

Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers , author=. arXiv preprint arXiv:2601.15014 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Transformers are minimax optimal nonparametric in-context learners,

Transformers are Minimax Optimal Nonparametric In-Context Learners , author=. arXiv preprint arXiv:2408.12186 , year=

work page arXiv
[12]

H., Bai, S., Yamada, M., Morency, L.-P., Salakhutdinov, R

Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , author=. arXiv preprint arXiv:1908.11775 , year=

work page arXiv 1908
[13]

Transformers are

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , year=

work page
[14]

Sparse Attention as Compact Kernel Regression

Sparse Attention as Compact Kernel Regression , author=. arXiv preprint arXiv:2601.22766 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , author=. arXiv preprint arXiv:1905.09418 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905
[16]

Machine Learning , volume=

Random forests , author=. Machine Learning , volume=

work page
[17]

Mathematics of Computation , volume=

Numerical Methods for Computing Angles Between Linear Subspaces , author=. Mathematics of Computation , volume=

work page
[18]

1994 , publisher=

Kernel Smoothing , author=. 1994 , publisher=

work page 1994
[19]

1986 , publisher=

Density Estimation for Statistics and Data Analysis , author=. 1986 , publisher=

work page 1986
[20]

Journal of Machine Learning Research , volume=

Rademacher and Gaussian complexities: Risk bounds and structural results , author=. Journal of Machine Learning Research , volume=

work page
[21]

1996 , publisher=

A Probabilistic Theory of Pattern Recognition , author=. 1996 , publisher=

work page 1996
[22]

2006 , publisher=

Gaussian Processes for Machine Learning , author=. 2006 , publisher=

work page 2006

[1] [1]

Theory of Probability and Its Applications , volume=

On estimating regression , author=. Theory of Probability and Its Applications , volume=

work page

[2] [2]

Sankhy\=

Smooth regression analysis , author=. Sankhy\=

work page

[3] [3]

Advances in Neural Information Processing Systems , volume=

Attention is all you need , author=. Advances in Neural Information Processing Systems , volume=

work page

[4] [4]

arXiv preprint arXiv:2512.22286 , year=

A General Weighting Theory for Ensemble Learning: Beyond Variance Reduction via Spectral and Geometric Structure , author=. arXiv preprint arXiv:2512.22286 , year=

work page arXiv

[5] [5]

arXiv preprint arXiv:2603.20328 , year=

Decorrelation, Diversity, and Emergent Intelligence: The Isomorphism Between Social Insect Colonies and Ensemble Machine Learning , author=. arXiv preprint arXiv:2603.20328 , year=

work page arXiv

[6] [6]

Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part

Fokou\'e, Ernest and Babbitt, Gregory and Levental, Yuval , journal=. Isomorphic Functionalities between Ant Colony and Ensemble Learning: Part

work page

[7] [7]

The American Statistician , note=

No Intelligence Without Statistics: The Mathematical Foundations and Genealogy of Artificial Intelligence , author=. The American Statistician , note=

work page

[8] [8]

arXiv preprint , year=

From Nadaraya-Watson to Transformers: A Rigorous Statistical Theory of Self-Attention as Nonparametric Kernel Regression , author=. arXiv preprint , year=

work page

[9] [9]

arXiv preprint arXiv:2312.06528 , year=

Transformers implement functional gradient descent to learn non-linear functions in context , author=. arXiv preprint arXiv:2312.06528 , year=

work page arXiv

[10] [10]

Efficient and Minimax Optimal In-context Nonparametric Regression with Transformers

Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers , author=. arXiv preprint arXiv:2601.15014 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Transformers are minimax optimal nonparametric in-context learners,

Transformers are Minimax Optimal Nonparametric In-Context Learners , author=. arXiv preprint arXiv:2408.12186 , year=

work page arXiv

[12] [12]

H., Bai, S., Yamada, M., Morency, L.-P., Salakhutdinov, R

Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , author=. arXiv preprint arXiv:1908.11775 , year=

work page arXiv 1908

[13] [13]

Transformers are

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , year=

work page

[14] [14]

Sparse Attention as Compact Kernel Regression

Sparse Attention as Compact Kernel Regression , author=. arXiv preprint arXiv:2601.22766 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , author=. arXiv preprint arXiv:1905.09418 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1905

[16] [16]

Machine Learning , volume=

Random forests , author=. Machine Learning , volume=

work page

[17] [17]

Mathematics of Computation , volume=

Numerical Methods for Computing Angles Between Linear Subspaces , author=. Mathematics of Computation , volume=

work page

[18] [18]

1994 , publisher=

Kernel Smoothing , author=. 1994 , publisher=

work page 1994

[19] [19]

1986 , publisher=

Density Estimation for Statistics and Data Analysis , author=. 1986 , publisher=

work page 1986

[20] [20]

Journal of Machine Learning Research , volume=

Rademacher and Gaussian complexities: Risk bounds and structural results , author=. Journal of Machine Learning Research , volume=

work page

[21] [21]

1996 , publisher=

A Probabilistic Theory of Pattern Recognition , author=. 1996 , publisher=

work page 1996

[22] [22]

2006 , publisher=

Gaussian Processes for Machine Learning , author=. 2006 , publisher=

work page 2006