Transformers are minimax optimal nonparametric in-context learners

· 2024 · arXiv 2408.12186

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

stat.ML · 2026-05-18 · unverdicted · novelty 7.0

Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.

Overfitting has a limitation: a model-independent generalization gap bound based on R\'enyi entropy

stat.ML · 2025-05-30 · unverdicted · novelty 6.0

A model-independent upper bound on generalization gap is established that depends solely on the Rényi entropy of the data-generating distribution for histogram-determined algorithms such as ERM.

Hallucinations are inevitable but can be made statistically negligible

cs.CL · 2025-02-15 · unverdicted · novelty 6.0

Hallucinations are inevitable on an infinite set of inputs but can be made statistically negligible with sufficient training data quality and quantity.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity stat.ML · 2026-05-18 · unverdicted · none · ref 11
Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension scaling laws under fixed total dimension.

Transformers are minimax optimal nonparametric in-context learners

fields

years

verdicts

representative citing papers

citing papers explorer