Efficient and Minimax Optimal In-context Nonparametric Regression with Transformers

Ioana Popescu; Michelle Ching; Nico Smith; Richard J. Samworth; Tianyi Ma; William G. Underwood

arxiv: 2601.15014 · v2 · pith:ZPFVD2JYnew · submitted 2026-01-21 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Efficient and Minimax Optimal In-context Nonparametric Regression with Transformers

Michelle Ching , Ioana Popescu , Nico Smith , Tianyi Ma , William G. Underwood , Richard J. Samworth This is my paper

Pith reviewed 2026-05-21 15:46 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH

keywords in-context learningnonparametric regressiontransformersminimax optimalityHölder smooth functionslocal polynomial estimationgradient descent

0 comments

The pith

A transformer with Θ(log n) parameters pretrained on Ω(n^{2α/(2α+d)} log³ n) sequences achieves the minimax optimal rate O(n^{-2α/(2α+d)}) for α-Hölder nonparametric regression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that pretrained transformers can perform in-context nonparametric regression at the best possible statistical rate for Hölder smooth functions. With n examples in d dimensions, a model using only Θ(log n) parameters and a polynomial number of pretraining sequences reaches mean squared error of order n to the power of minus 2α over 2α plus d. This bound improves on earlier work by using substantially fewer parameters and less pretraining data. The result follows from showing that the transformer can internally approximate a local polynomial estimator.

Core claim

With n in-context examples and d-dimensional regression covariates, a pretrained transformer with Θ(log n) parameters and Ω(n^{2α/(2α+d)} log³ n) pretraining sequences can achieve the minimax optimal rate of convergence O(n^{-2α/(2α+d)}) in mean squared error for α-Hölder smooth regression functions. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.

What carries the argument

Approximating local polynomial estimators via a kernel-weighted polynomial basis followed by gradient descent.

If this is right

Transformers attain the statistical minimax rate for nonparametric regression in the in-context setting.
Only a logarithmic number of parameters in n is required.
Pretraining on a polynomial number of sequences in the effective sample size suffices.
The approach requires fewer resources than earlier transformer bounds for the same task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The architecture and pretraining together allow the model to simulate classical kernel-based statistical estimators.
Analogous approximation strategies could yield optimal in-context performance for related tasks such as classification or density estimation.
Reducing pretraining scale while preserving optimality may be possible through refined architectural choices.

Load-bearing premise

A transformer can approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent under the described pretraining regime.

What would settle it

Measuring whether the transformer's in-context mean squared error on α-Hölder data fails to scale as O(n^{-2α/(2α+d)}) for large n in fixed d and α.

read the original abstract

We study in-context learning for nonparametric regression with $\alpha$-H\"older smooth regression functions, for some $\alpha>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $\Theta(\log n)$ parameters and $\Omega\bigl(n^{2\alpha/(2\alpha+d)}\log^3 n\bigr)$ pretraining sequences can achieve the minimax optimal rate of convergence $O\bigl(n^{-2\alpha/(2\alpha+d)}\bigr)$ in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper claims a transformer with only Θ(log n) parameters can hit the minimax rate for Hölder nonparametric regression after targeted pretraining, by approximating local polynomials inside the model.

read the letter

The main thing to know is that the authors get the optimal rate O(n^{-2α/(2α+d)}) for α-Hölder regression using a pretrained transformer whose parameter count stays logarithmic in n and whose pretraining uses roughly n to the power 2α/(2α+d) times log cubed n sequences. That is a clear improvement on the resource counts in earlier transformer in-context results they cite. The construction works by showing the transformer can build a kernel-weighted polynomial basis of degree floor(α) and then run gradient descent on the resulting local least-squares problem, all inside the attention layers after pretraining. If the details check out, this gives a concrete mechanism for why in-context learning can recover classical nonparametric estimators without needing model size or data to grow with n in the usual way. The engagement with minimax theory and the explicit parameter and sample bounds are the parts that feel solid and worth citing if the proofs hold. The soft spot is exactly the one the stress test flags: with parameter count fixed at Θ(log n), it is not immediate how the fixed-depth, fixed-width attention stack encodes both the kernel weighting and the iterative solver without hidden dimension or number of heads growing with n or α. The abstract states the result but leaves the circuit-level argument for the full text; if the proof only shows that some algorithm exists and then asserts the transformer can simulate it without an explicit low-parameter implementation, the rate transfer is not automatic. I would still send this to referees because the claim is sharp, the literature comparison is direct, and the question of whether transformers can match optimal statistical rates with minimal overhead is worth settling. Readers working on theoretical in-context learning or on bridging transformers to classical nonparametric statistics will get the most out of it.

Referee Report

2 major / 2 minor

Summary. The manuscript proves that, for α-Hölder smooth regression functions in d dimensions, a transformer pretrained on Ω(n^{2α/(2α+d)} log³ n) sequences and possessing only Θ(log n) parameters can, given n in-context examples, achieve the minimax-optimal mean-squared-error rate O(n^{-2α/(2α+d)}). The argument proceeds by showing that the transformer realizes an efficient approximation to the local-polynomial estimator: it constructs a kernel-weighted polynomial feature basis of degree ⌊α⌋ and then performs gradient-descent iterations on the resulting linear system.

Significance. If the central construction is rigorous, the result materially improves upon prior in-context-learning bounds by reducing both the transformer parameter count to logarithmic in n and the pretraining sample complexity while still attaining the classical nonparametric rate. The explicit reduction to a well-understood statistical estimator (local polynomials) supplies a concrete, falsifiable mechanism that explains why attention can succeed at this task.

major comments (2)

[§3] §3 (Transformer Architecture and Approximation): the claim that a fixed-depth, Θ(log n)-parameter attention network can simultaneously encode kernel evaluations and run a sufficient number of gradient-descent steps on the weighted polynomial system must be made fully explicit. If the proof only establishes existence of some (non-transformer) algorithm that performs these operations and then asserts that a transformer can simulate it, the minimax rate does not automatically transfer; an explicit parameter-efficient circuit for the kernel weights and the iterative solver is required.
[Theorem 1] Theorem 1 (Main Rate Statement): the pretraining regime Ω(n^{2α/(2α+d)} log³ n) is asserted to be substantially smaller than previous results, yet no quantitative comparison table or citation to the exact prior bounds appears in the statement or proof sketch. Without this, it is impossible to verify the claimed improvement in sample complexity.

minor comments (2)

[§2] Notation for the Hölder ball and the precise definition of the local-polynomial degree ⌊α⌋ should be introduced before the main theorem rather than deferred to the appendix.
[Figure 1] Figure 1 (or the schematic of the attention circuit) would benefit from explicit labels indicating which heads compute kernel weights and which implement the GD update.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We respond to each major comment below.

read point-by-point responses

Referee: [§3] §3 (Transformer Architecture and Approximation): the claim that a fixed-depth, Θ(log n)-parameter attention network can simultaneously encode kernel evaluations and run a sufficient number of gradient-descent steps on the weighted polynomial system must be made fully explicit. If the proof only establishes existence of some (non-transformer) algorithm that performs these operations and then asserts that a transformer can simulate it, the minimax rate does not automatically transfer; an explicit parameter-efficient circuit for the kernel weights and the iterative solver is required.

Authors: Section 3 provides an explicit, constructive reduction: specific attention heads compute the kernel weights via a softmax-based approximation to the Hölder kernel (with precision controlled by Θ(log n) parameters), the value vectors encode the polynomial features of degree ⌊α⌋, and a fixed number of subsequent layers implement the gradient-descent updates on the resulting Gram matrix via repeated linear transformations. The depth is independent of n and the total parameter count remains Θ(log n). This is a direct simulation inside the transformer, not a black-box existence claim. To further address the concern we will insert a short algorithmic box and a remark mapping each statistical operation to the corresponding transformer layer. revision: partial
Referee: [Theorem 1] Theorem 1 (Main Rate Statement): the pretraining regime Ω(n^{2α/(2α+d)} log³ n) is asserted to be substantially smaller than previous results, yet no quantitative comparison table or citation to the exact prior bounds appears in the statement or proof sketch. Without this, it is impossible to verify the claimed improvement in sample complexity.

Authors: We agree that an explicit side-by-side comparison would make the improvement easier to verify. In the revision we will add a table (and the corresponding citations) in the introduction that lists the transformer parameter count and pretraining sample complexity of the main prior in-context-learning results for nonparametric regression, thereby quantifying the reduction to Θ(log n) parameters and Ω(n^{2α/(2α+d)} log³ n) sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is a self-contained existence proof via approximation arguments.

full rationale

The paper claims an existence result: a transformer with Θ(log n) parameters, after sufficient pretraining, can implement a kernel-weighted polynomial basis of degree ⌊α⌋ and run gradient descent to approximate the local polynomial estimator, thereby attaining the minimax rate O(n^{-2α/(2α+d)}). This is presented as a direct mathematical construction and approximation analysis rather than a reduction of the target rate to a fitted parameter or a self-citation chain. No equations or steps in the provided abstract and description equate the final convergence rate to the input pretraining sequences by definition, nor do they rename a known empirical pattern or smuggle an ansatz via prior self-work. The central claim retains independent content as an explicit (if technically demanding) simulation argument inside the transformer architecture.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on the standard Holder smoothness assumption from nonparametric statistics and on the unproven-in-abstract claim that transformers can realize the local polynomial mechanism through their architecture and gradient descent.

axioms (1)

domain assumption Regression functions belong to the alpha-Holder class for some alpha > 0.
This is the classical assumption that delivers the minimax rate n^{-2alpha/(2alpha+d)} in d dimensions.

pith-pipeline@v0.9.0 · 5679 in / 1364 out tokens · 57474 ms · 2026-05-21T15:46:09.358954+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction / embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1 … Θ(log n) parameters … L := ⌈C log(en)⌉

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity
stat.ML 2026-05 unverdicted novelty 7.0

Multi-head attention is an ensemble of Nadaraya-Watson estimators whose MSE decreases monotonically with a new spectral Head Diversity Index measuring subspace decorrelation, yielding optimal head count and dimension ...