A three-layer transformer exactly implements one step of mirror descent on latent mixture weights for next-token prediction, yielding a first-order approximation to the Bayes-optimal estimator.
It represents the theoretical performance limit for inference under the MTD model assumptions, providing a gold-standard benchmark against which other estimators can be compared
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Transformers Learn Latent Mixture Models In-Context via Mirror Descent
A three-layer transformer exactly implements one step of mirror descent on latent mixture weights for next-token prediction, yielding a first-order approximation to the Bayes-optimal estimator.