A three-layer transformer exactly implements one step of mirror descent on latent mixture weights for next-token prediction, yielding a first-order approximation to the Bayes-optimal estimator.
Using equation 43 we get the upper bound∥c t∥2 2 ≤m·1 2 =mfor everyt
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Transformers Learn Latent Mixture Models In-Context via Mirror Descent
A three-layer transformer exactly implements one step of mirror descent on latent mixture weights for next-token prediction, yielding a first-order approximation to the Bayes-optimal estimator.