A three-layer transformer exactly implements one step of mirror descent on latent mixture weights for next-token prediction, yielding a first-order approximation to the Bayes-optimal estimator.
Jorge P´erez, Pablo Barcel ´o, and Javier Marinkovic
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Pre-trained LLMs learn to predict HMM-generated sequences via in-context learning, approaching theoretical optimum on synthetic HMMs and matching expert models on real animal decision data.
LZ78 sources are almost stationary ergodic processes satisfying a Shannon-McMillan-Breiman property and local i.i.d. convergence, yet their finite-state compressibility exceeds the entropy rate by a Jensen gap.
citing papers explorer
-
Transformers Learn Latent Mixture Models In-Context via Mirror Descent
A three-layer transformer exactly implements one step of mirror descent on latent mixture weights for next-token prediction, yielding a first-order approximation to the Bayes-optimal estimator.
-
Pre-trained Large Language Models Learn Hidden Markov Models In-context
Pre-trained LLMs learn to predict HMM-generated sequences via in-context learning, approaching theoretical optimum on synthetic HMMs and matching expert models on real animal decision data.
-
The LZ78 Source
LZ78 sources are almost stationary ergodic processes satisfying a Shannon-McMillan-Breiman property and local i.i.d. convergence, yet their finite-state compressibility exceeds the entropy rate by a Jensen gap.