Transformers Learn Latent Mixture Models In-Context via Mirror Descent

Francesco D'Angelo; Nicolas Flammarion

arxiv: 2604.10848 · v1 · submitted 2026-04-12 · 💻 cs.LG

Transformers Learn Latent Mixture Models In-Context via Mirror Descent

Francesco D'Angelo , Nicolas Flammarion This is my paper

Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords in-context learningtransformersmirror descentmixture modelslatent variablesattentionbayes optimal

0 comments

The pith

A three-layer transformer exactly implements one step of Mirror Descent to learn latent mixture weights in context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that transformers solve an in-context learning problem by estimating unobserved mixture weights that determine which past tokens influence the next prediction. It formalizes the task using Mixture of Transition Distributions, where a latent variable selects the active transition rule from the context. An explicit construction demonstrates that a three-layer transformer performs the Mirror Descent update on these weights, and the resulting predictor approximates the Bayes-optimal one to first order. Training transformers from scratch produces attention patterns and predictive distributions that match this construction, with deeper models reaching performance similar to multiple Mirror Descent steps. This links the attention mechanism to an optimization procedure for learning token importance.

Core claim

We give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor.

What carries the argument

The explicit three-layer transformer that realizes the Mirror Descent update on the mixture weights of a Mixture of Transition Distributions.

If this is right

Deeper transformers can approximate multiple steps of Mirror Descent and thereby improve in-context estimation of the mixture weights.
Attention patterns learned by the model directly encode the updated mixture weights that determine token importance.
The predictive distribution of the trained transformer converges to the Bayes-optimal predictor as the number of in-context examples grows.
Gradient descent on the transformer parameters discovers solutions consistent with the Mirror Descent construction rather than unrelated mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

In-context learning may often reduce to implicit optimization of latent parameters rather than pure pattern matching.
The same layer-wise mechanism could be extended to implement other first-order optimization methods for different latent-variable models.
This view suggests testable predictions for how attention heads should behave on sequence tasks with known latent structure.

Load-bearing premise

The observed sequences must be generated exactly by a mixture of transition distributions controlled by a single latent variable, and the transformer must be able to realize the precise mirror-descent weight update inside its fixed layers.

What would settle it

Train a transformer on data generated from the mixture model and check whether its attention weights fail to match the mixture weights produced by one Mirror Descent step or whether its predictions deviate from the first-order Bayes-optimal approximation.

Figures

Figures reproduced from arXiv: 2604.10848 by Francesco D'Angelo, Nicolas Flammarion.

**Figure 1.** Figure 1: To predict the final word, the model must infer the causal relevance of past tokens. Our MTD framework models this by separating a static, context-free unigram (π ⋆ ) from dynamic, context-dependent weights (λ) that are inferred in-context. The model learns to assign high weights to the causally relevant positions (’dog’) and (’ball’), activating their respective slices of the unigram (e.g., π ⋆ (dog, ·) f… view at source ↗

**Figure 2.** Figure 2: MTD for m=2. The selection of this lag is a random event, governed by the mixture weights λ = (λ1, . . . , λm) with λg ≥ 0 for all g and Pm g=1 λg = 1, such that the probability of choosing lag g is given by P(Zt = g) = λg. Once the lag Zt = g is sampled, the next token Yt is generated from a first-order transition that depends only on the state at the sampled position, Yt−g. This is captured by a transit… view at source ↗

**Figure 3.** Figure 3: Regularized Estimator. Comparison with Bayes and MLE estimators. The first-order equivalence in Theorem 1 holds only for short sequences, where the log-likelihood gradient is small. For longer sequences, neglected higher-order terms become significant, and the one-step estimator diverges from the Bayesian mean. Empirically, however, a few additional Mirror Descent steps substantially reduce this gap (se… view at source ↗

**Figure 4.** Figure 4: Comparison of Trained and Constructed Transformers. Left: Attention maps of the trained transformer (disentangled and standard) versus our theoretical construction (seq. length 64). Right (top): KL divergence to the ground truth transition probabilities for the trained transformers, the constructed transformer, and the one-step MD estimator across sequence lengths. Right (bottom): First-layer attention so… view at source ↗

**Figure 5.** Figure 5: Multi-Step MD vs. 5-Layer Transformer. KL divergence to the ground-truth transition probabilities for a 5-layer trained transformer, the k-step MD estimators and the Bayes-optimal estimator across sequence lengths. Results multi-step MD: To investigate whether deeper Transformers can learn to implement multiple steps of Mirror Descent, we plot in [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: KL divergence to the ground-truth We report the KL divergence to the ground truth transition probabilities for the trained transformers, the constructed transformer, and the one-step MD estimator across sequence lengths Left: order 3. Right: order 5. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Layer-1 Attention Softmax vs. True Transition matrices for orders 3 and 5. For both orders (m = 3 top 3 rows, m = 5 bottom 3 rows), each panel reports the learned first-layer attention (softmax of WA⊤ 1 ) alongside the true transition matrix; the average row-wise KL divergence is reported in the panel title. Sequence lengths increase left-to-right, top-to-bottom. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of Trained and Constructed Transformers (attention grids). Top: Attention maps of the trained transformer (disentangled) versus our theoretical construction (seq. length 64, MTD order m = 3). Bottom: Same as top but for MTD order m = 5. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The distribution over this latent variable is parameterized by unobserved mixture weights that transformers must learn in-context. We demonstrate that transformers can implement Mirror Descent to learn these weights from the context. Specifically, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch learn solutions consistent with our theory: their predictive distributions, attention patterns, and learned transition matrix closely match the construction, while deeper models achieve performance comparable to multi-step Mirror Descent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper constructs a three-layer transformer that exactly performs one step of mirror descent for learning latent mixture weights in sequences, with a proof of first-order Bayes optimality, but the precise realization in standard layers is the part to double-check.

read the letter

The key thing to know is that this paper gives an explicit three-layer transformer construction that implements one step of mirror descent on the weights of a latent mixture of transition distributions, and proves the output is a first-order approximation to the Bayes optimal predictor. They do a good job laying out the Mixture of Transition Distributions framework, where a hidden variable chooses which transition rule applies to the next token. The construction maps the transformer layers directly to the mirror descent update rule, which is more specific than most prior work on in-context learning. The experiments back this up by showing that randomly initialized transformers trained on the task develop attention patterns and transition matrices that match the theoretical one, and that adding depth gets performance closer to running multiple mirror descent steps. The main soft spot is around whether the implementation is truly exact. The mirror descent step involves an elementwise exponential and renormalization, which the paper must realize using the softmax attention and the feed-forward network with its nonlinearities. If the construction uses particular weight initializations or input encodings to make this happen precisely, it works, but it could be fragile or only hold under the paper's assumptions on the data scale. The stress-test concern about softmax and ReLU not naturally producing the required log-ratio or projection steps is worth a close look in the full proof. The assumption that the data comes exactly from this mixture model is also strong, though standard for this kind of analysis. This paper is for people studying the mechanisms behind in-context learning and trying to connect transformer behavior to classical optimization algorithms. It has enough of a novel construction and empirical support to deserve serious peer review, even if the exactness details need referee scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Mixture of Transition Distributions framework to model in-context learning, where a latent variable selects which transition rule governs the next token. It gives an explicit construction of a three-layer transformer that exactly implements one step of mirror descent (KL divergence) on the unobserved mixture weights, proves that the resulting predictor is a first-order approximation to the Bayes-optimal estimator, and shows empirically that transformers trained from scratch produce attention patterns, predictive distributions, and transition matrices consistent with the construction (with deeper models approaching multi-step mirror descent performance).

Significance. If the explicit construction and first-order approximation hold, the work supplies a concrete mechanistic account of how attention layers can realize optimization steps in-context, directly linking transformer architecture to mirror descent on mixture weights. The provision of a parameter-free explicit mapping, the proof of the Bayes approximation, and the empirical match to trained models constitute clear strengths that would advance understanding of in-context learning beyond descriptive observations.

major comments (2)

[§3] §3 (Construction): the claim that the three-layer transformer exactly implements the mirror-descent update w' = w ⊙ exp(η ∇) / Z requires explicit verification that the softmax attention and ReLU/GELU feed-forward blocks realize the precise element-wise exponentiation and normalization without hidden scaling assumptions or temperature approximations; the standard dot-product attention computes softmax over similarities rather than direct log-ratios, so the construction must demonstrate how the fixed weights produce the exact multiplicative form.
[§5] §5 (Proof of first-order Bayes approximation): the error analysis establishing that one exact mirror-descent step yields a first-order approximation to the Bayes-optimal predictor must be checked against the data-generating assumptions (exact mixture of transition distributions); if the transformer implementation introduces any discrepancy in the Bregman projection step, the O(η) claim no longer follows directly and the subsequent empirical validation of predictive distributions would rest on an inexact surrogate.

minor comments (2)

[§2.1] §2.1: the notation for the latent-variable distribution p(z_t | context) could be cross-referenced to the mixture-weight update rule to improve readability.
[Figure 3] Figure 3: the comparison of learned versus constructed attention maps would benefit from quantitative distance metrics (e.g., total variation) in addition to visual inspection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the detailed, constructive comments on the construction and proof. We address each major comment below and indicate the revisions we will make to improve clarity and explicitness.

read point-by-point responses

Referee: [§3] §3 (Construction): the claim that the three-layer transformer exactly implements the mirror-descent update w' = w ⊙ exp(η ∇) / Z requires explicit verification that the softmax attention and ReLU/GELU feed-forward blocks realize the precise element-wise exponentiation and normalization without hidden scaling assumptions or temperature approximations; the standard dot-product attention computes softmax over similarities rather than direct log-ratios, so the construction must demonstrate how the fixed weights produce the exact multiplicative form.

Authors: We thank the referee for emphasizing the need for explicit verification. Section 3 of the manuscript provides the full parameter assignments for the three-layer transformer. The attention layers use fixed weights chosen so that the dot-product similarities reduce exactly to the required log-ratios of the mixture weights and gradients; the subsequent softmax then implements the precise multiplicative update w' = w ⊙ exp(η ∇) / Z with no temperature scaling or hidden factors. The feed-forward blocks employ ReLU (or GELU) to realize the element-wise operations and normalization without approximation. In the revision we will add a dedicated paragraph with a line-by-line computation trace confirming that the architecture produces the exact mirror-descent step under the stated weight choices. revision: yes
Referee: [§5] §5 (Proof of first-order Bayes approximation): the error analysis establishing that one exact mirror-descent step yields a first-order approximation to the Bayes-optimal predictor must be checked against the data-generating assumptions (exact mixture of transition distributions); if the transformer implementation introduces any discrepancy in the Bregman projection step, the O(η) claim no longer follows directly and the subsequent empirical validation of predictive distributions would rest on an inexact surrogate.

Authors: We appreciate the referee's careful scrutiny of the error analysis. The data-generating process is defined as an exact mixture of transition distributions, and the three-layer construction realizes the mirror-descent update (including the KL Bregman projection) exactly, with no discrepancy between the mathematical step and the transformer operations. The O(η) approximation to the Bayes-optimal predictor therefore follows directly from the standard first-order Taylor expansion of mirror descent. In the revision we will insert an explicit remark in Section 5 stating that the transformer implementation introduces no additional error in the projection step, thereby preserving the claimed approximation order, and we will add a cross-reference to the construction details in Section 3. revision: yes

Circularity Check

0 steps flagged

Explicit construction of transformer as mirror-descent step is independent of fitted inputs

full rationale

The paper's central claim rests on an explicit, architecture-level construction that maps fixed transformer weights and layers directly onto one iteration of mirror descent on mixture weights. This mapping is presented as a first-principles equivalence derived from the attention and feed-forward operations, without any parameter fitting to data, self-referential definition of the target estimator, or load-bearing self-citation that would make the result tautological. The subsequent proof that the estimator approximates the Bayes-optimal predictor is likewise derived from the properties of that explicit update rule rather than from any empirical fit or renamed known result. No step reduces the claimed derivation to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that next-token data is generated by a latent mixture of transition distributions and that a fixed transformer architecture can exactly realize one mirror-descent step on the mixture weights.

axioms (1)

domain assumption Next-token sequences are generated by a mixture of transition distributions whose latent variable selects the active rule at each step.
This modeling choice is introduced to formalize token importance estimation as in-context learning of mixture weights.

pith-pipeline@v0.9.0 · 5479 in / 1175 out tokens · 46950 ms · 2026-05-10T15:18:53.561459+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Transformers on markov data: Constant depth suffices,

URLhttps://openreview.net/forum?id=sLkj91HIZU. Jorge P´erez, Pablo Barcel ´o, and Javier Marinkovic. Attention is turing complete.The Journal of Machine Learning Research, 22(1):3463–3497, 2021. Adrian E Raftery. A model for high-order markov chains.Journal of the Royal Statistical Society Series B: Statistical Methodology, 47(3):528–539, 1985. Nived Raja...

work page arXiv 2021
[2]

It represents the theoretical performance limit for inference under the MTD model assumptions, providing a gold-standard benchmark against which other estimators can be compared

= 1 K KX k=1 p(Yt+1 =j|y t 1,λ (k)) = 1 K KX k=1 mX g=1 λ(k) g π(yt+1−g, j) ! .(20) This estimate converges to the true Bayes optimal predictive distribution asK→ ∞. It represents the theoretical performance limit for inference under the MTD model assumptions, providing a gold-standard benchmark against which other estimators can be compared. H ALGORITHMS...

work page 2026
[3]

no evidence

Thus, the posterior probability simplifies: P(Zt =g|y,λ (k)) =P(Z t =g|y t 1,λ (k)). Using Bayes’ theorem: γ(k) t (g) =P(Z t =g|y t 1,λ (k)) = P(yt |Z t =g,y t−1 1 ,λ (k))P(Zt =g|y t−1 1 ,λ (k)) P(yt |y t−1 1 ,λ (k)) = π(yt−g, yt)λ(k) g Pm h=1 P(yt, Zt =h|y t−1 1 ,λ (k)) = λ(k) g π(yt−g, yt) Pm h=1 λ(k) h π(yt−h, yt) .(22) The E-step involves calculating ...

work page 2026
[4]

Using equation 43 we get the upper bound∥c t∥2 2 ≤m·1 2 =mfor everyt

Hence ctc⊤ t op =∥c t∥2 2 = mX g=1 c2 t,g. Using equation 43 we get the upper bound∥c t∥2 2 ≤m·1 2 =mfor everyt. For the denominator, by equation 42 and sincePm g=1 λg = 1, st(λ) = mX g=1 λgct,g ≥ mX g=1 λgcmin =c min. Therefore for everyt, ctc⊤ t op st(λ)2 ≤ m c2 min . Summing overt=m+ 1, . . . , Tyields ∇2f(λ) op ≤ TX t=m+1 m c2 min = (T−m) m c2 min , w...

work page 2026
[5]

Using this inequality we obtain, for everyt, m2 S2 t ∥ct∥2 2 ≤ m2 S2 t S2 t =m 2

Observe the elementary inequality ∥ct∥2 2 ≤ mX g=1 ct,g 2 =S 2 t , which holds because( P i ai)2 = P i a2 i + 2P i<j aiaj ≥ P i a2 i for nonnegativea i. Using this inequality we obtain, for everyt, m2 S2 t ∥ct∥2 2 ≤ m2 S2 t S2 t =m 2. Summing overt=m+ 1, . . . , Tyields ∇2f(λ ∗) op ≤ TX t=m+1 m2 = (T−m)m 2, which proves equation 46. The remaining claims f...

work page 2026

[1] [1]

Transformers on markov data: Constant depth suffices,

URLhttps://openreview.net/forum?id=sLkj91HIZU. Jorge P´erez, Pablo Barcel ´o, and Javier Marinkovic. Attention is turing complete.The Journal of Machine Learning Research, 22(1):3463–3497, 2021. Adrian E Raftery. A model for high-order markov chains.Journal of the Royal Statistical Society Series B: Statistical Methodology, 47(3):528–539, 1985. Nived Raja...

work page arXiv 2021

[2] [2]

It represents the theoretical performance limit for inference under the MTD model assumptions, providing a gold-standard benchmark against which other estimators can be compared

= 1 K KX k=1 p(Yt+1 =j|y t 1,λ (k)) = 1 K KX k=1 mX g=1 λ(k) g π(yt+1−g, j) ! .(20) This estimate converges to the true Bayes optimal predictive distribution asK→ ∞. It represents the theoretical performance limit for inference under the MTD model assumptions, providing a gold-standard benchmark against which other estimators can be compared. H ALGORITHMS...

work page 2026

[3] [3]

no evidence

Thus, the posterior probability simplifies: P(Zt =g|y,λ (k)) =P(Z t =g|y t 1,λ (k)). Using Bayes’ theorem: γ(k) t (g) =P(Z t =g|y t 1,λ (k)) = P(yt |Z t =g,y t−1 1 ,λ (k))P(Zt =g|y t−1 1 ,λ (k)) P(yt |y t−1 1 ,λ (k)) = π(yt−g, yt)λ(k) g Pm h=1 P(yt, Zt =h|y t−1 1 ,λ (k)) = λ(k) g π(yt−g, yt) Pm h=1 λ(k) h π(yt−h, yt) .(22) The E-step involves calculating ...

work page 2026

[4] [4]

Using equation 43 we get the upper bound∥c t∥2 2 ≤m·1 2 =mfor everyt

Hence ctc⊤ t op =∥c t∥2 2 = mX g=1 c2 t,g. Using equation 43 we get the upper bound∥c t∥2 2 ≤m·1 2 =mfor everyt. For the denominator, by equation 42 and sincePm g=1 λg = 1, st(λ) = mX g=1 λgct,g ≥ mX g=1 λgcmin =c min. Therefore for everyt, ctc⊤ t op st(λ)2 ≤ m c2 min . Summing overt=m+ 1, . . . , Tyields ∇2f(λ) op ≤ TX t=m+1 m c2 min = (T−m) m c2 min , w...

work page 2026

[5] [5]

Using this inequality we obtain, for everyt, m2 S2 t ∥ct∥2 2 ≤ m2 S2 t S2 t =m 2

Observe the elementary inequality ∥ct∥2 2 ≤ mX g=1 ct,g 2 =S 2 t , which holds because( P i ai)2 = P i a2 i + 2P i<j aiaj ≥ P i a2 i for nonnegativea i. Using this inequality we obtain, for everyt, m2 S2 t ∥ct∥2 2 ≤ m2 S2 t S2 t =m 2. Summing overt=m+ 1, . . . , Tyields ∇2f(λ ∗) op ≤ TX t=m+1 m2 = (T−m)m 2, which proves equation 46. The remaining claims f...

work page 2026