Distinct mechanisms underlying in-context learning in transformers
Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3
The pith
Transformers implement in-context learning with two distinct multi-layer subcircuits whose use depends on training data diversity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A transformer trained on a finite set S of discrete Markov chains exhibits four algorithmic phases characterized by memorization versus generalization and use of 1-point versus 2-point statistics. These phases are realized by multi-layer subcircuits that implement context-adaptive computations through two distinct mechanisms. The phase boundaries K1* and K2* are set by data diversity K = |S|, with K1* arising from kinetic competition between subcircuits and K2* from a representational bottleneck. A symmetry-constrained theory of training dynamics accounts for the sharp transition to 2-point generalization and the structure of the loss landscape that permits generalization.
What carries the argument
Multi-layer subcircuits that realize two qualitatively distinct mechanisms for context-adaptive computation, one for each combination of memorization/generalization and 1-point/2-point statistics.
If this is right
- Below K1* the network memorizes instead of generalizing.
- Above K2* the network switches to using two-point statistics for generalization.
- Minimal models can be constructed that isolate the essential features of each subcircuit motif.
- The symmetry-constrained theory predicts the conditions under which generalization occurs and identifies the relevant features of the loss landscape.
Where Pith is reading between the lines
- The same phase structure and subcircuit distinction could be tested in transformers trained on language or image data that contain natural sequential statistics.
- If the kinetic-competition and representational-bottleneck boundaries generalize, they could be used to predict which mechanism a model will adopt for a given training regime.
- Designing training curricula that deliberately cross or avoid these boundaries might let practitioners select the desired in-context mechanism.
Load-bearing premise
That the four phases and two subcircuit mechanisms seen on a finite set of discrete Markov chains capture how in-context learning works in transformers trained on wider or continuous data distributions.
What would settle it
Train the same transformer architecture on continuous or real-world sequential data and check whether the same four phases, two subcircuit mechanisms, and the same K1* and K2* boundaries appear.
Figures
read the original abstract
Modern distributed networks, notably transformers, acquire a remarkable ability (termed `in-context learning') to adapt their computation to input statistics, such that a fixed network can be applied to data from a broad range of systems. Here, we provide a complete mechanistic characterization of this behavior in transformers trained on a finite set $S$ of discrete Markov chains. The transformer displays four algorithmic phases, characterized by whether the network memorizes and generalizes, and whether it uses 1-point or 2-point statistics. We show that the four phases are implemented by multi-layer subcircuits that exemplify two qualitatively distinct mechanisms for implementing context-adaptive computations. Minimal models isolate the key features of both motifs. Memorization and generalization phases are delineated by two boundaries that depend on data diversity, $K = |S|$. The first ($K_1^\ast$) is set by a kinetic competition between subcircuits and the second ($K_2^\ast$) is set by a representational bottleneck. A symmetry-constrained theory of a transformer's training dynamics explains the sharp transition from 1-point to 2-point generalization and identifies key features of the loss landscape that allow the network to generalize. Put together, we show that transformers develop distinct subcircuits to implement in-context learning and identify conditions that favor certain mechanisms over others.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that transformers trained on a finite set S of discrete Markov chains exhibit exactly four algorithmic phases of in-context learning, defined by combinations of memorization vs. generalization and use of 1-point vs. 2-point statistics. These phases are realized by two qualitatively distinct multi-layer subcircuit motifs, separated by boundaries K1* (set by kinetic competition between subcircuits) and K2* (set by a representational bottleneck). A symmetry-constrained theory of training dynamics is said to explain the sharp 1-point to 2-point transition and key features of the loss landscape.
Significance. If the mechanistic dissection holds, the work is significant for providing concrete subcircuit-level accounts of context-adaptive computation in transformers and for isolating the roles of data diversity K = |S| in selecting among mechanisms. The use of minimal models to extract the essential features of each motif is a clear strength, as is the explicit scoping to enumerable 1- and 2-point statistics on finite discrete chains.
major comments (2)
- [Theory section] The symmetry-constrained theory of training dynamics is presented as explanatory for the sharp transition at K1*, yet it is unclear from the provided description whether the kinetic-competition boundary is derived independently or reduces to quantities fitted from the same experimental runs that define the phases (see abstract and the theory section). This risks circularity in the account of how the loss landscape permits generalization.
- [Results on subcircuits] The central claim that the four phases are implemented by two qualitatively distinct multi-layer subcircuit mechanisms rests on identification of motifs for the finite discrete Markov-chain regime. The manuscript must demonstrate that these motifs do not collapse or require entirely different circuits when the data distribution is continuous or high-dimensional, as the current construction rules out neither possibility.
minor comments (2)
- [Abstract] The abstract states that 'a complete mechanistic characterization' is provided, but the methods for identifying and verifying the subcircuits (e.g., via ablation, activation patching, or circuit discovery) are not summarized; a brief methods paragraph would improve accessibility.
- [Introduction] Notation for K1* and K2* is introduced without an explicit equation linking them to the loss landscape or to the symmetry constraints; adding a short definitional equation would clarify the boundaries.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below, providing clarifications and indicating where revisions will be made to improve the presentation.
read point-by-point responses
-
Referee: [Theory section] The symmetry-constrained theory of training dynamics is presented as explanatory for the sharp transition at K1*, yet it is unclear from the provided description whether the kinetic-competition boundary is derived independently or reduces to quantities fitted from the same experimental runs that define the phases (see abstract and the theory section). This risks circularity in the account of how the loss landscape permits generalization.
Authors: We appreciate the referee pointing out this potential ambiguity. The symmetry-constrained theory is constructed by imposing the symmetries of the Markov chain ensemble on the transformer's training dynamics, which allows us to derive the kinetic competition between subcircuits and predict the boundary K1* without reference to specific experimental data. The experiments then serve to test and illustrate these theoretical predictions. To make this separation explicit, we will revise the theory section to include a more detailed derivation of the boundary from symmetry arguments alone, followed by a comparison to experimental results. This should resolve any perception of circularity. revision: yes
-
Referee: [Results on subcircuits] The central claim that the four phases are implemented by two qualitatively distinct multi-layer subcircuit mechanisms rests on identification of motifs for the finite discrete Markov-chain regime. The manuscript must demonstrate that these motifs do not collapse or require entirely different circuits when the data distribution is continuous or high-dimensional, as the current construction rules out neither possibility.
Authors: The manuscript is scoped to the finite discrete Markov chain regime, as emphasized in the abstract, where K = |S| is finite and the statistics are discrete and enumerable. In this setting, we provide a complete characterization of the two distinct subcircuit mechanisms. We agree that it is an open question whether these motifs generalize to continuous or high-dimensional distributions, and our current results do not address or rule out alternative circuits in those cases. We will add a limitations paragraph in the discussion to explicitly state the scope and suggest that investigating continuous distributions is an important direction for future research. This does not alter the conclusions for the discrete case studied. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper's core claims rest on empirical identification of four phases and two subcircuit motifs in transformers trained on a finite discrete Markov chain set S, with boundaries K1* and K2* delineated by kinetic competition and representational bottleneck, plus a symmetry-constrained training dynamics theory. These are presented as observations and explanatory models derived from the training runs and minimal models, without any quoted reduction where a 'prediction' or first-principles result is definitionally equivalent to fitted inputs or prior self-citations. The derivation remains self-contained against the stated scope; no load-bearing step collapses to tautology or renaming of known results by construction. External benchmarks or code reproduction would be needed to assess broader validity, but none is required for circularity.
Axiom & Free-Parameter Ledger
free parameters (2)
- K1*
- K2*
axioms (2)
- ad hoc to paper Transformers trained on finite discrete Markov chains exhibit exactly four algorithmic phases of in-context learning
- domain assumption The observed subcircuits implement context-adaptive computation via 1-point or 2-point statistics
Reference graph
Works this paper leans on
-
[1]
Memorization and task vectors φTask vector, representation of the generating chain produced internally by theM 2 transformer Dφ Dimension of the task vector in the minimal model 19 II: Transformer architecture
-
[2]
Each process is specified by a Markovian transition matrix overC states
Data generation We follow a meta-training setup, where the parameters of a transformer are optimized on data generated from multiple distinct stochastic processes (‘tasks’). Each process is specified by a Markovian transition matrix overC states. The transformer is trained on a fixed set of chainsS={T (1), T (2), . . . , T(K) }. These transition matrices ...
-
[3]
Network architecture Here, we describe our primary architecture, which is the two-layer transformer illustrated in Figure 2a. A trans- former takes a sequence of vector-embedded states as input and produces a probability distribution over the next state as output [1]. Since the data generating process generates a sequence overCstates, each sequence is fir...
-
[4]
Training Process The parameters of the model are optimized by minimizing the auto-regressive cross-entropy sequence prediction loss, Ltrain(ˆπθ) = * − 1 N NX n=1 log ˆπθ(sn+1 |S n) + T∼S SN+1 ∼T ,(II13) whereS n = (s1, s2, . . . , sn)for alln. We drop the argument and writeLtrain(ˆπ) =Ltrain when the predictor the loss is evaluated for is clear from conte...
-
[5]
Bayes Predictors Acrosstaskdiversitiesandoverthecourseoftraining, wefindthatatransformer’sbehaviorcanbewell-characterized by four algorithms (illustrated in Figure 1c). These algorithms are specific implementations of the Bayes-optimal predictor, which first infers the underlying transition matrix from an observed sequenceSN and then predicts the distribu...
-
[6]
In this case, memorizing predictors correspond to the general Bayes predictor (Eq
Memorization We model memorization by ideal predictors that have complete information about the transition matrices inS. In this case, memorizing predictors correspond to the general Bayes predictor (Eq. III2) when the priorP(T)matches S. Making the substitutionP(T) = 1 K PK k=1 δ(T−T (k)), we have ˆπMem n (τ|µ) = 1 K KX k=1 P(S n |T (k)) P(S n) T (k) τ µ...
-
[7]
Generalizing predictors thus correspond to the Bayes-optimal predictors (Eq
Generalization We model generalization by ideal predictors that predict the next state given complete knowledge of the data distributionD T from which the transition matrices inSare sampled. Generalizing predictors thus correspond to the Bayes-optimal predictors (Eq. III2) when the assumed priorP(T)matchesD T. a. 1-point generalization.For 1-point statist...
-
[8]
Scaling of the four Bayesian predictors withKandN Figure A1 compares the loss incurred by the four Bayesian predictors as a function of data diversityKand sequence lengthNon training sequences. (a) (b) FIG. A1: Scaling of the four predictor losses with data diversityKand sequence lengthNcomputed over a large batch of sequences and averaged over 8 task dis...
-
[9]
Behavioral Readouts We evaluated the training and generalization loss of each model checkpoint on a sample of8×Kand 2048 sequences respectively, where the generalization loss is given by Lgen = * − 1 N NX n=1 log ˆπθ(sn+1 |S n) + T∼D T SN+1 ∼T .(IV1) This allowed us to compare the loss values through training to the loss of each predictor on the same sequ...
work page 2048
-
[10]
Mechanistic Readouts Only the attention layers of the transformer architecture can mix information along the sequence dimension. Conse- quently, for the model to infer nearest-neighbor 2-point correlations (i.e., bigrams) in a sequence, at least one attention layer must attend to the previous state. To identify when an attention layer exhibits this behavi...
-
[11]
Path Expansion We first explain the nature of the circuit edges we consider. Recall the construction of the residual stream after each block x(0) n =W Exn (V1) y(1) n =x (0) n +Att (1) x(0) ≤n (V2) x(1) n =x (0) n +Att (1) x(0) ≤n +MLP (1) y(1) n (V3) y(2) n =x (0) n +Att (1) x(0) ≤n +MLP (1) y(1) n +Att (2) x(1) ≤n (V4) x(2) n =x (0) n +Att (1) x(0) ≤n +...
-
[12]
Circuit tracing We first measured the importance of each connection in producing the observed transformer behavior. To do so, we developed a custom Python implementation of the model forward pass that explicitly exposes the vector passed along each layer connection. This allows the passed vectors along all edges to be cached during a forward pass of the u...
-
[13]
Reduction of the Network and Circuits We first introduce simplifications at the architectural level. (i) Fixed one-hot embeddings.Since sequence states are embedded as orthogonal one-hot vectors, we eliminate the embedding matrixW E and directly work with the one-hot token representations. 31 (ii) Disentangled value subspaces.We assume that the value matr...
-
[14]
Symmetry Reduction In our case, since the generative model for transition matricesTis symmetric over theCtoken classes, allCtoken classes are statistically identical. This symmetry implies that the fourWmatrices will have equal diagonal terms and equal off-diagonal terms. The off-diagonal terms can be set to zero as they contribute a constant offset; for ...
-
[15]
Training details Denote the last in the sequencexN asµ
Numerical validation a. Training details Denote the last in the sequencexN asµ. The cross-entropy loss averaged over input sequences given transition matrixTis L=− *X µ,τ pµTτ µ logπ τ(x1:N) + .(VII12) whereT τ µ is the probability that the next token isτgiven that the current token isµandp µ is the stationary probability of tokenµfor transition matrixT. ...
-
[16]
The SA-transformer is trained using stochastic gradient descent (SGD) with learning rate 1 and a batch size of 256. b. Training results The training loss is shown in Figure 5b, and shows that the SA-transformer reproduces the abrupt learning of the full network (compare with Figure 2b forK= 1024, where the transformer rapidly learns the 1-Gen solution and...
-
[17]
We show thatw A →0andw B =w C =w D = 1/3inG 1, whereas the rest of the parameters remain at zero
The network first entersG1 before enteringG 2. We show thatw A →0andw B =w C =w D = 1/3inG 1, whereas the rest of the parameters remain at zero. 34 FIG. A4: Parameters of the attention circuits defined in equation VII13 at different iterations. The corresponding loss dynamics is given in Figure 5 (b)
-
[18]
The other terms involving wA, wB, wD do not contribute to the solution after convergence
Next, we show thatG2 corresponds towC = 1, β→ ∞, δ→ ∞in the SA-transformer. The other terms involving wA, wB, wD do not contribute to the solution after convergence
-
[19]
Third, we show that accurately capturing the training dynamics ofβ, δrequires a careful computation of expectations when expanding the loss in a Taylor series near the 1-Gen solution. While the 2-Gen solution involves a second-order term in the loss of the formβδ(which would imply a saddle-point at the originβ, δ= 0), there are subtle first-order contribu...
-
[20]
At initialization, we haveA(ℓ) ji = 1/iforℓ= 1,2
Competition betweenx N and the 1-Gen solution The network nearly implements the 1-Gen solution at initialization except for a contribution due to the first term involvingw A in equation VII14. At initialization, we haveA(ℓ) ji = 1/iforℓ= 1,2. Plugging this into equation VII14, we observe that the terms involvingwB, wC, wD compute the 1-point statistics. S...
-
[21]
Specifically, wC = 1, δ→ ∞, β→ ∞corresponds to the 2-Gen solution
The 2-Gen solution after convergence Now, we show that 2-Gen can be implemented by setting all other parameters exceptwC, δ, βto zero. Specifically, wC = 1, δ→ ∞, β→ ∞corresponds to the 2-Gen solution. Recall that, by definition,δ=P (1) −1 andβ=β (2) 3 . When 36 all other parameters are set to zero, from equation VII14, we have ˆπ= X i≤N A(2) iN xi,where ...
-
[22]
Acquisition of the 2-Gen solution We now examine the kinetics of acquisition of the 2-Gen solutionwC = 1, β→ ∞, δ→ ∞. To do this, we expand the loss to first order inβandδaround the unigram solution, L(β, δ) =L 1-Gen −c ββ−c δδ+. . . .(VIII9) Recall from Section VIII1 that the 1-Gen solution corresponds towA = 0, wB =w C =w D = 1/3and the rest of the para...
-
[23]
Testing predictions a. Loss landscape Through above analysis in previous sections, we could express the model as: ˆπτ =w Aδµτ +w B X i≤N A(1) iN δτ si +w C X i≤N A(2) iN δτ si +w D X i≤N X j≤i A(2) iN A(1) ji δτ sj ,where (VIII25) A(1) ji = eδ −1 δj(i−1) + 1 i+e δ −1 andA (2) ji = exp βP k≤j A(1) kj δsisk P j′ exp βP k′≤j′ A(1) k′j′δsisk′ .(VIII26) Thus t...
-
[24]
Ablating the first-order contribution inδ In equation VIII24, the dynamics ofδis governed by two terms of which only the first depends onβ. If we set F1 = 0, the dynamics change qualitatively.F 1 is non-zero due to subtle correlations betweensN−1 ands N+1. We remove these correlations by resampling the token at positionN−1after generating the sequence, th...
-
[25]
The time to transition from the 1-Gen to the 2-Gen solution Finally, a key result of our theory is an estimate for the number of iterations required to transition from the 1-Gen to the 2-Gen solution, which we callτ2-Gen. From equation VIII24,βgrows at a constant rate until it reaches the nonlinear regime, after which it increases rapidly and saturates sh...
-
[26]
Consider the fluctuations of these inputs over sequences sampled from the same chain
Task Vector In the 2-Mem circuit, MLP2 primarily reads from two inputs to produce the logit: MLP1 and Att2, the latter of which averages the outputs of MLP1 over the sequence. Consider the fluctuations of these inputs over sequences sampled from the same chain. MLP1 can only read the current state and the output of Att1, the latter of which is almost excl...
-
[27]
Representation Geometry Since MLP2 infers the current task condition fromφ, it is desirable for instantiations ofφto be separable when the underlying sequences are from different tasks. We computed t-SNE forφand were able to observe task-specific clustering forming through training on data diversities up toK= 128. The task vectors for this computation are...
-
[28]
The transformer first performed a forward pass on a batch of sequencesSA (ConditionS A)
Patching We propose that MLP2 makes predictions using only the task information provided byφ, and test this by analyzing the transformer behavior when theφtask information is incongruent with any other alternative source of task information. The transformer first performed a forward pass on a batch of sequencesSA (ConditionS A). During the forward pass, t...
-
[29]
Information Content InM 2, the first layer attention weight concentrates on the previous tokenA(1) ni ≈δ(n−i−1), in which case Att1 may be seen as forming a representation of 2-point statistics in the residual streamx(1) n =x n +W (1) V xn−1 :=f(x n, xn−1). ThetracedcircuitindicatesthatMLP1readsthesepairsandproducesanoutputvectorforeachthatisthenaggregate...
-
[30]
Results derived using the same models used for Figure 7a
As in Figure 7a, each seed has been shifted along the horizontal axis by the value ofK∗ 1 determined for that seed by theϕ(2) β threshold criteria. Results derived using the same models used for Figure 7a
-
[31]
Task Injection To implement a memorizing predictor, the model must in part learn to differentiate between sequences from different tasks. The rate at which the model learns this can be increased by providing perfect task information to the model. To do this, we allowed the model to learnD-dimensional embeddings for each chain that were injected into MLP1 ...
-
[32]
Gradient Reweighting We sought to perturb the model by modifying the effective learning rate for the2-Gen circuit independent of the 1-Mem circuit. One way to accomplish this is by reducing the rate at which the first attention layer learns to attend to the previous token. We implemented a modified attention head in Python with a fixed weight factorw∈[0,1...
-
[33]
Path Expansion in the 2-Mem Phase The circuit-tracing results forM2 in Figure A3 indicate that the full transformer computation can be reduced to the dominant information flow shown schematically in Figure 8(c). This reduced path can be written as x(0) n =W Exn,(XII1) y(1) n =x (0) n + Att(1) x(0) ≤n ,(XII2) x(1) n =x (0) n + MLP(1) y(1) n ,(XII3) y(2) n ...
-
[34]
Minimal Network Construction First, we assume that Att1 extracts the preceding state and maps it to a subspace orthogonal to that of the current state. This allows us to replace y(1) n =x (0) n + Att(1) x(0) ≤n − →y (1) n =x (0) n ⊕x (0) n−1,(XII7) where⊕denotes concatenation into orthogonal subspaces. Second, we assume that Att2 performs a uniform poolin...
-
[35]
This separation demonstrates that the minimal network successfully reproduces the 2-Mem predictor
Phase Diagram Figures 8d and 8f show that the minimal model training loss can fall below the 2-Gen baselineL2-Gen while the generalization loss remains significantly above it. This separation demonstrates that the minimal network successfully reproduces the 2-Mem predictor. As data diversityKincreases, Figure 8d exhibits a clear crossover behavior: for sm...
-
[36]
Dependence ofK ∗ 2 on MLP Capacity We denote the estimated critical data diversity for the minimal model byˆK ∗ 2 to distinguish it fromK∗ 2 in the full model. We hypothesize that MLP1 is primarily responsible for extracting a task vector from the sequence, while MLP2 must memorize the collection of task-specific transition matrices and select the appropr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.