Distinct mechanisms underlying in-context learning in transformers

Cole Gibson; Gautam Reddy; Wenping Cui

arxiv: 2604.12151 · v1 · submitted 2026-04-14 · 💻 cs.LG · cond-mat.dis-nn· cond-mat.stat-mech

Distinct mechanisms underlying in-context learning in transformers

Cole Gibson , Wenping Cui , Gautam Reddy This is my paper

Pith reviewed 2026-05-10 15:04 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nncond-mat.stat-mech

keywords in-context learningtransformersalgorithmic phasessubcircuitsMarkov chainsmechanistic interpretabilitydata diversitygeneralization

0 comments

The pith

Transformers implement in-context learning with two distinct multi-layer subcircuits whose use depends on training data diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers trained on a finite collection of discrete Markov chains develop four algorithmic phases for adapting their computations to input context. The phases are distinguished by whether the network memorizes or generalizes and by whether it extracts one-point or two-point statistics from the context. These phases arise from two qualitatively different multi-layer subcircuits that each perform context-adaptive computation in their own manner. The switch points between phases are fixed by the number of distinct chains K in the training set, with one transition produced by competition between the subcircuits and the other by a bottleneck in how information can be represented inside the network. A symmetry-based account of the training dynamics explains the abrupt change from one-point to two-point generalization.

Core claim

A transformer trained on a finite set S of discrete Markov chains exhibits four algorithmic phases characterized by memorization versus generalization and use of 1-point versus 2-point statistics. These phases are realized by multi-layer subcircuits that implement context-adaptive computations through two distinct mechanisms. The phase boundaries K1* and K2* are set by data diversity K = |S|, with K1* arising from kinetic competition between subcircuits and K2* from a representational bottleneck. A symmetry-constrained theory of training dynamics accounts for the sharp transition to 2-point generalization and the structure of the loss landscape that permits generalization.

What carries the argument

Multi-layer subcircuits that realize two qualitatively distinct mechanisms for context-adaptive computation, one for each combination of memorization/generalization and 1-point/2-point statistics.

If this is right

Below K1* the network memorizes instead of generalizing.
Above K2* the network switches to using two-point statistics for generalization.
Minimal models can be constructed that isolate the essential features of each subcircuit motif.
The symmetry-constrained theory predicts the conditions under which generalization occurs and identifies the relevant features of the loss landscape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phase structure and subcircuit distinction could be tested in transformers trained on language or image data that contain natural sequential statistics.
If the kinetic-competition and representational-bottleneck boundaries generalize, they could be used to predict which mechanism a model will adopt for a given training regime.
Designing training curricula that deliberately cross or avoid these boundaries might let practitioners select the desired in-context mechanism.

Load-bearing premise

That the four phases and two subcircuit mechanisms seen on a finite set of discrete Markov chains capture how in-context learning works in transformers trained on wider or continuous data distributions.

What would settle it

Train the same transformer architecture on continuous or real-world sequential data and check whether the same four phases, two subcircuit mechanisms, and the same K1* and K2* boundaries appear.

Figures

Figures reproduced from arXiv: 2604.12151 by Cole Gibson, Gautam Reddy, Wenping Cui.

**Figure 1.** Figure 1: FIG. 1: (a) Data generating process for training and out-of-distribution sequences. Training sequences are generated [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: FIG. 2: (a) Training dynamics of the transformer training loss (top) and generalization loss (bottom) at low, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3: Circuit tracing. (a) The transformer may be unrolled into a directed graph with each block taking input [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4: (a) A 2-D t-SNE visualization of task vectors [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5: (a) Schematic of the SA-Transformer for generalization. (b) Loss dynamics for the SA-Transformer trained [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6: (a) Loss landscape in the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7: (a) The maximum value of the induction head parameter [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: FIG. 8: (a) The time between the onset of [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: FIG. 9: A schematic summarizing the factors that [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Modern distributed networks, notably transformers, acquire a remarkable ability (termed `in-context learning') to adapt their computation to input statistics, such that a fixed network can be applied to data from a broad range of systems. Here, we provide a complete mechanistic characterization of this behavior in transformers trained on a finite set $S$ of discrete Markov chains. The transformer displays four algorithmic phases, characterized by whether the network memorizes and generalizes, and whether it uses 1-point or 2-point statistics. We show that the four phases are implemented by multi-layer subcircuits that exemplify two qualitatively distinct mechanisms for implementing context-adaptive computations. Minimal models isolate the key features of both motifs. Memorization and generalization phases are delineated by two boundaries that depend on data diversity, $K = |S|$. The first ($K_1^\ast$) is set by a kinetic competition between subcircuits and the second ($K_2^\ast$) is set by a representational bottleneck. A symmetry-constrained theory of a transformer's training dynamics explains the sharp transition from 1-point to 2-point generalization and identifies key features of the loss landscape that allow the network to generalize. Put together, we show that transformers develop distinct subcircuits to implement in-context learning and identify conditions that favor certain mechanisms over others.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

On finite discrete Markov chains, this paper identifies four ICL phases implemented by two distinct multi-layer subcircuits, with boundaries set by kinetic competition and a representational bottleneck.

read the letter

The paper shows that transformers trained on a finite set of discrete Markov chains fall into four algorithmic phases depending on whether they memorize or generalize and whether they rely on 1-point or 2-point statistics. These phases are realized by two qualitatively different multi-layer subcircuit motifs, and the transitions between them are explained by a symmetry-constrained account of the loss landscape and training dynamics. Minimal models are used to isolate the key features of each motif, with the first boundary K1* arising from competition between subcircuits and the second K2* from a representational bottleneck that depends on data diversity K = |S|.

Referee Report

2 major / 2 minor

Summary. The paper claims that transformers trained on a finite set S of discrete Markov chains exhibit exactly four algorithmic phases of in-context learning, defined by combinations of memorization vs. generalization and use of 1-point vs. 2-point statistics. These phases are realized by two qualitatively distinct multi-layer subcircuit motifs, separated by boundaries K1* (set by kinetic competition between subcircuits) and K2* (set by a representational bottleneck). A symmetry-constrained theory of training dynamics is said to explain the sharp 1-point to 2-point transition and key features of the loss landscape.

Significance. If the mechanistic dissection holds, the work is significant for providing concrete subcircuit-level accounts of context-adaptive computation in transformers and for isolating the roles of data diversity K = |S| in selecting among mechanisms. The use of minimal models to extract the essential features of each motif is a clear strength, as is the explicit scoping to enumerable 1- and 2-point statistics on finite discrete chains.

major comments (2)

[Theory section] The symmetry-constrained theory of training dynamics is presented as explanatory for the sharp transition at K1*, yet it is unclear from the provided description whether the kinetic-competition boundary is derived independently or reduces to quantities fitted from the same experimental runs that define the phases (see abstract and the theory section). This risks circularity in the account of how the loss landscape permits generalization.
[Results on subcircuits] The central claim that the four phases are implemented by two qualitatively distinct multi-layer subcircuit mechanisms rests on identification of motifs for the finite discrete Markov-chain regime. The manuscript must demonstrate that these motifs do not collapse or require entirely different circuits when the data distribution is continuous or high-dimensional, as the current construction rules out neither possibility.

minor comments (2)

[Abstract] The abstract states that 'a complete mechanistic characterization' is provided, but the methods for identifying and verifying the subcircuits (e.g., via ablation, activation patching, or circuit discovery) are not summarized; a brief methods paragraph would improve accessibility.
[Introduction] Notation for K1* and K2* is introduced without an explicit equation linking them to the loss landscape or to the symmetry constraints; adding a short definitional equation would clarify the boundaries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below, providing clarifications and indicating where revisions will be made to improve the presentation.

read point-by-point responses

Referee: [Theory section] The symmetry-constrained theory of training dynamics is presented as explanatory for the sharp transition at K1*, yet it is unclear from the provided description whether the kinetic-competition boundary is derived independently or reduces to quantities fitted from the same experimental runs that define the phases (see abstract and the theory section). This risks circularity in the account of how the loss landscape permits generalization.

Authors: We appreciate the referee pointing out this potential ambiguity. The symmetry-constrained theory is constructed by imposing the symmetries of the Markov chain ensemble on the transformer's training dynamics, which allows us to derive the kinetic competition between subcircuits and predict the boundary K1* without reference to specific experimental data. The experiments then serve to test and illustrate these theoretical predictions. To make this separation explicit, we will revise the theory section to include a more detailed derivation of the boundary from symmetry arguments alone, followed by a comparison to experimental results. This should resolve any perception of circularity. revision: yes
Referee: [Results on subcircuits] The central claim that the four phases are implemented by two qualitatively distinct multi-layer subcircuit mechanisms rests on identification of motifs for the finite discrete Markov-chain regime. The manuscript must demonstrate that these motifs do not collapse or require entirely different circuits when the data distribution is continuous or high-dimensional, as the current construction rules out neither possibility.

Authors: The manuscript is scoped to the finite discrete Markov chain regime, as emphasized in the abstract, where K = |S| is finite and the statistics are discrete and enumerable. In this setting, we provide a complete characterization of the two distinct subcircuit mechanisms. We agree that it is an open question whether these motifs generalize to continuous or high-dimensional distributions, and our current results do not address or rule out alternative circuits in those cases. We will add a limitations paragraph in the discussion to explicitly state the scope and suggest that investigating continuous distributions is an important direction for future research. This does not alter the conclusions for the discrete case studied. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core claims rest on empirical identification of four phases and two subcircuit motifs in transformers trained on a finite discrete Markov chain set S, with boundaries K1* and K2* delineated by kinetic competition and representational bottleneck, plus a symmetry-constrained training dynamics theory. These are presented as observations and explanatory models derived from the training runs and minimal models, without any quoted reduction where a 'prediction' or first-principles result is definitionally equivalent to fitted inputs or prior self-citations. The derivation remains self-contained against the stated scope; no load-bearing step collapses to tautology or renaming of known results by construction. External benchmarks or code reproduction would be needed to assess broader validity, but none is required for circularity.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that discrete Markov chains are representative of general in-context learning and on the ad-hoc classification of observed behaviors into exactly four exhaustive phases.

free parameters (2)

K1*
First phase boundary set by kinetic competition between subcircuits; value depends on data diversity K
K2*
Second phase boundary set by representational bottleneck; value depends on data diversity K

axioms (2)

ad hoc to paper Transformers trained on finite discrete Markov chains exhibit exactly four algorithmic phases of in-context learning
Invoked to structure the entire characterization
domain assumption The observed subcircuits implement context-adaptive computation via 1-point or 2-point statistics
Core modeling choice for the mechanistic analysis

pith-pipeline@v0.9.0 · 5539 in / 1472 out tokens · 72309 ms · 2026-05-10T15:04:05.157163+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

[1]

Memorization and task vectors φTask vector, representation of the generating chain produced internally by theM 2 transformer Dφ Dimension of the task vector in the minimal model 19 II: Transformer architecture

work page
[2]

Each process is specified by a Markovian transition matrix overC states

Data generation We follow a meta-training setup, where the parameters of a transformer are optimized on data generated from multiple distinct stochastic processes (‘tasks’). Each process is specified by a Markovian transition matrix overC states. The transformer is trained on a fixed set of chainsS={T (1), T (2), . . . , T(K) }. These transition matrices ...

work page
[3]

A trans- former takes a sequence of vector-embedded states as input and produces a probability distribution over the next state as output [1]

Network architecture Here, we describe our primary architecture, which is the two-layer transformer illustrated in Figure 2a. A trans- former takes a sequence of vector-embedded states as input and produces a probability distribution over the next state as output [1]. Since the data generating process generates a sequence overCstates, each sequence is fir...

work page
[4]

, sn)for alln

Training Process The parameters of the model are optimized by minimizing the auto-regressive cross-entropy sequence prediction loss, Ltrain(ˆπθ) = * − 1 N NX n=1 log ˆπθ(sn+1 |S n) + T∼S SN+1 ∼T ,(II13) whereS n = (s1, s2, . . . , sn)for alln. We drop the argument and writeLtrain(ˆπ) =Ltrain when the predictor the loss is evaluated for is clear from conte...

work page
[5]

Bayes Predictors Acrosstaskdiversitiesandoverthecourseoftraining, wefindthatatransformer’sbehaviorcanbewell-characterized by four algorithms (illustrated in Figure 1c). These algorithms are specific implementations of the Bayes-optimal predictor, which first infers the underlying transition matrix from an observed sequenceSN and then predicts the distribu...

work page
[6]

In this case, memorizing predictors correspond to the general Bayes predictor (Eq

Memorization We model memorization by ideal predictors that have complete information about the transition matrices inS. In this case, memorizing predictors correspond to the general Bayes predictor (Eq. III2) when the priorP(T)matches S. Making the substitutionP(T) = 1 K PK k=1 δ(T−T (k)), we have ˆπMem n (τ|µ) = 1 K KX k=1 P(S n |T (k)) P(S n) T (k) τ µ...

work page
[7]

Generalizing predictors thus correspond to the Bayes-optimal predictors (Eq

Generalization We model generalization by ideal predictors that predict the next state given complete knowledge of the data distributionD T from which the transition matrices inSare sampled. Generalizing predictors thus correspond to the Bayes-optimal predictors (Eq. III2) when the assumed priorP(T)matchesD T. a. 1-point generalization.For 1-point statist...

work page
[8]

(a) (b) FIG

Scaling of the four Bayesian predictors withKandN Figure A1 compares the loss incurred by the four Bayesian predictors as a function of data diversityKand sequence lengthNon training sequences. (a) (b) FIG. A1: Scaling of the four predictor losses with data diversityKand sequence lengthNcomputed over a large batch of sequences and averaged over 8 task dis...

work page
[9]

We find plateaus in both the training and generalization losses, which correspond closely with loss values of the predictors (Figure 2a)

Behavioral Readouts We evaluated the training and generalization loss of each model checkpoint on a sample of8×Kand 2048 sequences respectively, where the generalization loss is given by Lgen = * − 1 N NX n=1 log ˆπθ(sn+1 |S n) + T∼D T SN+1 ∼T .(IV1) This allowed us to compare the loss values through training to the loss of each predictor on the same sequ...

work page 2048
[10]

Conse- quently, for the model to infer nearest-neighbor 2-point correlations (i.e., bigrams) in a sequence, at least one attention layer must attend to the previous state

Mechanistic Readouts Only the attention layers of the transformer architecture can mix information along the sequence dimension. Conse- quently, for the model to infer nearest-neighbor 2-point correlations (i.e., bigrams) in a sequence, at least one attention layer must attend to the previous state. To identify when an attention layer exhibits this behavi...

work page
[11]

Path Expansion We first explain the nature of the circuit edges we consider. Recall the construction of the residual stream after each block x(0) n =W Exn (V1) y(1) n =x (0) n +Att (1) x(0) ≤n (V2) x(1) n =x (0) n +Att (1) x(0) ≤n +MLP (1) y(1) n (V3) y(2) n =x (0) n +Att (1) x(0) ≤n +MLP (1) y(1) n +Att (2) x(1) ≤n (V4) x(2) n =x (0) n +Att (1) x(0) ≤n +...

work page
[12]

To do so, we developed a custom Python implementation of the model forward pass that explicitly exposes the vector passed along each layer connection

Circuit tracing We first measured the importance of each connection in producing the observed transformer behavior. To do so, we developed a custom Python implementation of the model forward pass that explicitly exposes the vector passed along each layer connection. This allows the passed vectors along all edges to be cached during a forward pass of the u...

work page
[13]

Reduction of the Network and Circuits We first introduce simplifications at the architectural level. (i) Fixed one-hot embeddings.Since sequence states are embedded as orthogonal one-hot vectors, we eliminate the embedding matrixW E and directly work with the one-hot token representations. 31 (ii) Disentangled value subspaces.We assume that the value matr...

work page
[14]

This symmetry implies that the fourWmatrices will have equal diagonal terms and equal off-diagonal terms

Symmetry Reduction In our case, since the generative model for transition matricesTis symmetric over theCtoken classes, allCtoken classes are statistically identical. This symmetry implies that the fourWmatrices will have equal diagonal terms and equal off-diagonal terms. The off-diagonal terms can be set to zero as they contribute a constant offset; for ...

work page
[15]

Training details Denote the last in the sequencexN asµ

Numerical validation a. Training details Denote the last in the sequencexN asµ. The cross-entropy loss averaged over input sequences given transition matrixTis L=− *X µ,τ pµTτ µ logπ τ(x1:N) + .(VII12) whereT τ µ is the probability that the next token isτgiven that the current token isµandp µ is the stationary probability of tokenµfor transition matrixT. ...

work page
[16]

The SA-transformer is trained using stochastic gradient descent (SGD) with learning rate 1 and a batch size of 256. b. Training results The training loss is shown in Figure 5b, and shows that the SA-transformer reproduces the abrupt learning of the full network (compare with Figure 2b forK= 1024, where the transformer rapidly learns the 1-Gen solution and...

work page
[17]

We show thatw A →0andw B =w C =w D = 1/3inG 1, whereas the rest of the parameters remain at zero

The network first entersG1 before enteringG 2. We show thatw A →0andw B =w C =w D = 1/3inG 1, whereas the rest of the parameters remain at zero. 34 FIG. A4: Parameters of the attention circuits defined in equation VII13 at different iterations. The corresponding loss dynamics is given in Figure 5 (b)

work page
[18]

The other terms involving wA, wB, wD do not contribute to the solution after convergence

Next, we show thatG2 corresponds towC = 1, β→ ∞, δ→ ∞in the SA-transformer. The other terms involving wA, wB, wD do not contribute to the solution after convergence

work page
[19]

Third, we show that accurately capturing the training dynamics ofβ, δrequires a careful computation of expectations when expanding the loss in a Taylor series near the 1-Gen solution. While the 2-Gen solution involves a second-order term in the loss of the formβδ(which would imply a saddle-point at the originβ, δ= 0), there are subtle first-order contribu...

work page
[20]

At initialization, we haveA(ℓ) ji = 1/iforℓ= 1,2

Competition betweenx N and the 1-Gen solution The network nearly implements the 1-Gen solution at initialization except for a contribution due to the first term involvingw A in equation VII14. At initialization, we haveA(ℓ) ji = 1/iforℓ= 1,2. Plugging this into equation VII14, we observe that the terms involvingwB, wC, wD compute the 1-point statistics. S...

work page
[21]

Specifically, wC = 1, δ→ ∞, β→ ∞corresponds to the 2-Gen solution

The 2-Gen solution after convergence Now, we show that 2-Gen can be implemented by setting all other parameters exceptwC, δ, βto zero. Specifically, wC = 1, δ→ ∞, β→ ∞corresponds to the 2-Gen solution. Recall that, by definition,δ=P (1) −1 andβ=β (2) 3 . When 36 all other parameters are set to zero, from equation VII14, we have ˆπ= X i≤N A(2) iN xi,where ...

work page
[22]

To do this, we expand the loss to first order inβandδaround the unigram solution, L(β, δ) =L 1-Gen −c ββ−c δδ+

Acquisition of the 2-Gen solution We now examine the kinetics of acquisition of the 2-Gen solutionwC = 1, β→ ∞, δ→ ∞. To do this, we expand the loss to first order inβandδaround the unigram solution, L(β, δ) =L 1-Gen −c ββ−c δδ+. . . .(VIII9) Recall from Section VIII1 that the 1-Gen solution corresponds towA = 0, wB =w C =w D = 1/3and the rest of the para...

work page
[23]

Testing predictions a. Loss landscape Through above analysis in previous sections, we could express the model as: ˆπτ =w Aδµτ +w B X i≤N A(1) iN δτ si +w C X i≤N A(2) iN δτ si +w D X i≤N X j≤i A(2) iN A(1) ji δτ sj ,where (VIII25) A(1) ji = eδ −1 δj(i−1) + 1 i+e δ −1 andA (2) ji = exp βP k≤j A(1) kj δsisk P j′ exp βP k′≤j′ A(1) k′j′δsisk′ .(VIII26) Thus t...

work page
[24]

If we set F1 = 0, the dynamics change qualitatively.F 1 is non-zero due to subtle correlations betweensN−1 ands N+1

Ablating the first-order contribution inδ In equation VIII24, the dynamics ofδis governed by two terms of which only the first depends onβ. If we set F1 = 0, the dynamics change qualitatively.F 1 is non-zero due to subtle correlations betweensN−1 ands N+1. We remove these correlations by resampling the token at positionN−1after generating the sequence, th...

work page
[25]

From equation VIII24,βgrows at a constant rate until it reaches the nonlinear regime, after which it increases rapidly and saturates shortly thereafter

The time to transition from the 1-Gen to the 2-Gen solution Finally, a key result of our theory is an estimate for the number of iterations required to transition from the 1-Gen to the 2-Gen solution, which we callτ2-Gen. From equation VIII24,βgrows at a constant rate until it reaches the nonlinear regime, after which it increases rapidly and saturates sh...

work page
[26]

Consider the fluctuations of these inputs over sequences sampled from the same chain

Task Vector In the 2-Mem circuit, MLP2 primarily reads from two inputs to produce the logit: MLP1 and Att2, the latter of which averages the outputs of MLP1 over the sequence. Consider the fluctuations of these inputs over sequences sampled from the same chain. MLP1 can only read the current state and the output of Att1, the latter of which is almost excl...

work page
[27]

We computed t-SNE forφand were able to observe task-specific clustering forming through training on data diversities up toK= 128

Representation Geometry Since MLP2 infers the current task condition fromφ, it is desirable for instantiations ofφto be separable when the underlying sequences are from different tasks. We computed t-SNE forφand were able to observe task-specific clustering forming through training on data diversities up toK= 128. The task vectors for this computation are...

work page
[28]

The transformer first performed a forward pass on a batch of sequencesSA (ConditionS A)

Patching We propose that MLP2 makes predictions using only the task information provided byφ, and test this by analyzing the transformer behavior when theφtask information is incongruent with any other alternative source of task information. The transformer first performed a forward pass on a batch of sequencesSA (ConditionS A). During the forward pass, t...

work page
[29]

ThetracedcircuitindicatesthatMLP1readsthesepairsandproducesanoutputvectorforeachthatisthenaggregated across the sequence by Att2 to form the task vectorφ

Information Content InM 2, the first layer attention weight concentrates on the previous tokenA(1) ni ≈δ(n−i−1), in which case Att1 may be seen as forming a representation of 2-point statistics in the residual streamx(1) n =x n +W (1) V xn−1 :=f(x n, xn−1). ThetracedcircuitindicatesthatMLP1readsthesepairsandproducesanoutputvectorforeachthatisthenaggregate...

work page
[30]

Results derived using the same models used for Figure 7a

As in Figure 7a, each seed has been shifted along the horizontal axis by the value ofK∗ 1 determined for that seed by theϕ(2) β threshold criteria. Results derived using the same models used for Figure 7a

work page
[31]

The rate at which the model learns this can be increased by providing perfect task information to the model

Task Injection To implement a memorizing predictor, the model must in part learn to differentiate between sequences from different tasks. The rate at which the model learns this can be increased by providing perfect task information to the model. To do this, we allowed the model to learnD-dimensional embeddings for each chain that were injected into MLP1 ...

work page
[32]

One way to accomplish this is by reducing the rate at which the first attention layer learns to attend to the previous token

Gradient Reweighting We sought to perturb the model by modifying the effective learning rate for the2-Gen circuit independent of the 1-Mem circuit. One way to accomplish this is by reducing the rate at which the first attention layer learns to attend to the previous token. We implemented a modified attention head in Python with a fixed weight factorw∈[0,1...

work page
[33]

Path Expansion in the 2-Mem Phase The circuit-tracing results forM2 in Figure A3 indicate that the full transformer computation can be reduced to the dominant information flow shown schematically in Figure 8(c). This reduced path can be written as x(0) n =W Exn,(XII1) y(1) n =x (0) n + Att(1) x(0) ≤n ,(XII2) x(1) n =x (0) n + MLP(1) y(1) n ,(XII3) y(2) n ...

work page
[34]

This allows us to replace y(1) n =x (0) n + Att(1) x(0) ≤n − →y (1) n =x (0) n ⊕x (0) n−1,(XII7) where⊕denotes concatenation into orthogonal subspaces

Minimal Network Construction First, we assume that Att1 extracts the preceding state and maps it to a subspace orthogonal to that of the current state. This allows us to replace y(1) n =x (0) n + Att(1) x(0) ≤n − →y (1) n =x (0) n ⊕x (0) n−1,(XII7) where⊕denotes concatenation into orthogonal subspaces. Second, we assume that Att2 performs a uniform poolin...

work page
[35]

This separation demonstrates that the minimal network successfully reproduces the 2-Mem predictor

Phase Diagram Figures 8d and 8f show that the minimal model training loss can fall below the 2-Gen baselineL2-Gen while the generalization loss remains significantly above it. This separation demonstrates that the minimal network successfully reproduces the 2-Mem predictor. As data diversityKincreases, Figure 8d exhibits a clear crossover behavior: for sm...

work page
[36]

Dependence ofK ∗ 2 on MLP Capacity We denote the estimated critical data diversity for the minimal model byˆK ∗ 2 to distinguish it fromK∗ 2 in the full model. We hypothesize that MLP1 is primarily responsible for extracting a task vector from the sequence, while MLP2 must memorize the collection of task-specific transition matrices and select the appropr...

work page

[1] [1]

Memorization and task vectors φTask vector, representation of the generating chain produced internally by theM 2 transformer Dφ Dimension of the task vector in the minimal model 19 II: Transformer architecture

work page

[2] [2]

Each process is specified by a Markovian transition matrix overC states

Data generation We follow a meta-training setup, where the parameters of a transformer are optimized on data generated from multiple distinct stochastic processes (‘tasks’). Each process is specified by a Markovian transition matrix overC states. The transformer is trained on a fixed set of chainsS={T (1), T (2), . . . , T(K) }. These transition matrices ...

work page

[3] [3]

A trans- former takes a sequence of vector-embedded states as input and produces a probability distribution over the next state as output [1]

Network architecture Here, we describe our primary architecture, which is the two-layer transformer illustrated in Figure 2a. A trans- former takes a sequence of vector-embedded states as input and produces a probability distribution over the next state as output [1]. Since the data generating process generates a sequence overCstates, each sequence is fir...

work page

[4] [4]

, sn)for alln

Training Process The parameters of the model are optimized by minimizing the auto-regressive cross-entropy sequence prediction loss, Ltrain(ˆπθ) = * − 1 N NX n=1 log ˆπθ(sn+1 |S n) + T∼S SN+1 ∼T ,(II13) whereS n = (s1, s2, . . . , sn)for alln. We drop the argument and writeLtrain(ˆπ) =Ltrain when the predictor the loss is evaluated for is clear from conte...

work page

[5] [5]

Bayes Predictors Acrosstaskdiversitiesandoverthecourseoftraining, wefindthatatransformer’sbehaviorcanbewell-characterized by four algorithms (illustrated in Figure 1c). These algorithms are specific implementations of the Bayes-optimal predictor, which first infers the underlying transition matrix from an observed sequenceSN and then predicts the distribu...

work page

[6] [6]

In this case, memorizing predictors correspond to the general Bayes predictor (Eq

Memorization We model memorization by ideal predictors that have complete information about the transition matrices inS. In this case, memorizing predictors correspond to the general Bayes predictor (Eq. III2) when the priorP(T)matches S. Making the substitutionP(T) = 1 K PK k=1 δ(T−T (k)), we have ˆπMem n (τ|µ) = 1 K KX k=1 P(S n |T (k)) P(S n) T (k) τ µ...

work page

[7] [7]

Generalizing predictors thus correspond to the Bayes-optimal predictors (Eq

Generalization We model generalization by ideal predictors that predict the next state given complete knowledge of the data distributionD T from which the transition matrices inSare sampled. Generalizing predictors thus correspond to the Bayes-optimal predictors (Eq. III2) when the assumed priorP(T)matchesD T. a. 1-point generalization.For 1-point statist...

work page

[8] [8]

(a) (b) FIG

Scaling of the four Bayesian predictors withKandN Figure A1 compares the loss incurred by the four Bayesian predictors as a function of data diversityKand sequence lengthNon training sequences. (a) (b) FIG. A1: Scaling of the four predictor losses with data diversityKand sequence lengthNcomputed over a large batch of sequences and averaged over 8 task dis...

work page

[9] [9]

We find plateaus in both the training and generalization losses, which correspond closely with loss values of the predictors (Figure 2a)

Behavioral Readouts We evaluated the training and generalization loss of each model checkpoint on a sample of8×Kand 2048 sequences respectively, where the generalization loss is given by Lgen = * − 1 N NX n=1 log ˆπθ(sn+1 |S n) + T∼D T SN+1 ∼T .(IV1) This allowed us to compare the loss values through training to the loss of each predictor on the same sequ...

work page 2048

[10] [10]

Conse- quently, for the model to infer nearest-neighbor 2-point correlations (i.e., bigrams) in a sequence, at least one attention layer must attend to the previous state

Mechanistic Readouts Only the attention layers of the transformer architecture can mix information along the sequence dimension. Conse- quently, for the model to infer nearest-neighbor 2-point correlations (i.e., bigrams) in a sequence, at least one attention layer must attend to the previous state. To identify when an attention layer exhibits this behavi...

work page

[11] [11]

Path Expansion We first explain the nature of the circuit edges we consider. Recall the construction of the residual stream after each block x(0) n =W Exn (V1) y(1) n =x (0) n +Att (1) x(0) ≤n (V2) x(1) n =x (0) n +Att (1) x(0) ≤n +MLP (1) y(1) n (V3) y(2) n =x (0) n +Att (1) x(0) ≤n +MLP (1) y(1) n +Att (2) x(1) ≤n (V4) x(2) n =x (0) n +Att (1) x(0) ≤n +...

work page

[12] [12]

To do so, we developed a custom Python implementation of the model forward pass that explicitly exposes the vector passed along each layer connection

Circuit tracing We first measured the importance of each connection in producing the observed transformer behavior. To do so, we developed a custom Python implementation of the model forward pass that explicitly exposes the vector passed along each layer connection. This allows the passed vectors along all edges to be cached during a forward pass of the u...

work page

[13] [13]

Reduction of the Network and Circuits We first introduce simplifications at the architectural level. (i) Fixed one-hot embeddings.Since sequence states are embedded as orthogonal one-hot vectors, we eliminate the embedding matrixW E and directly work with the one-hot token representations. 31 (ii) Disentangled value subspaces.We assume that the value matr...

work page

[14] [14]

This symmetry implies that the fourWmatrices will have equal diagonal terms and equal off-diagonal terms

Symmetry Reduction In our case, since the generative model for transition matricesTis symmetric over theCtoken classes, allCtoken classes are statistically identical. This symmetry implies that the fourWmatrices will have equal diagonal terms and equal off-diagonal terms. The off-diagonal terms can be set to zero as they contribute a constant offset; for ...

work page

[15] [15]

Training details Denote the last in the sequencexN asµ

Numerical validation a. Training details Denote the last in the sequencexN asµ. The cross-entropy loss averaged over input sequences given transition matrixTis L=− *X µ,τ pµTτ µ logπ τ(x1:N) + .(VII12) whereT τ µ is the probability that the next token isτgiven that the current token isµandp µ is the stationary probability of tokenµfor transition matrixT. ...

work page

[16] [16]

The SA-transformer is trained using stochastic gradient descent (SGD) with learning rate 1 and a batch size of 256. b. Training results The training loss is shown in Figure 5b, and shows that the SA-transformer reproduces the abrupt learning of the full network (compare with Figure 2b forK= 1024, where the transformer rapidly learns the 1-Gen solution and...

work page

[17] [17]

We show thatw A →0andw B =w C =w D = 1/3inG 1, whereas the rest of the parameters remain at zero

The network first entersG1 before enteringG 2. We show thatw A →0andw B =w C =w D = 1/3inG 1, whereas the rest of the parameters remain at zero. 34 FIG. A4: Parameters of the attention circuits defined in equation VII13 at different iterations. The corresponding loss dynamics is given in Figure 5 (b)

work page

[18] [18]

The other terms involving wA, wB, wD do not contribute to the solution after convergence

Next, we show thatG2 corresponds towC = 1, β→ ∞, δ→ ∞in the SA-transformer. The other terms involving wA, wB, wD do not contribute to the solution after convergence

work page

[19] [19]

Third, we show that accurately capturing the training dynamics ofβ, δrequires a careful computation of expectations when expanding the loss in a Taylor series near the 1-Gen solution. While the 2-Gen solution involves a second-order term in the loss of the formβδ(which would imply a saddle-point at the originβ, δ= 0), there are subtle first-order contribu...

work page

[20] [20]

At initialization, we haveA(ℓ) ji = 1/iforℓ= 1,2

Competition betweenx N and the 1-Gen solution The network nearly implements the 1-Gen solution at initialization except for a contribution due to the first term involvingw A in equation VII14. At initialization, we haveA(ℓ) ji = 1/iforℓ= 1,2. Plugging this into equation VII14, we observe that the terms involvingwB, wC, wD compute the 1-point statistics. S...

work page

[21] [21]

Specifically, wC = 1, δ→ ∞, β→ ∞corresponds to the 2-Gen solution

The 2-Gen solution after convergence Now, we show that 2-Gen can be implemented by setting all other parameters exceptwC, δ, βto zero. Specifically, wC = 1, δ→ ∞, β→ ∞corresponds to the 2-Gen solution. Recall that, by definition,δ=P (1) −1 andβ=β (2) 3 . When 36 all other parameters are set to zero, from equation VII14, we have ˆπ= X i≤N A(2) iN xi,where ...

work page

[22] [22]

To do this, we expand the loss to first order inβandδaround the unigram solution, L(β, δ) =L 1-Gen −c ββ−c δδ+

Acquisition of the 2-Gen solution We now examine the kinetics of acquisition of the 2-Gen solutionwC = 1, β→ ∞, δ→ ∞. To do this, we expand the loss to first order inβandδaround the unigram solution, L(β, δ) =L 1-Gen −c ββ−c δδ+. . . .(VIII9) Recall from Section VIII1 that the 1-Gen solution corresponds towA = 0, wB =w C =w D = 1/3and the rest of the para...

work page

[23] [23]

Testing predictions a. Loss landscape Through above analysis in previous sections, we could express the model as: ˆπτ =w Aδµτ +w B X i≤N A(1) iN δτ si +w C X i≤N A(2) iN δτ si +w D X i≤N X j≤i A(2) iN A(1) ji δτ sj ,where (VIII25) A(1) ji = eδ −1 δj(i−1) + 1 i+e δ −1 andA (2) ji = exp βP k≤j A(1) kj δsisk P j′ exp βP k′≤j′ A(1) k′j′δsisk′ .(VIII26) Thus t...

work page

[24] [24]

If we set F1 = 0, the dynamics change qualitatively.F 1 is non-zero due to subtle correlations betweensN−1 ands N+1

Ablating the first-order contribution inδ In equation VIII24, the dynamics ofδis governed by two terms of which only the first depends onβ. If we set F1 = 0, the dynamics change qualitatively.F 1 is non-zero due to subtle correlations betweensN−1 ands N+1. We remove these correlations by resampling the token at positionN−1after generating the sequence, th...

work page

[25] [25]

From equation VIII24,βgrows at a constant rate until it reaches the nonlinear regime, after which it increases rapidly and saturates shortly thereafter

The time to transition from the 1-Gen to the 2-Gen solution Finally, a key result of our theory is an estimate for the number of iterations required to transition from the 1-Gen to the 2-Gen solution, which we callτ2-Gen. From equation VIII24,βgrows at a constant rate until it reaches the nonlinear regime, after which it increases rapidly and saturates sh...

work page

[26] [26]

Consider the fluctuations of these inputs over sequences sampled from the same chain

Task Vector In the 2-Mem circuit, MLP2 primarily reads from two inputs to produce the logit: MLP1 and Att2, the latter of which averages the outputs of MLP1 over the sequence. Consider the fluctuations of these inputs over sequences sampled from the same chain. MLP1 can only read the current state and the output of Att1, the latter of which is almost excl...

work page

[27] [27]

We computed t-SNE forφand were able to observe task-specific clustering forming through training on data diversities up toK= 128

Representation Geometry Since MLP2 infers the current task condition fromφ, it is desirable for instantiations ofφto be separable when the underlying sequences are from different tasks. We computed t-SNE forφand were able to observe task-specific clustering forming through training on data diversities up toK= 128. The task vectors for this computation are...

work page

[28] [28]

The transformer first performed a forward pass on a batch of sequencesSA (ConditionS A)

Patching We propose that MLP2 makes predictions using only the task information provided byφ, and test this by analyzing the transformer behavior when theφtask information is incongruent with any other alternative source of task information. The transformer first performed a forward pass on a batch of sequencesSA (ConditionS A). During the forward pass, t...

work page

[29] [29]

ThetracedcircuitindicatesthatMLP1readsthesepairsandproducesanoutputvectorforeachthatisthenaggregated across the sequence by Att2 to form the task vectorφ

Information Content InM 2, the first layer attention weight concentrates on the previous tokenA(1) ni ≈δ(n−i−1), in which case Att1 may be seen as forming a representation of 2-point statistics in the residual streamx(1) n =x n +W (1) V xn−1 :=f(x n, xn−1). ThetracedcircuitindicatesthatMLP1readsthesepairsandproducesanoutputvectorforeachthatisthenaggregate...

work page

[30] [30]

Results derived using the same models used for Figure 7a

As in Figure 7a, each seed has been shifted along the horizontal axis by the value ofK∗ 1 determined for that seed by theϕ(2) β threshold criteria. Results derived using the same models used for Figure 7a

work page

[31] [31]

The rate at which the model learns this can be increased by providing perfect task information to the model

Task Injection To implement a memorizing predictor, the model must in part learn to differentiate between sequences from different tasks. The rate at which the model learns this can be increased by providing perfect task information to the model. To do this, we allowed the model to learnD-dimensional embeddings for each chain that were injected into MLP1 ...

work page

[32] [32]

One way to accomplish this is by reducing the rate at which the first attention layer learns to attend to the previous token

Gradient Reweighting We sought to perturb the model by modifying the effective learning rate for the2-Gen circuit independent of the 1-Mem circuit. One way to accomplish this is by reducing the rate at which the first attention layer learns to attend to the previous token. We implemented a modified attention head in Python with a fixed weight factorw∈[0,1...

work page

[33] [33]

Path Expansion in the 2-Mem Phase The circuit-tracing results forM2 in Figure A3 indicate that the full transformer computation can be reduced to the dominant information flow shown schematically in Figure 8(c). This reduced path can be written as x(0) n =W Exn,(XII1) y(1) n =x (0) n + Att(1) x(0) ≤n ,(XII2) x(1) n =x (0) n + MLP(1) y(1) n ,(XII3) y(2) n ...

work page

[34] [34]

This allows us to replace y(1) n =x (0) n + Att(1) x(0) ≤n − →y (1) n =x (0) n ⊕x (0) n−1,(XII7) where⊕denotes concatenation into orthogonal subspaces

Minimal Network Construction First, we assume that Att1 extracts the preceding state and maps it to a subspace orthogonal to that of the current state. This allows us to replace y(1) n =x (0) n + Att(1) x(0) ≤n − →y (1) n =x (0) n ⊕x (0) n−1,(XII7) where⊕denotes concatenation into orthogonal subspaces. Second, we assume that Att2 performs a uniform poolin...

work page

[35] [35]

This separation demonstrates that the minimal network successfully reproduces the 2-Mem predictor

Phase Diagram Figures 8d and 8f show that the minimal model training loss can fall below the 2-Gen baselineL2-Gen while the generalization loss remains significantly above it. This separation demonstrates that the minimal network successfully reproduces the 2-Mem predictor. As data diversityKincreases, Figure 8d exhibits a clear crossover behavior: for sm...

work page

[36] [36]

Dependence ofK ∗ 2 on MLP Capacity We denote the estimated critical data diversity for the minimal model byˆK ∗ 2 to distinguish it fromK∗ 2 in the full model. We hypothesize that MLP1 is primarily responsible for extracting a task vector from the sequence, while MLP2 must memorize the collection of task-specific transition matrices and select the appropr...

work page