InfoFlow: A Framework for Multi-Layer Transformer Analysis
Pith reviewed 2026-05-20 13:22 UTC · model grok-4.3
The pith
Two-layer Transformers approximate certain retrieval tasks with O(ε^{-1}) parameters while any single-layer Transformer requires Ω(ε^{-k}) where k grows linearly with sequence length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For certain retrieval tasks, any single-layer Transformer requires at least Ω(ε^{-k}) parameters to achieve precision ε, where k grows linearly with sequence length T, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most O(ε^{-1}) parameters. Softmax attention efficiently retrieves only the token attaining the maximum attention score, and the cost of decoding coupled information scales with the size of the retrieved token set. The InfoFlow framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation.
What carries the argument
The InfoFlow framework, which tracks the set of accessible input positions at each token and layer and assigns explicit approximation rates to distinct modes of information propagation.
If this is right
- Retrieving the k-th largest attention score for k greater than 1 forces exponential parameter growth in single-layer models as sequence length grows.
- Combining information across multiple tokens incurs parameter cost proportional to the size of the retrieved set.
- The InfoFlow abstraction recovers known approximation bounds for both single- and multi-layer Transformers.
- Concrete predictions for approximation efficiency become available in regimes where direct theoretical analysis remains intractable.
Where Pith is reading between the lines
- Architectures with additional layers may achieve large efficiency gains on tasks that require selecting or combining information beyond the single highest-attention token.
- Tracking information sets layer by layer could serve as a practical tool for deciding how many layers are needed for a given task family.
- Synthetic retrieval benchmarks could be used to measure whether the parameter scaling of trained models matches the predicted separation between one and two layers.
Load-bearing premise
The retrieval tasks are defined so that attention efficiently retrieves only the single maximum-scoring token and the cost of combining information grows directly with the number of tokens retrieved.
What would settle it
A single-layer Transformer trained on one of the paper's retrieval tasks that reaches precision ε using only O(ε^{-1}) parameters would falsify the claimed lower bound.
Figures
read the original abstract
While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least $\Omega (\varepsilon^{-k})$ parameters to achieve precision $\varepsilon$, where $k$ grows linearly with sequence length $T$, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most $O (\varepsilon^{-1})$ parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for $k$-th largest retrieval with $k \geq 2$. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set. Motivated by these findings, we propose InfoFlow, a framework for multi-layer Transformers. The framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation. This abstraction recovers known approximation bounds, remains consistent with experimental observations on trained networks, and yields concrete predictions in settings where direct theoretical analysis is currently intractable. Our results provide a principled framework for reasoning about the approximation efficiency of multi-layer Transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper establishes a separation between single-layer and multi-layer Transformer approximation power on specific retrieval tasks. For k-th largest token retrieval (k linear in sequence length T), any single-layer Transformer requires Ω(ε^{-k}) parameters to reach precision ε, while a two-layer single-head Transformer achieves the same precision with O(ε^{-1}) parameters. The authors identify two mechanisms—softmax attention retrieves only the argmax token (incurring exponential cost for k≥2) and decoding coupled information scales with retrieved-set size—and introduce the InfoFlow framework, which tracks accessible input positions per token and layer while assigning explicit approximation rates to propagation modes. The framework recovers prior bounds, matches experimental observations, and generates predictions for intractable cases.
Significance. If the separation and lower-bound arguments hold, the result supplies a concrete explanation for the empirical advantage of depth in Transformers on retrieval-like tasks and supplies a reusable abstraction (InfoFlow) for analyzing multi-layer models. The framework’s ability to recover known bounds and remain consistent with trained-network experiments is a clear strength; the concrete predictions it yields in currently intractable regimes are also valuable.
major comments (1)
- [Abstract / Mechanisms] Abstract / Mechanisms paragraph: The Ω(ε^{-k}) lower bound for single-layer models is load-bearing for the claimed separation. The argument rests on the claim that softmax attention can only efficiently retrieve the argmax token and that any single-layer construction (arbitrary heads, values, and FFN) therefore incurs exponential parameter cost for k-th largest retrieval. It is not shown why a single-layer model cannot form a linear combination of several value vectors via the full attention distribution and then use the FFN to extract the k-th order statistic at parameter cost independent of k. A concrete exclusion of this encoding (or an explicit reduction showing that any such construction still requires Ω(ε^{-k}) parameters) is required to support the central claim.
minor comments (2)
- [Task definitions] Clarify the precise definition of the retrieval tasks (including how the target token set and precision metric are formalized) so that the weakest assumption identified in the review can be directly verified against the stated constructions.
- [InfoFlow framework] Ensure that the InfoFlow rate assignments are derived rather than post-hoc matched to observed mechanisms; explicit equations showing how the rates follow from the attention and FFN definitions would strengthen the framework section.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The major comment identifies a point where the lower-bound argument for single-layer models would benefit from greater explicitness, and we address it directly below. We have revised the manuscript to strengthen this part of the exposition while preserving the original claims.
read point-by-point responses
-
Referee: The Ω(ε^{-k}) lower bound for single-layer models is load-bearing for the claimed separation. The argument rests on the claim that softmax attention can only efficiently retrieve the argmax token and that any single-layer construction (arbitrary heads, values, and FFN) therefore incurs exponential parameter cost for k-th largest retrieval. It is not shown why a single-layer model cannot form a linear combination of several value vectors via the full attention distribution and then use the FFN to extract the k-th order statistic at parameter cost independent of k. A concrete exclusion of this encoding (or an explicit reduction showing that any such construction still requires Ω(ε^{-k}) parameters) is required to support the central claim.
Authors: We agree that the manuscript would be improved by an explicit argument ruling out the distributed-attention-plus-FFN construction. The current text relies on the concentration property of softmax to claim that only the argmax is retrieved efficiently, but does not directly address why a carefully balanced attention distribution followed by an FFN cannot recover the k-th largest at lower cost. We have added a new lemma (Lemma 3.4 in the revised version) and a short paragraph in Section 3 that provides the requested reduction: any attention distribution whose support covers the k largest tokens with sufficient mass to allow the FFN to isolate the k-th order statistic must have its logits calibrated to precision Ω(ε), which in turn requires the preceding linear projections to approximate a k-dependent ordering function. This forces the total parameter count to remain Ω(ε^{-k}) by a standard reduction to the cost of approximating order statistics with bounded-depth networks. The revision therefore supplies the concrete exclusion while leaving the remainder of the separation argument unchanged. revision: yes
Circularity Check
No significant circularity in derivation of single- vs multi-layer separation
full rationale
The paper identifies softmax max-retrieval and set-size decoding costs as structural mechanisms, then introduces InfoFlow to track accessible positions and assign explicit rates per propagation mode. These rates recover known bounds and enable predictions, but the central Ω(ε^{-k}) vs O(ε^{-1}) separation is presented as following from independent analysis of attention properties rather than reducing to a self-defined quantity, fitted parameter renamed as prediction, or self-citation chain. No equation or step equates a claimed bound to its own input by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Softmax attention retrieves only the maximum-scoring token efficiently.
- standard math Standard multi-head attention and feed-forward layers in the Transformer definition.
invented entities (1)
-
InfoFlow framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for k-th largest retrieval with k≥2. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Furthermore, by the fact thatY∩B(−b 1, L) =∅, this further reduces toF(W i,t) = mins∈J ∥wi(t)−(−b 1)∥2 2. By the construction of Ut and Z1,T , Z2,T , let j∗ = min t∈J, we have that F(W i,t) =∥w i(j∗)− (−b1)∥2 2 =∥z i(j∗)−(−b 1)∥2
-
[2]
Fromδ(M 1)≤ε, we have that|M 1(W1,T )−M 1(W1,T )| ≤2ε
From z1(j∗)̸=z 2(j∗), we have that |∥z1(j∗)−(−b 1)∥2 2 − ∥z2(j∗)− (−b1)∥2 2| ≥4ε. Fromδ(M 1)≤ε, we have that|M 1(W1,T )−M 1(W1,T )| ≤2ε. Then by Lemma 4 in Yu et al. [2026], which states that for two vectorsv 1, v2 ∈R n. If ∥v1 −v 2∥2 ≤Aand∥ ˆF(v 1)− ˆF(v 2)∥ ≥B, where ˆF:R n →R m is a two-layer feed-forward network satisfying with Tanh-activation and bou...
work page 2026
-
[3]
Denote by G: [−1,1] 6 →[−1,1] 6 the following function: G((x, y)) = (xT y, 1 2 ,0,0,0,0) , where (x, y)∈R 6, x, y∈R 3. Then there exists a two-layer Tanh-activated FFN F1 with at most O(1/ε1) parameters such that |F1((x, y))−G((x, y))| ≤ ε
-
[4]
(As we can rewrite xiyi = 1 4[(xi +y i)2 −(x i −y i)2], this approximation of inner product is transformed to one dimensional FFN approximation problem, therefore O(1/ε1) approximation rate can be achieved by FFNs.)
-
[5]
Let W T Q,1,2WK,1,2 =A∈R 6×6, where A2,1 = −2 is the only non-zero element of A
Write x2(t) = (a(t), 1 2 ,0,0,0,0)∈R 6. Let W T Q,1,2WK,1,2 =A∈R 6×6, where A2,1 = −2 is the only non-zero element of A. Thus we have that (WQ,1,2c2)T WK,1,1x2(t) = −a(t). 6.W V,1,2 =I 6, WO,2 = 0 0 I3 0 . With the scaling factor β >0 , we also have a large enough β such that the 4-th element of c′ 2 satisfies that |(c′ 2)4 −min 1≤t≤T a(t)| ≤ ε
-
[6]
Then the output feed-forward block mapsc ′ 2 to2 + 18(c ′ 2)4 withO(1)parameters
-
[7]
Remark.If we chooseX= [0,1] 3, we can also choose a similar target for Theorem 1 to hold
The total error is |M2(XT )−F(X T )|(39) ≤18[ max 1≤t≤T (|x(t)T v(t)− 1 9 min 1≤s≤T x(s)T x(t)|+|a(t)− 1 9 min 1≤s≤T x(s)T x(t)|)(40) +|(c′ 2)4 −min 1≤t≤T a(t)|](41) ≤ ε 3 + ε 3 + ε 3 (42) ≤ε(43) By this construction we have proven the second part of Theorem 1. Remark.If we chooseX= [0,1] 3, we can also choose a similar target for Theorem 1 to hold. A.2 F...
work page 2026
-
[8]
For T Ri, each leaf is associated with an ordered index set t1, t2 for 1≤t 1, t2 ≤ T , thus T Ri has T 2 leaves. And for each internal node of T Ri, its correspond- ing fin is defined as fin(X[{s1, s2}]) =x(s 1)T Aix(s2). Thus If i(XT , T Ri) = arg max{s1,s2} x(s1)T Aix(s2). Then we haveI F (XT )⊆ SD i=1 If i(XT , T Ri). With this upper bound we also have...
work page 2099
-
[9]
It is clear that|I F (XT )| ≤D, a.e., so we constructDTrees of Comparison
-
[10]
, D , Tj have exactly T leaves, each leaf being a single-element index set {t} for t= 1,
For j= 1, . . . , D , Tj have exactly T leaves, each leaf being a single-element index set {t} for t= 1, . . . , T . And fin(x) =f j(x) for all internal nodes of Tj. Thus If i(XT ,T j) = arg max1≤t≤T fj(x(t))
-
[11]
We then haveI F (XT ) =ST j=1 If i(XT ,T j)
-
[12]
These D Trees of Comparison each has T leaves, thus it upper bounds the Number of Comparison ofFbyD(T−1). Assuming that each fi(z) obtains maximum at different zi, and non-degeneracy condition as in Yu et al. [2026]. We then show that the Number of Comparison is lower bounded byD(T−D)
work page 2026
-
[13]
As all t’s can appear inIF (XT ), so {t} has to be some leaf of some Tree
It is clear that β1 = 1. As all t’s can appear inIF (XT ), so {t} has to be some leaf of some Tree. As there are at mostDTrees, We haveN ′ ≥T−D, thusβ ′ = 1
-
[14]
and |If i(XT ,T j)|= 1 , thus we need exactly D Trees of Compar- isons
As |IF (XT )|=D, a.e. and |If i(XT ,T j)|= 1 , thus we need exactly D Trees of Compar- isons
-
[15]
And Number of Comparison is calculated withPD j=1(|Ij| −1)
For each Tj, denote by Ij ⊆[T] the union of index sets on all its leaves. And Number of Comparison is calculated withPD j=1(|Ij| −1)
-
[16]
Assume that |Ij| ≤T−D for some j. Then suppose t1, . . . , tD /∈Ij. Then let x(ti) =z i for i= 1, . . . , D , we have that IF (XT ) ={t 1, . . . , tD}. However, [IF (XT )∩I f i(XT ,T j)]⊆ [IF (XT )∩I j] =∅. Then we have that |IF (XT )∩ T[ i=1 If i(XT ,T i)|(60) =|IF (XT )∩ [ i̸=j If i(XT ,T i)|(61) ≤| [ i̸=j If i(XT ,T i)|(62) ≤D−1(63) <|IF (XT )|(64) whi...
-
[17]
, D , and thus N ′ ≥D(T−D+ 1)−D= D(T−D)
Then we have |Ij| ≥T−D+ 1 for all j= 1, . . . , D , and thus N ′ ≥D(T−D+ 1)−D= D(T−D). We then successfully showed that the Number of Comparison of F(X T ) = F0(⊕D i=1 max1≤t≤T fi(x(t)))isDT+O(1). Triangle-Center TargetFor triangle-center target F(X T ) = min 1≤t1,t2,t3≤T ∥x(t1) +x(t 2) + x(t3)∥2 2, we show that it has β1 = 3 and β′ = 3, with Number of Co...
-
[18]
There areT 3 choices of such ordered set
We construct one Tree of Comparison, with each leaf being an ordered set(t1, t2, t3) that allows repetition. There areT 3 choices of such ordered set
-
[19]
For all internal nodes, they have the same fin(X[I]) =−∥ P t∈I x(t)∥2
-
[20]
Thus at the root, we haveI f i(XT ,T) = arg max I⊆[T],|I|=3 ∥P t∈I x(t)∥2 2 =I F (XT ). With the above construction, we have that the Order and Dimension of Comparison upper bounded by3, namelyβ ′ ≤β 1 ≤3. We then show the lower bound. Consider X T = [−1,1] T d ⊂R T d as part of the T d-dimensional Euclidean Space. Now for two given triple pairs (unordere...
-
[21]
By symmetry, we can find a connected component Γ⊆Γ ′′ that is separated into two pieces Γl,Γ r such thatµ(Γ) = 2µ(Γ l) = 2µ(Γr)>0, andγ=γ ′′ ∩Γalso haveν(γ)>0. Now for each pairs of unordered triple I ′, J′ that is not the same as I0, J0, let γ′ ={X T ∈ X T :g(X[I 1]) =g(X[J 1])} ∩Γ for non-degenerate g (namely γ′ have dimension T d−1 ). As {I ′, J′} ̸={I...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.