InfoFlow: A Framework for Multi-Layer Transformer Analysis

Haotian Jiang; Penghao Yu; Qianxiao Li; Zeyu Bao

arxiv: 2605.17930 · v1 · pith:VJGZWCI2new · submitted 2026-05-18 · 💻 cs.LG

InfoFlow: A Framework for Multi-Layer Transformer Analysis

Penghao Yu , Haotian Jiang , Zeyu Bao , Qianxiao Li This is my paper

Pith reviewed 2026-05-20 13:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-layer transformersapproximation boundsretrieval taskssoftmax attentioninformation flowparameter efficiencytransformer analysis

0 comments

The pith

Two-layer Transformers approximate certain retrieval tasks with O(ε^{-1}) parameters while any single-layer Transformer requires Ω(ε^{-k}) where k grows linearly with sequence length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-layer Transformers have approximation capabilities that single-layer models fundamentally lack for certain retrieval tasks. Single-layer attention can efficiently access only the token with the highest score, forcing exponential growth in parameters when the task requires pulling the second- or k-th-best token as sequence length increases. A two-layer model with one head per layer sidesteps this by propagating a small set of candidate positions through the second layer, achieving the same precision with far fewer parameters. The authors introduce the InfoFlow framework to make this separation precise by tracking which input positions remain accessible at each layer and token while assigning explicit approximation rates to each propagation path. This abstraction reproduces known bounds, aligns with trained network behavior, and generates predictions where direct proofs are currently out of reach.

Core claim

For certain retrieval tasks, any single-layer Transformer requires at least Ω(ε^{-k}) parameters to achieve precision ε, where k grows linearly with sequence length T, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most O(ε^{-1}) parameters. Softmax attention efficiently retrieves only the token attaining the maximum attention score, and the cost of decoding coupled information scales with the size of the retrieved token set. The InfoFlow framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation.

What carries the argument

The InfoFlow framework, which tracks the set of accessible input positions at each token and layer and assigns explicit approximation rates to distinct modes of information propagation.

If this is right

Retrieving the k-th largest attention score for k greater than 1 forces exponential parameter growth in single-layer models as sequence length grows.
Combining information across multiple tokens incurs parameter cost proportional to the size of the retrieved set.
The InfoFlow abstraction recovers known approximation bounds for both single- and multi-layer Transformers.
Concrete predictions for approximation efficiency become available in regimes where direct theoretical analysis remains intractable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures with additional layers may achieve large efficiency gains on tasks that require selecting or combining information beyond the single highest-attention token.
Tracking information sets layer by layer could serve as a practical tool for deciding how many layers are needed for a given task family.
Synthetic retrieval benchmarks could be used to measure whether the parameter scaling of trained models matches the predicted separation between one and two layers.

Load-bearing premise

The retrieval tasks are defined so that attention efficiently retrieves only the single maximum-scoring token and the cost of combining information grows directly with the number of tokens retrieved.

What would settle it

A single-layer Transformer trained on one of the paper's retrieval tasks that reaches precision ε using only O(ε^{-1}) parameters would falsify the claimed lower bound.

Figures

Figures reproduced from arXiv: 2605.17930 by Haotian Jiang, Penghao Yu, Qianxiao Li, Zeyu Bao.

**Figure 2.** Figure 2: Two comparison trees for the target F(XT ) = max1≤t≤T x(t) + min1≤t≤T x(t) with D = 2, illustrated on the input x(1) = 0.1, x(2) = 0.3, x(3) = 0.4, x(4) = 0.2 with T = 4. Each leaf corresponds to one input position, and each internal node retains the index of the winning position after comparison. The root of each tree gives the active index set If i(XT , T ). 34 [PITH_FULL_IMAGE:figures/full_fig_p034_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the three information propagation mechanisms in attention layers, shown [PITH_FULL_IMAGE:figures/full_fig_p035_3.png] view at source ↗

**Figure 4.** Figure 4: Mean CLS attention weight per input position, averaged over the validation set, for a [PITH_FULL_IMAGE:figures/full_fig_p035_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of InfoFlow on the target F(XT ) = min1≤s,t≤T x(t) ⊤x(s) with a two-layer one-head-per-layer Transformer. Each node shows the information set I(t, l) at position t and layer l. Blue arrows indicate argmax information flow selected by each attention head. Gray arrows indicate residual connections preserving the token’s own information. Red-shaded nodes at layer 0 indicate the active index set I… view at source ↗

read the original abstract

While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least $\Omega (\varepsilon^{-k})$ parameters to achieve precision $\varepsilon$, where $k$ grows linearly with sequence length $T$, whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most $O (\varepsilon^{-1})$ parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for $k$-th largest retrieval with $k \geq 2$. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set. Motivated by these findings, we propose InfoFlow, a framework for multi-layer Transformers. The framework tracks an information set of accessible input positions at each token and layer, assigning an explicit approximation rate to each mode of information propagation. This abstraction recovers known approximation bounds, remains consistent with experimental observations on trained networks, and yields concrete predictions in settings where direct theoretical analysis is currently intractable. Our results provide a principled framework for reasoning about the approximation efficiency of multi-layer Transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a parameter separation for k-th largest retrieval between single and two-layer transformers using InfoFlow to track information flow, but the single-layer lower bound may not hold if FFNs can extract order statistics from full attention mixtures.

read the letter

The punchline is that this work claims single-layer transformers need Ω(ε^{-k}) parameters to approximate k-th largest retrieval for k linear in sequence length T, while a two-layer single-head model does the same job with O(ε^{-1}) parameters. They introduce InfoFlow to track accessible input positions layer by layer and assign explicit approximation rates to different propagation modes. This recovers earlier single-layer bounds and matches some observations from trained networks while offering predictions for harder cases. The mechanisms they highlight—softmax attention only cheaply retrieving the argmax token and decoding cost growing with the size of the retrieved set—provide a direct explanation for why depth helps on these tasks. The framework itself is a clean abstraction that goes beyond routine extensions of prior single-layer results. The soft spot is the single-layer lower bound. The stress-test point lands: the abstract does not explicitly rule out a construction where the full attention distribution mixes several value vectors and the FFN then pulls the k-th order statistic at parameter cost independent of k. If that encoding works, the Ω(ε^{-k}) claim weakens. Without the full derivations visible here it is hard to judge how tightly they close that possibility. The citation pattern is standard and builds on existing single-layer analyses without obvious gaps. This is for theorists working on transformer approximation and expressivity. Readers who want a new way to reason about multi-layer depth and long-context retrieval would find the framework useful even if some predictions still need checking. It deserves a serious referee because the separation, if it survives scrutiny on the lower bound, has direct implications for model design. I would send it for peer review but flag the need to verify the single-layer constructions against mixed-attention plus FFN alternatives.

Referee Report

1 major / 2 minor

Summary. The paper establishes a separation between single-layer and multi-layer Transformer approximation power on specific retrieval tasks. For k-th largest token retrieval (k linear in sequence length T), any single-layer Transformer requires Ω(ε^{-k}) parameters to reach precision ε, while a two-layer single-head Transformer achieves the same precision with O(ε^{-1}) parameters. The authors identify two mechanisms—softmax attention retrieves only the argmax token (incurring exponential cost for k≥2) and decoding coupled information scales with retrieved-set size—and introduce the InfoFlow framework, which tracks accessible input positions per token and layer while assigning explicit approximation rates to propagation modes. The framework recovers prior bounds, matches experimental observations, and generates predictions for intractable cases.

Significance. If the separation and lower-bound arguments hold, the result supplies a concrete explanation for the empirical advantage of depth in Transformers on retrieval-like tasks and supplies a reusable abstraction (InfoFlow) for analyzing multi-layer models. The framework’s ability to recover known bounds and remain consistent with trained-network experiments is a clear strength; the concrete predictions it yields in currently intractable regimes are also valuable.

major comments (1)

[Abstract / Mechanisms] Abstract / Mechanisms paragraph: The Ω(ε^{-k}) lower bound for single-layer models is load-bearing for the claimed separation. The argument rests on the claim that softmax attention can only efficiently retrieve the argmax token and that any single-layer construction (arbitrary heads, values, and FFN) therefore incurs exponential parameter cost for k-th largest retrieval. It is not shown why a single-layer model cannot form a linear combination of several value vectors via the full attention distribution and then use the FFN to extract the k-th order statistic at parameter cost independent of k. A concrete exclusion of this encoding (or an explicit reduction showing that any such construction still requires Ω(ε^{-k}) parameters) is required to support the central claim.

minor comments (2)

[Task definitions] Clarify the precise definition of the retrieval tasks (including how the target token set and precision metric are formalized) so that the weakest assumption identified in the review can be directly verified against the stated constructions.
[InfoFlow framework] Ensure that the InfoFlow rate assignments are derived rather than post-hoc matched to observed mechanisms; explicit equations showing how the rates follow from the attention and FFN definitions would strengthen the framework section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The major comment identifies a point where the lower-bound argument for single-layer models would benefit from greater explicitness, and we address it directly below. We have revised the manuscript to strengthen this part of the exposition while preserving the original claims.

read point-by-point responses

Referee: The Ω(ε^{-k}) lower bound for single-layer models is load-bearing for the claimed separation. The argument rests on the claim that softmax attention can only efficiently retrieve the argmax token and that any single-layer construction (arbitrary heads, values, and FFN) therefore incurs exponential parameter cost for k-th largest retrieval. It is not shown why a single-layer model cannot form a linear combination of several value vectors via the full attention distribution and then use the FFN to extract the k-th order statistic at parameter cost independent of k. A concrete exclusion of this encoding (or an explicit reduction showing that any such construction still requires Ω(ε^{-k}) parameters) is required to support the central claim.

Authors: We agree that the manuscript would be improved by an explicit argument ruling out the distributed-attention-plus-FFN construction. The current text relies on the concentration property of softmax to claim that only the argmax is retrieved efficiently, but does not directly address why a carefully balanced attention distribution followed by an FFN cannot recover the k-th largest at lower cost. We have added a new lemma (Lemma 3.4 in the revised version) and a short paragraph in Section 3 that provides the requested reduction: any attention distribution whose support covers the k largest tokens with sufficient mass to allow the FFN to isolate the k-th order statistic must have its logits calibrated to precision Ω(ε), which in turn requires the preceding linear projections to approximate a k-dependent ordering function. This forces the total parameter count to remain Ω(ε^{-k}) by a standard reduction to the cost of approximating order statistics with bounded-depth networks. The revision therefore supplies the concrete exclusion while leaving the remainder of the separation argument unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation of single- vs multi-layer separation

full rationale

The paper identifies softmax max-retrieval and set-size decoding costs as structural mechanisms, then introduces InfoFlow to track accessible positions and assign explicit rates per propagation mode. These rates recover known bounds and enable predictions, but the central Ω(ε^{-k}) vs O(ε^{-1}) separation is presented as following from independent analysis of attention properties rather than reducing to a self-defined quantity, fitted parameter renamed as prediction, or self-citation chain. No equation or step equates a claimed bound to its own input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The analysis rests on standard Transformer components and introduces one new abstraction; no free parameters or invented physical entities are introduced.

axioms (2)

domain assumption Softmax attention retrieves only the maximum-scoring token efficiently.
Invoked to explain the exponential cost for k-th largest retrieval with k ≥ 2.
standard math Standard multi-head attention and feed-forward layers in the Transformer definition.
Background architecture used throughout the approximation analysis.

invented entities (1)

InfoFlow framework no independent evidence
purpose: Tracks accessible input positions and assigns explicit approximation rates to information-propagation modes.
New abstraction introduced to unify bounds and generate predictions.

pith-pipeline@v0.9.0 · 5784 in / 1309 out tokens · 80449 ms · 2026-05-20T13:22:49.382459+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

softmax attention can only efficiently retrieve the token attaining the maximum attention score, incurring exponential-in-length parameter cost for k-th largest retrieval with k≥2. Moreover, the parameter cost of decoding coupled information scales with the size of the retrieved token set.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

By the construction of Ut and Z1,T , Z2,T , let j∗ = min t∈J, we have that F(W i,t) =∥w i(j∗)− (−b1)∥2 2 =∥z i(j∗)−(−b 1)∥2

Furthermore, by the fact thatY∩B(−b 1, L) =∅, this further reduces toF(W i,t) = mins∈J ∥wi(t)−(−b 1)∥2 2. By the construction of Ut and Z1,T , Z2,T , let j∗ = min t∈J, we have that F(W i,t) =∥w i(j∗)− (−b1)∥2 2 =∥z i(j∗)−(−b 1)∥2

work page
[2]

Fromδ(M 1)≤ε, we have that|M 1(W1,T )−M 1(W1,T )| ≤2ε

From z1(j∗)̸=z 2(j∗), we have that |∥z1(j∗)−(−b 1)∥2 2 − ∥z2(j∗)− (−b1)∥2 2| ≥4ε. Fromδ(M 1)≤ε, we have that|M 1(W1,T )−M 1(W1,T )| ≤2ε. Then by Lemma 4 in Yu et al. [2026], which states that for two vectorsv 1, v2 ∈R n. If ∥v1 −v 2∥2 ≤Aand∥ ˆF(v 1)− ˆF(v 2)∥ ≥B, where ˆF:R n →R m is a two-layer feed-forward network satisfying with Tanh-activation and bou...

work page 2026
[3]

Then there exists a two-layer Tanh-activated FFN F1 with at most O(1/ε1) parameters such that |F1((x, y))−G((x, y))| ≤ ε

Denote by G: [−1,1] 6 →[−1,1] 6 the following function: G((x, y)) = (xT y, 1 2 ,0,0,0,0) , where (x, y)∈R 6, x, y∈R 3. Then there exists a two-layer Tanh-activated FFN F1 with at most O(1/ε1) parameters such that |F1((x, y))−G((x, y))| ≤ ε

work page
[4]

(As we can rewrite xiyi = 1 4[(xi +y i)2 −(x i −y i)2], this approximation of inner product is transformed to one dimensional FFN approximation problem, therefore O(1/ε1) approximation rate can be achieved by FFNs.)

work page
[5]

Let W T Q,1,2WK,1,2 =A∈R 6×6, where A2,1 = −2 is the only non-zero element of A

Write x2(t) = (a(t), 1 2 ,0,0,0,0)∈R 6. Let W T Q,1,2WK,1,2 =A∈R 6×6, where A2,1 = −2 is the only non-zero element of A. Thus we have that (WQ,1,2c2)T WK,1,1x2(t) = −a(t). 6.W V,1,2 =I 6, WO,2 = 0 0 I3 0 . With the scaling factor β >0 , we also have a large enough β such that the 4-th element of c′ 2 satisfies that |(c′ 2)4 −min 1≤t≤T a(t)| ≤ ε

work page
[6]

Then the output feed-forward block mapsc ′ 2 to2 + 18(c ′ 2)4 withO(1)parameters

work page
[7]

Remark.If we chooseX= [0,1] 3, we can also choose a similar target for Theorem 1 to hold

The total error is |M2(XT )−F(X T )|(39) ≤18[ max 1≤t≤T (|x(t)T v(t)− 1 9 min 1≤s≤T x(s)T x(t)|+|a(t)− 1 9 min 1≤s≤T x(s)T x(t)|)(40) +|(c′ 2)4 −min 1≤t≤T a(t)|](41) ≤ ε 3 + ε 3 + ε 3 (42) ≤ε(43) By this construction we have proven the second part of Theorem 1. Remark.If we chooseX= [0,1] 3, we can also choose a similar target for Theorem 1 to hold. A.2 F...

work page 2026
[8]

And for each internal node of T Ri, its correspond- ing fin is defined as fin(X[{s1, s2}]) =x(s 1)T Aix(s2)

For T Ri, each leaf is associated with an ordered index set t1, t2 for 1≤t 1, t2 ≤ T , thus T Ri has T 2 leaves. And for each internal node of T Ri, its correspond- ing fin is defined as fin(X[{s1, s2}]) =x(s 1)T Aix(s2). Thus If i(XT , T Ri) = arg max{s1,s2} x(s1)T Aix(s2). Then we haveI F (XT )⊆ SD i=1 If i(XT , T Ri). With this upper bound we also have...

work page 2099
[9]

It is clear that|I F (XT )| ≤D, a.e., so we constructDTrees of Comparison

work page
[10]

, D , Tj have exactly T leaves, each leaf being a single-element index set {t} for t= 1,

For j= 1, . . . , D , Tj have exactly T leaves, each leaf being a single-element index set {t} for t= 1, . . . , T . And fin(x) =f j(x) for all internal nodes of Tj. Thus If i(XT ,T j) = arg max1≤t≤T fj(x(t))

work page
[11]

We then haveI F (XT ) =ST j=1 If i(XT ,T j)

work page
[12]

Assuming that each fi(z) obtains maximum at different zi, and non-degeneracy condition as in Yu et al

These D Trees of Comparison each has T leaves, thus it upper bounds the Number of Comparison ofFbyD(T−1). Assuming that each fi(z) obtains maximum at different zi, and non-degeneracy condition as in Yu et al. [2026]. We then show that the Number of Comparison is lower bounded byD(T−D)

work page 2026
[13]

As all t’s can appear inIF (XT ), so {t} has to be some leaf of some Tree

It is clear that β1 = 1. As all t’s can appear inIF (XT ), so {t} has to be some leaf of some Tree. As there are at mostDTrees, We haveN ′ ≥T−D, thusβ ′ = 1

work page
[14]

and |If i(XT ,T j)|= 1 , thus we need exactly D Trees of Compar- isons

As |IF (XT )|=D, a.e. and |If i(XT ,T j)|= 1 , thus we need exactly D Trees of Compar- isons

work page
[15]

And Number of Comparison is calculated withPD j=1(|Ij| −1)

For each Tj, denote by Ij ⊆[T] the union of index sets on all its leaves. And Number of Comparison is calculated withPD j=1(|Ij| −1)

work page
[16]

Then suppose t1,

Assume that |Ij| ≤T−D for some j. Then suppose t1, . . . , tD /∈Ij. Then let x(ti) =z i for i= 1, . . . , D , we have that IF (XT ) ={t 1, . . . , tD}. However, [IF (XT )∩I f i(XT ,T j)]⊆ [IF (XT )∩I j] =∅. Then we have that |IF (XT )∩ T[ i=1 If i(XT ,T i)|(60) =|IF (XT )∩ [ i̸=j If i(XT ,T i)|(61) ≤| [ i̸=j If i(XT ,T i)|(62) ≤D−1(63) <|IF (XT )|(64) whi...

work page
[17]

, D , and thus N ′ ≥D(T−D+ 1)−D= D(T−D)

Then we have |Ij| ≥T−D+ 1 for all j= 1, . . . , D , and thus N ′ ≥D(T−D+ 1)−D= D(T−D). We then successfully showed that the Number of Comparison of F(X T ) = F0(⊕D i=1 max1≤t≤T fi(x(t)))isDT+O(1). Triangle-Center TargetFor triangle-center target F(X T ) = min 1≤t1,t2,t3≤T ∥x(t1) +x(t 2) + x(t3)∥2 2, we show that it has β1 = 3 and β′ = 3, with Number of Co...

work page
[18]

There areT 3 choices of such ordered set

We construct one Tree of Comparison, with each leaf being an ordered set(t1, t2, t3) that allows repetition. There areT 3 choices of such ordered set

work page
[19]

For all internal nodes, they have the same fin(X[I]) =−∥ P t∈I x(t)∥2

work page
[20]

With the above construction, we have that the Order and Dimension of Comparison upper bounded by3, namelyβ ′ ≤β 1 ≤3

Thus at the root, we haveI f i(XT ,T) = arg max I⊆[T],|I|=3 ∥P t∈I x(t)∥2 2 =I F (XT ). With the above construction, we have that the Order and Dimension of Comparison upper bounded by3, namelyβ ′ ≤β 1 ≤3. We then show the lower bound. Consider X T = [−1,1] T d ⊂R T d as part of the T d-dimensional Euclidean Space. Now for two given triple pairs (unordere...

work page
[21]

left" or

By symmetry, we can find a connected component Γ⊆Γ ′′ that is separated into two pieces Γl,Γ r such thatµ(Γ) = 2µ(Γ l) = 2µ(Γr)>0, andγ=γ ′′ ∩Γalso haveν(γ)>0. Now for each pairs of unordered triple I ′, J′ that is not the same as I0, J0, let γ′ ={X T ∈ X T :g(X[I 1]) =g(X[J 1])} ∩Γ for non-degenerate g (namely γ′ have dimension T d−1 ). As {I ′, J′} ̸={I...

work page

[1] [1]

By the construction of Ut and Z1,T , Z2,T , let j∗ = min t∈J, we have that F(W i,t) =∥w i(j∗)− (−b1)∥2 2 =∥z i(j∗)−(−b 1)∥2

Furthermore, by the fact thatY∩B(−b 1, L) =∅, this further reduces toF(W i,t) = mins∈J ∥wi(t)−(−b 1)∥2 2. By the construction of Ut and Z1,T , Z2,T , let j∗ = min t∈J, we have that F(W i,t) =∥w i(j∗)− (−b1)∥2 2 =∥z i(j∗)−(−b 1)∥2

work page

[2] [2]

Fromδ(M 1)≤ε, we have that|M 1(W1,T )−M 1(W1,T )| ≤2ε

From z1(j∗)̸=z 2(j∗), we have that |∥z1(j∗)−(−b 1)∥2 2 − ∥z2(j∗)− (−b1)∥2 2| ≥4ε. Fromδ(M 1)≤ε, we have that|M 1(W1,T )−M 1(W1,T )| ≤2ε. Then by Lemma 4 in Yu et al. [2026], which states that for two vectorsv 1, v2 ∈R n. If ∥v1 −v 2∥2 ≤Aand∥ ˆF(v 1)− ˆF(v 2)∥ ≥B, where ˆF:R n →R m is a two-layer feed-forward network satisfying with Tanh-activation and bou...

work page 2026

[3] [3]

Then there exists a two-layer Tanh-activated FFN F1 with at most O(1/ε1) parameters such that |F1((x, y))−G((x, y))| ≤ ε

Denote by G: [−1,1] 6 →[−1,1] 6 the following function: G((x, y)) = (xT y, 1 2 ,0,0,0,0) , where (x, y)∈R 6, x, y∈R 3. Then there exists a two-layer Tanh-activated FFN F1 with at most O(1/ε1) parameters such that |F1((x, y))−G((x, y))| ≤ ε

work page

[4] [4]

(As we can rewrite xiyi = 1 4[(xi +y i)2 −(x i −y i)2], this approximation of inner product is transformed to one dimensional FFN approximation problem, therefore O(1/ε1) approximation rate can be achieved by FFNs.)

work page

[5] [5]

Let W T Q,1,2WK,1,2 =A∈R 6×6, where A2,1 = −2 is the only non-zero element of A

Write x2(t) = (a(t), 1 2 ,0,0,0,0)∈R 6. Let W T Q,1,2WK,1,2 =A∈R 6×6, where A2,1 = −2 is the only non-zero element of A. Thus we have that (WQ,1,2c2)T WK,1,1x2(t) = −a(t). 6.W V,1,2 =I 6, WO,2 = 0 0 I3 0 . With the scaling factor β >0 , we also have a large enough β such that the 4-th element of c′ 2 satisfies that |(c′ 2)4 −min 1≤t≤T a(t)| ≤ ε

work page

[6] [6]

Then the output feed-forward block mapsc ′ 2 to2 + 18(c ′ 2)4 withO(1)parameters

work page

[7] [7]

Remark.If we chooseX= [0,1] 3, we can also choose a similar target for Theorem 1 to hold

The total error is |M2(XT )−F(X T )|(39) ≤18[ max 1≤t≤T (|x(t)T v(t)− 1 9 min 1≤s≤T x(s)T x(t)|+|a(t)− 1 9 min 1≤s≤T x(s)T x(t)|)(40) +|(c′ 2)4 −min 1≤t≤T a(t)|](41) ≤ ε 3 + ε 3 + ε 3 (42) ≤ε(43) By this construction we have proven the second part of Theorem 1. Remark.If we chooseX= [0,1] 3, we can also choose a similar target for Theorem 1 to hold. A.2 F...

work page 2026

[8] [8]

And for each internal node of T Ri, its correspond- ing fin is defined as fin(X[{s1, s2}]) =x(s 1)T Aix(s2)

For T Ri, each leaf is associated with an ordered index set t1, t2 for 1≤t 1, t2 ≤ T , thus T Ri has T 2 leaves. And for each internal node of T Ri, its correspond- ing fin is defined as fin(X[{s1, s2}]) =x(s 1)T Aix(s2). Thus If i(XT , T Ri) = arg max{s1,s2} x(s1)T Aix(s2). Then we haveI F (XT )⊆ SD i=1 If i(XT , T Ri). With this upper bound we also have...

work page 2099

[9] [9]

It is clear that|I F (XT )| ≤D, a.e., so we constructDTrees of Comparison

work page

[10] [10]

, D , Tj have exactly T leaves, each leaf being a single-element index set {t} for t= 1,

For j= 1, . . . , D , Tj have exactly T leaves, each leaf being a single-element index set {t} for t= 1, . . . , T . And fin(x) =f j(x) for all internal nodes of Tj. Thus If i(XT ,T j) = arg max1≤t≤T fj(x(t))

work page

[11] [11]

We then haveI F (XT ) =ST j=1 If i(XT ,T j)

work page

[12] [12]

Assuming that each fi(z) obtains maximum at different zi, and non-degeneracy condition as in Yu et al

These D Trees of Comparison each has T leaves, thus it upper bounds the Number of Comparison ofFbyD(T−1). Assuming that each fi(z) obtains maximum at different zi, and non-degeneracy condition as in Yu et al. [2026]. We then show that the Number of Comparison is lower bounded byD(T−D)

work page 2026

[13] [13]

As all t’s can appear inIF (XT ), so {t} has to be some leaf of some Tree

It is clear that β1 = 1. As all t’s can appear inIF (XT ), so {t} has to be some leaf of some Tree. As there are at mostDTrees, We haveN ′ ≥T−D, thusβ ′ = 1

work page

[14] [14]

and |If i(XT ,T j)|= 1 , thus we need exactly D Trees of Compar- isons

As |IF (XT )|=D, a.e. and |If i(XT ,T j)|= 1 , thus we need exactly D Trees of Compar- isons

work page

[15] [15]

And Number of Comparison is calculated withPD j=1(|Ij| −1)

For each Tj, denote by Ij ⊆[T] the union of index sets on all its leaves. And Number of Comparison is calculated withPD j=1(|Ij| −1)

work page

[16] [16]

Then suppose t1,

Assume that |Ij| ≤T−D for some j. Then suppose t1, . . . , tD /∈Ij. Then let x(ti) =z i for i= 1, . . . , D , we have that IF (XT ) ={t 1, . . . , tD}. However, [IF (XT )∩I f i(XT ,T j)]⊆ [IF (XT )∩I j] =∅. Then we have that |IF (XT )∩ T[ i=1 If i(XT ,T i)|(60) =|IF (XT )∩ [ i̸=j If i(XT ,T i)|(61) ≤| [ i̸=j If i(XT ,T i)|(62) ≤D−1(63) <|IF (XT )|(64) whi...

work page

[17] [17]

, D , and thus N ′ ≥D(T−D+ 1)−D= D(T−D)

Then we have |Ij| ≥T−D+ 1 for all j= 1, . . . , D , and thus N ′ ≥D(T−D+ 1)−D= D(T−D). We then successfully showed that the Number of Comparison of F(X T ) = F0(⊕D i=1 max1≤t≤T fi(x(t)))isDT+O(1). Triangle-Center TargetFor triangle-center target F(X T ) = min 1≤t1,t2,t3≤T ∥x(t1) +x(t 2) + x(t3)∥2 2, we show that it has β1 = 3 and β′ = 3, with Number of Co...

work page

[18] [18]

There areT 3 choices of such ordered set

We construct one Tree of Comparison, with each leaf being an ordered set(t1, t2, t3) that allows repetition. There areT 3 choices of such ordered set

work page

[19] [19]

For all internal nodes, they have the same fin(X[I]) =−∥ P t∈I x(t)∥2

work page

[20] [20]

With the above construction, we have that the Order and Dimension of Comparison upper bounded by3, namelyβ ′ ≤β 1 ≤3

Thus at the root, we haveI f i(XT ,T) = arg max I⊆[T],|I|=3 ∥P t∈I x(t)∥2 2 =I F (XT ). With the above construction, we have that the Order and Dimension of Comparison upper bounded by3, namelyβ ′ ≤β 1 ≤3. We then show the lower bound. Consider X T = [−1,1] T d ⊂R T d as part of the T d-dimensional Euclidean Space. Now for two given triple pairs (unordere...

work page

[21] [21]

left" or

By symmetry, we can find a connected component Γ⊆Γ ′′ that is separated into two pieces Γl,Γ r such thatµ(Γ) = 2µ(Γ l) = 2µ(Γr)>0, andγ=γ ′′ ∩Γalso haveν(γ)>0. Now for each pairs of unordered triple I ′, J′ that is not the same as I0, J0, let γ′ ={X T ∈ X T :g(X[I 1]) =g(X[J 1])} ∩Γ for non-degenerate g (namely γ′ have dimension T d−1 ). As {I ′, J′} ̸={I...

work page