Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

Anej Svete; Ashish Sabharwal; Ryan Cotterell; William Merrill

arxiv: 2605.30523 · v1 · pith:6H6HLXW2new · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CC· cs.CL· cs.FL

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

Anej Svete , William Merrill , Ryan Cotterell , Ashish Sabharwal This is my paper

Pith reviewed 2026-06-29 08:11 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CCcs.CLcs.FL

keywords padded transformersexpressivityAC^0TC^0circuit complexityattention mechanismsnumeric precisionmodel depth

0 comments

The pith

Padded transformers with constant precision equal L-uniform AC^0 circuits while growing precision reaches L-uniform TC^0 regardless of width.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that polynomially padded transformers match specific circuit classes depending on numeric precision and depth. Constant-precision models compute exactly the functions in L-uniform AC^0, while models with growing precision compute those in L-uniform TC^0. Adding log^d N loops raises the classes to FO-uniform AC^d or TC^d. These equivalences remain unchanged when switching between softmax and average hard attention or when varying width. The results identify precision and depth as the decisive factors and show that many other modeling choices do not alter the computational power.

Core claim

Polynomially padded L-uniform constant-precision transformers are equivalent to L-uniform AC^0, while growing-precision ones achieve L-uniform TC^0 regardless of width. Furthermore, looping enables sequential processing analogous to circuits: log^d N-looped constant-precision transformers reach FO-uniform AC^d, and growing-precision ones reach FO-uniform TC^d. All results hold for both softmax and average hard attention transformers.

What carries the argument

Polynomially padded transformer model whose expressivity is governed by constant versus growing numeric precision and by L-uniform versus FO-uniform circuit families.

If this is right

Increasing width beyond logarithmic size adds no further expressivity.
Softmax attention and average hard attention produce identical expressivity results.
Looping the transformer log^d N times raises its power from AC^0/TC^0 to AC^d/TC^d.
Precision level, not width or attention variant, determines whether the model reaches AC^0 or TC^0.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Fixed-precision transformers in practice may remain confined to AC^0-level tasks even when given padded inputs.
Recurrent or looped transformer variants could be used to target higher circuit classes without changing precision.
Design efforts aimed at greater expressivity should prioritize scaling numeric precision over increasing width.

Load-bearing premise

Padded inputs with filler symbols supply enough polynomial space for adaptive parallel computation, and the chosen definitions of L-uniform and FO-uniform circuit families correctly model what the transformer can do.

What would settle it

Exhibit a specific language in L-uniform TC^0 that no polynomially padded growing-precision transformer (of any width) can compute, or exhibit one outside TC^0 that such a transformer can compute.

Figures

Figures reproduced from arXiv: 2605.30523 by Anej Svete, Ashish Sabharwal, Ryan Cotterell, William Merrill.

**Figure 1.** Figure 1: Polynomially padded transformer expressivity across depth (Ó), precision (Ó), uniformity (Ñ), and width (Ñ) regimes, ignoring parameterizations that do not satisfy the sufficient volume constraint (cf. Def. 2.3). In contrast to most existing results on transformer expressivity, these results are exact for padded transformers. The purple lines mark the AC{TC divide: Constant-precision transformers are limit… view at source ↗

read the original abstract

Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\text{L-uniform}$ constant-precision transformers are equivalent to $\text{L-uniform AC}^0$, while growing-precision ones achieve $\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach $\text{FO-uniform AC}^d$, and growing-precision ones reach $\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Padded transformers match L-uniform AC0 or TC0 depending on constant vs growing precision, with the results holding across attention types and widths, though the padding uniformity step is the part that needs the closest look in the proofs.

read the letter

This paper pins down that padded transformers line up with L-uniform AC0 when precision stays constant and L-uniform TC0 when it grows, and the same pattern scales to FO-uniform ACd and TCd once you add log^d N loops. The main new pieces are the exact equivalences under these variations plus the claim that width and attention type (softmax or average hard) do not change the classes.

They do a clean job showing robustness. Earlier mappings left several modeling choices open, so checking that the equivalences survive changes in width, attention mechanism, and uniformity level is useful. The observation that extra width or precision past logarithmic does not increase power is also a clear negative result that narrows what actually matters.

The soft spot sits in the padding construction itself. The argument needs both directions: every L-uniform circuit family must be simulated by a padded transformer of the right type, and every such transformer must be simulated by an L-uniform circuit family. If the filler symbols or their positions introduce any non-uniform advice when the circuit is built, the uniformity claims break. The stress-test note flags exactly this point, and the abstract does not give enough of the derivations to see whether the encoding of the fillers preserves L-uniformity or FO-uniformity in both directions.

The work is aimed at people who track transformer expressivity through circuit complexity. Anyone already reading the Merrill et al. line of results will get a sharper map of which knobs actually move the needle. It deserves a serious referee because the robustness claims address real open questions in the subfield, even if the uniformity details will need careful checking.

Referee Report

2 major / 2 minor

Summary. The paper claims that polynomially padded L-uniform constant-precision transformers are equivalent to L-uniform AC^0, while growing-precision versions are equivalent to L-uniform TC^0 independent of width. It further establishes that log^d N-looped constant-precision transformers reach FO-uniform AC^d and growing-precision ones reach FO-uniform TC^d. These equivalences are asserted to hold for both softmax and average hard attention, with numeric precision and depth as the primary factors affecting expressivity rather than width or attention type. Padding is positioned as a gadget supplying polynomial space for adaptive parallel computation.

Significance. If the equivalences and uniformity preservations hold, the work provides exact characterizations of padded transformer expressivity that are robust to attention type and width, clarifying the theoretical connections to circuit classes. This strengthens prior results by isolating precision and depth as the decisive parameters and demonstrates the utility of the padding construction for adaptive computation under standard uniformity notions.

major comments (2)

[Proofs of the main equivalences (likely §4-5)] The bidirectional simulations between padded transformers and the claimed circuit classes must explicitly preserve L-uniformity (and FO-uniformity for the looped case). Any dependence on the encoding or positioning of filler symbols in the attention definitions could introduce non-uniform advice, undermining the uniformity statements in the main theorems.
[Definitions of precision and the TC^0 simulation] The distinction between constant and growing numeric precision is load-bearing for separating AC^0 from TC^0; the manuscript should verify that the growing-precision simulation does not implicitly rely on width or non-constant precision in the constant-precision direction, and vice versa.

minor comments (2)

[Preliminaries] Clarify the exact definition of 'average hard attention' and how it differs from standard hard attention in the circuit constructions.
[Section 2] Ensure all uniformity notions (L-uniform, FO-uniform) are defined with explicit reference to the circuit families and transformer parameters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on uniformity preservation and the role of precision. We address the major comments point by point below.

read point-by-point responses

Referee: [Proofs of the main equivalences (likely §4-5)] The bidirectional simulations between padded transformers and the claimed circuit classes must explicitly preserve L-uniformity (and FO-uniformity for the looped case). Any dependence on the encoding or positioning of filler symbols in the attention definitions could introduce non-uniform advice, undermining the uniformity statements in the main theorems.

Authors: The proofs in §§4–5 construct the simulations using only logspace-computable functions to generate both the circuit descriptions and the transformer parameters from the input length, which directly preserves L-uniformity (and FO-uniformity for the looped case). Polynomial padding length is itself L-uniform, and both softmax and average-hard attention are defined via relative-position comparisons that are computable by logspace machines without reference to any non-uniform encoding of the filler symbols. We will add a short clarifying paragraph after the main simulation theorems to make this uniformity argument explicit. revision: partial
Referee: [Definitions of precision and the TC^0 simulation] The distinction between constant and growing numeric precision is load-bearing for separating AC^0 from TC^0; the manuscript should verify that the growing-precision simulation does not implicitly rely on width or non-constant precision in the constant-precision direction, and vice versa.

Authors: Constant precision is defined as a fixed constant number of bits independent of N; growing precision is O(log N) bits. The AC^0 direction uses only the constant-precision model and never invokes threshold gates. The TC^0 direction uses the extra bits solely to realize majority gates and does not rely on super-logarithmic width. We will insert a short verification subsection (or appendix paragraph) that cross-checks both directions against these definitions. revision: yes

Circularity Check

0 steps flagged

No significant circularity; equivalences grounded in standard circuit definitions

full rationale

The paper derives equivalences between polynomially padded transformers (constant/growing precision, L-uniform/FO-uniform, softmax/hard attention) and circuit classes (AC^0, TC^0, AC^d, TC^d) by formalizing padding as adaptive space and simulating both directions of the correspondence. These steps rely on explicit definitions of uniformity, attention, and looping rather than any self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. No equation or claim reduces to its own input by construction; the results are presented as robust across modeling choices and independent of the authors' prior work for the core uniformity arguments. This is the expected non-finding for a self-contained theoretical equivalence proof.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; relies on standard circuit complexity definitions with no new free parameters or invented entities apparent.

axioms (2)

standard math Standard definitions of AC^0, TC^0, L-uniform and FO-uniform circuit families
Equivalences are stated directly to these classes.
domain assumption Padded transformers provide polynomial space for adaptive parallel computation
Central modeling choice enabling the equivalences.

pith-pipeline@v0.9.1-grok · 5777 in / 1277 out tokens · 26132 ms · 2026-06-29T08:11:10.145151+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 5 canonical work pages · 1 internal anchor

[1]

doi: 10.4230/DagRep.15

ISSN 2192-5283. doi: 10.4230/DagRep.15. 7.22. URL https://drops.dagstuhl.de/entities/ document/10.4230/DagRep.15.7.22. Bhattamishra, S., Ahuja, K., and Goyal, N. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Proceedings of the 2020 Conference on Empiri- cal Methods in Na...

work page doi:10.4230/dagrep.15 2020
[2]

URL https://www.sciencedirect.com/ science/article/pii/S0022000002000259

doi: https://doi.org/10.1016/S0022-0000(02) 00025-9. URL https://www.sciencedirect.com/ science/article/pii/S0022000002000259. Special Issue on Complexity 2001. Immerman, N.Descriptive Complexity. Graduate Texts in Computer Science. Springer, New York,

work page doi:10.1016/s0022-0000(02 2001
[3]

doi: 10.1007/ 978-1-4612-0539-5

ISBN 978-0-387-98600-5. doi: 10.1007/ 978-1-4612-0539-5. URL https://link.springer. com/book/10.1007/978-1-4612-0539-5. Jerad, S., Svete, A., Li, J., and Cotterell, R. Unique hard attention: A tale of two sides. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Proceed- ings of the 63rd Annual Meeting of the Association for Computational Li...

work page doi:10.1007/978-1-4612-0539-5 2025
[4]

Characterizing the Expressivity of Local Attention in Transformers

doi: 10.18653/v1/2025.acl-short.76. URL https: //aclanthology.org/2025.acl-short.76/. Jerad, S., Svete, A., Hao, S., Cotterell, R., and Merrill, W. Context-free recognition with transformers. In Forty-third International Conference on Machine Learn- ing, 2026. URL https://openreview.net/forum?id= JOyxs9ElI7. Kövér, B., Butoi, A., Svete, A., Hahn, M., and ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-short.76 2025
[5]

URL https://jmlr.org/papers/v25/23-0518. html. Weiss, G., Goldberg, Y ., and Yahav, E. On the practi- cal computational power of finite precision RNNs for language recognition. In Gurevych, I. and Miyao, Y . (eds.),Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 740–745, Melbourne, Aust...
[6]

algorithm

Association for Computational Linguistics. doi: 10.18653/v1/P18-2117. URL https://aclanthology. org/P18-2117/. Yang, A., Chiang, D., and Angluin, D. Masked hard- attention transformers recognize exactly the star-free lan- guages. InNeurIPS, 2024. URL https://openreview. net/forum?id=FBMsBdH0yz. Yang, A., Cadilhac, M., and Chiang, D. Knee-deep in c-RASP: A...

work page doi:10.18653/v1/p18-2117 2024
[7]

Computing the initial hidden states H p0q for the input string w“w 1 ¨ ¨ ¨w N and computing H pl1q using the first l1 layers of the transformer
[8]

, l2 Ttimes to the hidden statesH pl1q to obtainH pl1`Tpl 2´l1qq

Applying the transformer layersl 1 `1, . . . , l2 Ttimes to the hidden statesH pl1q to obtainH pl1`Tpl 2´l1qq
[9]

, L to the hidden states H pl1`Tpl 2´l1qq to obtain the final representations Hthat are passed to the output layer

Applying the transformer layers l2 `1, . . . , L to the hidden states H pl1`Tpl 2´l1qq to obtain the final representations Hthat are passed to the output layer. The dynamic computational depth of looped transformers lets them perform more complex reasoning tasks by iteratively refining their hidden states across multiple timesteps. Importantly, these reas...
[10]

yPR 2D where x

and the form we use in our constructions when convenient. Padded transformers additionally pad the input string with padding (pause) symbols (Pfau et al., 2024; Goyal et al., 2024; 18 Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don’t London & Kanade, 2025). The number of padding symbols can depend on the input ...

2024
[11]

If it is,wn1 stores e1 and d1 def “w n1e1 in designated parts of its residual stream

At timestep t“1 (the first application of the looped block), each symbol wn1 P t0,1u checks if it is immediately preceded by the & symbol, which denotes the beginning of the pointer in the string. If it is,wn1 stores e1 and d1 def “w n1e1 in designated parts of its residual stream. Here,e 1 is the first unit vector ofR rlogNs
[12]

B ˘pn‚q

At each subsequent timestep tP t2, . . . ,rlogNsu (i.e., the t-th application of the same single-layer block), each symbol wn1 checks if the entry et´1 has already been written to the designated space of the previous symbol’s residual stream. If it has, wn1 copies and shifts et´1 into et, and stores et and dt def “d t´1 `w n1et in designated parts of its ...

2025
[13]

Let AP t0,1u NˆN be G’s adjacency matrix

We again only highlight the differences in the construction. Let AP t0,1u NˆN be G’s adjacency matrix. The input to the transformer is a string over the alphabet Σ def “ t0,1,˝,&u , where ˝ is the padding symbol and & is a dedicatedseparatorsymbol used to delimit the binary encodings of the source and target nodes; the adjacency matrix A occupies N 2 posi...
[14]

Both retrievals are facilitated by binary PEs: Head 1 attends with query B˘pi¨N`kq and head 2 with query B˘pk¨N`jq , both against keys B˘pi1 ¨N`j 1q at input position pi1, j1q

Computing Cℓ in the padding positions.A padding position with coordinates pi, k, jq uses two attention heads to retrieve Bℓ´1pi, kq from the input position pi, kq and Bℓ´1pk, jq from the input position pk, jq. Both retrievals are facilitated by binary PEs: Head 1 attends with query B˘pi¨N`kq and head 2 with query B˘pk¨N`jq , both against keys B˘pi1 ¨N`j 1...
[15]

We compute this disjunction by detection (Lem

Computing Bℓ in the input positions.An input position with coordinates pi, jq accepts as Bℓpi, jq iff there exists k with Cℓpi, k, jq “1 , i.e., Bℓpi, jq “ ŽN k“1 Cℓpi, k, jq. We compute this disjunction by detection (Lem. B.6): One attention head at the input position pi, jq attends over the N padding positions pi, k, jq for kP rNs (selected via the PE b...
[16]

First, graph connectivity is known to beNL-complete underFOreductions (Immerman, 1999)

1999
[17]

C.1 shows that log-depth transformers with cubic padding can recognize the graph connectivity problem L, i.e.,LPL-uniformLPT 1 c,l

Second, Thm. C.1 shows that log-depth transformers with cubic padding can recognize the graph connectivity problem L, i.e.,LPL-uniformLPT 1 c,l
[18]

4.1 gives L-uniformLPT 0 c,l “L-uniformAC 0

Finally, Thm. 4.1 gives L-uniformLPT 0 c,l “L-uniformAC 0. Since L-uniformAC 0 ĚFO-uniformAC 0 “FO (Mix Barrington et al., 1990),L-uniformLPT 0 c,l can recognize languages inFOand can therefore computeFOreductions. ■ C.3.1. CIRCUITEVALUATION We now introduce the circuit evaluation problem and show its completeness under L reductions. This section adapts M...

1990
[19]

This stage requiresOplogNqlayers

Stage 1, implemented by T read N : Converting the input w˝xCy into internal representations that will allow the transformer to process it. This stage requiresOplogNqlayers
[20]

This stage requiresLlayers

Stage 2, implemented byT step N unrolled L times: Iteratively applying the circuit operations to the internal representations to compute the final output. This stage requiresLlayers. Stage 1 converts the binary encodings of the argument pointers stored in the input string xCy into binary encodings stored in the residual stream as per the L-uniform constru...

[1] [1]

doi: 10.4230/DagRep.15

ISSN 2192-5283. doi: 10.4230/DagRep.15. 7.22. URL https://drops.dagstuhl.de/entities/ document/10.4230/DagRep.15.7.22. Bhattamishra, S., Ahuja, K., and Goyal, N. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Webber, B., Cohn, T., He, Y ., and Liu, Y . (eds.),Proceedings of the 2020 Conference on Empiri- cal Methods in Na...

work page doi:10.4230/dagrep.15 2020

[2] [2]

URL https://www.sciencedirect.com/ science/article/pii/S0022000002000259

doi: https://doi.org/10.1016/S0022-0000(02) 00025-9. URL https://www.sciencedirect.com/ science/article/pii/S0022000002000259. Special Issue on Complexity 2001. Immerman, N.Descriptive Complexity. Graduate Texts in Computer Science. Springer, New York,

work page doi:10.1016/s0022-0000(02 2001

[3] [3]

doi: 10.1007/ 978-1-4612-0539-5

ISBN 978-0-387-98600-5. doi: 10.1007/ 978-1-4612-0539-5. URL https://link.springer. com/book/10.1007/978-1-4612-0539-5. Jerad, S., Svete, A., Li, J., and Cotterell, R. Unique hard attention: A tale of two sides. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.),Proceed- ings of the 63rd Annual Meeting of the Association for Computational Li...

work page doi:10.1007/978-1-4612-0539-5 2025

[4] [4]

Characterizing the Expressivity of Local Attention in Transformers

doi: 10.18653/v1/2025.acl-short.76. URL https: //aclanthology.org/2025.acl-short.76/. Jerad, S., Svete, A., Hao, S., Cotterell, R., and Merrill, W. Context-free recognition with transformers. In Forty-third International Conference on Machine Learn- ing, 2026. URL https://openreview.net/forum?id= JOyxs9ElI7. Kövér, B., Butoi, A., Svete, A., Hahn, M., and ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-short.76 2025

[5] [5]

URL https://jmlr.org/papers/v25/23-0518. html. Weiss, G., Goldberg, Y ., and Yahav, E. On the practi- cal computational power of finite precision RNNs for language recognition. In Gurevych, I. and Miyao, Y . (eds.),Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 740–745, Melbourne, Aust...

[6] [6]

algorithm

Association for Computational Linguistics. doi: 10.18653/v1/P18-2117. URL https://aclanthology. org/P18-2117/. Yang, A., Chiang, D., and Angluin, D. Masked hard- attention transformers recognize exactly the star-free lan- guages. InNeurIPS, 2024. URL https://openreview. net/forum?id=FBMsBdH0yz. Yang, A., Cadilhac, M., and Chiang, D. Knee-deep in c-RASP: A...

work page doi:10.18653/v1/p18-2117 2024

[7] [7]

Computing the initial hidden states H p0q for the input string w“w 1 ¨ ¨ ¨w N and computing H pl1q using the first l1 layers of the transformer

[8] [8]

, l2 Ttimes to the hidden statesH pl1q to obtainH pl1`Tpl 2´l1qq

Applying the transformer layersl 1 `1, . . . , l2 Ttimes to the hidden statesH pl1q to obtainH pl1`Tpl 2´l1qq

[9] [9]

, L to the hidden states H pl1`Tpl 2´l1qq to obtain the final representations Hthat are passed to the output layer

Applying the transformer layers l2 `1, . . . , L to the hidden states H pl1`Tpl 2´l1qq to obtain the final representations Hthat are passed to the output layer. The dynamic computational depth of looped transformers lets them perform more complex reasoning tasks by iteratively refining their hidden states across multiple timesteps. Importantly, these reas...

[10] [10]

yPR 2D where x

and the form we use in our constructions when convenient. Padded transformers additionally pad the input string with padding (pause) symbols (Pfau et al., 2024; Goyal et al., 2024; 18 Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don’t London & Kanade, 2025). The number of padding symbols can depend on the input ...

2024

[11] [11]

If it is,wn1 stores e1 and d1 def “w n1e1 in designated parts of its residual stream

At timestep t“1 (the first application of the looped block), each symbol wn1 P t0,1u checks if it is immediately preceded by the & symbol, which denotes the beginning of the pointer in the string. If it is,wn1 stores e1 and d1 def “w n1e1 in designated parts of its residual stream. Here,e 1 is the first unit vector ofR rlogNs

[12] [12]

B ˘pn‚q

At each subsequent timestep tP t2, . . . ,rlogNsu (i.e., the t-th application of the same single-layer block), each symbol wn1 checks if the entry et´1 has already been written to the designated space of the previous symbol’s residual stream. If it has, wn1 copies and shifts et´1 into et, and stores et and dt def “d t´1 `w n1et in designated parts of its ...

2025

[13] [13]

Let AP t0,1u NˆN be G’s adjacency matrix

We again only highlight the differences in the construction. Let AP t0,1u NˆN be G’s adjacency matrix. The input to the transformer is a string over the alphabet Σ def “ t0,1,˝,&u , where ˝ is the padding symbol and & is a dedicatedseparatorsymbol used to delimit the binary encodings of the source and target nodes; the adjacency matrix A occupies N 2 posi...

[14] [14]

Both retrievals are facilitated by binary PEs: Head 1 attends with query B˘pi¨N`kq and head 2 with query B˘pk¨N`jq , both against keys B˘pi1 ¨N`j 1q at input position pi1, j1q

Computing Cℓ in the padding positions.A padding position with coordinates pi, k, jq uses two attention heads to retrieve Bℓ´1pi, kq from the input position pi, kq and Bℓ´1pk, jq from the input position pk, jq. Both retrievals are facilitated by binary PEs: Head 1 attends with query B˘pi¨N`kq and head 2 with query B˘pk¨N`jq , both against keys B˘pi1 ¨N`j 1...

[15] [15]

We compute this disjunction by detection (Lem

Computing Bℓ in the input positions.An input position with coordinates pi, jq accepts as Bℓpi, jq iff there exists k with Cℓpi, k, jq “1 , i.e., Bℓpi, jq “ ŽN k“1 Cℓpi, k, jq. We compute this disjunction by detection (Lem. B.6): One attention head at the input position pi, jq attends over the N padding positions pi, k, jq for kP rNs (selected via the PE b...

[16] [16]

First, graph connectivity is known to beNL-complete underFOreductions (Immerman, 1999)

1999

[17] [17]

C.1 shows that log-depth transformers with cubic padding can recognize the graph connectivity problem L, i.e.,LPL-uniformLPT 1 c,l

Second, Thm. C.1 shows that log-depth transformers with cubic padding can recognize the graph connectivity problem L, i.e.,LPL-uniformLPT 1 c,l

[18] [18]

4.1 gives L-uniformLPT 0 c,l “L-uniformAC 0

Finally, Thm. 4.1 gives L-uniformLPT 0 c,l “L-uniformAC 0. Since L-uniformAC 0 ĚFO-uniformAC 0 “FO (Mix Barrington et al., 1990),L-uniformLPT 0 c,l can recognize languages inFOand can therefore computeFOreductions. ■ C.3.1. CIRCUITEVALUATION We now introduce the circuit evaluation problem and show its completeness under L reductions. This section adapts M...

1990

[19] [19]

This stage requiresOplogNqlayers

Stage 1, implemented by T read N : Converting the input w˝xCy into internal representations that will allow the transformer to process it. This stage requiresOplogNqlayers

[20] [20]

This stage requiresLlayers

Stage 2, implemented byT step N unrolled L times: Iteratively applying the circuit operations to the internal representations to compute the final output. This stage requiresLlayers. Stage 1 converts the binary encodings of the argument pointers stored in the input string xCy into binary encodings stored in the residual stream as per the L-uniform constru...