Efficiently Representing Algorithms With Chain-of-Thought Transformers

Anej Svete; Ashish Sabharwal; William Merrill; Yanhong Li

arxiv: 2606.19697 · v1 · pith:GTVYZKN4new · submitted 2026-06-18 · 💻 cs.LG · cs.AI· cs.CL

Efficiently Representing Algorithms With Chain-of-Thought Transformers

Yanhong Li , Anej Svete , Ashish Sabharwal , William Merrill This is my paper

Pith reviewed 2026-06-26 18:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords chain-of-thoughttransformersWord RAMalgorithm simulationcomputational complexityreasoning modelsattention mechanisms

0 comments

The pith

Chain-of-thought transformers simulate Word RAM algorithms with only poly-logarithmic overhead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that chain-of-thought transformers can efficiently simulate algorithms written for the Word RAM model, which features random access to memory and constant-time operations on words of size log n. This allows them to perform tasks such as sorting n elements in O(n log n) steps or running Dijkstra's algorithm in O(E + V log V) steps, up to poly-log factors. The result is shown first for finite-precision transformers with poly-log width and rightmost unique hard attention, then extended to continuous CoT with vector reasoning and hybrid transformer-RNN architectures. This is significant because Word RAM is the standard model for algorithm design, and previous CoT simulations of Turing machines incurred quadratic overhead.

Core claim

CoT can efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in n. This overhead reduces to log-square when the Word RAM has a “flat” instruction set, and only logarithmic for multiplication-free flat instructions. The simulation holds for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, as well as for continuous CoT and hybrid RNN-transformer architectures with finite width and log-precision.

What carries the argument

The chain-of-thought (CoT) output sequence in transformers, which is used to simulate the memory accesses and operations of the Word RAM model through a series of reasoning steps.

If this is right

Sorting n items requires only O(n log n polylog n) CoT steps instead of quadratic in n.
Dijkstra's algorithm on graphs with V vertices and E edges runs in O((E + V log V) polylog V) steps.
Algorithms with flat instruction sets achieve log-squared overhead, and multiplication-free ones achieve logarithmic overhead.
CoT transformers can represent a wider range of practical algorithms efficiently compared to Turing machine simulations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of reasoning models may prioritize architectures supporting hard attention or hybrid RNN layers to achieve better algorithmic efficiency.
This could lead to new ways of training models to perform algorithmic tasks by directly targeting Word RAM simulations.
Future work might explore whether real-world language models exhibit similar efficiency when prompted with chain-of-thought for algorithmic problems.

Load-bearing premise

The transformer must use rightmost unique hard attention or the specified continuous or hybrid architectures to achieve the poly-log overhead.

What would settle it

A demonstration that some Word RAM algorithm, such as sorting, requires super-polylogarithmic steps in any chain-of-thought transformer of the described types.

Figures

Figures reproduced from arXiv: 2606.19697 by Anej Svete, Ashish Sabharwal, William Merrill, Yanhong Li.

**Figure 1.** Figure 1: Schematic of the CoT transcript simulating a Word RAM program. Within each slot, the model may emit TM-oracle scratch tokens—intermediate computations used to carry out the instruction—before committing the resulting memory updates as a list of address–value pairs. O(n log n) RAM steps on a TM therefore costs up to O(n 2 log n) steps, and on a single-tape TM sorting provably requires Ω(n 2/ log n) steps (H… view at source ↗

**Figure 2.** Figure 2: Six-step CCoT simulation of a RAM instruction; detailed caption in Appendix, Fig. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Executing a Word RAM program with discrete vs. continuous CoT under the unifiedmemory convention. The entire machine state—program counter, registers, and memory—lives in a single random-access memory mem, so each simulated step is fully described by the constantmany (K) address–value pairs it writes. Bottom-left (discrete CoT): each step emits K memory blocks mem # bits(a (s) k ) # bits(v (s) k ) follow… view at source ↗

**Figure 4.** Figure 4: Layer-block structure and CoT transcript view of one step in the log2 -width simulation. Layer blocks (center): each CoT step executes four constant-depth blocks. Block 1 reads bits(mem[0]) (the pc) via one indexed memory read in two attention layers (Prop. 3). Block 2 uses two more sequential attention layers: the first identifies the active instruction and reads its compile-time operand addresses; the se… view at source ↗

**Figure 5.** Figure 5: Phase structure and CoT transcript view of one RAM step in the CCoT (continuous CoT) simulation; counterpart to [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

read the original abstract

The increasing popularity of \emph{reasoning} models -- language models that output a series of reasoning or thought tokens before producing an answer -- is justified, in part, by theoretical results showing that chain-of-thought (CoT) transformers can simulate Turing machines, and thus perform arbitrary computation. However, the Turing machine, while suitable for complexity-theoretic analysis, is not convenient, intuitive, or efficient for discussing algorithms. Algorithms are typically designed and analyzed at a higher level of abstraction, captured by the \emph{Word RAM} model with random-access memory and unit-cost operations on $\bigO(\log n)$-bit words. As a result, Word RAM algorithms can be substantially more efficient than their Turing machine counterparts, raising the question: \emph{Can CoT transformers efficiently simulate Word RAM algorithms?} For instance, can they sort $n$ items in $\bigO(n \log n)$ steps or run Dijkstra's algorithm in $\bigO(E + V \log V)$ steps? We answer affirmatively, up to poly-logarithmic overhead. We first establish this for finite-precision transformers with poly-logarithmic width and rightmost unique hard attention, then strengthen the result to two more practical settings with finite width and log-precision: \emph{continuous} CoT, where reasoning takes the form of vectors rather than tokens, and a \emph{hybrid} architecture in which transformer layers sit atop a recurrent (linear RNN) layer. In all three cases, we find that CoT \emph{can} efficiently simulate any Word RAM algorithm with only a poly-logarithmic overhead in $n$. This overhead reduces to log-square when the Word RAM has a ``flat'' instruction set, and only logarithmic for multiplication-free flat instructions -- in stark contrast to known CoT simulations of Turing machines, which require quadratic overhead over Word RAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoT can simulate Word RAM with poly-log overhead, tightening prior TM results to something closer to practical algorithm costs.

read the letter

The main point is that this paper shows chain-of-thought transformers can simulate arbitrary Word RAM algorithms with only poly-logarithmic overhead in n, rather than the quadratic cost of Turing-machine simulations. They give explicit constructions for three settings and tighten the bound further for flat instruction sets.

What stands out is the move from TM to Word RAM as the target model, plus the concrete overhead reductions: log-squared for flat sets and logarithmic when multiplication is dropped. The three architecture variants are laid out clearly in the abstract: finite-precision with poly-log width and rightmost unique hard attention, then continuous vector-based CoT, and a hybrid RNN-transformer. That progression and the contrast with existing TM work are the useful parts.

The soft spot is that the abstract states the existence of the simulations and bounds but does not show the attention constructions or precision arguments. Soundness therefore rests on whether the full derivations deliver those without extra hidden costs or assumptions on the model. The claims themselves look internally consistent and free of circularity.

This is for people working on the computational limits of transformers and CoT reasoning. Readers who care about algorithmic efficiency in these models will get direct value from the overhead numbers. It deserves peer review because the question is well-posed and the result, if the proofs check out, affects how we think about practical reasoning systems.

Referee Report

2 major / 1 minor

Summary. The paper claims that chain-of-thought (CoT) transformers can simulate arbitrary Word RAM algorithms with only polylogarithmic overhead in n (poly-log in general; log² for flat instruction sets; log for multiplication-free flat instructions). Explicit constructions are given for three settings: finite-precision transformers with poly-log width and rightmost unique hard attention; continuous CoT (vector reasoning) with finite width and log precision; and a hybrid architecture with transformer layers over a linear RNN. This improves on prior quadratic-overhead CoT simulations of Turing machines.

Significance. If the constructions and bounds hold, the result is significant: it supplies a realistic complexity-theoretic justification for CoT reasoning models by showing they can implement standard algorithms (sorting, Dijkstra, etc.) with near-optimal step counts under the Word RAM model. The explicit overhead reductions and the three architecture variants constitute a clear advance over TM-based analyses.

major comments (2)

[Abstract / finite-precision case] Abstract and the finite-precision construction: the central claim rests on the existence of explicit attention constructions and precision handling that achieve the stated polylog bounds; these derivations are not visible, leaving the soundness of the overhead claims unverifiable from the provided text. This is load-bearing for the main result.
[overhead-reduction claims] The reduction from general Word RAM to flat and multiplication-free instruction sets is stated to improve the overhead, but without the corresponding simulation details or the precise definition of the instruction-set restrictions, it is impossible to confirm that no hidden logarithmic factors are introduced.

minor comments (1)

Define 'rightmost unique hard attention' and 'flat instruction set' with a short formal paragraph or reference to prior work on the first occurrence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the significance of our results and for the constructive feedback. We address each major comment below with clarifications drawn from the manuscript and indicate where revisions will improve accessibility.

read point-by-point responses

Referee: [Abstract / finite-precision case] Abstract and the finite-precision construction: the central claim rests on the existence of explicit attention constructions and precision handling that achieve the stated polylog bounds; these derivations are not visible, leaving the soundness of the overhead claims unverifiable from the provided text. This is load-bearing for the main result.

Authors: The finite-precision construction with poly-log width and rightmost unique hard attention is given explicitly in Sections 3–4, including the precise attention masks, head configurations, and bit-precision bounds that yield the claimed polylogarithmic overhead. The derivations appear in the proofs of Theorems 3.1 and 4.2. To address verifiability concerns, we will add a concise high-level outline of the attention construction to the abstract and a one-paragraph roadmap at the start of Section 3. revision: partial
Referee: [overhead-reduction claims] The reduction from general Word RAM to flat and multiplication-free instruction sets is stated to improve the overhead, but without the corresponding simulation details or the precise definition of the instruction-set restrictions, it is impossible to confirm that no hidden logarithmic factors are introduced.

Authors: Section 5 formally defines the flat and multiplication-free instruction sets (Definitions 5.1–5.2) and provides the simulation reductions (Theorems 5.3 and 5.5). The proofs show that each reduction incurs at most an additional O(log n) factor, preserving the overall polylogarithmic bound with no hidden terms. We will expand Section 5 with a short pseudocode table summarizing the instruction-set mappings to make the absence of extra logarithmic factors immediately apparent. revision: partial

Circularity Check

0 steps flagged

No significant circularity; construction against external model

full rationale

The paper presents explicit theoretical constructions for CoT transformers (finite-precision with poly-log width and rightmost unique hard attention; continuous CoT; hybrid RNN-transformer) that simulate arbitrary Word RAM algorithms. The claimed polylog(n) overhead bounds (reducing to log² or log under flat/multiplication-free instruction sets) are derived from these constructions against the external Word RAM model, with no equations, parameters, or claims reducing by definition or self-citation to the target simulation bounds themselves. Architectural assumptions are stated upfront without circular reduction, and no load-bearing self-citations, fitted inputs renamed as predictions, or ansatz smuggling appear in the abstract or described claims. The derivation is self-contained as a new simulation result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about the Word RAM model and transformer attention mechanisms; no free parameters are fitted and no new entities are postulated.

axioms (2)

domain assumption Word RAM model permits unit-cost operations on O(log n)-bit words with random access memory.
Defines the target computational model whose algorithms are to be simulated.
domain assumption Rightmost unique hard attention and the continuous/hybrid variants admit the required memory and control operations.
Invoked to establish the initial simulation result and its extensions.

pith-pipeline@v0.9.1-grok · 5880 in / 1356 out tokens · 21388 ms · 2026-06-26T18:39:43.818343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 6 canonical work pages

[1]

The Transformer Cookbook , url =

Andy Yang and Christopher Watson and Anton Xue and Satwik Bhattamishra and Jose Llarena and William Merrill and Emile Dos Santos Ferreira and Anej Svete and David Chiang , issn =. The Transformer Cookbook , url =. TMLR , note =
[2]

Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages , url =

Andy Yang and David Chiang and Dana Angluin , booktitle =. Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages , url =
[3]

On the Reasoning Abilities of Masked Diffusion Language Models , url =

Anej Svete and Ashish Sabharwal , booktitle =. On the Reasoning Abilities of Masked Diffusion Language Models , url =
[4]

ICML , year=

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't , author=. ICML , year=
[5]

Anej Svete and William Merrill and Ashish Sabharwal , title =
[6]

Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =. Attention is All you Need , url =. NeurIPS , pages =
[7]

Dagstuhl Reports , number =

Barcelo, Pablo and Chiang, David and Cybenko, George and Strobl, Lena and Yang, Andy , doi =. Dagstuhl Reports , number =
[8]

Wind and Tianyi Wu and Daniel Wuttke and Christian Zhou-Zheng , booktitle =

Bo Peng and Ruichong Zhang and Daniel Goldstein and Eric Alcaide and Xingjian Du and Haowen Hou and Jiaju Lin and Jiaxing Liu and Janna Lu and William Merrill and Guangyu Song and Kaifeng Tan and Saiteja Utpala and Nathan Wilce and Johan S. Wind and Tianyi Wu and Daniel Wuttke and Christian Zhou-Zheng , booktitle =
[9]

Ian , booktitle =

Brodnik, Andrej and Miltersen, Peter Bro and Munro, J. Ian , booktitle =. Trans-dichotomous algorithms without multiplication --- some upper and lower bounds , year =
[10]

arXiv , author =:2209.11895 , journal =

In-context Learning and Induction Heads , url =. arXiv , author =:2209.11895 , journal =

Pith/arXiv arXiv
[11]

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers , url =

Charles London and Varun Kanade , booktitle =. Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers , url =
[12]

arXiv , author =:2408.03314 , journal =

Scaling. arXiv , author =:2408.03314 , journal =

Pith/arXiv arXiv
[14]

DeepSeek-AI and : Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Boc...

work page doi:10.1038/s41586-025-09422-z
[15]

Forty-first International Conference on Machine Learning , title =

Ekin Aky. Forty-first International Conference on Machine Learning , title =
[16]

Hennie , doi =

F.C. Hennie , doi =. One-tape, off-line Turing machine computations , url =. Information and Control , number =
[17]

Sorting and Searching on the Word RAM , year =

Hagerup, Torben , booktitle =. Sorting and Searching on the Word RAM , year =
[18]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , bibsource =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =. NeurIPS , timestamp =
[19]

Jeonghoon Kim and Byeongchan Lee and Cheonbok Park and Yeontaek Oh and Beomjun Kim and Taehwan Yoo and Seongjin Shin and Dongyoon Han and Jinwoo Shin and Kang Min Yoo , booktitle =. Peri-
[20]

Unique Hard Attention: A Tale of Two Sides , url =

Jerad, Selim and Svete, Anej and Li, Jiaoda and Cotterell, Ryan , booktitle =. Unique Hard Attention: A Tale of Two Sides , url =. doi:10.18653/v1/2025.acl-short.76 , month = jul, pages =

work page doi:10.18653/v1/2025.acl-short.76 2025
[21]

Characterizing the Expressivity of Fixed-Precision Transformer Language Models , url =

Jiaoda Li and Ryan Cotterell , booktitle =. Characterizing the Expressivity of Fixed-Precision Transformer Language Models , url =
[22]

Attention is Turing-Complete , url =

Jorge Pérez and Pablo Barceló and Javier Marinkovic , journal =. Attention is Turing-Complete , url =
[23]

, isbn =

Knuth, Donald E. , isbn =. The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms , url =
[24]

The Parallelism Tradeoff: Limitations of Log-Precision Transformers , volume =

Merrill, William and Sabharwal, Ashish , journal =. The Parallelism Tradeoff: Limitations of Log-Precision Transformers , volume =
[25]

Open Data Structures: An Introduction , volume =

Morin, Pat , publisher =. Open Data Structures: An Introduction , volume =
[26]

On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning , url =

Nowak, Franz and Svete, Anej and Butoi, Alexandra and Cotterell, Ryan , booktitle =. On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning , url =. doi:10.18653/v1/2024.acl-long.676 , month = aug, pages =

work page doi:10.18653/v1/2024.acl-long.676 2024
[27]

arXiv , author =:2412.16720 , journal =

Pith/arXiv arXiv
[28]

Linear Transformers Are Secretly Fast Weight Programmers , url =

Schlag, Imanol and Irie, Kazuki and Schmidhuber, J. Linear Transformers Are Secretly Fast Weight Programmers , url =. Proceedings of the 38th International Conference on Machine Learning , editor =
[29]

Finite automata, formal logic, and circuit complexity , year =

Straubing, Howard , isbn =. Finite automata, formal logic, and circuit complexity , year =
[30]

What Formal Languages Can Transformers Express? A Survey , url =

Strobl, Lena and Merrill, William and Weiss, Gail and Chiang, David and Angluin, Dana , doi =. What Formal Languages Can Transformers Express? A Survey , url =. TACL , pages =
[31]

Transformers Can Represent n -gram Language Models , url =

Svete, Anej and Cotterell, Ryan , booktitle =. Transformers Can Represent n -gram Language Models , url =. doi:10.18653/v1/2024.naacl-long.381 , month = jun, pages =

work page doi:10.18653/v1/2024.naacl-long.381 2024
[32]

arXiv , author =:2512.13961 , primaryclass =

Olmo 3 , url =. arXiv , author =:2512.13961 , primaryclass =

Pith/arXiv arXiv
[33]

Thorup, Mikkel , booktitle =. On
[34]

Mix Barrington

William Hesse and Eric Allender and David A. Uniform constant-depth threshold circuits for division and iterated multiplication , url =. Journal of Computer and System Sciences , note =. doi:https://doi.org/10.1016/S0022-0000(02)00025-9 , issn =

work page doi:10.1016/s0022-0000(02)00025-9
[35]

The Expressive Power of Transformers with Chain of Thought , url =

William Merrill and Ashish Sabharwal , bibsource =. The Expressive Power of Transformers with Chain of Thought , url =. ICLR , publisher =
[36]

A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers , url =

William Merrill and Ashish Sabharwal , booktitle =. A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers , url =
[37]

A Logic for Expressing Log-Precision Transformers , year =

William Merrill and Ashish Sabharwal , booktitle =. A Logic for Expressing Log-Precision Transformers , year =
[38]

arXiv , author =:2603.03612 , journal =

Why Are Linear RNNs More Parallelizable? , url =. arXiv , author =:2603.03612 , journal =

Pith/arXiv arXiv
[39]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , url =

Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Kim, Yoon , booktitle =. Parallelizing Linear Transformers with the Delta Rule over Sequence Length , url =. doi:10.52202/079017-3668 , pages =

work page doi:10.52202/079017-3668
[40]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , url =

Zhiyuan Li and Hong Liu and Denny Zhou and Tengyu Ma , booktitle =. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , url =

[1] [1]

The Transformer Cookbook , url =

Andy Yang and Christopher Watson and Anton Xue and Satwik Bhattamishra and Jose Llarena and William Merrill and Emile Dos Santos Ferreira and Anej Svete and David Chiang , issn =. The Transformer Cookbook , url =. TMLR , note =

[2] [2]

Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages , url =

Andy Yang and David Chiang and Dana Angluin , booktitle =. Masked Hard-Attention Transformers Recognize Exactly the Star-Free Languages , url =

[3] [3]

On the Reasoning Abilities of Masked Diffusion Language Models , url =

Anej Svete and Ashish Sabharwal , booktitle =. On the Reasoning Abilities of Masked Diffusion Language Models , url =

[4] [4]

ICML , year=

Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't , author=. ICML , year=

[5] [5]

Anej Svete and William Merrill and Ashish Sabharwal , title =

[6] [6]

Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , bibsource =. Attention is All you Need , url =. NeurIPS , pages =

[7] [7]

Dagstuhl Reports , number =

Barcelo, Pablo and Chiang, David and Cybenko, George and Strobl, Lena and Yang, Andy , doi =. Dagstuhl Reports , number =

[8] [8]

Wind and Tianyi Wu and Daniel Wuttke and Christian Zhou-Zheng , booktitle =

Bo Peng and Ruichong Zhang and Daniel Goldstein and Eric Alcaide and Xingjian Du and Haowen Hou and Jiaju Lin and Jiaxing Liu and Janna Lu and William Merrill and Guangyu Song and Kaifeng Tan and Saiteja Utpala and Nathan Wilce and Johan S. Wind and Tianyi Wu and Daniel Wuttke and Christian Zhou-Zheng , booktitle =

[9] [9]

Ian , booktitle =

Brodnik, Andrej and Miltersen, Peter Bro and Munro, J. Ian , booktitle =. Trans-dichotomous algorithms without multiplication --- some upper and lower bounds , year =

[10] [10]

arXiv , author =:2209.11895 , journal =

In-context Learning and Induction Heads , url =. arXiv , author =:2209.11895 , journal =

Pith/arXiv arXiv

[11] [11]

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers , url =

Charles London and Varun Kanade , booktitle =. Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers , url =

[12] [12]

arXiv , author =:2408.03314 , journal =

Scaling. arXiv , author =:2408.03314 , journal =

Pith/arXiv arXiv

[13] [14]

DeepSeek-AI and : Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Boc...

work page doi:10.1038/s41586-025-09422-z

[14] [15]

Forty-first International Conference on Machine Learning , title =

Ekin Aky. Forty-first International Conference on Machine Learning , title =

[15] [16]

Hennie , doi =

F.C. Hennie , doi =. One-tape, off-line Turing machine computations , url =. Information and Control , number =

[16] [17]

Sorting and Searching on the Word RAM , year =

Hagerup, Torben , booktitle =. Sorting and Searching on the Word RAM , year =

[17] [18]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , bibsource =. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , url =. NeurIPS , timestamp =

[18] [19]

Jeonghoon Kim and Byeongchan Lee and Cheonbok Park and Yeontaek Oh and Beomjun Kim and Taehwan Yoo and Seongjin Shin and Dongyoon Han and Jinwoo Shin and Kang Min Yoo , booktitle =. Peri-

[19] [20]

Unique Hard Attention: A Tale of Two Sides , url =

Jerad, Selim and Svete, Anej and Li, Jiaoda and Cotterell, Ryan , booktitle =. Unique Hard Attention: A Tale of Two Sides , url =. doi:10.18653/v1/2025.acl-short.76 , month = jul, pages =

work page doi:10.18653/v1/2025.acl-short.76 2025

[20] [21]

Characterizing the Expressivity of Fixed-Precision Transformer Language Models , url =

Jiaoda Li and Ryan Cotterell , booktitle =. Characterizing the Expressivity of Fixed-Precision Transformer Language Models , url =

[21] [22]

Attention is Turing-Complete , url =

Jorge Pérez and Pablo Barceló and Javier Marinkovic , journal =. Attention is Turing-Complete , url =

[22] [23]

, isbn =

Knuth, Donald E. , isbn =. The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms , url =

[23] [24]

The Parallelism Tradeoff: Limitations of Log-Precision Transformers , volume =

Merrill, William and Sabharwal, Ashish , journal =. The Parallelism Tradeoff: Limitations of Log-Precision Transformers , volume =

[24] [25]

Open Data Structures: An Introduction , volume =

Morin, Pat , publisher =. Open Data Structures: An Introduction , volume =

[25] [26]

On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning , url =

Nowak, Franz and Svete, Anej and Butoi, Alexandra and Cotterell, Ryan , booktitle =. On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning , url =. doi:10.18653/v1/2024.acl-long.676 , month = aug, pages =

work page doi:10.18653/v1/2024.acl-long.676 2024

[26] [27]

arXiv , author =:2412.16720 , journal =

Pith/arXiv arXiv

[27] [28]

Linear Transformers Are Secretly Fast Weight Programmers , url =

Schlag, Imanol and Irie, Kazuki and Schmidhuber, J. Linear Transformers Are Secretly Fast Weight Programmers , url =. Proceedings of the 38th International Conference on Machine Learning , editor =

[28] [29]

Finite automata, formal logic, and circuit complexity , year =

Straubing, Howard , isbn =. Finite automata, formal logic, and circuit complexity , year =

[29] [30]

What Formal Languages Can Transformers Express? A Survey , url =

Strobl, Lena and Merrill, William and Weiss, Gail and Chiang, David and Angluin, Dana , doi =. What Formal Languages Can Transformers Express? A Survey , url =. TACL , pages =

[30] [31]

Transformers Can Represent n -gram Language Models , url =

Svete, Anej and Cotterell, Ryan , booktitle =. Transformers Can Represent n -gram Language Models , url =. doi:10.18653/v1/2024.naacl-long.381 , month = jun, pages =

work page doi:10.18653/v1/2024.naacl-long.381 2024

[31] [32]

arXiv , author =:2512.13961 , primaryclass =

Olmo 3 , url =. arXiv , author =:2512.13961 , primaryclass =

Pith/arXiv arXiv

[32] [33]

Thorup, Mikkel , booktitle =. On

[33] [34]

Mix Barrington

William Hesse and Eric Allender and David A. Uniform constant-depth threshold circuits for division and iterated multiplication , url =. Journal of Computer and System Sciences , note =. doi:https://doi.org/10.1016/S0022-0000(02)00025-9 , issn =

work page doi:10.1016/s0022-0000(02)00025-9

[34] [35]

The Expressive Power of Transformers with Chain of Thought , url =

William Merrill and Ashish Sabharwal , bibsource =. The Expressive Power of Transformers with Chain of Thought , url =. ICLR , publisher =

[35] [36]

A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers , url =

William Merrill and Ashish Sabharwal , booktitle =. A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers , url =

[36] [37]

A Logic for Expressing Log-Precision Transformers , year =

William Merrill and Ashish Sabharwal , booktitle =. A Logic for Expressing Log-Precision Transformers , year =

[37] [38]

arXiv , author =:2603.03612 , journal =

Why Are Linear RNNs More Parallelizable? , url =. arXiv , author =:2603.03612 , journal =

Pith/arXiv arXiv

[38] [39]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length , url =

Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Kim, Yoon , booktitle =. Parallelizing Linear Transformers with the Delta Rule over Sequence Length , url =. doi:10.52202/079017-3668 , pages =

work page doi:10.52202/079017-3668

[39] [40]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , url =

Zhiyuan Li and Hong Liu and Denny Zhou and Tengyu Ma , booktitle =. Chain of Thought Empowers Transformers to Solve Inherently Serial Problems , url =