pith. sign in

arxiv: 2604.25800 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.CL

Barriers to Universal Reasoning With Transformers (And How to Overcome Them)

Pith reviewed 2026-05-07 16:31 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords chain-of-thoughttransformerslength generalizationturing machinesTC0signpost tokensvalue change encodingsreasoning
0
0 comments X

The pith

Under standard positional encodings and finite alphabets, chain-of-thought does not let Transformers generalize beyond TC^0, but growing vocabularies enable length-generalizable Turing machine simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that adding chain-of-thought traces boosts theoretical expressivity to Turing completeness yet fails to produce length-generalizable solutions for anything harder than TC^0 when the alphabet stays fixed and positional encodings are standard. The core barriers are reliable repeated copying of symbols across long sequences and retrieval of the most recent occurrence of a given symbol. By letting vocabulary size grow with problem size, the authors introduce unique signpost tokens for each tape position and encodings that record only value changes, allowing the current symbol to be recovered from counts. This yields a simulation whose chain-of-thought trace grows linearly with the simulated Turing machine runtime. A reader should care because the construction isolates two concrete obstacles that block practical universal reasoning and supplies an explicit workaround shown to improve empirical length generalization on difficult tasks.

Core claim

Under standard positional encodings and a finite alphabet, Transformers with chain-of-thought cannot solve problems beyond TC^0 because length-generalizable learnability does not inherit the theoretical expressivity gains. When vocabulary is permitted to grow with input size, unique signpost tokens per tape position together with value-change encodings produce a length-generalizable Turing-machine simulation in which the chain-of-thought trace length is linear in the simulated runtime up to a constant factor.

What carries the argument

Unique signpost tokens assigned to each tape position combined with value-change encodings that log only symbol updates, allowing recovery of the current tape symbol through simple counts rather than repeated copying or last-occurrence search.

If this is right

  • The theoretical Turing-completeness result for chain-of-thought does not translate into length-generalizable learnability under finite alphabets.
  • Repeated copying and last-occurrence retrieval constitute the two primary obstacles that must be circumvented for reliable scaling.
  • The signpost-plus-value-change method achieves Turing-machine simulation whose trace length scales linearly with simulated runtime.
  • Empirical length generalization improves on hard problems when signpost tokens and value-change encodings are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that can dynamically allocate new tokens as problem size increases may be necessary for universal reasoning even if fixed-vocabulary models improve in other ways.
  • The same signpost technique could be adapted to other sequence-to-sequence tasks that require tracking positions over variable-length inputs.
  • Testing the construction on concrete problems such as circuit evaluation or graph reachability at increasing lengths would give a direct measure of the linear scaling claim.

Load-bearing premise

The cited theoretical frameworks that bound Transformer length generalization to TC^0 continue to apply once chain-of-thought traces are introduced with a fixed finite alphabet.

What would settle it

Train a Transformer on short instances of a problem outside TC^0 using the signpost and value-change construction, then test whether it produces correct outputs on instances whose required chain-of-thought length is several times longer than any seen in training.

Figures

Figures reproduced from arXiv: 2604.25800 by Alexander Koller, Michael Hahn, Oliver Kraus, Yash Sarrof, Yuekun Yao.

Figure 1
Figure 1. Figure 1: Barriers to universal CoT and overcoming them: (A.1) Simulating TM compu view at source ↗
Figure 2
Figure 2. Figure 2: Autoregressive CoT Trace for PARITY. For the input 1101, which contains 3 ones, the CoT C-RASP program generates the trace E O E O, always starting with emitting an even E token and flipping till the count of even and odd tokens produced crosses the number of ones (here, nones + 1 = 4). It then emits ⟨SEP⟩, copies the final parity token O using the local C-RASP[Pos] predicate as the answer, and terminates … view at source ↗
Figure 3
Figure 3. Figure 3: A TM simulation step for a single tape in our construction (same logic extends view at source ↗
Figure 4
Figure 4. Figure 4: Length generalization from training length 50 to test length 100. We report results view at source ↗
Figure 5
Figure 5. Figure 5: We prompt open-source Transformer-based LLMs spanning different model view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation curves for length 30 for all our tasks. We observe similar trends as seen view at source ↗
Figure 7
Figure 7. Figure 7: S5 becomes harder when the data contains more repetitive operations. We view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation curves for training length 30 for the Binary Permutation task. Even with view at source ↗
Figure 9
Figure 9. Figure 9: Frequency of first-occurring errors across test lengths for our three evaluated CoT view at source ↗
Figure 10
Figure 10. Figure 10: Validation dynamics for the top 3 performing seeds (based on OOD validation view at source ↗
read the original abstract

Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that -- under standard positional encodings and a finite alphabet -- Transformers with CoT cannot solve problems beyond $TC^0$, i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length-generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last-occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that under standard positional encodings and a finite alphabet, Transformers with Chain-of-Thought (CoT) are limited to TC^0 for length-generalizable learnability, so that the known Turing-completeness benefits of CoT do not transfer. It then presents a construction that grows the vocabulary with problem size, introduces signpost tokens for each tape position, and logs only value changes; this yields a length-generalizable Turing-machine simulation whose CoT trace is linear in the simulated runtime. The authors further report that these encodings give empirical guidance for improving length generalization on hard problems.

Significance. If the applicability of the cited length-generalization frameworks to the CoT regime is confirmed and the construction is shown to be learnable, the work would usefully separate expressivity from length-generalizable learnability for Transformers and supply a concrete, linear-overhead route to universal reasoning. The explicit identification of repeated-copying and last-occurrence barriers, together with the signpost/value-change mechanism that circumvents them, is a clear positive contribution.

major comments (2)
  1. [theoretical analysis of length generalization] The central negative claim—that standard Transformers + CoT remain inside TC^0 under length-generalizable learnability with finite alphabet—rests on the direct transfer of existing length-generalization frameworks to the autoregressive CoT setting. The manuscript does not supply an explicit argument or reduction showing that the variable-length CoT trace cannot encode auxiliary state that bypasses the frameworks’ barriers (e.g., repeated copying or last-occurrence retrieval). This applicability step is load-bearing for the TC^0 bound and must be spelled out.
  2. [empirical evaluation] The empirical claim that signpost tokens and value-change encodings improve length generalization is asserted without any reported metrics, ablation controls, baseline comparisons, or statistical details. Because the construction is the paper’s positive contribution, the absence of quantitative evidence leaves the practical utility of the proposed encodings unverified.
minor comments (2)
  1. [abstract] The abstract introduces “value change encodings” without a one-sentence definition; a brief parenthetical gloss on first use would improve readability.
  2. [construction section] Notation for the growing vocabulary and the linear CoT trace length should be introduced once in the main text and used consistently thereafter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our results. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: The central negative claim—that standard Transformers + CoT remain inside TC^0 under length-generalizable learnability with finite alphabet—rests on the direct transfer of existing length-generalization frameworks to the autoregressive CoT setting. The manuscript does not supply an explicit argument or reduction showing that the variable-length CoT trace cannot encode auxiliary state that bypasses the frameworks’ barriers (e.g., repeated copying or last-occurrence retrieval). This applicability step is load-bearing for the TC^0 bound and must be spelled out.

    Authors: We agree that the transfer of the length-generalization frameworks to the autoregressive CoT setting requires an explicit argument. In the revised manuscript we will add a dedicated subsection providing a formal reduction. This will show that, under finite alphabet and standard positional encodings, auxiliary state in the variable-length CoT trace cannot bypass the identified barriers (repeated copying and last-occurrence retrieval) while preserving length-generalizable learnability. The argument will leverage the fixed vocabulary and autoregressive constraints to demonstrate that any such encoding would violate the frameworks' assumptions on reliable generalization. revision: yes

  2. Referee: The empirical claim that signpost tokens and value-change encodings improve length generalization is asserted without any reported metrics, ablation controls, baseline comparisons, or statistical details. Because the construction is the paper’s positive contribution, the absence of quantitative evidence leaves the practical utility of the proposed encodings unverified.

    Authors: We acknowledge that the empirical section currently provides only high-level guidance without the requested quantitative details. In the revision we will expand this section to report concrete metrics on length-generalization performance, ablation studies comparing signpost tokens and value-change encodings against standard baselines, and appropriate statistical analysis. These additions will substantiate the practical utility of the encodings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external frameworks and explicit construction

full rationale

The paper's negative result applies cited external length-generalization frameworks to CoT (no self-citation load-bearing or uniqueness imported from authors). The positive result is an explicit construction (signpost tokens + value-change logging) that does not reduce to fitted inputs, self-definitions, or prior ansatzes by the authors. No equations or steps equate a claimed prediction to its own inputs by construction. The derivation chain is self-contained against the external benchmarks it invokes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the validity of external length-generalization frameworks and on the effectiveness of the newly introduced signpost and change-log mechanisms; no numerical parameters are fitted.

axioms (1)
  • domain assumption Recent theoretical frameworks for Transformer length generalization apply directly to the CoT setting with standard positional encodings and finite alphabet
    Invoked to derive the TC^0 limitation
invented entities (2)
  • signpost tokens no independent evidence
    purpose: Unique marker assigned to each tape position to circumvent repeated copying and last-occurrence retrieval
    New tokens introduced in the construction
  • value change encodings no independent evidence
    purpose: Record only changes so current tape symbol can be recovered by counting markers
    New encoding scheme introduced to enable recovery without full copying

pith-pipeline@v0.9.0 · 5503 in / 1394 out tokens · 62809 ms · 2026-05-07T16:31:58.644043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin

    URLhttps://openreview.net/forum?id=duRRoGeoQT. Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin. Softmax trans- formers are turing-complete. InThe Fourteenth International Conference on Learning Repre- sentations, 2026. URLhttps://openreview.net/forum?id=FdkPOHlChS. Mayank Jobanputra, Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, V...

  2. [2]

    Entity Tracking in Language Models

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.213. URLhttps://aclanthology.org/2023.acl-long.213/. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022. Belinda Z. Li, Zifan Carl Gu...

  3. [3]

    all state symbolsq∈Q

  4. [4]

    the special symbols⟨SEP⟩,⟨EOS⟩, and a block separator $

  5. [5]

    the symbol KEEP to show that no operation is taking place on the cell

  6. [6]

    Thus the generated CoT is a string overΞ∪ C

    for each pair of distinct symbolsσ,τ∈Γ, a symbol WRITE(σ→τ). Thus the generated CoT is a string overΞ∪ C. Annotated input.For w=w 1 · · ·w n ∈Σ ∗, we use the annotated encodingAnnotate(w):= w1c1 w2c2 · · ·w ncn. Hence each input symbol is followed immediately by its signpost token. Notation.For each simulated time stepm≥0 • Letη m ∈N >0 be the position of...

  7. [7]

    Evaluate inner clause:NOT true→False

  8. [8]

    This initialization is followed by a sequence of state-changing operations consisting of pairwise ”swap” instructions that exchange the contents of two variables

    Evaluate root operator:False AND false→False Expected Output: False S5 PermutationIn this task, the model receives a natural language description of an initial state, which assigns five distinct objects to five variables (e.g., A = Apple, B = Banana). This initialization is followed by a sequence of state-changing operations consisting of pairwise ”swap” ...

  9. [9]

    Initialization: A = Apple, B = Banana, C = Cat, D = Dog, E = Hat

  10. [10]

    We ensure to generate no problem instance where all variables are initialized to the same object, so the task does not become trivial

    Operations (Sequence Length = 2): Step 1: Swap A and E Step 2: Swap B and C Expected Output (Final State): A = Hat, B = Cat, C = Banana, D = Dog, E = Apple Binary PermutationThis is the same task as S5 Permutation, but we restrict the object pool to only two objects (Cat, Dog) that are assigned to the five variables. We ensure to generate no problem insta...

  11. [11]

    Initialization: A = Cat, B = Dog, C = Cat, D = Dog, E = Dog

  12. [12]

    CoT Formats:We compare a standard dense trace against our proposed value-change trace

    Operations (Sequence Length = 2): Step 1: Swap A and E Step 2: Swap B and C Expected Output (Final State): A = Dog, B = Cat, C = Dog, D = Dog, E = Cat B.2 CoT Training Formats & Further Training Details B.2.1 Parity Data Generation:For both training and evaluation, binary sequences of lengths bounded by the respective experimental setups are randomly samp...

  13. [13]

    ⟨index⟩ ⟨boolean⟩

    Finally, we note that the signpost tokens (numbers in angle brackets) are implemented as atomic tokens in the whitespace tokenizer. Training Formats: S5 Permutation Naive CoT Prompt: init A Monkey B Cat C Hat D Dog E Book operation swap A B . swap A E . swap A B . end . ### trace Naive CoT Trace: swap A B write A Book B Cat C Hat D Dog E Monkey . swap A E...

  14. [14]

    Prompt Format: Variable Assignment with Value Change System Prompt:You are a very careful and precise assistant

    node q = 1 Query: print(node a, node m, node b) Return only the final answer in⟨output⟩ ⟨output⟩tags. Prompt Format: Variable Assignment with Value Change System Prompt:You are a very careful and precise assistant. You always follow the instructions and solve tasks yourself. You never generate code. You also give the answer directly whenever possible with...

  15. [15]

    node a = 0 # node a: 9→0

  16. [16]

    node c = 8 # node c: 2→8

  17. [17]

    node e = 4 # node e: 5→4

  18. [18]

    node m = 7 # node m: 3→7

  19. [19]

    node q = 1 # node q: 2→1 Query: print(node a, node m, node b) Return only the final answer in⟨output⟩ ⟨output⟩tags. B.6 Hyperparameter Settings & Model Selection This section reports the hyperparameter configurations, the top three seeds based on the OOD validation set, as well as graphs for the ID validation and OOD validation performance throughout trai...