Barriers to Universal Reasoning With Transformers (And How to Overcome Them)
Pith reviewed 2026-05-07 16:31 UTC · model grok-4.3
The pith
Under standard positional encodings and finite alphabets, chain-of-thought does not let Transformers generalize beyond TC^0, but growing vocabularies enable length-generalizable Turing machine simulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under standard positional encodings and a finite alphabet, Transformers with chain-of-thought cannot solve problems beyond TC^0 because length-generalizable learnability does not inherit the theoretical expressivity gains. When vocabulary is permitted to grow with input size, unique signpost tokens per tape position together with value-change encodings produce a length-generalizable Turing-machine simulation in which the chain-of-thought trace length is linear in the simulated runtime up to a constant factor.
What carries the argument
Unique signpost tokens assigned to each tape position combined with value-change encodings that log only symbol updates, allowing recovery of the current tape symbol through simple counts rather than repeated copying or last-occurrence search.
If this is right
- The theoretical Turing-completeness result for chain-of-thought does not translate into length-generalizable learnability under finite alphabets.
- Repeated copying and last-occurrence retrieval constitute the two primary obstacles that must be circumvented for reliable scaling.
- The signpost-plus-value-change method achieves Turing-machine simulation whose trace length scales linearly with simulated runtime.
- Empirical length generalization improves on hard problems when signpost tokens and value-change encodings are used.
Where Pith is reading between the lines
- Architectures that can dynamically allocate new tokens as problem size increases may be necessary for universal reasoning even if fixed-vocabulary models improve in other ways.
- The same signpost technique could be adapted to other sequence-to-sequence tasks that require tracking positions over variable-length inputs.
- Testing the construction on concrete problems such as circuit evaluation or graph reachability at increasing lengths would give a direct measure of the linear scaling claim.
Load-bearing premise
The cited theoretical frameworks that bound Transformer length generalization to TC^0 continue to apply once chain-of-thought traces are introduced with a fixed finite alphabet.
What would settle it
Train a Transformer on short instances of a problem outside TC^0 using the signpost and value-change construction, then test whether it produces correct outputs on instances whose required chain-of-thought length is several times longer than any seen in training.
Figures
read the original abstract
Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than those seen during training is understudied. We use recent theoretical frameworks for Transformer length generalization and find that -- under standard positional encodings and a finite alphabet -- Transformers with CoT cannot solve problems beyond $TC^0$, i.e. the expressivity benefits do not hold under the stricter requirement of length-generalizable learnability. However, if we allow the vocabulary to grow with problem size, we attain a length-generalizable simulation of Turing machines where the CoT trace length is linear in the simulated runtime up to a constant. Our construction overcomes two core obstacles to reliable length generalization: repeated copying and last-occurrence retrieval. We assign each tape position a unique signpost token, and log only value changes to enable recovery of the current tape symbol through counts circumventing both barriers. Further, we empirically show that the use of such signpost tokens and value change encodings provide actionable guidance to improve length generalization on hard problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that under standard positional encodings and a finite alphabet, Transformers with Chain-of-Thought (CoT) are limited to TC^0 for length-generalizable learnability, so that the known Turing-completeness benefits of CoT do not transfer. It then presents a construction that grows the vocabulary with problem size, introduces signpost tokens for each tape position, and logs only value changes; this yields a length-generalizable Turing-machine simulation whose CoT trace is linear in the simulated runtime. The authors further report that these encodings give empirical guidance for improving length generalization on hard problems.
Significance. If the applicability of the cited length-generalization frameworks to the CoT regime is confirmed and the construction is shown to be learnable, the work would usefully separate expressivity from length-generalizable learnability for Transformers and supply a concrete, linear-overhead route to universal reasoning. The explicit identification of repeated-copying and last-occurrence barriers, together with the signpost/value-change mechanism that circumvents them, is a clear positive contribution.
major comments (2)
- [theoretical analysis of length generalization] The central negative claim—that standard Transformers + CoT remain inside TC^0 under length-generalizable learnability with finite alphabet—rests on the direct transfer of existing length-generalization frameworks to the autoregressive CoT setting. The manuscript does not supply an explicit argument or reduction showing that the variable-length CoT trace cannot encode auxiliary state that bypasses the frameworks’ barriers (e.g., repeated copying or last-occurrence retrieval). This applicability step is load-bearing for the TC^0 bound and must be spelled out.
- [empirical evaluation] The empirical claim that signpost tokens and value-change encodings improve length generalization is asserted without any reported metrics, ablation controls, baseline comparisons, or statistical details. Because the construction is the paper’s positive contribution, the absence of quantitative evidence leaves the practical utility of the proposed encodings unverified.
minor comments (2)
- [abstract] The abstract introduces “value change encodings” without a one-sentence definition; a brief parenthetical gloss on first use would improve readability.
- [construction section] Notation for the growing vocabulary and the linear CoT trace length should be introduced once in the main text and used consistently thereafter.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our results. We address each major point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: The central negative claim—that standard Transformers + CoT remain inside TC^0 under length-generalizable learnability with finite alphabet—rests on the direct transfer of existing length-generalization frameworks to the autoregressive CoT setting. The manuscript does not supply an explicit argument or reduction showing that the variable-length CoT trace cannot encode auxiliary state that bypasses the frameworks’ barriers (e.g., repeated copying or last-occurrence retrieval). This applicability step is load-bearing for the TC^0 bound and must be spelled out.
Authors: We agree that the transfer of the length-generalization frameworks to the autoregressive CoT setting requires an explicit argument. In the revised manuscript we will add a dedicated subsection providing a formal reduction. This will show that, under finite alphabet and standard positional encodings, auxiliary state in the variable-length CoT trace cannot bypass the identified barriers (repeated copying and last-occurrence retrieval) while preserving length-generalizable learnability. The argument will leverage the fixed vocabulary and autoregressive constraints to demonstrate that any such encoding would violate the frameworks' assumptions on reliable generalization. revision: yes
-
Referee: The empirical claim that signpost tokens and value-change encodings improve length generalization is asserted without any reported metrics, ablation controls, baseline comparisons, or statistical details. Because the construction is the paper’s positive contribution, the absence of quantitative evidence leaves the practical utility of the proposed encodings unverified.
Authors: We acknowledge that the empirical section currently provides only high-level guidance without the requested quantitative details. In the revision we will expand this section to report concrete metrics on length-generalization performance, ablation studies comparing signpost tokens and value-change encodings against standard baselines, and appropriate statistical analysis. These additions will substantiate the practical utility of the encodings. revision: yes
Circularity Check
No significant circularity; claims rest on external frameworks and explicit construction
full rationale
The paper's negative result applies cited external length-generalization frameworks to CoT (no self-citation load-bearing or uniqueness imported from authors). The positive result is an explicit construction (signpost tokens + value-change logging) that does not reduce to fitted inputs, self-definitions, or prior ansatzes by the authors. No equations or steps equate a claimed prediction to its own inputs by construction. The derivation chain is self-contained against the external benchmarks it invokes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Recent theoretical frameworks for Transformer length generalization apply directly to the CoT setting with standard positional encodings and finite alphabet
invented entities (2)
-
signpost tokens
no independent evidence
-
value change encodings
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin
URLhttps://openreview.net/forum?id=duRRoGeoQT. Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin. Softmax trans- formers are turing-complete. InThe Fourteenth International Conference on Learning Repre- sentations, 2026. URLhttps://openreview.net/forum?id=FdkPOHlChS. Mayank Jobanputra, Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, V...
work page 2026
-
[2]
Entity Tracking in Language Models
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.213. URLhttps://aclanthology.org/2023.acl-long.213/. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213, 2022. Belinda Z. Li, Zifan Carl Gu...
-
[3]
all state symbolsq∈Q
-
[4]
the special symbols⟨SEP⟩,⟨EOS⟩, and a block separator $
-
[5]
the symbol KEEP to show that no operation is taking place on the cell
-
[6]
Thus the generated CoT is a string overΞ∪ C
for each pair of distinct symbolsσ,τ∈Γ, a symbol WRITE(σ→τ). Thus the generated CoT is a string overΞ∪ C. Annotated input.For w=w 1 · · ·w n ∈Σ ∗, we use the annotated encodingAnnotate(w):= w1c1 w2c2 · · ·w ncn. Hence each input symbol is followed immediately by its signpost token. Notation.For each simulated time stepm≥0 • Letη m ∈N >0 be the position of...
-
[7]
Evaluate inner clause:NOT true→False
-
[8]
Evaluate root operator:False AND false→False Expected Output: False S5 PermutationIn this task, the model receives a natural language description of an initial state, which assigns five distinct objects to five variables (e.g., A = Apple, B = Banana). This initialization is followed by a sequence of state-changing operations consisting of pairwise ”swap” ...
-
[9]
Initialization: A = Apple, B = Banana, C = Cat, D = Dog, E = Hat
-
[10]
Operations (Sequence Length = 2): Step 1: Swap A and E Step 2: Swap B and C Expected Output (Final State): A = Hat, B = Cat, C = Banana, D = Dog, E = Apple Binary PermutationThis is the same task as S5 Permutation, but we restrict the object pool to only two objects (Cat, Dog) that are assigned to the five variables. We ensure to generate no problem insta...
-
[11]
Initialization: A = Cat, B = Dog, C = Cat, D = Dog, E = Dog
-
[12]
CoT Formats:We compare a standard dense trace against our proposed value-change trace
Operations (Sequence Length = 2): Step 1: Swap A and E Step 2: Swap B and C Expected Output (Final State): A = Dog, B = Cat, C = Dog, D = Dog, E = Cat B.2 CoT Training Formats & Further Training Details B.2.1 Parity Data Generation:For both training and evaluation, binary sequences of lengths bounded by the respective experimental setups are randomly samp...
-
[13]
Finally, we note that the signpost tokens (numbers in angle brackets) are implemented as atomic tokens in the whitespace tokenizer. Training Formats: S5 Permutation Naive CoT Prompt: init A Monkey B Cat C Hat D Dog E Book operation swap A B . swap A E . swap A B . end . ### trace Naive CoT Trace: swap A B write A Book B Cat C Hat D Dog E Monkey . swap A E...
-
[14]
node q = 1 Query: print(node a, node m, node b) Return only the final answer in⟨output⟩ ⟨output⟩tags. Prompt Format: Variable Assignment with Value Change System Prompt:You are a very careful and precise assistant. You always follow the instructions and solve tasks yourself. You never generate code. You also give the answer directly whenever possible with...
-
[15]
node a = 0 # node a: 9→0
-
[16]
node c = 8 # node c: 2→8
-
[17]
node e = 4 # node e: 5→4
-
[18]
node m = 7 # node m: 3→7
-
[19]
node q = 1 # node q: 2→1 Query: print(node a, node m, node b) Return only the final answer in⟨output⟩ ⟨output⟩tags. B.6 Hyperparameter Settings & Model Selection This section reports the hyperparameter configurations, the top three seeds based on the OOD validation set, as well as graphs for the ID validation and OOD validation performance throughout trai...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.