pith. sign in

arxiv: 2605.18079 · v1 · pith:JMH56MYWnew · submitted 2026-05-18 · 💻 cs.LG · cs.CC· cs.CL

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

Pith reviewed 2026-05-20 11:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CCcs.CL
keywords transformer expressivitysoftmax attentionchain-of-thoughtTuring machine simulationlow precision roundingsummarized CoTcomputational universalityattention score separation
0
0 comments X

The pith

Standard softmax transformers with rounding simulate Turing machines via chain-of-thought when depth and width scale logarithmically with context length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that practical transformer decoders using softmax attention and rounded low-precision values can simulate Turing machines through chain-of-thought reasoning, provided depth and width grow only logarithmically with sequence length. This bridges a gap with prior work that depended on hardmax attention or high-precision arithmetic not found in deployed models. An intermediate construction first builds hardmax transformers with ternary activations and well-separated attention scores to carry out the simulation, then converts the result to an equivalent softmax version that preserves the required distinctions. The same conversion shows summarized chain-of-thought achieves the simulation with model size scaling logarithmically in space rather than time. A Sudoku experiment indicates these low-precision constructions align more closely with actual learnability than high-precision alternatives.

Core claim

Standard transformer decoders equipped with softmax attention and rounding of activations and attention weights can simulate Turing machines via chain-of-thought when depth and width are allowed to grow logarithmically with context length. An intermediate construction using hardmax attention, ternary activations, and well-separated attention scores establishes the simulation, after which a conversion technique produces an equivalent softmax model without requiring unrealistic parameter magnitudes or activation precision. The same approach shows that summarized chain-of-thought enables simulation with model size scaling logarithmically in a space bound rather than a time bound.

What carries the argument

The conversion from a hardmax transformer with ternary activations and well-separated attention scores to a softmax version with rounding, which preserves the ability to simulate Turing machine steps without destroying distinctions in attention scores or activations.

If this is right

  • Transformers with practical softmax attention and low precision can perform arbitrary computation via chain-of-thought if their size scales logarithmically with input length.
  • Summarized chain-of-thought allows more efficient simulation, with logarithmic scaling in space requirements.
  • Prior high-precision or hardmax results overestimate the resources needed for Turing machine simulation in real models.
  • Empirical performance on tasks like Sudoku aligns better with learnability under these low-precision constructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If true, this suggests that chain-of-thought prompting in current large language models may enable them to solve a broader class of algorithmic problems than previously thought possible under low-precision constraints.
  • The logarithmic scaling implies that very deep but narrow models could achieve universal computation, testable by training models with controlled depth increases on synthetic Turing machine tasks.
  • Summarized CoT might allow more memory-efficient reasoning in deployed systems, potentially reducing the effective context length needed for complex multi-step problems.

Load-bearing premise

The assumption that converting the hardmax construction to softmax with rounding preserves the necessary distinctions in attention scores and activation values throughout the simulation steps.

What would settle it

A concrete counterexample in which a small-depth, small-width softmax transformer with rounding fails to simulate a simple Turing machine computation on a chain-of-thought sequence that the corresponding hardmax version succeeds on.

Figures

Figures reproduced from arXiv: 2605.18079 by Moritz Br\"osamle, Stephan Eckstein.

Figure 1
Figure 1. Figure 1: SCoT generation schematic: a token segment can be finished by a summary, which is then used to prompt the next segment until an output block is generated. For summarized Chain-of-Thought, T may generate a sum￾mary between other delimiter tokens instead of the output, and can then continue generating from this summary: Definition 2.2 (Summarized Chain-of-Thought). Con￾sider the setting of Definition 2.1, wh… view at source ↗
Figure 2
Figure 2. Figure 2: Example CoT token sequence for a 1-tape Turing machine that replaces ab with cb, run on input aab using r = 2 head-position bits. Head positions are written with least significant bit first. Each grey token encodes one Turing machine step consisting of the new state, the written symbol and the head movement (where L/S/R denote left/stay/right). these statements in Appendix C, the coordinates {1, . . . , d}… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Configuration of a 2-tape Turing machine. Right: Encoding of this configuration into a summary for SCoT. Hats over symbols indicate the head position for each tape. Theorem 4.1. Fix a hardmax transformer T from one of the hardmax constructions from the previous section for an input/time/space bound ˆl, i.e., ˆl = ˆn for Proposition 3.1, ˆl = tˆ for Theorem 3.2 and ˆl = ˆs for Theorem 3.3. Then there … view at source ↗
Figure 4
Figure 4. Figure 4: Coordinatewise denoising function f implemented by the additional MLPs. For inputs inside the three white blocks, adding this function onto the residual stream (i.e., x 7→ x + f(x)) maps values deviating from −1, 0, 1 by at most 1 4 back to −1, 0, 1. the attention and removes the error with the MLP, while the second one contains the MLP from the original layer and does not use the attention. The more gener… view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy of L = 6 CoT models with dimension 512 and 8 heads trained on puzzles requiring at most N CoT tokens and evaluated on puzzles of varying CoT length. this section, models are trained for 20 billion tokens (ex￾cluding padding) with mixed-precision bfloat16 activations unless stated otherwise. For evaluation, models generate deterministically (temperature 0) with CoT or SCoT and the generation is see… view at source ↗
Figure 7
Figure 7. Figure 7: Example excerpt from a deterministic solver trace in our tokenization. Tokens are separated by spaces, with coordinate pairs and digits as separate tokens. N # puzzles CoT tokens # SCoT segments SCoT tokens 2 10 1,003,062 549,886,480 1,216,586 597,636,422 2 11 1,366,621 1,081,910,259 2,275,332 1,278,330,673 2 12 1,744,525 2,215,629,831 4,514,399 2,812,386,367 2 13 2,288,481 5,501,694,458 10,966,828 7,382,3… view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy vs. correct CoT length at N = 214 for the original L = 6 bfloat16 run, the same run with 50 billion training tokens, an L = 8 bfloat16 run with the standard 20 billion training tokens, and an L = 6 fp32 run with 20 billion training tokens. Training N Test Acc (%) Hardest 100 Acc (%) 2 11 99.34 69 2 12 99.96 92 2 13 99.97 91 2 14 99.95 94 [PITH_FULL_IMAGE:figures/full_fig_p061_8.png] view at source ↗
read the original abstract

Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that standard softmax-attention transformer decoders with rounding of activations and attention weights can simulate Turing machines via Chain-of-Thought when depth and width grow only logarithmically with context length. It proceeds by first constructing hardmax transformers that use ternary activations and attention scores separated by a fixed gap, then converting those constructions to equivalent softmax versions; the same technique is applied to summarized CoT to obtain logarithmic space rather than time scaling. Empirical predictions are tested on a Sudoku reasoning task.

Significance. If the hardmax-to-softmax conversion and rounding preservation arguments hold, the result meaningfully reduces the distance between theoretical expressivity results and the models actually deployed in practice. The logarithmic scaling claims, the summarized-CoT analysis, and the provision of reproducible code are concrete strengths.

major comments (1)
  1. [section describing the conversion technique (immediately after the hardmax TM simulation)] The central simulation rests on the conversion from the hardmax construction (ternary activations, well-separated attention scores) to softmax-plus-rounding. The manuscript must supply an explicit lower bound on the separation gap expressed in terms of the rounding bit-width, together with a proof that this gap is preserved under the specific rounding operator so that (i) the rounded softmax still selects exactly the same position as hardmax and (ii) activations remain exactly ternary after rounding. Without such a bound, error accumulation over the logarithmic number of CoT steps cannot be ruled out.
minor comments (2)
  1. State the precise bit-width and rounding operator used throughout the constructions so that the separation-gap argument can be verified numerically.
  2. [Sudoku experiment section] The Sudoku experiment tests learnability alignment but does not exercise the full TM-simulation or conversion claims; this limitation should be stated explicitly.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for recognizing the significance of bridging theoretical expressivity results with practical low-precision softmax transformers. We address the major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: The central simulation rests on the conversion from the hardmax construction (ternary activations, well-separated attention scores) to softmax-plus-rounding. The manuscript must supply an explicit lower bound on the separation gap expressed in terms of the rounding bit-width, together with a proof that this gap is preserved under the specific rounding operator so that (i) the rounded softmax still selects exactly the same position as hardmax and (ii) activations remain exactly ternary after rounding. Without such a bound, error accumulation over the logarithmic number of CoT steps cannot be ruled out.

    Authors: We agree that an explicit lower bound on the separation gap, expressed in terms of the rounding bit-width, together with a preservation proof, is required to rigorously exclude error accumulation across the O(log n) CoT steps. In the revised version we will insert a new lemma immediately after the hardmax TM simulation. The lemma will establish that an attention-score separation of at least 4 * 2^{-b} (where b is the rounding bit-width) is preserved under the paper's specific rounding operator: the rounded softmax selects the identical argmax position, and the activations remain exactly ternary. The proof bounds the softmax perturbation by the separation gap and shows that rounding error is strictly smaller than half the gap, so each simulation step remains exact. Because the depth is only logarithmic, the overall simulation is therefore preserved without accumulation. We view this addition as a clarification that strengthens the central technical claim. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit constructions of TM simulators via hardmax then softmax conversion

full rationale

The paper derives its claims through explicit mathematical constructions: first building hardmax transformers with ternary activations and separated attention scores to simulate Turing machines under CoT, then converting those to equivalent softmax versions with rounding while preserving distinctions. These steps are direct existence proofs with stated assumptions on separation gaps and rounding, not reductions to fitted parameters, self-definitional loops, or load-bearing self-citations. The summarized CoT analysis follows the same construction technique for space-efficient scaling. The Sudoku empirical test is presented separately as validation of learnability predictions and does not underpin the theoretical derivation. No ansatzes, renamings of known results, or uniqueness theorems imported from prior author work are used in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on domain assumptions about attention score separation and rounding behavior rather than new free parameters or invented entities.

axioms (1)
  • domain assumption Well-separated attention scores and ternary activations allow faithful simulation of Turing machine steps that survive conversion to softmax and rounding.
    Invoked when converting the hardmax intermediate construction to the final softmax version.

pith-pipeline@v0.9.0 · 5719 in / 1186 out tokens · 39332 ms · 2026-05-20T11:47:40.272456+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    Bansal, A., Schwarzschild, A., Borgnia, E., Emam, Z., Huang, F., Goldblum, M., and Goldstein, T

    URL https://arxiv.org/abs/1607.0 6450. Bansal, A., Schwarzschild, A., Borgnia, E., Emam, Z., Huang, F., Goldblum, M., and Goldstein, T. End-to- end algorithm synthesis with recurrent networks: logical extrapolation without overthinking. InProceedings of the 36th International Conference on Neural Informa- tion Processing Systems, NIPS ’22, Red Hook, NY , USA,

  2. [2]

    ISBN 9781713871088

    Curran Associates Inc. ISBN 9781713871088. Bertsch, A., Alon, U., Neubig, G., and Gormley, M. R. Unlimiformer: long-range transformers with unlimited length input. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc. Bhattamishra, S., Patel, A., and Goyal, N...

  3. [3]

    URL https: //aclanthology.org/2020.conll-1.37/

    doi: 10.18653/v1/2020.conll-1.37. URL https: //aclanthology.org/2020.conll-1.37/. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E....

  4. [4]

    URL http://papers.nips.cc/paper_f iles/paper/2022/hash/67d57c32e20fd0a 7a302cb81d36e40d5-Abstract-Conference. html. Dettmers, T., Lewis, M., Belkada, Y ., and Zettlemoyer, L. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InAdvances in Neural Information Process- ing Systems, volume 35, pp. 30318–30332. Curran As- sociates, Inc., 2022....

  5. [7]

    doi: 10.18653/v1/2025.acl-short.76

    Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.acl-short.76. URL https: //aclanthology.org/2025.acl-short.76/. Jiang, H., Hahn, M., Zetzsche, G., and Lin, A. W. Softmax transformers are Turing-complete. InThe Fourteenth International Conference on Learning Representations,

  6. [8]

    URL https://openreview.net/forum ?id=FdkPOHlChS. Oral. Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgN KkHtvB. Li, Q. and Wang, Y . Constant bit-size t...

  7. [9]

    Lost in the Middle: How Language Models Use Long Contexts

    URL https://openreview.net/forum ?id=5pHfYe10iX. Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits.Trans- actions of the Association for Computational Linguistics, 10:843–856, 2022. doi: 10.1162/tacl a 00493. URL https://aclanthology.org/2022.tacl-1 .49/. Micikevicius, P., Narang, S., Alben, J., Diam...

  8. [10]

    Minsky, M

    URL https://openreview.net/forum ?id=r1gs9JgRZ. Minsky, M. L.Computation: Finite and Infinite Machines. Prentice-Hall Series in Automatic Computation. Prentice- Hall, Englewood Cliffs, NJ, 1967. ISBN 0131655639. Nowak, F., Svete, A., Butoi, A., and Cotterell, R. On the representational capacity of neural language models with chain-of-thought reasoning. In...

  9. [11]

    Siegelmann, H

    URL https://openreview.net/forum ?id=B1ckMDqlg. Siegelmann, H. T. and Sontag, E. D. On the computational power of neural nets. InProceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pp. 440–449, New York, NY , USA, 1992. Association for Computing Machinery. ISBN 089791497X. doi: 10.1145/130385.130432. URL https://doi.org/ 1...

  10. [12]

    URL https://acla nthology.org/2026.tacl-1.8/

    doi: 10.1162/tacl.a.597. URL https://acla nthology.org/2026.tacl-1.8/. Yang, C., Srebro, N., McAllester, D., and Li, Z. PEN- CIL: Long thoughts with short memory.arXiv preprint arXiv:2503.14337, 2025. Ye, J., Gao, J., Gong, S., Zheng, L., Jiang, X., Li, Z., and Kong, L. Beyond autoregression: Discrete diffusion for complex reasoning and planning. InThe Th...

  11. [13]

    double- logarithmic in the context length

    Representing them would hence require O(logd) mantissa (and O(log logd) exponent) bits, i.e. double- logarithmic in the context length. Alternatively, one could get completely rid of biases without changing any of the theorems in the main part by adding d constant 1 dimensions (set at embedding) to the representations connected with weights 0 or ±1 to eac...

  12. [14]

    Hence |tracei| ≤ 3(|summaryi−1| −1) +r+ 2≤6|summary i−1|

    Then the trace has maximum length if the length cap is hit at a <p> token, in this case r+ 2 tokens (the full head position encoding) are added to tracei in addition to the length cap. Hence |tracei| ≤ 3(|summaryi−1| −1) +r+ 2≤6|summary i−1|. In both cases it holds that |tracei| ≤6|summary i−1| ≤6(s M(w) + 3) and thus |segi| ≤8(s M(w) + 3). For the second...

  13. [15]

    In each layer, the attention matricesW ℓ,h Q , W ℓ,h K , W ℓ,h V , W ℓ,h O have at most one non-zero entry per row which is1

  14. [16]

    they satisfy the precondition from Lemma D.9

    In each layer, the output projections(W ℓ,h O )H h=1 write to disjoint coordinates, i.e. they satisfy the precondition from Lemma D.9

  15. [17]

    In each layer, the MLP matricesW 1, W2 satisfy ∥W1∥∞ ≤d ,∥W 2∥∞ ≤d ff

  16. [18]

    The embeddings and unembeddings have entries in{−1,0,1}

  17. [19]

    the condition of Lemma D.8 is satisfied in each layer

    For each x∈ X , all heads and layers have tie-invariant values, i.e. the condition of Lemma D.8 is satisfied in each layer

  18. [20]

    For eachx∈ X, all activations have entries in{−1,0,1}

  19. [21]

    The output scores ⟨unembv, x(L) n−1⟩ have a unique maximum which is larger than other scores by at least1. Proof. Item 1holds because all queries, keys and values are concatenations of registers/flags and write their output to some register/flag (cf. Appendix C.1).Item 2holds because no two heads in the same layer write to the same output register/flag. I...

  20. [22]

    For the first summand this is precisely the condition c2 ≥log(96N) p dk while for the second one it is the conditionb att m ≥4. D.5.1. PROOF OFTHEOREM4.2 We obtain Theorem 4.2 from Theorem D.17 by choosing N as a uniform upper bound on the length of any context that can occur in the corresponding hardmax construction. Proof of Theorem 4.2. Fix one of the ...