The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought
Pith reviewed 2026-05-20 11:47 UTC · model grok-4.3
The pith
Standard softmax transformers with rounding simulate Turing machines via chain-of-thought when depth and width scale logarithmically with context length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Standard transformer decoders equipped with softmax attention and rounding of activations and attention weights can simulate Turing machines via chain-of-thought when depth and width are allowed to grow logarithmically with context length. An intermediate construction using hardmax attention, ternary activations, and well-separated attention scores establishes the simulation, after which a conversion technique produces an equivalent softmax model without requiring unrealistic parameter magnitudes or activation precision. The same approach shows that summarized chain-of-thought enables simulation with model size scaling logarithmically in a space bound rather than a time bound.
What carries the argument
The conversion from a hardmax transformer with ternary activations and well-separated attention scores to a softmax version with rounding, which preserves the ability to simulate Turing machine steps without destroying distinctions in attention scores or activations.
If this is right
- Transformers with practical softmax attention and low precision can perform arbitrary computation via chain-of-thought if their size scales logarithmically with input length.
- Summarized chain-of-thought allows more efficient simulation, with logarithmic scaling in space requirements.
- Prior high-precision or hardmax results overestimate the resources needed for Turing machine simulation in real models.
- Empirical performance on tasks like Sudoku aligns better with learnability under these low-precision constructions.
Where Pith is reading between the lines
- If true, this suggests that chain-of-thought prompting in current large language models may enable them to solve a broader class of algorithmic problems than previously thought possible under low-precision constraints.
- The logarithmic scaling implies that very deep but narrow models could achieve universal computation, testable by training models with controlled depth increases on synthetic Turing machine tasks.
- Summarized CoT might allow more memory-efficient reasoning in deployed systems, potentially reducing the effective context length needed for complex multi-step problems.
Load-bearing premise
The assumption that converting the hardmax construction to softmax with rounding preserves the necessary distinctions in attention scores and activation values throughout the simulation steps.
What would settle it
A concrete counterexample in which a small-depth, small-width softmax transformer with rounding fails to simulate a simple Turing machine computation on a chain-of-thought sequence that the corresponding hardmax version succeeds on.
Figures
read the original abstract
Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard softmax-attention transformer decoders with rounding of activations and attention weights can simulate Turing machines via Chain-of-Thought when depth and width grow only logarithmically with context length. It proceeds by first constructing hardmax transformers that use ternary activations and attention scores separated by a fixed gap, then converting those constructions to equivalent softmax versions; the same technique is applied to summarized CoT to obtain logarithmic space rather than time scaling. Empirical predictions are tested on a Sudoku reasoning task.
Significance. If the hardmax-to-softmax conversion and rounding preservation arguments hold, the result meaningfully reduces the distance between theoretical expressivity results and the models actually deployed in practice. The logarithmic scaling claims, the summarized-CoT analysis, and the provision of reproducible code are concrete strengths.
major comments (1)
- [section describing the conversion technique (immediately after the hardmax TM simulation)] The central simulation rests on the conversion from the hardmax construction (ternary activations, well-separated attention scores) to softmax-plus-rounding. The manuscript must supply an explicit lower bound on the separation gap expressed in terms of the rounding bit-width, together with a proof that this gap is preserved under the specific rounding operator so that (i) the rounded softmax still selects exactly the same position as hardmax and (ii) activations remain exactly ternary after rounding. Without such a bound, error accumulation over the logarithmic number of CoT steps cannot be ruled out.
minor comments (2)
- State the precise bit-width and rounding operator used throughout the constructions so that the separation-gap argument can be verified numerically.
- [Sudoku experiment section] The Sudoku experiment tests learnability alignment but does not exercise the full TM-simulation or conversion claims; this limitation should be stated explicitly.
Simulated Author's Rebuttal
We thank the referee for the careful review and for recognizing the significance of bridging theoretical expressivity results with practical low-precision softmax transformers. We address the major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: The central simulation rests on the conversion from the hardmax construction (ternary activations, well-separated attention scores) to softmax-plus-rounding. The manuscript must supply an explicit lower bound on the separation gap expressed in terms of the rounding bit-width, together with a proof that this gap is preserved under the specific rounding operator so that (i) the rounded softmax still selects exactly the same position as hardmax and (ii) activations remain exactly ternary after rounding. Without such a bound, error accumulation over the logarithmic number of CoT steps cannot be ruled out.
Authors: We agree that an explicit lower bound on the separation gap, expressed in terms of the rounding bit-width, together with a preservation proof, is required to rigorously exclude error accumulation across the O(log n) CoT steps. In the revised version we will insert a new lemma immediately after the hardmax TM simulation. The lemma will establish that an attention-score separation of at least 4 * 2^{-b} (where b is the rounding bit-width) is preserved under the paper's specific rounding operator: the rounded softmax selects the identical argmax position, and the activations remain exactly ternary. The proof bounds the softmax perturbation by the separation gap and shows that rounding error is strictly smaller than half the gap, so each simulation step remains exact. Because the depth is only logarithmic, the overall simulation is therefore preserved without accumulation. We view this addition as a clarification that strengthens the central technical claim. revision: yes
Circularity Check
No circularity: explicit constructions of TM simulators via hardmax then softmax conversion
full rationale
The paper derives its claims through explicit mathematical constructions: first building hardmax transformers with ternary activations and separated attention scores to simulate Turing machines under CoT, then converting those to equivalent softmax versions with rounding while preserving distinctions. These steps are direct existence proofs with stated assumptions on separation gaps and rounding, not reductions to fitted parameters, self-definitional loops, or load-bearing self-citations. The summarized CoT analysis follows the same construction technique for space-efficient scaling. The Sudoku empirical test is presented separately as validation of learnability predictions and does not underpin the theoretical derivation. No ansatzes, renamings of known results, or uniqueness theorems imported from prior author work are used in a circular manner.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Well-separated attention scores and ternary activations allow faithful simulation of Turing machine steps that survive conversion to softmax and rounding.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
construct hardmax transformers with ternary activations and well-separated attention scores... convert... by scaling... c WQ, c WK... β = s* − max{sj | sj < s*}
-
IndisputableMonolith/Foundation/ArithmeticFromLogicLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
binary positional embeddings... binr(i) with entries in {-1,1}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bansal, A., Schwarzschild, A., Borgnia, E., Emam, Z., Huang, F., Goldblum, M., and Goldstein, T
URL https://arxiv.org/abs/1607.0 6450. Bansal, A., Schwarzschild, A., Borgnia, E., Emam, Z., Huang, F., Goldblum, M., and Goldstein, T. End-to- end algorithm synthesis with recurrent networks: logical extrapolation without overthinking. InProceedings of the 36th International Conference on Neural Informa- tion Processing Systems, NIPS ’22, Red Hook, NY , USA,
-
[2]
Curran Associates Inc. ISBN 9781713871088. Bertsch, A., Alon, U., Neubig, G., and Gormley, M. R. Unlimiformer: long-range transformers with unlimited length input. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc. Bhattamishra, S., Patel, A., and Goyal, N...
work page 2023
-
[3]
URL https: //aclanthology.org/2020.conll-1.37/
doi: 10.18653/v1/2020.conll-1.37. URL https: //aclanthology.org/2020.conll-1.37/. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E....
-
[4]
URL http://papers.nips.cc/paper_f iles/paper/2022/hash/67d57c32e20fd0a 7a302cb81d36e40d5-Abstract-Conference. html. Dettmers, T., Lewis, M., Belkada, Y ., and Zettlemoyer, L. LLM.int8(): 8-bit matrix multiplication for transformers at scale. InAdvances in Neural Information Process- ing Systems, volume 35, pp. 30318–30332. Curran As- sociates, Inc., 2022....
work page 2022
-
[7]
doi: 10.18653/v1/2025.acl-short.76
Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.acl-short.76. URL https: //aclanthology.org/2025.acl-short.76/. Jiang, H., Hahn, M., Zetzsche, G., and Lin, A. W. Softmax transformers are Turing-complete. InThe Fourteenth International Conference on Learning Representations,
-
[8]
URL https://openreview.net/forum ?id=FdkPOHlChS. Oral. Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkgN KkHtvB. Li, Q. and Wang, Y . Constant bit-size t...
-
[9]
Lost in the Middle: How Language Models Use Long Contexts
URL https://openreview.net/forum ?id=5pHfYe10iX. Merrill, W., Sabharwal, A., and Smith, N. A. Saturated transformers are constant-depth threshold circuits.Trans- actions of the Association for Computational Linguistics, 10:843–856, 2022. doi: 10.1162/tacl a 00493. URL https://aclanthology.org/2022.tacl-1 .49/. Micikevicius, P., Narang, S., Alben, J., Diam...
work page internal anchor Pith review doi:10.1162/tacl 2022
-
[10]
URL https://openreview.net/forum ?id=r1gs9JgRZ. Minsky, M. L.Computation: Finite and Infinite Machines. Prentice-Hall Series in Automatic Computation. Prentice- Hall, Englewood Cliffs, NJ, 1967. ISBN 0131655639. Nowak, F., Svete, A., Butoi, A., and Cotterell, R. On the representational capacity of neural language models with chain-of-thought reasoning. In...
-
[11]
URL https://openreview.net/forum ?id=B1ckMDqlg. Siegelmann, H. T. and Sontag, E. D. On the computational power of neural nets. InProceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pp. 440–449, New York, NY , USA, 1992. Association for Computing Machinery. ISBN 089791497X. doi: 10.1145/130385.130432. URL https://doi.org/ 1...
-
[12]
URL https://acla nthology.org/2026.tacl-1.8/
doi: 10.1162/tacl.a.597. URL https://acla nthology.org/2026.tacl-1.8/. Yang, C., Srebro, N., McAllester, D., and Li, Z. PEN- CIL: Long thoughts with short memory.arXiv preprint arXiv:2503.14337, 2025. Ye, J., Gao, J., Gong, S., Zheng, L., Jiang, X., Li, Z., and Kong, L. Beyond autoregression: Discrete diffusion for complex reasoning and planning. InThe Th...
-
[13]
double- logarithmic in the context length
Representing them would hence require O(logd) mantissa (and O(log logd) exponent) bits, i.e. double- logarithmic in the context length. Alternatively, one could get completely rid of biases without changing any of the theorems in the main part by adding d constant 1 dimensions (set at embedding) to the representations connected with weights 0 or ±1 to eac...
work page 2021
-
[14]
Hence |tracei| ≤ 3(|summaryi−1| −1) +r+ 2≤6|summary i−1|
Then the trace has maximum length if the length cap is hit at a <p> token, in this case r+ 2 tokens (the full head position encoding) are added to tracei in addition to the length cap. Hence |tracei| ≤ 3(|summaryi−1| −1) +r+ 2≤6|summary i−1|. In both cases it holds that |tracei| ≤6|summary i−1| ≤6(s M(w) + 3) and thus |segi| ≤8(s M(w) + 3). For the second...
work page 2026
-
[15]
In each layer, the attention matricesW ℓ,h Q , W ℓ,h K , W ℓ,h V , W ℓ,h O have at most one non-zero entry per row which is1
-
[16]
they satisfy the precondition from Lemma D.9
In each layer, the output projections(W ℓ,h O )H h=1 write to disjoint coordinates, i.e. they satisfy the precondition from Lemma D.9
-
[17]
In each layer, the MLP matricesW 1, W2 satisfy ∥W1∥∞ ≤d ,∥W 2∥∞ ≤d ff
-
[18]
The embeddings and unembeddings have entries in{−1,0,1}
-
[19]
the condition of Lemma D.8 is satisfied in each layer
For each x∈ X , all heads and layers have tie-invariant values, i.e. the condition of Lemma D.8 is satisfied in each layer
-
[20]
For eachx∈ X, all activations have entries in{−1,0,1}
-
[21]
The output scores ⟨unembv, x(L) n−1⟩ have a unique maximum which is larger than other scores by at least1. Proof. Item 1holds because all queries, keys and values are concatenations of registers/flags and write their output to some register/flag (cf. Appendix C.1).Item 2holds because no two heads in the same layer write to the same output register/flag. I...
-
[22]
For the first summand this is precisely the condition c2 ≥log(96N) p dk while for the second one it is the conditionb att m ≥4. D.5.1. PROOF OFTHEOREM4.2 We obtain Theorem 4.2 from Theorem D.17 by choosing N as a uniform upper bound on the length of any context that can occur in the corresponding hardmax construction. Proof of Theorem 4.2. Fix one of the ...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.