Parity, Sensitivity, and Transformers
Pith reviewed 2026-05-16 06:52 UTC · model grok-4.3
The pith
One-layer transformers cannot compute parity of binary sequences, but two layers suffice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The minimal number of layers a transformer needs to compute PARITY is two. A one-layer transformer cannot compute PARITY because the average sensitivity of its output function grows slower than that of PARITY. A four-layer transformer can compute PARITY using softmax attention, polynomially bounded length-independent positional encoding, no layer normalization, and compatible with both causal and non-causal masking.
What carries the argument
Average sensitivity of the final-layer output, which measures how much the prediction changes on average when a single input bit flips, combined with layered multi-head attention that can propagate parity information across positions.
If this is right
- Two layers are the exact minimum depth for parity in standard transformers.
- Four layers allow parity computation with practical components including softmax and length-independent encodings.
- Prior constructions that relied on hardmax or length-dependent encodings are no longer necessary.
- The sensitivity gap separates one-layer models from functions like parity that require high input sensitivity.
Where Pith is reading between the lines
- Sensitivity analysis may serve as a general tool for proving depth lower bounds on other Boolean functions in transformers.
- The four-layer construction could be directly implemented in existing libraries without custom attention variants.
- Similar sensitivity arguments might separate depth requirements in related architectures such as state-space models or recurrent networks.
Load-bearing premise
The analysis assumes that the transformer uses the standard multi-head attention formulation without extra non-standard components and that the sensitivity bound holds directly for the output under the chosen positional encoding.
What would settle it
An explicit one-layer transformer construction that correctly outputs the parity of every binary sequence of length n for arbitrarily large n would falsify the negative result.
read the original abstract
Understanding what neural architectures can and cannot compute is a central challenge in the theory of AI. One of the fundamental problems in this context is the PARITY task, which asks whether the number of 1s in a binary input sequence is even or odd. PARITY is one of the central tasks studied in the theory of computation, yet it remains surprisingly unclear under which conditions transformers can or cannot solve it. In this paper, we show that the minimal number of layers a transformer needs to compute PARITY is two. In particular, we solve the open problem asking whether a one-layer transformer can compute PARITY. We answer it negatively by showing that average sensitivity of a one-layer transformer grows slower than that of PARITY. Furthermore, we show a new construction for transformer that computes PARITY, which improves on the existing constructions by removing a number of impractical assumptions. In particular, the existing transformers for PARITY rely on such impractical assumptions as length-dependent positional encoding, hardmax, layernorm without a regularisation parameter, or incompatibility with causal masking. We show that these assumptions can be removed, at the cost of increasing the number of layers from two to four. Specifically, we show that PARITY can be computed by a four-layer transformer using softmax attention, length-independent and polynomially bounded positional encoding, no layernorm, and compatible with both causal and non-causal masking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that the minimal number of layers needed for a transformer to compute the PARITY function is two. It resolves the open question of whether a one-layer transformer suffices by proving that the average sensitivity of any one-layer transformer grows sublinearly in sequence length n, while PARITY has sensitivity exactly n. It further provides an explicit four-layer construction that computes PARITY using standard softmax attention, length-independent polynomially bounded positional encodings, no layernorm, and compatibility with both causal and non-causal masking.
Significance. If the sensitivity bounds and construction hold, the work supplies a clean negative result for one-layer models together with a practical positive construction that removes several non-standard assumptions (length-dependent encodings, hardmax, layernorm without regularization) present in earlier results. The sensitivity technique itself is a reusable tool for establishing transformer limitations on Boolean functions.
major comments (2)
- [one-layer sensitivity analysis] The central negative claim rests on showing that one-layer average sensitivity is o(n). The manuscript describes the argument at a high level; an explicit derivation of the sensitivity bound (including the precise dependence on the attention weights and positional encoding) is required to confirm that the o(n) growth holds uniformly for all parameter choices.
- [four-layer construction] The four-layer construction is stated to compute PARITY exactly. Because the positive result is the only concrete upper bound under the relaxed assumptions, the manuscript should include a self-contained verification (or pseudocode) that the final output equals the parity bit for both even and odd input lengths, including the effect of the chosen positional encoding on the attention scores.
minor comments (2)
- Clarify whether the length-independent positional encoding is required to be polynomial in n or merely bounded by a fixed polynomial; the current statement leaves the degree unspecified.
- Add a short comparison paragraph or table contrasting the new four-layer construction with the two-layer constructions in prior work, highlighting which assumptions are dropped.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments. We appreciate the recognition of the significance of our sensitivity bounds and the practical four-layer construction. We will revise the manuscript to address the points raised, as detailed below.
read point-by-point responses
-
Referee: [one-layer sensitivity analysis] The central negative claim rests on showing that one-layer average sensitivity is o(n). The manuscript describes the argument at a high level; an explicit derivation of the sensitivity bound (including the precise dependence on the attention weights and positional encoding) is required to confirm that the o(n) growth holds uniformly for all parameter choices.
Authors: We agree that providing an explicit derivation will enhance the clarity and verifiability of our negative result. In the revised manuscript, we will expand the sensitivity analysis section to include a complete, step-by-step derivation of the o(n) bound on average sensitivity for one-layer transformers. This will explicitly trace the dependence on attention weights, softmax normalization, and positional encodings, showing that the bound holds uniformly regardless of parameter values. revision: yes
-
Referee: [four-layer construction] The four-layer construction is stated to compute PARITY exactly. Because the positive result is the only concrete upper bound under the relaxed assumptions, the manuscript should include a self-contained verification (or pseudocode) that the final output equals the parity bit for both even and odd input lengths, including the effect of the chosen positional encoding on the attention scores.
Authors: We concur that a self-contained verification is necessary to fully substantiate the positive result. We will include in the revision either a detailed proof or pseudocode that demonstrates the construction computes the exact parity for all input lengths. This verification will analyze the attention scores under the chosen positional encodings and confirm the output for both even and odd parities, ensuring compatibility with the standard components used. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper's negative result proceeds from a direct comparison of average sensitivity growth rates between one-layer transformers (bounded via standard attention analysis) and the PARITY function (known to have sensitivity n). The positive result is an explicit four-layer construction under standard softmax attention, length-independent positional encodings, and no layernorm. Both rest on external definitions of transformer components and sensitivity; no parameter is fitted to data and then renamed as a prediction, no self-citation is load-bearing for the central claims, and no ansatz or uniqueness theorem is imported from the authors' prior work. The derivation chain is therefore independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard formulation of multi-head self-attention with softmax and positional encodings
- standard math Definition of average sensitivity for boolean functions
Forward citations
Cited by 2 Pith papers
-
Lower bounds for one-layer transformers that compute parity
One-layer self-attention with rational or ReLU post-processing cannot sign-represent parity unless heads times degree scales linearly with input length.
-
Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders
Hybrid Gated DeltaNet-Attention decoders solve parity-conditioned retrieval with O(1) scratchpad while pure Gated DeltaNet cannot and pure Gated Attention needs polynomial length.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.