Parity, Sensitivity, and Transformers

Alexander Kozachinskiy; Przemys{\l}aw Wa{\l}\c{e}ga; Tomasz Steifer

arxiv: 2602.05896 · v2 · submitted 2026-02-05 · 💻 cs.LG · cs.AI

Parity, Sensitivity, and Transformers

Alexander Kozachinskiy , Tomasz Steifer , Przemys{\l}aw Wa{\l}\c{e}ga This is my paper

Pith reviewed 2026-05-16 06:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformersparitysensitivityattention mechanismsneural network expressivityboolean functionssequence models

0 comments

The pith

One-layer transformers cannot compute parity of binary sequences, but two layers suffice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that transformers need at least two layers to compute the parity of a binary input sequence, which asks whether the count of 1s is even or odd. It resolves the open question of one-layer capability by proving that the average sensitivity of any one-layer transformer output grows slower than the linear sensitivity required for parity. The authors also give a new four-layer construction that achieves parity using only softmax attention, length-independent positional encodings, and no layer normalization, removing several restrictive assumptions from prior work.

Core claim

The minimal number of layers a transformer needs to compute PARITY is two. A one-layer transformer cannot compute PARITY because the average sensitivity of its output function grows slower than that of PARITY. A four-layer transformer can compute PARITY using softmax attention, polynomially bounded length-independent positional encoding, no layer normalization, and compatible with both causal and non-causal masking.

What carries the argument

Average sensitivity of the final-layer output, which measures how much the prediction changes on average when a single input bit flips, combined with layered multi-head attention that can propagate parity information across positions.

If this is right

Two layers are the exact minimum depth for parity in standard transformers.
Four layers allow parity computation with practical components including softmax and length-independent encodings.
Prior constructions that relied on hardmax or length-dependent encodings are no longer necessary.
The sensitivity gap separates one-layer models from functions like parity that require high input sensitivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sensitivity analysis may serve as a general tool for proving depth lower bounds on other Boolean functions in transformers.
The four-layer construction could be directly implemented in existing libraries without custom attention variants.
Similar sensitivity arguments might separate depth requirements in related architectures such as state-space models or recurrent networks.

Load-bearing premise

The analysis assumes that the transformer uses the standard multi-head attention formulation without extra non-standard components and that the sensitivity bound holds directly for the output under the chosen positional encoding.

What would settle it

An explicit one-layer transformer construction that correctly outputs the parity of every binary sequence of length n for arbitrarily large n would falsify the negative result.

read the original abstract

Understanding what neural architectures can and cannot compute is a central challenge in the theory of AI. One of the fundamental problems in this context is the PARITY task, which asks whether the number of 1s in a binary input sequence is even or odd. PARITY is one of the central tasks studied in the theory of computation, yet it remains surprisingly unclear under which conditions transformers can or cannot solve it. In this paper, we show that the minimal number of layers a transformer needs to compute PARITY is two. In particular, we solve the open problem asking whether a one-layer transformer can compute PARITY. We answer it negatively by showing that average sensitivity of a one-layer transformer grows slower than that of PARITY. Furthermore, we show a new construction for transformer that computes PARITY, which improves on the existing constructions by removing a number of impractical assumptions. In particular, the existing transformers for PARITY rely on such impractical assumptions as length-dependent positional encoding, hardmax, layernorm without a regularisation parameter, or incompatibility with causal masking. We show that these assumptions can be removed, at the cost of increasing the number of layers from two to four. Specifically, we show that PARITY can be computed by a four-layer transformer using softmax attention, length-independent and polynomially bounded positional encoding, no layernorm, and compatible with both causal and non-causal masking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

One-layer transformers cannot compute parity because their sensitivity grows too slowly, but a four-layer version works under standard softmax and length-independent encodings.

read the letter

The main thing to know is that this paper settles the open question on the minimal depth for transformers to compute parity exactly. One layer is not enough, shown by proving that the average sensitivity of any one-layer transformer is o(n) while parity has sensitivity n. They then supply a four-layer construction that achieves it with softmax attention, polynomially bounded length-independent positional encodings, no layernorm, and compatibility with causal masking. That removes several unrealistic assumptions from earlier constructions that needed hardmax or length-dependent encodings. The sensitivity argument is a standard technique applied cleanly here, and the positive result is an explicit architectural recipe rather than an existence proof. Both directions look internally consistent from the abstract and the stress-test note, with no evident circularity or dependence on non-standard components. The negative claim rests on growth-rate comparison rather than parameter fitting, which is a strength. The construction details are summarized at a high level, so the full paper would need checking for edge cases around how the layers combine information across positions, but nothing in the stated assumptions suggests a load-bearing problem. This work is for people studying the exact computational power of transformers and depth requirements in attention models. A reader focused on lower bounds or circuit-like analysis of neural nets will get direct value from the sensitivity bound and the cleaned-up construction. It deserves a serious referee because it resolves a concrete open problem with both directions grounded in reproducible techniques. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that the minimal number of layers needed for a transformer to compute the PARITY function is two. It resolves the open question of whether a one-layer transformer suffices by proving that the average sensitivity of any one-layer transformer grows sublinearly in sequence length n, while PARITY has sensitivity exactly n. It further provides an explicit four-layer construction that computes PARITY using standard softmax attention, length-independent polynomially bounded positional encodings, no layernorm, and compatibility with both causal and non-causal masking.

Significance. If the sensitivity bounds and construction hold, the work supplies a clean negative result for one-layer models together with a practical positive construction that removes several non-standard assumptions (length-dependent encodings, hardmax, layernorm without regularization) present in earlier results. The sensitivity technique itself is a reusable tool for establishing transformer limitations on Boolean functions.

major comments (2)

[one-layer sensitivity analysis] The central negative claim rests on showing that one-layer average sensitivity is o(n). The manuscript describes the argument at a high level; an explicit derivation of the sensitivity bound (including the precise dependence on the attention weights and positional encoding) is required to confirm that the o(n) growth holds uniformly for all parameter choices.
[four-layer construction] The four-layer construction is stated to compute PARITY exactly. Because the positive result is the only concrete upper bound under the relaxed assumptions, the manuscript should include a self-contained verification (or pseudocode) that the final output equals the parity bit for both even and odd input lengths, including the effect of the chosen positional encoding on the attention scores.

minor comments (2)

Clarify whether the length-independent positional encoding is required to be polynomial in n or merely bounded by a fixed polynomial; the current statement leaves the degree unspecified.
Add a short comparison paragraph or table contrasting the new four-layer construction with the two-layer constructions in prior work, highlighting which assumptions are dropped.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments. We appreciate the recognition of the significance of our sensitivity bounds and the practical four-layer construction. We will revise the manuscript to address the points raised, as detailed below.

read point-by-point responses

Referee: [one-layer sensitivity analysis] The central negative claim rests on showing that one-layer average sensitivity is o(n). The manuscript describes the argument at a high level; an explicit derivation of the sensitivity bound (including the precise dependence on the attention weights and positional encoding) is required to confirm that the o(n) growth holds uniformly for all parameter choices.

Authors: We agree that providing an explicit derivation will enhance the clarity and verifiability of our negative result. In the revised manuscript, we will expand the sensitivity analysis section to include a complete, step-by-step derivation of the o(n) bound on average sensitivity for one-layer transformers. This will explicitly trace the dependence on attention weights, softmax normalization, and positional encodings, showing that the bound holds uniformly regardless of parameter values. revision: yes
Referee: [four-layer construction] The four-layer construction is stated to compute PARITY exactly. Because the positive result is the only concrete upper bound under the relaxed assumptions, the manuscript should include a self-contained verification (or pseudocode) that the final output equals the parity bit for both even and odd input lengths, including the effect of the chosen positional encoding on the attention scores.

Authors: We concur that a self-contained verification is necessary to fully substantiate the positive result. We will include in the revision either a detailed proof or pseudocode that demonstrates the construction computes the exact parity for all input lengths. This verification will analyze the attention scores under the chosen positional encodings and confirm the output for both even and odd parities, ensuring compatibility with the standard components used. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper's negative result proceeds from a direct comparison of average sensitivity growth rates between one-layer transformers (bounded via standard attention analysis) and the PARITY function (known to have sensitivity n). The positive result is an explicit four-layer construction under standard softmax attention, length-independent positional encodings, and no layernorm. Both rest on external definitions of transformer components and sensitivity; no parameter is fitted to data and then renamed as a prediction, no self-citation is load-bearing for the central claims, and no ansatz or uniqueness theorem is imported from the authors' prior work. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard mathematical definitions of transformer layers, attention mechanisms, and average sensitivity from prior literature; no new free parameters, invented entities, or ad-hoc axioms are introduced beyond the usual assumptions of the transformer model.

axioms (2)

standard math Standard formulation of multi-head self-attention with softmax and positional encodings
Invoked throughout the sensitivity analysis and construction sections.
standard math Definition of average sensitivity for boolean functions
Used to compare the growth rate of one-layer transformers against parity.

pith-pipeline@v0.9.0 · 5559 in / 1419 out tokens · 46518 ms · 2026-05-16T06:52:19.419579+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lower bounds for one-layer transformers that compute parity
cs.LG 2026-05 unverdicted novelty 7.0

One-layer self-attention with rational or ReLU post-processing cannot sign-represent parity unless heads times degree scales linearly with input length.
Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders
cs.LG 2026-05 unverdicted novelty 6.0

Hybrid Gated DeltaNet-Attention decoders solve parity-conditioned retrieval with O(1) scratchpad while pure Gated DeltaNet cannot and pure Gated Attention needs polynomial length.