Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

Tomasz Steifer

arxiv: 2605.16640 · v1 · pith:J7VEOI34new · submitted 2026-05-15 · 💻 cs.LG

Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders

Tomasz Steifer This is my paper

Pith reviewed 2026-05-20 19:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords hybrid architectureGated DeltaNetGated Attentionscratchpad lengthexpressivitychain-of-thoughtparity-conditioned retrievalrecurrent-attention decoder

0 comments

The pith

A hybrid of Gated DeltaNet and Gated Attention solves a retrieval task with constant scratchpad under constant precision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether combining recurrent Gated DeltaNet heads with Gated Attention heads provides an advantage in how much scratchpad or chain-of-thought steps a model needs for certain tasks. It defines a parity-conditioned retrieval task and proves that the hybrid can handle it with a fixed, constant amount of scratchpad. This matters because many language models use similar hybrid designs, and shorter scratchpads could mean more efficient reasoning. Pure DeltaNet versions cannot do it at all in this setting, while pure attention needs much longer scratchpads that grow with problem size.

Core claim

Under the constant-precision assumption, the hybrid architecture solves the parity-conditioned retrieval task using a constant scratchpad, which corresponds to O(1) chain-of-thought steps. No such constant-scratchpad solution exists for pure Gated DeltaNet models, and pure Gated Attention models require at least a polynomial-sized scratchpad for the same task.

What carries the argument

The Qwen-style hybrid combining recurrent Gated DeltaNet heads with Gated Attention heads, which integrates recurrent state updates for sequential processing and attention for selective retrieval to achieve efficient constant-resource computation on the defined task.

If this is right

Hybrid models can achieve constant chain-of-thought length for parity-conditioned retrieval where pure recurrent models have no solution.
Pure attention-based models need polynomial growth in scratchpad size for the task.
The hybrid provides a formal expressivity advantage in terms of required scratchpad length.
Such architectures may support more efficient inference in models that already use similar combinations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hybrids might enable constant-resource solutions for other sequential reasoning tasks that challenge pure recurrent or attention models.
Designers of language models could prioritize hybrid layers to reduce the length of intermediate reasoning traces needed.
Further analysis could explore whether the constant scratchpad holds for variants of the task with different conditioning.
The result suggests exploring the minimal hybrid ratio of DeltaNet to attention heads for achieving these bounds.

Load-bearing premise

The model computations remain bounded by constant precision, without arbitrary increases in bit depth or numerical accuracy as the input size grows.

What would settle it

A demonstration that a pure Gated DeltaNet model can solve the parity-conditioned retrieval task with constant scratchpad length, or that the hybrid requires growing scratchpad size under the constant-precision setting.

read the original abstract

We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript defines a parity-conditioned retrieval task and claims that, under a constant-precision assumption, a hybrid decoder combining Gated DeltaNet and Gated Attention solves the task with constant scratchpad length (equivalently O(1) chain-of-thought steps). Pure Gated DeltaNet admits no such constant-scratchpad solution, while pure Gated Attention requires at least polynomial scratchpad length.

Significance. If the separation holds, the result supplies a formal expressivity justification for the hybrid recurrent-attention designs appearing in models such as Qwen3-Next, showing a concrete reduction in required scratchpad length for conditional retrieval. The constant-precision framing increases relevance to hardware-constrained settings and offers a useful metric (scratchpad length) for comparing recurrent and attention components.

major comments (2)

[Section 4] Section 4 (separation for pure Gated DeltaNet): The claim that the recurrent state cannot maintain both retrieval key and parity bit without state growth or extra tokens is load-bearing for the strict separation. Under a constant-precision model defined only via bounded-magnitude reals, a mod-2 accumulator in the DeltaNet update could encode parity while preserving constant state size for attention-free retrieval; the manuscript must supply an explicit impossibility argument ruling this out.
[Abstract and Section 3.1] Abstract and Section 3.1 (constant-precision assumption): The positive result for the hybrid and the negative result for pure DeltaNet both rest on this assumption, yet the manuscript does not specify whether it means bounded reals, fixed-bit arithmetic with rounding, or another discretization. This ambiguity directly affects whether the hybrid's routing of parity through the attention head is necessary or whether pure DeltaNet could achieve the same O(1) bound.

minor comments (2)

[Section 2.1] Section 2.1: Notation for the Gated DeltaNet recurrence could be aligned more closely with standard linear recurrent network literature to improve readability.
[Figure 1] Figure 1: The architecture diagram caption should explicitly label the information-flow path that allows the attention head to keep the DeltaNet state constant-size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The comments highlight important points about the constant-precision assumption and the separation argument for pure Gated DeltaNet. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Section 4] Section 4 (separation for pure Gated DeltaNet): The claim that the recurrent state cannot maintain both retrieval key and parity bit without state growth or extra tokens is load-bearing for the strict separation. Under a constant-precision model defined only via bounded-magnitude reals, a mod-2 accumulator in the DeltaNet update could encode parity while preserving constant state size for attention-free retrieval; the manuscript must supply an explicit impossibility argument ruling this out.

Authors: We agree that an explicit impossibility argument is necessary to rule out encodings such as a mod-2 accumulator under bounded-magnitude reals. In the current Section 4 we show that any fixed-dimensional state updated by the Gated DeltaNet recurrence cannot simultaneously preserve an exact retrieval key (requiring precise inner-product matching) and an independent parity bit without interference or precision loss across arbitrary input lengths. However, we acknowledge the argument could be stated more formally as a lemma. We will add an explicit proof that any attempt to encode parity via accumulation necessarily perturbs the key representation under the bounded-magnitude constraint, thereby requiring either state growth or additional scratchpad tokens. This revision will be incorporated in the next version. revision: yes
Referee: [Abstract and Section 3.1] Abstract and Section 3.1 (constant-precision assumption): The positive result for the hybrid and the negative result for pure DeltaNet both rest on this assumption, yet the manuscript does not specify whether it means bounded reals, fixed-bit arithmetic with rounding, or another discretization. This ambiguity directly affects whether the hybrid's routing of parity through the attention head is necessary or whether pure DeltaNet could achieve the same O(1) bound.

Authors: We thank the referee for pointing out this ambiguity. The constant-precision assumption in the manuscript is intended to mean that all state vectors and updates are restricted to real numbers whose magnitudes are bounded by a constant independent of sequence length, with no access to arbitrary-precision arithmetic or exact modular operations that would require growing bit-width. Under this model, a pure DeltaNet cannot route the parity bit separately from the retrieval key without either violating the bound or consuming additional tokens. We will revise the abstract and Section 3.1 to state this definition explicitly and add a short paragraph contrasting it with fixed-bit rounding models. This clarification will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: expressivity results rest on newly defined task and explicit architectural comparisons

full rationale

The paper defines a parity-conditioned retrieval task and proves expressivity bounds for three architectures (hybrid Gated DeltaNet+Attention, pure Gated DeltaNet, pure Gated Attention) under a constant-precision regime. These bounds are established via direct construction for the hybrid case (constant scratchpad) and separation arguments for the pure cases (no O(1) solution for DeltaNet; polynomial for Attention). No step reduces a claimed prediction to a fitted parameter, renames a known result, or relies on a load-bearing self-citation whose content is itself unverified within the paper. The constant-precision assumption is stated explicitly as the setting for the result rather than being derived from the architectures themselves. The derivation chain is therefore self-contained against the defined task and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the constant-precision model of computation and the precise definition of the parity-conditioned retrieval task; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption constant-precision assumption
Invoked to establish that the hybrid solves the task with constant scratchpad; location: abstract statement of the result.

pith-pipeline@v0.9.0 · 5652 in / 1232 out tokens · 42996 ms · 2026-05-20T19:29:42.863618+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Masked hard-attention transformers recognize exactly the star-free languages

Andy Yang, David Chiang, and Dana Angluin. Masked hard-attention transformers recognize exactly the star-free languages. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[2]

Podolskii

Pablo Barceló, Alexander Kozachinskiy, Anthony Widjaja Lin, and Vladimir V . Podolskii. Logical languages accepted by transformer encoders with hard attention. InInternational Conference on Learning Representations, 2024

work page 2024
[3]

Riccardo Grazzi, Julien Siems, Jörg K. H. Franke, Arber Zela, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear RNNs through negative eigenvalues. InInternational Conference on Learning Representations, 2025

work page 2025
[4]

Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

Michael Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

work page 2020
[5]

Formal language recognition by hard atten- tion transformers: Perspectives from circuit complexity.Transactions of the Association for Computational Linguistics, 10:800–810, 2022

Yiding Hao, Dana Angluin, and Robert Frank. Formal language recognition by hard atten- tion transformers: Perspectives from circuit complexity.Transactions of the Association for Computational Linguistics, 10:800–810, 2022

work page 2022
[6]

Softmax Transform- ers are Turing-Complete

Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin. Softmax Transform- ers are Turing-Complete. InInternational Conference on Learning Representations, 2026

work page 2026
[7]

Parity, Sensitivity, and Transformers

Alexander Kozachinskiy, Tomasz Steifer, and Przemysław Wał˛ ega. Parity, sensitivity, and transformers.arXiv preprint arXiv:2602.05896, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

William Merrill, Ashish Sabharwal, and Noah A. Smith. Saturated transformers are constant- depth threshold circuits.Transactions of the Association for Computational Linguistics, 10:843– 856, 2022

work page 2022
[9]

The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023

work page 2023
[10]

The expressive power of transformers with chain of thought

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InInternational Conference on Learning Representations, 2024

work page 2024
[11]

Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, and Ashish Sabharwal. Olmo Hybrid: From theory ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

PyTorch implementation of Qwen3-Next, modular_qwen3_next.py

Qwen Team and Hugging Face. PyTorch implementation of Qwen3-Next, modular_qwen3_next.py. Hugging Face Transformers, 2025. https://github. com/huggingface/transformers/blob/main/src/transformers/models/qwen3_ next/modular_qwen3_next.py

work page 2025
[13]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995

work page 1995
[14]

Siegelmann

Hava T. Siegelmann. Recurrent neural networks and finite automata.Computational Intelligence, 12(4):567–574, 1996

work page 1996
[15]

Deltaproduct: Im- proving state-tracking in linear rnns via householder products

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. DeltaProduct: Improving state-tracking in linear RNNs via Householder products. arXiv preprint arXiv:2502.10297, 2025

work page arXiv 2025
[16]

What formal lan- guages can transformers express? A survey.Transactions of the Association for Computational Linguistics, 12:543–561, 2024

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal lan- guages can transformers express? A survey.Transactions of the Association for Computational Linguistics, 12:543–561, 2024

work page 2024
[17]

Lower bounds on the expressivity of recurrent neural language models

Anej Svete, Franz Nowak, Anisha Mohamed Sahabdeen, and Ryan Cotterell. Lower bounds on the expressivity of recurrent neural language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, pages 6820–6844, 2024. 10

work page 2024
[18]

Counting like transformers: Compiling temporal counting logic into softmax transformers.CoRR, abs/2404.04393, 2024

Andy Yang and David Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers.arXiv preprint arXiv:2404.04393, 2024

work page arXiv 2024
[19]

Saxe, and Michael Sipser

Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy.Mathematical Systems Theory, 17:13–27, 1984

work page 1984
[20]

Almost optimal lower bounds for small depth circuits

Johan Håstad. Almost optimal lower bounds for small depth circuits. InProceedings of the 18th Annual ACM Symposium on Theory of Computing, pages 6–20, 1986

work page 1986
[21]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InInternational Conference on Learning Representations, 2024

work page 2024
[22]

Qwen3-Next-80B-A3B-Instruct model card

Qwen Team. Qwen3-Next-80B-A3B-Instruct model card. Hugging Face, 2025. https: //huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct

work page 2025
[23]

the next token is a

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with delta rule. InInternational Conference on Learning Representations, 2025. 11 Appendix Appendix A: Constant precision Definition 7.2(Constant precision).Fix an integers≥2. Let Fs ={0} ∪ {±k2 −s : 1≤k≤2 2s −1}. Let Bs = 2s −2 −s be the largest positive element of Fs. T...

work page 2025

[1] [1]

Masked hard-attention transformers recognize exactly the star-free languages

Andy Yang, David Chiang, and Dana Angluin. Masked hard-attention transformers recognize exactly the star-free languages. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[2] [2]

Podolskii

Pablo Barceló, Alexander Kozachinskiy, Anthony Widjaja Lin, and Vladimir V . Podolskii. Logical languages accepted by transformer encoders with hard attention. InInternational Conference on Learning Representations, 2024

work page 2024

[3] [3]

Riccardo Grazzi, Julien Siems, Jörg K. H. Franke, Arber Zela, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear RNNs through negative eigenvalues. InInternational Conference on Learning Representations, 2025

work page 2025

[4] [4]

Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

Michael Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

work page 2020

[5] [5]

Formal language recognition by hard atten- tion transformers: Perspectives from circuit complexity.Transactions of the Association for Computational Linguistics, 10:800–810, 2022

Yiding Hao, Dana Angluin, and Robert Frank. Formal language recognition by hard atten- tion transformers: Perspectives from circuit complexity.Transactions of the Association for Computational Linguistics, 10:800–810, 2022

work page 2022

[6] [6]

Softmax Transform- ers are Turing-Complete

Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin. Softmax Transform- ers are Turing-Complete. InInternational Conference on Learning Representations, 2026

work page 2026

[7] [7]

Parity, Sensitivity, and Transformers

Alexander Kozachinskiy, Tomasz Steifer, and Przemysław Wał˛ ega. Parity, sensitivity, and transformers.arXiv preprint arXiv:2602.05896, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

William Merrill, Ashish Sabharwal, and Noah A. Smith. Saturated transformers are constant- depth threshold circuits.Transactions of the Association for Computational Linguistics, 10:843– 856, 2022

work page 2022

[9] [9]

The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023

William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023

work page 2023

[10] [10]

The expressive power of transformers with chain of thought

William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InInternational Conference on Learning Representations, 2024

work page 2024

[11] [11]

Olmo Hybrid: From Theory to Practice and Back

William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, and Ashish Sabharwal. Olmo Hybrid: From theory ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

PyTorch implementation of Qwen3-Next, modular_qwen3_next.py

Qwen Team and Hugging Face. PyTorch implementation of Qwen3-Next, modular_qwen3_next.py. Hugging Face Transformers, 2025. https://github. com/huggingface/transformers/blob/main/src/transformers/models/qwen3_ next/modular_qwen3_next.py

work page 2025

[13] [13]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995

work page 1995

[14] [14]

Siegelmann

Hava T. Siegelmann. Recurrent neural networks and finite automata.Computational Intelligence, 12(4):567–574, 1996

work page 1996

[15] [15]

Deltaproduct: Im- proving state-tracking in linear rnns via householder products

Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. DeltaProduct: Improving state-tracking in linear RNNs via Householder products. arXiv preprint arXiv:2502.10297, 2025

work page arXiv 2025

[16] [16]

What formal lan- guages can transformers express? A survey.Transactions of the Association for Computational Linguistics, 12:543–561, 2024

Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal lan- guages can transformers express? A survey.Transactions of the Association for Computational Linguistics, 12:543–561, 2024

work page 2024

[17] [17]

Lower bounds on the expressivity of recurrent neural language models

Anej Svete, Franz Nowak, Anisha Mohamed Sahabdeen, and Ryan Cotterell. Lower bounds on the expressivity of recurrent neural language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, pages 6820–6844, 2024. 10

work page 2024

[18] [18]

Counting like transformers: Compiling temporal counting logic into softmax transformers.CoRR, abs/2404.04393, 2024

Andy Yang and David Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers.arXiv preprint arXiv:2404.04393, 2024

work page arXiv 2024

[19] [19]

Saxe, and Michael Sipser

Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy.Mathematical Systems Theory, 17:13–27, 1984

work page 1984

[20] [20]

Almost optimal lower bounds for small depth circuits

Johan Håstad. Almost optimal lower bounds for small depth circuits. InProceedings of the 18th Annual ACM Symposium on Theory of Computing, pages 6–20, 1986

work page 1986

[21] [21]

Chain of thought empowers transformers to solve inherently serial problems

Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InInternational Conference on Learning Representations, 2024

work page 2024

[22] [22]

Qwen3-Next-80B-A3B-Instruct model card

Qwen Team. Qwen3-Next-80B-A3B-Instruct model card. Hugging Face, 2025. https: //huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct

work page 2025

[23] [23]

the next token is a

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with delta rule. InInternational Conference on Learning Representations, 2025. 11 Appendix Appendix A: Constant precision Definition 7.2(Constant precision).Fix an integers≥2. Let Fs ={0} ∪ {±k2 −s : 1≤k≤2 2s −1}. Let Bs = 2s −2 −s be the largest positive element of Fs. T...

work page 2025