Provably Shorter Scratchpads in Hybrid DeltaNet-Attention Decoders
Pith reviewed 2026-05-20 19:29 UTC · model grok-4.3
The pith
A hybrid of Gated DeltaNet and Gated Attention solves a retrieval task with constant scratchpad under constant precision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the constant-precision assumption, the hybrid architecture solves the parity-conditioned retrieval task using a constant scratchpad, which corresponds to O(1) chain-of-thought steps. No such constant-scratchpad solution exists for pure Gated DeltaNet models, and pure Gated Attention models require at least a polynomial-sized scratchpad for the same task.
What carries the argument
The Qwen-style hybrid combining recurrent Gated DeltaNet heads with Gated Attention heads, which integrates recurrent state updates for sequential processing and attention for selective retrieval to achieve efficient constant-resource computation on the defined task.
If this is right
- Hybrid models can achieve constant chain-of-thought length for parity-conditioned retrieval where pure recurrent models have no solution.
- Pure attention-based models need polynomial growth in scratchpad size for the task.
- The hybrid provides a formal expressivity advantage in terms of required scratchpad length.
- Such architectures may support more efficient inference in models that already use similar combinations.
Where Pith is reading between the lines
- Similar hybrids might enable constant-resource solutions for other sequential reasoning tasks that challenge pure recurrent or attention models.
- Designers of language models could prioritize hybrid layers to reduce the length of intermediate reasoning traces needed.
- Further analysis could explore whether the constant scratchpad holds for variants of the task with different conditioning.
- The result suggests exploring the minimal hybrid ratio of DeltaNet to attention heads for achieving these bounds.
Load-bearing premise
The model computations remain bounded by constant precision, without arbitrary increases in bit depth or numerical accuracy as the input size grows.
What would settle it
A demonstration that a pure Gated DeltaNet model can solve the parity-conditioned retrieval task with constant scratchpad length, or that the hybrid requires growing scratchpad size under the constant-precision setting.
read the original abstract
We investigate the expressive power of hybrid recurrent-attention decoders, a class of architectures used in recent open-source language models such as Qwen3-Next and its successors. These models combine Gated Attention heads with recurrent Gated DeltaNet heads. Is there a formal advantage, in terms of model expressivity or efficiency, to such a hybrid architecture? We show that there is. We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad, or equivalently $O(1)$ chain-of-thought steps. In contrast, no similar solution exists for pure Gated DeltaNet models, while pure Gated Attention requires at least a polynomial scratchpad.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript defines a parity-conditioned retrieval task and claims that, under a constant-precision assumption, a hybrid decoder combining Gated DeltaNet and Gated Attention solves the task with constant scratchpad length (equivalently O(1) chain-of-thought steps). Pure Gated DeltaNet admits no such constant-scratchpad solution, while pure Gated Attention requires at least polynomial scratchpad length.
Significance. If the separation holds, the result supplies a formal expressivity justification for the hybrid recurrent-attention designs appearing in models such as Qwen3-Next, showing a concrete reduction in required scratchpad length for conditional retrieval. The constant-precision framing increases relevance to hardware-constrained settings and offers a useful metric (scratchpad length) for comparing recurrent and attention components.
major comments (2)
- [Section 4] Section 4 (separation for pure Gated DeltaNet): The claim that the recurrent state cannot maintain both retrieval key and parity bit without state growth or extra tokens is load-bearing for the strict separation. Under a constant-precision model defined only via bounded-magnitude reals, a mod-2 accumulator in the DeltaNet update could encode parity while preserving constant state size for attention-free retrieval; the manuscript must supply an explicit impossibility argument ruling this out.
- [Abstract and Section 3.1] Abstract and Section 3.1 (constant-precision assumption): The positive result for the hybrid and the negative result for pure DeltaNet both rest on this assumption, yet the manuscript does not specify whether it means bounded reals, fixed-bit arithmetic with rounding, or another discretization. This ambiguity directly affects whether the hybrid's routing of parity through the attention head is necessary or whether pure DeltaNet could achieve the same O(1) bound.
minor comments (2)
- [Section 2.1] Section 2.1: Notation for the Gated DeltaNet recurrence could be aligned more closely with standard linear recurrent network literature to improve readability.
- [Figure 1] Figure 1: The architecture diagram caption should explicitly label the information-flow path that allows the attention head to keep the DeltaNet state constant-size.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. The comments highlight important points about the constant-precision assumption and the separation argument for pure Gated DeltaNet. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Section 4] Section 4 (separation for pure Gated DeltaNet): The claim that the recurrent state cannot maintain both retrieval key and parity bit without state growth or extra tokens is load-bearing for the strict separation. Under a constant-precision model defined only via bounded-magnitude reals, a mod-2 accumulator in the DeltaNet update could encode parity while preserving constant state size for attention-free retrieval; the manuscript must supply an explicit impossibility argument ruling this out.
Authors: We agree that an explicit impossibility argument is necessary to rule out encodings such as a mod-2 accumulator under bounded-magnitude reals. In the current Section 4 we show that any fixed-dimensional state updated by the Gated DeltaNet recurrence cannot simultaneously preserve an exact retrieval key (requiring precise inner-product matching) and an independent parity bit without interference or precision loss across arbitrary input lengths. However, we acknowledge the argument could be stated more formally as a lemma. We will add an explicit proof that any attempt to encode parity via accumulation necessarily perturbs the key representation under the bounded-magnitude constraint, thereby requiring either state growth or additional scratchpad tokens. This revision will be incorporated in the next version. revision: yes
-
Referee: [Abstract and Section 3.1] Abstract and Section 3.1 (constant-precision assumption): The positive result for the hybrid and the negative result for pure DeltaNet both rest on this assumption, yet the manuscript does not specify whether it means bounded reals, fixed-bit arithmetic with rounding, or another discretization. This ambiguity directly affects whether the hybrid's routing of parity through the attention head is necessary or whether pure DeltaNet could achieve the same O(1) bound.
Authors: We thank the referee for pointing out this ambiguity. The constant-precision assumption in the manuscript is intended to mean that all state vectors and updates are restricted to real numbers whose magnitudes are bounded by a constant independent of sequence length, with no access to arbitrary-precision arithmetic or exact modular operations that would require growing bit-width. Under this model, a pure DeltaNet cannot route the parity bit separately from the retrieval key without either violating the bound or consuming additional tokens. We will revise the abstract and Section 3.1 to state this definition explicitly and add a short paragraph contrasting it with fixed-bit rounding models. This clarification will be included in the revised manuscript. revision: yes
Circularity Check
No circularity: expressivity results rest on newly defined task and explicit architectural comparisons
full rationale
The paper defines a parity-conditioned retrieval task and proves expressivity bounds for three architectures (hybrid Gated DeltaNet+Attention, pure Gated DeltaNet, pure Gated Attention) under a constant-precision regime. These bounds are established via direct construction for the hybrid case (constant scratchpad) and separation arguments for the pure cases (no O(1) solution for DeltaNet; polynomial for Attention). No step reduces a claimed prediction to a fitted parameter, renames a known result, or relies on a load-bearing self-citation whose content is itself unverified within the paper. The constant-precision assumption is stated explicitly as the setting for the result rather than being derived from the architectures themselves. The derivation chain is therefore self-contained against the defined task and does not exhibit any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption constant-precision assumption
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define parity-conditioned retrieval task and show that under constant-precision assumption, a Qwen-style hybrid of Gated DeltaNet and Gated Attention solves this task with a constant scratchpad
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Masked hard-attention transformers recognize exactly the star-free languages
Andy Yang, David Chiang, and Dana Angluin. Masked hard-attention transformers recognize exactly the star-free languages. InAdvances in Neural Information Processing Systems, 2024
work page 2024
- [2]
-
[3]
Riccardo Grazzi, Julien Siems, Jörg K. H. Franke, Arber Zela, Frank Hutter, and Massimiliano Pontil. Unlocking state-tracking in linear RNNs through negative eigenvalues. InInternational Conference on Learning Representations, 2025
work page 2025
-
[4]
Michael Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020
work page 2020
-
[5]
Yiding Hao, Dana Angluin, and Robert Frank. Formal language recognition by hard atten- tion transformers: Perspectives from circuit complexity.Transactions of the Association for Computational Linguistics, 10:800–810, 2022
work page 2022
-
[6]
Softmax Transform- ers are Turing-Complete
Hongjian Jiang, Michael Hahn, Georg Zetzsche, and Anthony Widjaja Lin. Softmax Transform- ers are Turing-Complete. InInternational Conference on Learning Representations, 2026
work page 2026
-
[7]
Parity, Sensitivity, and Transformers
Alexander Kozachinskiy, Tomasz Steifer, and Przemysław Wał˛ ega. Parity, sensitivity, and transformers.arXiv preprint arXiv:2602.05896, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
William Merrill, Ashish Sabharwal, and Noah A. Smith. Saturated transformers are constant- depth threshold circuits.Transactions of the Association for Computational Linguistics, 10:843– 856, 2022
work page 2022
-
[9]
William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers.Transactions of the Association for Computational Linguistics, 11:531–545, 2023
work page 2023
-
[10]
The expressive power of transformers with chain of thought
William Merrill and Ashish Sabharwal. The expressive power of transformers with chain of thought. InInternational Conference on Learning Representations, 2024
work page 2024
-
[11]
Olmo Hybrid: From Theory to Practice and Back
William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, DJ Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi, and Ashish Sabharwal. Olmo Hybrid: From theory ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
PyTorch implementation of Qwen3-Next, modular_qwen3_next.py
Qwen Team and Hugging Face. PyTorch implementation of Qwen3-Next, modular_qwen3_next.py. Hugging Face Transformers, 2025. https://github. com/huggingface/transformers/blob/main/src/transformers/models/qwen3_ next/modular_qwen3_next.py
work page 2025
-
[13]
Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995
work page 1995
-
[14]
Hava T. Siegelmann. Recurrent neural networks and finite automata.Computational Intelligence, 12(4):567–574, 1996
work page 1996
-
[15]
Deltaproduct: Im- proving state-tracking in linear rnns via householder products
Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, and Riccardo Grazzi. DeltaProduct: Improving state-tracking in linear RNNs via Householder products. arXiv preprint arXiv:2502.10297, 2025
-
[16]
Lena Strobl, William Merrill, Gail Weiss, David Chiang, and Dana Angluin. What formal lan- guages can transformers express? A survey.Transactions of the Association for Computational Linguistics, 12:543–561, 2024
work page 2024
-
[17]
Lower bounds on the expressivity of recurrent neural language models
Anej Svete, Franz Nowak, Anisha Mohamed Sahabdeen, and Ryan Cotterell. Lower bounds on the expressivity of recurrent neural language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, pages 6820–6844, 2024. 10
work page 2024
-
[18]
Andy Yang and David Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers.arXiv preprint arXiv:2404.04393, 2024
-
[19]
Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy.Mathematical Systems Theory, 17:13–27, 1984
work page 1984
-
[20]
Almost optimal lower bounds for small depth circuits
Johan Håstad. Almost optimal lower bounds for small depth circuits. InProceedings of the 18th Annual ACM Symposium on Theory of Computing, pages 6–20, 1986
work page 1986
-
[21]
Chain of thought empowers transformers to solve inherently serial problems
Zhiyuan Li, Hong Liu, Denny Zhou, and Tengyu Ma. Chain of thought empowers transformers to solve inherently serial problems. InInternational Conference on Learning Representations, 2024
work page 2024
-
[22]
Qwen3-Next-80B-A3B-Instruct model card
Qwen Team. Qwen3-Next-80B-A3B-Instruct model card. Hugging Face, 2025. https: //huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct
work page 2025
-
[23]
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated Delta Networks: Improving Mamba2 with delta rule. InInternational Conference on Learning Representations, 2025. 11 Appendix Appendix A: Constant precision Definition 7.2(Constant precision).Fix an integers≥2. Let Fs ={0} ∪ {±k2 −s : 1≤k≤2 2s −1}. Let Bs = 2s −2 −s be the largest positive element of Fs. T...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.