pith. machine review for the scientific record. sign in

arxiv: 2604.19021 · v2 · submitted 2026-04-21 · 💻 cs.LG

Recognition: unknown

FG²-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:23 UTC · model grok-4.3

classification 💻 cs.LG
keywords linear attentiondelta rulegated delta networksassociative recalllong-context modelingfine-grained adaptationonline updates
0
0 comments X

The pith

Making the delta-rule learning rate channel-wise and decoupling key-value scaling improves associative recall in gated delta networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that current gated delta networks are limited by using a single scalar learning rate for all dimensions in their online delta-rule updates. It proposes replacing this scalar with a per-channel vector, similar to how adaptive optimizers move beyond basic gradient descent, and adds independent scaling for keys and values to separately tune erasure and writing of information. This doubly fine-grained approach is tested on both synthetic associative recall tasks and real long-context benchmarks, where it outperforms prior versions while keeping linear inference cost. A sympathetic reader would care because linear-time sequence models need stronger long-range memory to compete with quadratic attention without exploding compute. The core idea is that dimension-specific adaptation during the recurrent update itself unlocks better generalization on extended sequences.

Core claim

FG²-GDN replaces the scalar β_t in the delta update with a channel-wise vector to enable per-dimension adaptation, and the variant FG²-GDN+ further decouples the scaling factors applied to keys and values. This allows independent control over how strongly old information is erased versus how new information is written in the online gradient-descent-style update. Experiments demonstrate that these changes yield higher accuracy on associative recall and long-context understanding tasks relative to GDN and KDA baselines, at comparable computational cost.

What carries the argument

The doubly fine-grained control consisting of a channel-wise learning-rate vector together with separate key and value scaling factors inside the gated delta network's delta-rule update.

If this is right

  • Linear-time models can achieve stronger long-range associative memory without increasing inference complexity.
  • The delta-rule update benefits from the same per-coordinate adaptation that improved optimization in training.
  • Decoupled key-value control separates the mechanisms of forgetting and writing, allowing more precise memory management.
  • Performance gains appear on both synthetic recall probes and real-world long-context tasks while preserving efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fine-grained update rules might transfer to other recurrent linear-attention variants beyond gated delta networks.
  • The analogy to adaptive optimizers suggests exploring momentum or second-moment estimates inside the delta update itself.
  • If per-channel rates help here, block-wise or input-dependent rates could be tested next to handle heterogeneous sequence content.

Load-bearing premise

That switching from a scalar to a channel-wise learning rate and decoupling key-value scaling will produce stable training and genuine generalization gains instead of overfitting to the tested benchmarks.

What would settle it

Training FG²-GDN on the same synthetic associative-recall tasks and finding that accuracy does not exceed the scalar-beta GDN baseline or that training diverges on standard long-context datasets.

Figures

Figures reproduced from arXiv: 2604.19021 by Jianchao Tan, Jiaqi Zhang, Pingwei Sun, Xue Wang, Xunliang Cai, Yerui Sun, Yifan Lu, Yuchen Xie, Yuxuan Hu.

Figure 1
Figure 1. Figure 1: Overview of the Linear-MLA hybrid architecture. The backbone follows standard pre-norm [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prefill throughput (left) and latency (right) on NVIDIA H800-80G (BF16) with a fixed [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate $\beta_t$ in the delta update remains a scalar, limiting the model's capacity for dimension-specific adaptation. We introduce FG$^2$-GDN, which replaces the scalar $\beta_t$ with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG$^2$-GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write strength. Experiments on synthetic and real-world benchmarks show that FG$^2$-GDN and its variant improve associative recall and long-context understanding over GDN and KDA, with comparable computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FG²-GDN, which replaces the scalar β_t learning rate in the delta-rule update of Gated Delta Networks (GDN) with a channel-wise vector, and FG²-GDN+, which additionally decouples key and value scaling factors. Drawing an analogy to per-coordinate adaptive optimizers, the authors claim that these changes enable finer-grained control over memory erasure and writing, leading to improved associative recall and long-context performance on synthetic and real-world benchmarks while preserving linear-time inference efficiency comparable to GDN and KDA.

Significance. If the claimed gains prove robust, this would constitute a practical refinement to recurrent linear attention mechanisms by increasing the expressivity of the online delta update without altering its asymptotic complexity. The per-channel adaptation idea is a natural extension of prior head-wise gating work (e.g., KDA), but its value hinges on whether the added degrees of freedom yield genuine generalization rather than benchmark-specific fitting.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): No quantitative results, error bars, ablation tables, or training curves are referenced in the abstract, and the experimental section provides insufficient detail on statistical significance, number of runs, or held-out generalization metrics. This directly undermines the central empirical claim that FG²-GDN and FG²-GDN+ outperform GDN/KDA on associative recall.
  2. [§3.2] §3.2 (Method, delta-rule formulation): Replacing scalar β_t with a channel-wise vector increases parameter count and alters the online gradient-descent interpretation of the update; the manuscript does not derive or prove that the modified rule remains stable under the recurrent dynamics or that it preserves the original contraction properties of the delta rule.
  3. [§4.2] §4.2 (Ablations): The contribution of the channel-wise β versus the decoupled K/V scaling in the + variant is not isolated; without component-wise ablations and controls for total parameter count, it is impossible to rule out that observed gains arise simply from extra capacity rather than the proposed fine-grained control.
minor comments (2)
  1. [§3.1] Notation for the channel-wise β vector is introduced without an explicit equation index; cross-referencing to the original GDN equations would improve clarity.
  2. [§4.3] The computational-efficiency claim is asserted but not supported by FLOPs or wall-clock measurements on the same hardware as the baselines.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, noting planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): No quantitative results, error bars, ablation tables, or training curves are referenced in the abstract, and the experimental section provides insufficient detail on statistical significance, number of runs, or held-out generalization metrics. This directly undermines the central empirical claim that FG²-GDN and FG²-GDN+ outperform GDN/KDA on associative recall.

    Authors: We agree that the abstract and §4 lack quantitative results, error bars, ablation tables, training curves, and details on statistical significance or number of runs. This weakens the empirical claims. In the revised manuscript we will add key performance numbers and error bars to the abstract, include ablation tables and training curves in §4, and report the number of runs, statistical tests, and held-out metrics to support the performance improvements over GDN and KDA. revision: yes

  2. Referee: [§3.2] §3.2 (Method, delta-rule formulation): Replacing scalar β_t with a channel-wise vector increases parameter count and alters the online gradient-descent interpretation of the update; the manuscript does not derive or prove that the modified rule remains stable under the recurrent dynamics or that it preserves the original contraction properties of the delta rule.

    Authors: We acknowledge that the channel-wise β increases parameters and changes the standard delta-rule interpretation, and that no derivation or proof of stability or contraction properties is provided. We will expand §3.2 with a discussion of the adaptive-optimizer analogy and report empirical evidence of stable training from our experiments. A full theoretical proof of recurrent stability lies outside the current scope. revision: partial

  3. Referee: [§4.2] §4.2 (Ablations): The contribution of the channel-wise β versus the decoupled K/V scaling in the + variant is not isolated; without component-wise ablations and controls for total parameter count, it is impossible to rule out that observed gains arise simply from extra capacity rather than the proposed fine-grained control.

    Authors: We agree the ablations do not isolate channel-wise β from K/V decoupling or control for parameter count, leaving open the possibility that gains stem from added capacity. In revision we will add component-wise ablations (e.g., channel-wise β alone) and parameter-matched baselines to demonstrate that the improvements arise from the proposed fine-grained mechanisms. revision: yes

standing simulated objections not resolved
  • Derivation or proof that the modified delta rule with channel-wise β remains stable under recurrent dynamics and preserves the original contraction properties.

Circularity Check

0 steps flagged

No circularity: architectural proposal with empirical validation

full rationale

The paper introduces FG²-GDN as an architectural modification to Gated Delta Networks: replacing scalar β_t with a channel-wise vector (analogous to per-coordinate adaptive optimizers) and, in the + variant, decoupling key/value scaling. These changes are motivated by capacity arguments and evaluated via experiments on synthetic and real-world benchmarks showing improved associative recall and long-context performance. No derivation chain, first-principles result, or prediction is presented that reduces by construction to fitted inputs, self-defined quantities, or a load-bearing self-citation chain. The central claims rest on external empirical comparison rather than internal tautology. Full text inspection confirms the absence of any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on the prior claim that the delta rule outperforms additive updates, plus the standard assumption that per-coordinate adaptation improves optimization; no new invented entities are introduced.

free parameters (1)
  • channel-wise beta vector
    Per-channel learning rates are introduced as trainable parameters and fitted during model training.
axioms (1)
  • domain assumption Delta rule enables superior associative recall compared to additive updates
    Invoked as established by prior GDN and KDA results.

pith-pipeline@v0.9.0 · 5517 in / 1185 out tokens · 33877 ms · 2026-05-10T02:23:58.662601+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 13 canonical work pages · 10 internal anchors

  1. [1]

    Transformers are rnns: Fast autoregressive transformers with linear attention,

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” inInternational conference on machine learning. PMLR, 2020, pp. 5156–5165

  2. [2]

    Transformer quality in linear time,

    W. Hua, Z. Dai, H. Liu, and Q. Le, “Transformer quality in linear time,” inInternational conference on machine learning. PMLR, 2022, pp. 9099–9117

  3. [3]

    Rwkv: Reinventing rnns for the transformer era,

    B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, L. Derczynskiet al., “Rwkv: Reinventing rnns for the transformer era,” inFindings of the association for computational linguistics: EMNLP 2023, 2023, pp. 14 048–14 077

  4. [4]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    S. Yang, J. Kautz, and A. Hatamizadeh, “Gated delta networks: Improving mamba2 with delta rule,”arXiv preprint arXiv:2412.06464, 2024

  5. [5]

    Linear transformers are secretly fast weight programmers,

    I. Schlag, K. Irie, and J. Schmidhuber, “Linear transformers are secretly fast weight programmers,” in International conference on machine learning. PMLR, 2021, pp. 9355–9366

  6. [6]

    Parallelizing linear transformers with the delta rule over sequence length,

    S. Yang, B. Wang, Y . Zhang, Y . Shen, and Y . Kim, “Parallelizing linear transformers with the delta rule over sequence length,”Advances in neural information processing systems, vol. 37, pp. 115 491–115 522, 2024

  7. [7]

    Kimi linear: An expressive, efficient attention architecture,

    Y . Zhang, Z. Lin, X. Yao, J. Hu, F. Meng, C. Liu, X. Men, S. Yang, Z. Li, W. Li, E. Lu, W. Liu, Y . Chen, W. Xu, L. Yu, Y . Wang, Y . Fan, L. Zhong, E. Yuan, D. Zhang, Y . Zhang, Y . T. Liu, H. Wang, S. Fang, W. He, S. Liu, Y . Li, J. Su, J. Qiu, B. Pang, J. Yan, Z. Jiang, W. Huang, B. Yin, J. You, C. Wei, Z. Wang, C. Hong, Y . Chen, G. Chen, Y . Wang, H...

  8. [8]

    Adaptive subgradient methods for online learning and stochastic optimization

    J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization.”Journal of machine learning research, vol. 12, no. 7, 2011

  9. [9]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

  10. [10]

    Efficiently Modeling Long Sequences with Structured State Spaces

    A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021. 9

  11. [11]

    Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism,

    S. Yang and Y . Zhang, “Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism,” Jan. 2024. [Online]. Available: https://github.com/fla-org/flash-linear-attention

  12. [12]

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks,

    J. Schmidhuber, “Learning to control fast-weight memories: An alternative to dynamic recurrent networks,” Neural Computation, vol. 4, no. 1, pp. 131–139, 1992

  13. [13]

    Going beyond linear transformers with recurrent fast weight programmers,

    K. Irie, I. Schlag, R. Csordás, and J. Schmidhuber, “Going beyond linear transformers with recurrent fast weight programmers,”Advances in neural information processing systems, vol. 34, pp. 7703–7717, 2021

  14. [14]

    RWKV-7 “Goose” with expressive dynamic state evolution, 2025

    B. Peng, D. Goldstein, Q. Anthony, A. Albalak, E. Alcaideet al., “RWKV-7: Goose with expressive dynamic state evolution,”arXiv preprint arXiv:2503.14456, 2025

  15. [15]

    The wy representation for products of householder matrices,

    C. Bischof and C. Van Loan, “The wy representation for products of householder matrices,”SIAM Journal on Scientific and Statistical Computing, vol. 8, no. 1, pp. s2–s13, 1987

  16. [16]

    Accumulating householder transformations, revisited,

    T. Joffrain, T. M. Low, E. S. Quintana-Ortí, R. v. d. Geijn, and F. G. V . Zee, “Accumulating householder transformations, revisited,”ACM Transactions on Mathematical Software (TOMS), vol. 32, no. 2, pp. 169–179, 2006

  17. [17]

    Mamba: Linear-time sequence modeling with selective state spaces,

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst conference on language modeling, 2024

  18. [18]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Chen, J. Yuan, J. Qiu, J. Song, K. Dong, K. Gao, K. Guan, L. Wan...

  19. [19]

    The LAMBADA dataset: Word prediction requiring a broad discourse context,

    D. Paperno, G. Kruszewski, A. Dufter, and M. Baroni, “The LAMBADA dataset: Word prediction requiring a broad discourse context,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 1525–1534

  20. [20]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try ARC, the AI2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018

  21. [21]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions,

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “BoolQ: Exploring the surprising difficulty of natural yes/no questions,” inProceedings of NAACL-HLT, 2019, pp. 2924–2936

  22. [22]

    HellaSwag: Can a machine really finish your sentence?

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “HellaSwag: Can a machine really finish your sentence?” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800

  23. [23]

    PIQA: Reasoning about physical intuition in natural language,

    Y . Bisk, R. Zellers, J. Gao, Y . Choiet al., “PIQA: Reasoning about physical intuition in natural language,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 7432–7439

  24. [24]

    WinoGrande: An adversarial Winograd schema challenge at scale,

    K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y . Choi, “WinoGrande: An adversarial Winograd schema challenge at scale,”Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021

  25. [25]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    C.-P. Hsieh, S. Sun, S. Kriman, J. Ainslie, D. Aithalet al., “RULER: What’s the real context size of your long-context language models?”arXiv preprint arXiv:2404.06654, 2024

  26. [26]

    Rethinking Attention with Performers

    K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohi- uddin, L. Kaiseret al., “Rethinking attention with performers,”arXiv preprint arXiv:2009.14794, 2020

  27. [27]

    Retentive Network: A Successor to Transformer for Large Language Models

    Y . Sun, L. Dong, S. Huang, S. Ma, Y . Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,”arXiv preprint arXiv:2307.08621, 2023. 10

  28. [28]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    T. Dao and A. Gu, “Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,”arXiv preprint arXiv:2405.21060, 2024

  29. [29]

    Gated linear attention transformers with hardware- efficient training,

    S. Yang, B. Wang, Y . Shen, R. Panda, and Y . Kim, “Gated linear attention transformers with hardware- efficient training,”Proceedings of the 41st International Conference on Machine Learning, 2024

  30. [30]

    Hgrn2: Gated linear rnns with state expansion

    Z. Qin, S. Li, W. Sun, X. Sun, D. Li, W. Zhonget al., “HGRN2: Gated linear rnns with state expansion,” arXiv preprint arXiv:2404.07904, 2024

  31. [31]

    Adaptive switching circuits,

    B. Widrow and M. E. Hoff, “Adaptive switching circuits,” inNeurocomputing: foundations of research, 1988, pp. 123–134

  32. [32]

    Longhorn: State space models are amortized online learners

    B. Liu, R. Wang, L. Wu, Y . Feng, P. Stone, and Q. Liu, “Longhorn: State space models are amortized online learners,”arXiv preprint arXiv:2407.14207, 2024

  33. [33]

    Jamba: A Hybrid Transformer-Mamba Language Model

    O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osinet al., “Jamba: A hybrid transformer-mamba language model,”arXiv preprint arXiv:2403.19887, 2024. A Experimental Details All models in our experiments share an identical hybrid architecture that interleaves linear attention layers and Multi-head Latent Attention (MLA) [18] layers with a 3:1 ratio (i.e., three ...