Recognition: unknown
FG²-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
Pith reviewed 2026-05-10 02:23 UTC · model grok-4.3
The pith
Making the delta-rule learning rate channel-wise and decoupling key-value scaling improves associative recall in gated delta networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FG²-GDN replaces the scalar β_t in the delta update with a channel-wise vector to enable per-dimension adaptation, and the variant FG²-GDN+ further decouples the scaling factors applied to keys and values. This allows independent control over how strongly old information is erased versus how new information is written in the online gradient-descent-style update. Experiments demonstrate that these changes yield higher accuracy on associative recall and long-context understanding tasks relative to GDN and KDA baselines, at comparable computational cost.
What carries the argument
The doubly fine-grained control consisting of a channel-wise learning-rate vector together with separate key and value scaling factors inside the gated delta network's delta-rule update.
If this is right
- Linear-time models can achieve stronger long-range associative memory without increasing inference complexity.
- The delta-rule update benefits from the same per-coordinate adaptation that improved optimization in training.
- Decoupled key-value control separates the mechanisms of forgetting and writing, allowing more precise memory management.
- Performance gains appear on both synthetic recall probes and real-world long-context tasks while preserving efficiency.
Where Pith is reading between the lines
- Similar fine-grained update rules might transfer to other recurrent linear-attention variants beyond gated delta networks.
- The analogy to adaptive optimizers suggests exploring momentum or second-moment estimates inside the delta update itself.
- If per-channel rates help here, block-wise or input-dependent rates could be tested next to handle heterogeneous sequence content.
Load-bearing premise
That switching from a scalar to a channel-wise learning rate and decoupling key-value scaling will produce stable training and genuine generalization gains instead of overfitting to the tested benchmarks.
What would settle it
Training FG²-GDN on the same synthetic associative-recall tasks and finding that accuracy does not exceed the scalar-beta GDN baseline or that training diverges on standard long-context datasets.
Figures
read the original abstract
Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate $\beta_t$ in the delta update remains a scalar, limiting the model's capacity for dimension-specific adaptation. We introduce FG$^2$-GDN, which replaces the scalar $\beta_t$ with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG$^2$-GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write strength. Experiments on synthetic and real-world benchmarks show that FG$^2$-GDN and its variant improve associative recall and long-context understanding over GDN and KDA, with comparable computational efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FG²-GDN, which replaces the scalar β_t learning rate in the delta-rule update of Gated Delta Networks (GDN) with a channel-wise vector, and FG²-GDN+, which additionally decouples key and value scaling factors. Drawing an analogy to per-coordinate adaptive optimizers, the authors claim that these changes enable finer-grained control over memory erasure and writing, leading to improved associative recall and long-context performance on synthetic and real-world benchmarks while preserving linear-time inference efficiency comparable to GDN and KDA.
Significance. If the claimed gains prove robust, this would constitute a practical refinement to recurrent linear attention mechanisms by increasing the expressivity of the online delta update without altering its asymptotic complexity. The per-channel adaptation idea is a natural extension of prior head-wise gating work (e.g., KDA), but its value hinges on whether the added degrees of freedom yield genuine generalization rather than benchmark-specific fitting.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): No quantitative results, error bars, ablation tables, or training curves are referenced in the abstract, and the experimental section provides insufficient detail on statistical significance, number of runs, or held-out generalization metrics. This directly undermines the central empirical claim that FG²-GDN and FG²-GDN+ outperform GDN/KDA on associative recall.
- [§3.2] §3.2 (Method, delta-rule formulation): Replacing scalar β_t with a channel-wise vector increases parameter count and alters the online gradient-descent interpretation of the update; the manuscript does not derive or prove that the modified rule remains stable under the recurrent dynamics or that it preserves the original contraction properties of the delta rule.
- [§4.2] §4.2 (Ablations): The contribution of the channel-wise β versus the decoupled K/V scaling in the + variant is not isolated; without component-wise ablations and controls for total parameter count, it is impossible to rule out that observed gains arise simply from extra capacity rather than the proposed fine-grained control.
minor comments (2)
- [§3.1] Notation for the channel-wise β vector is introduced without an explicit equation index; cross-referencing to the original GDN equations would improve clarity.
- [§4.3] The computational-efficiency claim is asserted but not supported by FLOPs or wall-clock measurements on the same hardware as the baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, noting planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): No quantitative results, error bars, ablation tables, or training curves are referenced in the abstract, and the experimental section provides insufficient detail on statistical significance, number of runs, or held-out generalization metrics. This directly undermines the central empirical claim that FG²-GDN and FG²-GDN+ outperform GDN/KDA on associative recall.
Authors: We agree that the abstract and §4 lack quantitative results, error bars, ablation tables, training curves, and details on statistical significance or number of runs. This weakens the empirical claims. In the revised manuscript we will add key performance numbers and error bars to the abstract, include ablation tables and training curves in §4, and report the number of runs, statistical tests, and held-out metrics to support the performance improvements over GDN and KDA. revision: yes
-
Referee: [§3.2] §3.2 (Method, delta-rule formulation): Replacing scalar β_t with a channel-wise vector increases parameter count and alters the online gradient-descent interpretation of the update; the manuscript does not derive or prove that the modified rule remains stable under the recurrent dynamics or that it preserves the original contraction properties of the delta rule.
Authors: We acknowledge that the channel-wise β increases parameters and changes the standard delta-rule interpretation, and that no derivation or proof of stability or contraction properties is provided. We will expand §3.2 with a discussion of the adaptive-optimizer analogy and report empirical evidence of stable training from our experiments. A full theoretical proof of recurrent stability lies outside the current scope. revision: partial
-
Referee: [§4.2] §4.2 (Ablations): The contribution of the channel-wise β versus the decoupled K/V scaling in the + variant is not isolated; without component-wise ablations and controls for total parameter count, it is impossible to rule out that observed gains arise simply from extra capacity rather than the proposed fine-grained control.
Authors: We agree the ablations do not isolate channel-wise β from K/V decoupling or control for parameter count, leaving open the possibility that gains stem from added capacity. In revision we will add component-wise ablations (e.g., channel-wise β alone) and parameter-matched baselines to demonstrate that the improvements arise from the proposed fine-grained mechanisms. revision: yes
- Derivation or proof that the modified delta rule with channel-wise β remains stable under recurrent dynamics and preserves the original contraction properties.
Circularity Check
No circularity: architectural proposal with empirical validation
full rationale
The paper introduces FG²-GDN as an architectural modification to Gated Delta Networks: replacing scalar β_t with a channel-wise vector (analogous to per-coordinate adaptive optimizers) and, in the + variant, decoupling key/value scaling. These changes are motivated by capacity arguments and evaluated via experiments on synthetic and real-world benchmarks showing improved associative recall and long-context performance. No derivation chain, first-principles result, or prediction is presented that reduces by construction to fitted inputs, self-defined quantities, or a load-bearing self-citation chain. The central claims rest on external empirical comparison rather than internal tautology. Full text inspection confirms the absence of any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- channel-wise beta vector
axioms (1)
- domain assumption Delta rule enables superior associative recall compared to additive updates
Reference graph
Works this paper leans on
-
[1]
Transformers are rnns: Fast autoregressive transformers with linear attention,
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” inInternational conference on machine learning. PMLR, 2020, pp. 5156–5165
2020
-
[2]
Transformer quality in linear time,
W. Hua, Z. Dai, H. Liu, and Q. Le, “Transformer quality in linear time,” inInternational conference on machine learning. PMLR, 2022, pp. 9099–9117
2022
-
[3]
Rwkv: Reinventing rnns for the transformer era,
B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, L. Derczynskiet al., “Rwkv: Reinventing rnns for the transformer era,” inFindings of the association for computational linguistics: EMNLP 2023, 2023, pp. 14 048–14 077
2023
-
[4]
Gated Delta Networks: Improving Mamba2 with Delta Rule
S. Yang, J. Kautz, and A. Hatamizadeh, “Gated delta networks: Improving mamba2 with delta rule,”arXiv preprint arXiv:2412.06464, 2024
work page internal anchor Pith review arXiv 2024
-
[5]
Linear transformers are secretly fast weight programmers,
I. Schlag, K. Irie, and J. Schmidhuber, “Linear transformers are secretly fast weight programmers,” in International conference on machine learning. PMLR, 2021, pp. 9355–9366
2021
-
[6]
Parallelizing linear transformers with the delta rule over sequence length,
S. Yang, B. Wang, Y . Zhang, Y . Shen, and Y . Kim, “Parallelizing linear transformers with the delta rule over sequence length,”Advances in neural information processing systems, vol. 37, pp. 115 491–115 522, 2024
2024
-
[7]
Kimi linear: An expressive, efficient attention architecture,
Y . Zhang, Z. Lin, X. Yao, J. Hu, F. Meng, C. Liu, X. Men, S. Yang, Z. Li, W. Li, E. Lu, W. Liu, Y . Chen, W. Xu, L. Yu, Y . Wang, Y . Fan, L. Zhong, E. Yuan, D. Zhang, Y . Zhang, Y . T. Liu, H. Wang, S. Fang, W. He, S. Liu, Y . Li, J. Su, J. Qiu, B. Pang, J. Yan, Z. Jiang, W. Huang, B. Yin, J. You, C. Wei, Z. Wang, C. Hong, Y . Chen, G. Chen, Y . Wang, H...
2025
-
[8]
Adaptive subgradient methods for online learning and stochastic optimization
J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization.”Journal of machine learning research, vol. 12, no. 7, 2011
2011
-
[9]
Adam: A Method for Stochastic Optimization
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[10]
Efficiently Modeling Long Sequences with Structured State Spaces
A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021. 9
work page internal anchor Pith review arXiv 2021
-
[11]
Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism,
S. Yang and Y . Zhang, “Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism,” Jan. 2024. [Online]. Available: https://github.com/fla-org/flash-linear-attention
2024
-
[12]
Learning to control fast-weight memories: An alternative to dynamic recurrent networks,
J. Schmidhuber, “Learning to control fast-weight memories: An alternative to dynamic recurrent networks,” Neural Computation, vol. 4, no. 1, pp. 131–139, 1992
1992
-
[13]
Going beyond linear transformers with recurrent fast weight programmers,
K. Irie, I. Schlag, R. Csordás, and J. Schmidhuber, “Going beyond linear transformers with recurrent fast weight programmers,”Advances in neural information processing systems, vol. 34, pp. 7703–7717, 2021
2021
-
[14]
RWKV-7 “Goose” with expressive dynamic state evolution, 2025
B. Peng, D. Goldstein, Q. Anthony, A. Albalak, E. Alcaideet al., “RWKV-7: Goose with expressive dynamic state evolution,”arXiv preprint arXiv:2503.14456, 2025
-
[15]
The wy representation for products of householder matrices,
C. Bischof and C. Van Loan, “The wy representation for products of householder matrices,”SIAM Journal on Scientific and Statistical Computing, vol. 8, no. 1, pp. s2–s13, 1987
1987
-
[16]
Accumulating householder transformations, revisited,
T. Joffrain, T. M. Low, E. S. Quintana-Ortí, R. v. d. Geijn, and F. G. V . Zee, “Accumulating householder transformations, revisited,”ACM Transactions on Mathematical Software (TOMS), vol. 32, no. 2, pp. 169–179, 2006
2006
-
[17]
Mamba: Linear-time sequence modeling with selective state spaces,
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst conference on language modeling, 2024
2024
-
[18]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
DeepSeek-AI, A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, D. Yang, D. Chen, D. Ji, E. Li, F. Lin, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Yang, H. Zhang, H. Ding, H. Xin, H. Gao, H. Li, H. Qu, J. L. Cai, J. Liang, J. Guo, J. Ni, J. Li, J. Chen, J. Yuan, J. Qiu, J. Song, K. Dong, K. Gao, K. Guan, L. Wan...
work page internal anchor Pith review arXiv 2024
-
[19]
The LAMBADA dataset: Word prediction requiring a broad discourse context,
D. Paperno, G. Kruszewski, A. Dufter, and M. Baroni, “The LAMBADA dataset: Word prediction requiring a broad discourse context,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 1525–1534
2016
-
[20]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try ARC, the AI2 reasoning challenge,”arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
BoolQ: Exploring the surprising difficulty of natural yes/no questions,
C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “BoolQ: Exploring the surprising difficulty of natural yes/no questions,” inProceedings of NAACL-HLT, 2019, pp. 2924–2936
2019
-
[22]
HellaSwag: Can a machine really finish your sentence?
R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi, “HellaSwag: Can a machine really finish your sentence?” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4791–4800
2019
-
[23]
PIQA: Reasoning about physical intuition in natural language,
Y . Bisk, R. Zellers, J. Gao, Y . Choiet al., “PIQA: Reasoning about physical intuition in natural language,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 7432–7439
2020
-
[24]
WinoGrande: An adversarial Winograd schema challenge at scale,
K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y . Choi, “WinoGrande: An adversarial Winograd schema challenge at scale,”Communications of the ACM, vol. 64, no. 9, pp. 99–106, 2021
2021
-
[25]
RULER: What's the Real Context Size of Your Long-Context Language Models?
C.-P. Hsieh, S. Sun, S. Kriman, J. Ainslie, D. Aithalet al., “RULER: What’s the real context size of your long-context language models?”arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review arXiv 2024
-
[26]
Rethinking Attention with Performers
K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohi- uddin, L. Kaiseret al., “Rethinking attention with performers,”arXiv preprint arXiv:2009.14794, 2020
work page internal anchor Pith review arXiv 2009
-
[27]
Retentive Network: A Successor to Transformer for Large Language Models
Y . Sun, L. Dong, S. Huang, S. Ma, Y . Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,”arXiv preprint arXiv:2307.08621, 2023. 10
work page internal anchor Pith review arXiv 2023
-
[28]
T. Dao and A. Gu, “Transformers are ssms: Generalized models and efficient algorithms through structured state space duality,”arXiv preprint arXiv:2405.21060, 2024
work page internal anchor Pith review arXiv 2024
-
[29]
Gated linear attention transformers with hardware- efficient training,
S. Yang, B. Wang, Y . Shen, R. Panda, and Y . Kim, “Gated linear attention transformers with hardware- efficient training,”Proceedings of the 41st International Conference on Machine Learning, 2024
2024
-
[30]
Hgrn2: Gated linear rnns with state expansion
Z. Qin, S. Li, W. Sun, X. Sun, D. Li, W. Zhonget al., “HGRN2: Gated linear rnns with state expansion,” arXiv preprint arXiv:2404.07904, 2024
-
[31]
Adaptive switching circuits,
B. Widrow and M. E. Hoff, “Adaptive switching circuits,” inNeurocomputing: foundations of research, 1988, pp. 123–134
1988
-
[32]
Longhorn: State space models are amortized online learners
B. Liu, R. Wang, L. Wu, Y . Feng, P. Stone, and Q. Liu, “Longhorn: State space models are amortized online learners,”arXiv preprint arXiv:2407.14207, 2024
-
[33]
Jamba: A Hybrid Transformer-Mamba Language Model
O. Lieber, B. Lenz, H. Bata, G. Cohen, J. Osinet al., “Jamba: A hybrid transformer-mamba language model,”arXiv preprint arXiv:2403.19887, 2024. A Experimental Details All models in our experiments share an identical hybrid architecture that interleaves linear attention layers and Multi-head Latent Attention (MLA) [18] layers with a 3:1 ratio (i.e., three ...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.