Recognition: no theorem link
Adaptive Memory Decay for Log-Linear Attention
Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3
The pith
Making memory decay in log-linear attention depend on the input token improves recall on long sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Replacing the fixed global decay parameter lambda with per-token per-level rates produced by a lightweight two-layer MLP yields consistent gains on associative recall, selective copying, and language modeling, with the largest improvements occurring in long-range settings where the baseline lambda degrades or collapses.
What carries the argument
A two-layer MLP that reads each token and outputs independent decay rates for every Fenwick-tree memory level, activated by softplus to enforce positivity without inter-level competition.
If this is right
- Performance improves on tasks that require retaining information across many tokens.
- The model retains exactly the same log-linear time and space complexity.
- Only a negligible number of extra parameters are added.
- Memory stability is maintained even when sequence length increases substantially.
Where Pith is reading between the lines
- The same input-conditioned decay idea could be inserted into other linear or state-space models that currently rely on a single learned decay scalar.
- Per-level independence may reduce the pressure to enlarge hidden-state dimension as sequence length grows.
- Testing the method on modalities beyond text, such as long audio or video streams, would show whether content-adaptive forgetting transfers.
- If the MLP is replaced by an even simpler linear layer, the overhead could drop further while preserving most of the benefit.
Load-bearing premise
A small two-layer network can learn stable, useful per-token per-level decay rates directly from the input without training instability.
What would settle it
Train both models on associative-recall tasks whose length exceeds the point where fixed-lambda performance collapses; the adaptive version shows no accuracy gain or lower accuracy.
Figures
read the original abstract
Sequence models face a fundamental tradeoff between memory capacity and computational efficiency. Transformers achieve expressive context modeling at quadratic cost, while linear attention and state-space models run in linear time by compressing context into a fixed-size hidden state, inherently limiting recall. Log-linear attention navigates this tradeoff by organizing memory across a Fenwick tree hierarchy, growing its hidden state logarithmically with sequence length at log-linear compute cost. However, its memory decay parameter {\lambda} is fixed and independent of the input, assigning uniform weights across all hierarchy levels regardless of the content, which introduces unnecessary rigidity. We propose learning {\lambda} directly from the input via a lightweight two-layer MLP, producing per-token, per-level decay that adapts to content rather than position. A softplus activation lets each Fenwick tree level scale independently, avoiding the inter-level competition that softmax introduces. This modification preserves log-linear complexity exactly and adds negligible parameter overhead. We evaluate on associative recall, selective copying, and language modeling, finding that input-dependent decay consistently outperforms the baseline, with the largest gains in long-range memory settings where baseline {\lambda} degrades or collapses entirely.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces adaptive memory decay for log-linear attention, replacing the fixed decay parameter λ with input-dependent values computed per token and per Fenwick-tree level by a lightweight two-layer MLP followed by softplus activation. This preserves the original log-linear complexity and is evaluated on associative recall, selective copying, and language modeling, where it is claimed to outperform the fixed-λ baseline with the largest gains in long-range memory regimes.
Significance. If the empirical improvements hold under scrutiny, the work offers a practical way to add content-adaptive forgetting to hierarchical linear-time attention mechanisms without increasing asymptotic cost. The softplus choice for independent per-level scaling is a clean design that sidesteps inter-level competition, and the negligible parameter overhead is a clear strength.
major comments (2)
- [Experiments / results presentation] The central empirical claim (consistent outperformance, especially on long-range tasks) rests on results whose magnitude, variance, and statistical reliability are not quantified in the abstract or summary; no tables, error bars, ablation details, or significance tests are referenced, which prevents verification that the gains are attributable to adaptive decay rather than training variance.
- [Method / §3] §3 (method) and the stability assumption: the two-layer MLP with only softplus is asserted to produce stable per-token, per-level λ values that avoid collapse or excessive forgetting when applied sequentially across the Fenwick hierarchy, yet no output statistics, gradient-norm analysis, or failure-mode experiments are supplied. This is load-bearing for the long-range recall claims, as even moderate variance in MLP outputs can compound into effective decay rates outside the useful range.
minor comments (1)
- [Abstract] The abstract states evaluation on language modeling but does not name the dataset, model scale, or context lengths used, which would help readers assess the scope of the long-range gains.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, agreeing that additional quantification and analysis will strengthen the presentation, and we will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [Experiments / results presentation] The central empirical claim (consistent outperformance, especially on long-range tasks) rests on results whose magnitude, variance, and statistical reliability are not quantified in the abstract or summary; no tables, error bars, ablation details, or significance tests are referenced, which prevents verification that the gains are attributable to adaptive decay rather than training variance.
Authors: We agree that the abstract does not quantify variance or reference statistical tests, which limits immediate verifiability. The full manuscript contains tables reporting performance on associative recall, selective copying, and language modeling, with consistent gains for adaptive decay (largest on long sequences). To address the concern directly, the revision will update the abstract to note the magnitude of improvements, add error bars from multiple seeds to figures, include ablation details on the MLP, and report basic significance tests where data permits. This will confirm the gains are attributable to the proposed mechanism rather than training variance. revision: yes
-
Referee: [Method / §3] §3 (method) and the stability assumption: the two-layer MLP with only softplus is asserted to produce stable per-token, per-level λ values that avoid collapse or excessive forgetting when applied sequentially across the Fenwick hierarchy, yet no output statistics, gradient-norm analysis, or failure-mode experiments are supplied. This is load-bearing for the long-range recall claims, as even moderate variance in MLP outputs can compound into effective decay rates outside the useful range.
Authors: The softplus activation guarantees positive λ values, and independent per-level scaling prevents the inter-level competition that softmax would introduce, which is the core design choice for stability. Empirical results on long-range tasks show no degradation that would indicate collapse or excessive forgetting. We acknowledge that explicit output statistics and gradient analysis are absent from the original submission. In revision we will add histograms and summary statistics of learned λ values across tokens and Fenwick levels from the trained models, plus a brief discussion of observed gradient norms during training. Full failure-mode experiments would require new runs and are noted as future work, but the added statistics will directly support the stability claim. revision: partial
Circularity Check
No circularity: adaptive decay learned end-to-end via MLP, independent of evaluation metrics
full rationale
The paper introduces input-dependent decay rates computed by a two-layer MLP with softplus, trained jointly with the model on standard tasks. This mechanism is not defined in terms of the performance outcomes it is later evaluated on, nor does any equation reduce a claimed prediction to a fitted parameter or self-citation by construction. The base Fenwick-tree structure is referenced as prior work, but the adaptation step adds an independent learned component whose outputs are not forced by the evaluation protocol. Empirical gains on associative recall and language modeling are measured externally and do not loop back to redefine the decay rates themselves.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Log-linear attention organizes memory across a Fenwick tree hierarchy with logarithmic growth in hidden state size
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems , year=
Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
-
[2]
Neural Machine Translation by Jointly Learning to Align and Translate
Neural machine translation by jointly learning to align and translate , author=. arXiv:1409.0473 , year=
work page internal anchor Pith review arXiv
-
[3]
Advances in Neural Information Processing Systems , year=
Dao, Tri and Fu, Daniel Y and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , year=
-
[4]
Transformers are
Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , year=
-
[5]
International Conference on Learning Representations , year=
Efficiently modeling long sequences with structured state spaces , author=. International Conference on Learning Representations , year=
-
[6]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv:2312.00752 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Transformers are
Dao, Tri and Gu, Albert , journal=. Transformers are
-
[8]
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention transformers with hardware-efficient training , author=. arXiv:2312.06635 , year=
work page internal anchor Pith review arXiv
-
[9]
Retentive network: A successor to
Sun, Yutao and Dong, Li and Huang, Shaohan and Ma, Shuming and Xia, Yuqing and Xue, Jilong and Wang, Jianyong and Wei, Furu , journal=. Retentive network: A successor to
-
[10]
Parallelizing Linear Transformers with the Delta Rule over Sequence Length.arXiv:2406.06484, 2024
Parallelizing linear transformers with the delta rule over sequence length , author=. arXiv:2406.06484 , year=
-
[11]
Gated delta networks: Improving
Yang, Songlin and others , journal=. Gated delta networks: Improving
-
[12]
International Conference on Machine Learning , year=
Linear transformers are secretly fast weight programmers , author=. International Conference on Machine Learning , year=
-
[13]
Hao, Tobias and others , journal=
-
[14]
Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models , author=. arXiv:2401.04658 , year=
-
[15]
Unlocking state-tracking in linear
Grazzi, Riccardo and others , journal=. Unlocking state-tracking in linear
-
[16]
Qin, Zhen and others , journal=
-
[17]
Peng, Bo and others , journal=
-
[18]
Eagle and
Peng, Bo and others , journal=. Eagle and
-
[19]
Beck, Maximilian and others , journal=
-
[20]
International Conference on Learning Representations , year=
Log-Linear Attention , author=. International Conference on Learning Representations , year=
-
[21]
Zoology: Measuring and improving recall in efficient language models
Zoology: Measuring and improving recall in efficient language models , author=. arXiv:2312.04927 , year=
-
[22]
Simple linear attention language models balance the recall-throughput tradeoff
Simple linear attention language models balance the recall-throughput tradeoff , author=. arXiv:2402.18668 , year=
-
[23]
International Conference on Learning Representations , year=
Hungry hungry hippos: Towards language modeling with state space models , author=. International Conference on Learning Representations , year=
-
[24]
The illusion of state in state- space models,
The illusion of state in state-space models , author=. arXiv:2404.08819 , year=
-
[25]
Hsieh, Cheng-Ping and others , journal=
-
[26]
Pointer Sentinel Mixture Models
Pointer sentinel mixture models , author=. arXiv:1609.07843 , year=
work page internal anchor Pith review arXiv
-
[27]
W., Potapenko, A., Jayakumar, S
Compressive transformers for long-range sequence modelling , author=. arXiv:1911.05507 , year=
-
[28]
Neural Computation , volume=
Learning to control fast-weight memories: An alternative to dynamic recurrent networks , author=. Neural Computation , volume=
-
[29]
Zhang, Yu and Lin, Zongyu and others , journal=
-
[30]
Hierarchically gated recurrent neural network for sequence modeling
Hierarchically gated recurrent neural network for sequence modeling , author=. arXiv:2311.04823 , year=
-
[31]
Yang, Songlin and Zhang, Yu , url=
-
[32]
Zhang, Yu and Yang, Songlin , url=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.