pith. machine review for the scientific record. sign in

arxiv: 2605.06946 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Adaptive Memory Decay for Log-Linear Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords log-linear attentionadaptive decayFenwick treememory hierarchysequence modelinglinear attentionassociative recalllanguage modeling
0
0 comments X

The pith

Making memory decay in log-linear attention depend on the input token improves recall on long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Log-linear attention stores context in a Fenwick tree hierarchy whose size grows only logarithmically with sequence length. The original design applies one fixed decay value lambda across every token and every hierarchy level. The paper replaces that constant with the output of a small two-layer network that reads the current token and produces a separate decay rate for each level. A softplus keeps the rates positive and independent, avoiding the competition that a softmax would create. The resulting model keeps exactly the same log-linear cost yet records higher accuracy on recall and language-modeling benchmarks, with the biggest lift appearing when sequences become long enough for the fixed-lambda baseline to lose distant information.

Core claim

Replacing the fixed global decay parameter lambda with per-token per-level rates produced by a lightweight two-layer MLP yields consistent gains on associative recall, selective copying, and language modeling, with the largest improvements occurring in long-range settings where the baseline lambda degrades or collapses.

What carries the argument

A two-layer MLP that reads each token and outputs independent decay rates for every Fenwick-tree memory level, activated by softplus to enforce positivity without inter-level competition.

If this is right

  • Performance improves on tasks that require retaining information across many tokens.
  • The model retains exactly the same log-linear time and space complexity.
  • Only a negligible number of extra parameters are added.
  • Memory stability is maintained even when sequence length increases substantially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same input-conditioned decay idea could be inserted into other linear or state-space models that currently rely on a single learned decay scalar.
  • Per-level independence may reduce the pressure to enlarge hidden-state dimension as sequence length grows.
  • Testing the method on modalities beyond text, such as long audio or video streams, would show whether content-adaptive forgetting transfers.
  • If the MLP is replaced by an even simpler linear layer, the overhead could drop further while preserving most of the benefit.

Load-bearing premise

A small two-layer network can learn stable, useful per-token per-level decay rates directly from the input without training instability.

What would settle it

Train both models on associative-recall tasks whose length exceeds the point where fixed-lambda performance collapses; the adaptive version shows no accuracy gain or lower accuracy.

Figures

Figures reproduced from arXiv: 2605.06946 by Helen Zichen Li, Mengfan Zhang, Samet Ayhan, Yaxita Amin.

Figure 1
Figure 1. Figure 1: Fenwick-tree memory structure in log-linear attention. At each timestep t, the prefix is partitioned into hierarchical memory buckets M (ℓ) t at increas￾ing temporal scales. The output ot is a weighted sum across all active levels, controlled by λ (ℓ) t . Log-Linear Attention addresses the fixed-state limitation of linear attention and state space models by replacing a single recurrent memory with a logari… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of LambdaMLPSoft￾plus. The baseline projection dt is passed through two linear layers with softplus acti￾vation. Initialization ensures λ ≈ 1.0 at the start of training. We replace the baseline λ projection with a lightweight two￾layer MLP. Concretely, let dt ∈ R H×L be the intermediate representation produced by the baseline projection for token t across all heads H and levels L, which we kee… view at source ↗
Figure 3
Figure 3. Figure 3: Complexity stress test (left) and length generalization (right). Softplus maintains accuracy as kv [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Selective copying results. Top-left: best validation accuracy across sequence lengths. Top-right and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: baseline λ weights show near-uniform values across all memory levels and token positions (range 0.688–0.698), confirming input-independence. Right: MLP-λ (softplus) learns sharp content-dependent patterns, with level 1 strongly activated at key-value and query tokens. better GPU parallelization, allowing attention to run faster without materializing the full attention matrix [Dao et al., 2022, Dao, 2… view at source ↗
Figure 6
Figure 6. Figure 6: shows that the baseline learns non-uniform preferences over the hierarchy, differing between Layer 0 and Layer 1, with larger weight often placed on a small subset of middle or deeper levels. These patterns are broadly consistent across kv=16 and kv=32, though the exact scale and preferred levels vary somewhat across seeds. Fixed λ can learn a meaningful global memory strategy per layer, but applies it ide… view at source ↗
Figure 7
Figure 7. Figure 7: Token-level MLP-softplus λ heatmaps on MQAR for kv=16 and kv=32 at sequence length 256 across two random seeds. Unlike the baseline, the strength of level preferences varies across token positions. B.3.3 Average MLP-Softplus λ Heatmaps on MQAR We also average the MLP-softplus λ weights over tokens and batch examples to summarize global memory-level preferences. This view removes token-wise variation but ma… view at source ↗
Figure 8
Figure 8. Figure 8: shows that MLP-softplus learns sparse, layer-specific profiles over the Fenwick hierarchy. Layer 0 places most average weight on lower or mid-to-deep levels, while Layer 1 consistently weights the deepest level most heavily. The separation between layers is stable across seeds. Together with the token-level heatmaps, this shows MLP-softplus learns both a global preference over memory scales and token-depen… view at source ↗
Figure 9
Figure 9. Figure 9: Token-level MLP-softmax λ heatmaps on MQAR for kv=16 and kv=32 at sequence length 256 across two random seeds. Softmax normalization often leads to sharper and more seed-sensitive level preferences than MLP-softplus. B.3.5 Selective Copying Lambda Heatmaps We also visualize λ weights on selective copying to test whether the fixed-versus-adaptive behavior observed on MQAR appears in a different synthetic me… view at source ↗
Figure 10
Figure 10. Figure 10: Baseline-λ heatmaps on selective copying for tok=16 across sequence lengths 256, 512, and 1024. Each row shows two random seeds for one sequence length. The baseline learns non-uniform layer-wise preferences over hierarchy levels, but these preferences are static and do not vary across token positions. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Token-level MLP-softplus λ heatmaps on selective copying for tok=16 across sequence lengths 256, 512, and 1024. The MLP-softplus parameterization produces token-dependent memory-level weights, with visible changes near the copy target region marked by the red dashed line. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Token-level MLP-softmax λ heatmaps on selective copying for tok=16 across sequence lengths 256, 512, and 1024. MLP-softmax also produces token-dependent memory-level weights, but its normalization across levels often leads to sharper and more seed-sensitive profiles [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Seed comparison of learned λ profiles on selective copying for tok=16 and sequence length 256. The baseline learns stable but static layer-wise profiles, while MLP-softplus and MLP-softmax produce more specialized profiles with stronger variation across seeds and layers. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
read the original abstract

Sequence models face a fundamental tradeoff between memory capacity and computational efficiency. Transformers achieve expressive context modeling at quadratic cost, while linear attention and state-space models run in linear time by compressing context into a fixed-size hidden state, inherently limiting recall. Log-linear attention navigates this tradeoff by organizing memory across a Fenwick tree hierarchy, growing its hidden state logarithmically with sequence length at log-linear compute cost. However, its memory decay parameter {\lambda} is fixed and independent of the input, assigning uniform weights across all hierarchy levels regardless of the content, which introduces unnecessary rigidity. We propose learning {\lambda} directly from the input via a lightweight two-layer MLP, producing per-token, per-level decay that adapts to content rather than position. A softplus activation lets each Fenwick tree level scale independently, avoiding the inter-level competition that softmax introduces. This modification preserves log-linear complexity exactly and adds negligible parameter overhead. We evaluate on associative recall, selective copying, and language modeling, finding that input-dependent decay consistently outperforms the baseline, with the largest gains in long-range memory settings where baseline {\lambda} degrades or collapses entirely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces adaptive memory decay for log-linear attention, replacing the fixed decay parameter λ with input-dependent values computed per token and per Fenwick-tree level by a lightweight two-layer MLP followed by softplus activation. This preserves the original log-linear complexity and is evaluated on associative recall, selective copying, and language modeling, where it is claimed to outperform the fixed-λ baseline with the largest gains in long-range memory regimes.

Significance. If the empirical improvements hold under scrutiny, the work offers a practical way to add content-adaptive forgetting to hierarchical linear-time attention mechanisms without increasing asymptotic cost. The softplus choice for independent per-level scaling is a clean design that sidesteps inter-level competition, and the negligible parameter overhead is a clear strength.

major comments (2)
  1. [Experiments / results presentation] The central empirical claim (consistent outperformance, especially on long-range tasks) rests on results whose magnitude, variance, and statistical reliability are not quantified in the abstract or summary; no tables, error bars, ablation details, or significance tests are referenced, which prevents verification that the gains are attributable to adaptive decay rather than training variance.
  2. [Method / §3] §3 (method) and the stability assumption: the two-layer MLP with only softplus is asserted to produce stable per-token, per-level λ values that avoid collapse or excessive forgetting when applied sequentially across the Fenwick hierarchy, yet no output statistics, gradient-norm analysis, or failure-mode experiments are supplied. This is load-bearing for the long-range recall claims, as even moderate variance in MLP outputs can compound into effective decay rates outside the useful range.
minor comments (1)
  1. [Abstract] The abstract states evaluation on language modeling but does not name the dataset, model scale, or context lengths used, which would help readers assess the scope of the long-range gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing that additional quantification and analysis will strengthen the presentation, and we will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Experiments / results presentation] The central empirical claim (consistent outperformance, especially on long-range tasks) rests on results whose magnitude, variance, and statistical reliability are not quantified in the abstract or summary; no tables, error bars, ablation details, or significance tests are referenced, which prevents verification that the gains are attributable to adaptive decay rather than training variance.

    Authors: We agree that the abstract does not quantify variance or reference statistical tests, which limits immediate verifiability. The full manuscript contains tables reporting performance on associative recall, selective copying, and language modeling, with consistent gains for adaptive decay (largest on long sequences). To address the concern directly, the revision will update the abstract to note the magnitude of improvements, add error bars from multiple seeds to figures, include ablation details on the MLP, and report basic significance tests where data permits. This will confirm the gains are attributable to the proposed mechanism rather than training variance. revision: yes

  2. Referee: [Method / §3] §3 (method) and the stability assumption: the two-layer MLP with only softplus is asserted to produce stable per-token, per-level λ values that avoid collapse or excessive forgetting when applied sequentially across the Fenwick hierarchy, yet no output statistics, gradient-norm analysis, or failure-mode experiments are supplied. This is load-bearing for the long-range recall claims, as even moderate variance in MLP outputs can compound into effective decay rates outside the useful range.

    Authors: The softplus activation guarantees positive λ values, and independent per-level scaling prevents the inter-level competition that softmax would introduce, which is the core design choice for stability. Empirical results on long-range tasks show no degradation that would indicate collapse or excessive forgetting. We acknowledge that explicit output statistics and gradient analysis are absent from the original submission. In revision we will add histograms and summary statistics of learned λ values across tokens and Fenwick levels from the trained models, plus a brief discussion of observed gradient norms during training. Full failure-mode experiments would require new runs and are noted as future work, but the added statistics will directly support the stability claim. revision: partial

Circularity Check

0 steps flagged

No circularity: adaptive decay learned end-to-end via MLP, independent of evaluation metrics

full rationale

The paper introduces input-dependent decay rates computed by a two-layer MLP with softplus, trained jointly with the model on standard tasks. This mechanism is not defined in terms of the performance outcomes it is later evaluated on, nor does any equation reduce a claimed prediction to a fitted parameter or self-citation by construction. The base Fenwick-tree structure is referenced as prior work, but the adaptation step adds an independent learned component whose outputs are not forced by the evaluation protocol. Empirical gains on associative recall and language modeling are measured externally and do not loop back to redefine the decay rates themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes the Fenwick-tree memory organization from prior log-linear attention work and that the MLP can be trained end-to-end to produce useful decay values; no new entities are postulated and no hand-tuned constants beyond standard training are introduced.

axioms (1)
  • domain assumption Log-linear attention organizes memory across a Fenwick tree hierarchy with logarithmic growth in hidden state size
    Invoked in the abstract as the base architecture whose fixed decay is being replaced.

pith-pipeline@v0.9.0 · 5497 in / 1128 out tokens · 43499 ms · 2026-05-11T00:57:23.784888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , year=

    Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

  2. [2]

    Neural Machine Translation by Jointly Learning to Align and Translate

    Neural machine translation by jointly learning to align and translate , author=. arXiv:1409.0473 , year=

  3. [3]

    Advances in Neural Information Processing Systems , year=

    Dao, Tri and Fu, Daniel Y and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , year=

  4. [4]

    Transformers are

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , year=

  5. [5]

    International Conference on Learning Representations , year=

    Efficiently modeling long sequences with structured state spaces , author=. International Conference on Learning Representations , year=

  6. [6]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv:2312.00752 , year=

  7. [7]

    Transformers are

    Dao, Tri and Gu, Albert , journal=. Transformers are

  8. [8]

    Gated Linear Attention Transformers with Hardware-Efficient Training

    Gated linear attention transformers with hardware-efficient training , author=. arXiv:2312.06635 , year=

  9. [9]

    Retentive network: A successor to

    Sun, Yutao and Dong, Li and Huang, Shaohan and Ma, Shuming and Xia, Yuqing and Xue, Jilong and Wang, Jianyong and Wei, Furu , journal=. Retentive network: A successor to

  10. [10]

    Parallelizing Linear Transformers with the Delta Rule over Sequence Length.arXiv:2406.06484, 2024

    Parallelizing linear transformers with the delta rule over sequence length , author=. arXiv:2406.06484 , year=

  11. [11]

    Gated delta networks: Improving

    Yang, Songlin and others , journal=. Gated delta networks: Improving

  12. [12]

    International Conference on Machine Learning , year=

    Linear transformers are secretly fast weight programmers , author=. International Conference on Machine Learning , year=

  13. [13]

    Hao, Tobias and others , journal=

  14. [14]

    Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024

    Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models , author=. arXiv:2401.04658 , year=

  15. [15]

    Unlocking state-tracking in linear

    Grazzi, Riccardo and others , journal=. Unlocking state-tracking in linear

  16. [16]

    Qin, Zhen and others , journal=

  17. [17]

    Peng, Bo and others , journal=

  18. [18]

    Eagle and

    Peng, Bo and others , journal=. Eagle and

  19. [19]

    Beck, Maximilian and others , journal=

  20. [20]

    International Conference on Learning Representations , year=

    Log-Linear Attention , author=. International Conference on Learning Representations , year=

  21. [21]

    Zoology: Measuring and improving recall in efficient language models

    Zoology: Measuring and improving recall in efficient language models , author=. arXiv:2312.04927 , year=

  22. [22]

    Simple linear attention language models balance the recall-throughput tradeoff

    Simple linear attention language models balance the recall-throughput tradeoff , author=. arXiv:2402.18668 , year=

  23. [23]

    International Conference on Learning Representations , year=

    Hungry hungry hippos: Towards language modeling with state space models , author=. International Conference on Learning Representations , year=

  24. [24]

    The illusion of state in state- space models,

    The illusion of state in state-space models , author=. arXiv:2404.08819 , year=

  25. [25]

    Hsieh, Cheng-Ping and others , journal=

  26. [26]

    Pointer Sentinel Mixture Models

    Pointer sentinel mixture models , author=. arXiv:1609.07843 , year=

  27. [27]

    W., Potapenko, A., Jayakumar, S

    Compressive transformers for long-range sequence modelling , author=. arXiv:1911.05507 , year=

  28. [28]

    Neural Computation , volume=

    Learning to control fast-weight memories: An alternative to dynamic recurrent networks , author=. Neural Computation , volume=

  29. [29]

    Zhang, Yu and Lin, Zongyu and others , journal=

  30. [30]

    Hierarchically gated recurrent neural network for sequence modeling

    Hierarchically gated recurrent neural network for sequence modeling , author=. arXiv:2311.04823 , year=

  31. [31]

    Yang, Songlin and Zhang, Yu , url=

  32. [32]

    Zhang, Yu and Yang, Songlin , url=