arxiv: 2605.06946 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Adaptive Memory Decay for Log-Linear Attention

Yaxita Amin , Helen Zichen Li , Mengfan Zhang , Samet Ayhan

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords log-linear attentionadaptive decayFenwick treememory hierarchysequence modelinglinear attentionassociative recalllanguage modeling

0 comments

The pith

Making memory decay in log-linear attention depend on the input token improves recall on long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Log-linear attention stores context in a Fenwick tree hierarchy whose size grows only logarithmically with sequence length. The original design applies one fixed decay value lambda across every token and every hierarchy level. The paper replaces that constant with the output of a small two-layer network that reads the current token and produces a separate decay rate for each level. A softplus keeps the rates positive and independent, avoiding the competition that a softmax would create. The resulting model keeps exactly the same log-linear cost yet records higher accuracy on recall and language-modeling benchmarks, with the biggest lift appearing when sequences become long enough for the fixed-lambda baseline to lose distant information.

Core claim

Replacing the fixed global decay parameter lambda with per-token per-level rates produced by a lightweight two-layer MLP yields consistent gains on associative recall, selective copying, and language modeling, with the largest improvements occurring in long-range settings where the baseline lambda degrades or collapses.

What carries the argument

A two-layer MLP that reads each token and outputs independent decay rates for every Fenwick-tree memory level, activated by softplus to enforce positivity without inter-level competition.

If this is right

Performance improves on tasks that require retaining information across many tokens.
The model retains exactly the same log-linear time and space complexity.
Only a negligible number of extra parameters are added.
Memory stability is maintained even when sequence length increases substantially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same input-conditioned decay idea could be inserted into other linear or state-space models that currently rely on a single learned decay scalar.
Per-level independence may reduce the pressure to enlarge hidden-state dimension as sequence length grows.
Testing the method on modalities beyond text, such as long audio or video streams, would show whether content-adaptive forgetting transfers.
If the MLP is replaced by an even simpler linear layer, the overhead could drop further while preserving most of the benefit.

Load-bearing premise

A small two-layer network can learn stable, useful per-token per-level decay rates directly from the input without training instability.

What would settle it

Train both models on associative-recall tasks whose length exceeds the point where fixed-lambda performance collapses; the adaptive version shows no accuracy gain or lower accuracy.

Figures

Figures reproduced from arXiv: 2605.06946 by Helen Zichen Li, Mengfan Zhang, Samet Ayhan, Yaxita Amin.

**Figure 1.** Figure 1: Fenwick-tree memory structure in log-linear attention. At each timestep t, the prefix is partitioned into hierarchical memory buckets M (ℓ) t at increasing temporal scales. The output ot is a weighted sum across all active levels, controlled by λ (ℓ) t . Log-Linear Attention addresses the fixed-state limitation of linear attention and state space models by replacing a single recurrent memory with a logari… view at source ↗

**Figure 2.** Figure 2: Architecture of LambdaMLPSoftplus. The baseline projection dt is passed through two linear layers with softplus activation. Initialization ensures λ ≈ 1.0 at the start of training. We replace the baseline λ projection with a lightweight twolayer MLP. Concretely, let dt ∈ R H×L be the intermediate representation produced by the baseline projection for token t across all heads H and levels L, which we kee… view at source ↗

**Figure 3.** Figure 3: Complexity stress test (left) and length generalization (right). Softplus maintains accuracy as kv [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Selective copying results. Top-left: best validation accuracy across sequence lengths. Top-right and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Left: baseline λ weights show near-uniform values across all memory levels and token positions (range 0.688–0.698), confirming input-independence. Right: MLP-λ (softplus) learns sharp content-dependent patterns, with level 1 strongly activated at key-value and query tokens. better GPU parallelization, allowing attention to run faster without materializing the full attention matrix [Dao et al., 2022, Dao, 2… view at source ↗

**Figure 6.** Figure 6: shows that the baseline learns non-uniform preferences over the hierarchy, differing between Layer 0 and Layer 1, with larger weight often placed on a small subset of middle or deeper levels. These patterns are broadly consistent across kv=16 and kv=32, though the exact scale and preferred levels vary somewhat across seeds. Fixed λ can learn a meaningful global memory strategy per layer, but applies it ide… view at source ↗

**Figure 7.** Figure 7: Token-level MLP-softplus λ heatmaps on MQAR for kv=16 and kv=32 at sequence length 256 across two random seeds. Unlike the baseline, the strength of level preferences varies across token positions. B.3.3 Average MLP-Softplus λ Heatmaps on MQAR We also average the MLP-softplus λ weights over tokens and batch examples to summarize global memory-level preferences. This view removes token-wise variation but ma… view at source ↗

**Figure 8.** Figure 8: shows that MLP-softplus learns sparse, layer-specific profiles over the Fenwick hierarchy. Layer 0 places most average weight on lower or mid-to-deep levels, while Layer 1 consistently weights the deepest level most heavily. The separation between layers is stable across seeds. Together with the token-level heatmaps, this shows MLP-softplus learns both a global preference over memory scales and token-depen… view at source ↗

**Figure 9.** Figure 9: Token-level MLP-softmax λ heatmaps on MQAR for kv=16 and kv=32 at sequence length 256 across two random seeds. Softmax normalization often leads to sharper and more seed-sensitive level preferences than MLP-softplus. B.3.5 Selective Copying Lambda Heatmaps We also visualize λ weights on selective copying to test whether the fixed-versus-adaptive behavior observed on MQAR appears in a different synthetic me… view at source ↗

**Figure 10.** Figure 10: Baseline-λ heatmaps on selective copying for tok=16 across sequence lengths 256, 512, and 1024. Each row shows two random seeds for one sequence length. The baseline learns non-uniform layer-wise preferences over hierarchy levels, but these preferences are static and do not vary across token positions. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Token-level MLP-softplus λ heatmaps on selective copying for tok=16 across sequence lengths 256, 512, and 1024. The MLP-softplus parameterization produces token-dependent memory-level weights, with visible changes near the copy target region marked by the red dashed line. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Token-level MLP-softmax λ heatmaps on selective copying for tok=16 across sequence lengths 256, 512, and 1024. MLP-softmax also produces token-dependent memory-level weights, but its normalization across levels often leads to sharper and more seed-sensitive profiles [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Seed comparison of learned λ profiles on selective copying for tok=16 and sequence length 256. The baseline learns stable but static layer-wise profiles, while MLP-softplus and MLP-softmax produce more specialized profiles with stronger variation across seeds and layers. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

read the original abstract

Sequence models face a fundamental tradeoff between memory capacity and computational efficiency. Transformers achieve expressive context modeling at quadratic cost, while linear attention and state-space models run in linear time by compressing context into a fixed-size hidden state, inherently limiting recall. Log-linear attention navigates this tradeoff by organizing memory across a Fenwick tree hierarchy, growing its hidden state logarithmically with sequence length at log-linear compute cost. However, its memory decay parameter {\lambda} is fixed and independent of the input, assigning uniform weights across all hierarchy levels regardless of the content, which introduces unnecessary rigidity. We propose learning {\lambda} directly from the input via a lightweight two-layer MLP, producing per-token, per-level decay that adapts to content rather than position. A softplus activation lets each Fenwick tree level scale independently, avoiding the inter-level competition that softmax introduces. This modification preserves log-linear complexity exactly and adds negligible parameter overhead. We evaluate on associative recall, selective copying, and language modeling, finding that input-dependent decay consistently outperforms the baseline, with the largest gains in long-range memory settings where baseline {\lambda} degrades or collapses entirely.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They make decay in log-linear attention input-dependent with a small MLP, a direct tweak that keeps complexity the same but rests on thin evidence.

read the letter

The main takeaway is that this paper replaces the fixed decay parameter in log-linear attention with one learned per token and per Fenwick level from the input, using a two-layer MLP plus softplus. That change targets the rigidity of uniform weighting across hierarchy levels and could help selective memory in long sequences without raising the log-linear cost or adding much overhead. The softplus choice is sensible because it lets levels scale independently instead of competing through normalization. The idea builds straight on the referenced prior work and is new in its exact per-token, per-level application. The paper does well to keep the method simple and to claim the modification adds negligible parameters. It also correctly identifies that fixed lambda can degrade in long-range cases. The soft spots sit mostly in the experiments. The abstract asserts consistent gains on associative recall, selective copying, and language modeling, with bigger lifts where the baseline collapses, yet it gives no numbers, error bars, ablations, or statistical details. That leaves the size and reliability of the improvement hard to judge. The stress-test point on stability is reasonable: even moderate variation in the MLP outputs could compound across the tree levels into decays that are either too fast or too slow, and the paper appears to offer no output statistics, gradient checks, or failure-mode analysis beyond saying softplus is enough. If the full manuscript includes those checks and solid tables, the concern shrinks; from the given text it does not. This work is for researchers already following efficient attention variants and state-space models who want incremental fixes for context scaling. Readers focused on practical long-sequence performance might extract value once the numbers are visible. It deserves a serious referee because the core change is straightforward, grounded in the existing framework, and addresses a real limitation, even if the current write-up needs more experimental grounding. I would send it to peer review with the expectation that reviewers will request detailed results and stability diagnostics.

Referee Report

2 major / 1 minor

Summary. The paper introduces adaptive memory decay for log-linear attention, replacing the fixed decay parameter λ with input-dependent values computed per token and per Fenwick-tree level by a lightweight two-layer MLP followed by softplus activation. This preserves the original log-linear complexity and is evaluated on associative recall, selective copying, and language modeling, where it is claimed to outperform the fixed-λ baseline with the largest gains in long-range memory regimes.

Significance. If the empirical improvements hold under scrutiny, the work offers a practical way to add content-adaptive forgetting to hierarchical linear-time attention mechanisms without increasing asymptotic cost. The softplus choice for independent per-level scaling is a clean design that sidesteps inter-level competition, and the negligible parameter overhead is a clear strength.

major comments (2)

[Experiments / results presentation] The central empirical claim (consistent outperformance, especially on long-range tasks) rests on results whose magnitude, variance, and statistical reliability are not quantified in the abstract or summary; no tables, error bars, ablation details, or significance tests are referenced, which prevents verification that the gains are attributable to adaptive decay rather than training variance.
[Method / §3] §3 (method) and the stability assumption: the two-layer MLP with only softplus is asserted to produce stable per-token, per-level λ values that avoid collapse or excessive forgetting when applied sequentially across the Fenwick hierarchy, yet no output statistics, gradient-norm analysis, or failure-mode experiments are supplied. This is load-bearing for the long-range recall claims, as even moderate variance in MLP outputs can compound into effective decay rates outside the useful range.

minor comments (1)

[Abstract] The abstract states evaluation on language modeling but does not name the dataset, model scale, or context lengths used, which would help readers assess the scope of the long-range gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing that additional quantification and analysis will strengthen the presentation, and we will incorporate revisions accordingly.

read point-by-point responses

Referee: [Experiments / results presentation] The central empirical claim (consistent outperformance, especially on long-range tasks) rests on results whose magnitude, variance, and statistical reliability are not quantified in the abstract or summary; no tables, error bars, ablation details, or significance tests are referenced, which prevents verification that the gains are attributable to adaptive decay rather than training variance.

Authors: We agree that the abstract does not quantify variance or reference statistical tests, which limits immediate verifiability. The full manuscript contains tables reporting performance on associative recall, selective copying, and language modeling, with consistent gains for adaptive decay (largest on long sequences). To address the concern directly, the revision will update the abstract to note the magnitude of improvements, add error bars from multiple seeds to figures, include ablation details on the MLP, and report basic significance tests where data permits. This will confirm the gains are attributable to the proposed mechanism rather than training variance. revision: yes
Referee: [Method / §3] §3 (method) and the stability assumption: the two-layer MLP with only softplus is asserted to produce stable per-token, per-level λ values that avoid collapse or excessive forgetting when applied sequentially across the Fenwick hierarchy, yet no output statistics, gradient-norm analysis, or failure-mode experiments are supplied. This is load-bearing for the long-range recall claims, as even moderate variance in MLP outputs can compound into effective decay rates outside the useful range.

Authors: The softplus activation guarantees positive λ values, and independent per-level scaling prevents the inter-level competition that softmax would introduce, which is the core design choice for stability. Empirical results on long-range tasks show no degradation that would indicate collapse or excessive forgetting. We acknowledge that explicit output statistics and gradient analysis are absent from the original submission. In revision we will add histograms and summary statistics of learned λ values across tokens and Fenwick levels from the trained models, plus a brief discussion of observed gradient norms during training. Full failure-mode experiments would require new runs and are noted as future work, but the added statistics will directly support the stability claim. revision: partial

Circularity Check

0 steps flagged

No circularity: adaptive decay learned end-to-end via MLP, independent of evaluation metrics

full rationale

The paper introduces input-dependent decay rates computed by a two-layer MLP with softplus, trained jointly with the model on standard tasks. This mechanism is not defined in terms of the performance outcomes it is later evaluated on, nor does any equation reduce a claimed prediction to a fitted parameter or self-citation by construction. The base Fenwick-tree structure is referenced as prior work, but the adaptation step adds an independent learned component whose outputs are not forced by the evaluation protocol. Empirical gains on associative recall and language modeling are measured externally and do not loop back to redefine the decay rates themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes the Fenwick-tree memory organization from prior log-linear attention work and that the MLP can be trained end-to-end to produce useful decay values; no new entities are postulated and no hand-tuned constants beyond standard training are introduced.

axioms (1)

domain assumption Log-linear attention organizes memory across a Fenwick tree hierarchy with logarithmic growth in hidden state size
Invoked in the abstract as the base architecture whose fixed decay is being replaced.

pith-pipeline@v0.9.0 · 5497 in / 1128 out tokens · 43499 ms · 2026-05-11T00:57:23.784888+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 11 canonical work pages · 4 internal anchors

[1]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
[2]

Neural Machine Translation by Jointly Learning to Align and Translate

Neural machine translation by jointly learning to align and translate , author=. arXiv:1409.0473 , year=

work page internal anchor Pith review arXiv
[3]

Advances in Neural Information Processing Systems , year=

Dao, Tri and Fu, Daniel Y and Ermon, Stefano and Rudra, Atri and R. Advances in Neural Information Processing Systems , year=
[4]

Transformers are

Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, Fran. Transformers are. International Conference on Machine Learning , year=
[5]

International Conference on Learning Representations , year=

Efficiently modeling long sequences with structured state spaces , author=. International Conference on Learning Representations , year=
[6]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Mamba: Linear-time sequence modeling with selective state spaces , author=. arXiv:2312.00752 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Transformers are

Dao, Tri and Gu, Albert , journal=. Transformers are
[8]

Gated Linear Attention Transformers with Hardware-Efficient Training

Gated linear attention transformers with hardware-efficient training , author=. arXiv:2312.06635 , year=

work page internal anchor Pith review arXiv
[9]

Retentive network: A successor to

Sun, Yutao and Dong, Li and Huang, Shaohan and Ma, Shuming and Xia, Yuqing and Xue, Jilong and Wang, Jianyong and Wei, Furu , journal=. Retentive network: A successor to
[10]

Parallelizing Linear Transformers with the Delta Rule over Sequence Length.arXiv:2406.06484, 2024

Parallelizing linear transformers with the delta rule over sequence length , author=. arXiv:2406.06484 , year=

work page arXiv
[11]

Gated delta networks: Improving

Yang, Songlin and others , journal=. Gated delta networks: Improving
[12]

International Conference on Machine Learning , year=

Linear transformers are secretly fast weight programmers , author=. International Conference on Machine Learning , year=
[13]

Hao, Tobias and others , journal=
[14]

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models.arXiv preprint arXiv:2401.04658, 2024

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models , author=. arXiv:2401.04658 , year=

work page arXiv
[15]

Unlocking state-tracking in linear

Grazzi, Riccardo and others , journal=. Unlocking state-tracking in linear
[16]

Qin, Zhen and others , journal=
[17]

Peng, Bo and others , journal=
[18]

Eagle and

Peng, Bo and others , journal=. Eagle and
[19]

Beck, Maximilian and others , journal=
[20]

International Conference on Learning Representations , year=

Log-Linear Attention , author=. International Conference on Learning Representations , year=
[21]

Zoology: Measuring and improving recall in efficient language models

Zoology: Measuring and improving recall in efficient language models , author=. arXiv:2312.04927 , year=

work page arXiv
[22]

Simple linear attention language models balance the recall-throughput tradeoff

Simple linear attention language models balance the recall-throughput tradeoff , author=. arXiv:2402.18668 , year=

work page arXiv
[23]

International Conference on Learning Representations , year=

Hungry hungry hippos: Towards language modeling with state space models , author=. International Conference on Learning Representations , year=
[24]

The illusion of state in state- space models,

The illusion of state in state-space models , author=. arXiv:2404.08819 , year=

work page arXiv
[25]

Hsieh, Cheng-Ping and others , journal=
[26]

Pointer Sentinel Mixture Models

Pointer sentinel mixture models , author=. arXiv:1609.07843 , year=

work page internal anchor Pith review arXiv
[27]

W., Potapenko, A., Jayakumar, S

Compressive transformers for long-range sequence modelling , author=. arXiv:1911.05507 , year=

work page arXiv 1911
[28]

Neural Computation , volume=

Learning to control fast-weight memories: An alternative to dynamic recurrent networks , author=. Neural Computation , volume=
[29]

Zhang, Yu and Lin, Zongyu and others , journal=
[30]

Hierarchically gated recurrent neural network for sequence modeling

Hierarchically gated recurrent neural network for sequence modeling , author=. arXiv:2311.04823 , year=

work page arXiv
[31]

Yang, Songlin and Zhang, Yu , url=
[32]

Zhang, Yu and Yang, Songlin , url=