pith. sign in

Augmenting Self-attention with Persistent Memory

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it
abstract

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.

citation-role summary

background 2

citation-polarity summary

verdicts

UNVERDICTED 6

roles

background 2

polarities

background 2

clear filters

representative citing papers

Titans: Learning to Memorize at Test Time

cs.LG · 2024-12-31 · unverdicted · novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

TIDE: Every Layer Knows the Token Beneath the Context

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Do Value Vectors in Deep Layers Need Context from the Residual Stream? cs.CL · 2026-06-01 · unverdicted · none · ref 102 · internal anchor

    Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.

  • TIDE: Every Layer Knows the Token Beneath the Context cs.CL · 2026-05-07 · unverdicted · none · ref 106

    TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.