A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Augmenting Self-attention with Persistent Memory
6 Pith papers cite this work. Polarity classification is still indexing.
abstract
Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 2polarities
background 2representative citing papers
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
PriorEye augments end-to-end driving models with a dual-memory architecture that stores and gates geospatial visual priors to improve performance and robustness to sensor corruption on NAVSIM-v2.
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
citing papers explorer
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
-
Deep sequence models tend to memorize geometrically; it is unclear why
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
PriorEye: Geospatial Visual Priors for End-to-End Autonomous Driving
PriorEye augments end-to-end driving models with a dual-memory architecture that stores and gates geospatial visual priors to improve performance and robustness to sensor corruption on NAVSIM-v2.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.