Augmenting Self-attention with Persistent Memory
Pith reviewed 2026-05-25 10:58 UTC · model grok-4.3
The pith
Persistent memory vectors let transformers drop their feed-forward layers without losing performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By augmenting the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.
What carries the argument
Persistent memory vectors: fixed learned vectors that are concatenated with the input keys and values inside each self-attention layer and thereby supply the transformation previously performed by the feed-forward sub-layer.
If this is right
- The resulting architecture contains only attention operations yet matches the original transformer on language modeling tasks.
- Both character-level and word-level benchmarks can be solved without dedicated feed-forward sub-layers.
- Self-attention with added memory is sufficient to capture the long-range dependencies that previously required the two-module design.
Where Pith is reading between the lines
- Uniform attention-only stacks may simplify hardware mapping or gradient flow compared with mixed attention-plus-MLP blocks.
- The same memory-augmentation trick could be tested in other attention-based sequence models that currently rely on position-wise feed-forward layers.
- If memory vectors can substitute for feed-forward transformations, future work could explore whether the number or placement of such vectors can be learned rather than fixed per layer.
Load-bearing premise
The persistent memory vectors can play a similar functional role to the feed-forward layer in transforming representations across layers.
What would settle it
Train the memory-augmented attention-only model on the same character and word language-modeling benchmarks; if its perplexity is materially worse than the baseline transformer that still contains feed-forward layers, the claim is false.
Figures
read the original abstract
Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes augmenting self-attention layers in Transformers with persistent memory vectors that substitute for the role of feed-forward layers, allowing their removal without degrading performance on character- and word-level language modeling benchmarks.
Significance. If the empirical results hold under controlled conditions, the work would be significant for simplifying Transformer architectures and clarifying the functional contribution of feed-forward layers versus attention. The approach introduces a new architectural primitive (persistent memory vectors) whose parameter count is explicitly listed as a free variable, and the evaluation on external benchmarks provides a falsifiable test of the central claim.
minor comments (1)
- Abstract: the statement that evaluation 'shows the benefits' is not accompanied by any quantitative numbers, baseline comparisons, or dataset names, making it impossible to assess the magnitude of the claimed result from the provided text alone.
Simulated Author's Rebuttal
We thank the referee for their review. The provided summary accurately captures the core contribution of the work. No specific major comments appear in the report, so we have no point-by-point responses at this time. We remain available to supply additional controlled experiments or clarifications that would help resolve the uncertainty in the recommendation.
Circularity Check
No significant circularity identified
full rationale
The paper defines a new architecture by augmenting self-attention layers with persistent memory vectors that are proposed to play a role similar to feed-forward layers, allowing their removal. This is presented as an architectural choice evaluated empirically on external character- and word-level language modeling benchmarks. No equations, derivations, or steps are visible in the abstract or described claims that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claim rests on performance comparisons rather than any internal loop where a prediction is forced by the inputs or prior self-work. This is the most common honest finding for an architecture paper with independent empirical validation.
Axiom & Free-Parameter Ledger
free parameters (1)
- number and dimension of persistent memory vectors
invented entities (1)
-
persistent memory vectors
no independent evidence
Forward citations
Cited by 4 Pith papers
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Deep sequence models tend to memorize geometrically; it is unclear why
Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
-
Titans: Learning to Memorize at Test Time
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
-
TIDE: Every Layer Knows the Token Beneath the Context
TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.
Reference graph
Works this paper leans on
-
[1]
Character-level language modeling with deeper self-attention
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019
work page 2019
-
[2]
Adaptive input representations for neural language modeling
Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR, 2019
work page 2019
-
[3]
Neural machine translation by jointly learning to align and translate
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015
work page 2015
-
[4]
A neural probabilistic language model
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003
work page 2003
-
[5]
Quick training of probabilistic neural nets by importance sampling
Yoshua Bengio, Jean-Sébastien Senécal, et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pages 1–9, 2003
work page 2003
-
[6]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[7]
Hierarchical multiscale recurrent neural networks
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017
work page 2017
-
[8]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[9]
Language modeling with gated convolutional networks
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, 2017
work page 2017
-
[10]
BERT: pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), 2019
work page 2019
-
[11]
Adaptive subgradient methods for online learning and stochastic optimization
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011
work page 2011
-
[12]
A bit of progress in language modeling
Joshua T Goodman. A bit of progress in language modeling. Computer Speech & Language, 15(4):403–434, 2001
work page 2001
-
[13]
Efficient softmax approxi- mation for gpus
Edouard Grave, Armand Joulin, Moustapha Cissé, and Hervé Jégou. Efficient softmax approxi- mation for gpus. In ICML, 2017
work page 2017
-
[14]
Improving neural language models with a continuous cache
Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In ICLR, 2017
work page 2017
-
[15]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. 9
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[16]
David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. In ICLR, 2017
work page 2017
-
[17]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[18]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997
work page 1997
-
[19]
Tying word vectors and word classifiers: A loss framework for language modeling
Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. In ICLR, 2017
work page 2017
-
[20]
Hierarchical mixtures of experts and the em algorithm
Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994
work page 1994
-
[21]
Exploring the Limits of Language Modeling
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
Multiplicative LSTM for sequence modelling
Ben Krause, Iain Murray, Steve Renals, and Liang Lu. Multiplicative LSTM for sequence modelling. In ICLR (Workshop), 2017
work page 2017
-
[23]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[24]
Large text compression benchmark.URL: http://www
Matt Mahoney. Large text compression benchmark.URL: http://www. mattmahoney. net/text/text. html, 2011
work page 2011
-
[25]
Pointer sentinel mixture models
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR, 2017
work page 2017
-
[26]
An Analysis of Neural Language Modeling at Multiple Scales
Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Recur- rent neural network based language model
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, JanˇCernock`y, and Sanjeev Khudanpur. Recur- rent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010
work page 2010
-
[28]
Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston
Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In EMNLP, 2016
work page 2016
-
[29]
Hierarchical probabilistic neural network language model
Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In AISTATS, 2005
work page 2005
-
[30]
Fast-slow recurrent neural networks
Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In NIPS, pages 5915–5924, 2017
work page 2017
-
[31]
On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013
work page 2013
-
[32]
Using the output embedding to improve language models
Ofir Press and Lior Wolf. Using the output embedding to improve language models. In EACL (2), 2017
work page 2017
-
[33]
Rae, Chris Dyer, Peter Dayan, and Timothy P
Jack W. Rae, Chris Dyer, Peter Dayan, and Timothy P. Lillicrap. Fast parametric learning with activation memorization. In ICML, 2018
work page 2018
-
[34]
Neural machine translation of rare words with subword units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL (1), 2016
work page 2016
-
[35]
Self-attention with relative position repre- sentations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations. In NAACL-HLT (2), 2018
work page 2018
-
[36]
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. In ICLR, 2017. 10
work page 2017
-
[37]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014
work page 1929
-
[38]
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In NIPS, 2015
work page 2015
-
[39]
Adaptive attention span in transformers
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. In ACL, 2019
work page 2019
-
[40]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017
work page 2017
-
[41]
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015
work page 2015
-
[42]
Pay less attention with lightweight and dynamic convolutions
Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In ICLR, 2019
work page 2019
-
[43]
Courville, Ruslan Salakhutdinov, Richard S
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015
work page 2015
-
[44]
Recurrent Neural Network Regularization
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[45]
Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. In ICML, 2017. 11
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.