Augmenting Self-attention with Persistent Memory

Armand Joulin; Edouard Grave; Guillaume Lample; Herve Jegou; Sainbayar Sukhbaatar

arxiv: 1907.01470 · v1 · pith:UVZI3KTGnew · submitted 2019-07-02 · 💻 cs.LG · cs.CL· stat.ML

Augmenting Self-attention with Persistent Memory

Sainbayar Sukhbaatar , Edouard Grave , Guillaume Lample , Herve Jegou , Armand Joulin This is my paper

Pith reviewed 2026-05-25 10:58 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords transformerself-attentionpersistent memoryfeed-forward layerlanguage modelingsequence modeling

0 comments

The pith

Persistent memory vectors let transformers drop their feed-forward layers without losing performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard transformers combine self-attention layers with separate feed-forward layers. The paper tests whether learned persistent memory vectors added directly inside self-attention can take over the work of those feed-forward layers. If they can, the entire model can be reduced to a stack of attention layers only. Experiments on character-level and word-level language modeling show that the simplified architecture reaches the same level of performance as the original transformer. A reader would care because the result questions whether feed-forward layers are essential or merely one convenient way to transform representations between attention steps.

Core claim

By augmenting the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.

What carries the argument

Persistent memory vectors: fixed learned vectors that are concatenated with the input keys and values inside each self-attention layer and thereby supply the transformation previously performed by the feed-forward sub-layer.

If this is right

The resulting architecture contains only attention operations yet matches the original transformer on language modeling tasks.
Both character-level and word-level benchmarks can be solved without dedicated feed-forward sub-layers.
Self-attention with added memory is sufficient to capture the long-range dependencies that previously required the two-module design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Uniform attention-only stacks may simplify hardware mapping or gradient flow compared with mixed attention-plus-MLP blocks.
The same memory-augmentation trick could be tested in other attention-based sequence models that currently rely on position-wise feed-forward layers.
If memory vectors can substitute for feed-forward transformations, future work could explore whether the number or placement of such vectors can be learned rather than fixed per layer.

Load-bearing premise

The persistent memory vectors can play a similar functional role to the feed-forward layer in transforming representations across layers.

What would settle it

Train the memory-augmented attention-only model on the same character and word language-modeling benchmarks; if its perplexity is materially worse than the baseline transformer that still contains feed-forward layers, the claim is false.

Figures

Figures reproduced from arXiv: 1907.01470 by Armand Joulin, Edouard Grave, Guillaume Lample, Herve Jegou, Sainbayar Sukhbaatar.

**Figure 2.** Figure 2: The performance of our large model on Text8 as we vary (left) the number of persistent vectors, or (right) the way how persistent vectors integrate with self-attention. importance of feedforward layers in transformer models. However, it maintains decent performances because it still has a lot of parameters (38M) in the Wq,k,v,o matrices. We also compare several different ways of integrating persistent vect… view at source ↗

read the original abstract

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a small set of persistent memory vectors to self-attention so the feed-forward layer can be removed with no reported drop on standard LM benchmarks.

read the letter

The central claim is that a few learned persistent vectors attached to attention can stand in for the usual feed-forward sub-layer. That is the actual new piece; prior work had not framed memory vectors this way to excise the FFN entirely. The paper shows the resulting attention-only stack matches baseline numbers on character and word-level language modeling, which is the main positive result. The architecture is defined cleanly enough that the substitution is testable rather than circular. The main soft spot is that the abstract gives almost no detail on vector count, dimension, whether they are shared across layers, or how they are initialized and updated. Without those numbers and the corresponding ablations it is hard to judge whether the parameter budget is truly comparable or whether the equivalence holds only for the specific setups tested. The experiments are described at a high level, so any hidden differences in training schedule or regularization would be invisible. This is useful reading for people already working on Transformer variants who want a concrete simplification to try. It is not a foundational rethinking of attention, but the device is specific enough that a referee could check the implementation and numbers directly. I would send it to review.

Referee Report

0 major / 1 minor

Summary. The paper proposes augmenting self-attention layers in Transformers with persistent memory vectors that substitute for the role of feed-forward layers, allowing their removal without degrading performance on character- and word-level language modeling benchmarks.

Significance. If the empirical results hold under controlled conditions, the work would be significant for simplifying Transformer architectures and clarifying the functional contribution of feed-forward layers versus attention. The approach introduces a new architectural primitive (persistent memory vectors) whose parameter count is explicitly listed as a free variable, and the evaluation on external benchmarks provides a falsifiable test of the central claim.

minor comments (1)

Abstract: the statement that evaluation 'shows the benefits' is not accompanied by any quantitative numbers, baseline comparisons, or dataset names, making it impossible to assess the magnitude of the claimed result from the provided text alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review. The provided summary accurately captures the core contribution of the work. No specific major comments appear in the report, so we have no point-by-point responses at this time. We remain available to supply additional controlled experiments or clarifications that would help resolve the uncertainty in the recommendation.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines a new architecture by augmenting self-attention layers with persistent memory vectors that are proposed to play a role similar to feed-forward layers, allowing their removal. This is presented as an architectural choice evaluated empirically on external character- and word-level language modeling benchmarks. No equations, derivations, or steps are visible in the abstract or described claims that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The central claim rests on performance comparisons rather than any internal loop where a prediction is forced by the inputs or prior self-work. This is the most common honest finding for an architecture paper with independent empirical validation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the architectural choice that a fixed set of memory vectors can functionally replace the learned transformations performed by feed-forward layers; this choice is introduced without derivation from first principles.

free parameters (1)

number and dimension of persistent memory vectors
Hyperparameters chosen to match or exceed baseline performance; their values are not derived from the model equations.

invented entities (1)

persistent memory vectors no independent evidence
purpose: Augment self-attention computation and substitute for feed-forward layers
New component introduced by the paper; no independent evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5654 in / 1026 out tokens · 26228 ms · 2026-05-25T10:58:32.544759+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Deep sequence models tend to memorize geometrically; it is unclear why
cs.LG 2025-10 unverdicted novelty 6.0

Deep sequence models develop geometric memory in embeddings that encodes novel global relationships, transforming l-fold composition tasks into 1-step navigation via a natural spectral bias connected to Node2Vec.
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
TIDE: Every Layer Knows the Token Beneath the Context
cs.CL 2026-05 unverdicted novelty 5.0

TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 4 Pith papers · 7 internal anchors

[1]

Character-level language modeling with deeper self-attention

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence, 2019

work page 2019
[2]

Adaptive input representations for neural language modeling

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR, 2019

work page 2019
[3]

Neural machine translation by jointly learning to align and translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015

work page 2015
[4]

A neural probabilistic language model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003

work page 2003
[5]

Quick training of probabilistic neural nets by importance sampling

Yoshua Bengio, Jean-Sébastien Senécal, et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pages 1–9, 2003

work page 2003
[6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[7]

Hierarchical multiscale recurrent neural networks

Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017

work page 2017
[8]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. arXiv preprint arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[9]

Language modeling with gated convolutional networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, 2017

work page 2017
[10]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), 2019

work page 2019
[11]

Adaptive subgradient methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011

work page 2011
[12]

A bit of progress in language modeling

Joshua T Goodman. A bit of progress in language modeling. Computer Speech & Language, 15(4):403–434, 2001

work page 2001
[13]

Efﬁcient softmax approxi- mation for gpus

Edouard Grave, Armand Joulin, Moustapha Cissé, and Hervé Jégou. Efﬁcient softmax approxi- mation for gpus. In ICML, 2017

work page 2017
[14]

Improving neural language models with a continuous cache

Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In ICLR, 2017

work page 2017
[15]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. 9

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Dai, and Quoc V

David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. In ICLR, 2017

work page 2017
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[18]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

work page 1997
[19]

Tying word vectors and word classiﬁers: A loss framework for language modeling

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classiﬁers: A loss framework for language modeling. In ICLR, 2017

work page 2017
[20]

Hierarchical mixtures of experts and the em algorithm

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994

work page 1994
[21]

Exploring the Limits of Language Modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

Multiplicative LSTM for sequence modelling

Ben Krause, Iain Murray, Steve Renals, and Liang Lu. Multiplicative LSTM for sequence modelling. In ICLR (Workshop), 2017

work page 2017
[23]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

Large text compression benchmark.URL: http://www

Matt Mahoney. Large text compression benchmark.URL: http://www. mattmahoney. net/text/text. html, 2011

work page 2011
[25]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR, 2017

work page 2017
[26]

An Analysis of Neural Language Modeling at Multiple Scales

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Recur- rent neural network based language model

Tomáš Mikolov, Martin Karaﬁát, Lukáš Burget, JanˇCernock`y, and Sanjeev Khudanpur. Recur- rent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010

work page 2010
[28]

Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston

Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In EMNLP, 2016

work page 2016
[29]

Hierarchical probabilistic neural network language model

Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In AISTATS, 2005

work page 2005
[30]

Fast-slow recurrent neural networks

Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In NIPS, pages 5915–5924, 2017

work page 2017
[31]

On the difﬁculty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In ICML, 2013

work page 2013
[32]

Using the output embedding to improve language models

Oﬁr Press and Lior Wolf. Using the output embedding to improve language models. In EACL (2), 2017

work page 2017
[33]

Rae, Chris Dyer, Peter Dayan, and Timothy P

Jack W. Rae, Chris Dyer, Peter Dayan, and Timothy P. Lillicrap. Fast parametric learning with activation memorization. In ICML, 2018

work page 2018
[34]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL (1), 2016

work page 2016
[35]

Self-attention with relative position repre- sentations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations. In NAACL-HLT (2), 2018

work page 2018
[36]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. In ICLR, 2017. 10

work page 2017
[37]

Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929
[38]

End-to-end memory networks

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In NIPS, 2015

work page 2015
[39]

Adaptive attention span in transformers

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. In ACL, 2019

work page 2019
[40]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

work page 2017
[41]

Pointer networks

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015

work page 2015
[42]

Pay less attention with lightweight and dynamic convolutions

Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In ICLR, 2019

work page 2019
[43]

Courville, Ruslan Salakhutdinov, Richard S

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015

work page 2015
[44]

Recurrent Neural Network Regularization

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[45]

Recurrent highway networks

Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. In ICML, 2017. 11

work page 2017

[1] [1]

Character-level language modeling with deeper self-attention

Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level language modeling with deeper self-attention. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence, 2019

work page 2019

[2] [2]

Adaptive input representations for neural language modeling

Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR, 2019

work page 2019

[3] [3]

Neural machine translation by jointly learning to align and translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015

work page 2015

[4] [4]

A neural probabilistic language model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003

work page 2003

[5] [5]

Quick training of probabilistic neural nets by importance sampling

Yoshua Bengio, Jean-Sébastien Senécal, et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pages 1–9, 2003

work page 2003

[6] [6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[7] [7]

Hierarchical multiscale recurrent neural networks

Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. In ICLR, 2017

work page 2017

[8] [8]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a ﬁxed-length context. arXiv preprint arXiv:1901.02860, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[9] [9]

Language modeling with gated convolutional networks

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, 2017

work page 2017

[10] [10]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), 2019

work page 2019

[11] [11]

Adaptive subgradient methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011

work page 2011

[12] [12]

A bit of progress in language modeling

Joshua T Goodman. A bit of progress in language modeling. Computer Speech & Language, 15(4):403–434, 2001

work page 2001

[13] [13]

Efﬁcient softmax approxi- mation for gpus

Edouard Grave, Armand Joulin, Moustapha Cissé, and Hervé Jégou. Efﬁcient softmax approxi- mation for gpus. In ICML, 2017

work page 2017

[14] [14]

Improving neural language models with a continuous cache

Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a continuous cache. In ICLR, 2017

work page 2017

[15] [15]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014. 9

work page internal anchor Pith review Pith/arXiv arXiv 2014

[16] [16]

Dai, and Quoc V

David Ha, Andrew M. Dai, and Quoc V . Le. Hypernetworks. In ICLR, 2017

work page 2017

[17] [17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[18] [18]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997

work page 1997

[19] [19]

Tying word vectors and word classiﬁers: A loss framework for language modeling

Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classiﬁers: A loss framework for language modeling. In ICLR, 2017

work page 2017

[20] [20]

Hierarchical mixtures of experts and the em algorithm

Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994

work page 1994

[21] [21]

Exploring the Limits of Language Modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[22] [22]

Multiplicative LSTM for sequence modelling

Ben Krause, Iain Murray, Steve Renals, and Liang Lu. Multiplicative LSTM for sequence modelling. In ICLR (Workshop), 2017

work page 2017

[23] [23]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

Large text compression benchmark.URL: http://www

Matt Mahoney. Large text compression benchmark.URL: http://www. mattmahoney. net/text/text. html, 2011

work page 2011

[25] [25]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In ICLR, 2017

work page 2017

[26] [26]

An Analysis of Neural Language Modeling at Multiple Scales

Stephen Merity, Nitish Shirish Keskar, and Richard Socher. An analysis of neural language modeling at multiple scales. arXiv preprint arXiv:1803.08240, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Recur- rent neural network based language model

Tomáš Mikolov, Martin Karaﬁát, Lukáš Burget, JanˇCernock`y, and Sanjeev Khudanpur. Recur- rent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010

work page 2010

[28] [28]

Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston

Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. In EMNLP, 2016

work page 2016

[29] [29]

Hierarchical probabilistic neural network language model

Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In AISTATS, 2005

work page 2005

[30] [30]

Fast-slow recurrent neural networks

Asier Mujika, Florian Meier, and Angelika Steger. Fast-slow recurrent neural networks. In NIPS, pages 5915–5924, 2017

work page 2017

[31] [31]

On the difﬁculty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difﬁculty of training recurrent neural networks. In ICML, 2013

work page 2013

[32] [32]

Using the output embedding to improve language models

Oﬁr Press and Lior Wolf. Using the output embedding to improve language models. In EACL (2), 2017

work page 2017

[33] [33]

Rae, Chris Dyer, Peter Dayan, and Timothy P

Jack W. Rae, Chris Dyer, Peter Dayan, and Timothy P. Lillicrap. Fast parametric learning with activation memorization. In ICML, 2018

work page 2018

[34] [34]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL (1), 2016

work page 2016

[35] [35]

Self-attention with relative position repre- sentations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations. In NAACL-HLT (2), 2018

work page 2018

[36] [36]

Le, Geoffrey E

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V . Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. In ICLR, 2017. 10

work page 2017

[37] [37]

Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.The Journal of Machine Learning Research, 15(1):1929–1958, 2014

work page 1929

[38] [38]

End-to-end memory networks

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In NIPS, 2015

work page 2015

[39] [39]

Adaptive attention span in transformers

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin. Adaptive attention span in transformers. In ACL, 2019

work page 2019

[40] [40]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017

work page 2017

[41] [41]

Pointer networks

Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In NIPS, 2015

work page 2015

[42] [42]

Pay less attention with lightweight and dynamic convolutions

Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. Pay less attention with lightweight and dynamic convolutions. In ICLR, 2019

work page 2019

[43] [43]

Courville, Ruslan Salakhutdinov, Richard S

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015

work page 2015

[44] [44]

Recurrent Neural Network Regularization

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[45] [45]

Recurrent highway networks

Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. In ICML, 2017. 11

work page 2017