Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
Adaptive subgradient methods for online learning and stochastic optimization
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2019 2verdicts
UNVERDICTED 2representative citing papers
An anytime algorithm for learning loss functions that is asymptotically optimal in the worst case and experimentally faster than prior methods for hyperparameter tuning.
citing papers explorer
-
Augmenting Self-attention with Persistent Memory
Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
-
Learning Effective Loss Functions Efficiently
An anytime algorithm for learning loss functions that is asymptotically optimal in the worst case and experimentally faster than prior methods for hyperparameter tuning.