Nonlinear query projections of the form X + MLP(X) improve transformer performance on small models with only d² + O(d) added parameters.
NanoGPT.https://github.com/karpathy/nanoGPT, 2023
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
CONDITIONAL 1representative citing papers
citing papers explorer
-
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
Nonlinear query projections of the form X + MLP(X) improve transformer performance on small models with only d² + O(d) added parameters.