Standard softmax-attention transformers can approximate the Gaussian kernel ridge regression predictor by implementing preconditioned Richardson iteration during their forward pass.
Training dynamics of in-context learning in linear attention
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3representative citing papers
LoRA modules function as composable knowledge memories for LLMs with measurable storage capacity, internalization efficiency, and advantages in multi-module long-context reasoning.
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
citing papers explorer
-
Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression
Standard softmax-attention transformers can approximate the Gaussian kernel ridge regression predictor by implementing preconditioned Richardson iteration during their forward pass.
-
Understanding LoRA as Knowledge Memory: An Empirical Analysis
LoRA modules function as composable knowledge memories for LLMs with measurable storage capacity, internalization efficiency, and advantages in multi-module long-context reasoning.
-
Learning to Adapt: In-Context Learning Beyond Stationarity
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.