A single-head softmax transformer with O(log(1/ε)) blocks and O(√(N/ε)) MLP width implements preconditioned Richardson iteration to achieve ε-accurate Gaussian KRR predictions on length-N prompts under bounded data.
Training dynamics of in-context learning in linear attention
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
Exact RMT-derived formula for CoT generalization error in linear ICL reveals phase transition between exponential/polynomial improvement, saturation, and overthinking regimes depending on depth, pretraining, and context length.
Large-step GD in deep linear multi-pathway networks drives re-balancing of signals across pathways via edge-of-stability oscillations after early depth-driven symmetry breaking.
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
Proposes a causal-dynamical framework that constructs causal graphs from longitudinal patient data, simulates intervention effects, and selects personalized treatment focuses for mental health care.
citing papers explorer
-
An Asymptotic Theory of Chain-of-Thought in In-Context Learning
Exact RMT-derived formula for CoT generalization error in linear ICL reveals phase transition between exponential/polynomial improvement, saturation, and overthinking regimes depending on depth, pretraining, and context length.