A single-head softmax transformer with O(log(1/ε)) blocks and O(√(N/ε)) MLP width implements preconditioned Richardson iteration to achieve ε-accurate Gaussian KRR predictions on length-N prompts under bounded data.
Training dynamics of in-context learning in linear attention
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
Exact RMT-derived formula for CoT generalization error in linear ICL reveals phase transition between exponential/polynomial improvement, saturation, and overthinking regimes depending on depth, pretraining, and context length.
Large-step GD in deep linear multi-pathway networks drives re-balancing of signals across pathways via edge-of-stability oscillations after early depth-driven symmetry breaking.
Gated linear attention enables lower training and test errors in non-stationary in-context learning by adaptively modulating past inputs through a learnable recency bias under an autoregressive model of task evolution.
Proposes a causal-dynamical framework that constructs causal graphs from longitudinal patient data, simulates intervention effects, and selects personalized treatment focuses for mental health care.
citing papers explorer
No citing papers match the current filters.