WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
hub Canonical reference
In-context Learning and Induction Heads
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract "Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguin
co-cited works
representative citing papers
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-monotonic updates that affect acquisition and regret.
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.
Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.
citing papers explorer
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
Localizing Model Behavior with Path Patching
Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
-
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
-
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
-
From Mechanistic to Compositional Interpretability
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations
LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-monotonic updates that affect acquisition and regret.
-
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
-
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
-
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.
-
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
-
Screening Is Enough
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
-
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
-
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.
-
The Bayesian Geometry of Transformer Attention
Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.
-
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.
-
Soft Head Selection for Injecting ICL-Derived Task Embeddings
SITE applies soft gradient-based head selection to inject ICL-derived task embeddings, outperforming prior embedding adaptation and few-shot ICL across generation, reasoning, and NLU tasks on 12 LLMs from 4B to 70B parameters.
-
A Markov Categorical Framework for Language Modeling
A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.
-
A ghost mechanism: An analytical model of abrupt learning in recurrent networks
The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
-
Steering Language Models With Activation Engineering
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
-
Towards Understanding Self-Pretraining for Sequence Classification
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
-
Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning
A distributional alignment metric d_NTP and a linear regression method LTV for task vectors that improves accuracy by 9.2% over baselines on classification and regression tasks across multiple LLMs.
-
A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions
A Weibull diagnostic framework classifies transformer weight matrices into consistent functional classes via the shape parameter k and tracks training progress via the scale parameter lambda across multiple architectures.
-
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology
Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
-
Fusion-fission forecasts when AI will shift to undesirable behavior
A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.
-
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
-
In-Context Learning Operates as Concept Subspace Learning
In-context learning decomposes into concept-coordinate regression plus off-subspace leakage, with recoverable task information concentrating in a 68-73 dimensional task-aligned subspace of the residual stream that restores 78.8% of the accuracy gap in Llama-3-8B experiments.
-
Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Architecture, Not Scale: Circuit Localization in Large Language Models
Grouped query attention produces more concentrated and stable circuits than multi-head attention across tasks and scales in Pythia and Qwen2.5 models, with a phase transition in factual recall circuits.
-
The Propagation Field: A Geometric Substrate Theory of Deep Learning
Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting in continual learning.
-
Belief or Circuitry? Causal Evidence for In-Context Graph Learning
Causal evidence from representation analysis and interventions shows LLMs use both genuine structure inference and induction circuits in parallel for in-context graph learning.
-
Priming: Hybrid State Space Models From Pre-trained Transformers
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.
-
Large Vision-Language Models Get Lost in Attention
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
-
Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize
Transformers show a sharp, task-specific critical window for weight decay application that determines reasoning versus memorization, with middle placement optimal and boundaries as narrow as 100 steps.
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.
-
Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.
-
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.