super hub Canonical reference

In-context Learning and Induction Heads

Catherine Olsson, Neel Nanda, Nelson Elhage, Nicholas Joseph, Nova DasSarma, Tom Henighan · 2022 · cs.LG · arXiv 2209.11895

Canonical reference. 80% of citing Pith papers cite this work as background.

145 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 145 citing papers more from Catherine Olsson arXiv PDF

abstract

"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 dataset 1 other 1

citation-polarity summary

background 16 unclear 2 support 1 use dataset 1

claims ledger

abstract "Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguin

authors

Catherine Olsson Neel Nanda Nelson Elhage Nicholas Joseph Nova DasSarma Tom Henighan

co-cited works

representative citing papers

Efficiently Representing Algorithms With Chain-of-Thought Transformers

cs.LG · 2026-06-18 · conditional · novelty 8.0

CoT transformers simulate any Word RAM algorithm with poly-logarithmic overhead in three architectures, improving on quadratic TM overhead.

Looped Transformers with Layer Normalization Provably Learn the Power Method

cs.LG · 2026-05-30 · unverdicted · novelty 8.0

Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

cs.LG · 2026-05-21 · unverdicted · novelty 8.0

Presents a solver-verifiable framework for Transformer circuits, with exhaustive checks on small symbolic tasks and surrogate methods for larger models.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

cs.LG · 2026-04-03 · unverdicted · novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cs.AI · 2024-08-12 · unverdicted · novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

Localizing Model Behavior with Path Patching

cs.LG · 2023-04-12 · unverdicted · novelty 8.0

Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

cs.LG · 2022-11-01 · conditional · novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

cs.AI · 2026-07-02 · unverdicted · novelty 7.0

RECONTEXT is a recursive evidence replay technique that improves long-context reasoning in LLMs by constructing and replaying a query-conditioned evidence pool before final generation.

Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.

Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

LOCOS scores attention heads via OV-circuit output projection onto answer-token unembedding directions and identifies non-literal retrieval heads whose ablation collapses performance on non-literal benchmarks more than prior literal-copy detectors.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit

cs.MA · 2026-06-29 · unverdicted · novelty 7.0

ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.

Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

cs.LG · 2026-06-24 · unverdicted · novelty 7.0

During pretraining, language models exhibit natural ungrokking where learned rules are forgotten based on their support frequency in the corpus, with asymmetric editability of rule survival.

DREAM: Dense Retrieval Embeddings via Autoregressive Modeling

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

DREAM enables training of dense retrieval embeddings using autoregressive next-token prediction from LLMs by modulating attention with retriever scores.

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

cs.CV · 2026-06-22 · conditional · novelty 7.0

HeRA aligns least-aligned attention heads in MLLMs using an MKNN-based contrastive objective to preserve cross-modal topological structure, yielding gains on vision-centric tasks and reduced hallucinations across 18 benchmarks.

Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

cs.CR · 2026-06-19 · unverdicted · novelty 7.0

Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

cs.CV · 2026-06-10 · conditional · novelty 7.0

Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs and KV-cache budgets.

Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

stat.ML · 2026-06-10 · unverdicted · novelty 7.0

Bayesian reduction of attention posterior on copy task predicts first-order phase transition for softmax attention and second-order followed by crossover for linear attention.

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

CoT probe-time gains arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.

citing papers explorer

Showing 6 of 6 citing papers after filters.

Large Vision-Language Models Get Lost in Attention cs.AI · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis cs.AI · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs cs.AI · 2026-04-15 · unverdicted · none · ref 45 · internal anchor
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.
Readable Minds: Emergent Theory-of-Mind-Like Behavior in LLM Poker Agents cs.AI · 2026-04-05 · conditional · none · ref 25 · internal anchor
Persistent memory is necessary and sufficient for LLM poker agents to reach ToM levels 3-5 and use strategic deception, while agents without memory stay at level 0.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory cs.AI · 2026-05-07 · unverdicted · none · ref 18 · internal anchor
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unreviewed · ref 68 · internal anchor

In-context Learning and Induction Heads

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer