CoT transformers simulate any Word RAM algorithm with poly-logarithmic overhead in three architectures, improving on quadratic TM overhead.
super hub Canonical reference
In-context Learning and Induction Heads
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
"Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguing that induction heads may be the mechanistic source of general in-context learning in transformer models of any size. For small attention-only models, we present strong, causal evidence; for larger models with MLPs, we present correlational evidence.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract "Induction heads" are attention heads that implement a simple algorithm to complete token sequences like [A][B] ... [A] -> [B]. In this work, we present preliminary and indirect evidence for a hypothesis that induction heads might constitute the mechanism for the majority of all "in-context learning" in large transformer models (i.e. decreasing loss at increasing token indices). We find that induction heads develop at precisely the same point as a sudden sharp increase in in-context learning ability, visible as a bump in the training loss. We present six complementary lines of evidence, arguin
authors
co-cited works
representative citing papers
Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.
Presents a solver-verifiable framework for Transformer circuits, with exhaustive checks on small symbolic tasks and surrogate methods for larger models.
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
RECONTEXT is a recursive evidence replay technique that improves long-context reasoning in LLMs by constructing and replaying a query-conditioned evidence pool before final generation.
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
LOCOS scores attention heads via OV-circuit output projection onto answer-token unembedding directions and identifies non-literal retrieval heads whose ablation collapses performance on non-literal benchmarks more than prior literal-copy detectors.
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.
SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.
DREAM enables training of dense retrieval embeddings using autoregressive next-token prediction from LLMs by modulating attention with retriever scores.
HeRA aligns least-aligned attention heads in MLLMs using an MKNN-based contrastive objective to preserve cross-modal topological structure, yielding gains on vision-centric tasks and reduced hallucinations across 18 benchmarks.
Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.
Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs and KV-cache budgets.
Bayesian reduction of attention posterior on copy task predicts first-order phase transition for softmax attention and second-order followed by crossover for linear attention.
STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.
CoT probe-time gains arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
citing papers explorer
-
Efficiently Representing Algorithms With Chain-of-Thought Transformers
CoT transformers simulate any Word RAM algorithm with poly-logarithmic overhead in three architectures, improving on quadratic TM overhead.
-
Looped Transformers with Layer Normalization Provably Learn the Power Method
Looped linear transformers with LN provably converge via GD to implement the power method on principal component prediction.
-
Towards Verifiable Transformers: Solver-Checkable Circuit Explanations
Presents a solver-verifiable framework for Transformer circuits, with exhaustive checks on small symbolic tasks and surrogate methods for larger models.
-
WriteSAE: Sparse Autoencoders for Recurrent State
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
-
Slot Machines: How LLMs Keep Track of Multiple Entities
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
-
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
Localizing Model Behavior with Path Patching
Path patching provides a method to express and quantitatively test hypotheses that neural network behaviors are localized to sets of paths.
-
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
-
ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning
RECONTEXT is a recursive evidence replay technique that improves long-context reasoning in LLMs by constructing and replaying a query-conditioned evidence pool before final generation.
-
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
-
Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads
LOCOS scores attention heads via OV-circuit output projection onto answer-token unembedding directions and identifies non-literal retrieval heads whose ablation collapses performance on non-literal benchmarks more than prior literal-copy detectors.
-
SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
-
ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit
ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.
-
Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies
SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.
-
DREAM: Dense Retrieval Embeddings via Autoregressive Modeling
DREAM enables training of dense retrieval embeddings using autoregressive next-token prediction from LLMs by modulating attention with retriever scores.
-
Mind the Heads: Topological Representation Alignment for Multimodal LLMs
HeRA aligns least-aligned attention heads in MLLMs using an MKNN-based contrastive objective to preserve cross-modal topological structure, yielding gains on vision-centric tasks and reduced hallucinations across 18 benchmarks.
-
Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents
Relinking is a new compression-boundary attack on LLM agents where summarization of split benign fragments produces malicious instructions, shown via Relink tool at 86.9% success rate and mitigated by KBRA defense to 0%.
-
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
Reroute turns irreversible visual-token pruning into recoverable routing that reuses existing attention scores, improving grounding performance under aggressive reduction on LLaVA-1.5 and Qwen while preserving TFLOPs and KV-cache budgets.
-
Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence
Bayesian reduction of attention posterior on copy task predicts first-order phase transition for softmax attention and second-order followed by crossover for linear attention.
-
STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations
STRIDE formulates TDA as sparse recovery using steering operators that mimic subset training effects in activation space, claiming SOTA LLM pre-training attribution at 13x prior speed.
-
What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
CoT probe-time gains arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.
-
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
-
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
-
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Self-attention acts as a covariance readout that unifies in-context learning via population gradient descent and repetitive generation via asymptotic Markov behavior.
-
From Mechanistic to Compositional Interpretability
The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.
-
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
Performance collapse in layer-pruned LLMs stems from disrupting the Silent Phase of decision-making, which blocks the transition to correct predictions, while the later Decisive Phase is robust to pruning.
-
Elicitation Matters: How Prompts and Query Protocols Shape LLM Surrogates under Sparse Observations
LLM surrogate beliefs under sparse observations depend on prompts and query protocols, with structural prompts as priors, pointwise vs joint querying producing different beliefs, and sequential evidence causing non-monotonic updates that affect acquisition and regret.
-
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
-
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.
-
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
-
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
HeadRank lifts preference optimization into attention space via entropy-regularized head selection and distribution regularizers to sharpen discriminability for efficient listwise reranking.
-
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
-
Screening Is Enough
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
-
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
-
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability
Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.
-
The Bayesian Geometry of Transformer Attention
Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.
-
Soft Head Selection for Injecting ICL-Derived Task Embeddings
SITE applies soft gradient-based head selection to inject ICL-derived task embeddings, outperforming prior embedding adaptation and few-shot ICL across generation, reasoning, and NLU tasks on 12 LLMs from 4B to 70B parameters.
-
A Markov Categorical Framework for Language Modeling
A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.
-
A ghost mechanism: An analytical model of abrupt learning in recurrent networks
The ghost mechanism derives a 1D canonical model of abrupt learning in RNNs from ghost points of saddle-node bifurcations, predicting an inverse-power-law critical learning rate and gradient-based failure modes.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Jamba: A Hybrid Transformer-Mamba Language Model
Jamba presents a hybrid Transformer-Mamba MoE architecture for LLMs that delivers state-of-the-art benchmark performance and strong results up to 256K token contexts while fitting in one 80GB GPU with high throughput.
-
Steering Language Models With Activation Engineering
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
-
Do Models Read What They Write? Causal Registers in Scratchpad Reasoning
State-writing models causally use edited scratchpad states in a controlled task at 80-91% accuracy on held-out examples, unlike final-answer-only and pretrained controls.
-
Developmental Trajectories of Situation Modeling and Mentalizing in Transformer Language Models
Larger LLMs acquire basic situation modeling before mentalizing on false-belief tasks, with performance depending on size, training volume, and post-training, yet remaining sensitive to non-factive verbs and agent knowledge states.
-
Generating Special Triangulations with Transformers
Transformers generate new FRSTs of 4D reflexive polytopes across size ranges and self-improve by retraining on their own outputs.