hub Canonical reference

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt · 2023 · cs.LG · arXiv 2301.05217

Canonical reference. 100% of citing Pith papers cite this work as background.

37 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 37 citing papers arXiv PDF

abstract

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

cs.LG · 2026-05-09 · unverdicted · novelty 8.0

In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to generative phenomena including double descent and out-of-equilibrium biases.

Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

CEDAR learns an invertible rotation of vision-language embeddings to concentrate semantics into sparse, axis-aligned coordinates for improved interpretability.

Markovian Circuit Tracing for Transformer State Dynamic

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

cs.AI · 2026-05-13 · conditional · novelty 7.0

The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.

First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

A first-passage time model produces the law T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star) that predicts grokking delays with 17.7% MAPE on held-out AdamW runs after calibrating two parameters on one cell.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

cs.LG · 2026-05-05 · accept · novelty 7.0 · 2 refs

Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.

ILDR: Geometric Early Detection of Grokking

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

Dimensional Criticality at Grokking Across MLPs and Transformers

cs.LG · 2026-04-06 · unverdicted · novelty 7.0

Effective cascade dimension D(t) crosses D=1 at the grokking transition in MLPs and Transformers, with opposite directions for modular addition versus XOR, consistent with attraction to a shared critical manifold.

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

cs.LG · 2026-03-30 · unverdicted · novelty 7.0

The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

cs.CL · 2026-01-27 · unverdicted · novelty 7.0

Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.

The Bayesian Geometry of Transformer Attention

cs.LG · 2025-12-27 · unverdicted · novelty 7.0

Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

cs.AI · 2025-11-04 · unverdicted · novelty 7.0

DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

cs.LG · 2025-04-29 · accept · novelty 7.0

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

Mechanisms of Misgeneralization in Physical Sequence Modeling

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.

Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.

Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

cs.LG · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.

LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

cs.CL · 2026-04-07 · unverdicted · novelty 6.0

LAG-XAI treats paraphrasing as affine flows in semantic manifolds using Lie-inspired approximations, achieving AUC 0.7713 on paraphrase detection and 95.3% hallucination detection on HaluEval.

citing papers explorer

Showing 37 of 37 citing papers.

Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models cs.LG · 2026-05-09 · unverdicted · none · ref 80 · internal anchor
In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to generative phenomena including double descent and out-of-equilibrium biases.
Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models cs.CV · 2026-05-21 · unverdicted · none · ref 12 · internal anchor
CEDAR learns an invertible rotation of vision-language embeddings to concentrate semantics into sparse, axis-aligned coordinates for improved interpretability.
Markovian Circuit Tracing for Transformer State Dynamic cs.LG · 2026-05-20 · unverdicted · none · ref 44 · internal anchor
This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability cs.LG · 2026-05-14 · unverdicted · none · ref 20 · internal anchor
Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers cs.AI · 2026-05-13 · conditional · none · ref 11 · internal anchor
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation cs.LG · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
A first-passage time model produces the law T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star) that predicts grokking delays with 17.7% MAPE on held-out AdamW runs after calibrating two parameters on one cell.
Interpreting Reinforcement Learning Agents with Susceptibilities cs.LG · 2026-05-08 · unverdicted · none · ref 120 · internal anchor
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It cs.LG · 2026-05-05 · accept · none · ref 7 · 2 links · internal anchor
Transformers store count information internally but cannot read it out as digits due to near-orthogonal alignment with output-head rows; updating digit rows or applying LoRA to attention layers improves constrained and unconstrained counting respectively.
ILDR: Geometric Early Detection of Grokking cs.LG · 2026-04-22 · unverdicted · none · ref 5 · internal anchor
ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.
Grokking of Diffusion Models: Case Study on Modular Addition cs.LG · 2026-04-20 · unverdicted · none · ref 15 · internal anchor
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Dimensional Criticality at Grokking Across MLPs and Transformers cs.LG · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
Effective cascade dimension D(t) crosses D=1 at the grokking transition in MLPs and Transformers, with opposite directions for modular addition versus XOR, consistent with attraction to a shared critical manifold.
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior cs.LG · 2026-03-30 · unverdicted · none · ref 18 · internal anchor
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a strong inductive bias that can raise accuracy from failure to 99.8%.
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability cs.CL · 2026-01-27 · unverdicted · none · ref 14 · internal anchor
Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.
The Bayesian Geometry of Transformer Attention cs.LG · 2025-12-27 · unverdicted · none · ref 13 · internal anchor
Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning cs.AI · 2025-11-04 · unverdicted · none · ref 15 · internal anchor
DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking cs.LG · 2025-10-06 · unverdicted · none · ref 7 · internal anchor
EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.
Reinforcement Learning for Reasoning in Large Language Models with One Training Example cs.LG · 2025-04-29 · accept · none · ref 38 · internal anchor
One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.
Mechanisms of Misgeneralization in Physical Sequence Modeling cs.LG · 2026-05-19 · unverdicted · none · ref 61 · internal anchor
Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 109 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory cs.LG · 2026-05-12 · unverdicted · none · ref 17 · 2 links · internal anchor
Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation cs.LG · 2026-05-12 · unverdicted · none · ref 65 · internal anchor
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization stat.ML · 2026-05-07 · unverdicted · none · ref 31 · internal anchor
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams cs.LG · 2026-04-20 · unverdicted · none · ref 8 · 2 links · internal anchor
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out harm benchmarks.
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces cs.CL · 2026-04-07 · unverdicted · none · ref 12 · internal anchor
LAG-XAI treats paraphrasing as affine flows in semantic manifolds using Lie-inspired approximations, achieving AUC 0.7713 on paraphrase detection and 95.3% hallucination detection on HaluEval.
Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models cs.CL · 2026-04-06 · unverdicted · none · ref 10 · internal anchor
Emotional framings induce distinct behavioral shifts and form a structured geometry in the final-layer activations of small language models, with pressure linked to shortcuts and calm to honesty.
Grokking as Dimensional Phase Transition in Neural Networks cs.LG · 2026-04-06 · unverdicted · none · ref 3 · internal anchor
Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.
PhiNet: Speaker Verification with Phonetic Interpretability eess.AS · 2026-04-02 · unverdicted · none · ref 58 · internal anchor
PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.
Feature Identification via the Empirical NTK cs.LG · 2025-10-01 · unverdicted · none · ref 8 · internal anchor
Eigenanalysis of the empirical NTK surfaces feature directions that align with Fourier features in modular addition networks and grammatical features in Gemma-3-270M, outperforming PCA baselines on activations.
FoNE: Precise Single-Token Number Embeddings via Fourier Features cs.CL · 2025-02-13 · unverdicted · none · ref 29 · internal anchor
FoNE encodes numbers as single tokens via Fourier features and outperforms subword and digit-wise embeddings on addition, subtraction, and multiplication with far less data.
Steered Generation via Gradient-Based Optimization on Sparse Query Features cs.LG · 2026-05-21 · unverdicted · none · ref 31 · internal anchor
Prototype-Based Sparse Steering decomposes query activations with SAEs and optimizes sparse features via gradients to steer LLM outputs toward specific behaviors.
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds cs.LG · 2026-05-10 · unverdicted · none · ref 17 · internal anchor
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
Emergent Semantic Role Understanding in Language Models cs.AI · 2026-05-09 · unverdicted · none · ref 46 · internal anchor
Semantic role understanding partially emerges during language model pre-training, with linear probes on frozen representations achieving substantial performance that improves with scale but does not match fine-tuned models, and representations shifting toward more distributed forms at larger scales.
Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry cs.LG · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
Proposes a two-gradient-field model with candidate order parameters alpha_dagger and kappa_c to unify phase transitions across learning theory and non-equilibrium chemistry.
Learning the symmetric group: large from small cs.LG · 2025-02-18 · unverdicted · none · ref 13 · internal anchor
Transformer trained on S10 permutation prediction from transpositions generalizes to S25 with near 100% accuracy using identity augmentation and partitioned windows.
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance cs.AI · 2026-05-02 · unverdicted · none · ref 25 · internal anchor
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking cs.LG · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 170 · internal anchor
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

Progress measures for grokking via mechanistic interpretability

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer