pith. sign in

hub Canonical reference

Progress measures for grokking via mechanistic interpretability

Canonical reference. 100% of citing Pith papers cite this work as background.

37 Pith papers citing it
Background 100% of classified citations
abstract

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

hub tools

citation-role summary

background 5

citation-polarity summary

years

2026 30 2025 7

roles

background 5

polarities

background 5

representative citing papers

Markovian Circuit Tracing for Transformer State Dynamic

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

ILDR: Geometric Early Detection of Grokking

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.

Grokking of Diffusion Models: Case Study on Modular Addition

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

Dimensional Criticality at Grokking Across MLPs and Transformers

cs.LG · 2026-04-06 · unverdicted · novelty 7.0

Effective cascade dimension D(t) crosses D=1 at the grokking transition in MLPs and Transformers, with opposite directions for modular addition versus XOR, consistent with attraction to a shared critical manifold.

The Bayesian Geometry of Transformer Attention

cs.LG · 2025-12-27 · unverdicted · novelty 7.0

Small transformers reproduce known Bayesian posteriors with 10^{-3} to 10^{-4} bit accuracy in verifiable wind-tunnel tasks via residual belief states, FFN updates, and attention routing, while MLPs do not.

Mechanisms of Misgeneralization in Physical Sequence Modeling

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核

citing papers explorer

Showing 37 of 37 citing papers.