A Basin-Selection Perspective on Grokking via Singular Learning Theory

Ben Cullen; Jiayi Li; Riya Danait; Sergio Estan-Ruiz

arxiv: 2603.01192 · v3 · submitted 2026-03-01 · 📊 stat.ML · cs.LG

A Basin-Selection Perspective on Grokking via Singular Learning Theory

Ben Cullen , Sergio Estan-Ruiz , Riya Danait , Jiayi Li This is my paper

Pith reviewed 2026-05-15 18:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords grokkingsingular learning theorylocal learning coefficientbasin selectiongeneralizationloss landscapequadratic networks

0 comments

The pith

Grokking is the optimization-driven shift from a higher-LLC memorizing basin to a lower-LLC generalizing basin favored by the posterior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames grokking as a competition between solution basins in the loss landscape, where singular learning theory supplies a geometric ranking via the local learning coefficient. Lower LLC values mark basins that concentrate more posterior mass and incur lower expected generalization error. The authors derive closed-form LLC expressions for shallow quadratic networks in both lazy and feature-learning regimes, then show that estimated LLC trajectories during gradient descent track the abrupt onset of generalization. This supplies a Bayesian explanation for why extended training can suddenly unlock generalization without any change in training loss.

Core claim

In the singular learning theory account of grokking, the local learning coefficient ranks near-zero-loss basins by their contribution to the posterior; lower-LLC basins are statistically preferred and generalize better. Grokking therefore corresponds to a dynamical transition, under gradient descent, from a higher-LLC memorizing basin to a lower-LLC generalizing basin that eventually dominates the posterior. Analytic LLC formulas confirm the ranking for quadratic networks, and data-driven LLC estimates during training serve as an observable probe of when this transition occurs.

What carries the argument

The local learning coefficient (LLC) from singular learning theory, which measures local degeneracy of the loss surface around a minimizer and thereby determines how much posterior mass that basin attracts.

If this is right

LLC trajectories computed from training data can serve as an online indicator for the onset of generalization.
Closed-form LLC expressions allow exact comparison of statistical preference between lazy-training and feature-learning solutions without running dynamics.
The timing of grokking is controlled by the speed of optimization dynamics rather than by the relative LLC values themselves.
The same basin-ranking logic applies to any setting where multiple near-interpolating solutions compete under Bayesian inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If LLC can be estimated reliably at scale, it offers a way to monitor generalization progress in settings where test data are unavailable or expensive.
Optimizers could be redesigned to bias trajectories toward low-LLC regions earlier, potentially shortening the memorization phase.
The framework suggests that other abrupt transitions, such as those seen in double descent or in-context learning, may also be basin-selection events governed by LLC differences.

Load-bearing premise

Lower-LLC basins necessarily carry higher posterior mass concentration and lower expected generalization error than higher-LLC basins with comparable training loss.

What would settle it

An explicit pair of near-zero-loss solutions in the same quadratic network where the basin with the lower LLC shows higher test error or lower posterior probability than the higher-LLC basin would falsify the ranking.

read the original abstract

Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape. The key measure is the local learning coefficient (LLC) which quantifies the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging SLT, we develop a basin-selection perspective on grokking in quadratic networks: LLC ranks competing near-zero-loss basins by statistical preference, while the training-time transition between them is governed by optimisation dynamics. In this view, grokking corresponds to a transition from a higher-LLC (memorising) basin to a lower-LLC (generalising) basin that dominates the posterior. To support this, we derive analytic formulas for the LLC in shallow quadratic networks under both lazy and feature learning regimes. Empirically, we demonstrate that LLC trajectories estimated from training data track the onset of generalisation and provide an informative probe of the optimisation path.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Analytic LLC formulas for quadratic nets frame grokking as a basin switch, but the finite-time SLT link needs direct checks.

read the letter

This paper derives closed-form local learning coefficients for shallow quadratic networks in both lazy and feature-learning regimes, then frames grokking as the training dynamics moving from a higher-LLC memorizing basin to a lower-LLC generalizing one that dominates the posterior. The analytic expressions are the clearest new element; they come from the loss geometry rather than being adjusted to fit the observed generalization jump, so the argument stays non-circular on its own terms. The experiments show LLC trajectories estimated from training data aligning with the onset of generalization, which gives a concrete probe of the optimization path. That part is useful and worth having on record. The soft spot is the direct application of the SLT asymptotic (lower LLC implies higher posterior mass and lower expected error) to finite training steps in these small models. Basin volumes and the speed of optimization can introduce corrections of similar magnitude at the transition point, and the paper does not yet show that the LLC difference dominates those terms. The stress-test note correctly flags this regime question. Readers already working with SLT or mechanistic accounts of grokking will find the derivations and the LLC probe worth looking at. The work has enough specific math and empirical alignment to deserve a serious referee, even if the finite-n validity will need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a basin-selection interpretation of grokking in shallow quadratic networks using Singular Learning Theory (SLT). It argues that grokking corresponds to a transition from a higher-LLC memorizing basin to a lower-LLC generalizing basin that dominates the posterior, with analytic LLC expressions derived for both lazy and feature-learning regimes and empirical LLC trajectories shown to track the onset of generalization.

Significance. If the analytic LLC derivations are correct and the finite-time behavior aligns with SLT predictions, the work supplies a geometric Bayesian account of why certain near-zero-loss solutions are statistically preferred, potentially explaining the delayed generalization in grokking. The closed-form LLC expressions for quadratic networks constitute a concrete technical contribution that could be tested in other regimes.

major comments (2)

[Abstract and §4] Abstract and §4 (analytic LLC derivations): The central claim that lower-LLC basins dominate the posterior and exhibit lower expected generalization error rests on the standard SLT asymptotic (gen. error ~ λ/n). For the finite training trajectories studied in shallow quadratic networks, it is unclear whether the LLC ranking already selects the generalizing basin at the observed grokking step or whether basin-volume and optimization corrections remain comparable in magnitude; an explicit regime check or finite-n correction term is needed to support the ranking interpretation.
[§5] §5 (empirical LLC trajectories): The reported LLC estimates are said to track generalization onset, yet the manuscript does not detail how the local neighborhood for LLC estimation is chosen or whether the estimates remain stable under different sampling radii; without this, it is difficult to rule out that the observed drop simply reflects the optimization path rather than a genuine posterior-mass shift.

minor comments (2)

[§2] Notation for the local learning coefficient should be introduced with an explicit reference to the SLT definition (e.g., the integral form) on first use to aid readers unfamiliar with the measure.
[Figures] Figure captions would benefit from stating the network depth, width, and exact loss function used in each panel so that the quadratic-network results can be reproduced without consulting the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. We address the two major comments point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (analytic LLC derivations): The central claim that lower-LLC basins dominate the posterior and exhibit lower expected generalization error rests on the standard SLT asymptotic (gen. error ~ λ/n). For the finite training trajectories studied in shallow quadratic networks, it is unclear whether the LLC ranking already selects the generalizing basin at the observed grokking step or whether basin-volume and optimization corrections remain comparable in magnitude; an explicit regime check or finite-n correction term is needed to support the ranking interpretation.

Authors: The analytic LLC expressions we derive are exact for the quadratic network class, and the SLT asymptotic supplies the leading term that governs posterior dominance for large n. In the grokking regime the LLC gap between the memorizing and generalizing basins is O(1), while volume corrections enter at higher order; our experiments operate at n large enough for the LLC ranking to determine the transition. We will add an explicit regime check to §4 that compares the LLC-based posterior weight approximation against direct evidence estimates on small-scale instances, confirming that the ranking selects the generalizing basin at the observed grokking step. revision: partial
Referee: [§5] §5 (empirical LLC trajectories): The reported LLC estimates are said to track generalization onset, yet the manuscript does not detail how the local neighborhood for LLC estimation is chosen or whether the estimates remain stable under different sampling radii; without this, it is difficult to rule out that the observed drop simply reflects the optimization path rather than a genuine posterior-mass shift.

Authors: We agree that the LLC estimation procedure requires explicit documentation. In the revised §5 we will specify the exact sampling radius and number of local samples used for each LLC estimate, and we will add a sensitivity analysis showing that the drop in LLC at the grokking transition remains stable and occurs at the same training step across a range of radii (0.01–0.2). These checks indicate that the observed change reflects a genuine shift in posterior mass toward the lower-LLC basin. revision: yes

Circularity Check

0 steps flagged

No significant circularity: LLC formulas derived from geometry; SLT links applied as external theory

full rationale

The paper derives closed-form LLC expressions directly from the Hessian and degeneracy structure of shallow quadratic networks in lazy and feature regimes. These derivations stand on the explicit loss landscape geometry rather than on any fit to generalization error or grokking timing. The mapping from lower LLC to higher posterior mass and lower expected generalization error is imported from standard SLT asymptotics (Watanabe) and is not re-derived or fitted inside the paper; the basin-selection interpretation therefore rests on that external relation rather than on a self-referential loop. No self-citation chain, ansatz smuggling, or renaming of known results is used to close the central claim. The empirical tracking of LLC trajectories is presented as a probe, not as a proof that LLC must dominate at the observed transition. The derivation chain remains self-contained against the paper's own equations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of Singular Learning Theory to neural-network loss surfaces and on the interpretation that LLC ranks basins by posterior mass and generalization error.

axioms (1)

domain assumption Singular Learning Theory provides a valid local measure (LLC) of degeneracy that correlates with posterior concentration and generalization error
Invoked throughout the abstract to link LLC to statistical preference of basins.

pith-pipeline@v0.9.0 · 5508 in / 1273 out tokens · 85200 ms · 2026-05-15T18:07:43.950693+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error... grokking corresponds to a transition from a higher-LLC (memorising) basin to a lower-LLC (generalising) basin
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We derive closed-form LLC expressions for quadratic networks (Sections 4 and 5)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.