A Basin-Selection Perspective on Grokking via Singular Learning Theory
Pith reviewed 2026-05-15 18:07 UTC · model grok-4.3
The pith
Grokking is the optimization-driven shift from a higher-LLC memorizing basin to a lower-LLC generalizing basin favored by the posterior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the singular learning theory account of grokking, the local learning coefficient ranks near-zero-loss basins by their contribution to the posterior; lower-LLC basins are statistically preferred and generalize better. Grokking therefore corresponds to a dynamical transition, under gradient descent, from a higher-LLC memorizing basin to a lower-LLC generalizing basin that eventually dominates the posterior. Analytic LLC formulas confirm the ranking for quadratic networks, and data-driven LLC estimates during training serve as an observable probe of when this transition occurs.
What carries the argument
The local learning coefficient (LLC) from singular learning theory, which measures local degeneracy of the loss surface around a minimizer and thereby determines how much posterior mass that basin attracts.
If this is right
- LLC trajectories computed from training data can serve as an online indicator for the onset of generalization.
- Closed-form LLC expressions allow exact comparison of statistical preference between lazy-training and feature-learning solutions without running dynamics.
- The timing of grokking is controlled by the speed of optimization dynamics rather than by the relative LLC values themselves.
- The same basin-ranking logic applies to any setting where multiple near-interpolating solutions compete under Bayesian inference.
Where Pith is reading between the lines
- If LLC can be estimated reliably at scale, it offers a way to monitor generalization progress in settings where test data are unavailable or expensive.
- Optimizers could be redesigned to bias trajectories toward low-LLC regions earlier, potentially shortening the memorization phase.
- The framework suggests that other abrupt transitions, such as those seen in double descent or in-context learning, may also be basin-selection events governed by LLC differences.
Load-bearing premise
Lower-LLC basins necessarily carry higher posterior mass concentration and lower expected generalization error than higher-LLC basins with comparable training loss.
What would settle it
An explicit pair of near-zero-loss solutions in the same quadratic network where the basin with the lower LLC shows higher test error or lower posterior probability than the higher-LLC basin would falsify the ranking.
read the original abstract
Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape. The key measure is the local learning coefficient (LLC) which quantifies the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging SLT, we develop a basin-selection perspective on grokking in quadratic networks: LLC ranks competing near-zero-loss basins by statistical preference, while the training-time transition between them is governed by optimisation dynamics. In this view, grokking corresponds to a transition from a higher-LLC (memorising) basin to a lower-LLC (generalising) basin that dominates the posterior. To support this, we derive analytic formulas for the LLC in shallow quadratic networks under both lazy and feature learning regimes. Empirically, we demonstrate that LLC trajectories estimated from training data track the onset of generalisation and provide an informative probe of the optimisation path.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a basin-selection interpretation of grokking in shallow quadratic networks using Singular Learning Theory (SLT). It argues that grokking corresponds to a transition from a higher-LLC memorizing basin to a lower-LLC generalizing basin that dominates the posterior, with analytic LLC expressions derived for both lazy and feature-learning regimes and empirical LLC trajectories shown to track the onset of generalization.
Significance. If the analytic LLC derivations are correct and the finite-time behavior aligns with SLT predictions, the work supplies a geometric Bayesian account of why certain near-zero-loss solutions are statistically preferred, potentially explaining the delayed generalization in grokking. The closed-form LLC expressions for quadratic networks constitute a concrete technical contribution that could be tested in other regimes.
major comments (2)
- [Abstract and §4] Abstract and §4 (analytic LLC derivations): The central claim that lower-LLC basins dominate the posterior and exhibit lower expected generalization error rests on the standard SLT asymptotic (gen. error ~ λ/n). For the finite training trajectories studied in shallow quadratic networks, it is unclear whether the LLC ranking already selects the generalizing basin at the observed grokking step or whether basin-volume and optimization corrections remain comparable in magnitude; an explicit regime check or finite-n correction term is needed to support the ranking interpretation.
- [§5] §5 (empirical LLC trajectories): The reported LLC estimates are said to track generalization onset, yet the manuscript does not detail how the local neighborhood for LLC estimation is chosen or whether the estimates remain stable under different sampling radii; without this, it is difficult to rule out that the observed drop simply reflects the optimization path rather than a genuine posterior-mass shift.
minor comments (2)
- [§2] Notation for the local learning coefficient should be introduced with an explicit reference to the SLT definition (e.g., the integral form) on first use to aid readers unfamiliar with the measure.
- [Figures] Figure captions would benefit from stating the network depth, width, and exact loss function used in each panel so that the quadratic-network results can be reproduced without consulting the main text.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive suggestions. We address the two major comments point by point below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (analytic LLC derivations): The central claim that lower-LLC basins dominate the posterior and exhibit lower expected generalization error rests on the standard SLT asymptotic (gen. error ~ λ/n). For the finite training trajectories studied in shallow quadratic networks, it is unclear whether the LLC ranking already selects the generalizing basin at the observed grokking step or whether basin-volume and optimization corrections remain comparable in magnitude; an explicit regime check or finite-n correction term is needed to support the ranking interpretation.
Authors: The analytic LLC expressions we derive are exact for the quadratic network class, and the SLT asymptotic supplies the leading term that governs posterior dominance for large n. In the grokking regime the LLC gap between the memorizing and generalizing basins is O(1), while volume corrections enter at higher order; our experiments operate at n large enough for the LLC ranking to determine the transition. We will add an explicit regime check to §4 that compares the LLC-based posterior weight approximation against direct evidence estimates on small-scale instances, confirming that the ranking selects the generalizing basin at the observed grokking step. revision: partial
-
Referee: [§5] §5 (empirical LLC trajectories): The reported LLC estimates are said to track generalization onset, yet the manuscript does not detail how the local neighborhood for LLC estimation is chosen or whether the estimates remain stable under different sampling radii; without this, it is difficult to rule out that the observed drop simply reflects the optimization path rather than a genuine posterior-mass shift.
Authors: We agree that the LLC estimation procedure requires explicit documentation. In the revised §5 we will specify the exact sampling radius and number of local samples used for each LLC estimate, and we will add a sensitivity analysis showing that the drop in LLC at the grokking transition remains stable and occurs at the same training step across a range of radii (0.01–0.2). These checks indicate that the observed change reflects a genuine shift in posterior mass toward the lower-LLC basin. revision: yes
Circularity Check
No significant circularity: LLC formulas derived from geometry; SLT links applied as external theory
full rationale
The paper derives closed-form LLC expressions directly from the Hessian and degeneracy structure of shallow quadratic networks in lazy and feature regimes. These derivations stand on the explicit loss landscape geometry rather than on any fit to generalization error or grokking timing. The mapping from lower LLC to higher posterior mass and lower expected generalization error is imported from standard SLT asymptotics (Watanabe) and is not re-derived or fitted inside the paper; the basin-selection interpretation therefore rests on that external relation rather than on a self-referential loop. No self-citation chain, ansatz smuggling, or renaming of known results is used to close the central claim. The empirical tracking of LLC trajectories is presented as a probe, not as a proof that LLC must dominate at the observed transition. The derivation chain remains self-contained against the paper's own equations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Singular Learning Theory provides a valid local measure (LLC) of degeneracy that correlates with posterior concentration and generalization error
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error... grokking corresponds to a transition from a higher-LLC (memorising) basin to a lower-LLC (generalising) basin
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We derive closed-form LLC expressions for quadratic networks (Sections 4 and 5)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.