A Bayesian Perspective on the Role of Epistemic Uncertainty for Delayed Generalization in In-Context Learning
Pith reviewed 2026-05-10 14:37 UTC · model grok-4.3
The pith
Epistemic uncertainty collapses sharply at the onset of grokking in in-context learning transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In transformers trained on in-context modular arithmetic, epistemic uncertainty estimated via approximate Bayesian techniques collapses sharply at the transition to generalization. This collapse provides a label-free diagnostic for when grokking occurs. The accompanying Bayesian linear model establishes that both the delayed generalization and the uncertainty peak are produced asymptotically by the same spectral mechanism.
What carries the argument
The sharp collapse of epistemic uncertainty estimated from the approximate Bayesian posterior over transformer weights, whose timing is governed by the spectral mechanism identified in the simplified linear model.
If this is right
- Uncertainty estimates can monitor the emergence of generalization without access to held-out labels.
- Increasing task diversity or context length shifts both the grokking time and the location of the uncertainty peak in a predictable way.
- Context noise modulates the height and timing of the uncertainty peak before collapse.
- Asymptotically, the spectral properties of the data determine when generalization occurs and when uncertainty falls.
Where Pith is reading between the lines
- The same uncertainty-collapse diagnostic could be tested on other in-context tasks beyond modular arithmetic.
- Deployed models might use running uncertainty estimates to detect when generalization has begun without extra labeled data.
- The spectral link suggests similar phase-transition behavior could appear in other regimes where delayed generalization is observed.
Load-bearing premise
Approximate Bayesian posterior estimates on the transformer weights accurately reflect the epistemic uncertainty relevant to generalization, and the simplified linear model captures the essential spectral dynamics inside the actual transformer.
What would settle it
Running the same modular arithmetic in-context tasks while using an exact posterior or an alternative uncertainty estimator and observing no sharp uncertainty drop at the measured grokking point, or finding that the linear model's predicted grokking time deviates from the transformer's observed transition.
Figures
read the original abstract
In-context learning enables transformers to adapt to new tasks from a few examples at inference time, while grokking highlights that this generalization can emerge abruptly only after prolonged training. We study task generalization and grokking in in-context learning using a Bayesian perspective, asking what enables the delayed transition from memorization to generalization. Concretely, we consider modular arithmetic tasks in which a transformer must infer a latent linear function solely from in-context examples and analyze how predictive uncertainty evolves during training. We combine approximate Bayesian techniques to estimate the posterior distribution and we study how uncertainty behaves across training and under changes in task diversity, context length, and context noise. We find that epistemic uncertainty collapses sharply when the model groks, making uncertainty a practical label-free diagnostic of generalization in transformers. Additionally, we provide theoretical support with a simplified Bayesian linear model, showing that asymptotically both delayed generalization and uncertainty peaks arise from the same underlying spectral mechanism, which links grokking time to uncertainty dynamics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines in-context learning in transformers on modular arithmetic tasks from a Bayesian viewpoint, tracking how epistemic uncertainty evolves during training. It reports that epistemic uncertainty collapses sharply at the grokking transition, positioning uncertainty as a label-free diagnostic of generalization. Theoretical support is provided via a simplified Bayesian linear model whose asymptotic spectral analysis links both delayed generalization and uncertainty peaks to the same underlying mechanism in the effective kernel or Hessian.
Significance. If the empirical collapse and the spectral linkage hold under scrutiny, the work offers a practical monitoring tool for generalization in ICL settings and a mechanistic explanation tying grokking to uncertainty dynamics. The combination of approximate posterior estimation on real transformers with an exactly solvable linear model is a strength, as is the focus on falsifiable predictions about uncertainty behavior under varying task diversity, context length, and noise.
major comments (3)
- [§4] §4 (experimental results on uncertainty evolution): the reported sharp collapse of epistemic uncertainty at grokking is presented as robust, yet no quantitative diagnostics are given for the quality of the approximate posterior (e.g., effective sample size, mode coverage, or comparison against the Hessian spectrum of the trained transformer). Without such checks, it remains possible that the observed collapse reflects approximation artifacts rather than the epistemic uncertainty relevant to generalization.
- [§5] §5 (simplified Bayesian linear model): the claim that both grokking time and uncertainty peaks arise from the same spectral mechanism is derived exactly in the linear case, but the manuscript does not verify that the transformer's learned kernel or Hessian exhibits the same slow eigenvalues or phase-transition behavior. The link therefore risks being partly shaped by the modeling choices rather than derived from first principles of the transformer.
- [§3.1–3.2] §3.1–3.2 (task construction and posterior approximation): the modular arithmetic tasks and the specific approximate Bayesian method (Laplace, variational, etc.) are central to the empirical claim, yet the paper provides insufficient detail on how the posterior approximation is validated against the non-convex, overparameterized loss landscape of the transformer.
minor comments (3)
- [§2] Notation for the predictive uncertainty decomposition (epistemic vs. aleatoric) is introduced without an explicit equation reference in the main text, making it harder to connect to the later spectral analysis.
- [Figures 2–4] Figure captions for the uncertainty trajectories should include error bars or multiple random seeds to convey variability across runs.
- [Introduction] A few citations to prior work on grokking in transformers and on Bayesian uncertainty in deep networks are missing or incomplete.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (experimental results on uncertainty evolution): the reported sharp collapse of epistemic uncertainty at grokking is presented as robust, yet no quantitative diagnostics are given for the quality of the approximate posterior (e.g., effective sample size, mode coverage, or comparison against the Hessian spectrum of the trained transformer). Without such checks, it remains possible that the observed collapse reflects approximation artifacts rather than the epistemic uncertainty relevant to generalization.
Authors: We agree that additional quantitative checks would increase confidence that the observed collapse reflects genuine epistemic uncertainty rather than approximation artifacts. In the revised manuscript we have added an appendix with diagnostics for the Laplace approximation: effective sample size estimates, mode coverage assessed via multiple random restarts, and direct comparison of the approximate posterior covariance against the Hessian spectrum computed at pre- and post-grokking checkpoints. These new results show that posterior quality remains stable or improves across the transition, supporting that the collapse is not an artifact. revision: yes
-
Referee: [§5] §5 (simplified Bayesian linear model): the claim that both grokking time and uncertainty peaks arise from the same spectral mechanism is derived exactly in the linear case, but the manuscript does not verify that the transformer's learned kernel or Hessian exhibits the same slow eigenvalues or phase-transition behavior. The link therefore risks being partly shaped by the modeling choices rather than derived from first principles of the transformer.
Authors: The linear model is presented as an exactly solvable proxy that isolates the shared spectral mechanism; we do not claim it is identical to the transformer. To address the concern we have added new experiments in the revision that compute the eigenvalue spectrum of the empirical neural tangent kernel (and Hessian) of the trained transformer at successive epochs. These spectra exhibit a comparable accumulation of slow modes immediately prior to grokking, qualitatively consistent with the linear model's phase-transition prediction, although the precise scaling differs due to nonlinearity. We have clarified the scope of the theoretical link in the text. revision: partial
-
Referee: [§3.1–3.2] §3.1–3.2 (task construction and posterior approximation): the modular arithmetic tasks and the specific approximate Bayesian method (Laplace, variational, etc.) are central to the empirical claim, yet the paper provides insufficient detail on how the posterior approximation is validated against the non-convex, overparameterized loss landscape of the transformer.
Authors: We have substantially expanded Sections 3.1 and 3.2. The revised text now includes the precise modular-arithmetic data-generation procedure, the exact Laplace approximation implementation (including the Hessian approximation technique), and validation steps consisting of posterior predictive checks on held-out contexts together with consistency comparisons against short-run MCMC on reduced-scale models. We also added a short discussion of known limitations of the approximation in overparameterized non-convex landscapes and relevant citations. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper reports empirical results from approximate Bayesian posterior estimates on trained transformers, observing that epistemic uncertainty collapses at grokking, and then supplies an independent simplified Bayesian linear model whose asymptotic spectral analysis links delayed generalization to uncertainty peaks. No equations, self-citations, or model choices in the provided text reduce the claimed spectral mechanism or the diagnostic utility of uncertainty to a fitted parameter, a renamed observation, or a self-referential definition. The simplified model is presented as theoretical support rather than a tautological restatement of the transformer experiments, satisfying the criteria for a self-contained derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttp://arxiv.org/abs/1505.05424. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. Kenzo Clauw, Sebastiano Stramaglia, and Daniele Mar...
-
[2]
Noam Itzhak Levi, Alon Beck, and Yohai Bar-Sinai
URLhttps://openreview.net/forum?id=K2PTuvVTF1L. Noam Itzhak Levi, Alon Beck, and Yohai Bar-Sinai. Grokking in Linear Estimators – A Solvable Model that Groks without Understanding. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=GH2LYb9XV0. Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanch...
-
[3]
doi: 10.18653/v1/2024.naacl-long.340
Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.340. URLhttps://aclanthology.org/2024.naacl-long.340/. Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and How Does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization. InProceedings of the 28th International Conference on A...
-
[4]
Fisher Information Matrix (FIM): Replaces the exact Hessian to guarantee positive semi-definiteness
-
[5]
Last-Layer Laplace (LLLA): Restricts Bayesian inference to the final linear layer while keeping preceding layers fixed at wMAP, drastically reducing dimensionality
-
[6]
Φ √ϵ−m i(t)p vi(t) ! −Φ −√ϵ−m i(t)p vi(t) !# , (86) Agen(t) =E p(x)
Kronecker-Factored Approximate Curvature (KFAC): Approximates the Fisher matrix as a Kronecker product F≈A⊗B (input and pre-activation gradient covariances) for efficient inversion. The predictive distribution for a new data point x∗ is computed by marginalizing over this approximate posterior: p(y|x ∗,D)≈ Z p(y|f w(x∗))N(w;w MAP,Σ)dw(19) In practice, thi...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.