A Bayesian Perspective on the Role of Epistemic Uncertainty for Delayed Generalization in In-Context Learning

Abdessamed Qchohi; Simone Rossi

arxiv: 2604.12434 · v1 · submitted 2026-04-14 · 📊 stat.ML · cs.LG

A Bayesian Perspective on the Role of Epistemic Uncertainty for Delayed Generalization in In-Context Learning

Abdessamed Qchohi , Simone Rossi This is my paper

Pith reviewed 2026-05-10 14:37 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords epistemic uncertaintyin-context learninggrokkingtransformersBayesian inferencemodular arithmeticgeneralizationspectral mechanism

0 comments

The pith

Epistemic uncertainty collapses sharply at the onset of grokking in in-context learning transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how transformers achieve delayed generalization on modular arithmetic tasks when given in-context examples. It applies approximate Bayesian methods to estimate the posterior over model weights and tracks predictive uncertainty across training. Uncertainty falls abruptly exactly when the model shifts from memorization to generalization. A simplified Bayesian linear model shows that this collapse and the timing of grokking both arise from the same spectral properties of the task.

Core claim

In transformers trained on in-context modular arithmetic, epistemic uncertainty estimated via approximate Bayesian techniques collapses sharply at the transition to generalization. This collapse provides a label-free diagnostic for when grokking occurs. The accompanying Bayesian linear model establishes that both the delayed generalization and the uncertainty peak are produced asymptotically by the same spectral mechanism.

What carries the argument

The sharp collapse of epistemic uncertainty estimated from the approximate Bayesian posterior over transformer weights, whose timing is governed by the spectral mechanism identified in the simplified linear model.

If this is right

Uncertainty estimates can monitor the emergence of generalization without access to held-out labels.
Increasing task diversity or context length shifts both the grokking time and the location of the uncertainty peak in a predictable way.
Context noise modulates the height and timing of the uncertainty peak before collapse.
Asymptotically, the spectral properties of the data determine when generalization occurs and when uncertainty falls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same uncertainty-collapse diagnostic could be tested on other in-context tasks beyond modular arithmetic.
Deployed models might use running uncertainty estimates to detect when generalization has begun without extra labeled data.
The spectral link suggests similar phase-transition behavior could appear in other regimes where delayed generalization is observed.

Load-bearing premise

Approximate Bayesian posterior estimates on the transformer weights accurately reflect the epistemic uncertainty relevant to generalization, and the simplified linear model captures the essential spectral dynamics inside the actual transformer.

What would settle it

Running the same modular arithmetic in-context tasks while using an exact posterior or an alternative uncertainty estimator and observing no sharp uncertainty drop at the measured grokking point, or finding that the linear model's predicted grokking time deviates from the transformer's observed transition.

Figures

Figures reproduced from arXiv: 2604.12434 by Abdessamed Qchohi, Simone Rossi.

**Figure 1.** Figure 1: Epistemic uncertainty collapses sharply at the grokking transition. On a decoder-only transformer trained on modular arithmetic tasks, the epistemic uncertainty collapses sharply before the generalization transition. This pattern follows the theoretical prediction for a simplified linear model, where the grokking transition corresponds to a spectral shift in the posterior distribution that causes a colla… view at source ↗

**Figure 2.** Figure 2: Accuracy heatmap with IVON and Laplace. For OOD Train, both IVON and Laplace achieve high accuracy starting at T = 256 and T = 512 for any training data fraction. For OOD Val (the most challenging split), Laplace generalizes more broadly across configurations, while IVON achieves competitive performance only at high training data fractions. IVON 32 64 128 256 512 T (tasks) 0 1 2 Uncertainty (nats) EU + AU … view at source ↗

**Figure 3.** Figure 3: Uncertainty decomposition with IVON and Laplace. Decomposition of predictive uncertainty into aleatoric and epistemic components across the pretraining hyperparameter grid. of this writing, grokking has been primarily characterized in terms of accuracy dynamics (which implies the need for labels to detect it), while here we show that it also has a clear signature in the model’s posterior uncertainty, which… view at source ↗

**Figure 4.** Figure 4: Accuracy, epistemic and aleatoric uncertainty across context sizes. Increasing the number of in-context examples monotonically improves accuracy and decreases alepatoric uncertainty, while the epistemic uncertainty follows a non-monotonic behavior. 0.5 1.0 Accuracy Val IVON Laplace OOD Train OOD Val 10−2 10−1 100 EU 0.0 0.2 0.4 0.6 0.8 1.0 Noise probability 10−1 100 AU 0.0 0.2 0.4 0.6 0.8 1.0 Noise probabi… view at source ↗

**Figure 5.** Figure 5: Accuracy vs. epistemic and aleatoric uncertainty across context noise levels. Corrupting in-context examples with random label noise degrades accuracy and raises both aleatoric and epistemic uncertainty. epistemic uncertainty, and aleatoric uncertainty as a function of context size. Accuracy improves monotonically with context size for both IVON and Laplace across all splits, confirming that more examples … view at source ↗

**Figure 6.** Figure 6: Training and generalization dynamics of the simplified model. Increasing the aspect ratio λ delays generalization relative to training (left). The epistemic uncertainty peaks coincide with the transition from interpolation to generalization (center left), although the EU peak occurs before the accuracy threshold is exceeded (center right). Consistently, the EU peak delay grows rapidly with λ → 1, with the … view at source ↗

**Figure 7.** Figure 7: Empirical trajectory of accuracy and epistemic uncertainty. Even in the full transformer model, the EU peak occurs before the accuracy threshold is exceeded, similarly to the asymptotic behavior of the simplified model. Takeaway. Theorem 5.1 is the main theoretical reason to say that grokking time can also be defined through epistemic uncertainty. The point is not that the EU peak occurs at the same abs… view at source ↗

read the original abstract

In-context learning enables transformers to adapt to new tasks from a few examples at inference time, while grokking highlights that this generalization can emerge abruptly only after prolonged training. We study task generalization and grokking in in-context learning using a Bayesian perspective, asking what enables the delayed transition from memorization to generalization. Concretely, we consider modular arithmetic tasks in which a transformer must infer a latent linear function solely from in-context examples and analyze how predictive uncertainty evolves during training. We combine approximate Bayesian techniques to estimate the posterior distribution and we study how uncertainty behaves across training and under changes in task diversity, context length, and context noise. We find that epistemic uncertainty collapses sharply when the model groks, making uncertainty a practical label-free diagnostic of generalization in transformers. Additionally, we provide theoretical support with a simplified Bayesian linear model, showing that asymptotically both delayed generalization and uncertainty peaks arise from the same underlying spectral mechanism, which links grokking time to uncertainty dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ties epistemic uncertainty collapse to grokking in ICL tasks via a shared spectral mechanism in a Bayesian setup, but the transformer posterior approximations look too shaky to carry the main claims.

read the letter

The main point is that epistemic uncertainty drops sharply once the transformer groks on these modular arithmetic in-context tasks, and a simplified Bayesian linear model is used to argue that both the delay and the uncertainty peak come from the same slow spectral modes. This is presented as a label-free way to spot when generalization has happened. What is actually new is the combination of approximate Bayesian tracking with grokking analysis on ICL, plus the asymptotic spectral account that links uncertainty dynamics directly to generalization time. The experiments varying task diversity, context length, and noise give some concrete handles on how those factors affect the uncertainty trajectory. The linear model derivation is the cleanest part and does provide a plausible mechanism without extra fitting parameters. The soft spots are in the bridge from the linear model to the actual transformer. Standard approximate posteriors (Laplace, variational, etc.) are known to be unreliable in overparameterized non-convex landscapes and often fail to capture the slow directions tied to delayed generalization. The paper does not appear to include direct checks, such as Hessian spectrum comparisons or calibration against held-out generalization, that would show the estimated uncertainty is the relevant epistemic quantity rather than an artifact of the approximation. Without those, the collapse observation remains suggestive but not yet diagnostic. The task construction and quantitative strength of the collapse are also hard to judge from the available details. This is for people already working on grokking or mechanistic studies of ICL who want ideas for uncertainty-based monitoring. It is not yet ready for broad citation, but the core idea is coherent enough that a serious editor should send it to referees rather than desk-reject it; the spectral link is worth testing properly even if the current evidence is preliminary.

Referee Report

3 major / 3 minor

Summary. The paper examines in-context learning in transformers on modular arithmetic tasks from a Bayesian viewpoint, tracking how epistemic uncertainty evolves during training. It reports that epistemic uncertainty collapses sharply at the grokking transition, positioning uncertainty as a label-free diagnostic of generalization. Theoretical support is provided via a simplified Bayesian linear model whose asymptotic spectral analysis links both delayed generalization and uncertainty peaks to the same underlying mechanism in the effective kernel or Hessian.

Significance. If the empirical collapse and the spectral linkage hold under scrutiny, the work offers a practical monitoring tool for generalization in ICL settings and a mechanistic explanation tying grokking to uncertainty dynamics. The combination of approximate posterior estimation on real transformers with an exactly solvable linear model is a strength, as is the focus on falsifiable predictions about uncertainty behavior under varying task diversity, context length, and noise.

major comments (3)

[§4] §4 (experimental results on uncertainty evolution): the reported sharp collapse of epistemic uncertainty at grokking is presented as robust, yet no quantitative diagnostics are given for the quality of the approximate posterior (e.g., effective sample size, mode coverage, or comparison against the Hessian spectrum of the trained transformer). Without such checks, it remains possible that the observed collapse reflects approximation artifacts rather than the epistemic uncertainty relevant to generalization.
[§5] §5 (simplified Bayesian linear model): the claim that both grokking time and uncertainty peaks arise from the same spectral mechanism is derived exactly in the linear case, but the manuscript does not verify that the transformer's learned kernel or Hessian exhibits the same slow eigenvalues or phase-transition behavior. The link therefore risks being partly shaped by the modeling choices rather than derived from first principles of the transformer.
[§3.1–3.2] §3.1–3.2 (task construction and posterior approximation): the modular arithmetic tasks and the specific approximate Bayesian method (Laplace, variational, etc.) are central to the empirical claim, yet the paper provides insufficient detail on how the posterior approximation is validated against the non-convex, overparameterized loss landscape of the transformer.

minor comments (3)

[§2] Notation for the predictive uncertainty decomposition (epistemic vs. aleatoric) is introduced without an explicit equation reference in the main text, making it harder to connect to the later spectral analysis.
[Figures 2–4] Figure captions for the uncertainty trajectories should include error bars or multiple random seeds to convey variability across runs.
[Introduction] A few citations to prior work on grokking in transformers and on Bayesian uncertainty in deep networks are missing or incomplete.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions have been made to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (experimental results on uncertainty evolution): the reported sharp collapse of epistemic uncertainty at grokking is presented as robust, yet no quantitative diagnostics are given for the quality of the approximate posterior (e.g., effective sample size, mode coverage, or comparison against the Hessian spectrum of the trained transformer). Without such checks, it remains possible that the observed collapse reflects approximation artifacts rather than the epistemic uncertainty relevant to generalization.

Authors: We agree that additional quantitative checks would increase confidence that the observed collapse reflects genuine epistemic uncertainty rather than approximation artifacts. In the revised manuscript we have added an appendix with diagnostics for the Laplace approximation: effective sample size estimates, mode coverage assessed via multiple random restarts, and direct comparison of the approximate posterior covariance against the Hessian spectrum computed at pre- and post-grokking checkpoints. These new results show that posterior quality remains stable or improves across the transition, supporting that the collapse is not an artifact. revision: yes
Referee: [§5] §5 (simplified Bayesian linear model): the claim that both grokking time and uncertainty peaks arise from the same spectral mechanism is derived exactly in the linear case, but the manuscript does not verify that the transformer's learned kernel or Hessian exhibits the same slow eigenvalues or phase-transition behavior. The link therefore risks being partly shaped by the modeling choices rather than derived from first principles of the transformer.

Authors: The linear model is presented as an exactly solvable proxy that isolates the shared spectral mechanism; we do not claim it is identical to the transformer. To address the concern we have added new experiments in the revision that compute the eigenvalue spectrum of the empirical neural tangent kernel (and Hessian) of the trained transformer at successive epochs. These spectra exhibit a comparable accumulation of slow modes immediately prior to grokking, qualitatively consistent with the linear model's phase-transition prediction, although the precise scaling differs due to nonlinearity. We have clarified the scope of the theoretical link in the text. revision: partial
Referee: [§3.1–3.2] §3.1–3.2 (task construction and posterior approximation): the modular arithmetic tasks and the specific approximate Bayesian method (Laplace, variational, etc.) are central to the empirical claim, yet the paper provides insufficient detail on how the posterior approximation is validated against the non-convex, overparameterized loss landscape of the transformer.

Authors: We have substantially expanded Sections 3.1 and 3.2. The revised text now includes the precise modular-arithmetic data-generation procedure, the exact Laplace approximation implementation (including the Hessian approximation technique), and validation steps consisting of posterior predictive checks on held-out contexts together with consistency comparisons against short-run MCMC on reduced-scale models. We also added a short discussion of known limitations of the approximation in overparameterized non-convex landscapes and relevant citations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports empirical results from approximate Bayesian posterior estimates on trained transformers, observing that epistemic uncertainty collapses at grokking, and then supplies an independent simplified Bayesian linear model whose asymptotic spectral analysis links delayed generalization to uncertainty peaks. No equations, self-citations, or model choices in the provided text reduce the claimed spectral mechanism or the diagnostic utility of uncertainty to a fitted parameter, a renamed observation, or a self-referential definition. The simplified model is presented as theoretical support rather than a tautological restatement of the transformer experiments, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5473 in / 1127 out tokens · 24868 ms · 2026-05-10T14:37:59.252350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

2024 , journal =

URLhttp://arxiv.org/abs/1505.05424. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. Kenzo Clauw, Sebastiano Stramaglia, and Daniele Mar...

work page doi:10.48550/arxiv.2408.08944 1901
[2]

Noam Itzhak Levi, Alon Beck, and Yohai Bar-Sinai

URLhttps://openreview.net/forum?id=K2PTuvVTF1L. Noam Itzhak Levi, Alon Beck, and Yohai Bar-Sinai. Grokking in Linear Estimators – A Solvable Model that Groks without Understanding. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=GH2LYb9XV0. Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanch...

work page doi:10.18653/v1/2024.naacl-long.184 2024
[3]

doi: 10.18653/v1/2024.naacl-long.340

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.340. URLhttps://aclanthology.org/2024.naacl-long.340/. Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and How Does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization. InProceedings of the 28th International Conference on A...

work page doi:10.18653/v1/2024.naacl-long.340 2024
[4]

Fisher Information Matrix (FIM): Replaces the exact Hessian to guarantee positive semi-definiteness

work page
[5]

Last-Layer Laplace (LLLA): Restricts Bayesian inference to the final linear layer while keeping preceding layers fixed at wMAP, drastically reducing dimensionality

work page
[6]

Φ √ϵ−m i(t)p vi(t) ! −Φ −√ϵ−m i(t)p vi(t) !# , (86) Agen(t) =E p(x)

Kronecker-Factored Approximate Curvature (KFAC): Approximates the Fisher matrix as a Kronecker product F≈A⊗B (input and pre-activation gradient covariances) for efficient inversion. The predictive distribution for a new data point x∗ is computed by marginalizing over this approximate posterior: p(y|x ∗,D)≈ Z p(y|f w(x∗))N(w;w MAP,Σ)dw(19) In practice, thi...

work page 2024

[1] [1]

2024 , journal =

URLhttp://arxiv.org/abs/1505.05424. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020. Kenzo Clauw, Sebastiano Stramaglia, and Daniele Mar...

work page doi:10.48550/arxiv.2408.08944 1901

[2] [2]

Noam Itzhak Levi, Alon Beck, and Yohai Bar-Sinai

URLhttps://openreview.net/forum?id=K2PTuvVTF1L. Noam Itzhak Levi, Alon Beck, and Yohai Bar-Sinai. Grokking in Linear Estimators – A Solvable Model that Groks without Understanding. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=GH2LYb9XV0. Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanch...

work page doi:10.18653/v1/2024.naacl-long.184 2024

[3] [3]

doi: 10.18653/v1/2024.naacl-long.340

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.340. URLhttps://aclanthology.org/2024.naacl-long.340/. Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and How Does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization. InProceedings of the 28th International Conference on A...

work page doi:10.18653/v1/2024.naacl-long.340 2024

[4] [4]

Fisher Information Matrix (FIM): Replaces the exact Hessian to guarantee positive semi-definiteness

work page

[5] [5]

Last-Layer Laplace (LLLA): Restricts Bayesian inference to the final linear layer while keeping preceding layers fixed at wMAP, drastically reducing dimensionality

work page

[6] [6]

Φ √ϵ−m i(t)p vi(t) ! −Φ −√ϵ−m i(t)p vi(t) !# , (86) Agen(t) =E p(x)

Kronecker-Factored Approximate Curvature (KFAC): Approximates the Fisher matrix as a Kronecker product F≈A⊗B (input and pre-activation gradient covariances) for efficient inversion. The predictive distribution for a new data point x∗ is computed by marginalizing over this approximate posterior: p(y|x ∗,D)≈ Z p(y|f w(x∗))N(w;w MAP,Σ)dw(19) In practice, thi...

work page 2024