A Sharper Picture of Generalization in Transformers

Paul Lintilhac; Sair Shaikh

arxiv: 2605.20988 · v2 · pith:FQLV235Rnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

A Sharper Picture of Generalization in Transformers

Paul Lintilhac , Sair Shaikh This is my paper

Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformersgeneralizationPAC-BayesFourier spectraboolean functionsflat minimasharpnesssparse spectra

0 comments

The pith

Transformers can implement any boolean function with sparsity at most the context length using flat minima, which then yield non-vacuous PAC-Bayes generalization bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates generalization in transformers on boolean domains through the lens of Fourier spectra rather than Rademacher complexity. It establishes that sparse, low-degree spectra permit the construction of low-sharpness flat minima capable of realizing the target functions. By proving the existence of such minima for any boolean function whose sparsity does not exceed the context length and then applying PAC-Bayes bounds to an idealized learner that uses them, the authors obtain generalization guarantees that remain non-vacuous. Empirical evaluations and mechanistic interpretability experiments are used to check whether real trained transformers behave in ways consistent with these constructions.

Core claim

We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound.

What carries the argument

Existence of flat (low-sharpness) minima that exactly implement any sparse boolean function whose sparsity is bounded by the context length; these minima serve as the basis for applying PAC-Bayes theory to obtain non-vacuous bounds.

If this is right

Any boolean function whose sparsity is at most the context length admits a flat-minimum realization inside the transformer parameter space.
PAC-Bayes bounds applied to an idealized learner that selects among such low-sharpness solutions produce non-vacuous generalization guarantees.
Empirical predictions derived from the low-sharpness construction are testable on concrete transformer training runs.
Mechanistic interpretability analysis can reveal whether real transformers exhibit weight configurations consistent with the flat-minima construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the flat-minima construction holds, transformers may implicitly select low-sharpness solutions when the data distribution favors sparse low-degree functions.
Similar existence arguments could be attempted for other architectures that admit a notion of sharpness and can represent boolean functions exactly.
Increasing context length would immediately extend the class of functions for which non-vacuous bounds are guaranteed under the same argument.

Load-bearing premise

The existence of flat minima implementing any boolean function of sparsity no greater than the context length.

What would settle it

Training or optimization runs that fail to locate any flat minimum realizing a specific sparse boolean function whose sparsity is at most the context length, or measurements showing that the resulting PAC-Bayes bound is still vacuous despite using the idealized low-sharpness learner.

Figures

Figures reproduced from arXiv: 2605.20988 by Paul Lintilhac, Sair Shaikh.

**Figure 2.** Figure 2: Sharpness and Frobenius norm comparisons between the exact construction and the learned [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: This plot shows the empirical 90th−percentile perturbation of the sharpness (the trace of the loss Hessian), and how it depends on the degree Df and sparsity ω of the target function it expresses, as well as the sequence length T. each line is a different magnitude of the perturbation, and naturally the size of the perturbation always increases as σ increases. While qualitatively similar to our analytic bo… view at source ↗

**Figure 5.** Figure 5: Left: A plot showing the combined attention matrix W for a function learned with an architecture [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 4.** Figure 4: This plot shows our analytic bound on the perturbation to the sharpness over the same grid of [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 6.** Figure 6: This plot shows a comparison of the bound with analytic worst-case bound on [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: A diagram showing the high-level dependency graph of our 1.5-layer transformer construction. [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Left: A plot showing the upper bound on the maximum degree of the target function obtained [PITH_FULL_IMAGE:figures/full_fig_p046_8.png] view at source ↗

read the original abstract

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger & Tosh, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We use this to give a formal account of why chain-of-thought improves generalization for high-degree target functions, and show that the complexity parameters in our bound can be efficiently estimated via property testing. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims to construct flat minima in transformers for any sparse boolean function up to context length and then derives a non-vacuous PAC-Bayes bound from an idealized low-sharpness learner.

read the letter

The key point to know is that the authors construct flat minima for transformers that exactly implement any boolean function with Fourier sparsity at most the context length, then use an idealized low-sharpness learner to get a non-vacuous PAC-Bayes generalization bound. This differs from the Rademacher complexity route in the papers they cite. They support this with empirical evaluation of the predictions and a mechanistic interpretability study to argue that real transformers behave in ways consistent with the construction. What the paper does well is link spectral properties of the target to the existence of good minima and then apply PAC-Bayes in a way that aims for non-vacuous results. The interpretability work is a positive step toward making the theory more believable. The soft spots are in the central construction. The bound relies on being able to control sharpness while implementing the function. If the construction only achieves exact implementation without a separate bound on the Hessian or sharpness that stays small, then the PAC-Bayes step won't deliver what is claimed. The stress-test note is right to flag this as the pivotal unverified step. Since the abstract gives no derivations, the full paper needs to show the explicit construction clearly. The setting is limited to boolean domains and sparse low-degree functions, which keeps things tractable but means the result applies to a specific class of problems rather than general transformer behavior. This paper is for researchers interested in theoretical explanations of generalization in transformers, particularly those exploring PAC-Bayes or spectral methods. A reader working on logical reasoning tasks or flat minima could get value from the ideas if the math checks out. I would bring it to a reading group to go through the construction in detail. I would not cite it in the next year unless the proofs are solid. It deserves serious peer review because the angle is new and the empirical grounding is there, even if revisions on the main result are likely.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that sparse Fourier spectra concentrated on low-degree components enable low-sharpness constructions in transformers that implement any boolean function whose sparsity is at most the context length. The authors establish the existence of such flat minima, then apply a PAC-Bayes bound to an idealized low-sharpness learner to obtain a non-vacuous generalization bound. This approach is positioned as an alternative to prior Rademacher-complexity analyses, with supporting empirical evaluations of the theoretical predictions and a mechanistic interpretability study examining the construction's realism in trained models.

Significance. If the existence result is shown with explicit, function-independent control of sharpness and the resulting PAC-Bayes bound is rigorously non-vacuous, the work would supply a useful theoretical account of why transformers can generalize on sparse boolean tasks. It would highlight the interplay between Fourier sparsity, flat minima, and generalization, offering a concrete alternative to complexity-based bounds and potentially explaining empirical observations of flat minima. The empirical validation and interpretability analysis add practical relevance.

major comments (2)

[§3.2, Theorem 1] §3.2, Theorem 1: The existence construction for flat minima must explicitly bound the sharpness measure (e.g., Hessian trace or largest eigenvalue) by a quantity depending only on context length and sparsity level, independent of the particular boolean function or its Fourier coefficients. If sharpness scales with the target function, the idealized low-sharpness learner cannot be defined uniformly and the subsequent PAC-Bayes bound collapses to vacuous for some sparse functions.
[§4, Eq. (8)] §4, Eq. (8): The PAC-Bayes application to the idealized learner requires a prior that is independent of the data yet yields a controlled KL term; it is unclear whether the bound remains non-vacuous when the posterior is centered at the constructed flat minimum for arbitrary sparse spectra up to the context length.

minor comments (2)

[Figure 3] The caption of Figure 3 should specify the exact sharpness metric plotted and the number of random seeds used for the error bars.
Notation for the Fourier support size S and context length n should be introduced consistently in the introduction rather than first appearing in the theoretical section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about uniformity in our constructions and the PAC-Bayes application. We respond to each major comment below.

read point-by-point responses

Referee: [§3.2, Theorem 1] The existence construction for flat minima must explicitly bound the sharpness measure (e.g., Hessian trace or largest eigenvalue) by a quantity depending only on context length and sparsity level, independent of the particular boolean function or its Fourier coefficients. If sharpness scales with the target function, the idealized low-sharpness learner cannot be defined uniformly and the subsequent PAC-Bayes bound collapses to vacuous for some sparse functions.

Authors: We agree that an explicit function-independent bound is necessary to ensure the idealized learner is well-defined uniformly. In the construction underlying Theorem 1, the transformer parameters are set by routing each sparse low-degree Fourier term through dedicated attention heads and MLP channels using a fixed template whose scaling depends only on context length and sparsity level; the resulting Hessian trace (or largest eigenvalue) is then bounded by a quantity polynomial in the context length and linear in the sparsity level, with no dependence on the numerical values of the Fourier coefficients. We will revise the theorem statement and proof to state this bound explicitly. revision: yes
Referee: [§4, Eq. (8)] The PAC-Bayes application to the idealized learner requires a prior that is independent of the data yet yields a controlled KL term; it is unclear whether the bound remains non-vacuous when the posterior is centered at the constructed flat minimum for arbitrary sparse spectra up to the context length.

Authors: The prior is a fixed, data-independent Gaussian over parameter space whose variance is chosen once to cover the entire family of constructed minima for all sparse spectra up to context length. The posterior is a Gaussian centered at the flat minimum whose covariance is inversely proportional to the (uniformly bounded) sharpness; the resulting KL term is therefore bounded by a function of context length and sparsity alone. Consequently the PAC-Bayes bound remains non-vacuous whenever the constructed model achieves zero empirical risk, which it does by design. We will add a clarifying paragraph after Equation (8) that makes the data-independence and uniform KL control explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: existence construction and PAC-Bayes application are independent of the target bound

full rationale

The paper's chain proceeds by first establishing (via construction or proof) the existence of flat minima in transformer parameter space that realize any boolean function whose Fourier support size is at most the context length, then feeding that idealized low-sharpness learner into a standard PAC-Bayes bound to obtain a non-vacuous generalization guarantee. No step reduces the claimed existence or the resulting bound to a fitted parameter, a self-citation chain, or a redefinition of the target quantity; the abstract and skeptic summary both treat the existence result as a separate, load-bearing mathematical claim rather than a tautology or data-dependent fit. The derivation is therefore self-contained against external benchmarks once the existence statement is accepted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction rests on one key unproven existence statement extracted from the abstract; no free parameters or invented entities are described.

axioms (1)

ad hoc to paper Existence of flat minima implementing any boolean function of sparsity no greater than the context length
This existence statement is required to define the idealized low-sharpness learner to which the PAC-Bayes bound is applied.

pith-pipeline@v0.9.0 · 5661 in / 1499 out tokens · 45560 ms · 2026-05-21T05:57:41.515603+00:00 · methodology

A Sharper Picture of Generalization in Transformers

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)