A Sharper Picture of Generalization in Transformers
Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3
The pith
Transformers can implement any boolean function with sparsity at most the context length using flat minima, which then yield non-vacuous PAC-Bayes generalization bounds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound.
What carries the argument
Existence of flat (low-sharpness) minima that exactly implement any sparse boolean function whose sparsity is bounded by the context length; these minima serve as the basis for applying PAC-Bayes theory to obtain non-vacuous bounds.
If this is right
- Any boolean function whose sparsity is at most the context length admits a flat-minimum realization inside the transformer parameter space.
- PAC-Bayes bounds applied to an idealized learner that selects among such low-sharpness solutions produce non-vacuous generalization guarantees.
- Empirical predictions derived from the low-sharpness construction are testable on concrete transformer training runs.
- Mechanistic interpretability analysis can reveal whether real transformers exhibit weight configurations consistent with the flat-minima construction.
Where Pith is reading between the lines
- If the flat-minima construction holds, transformers may implicitly select low-sharpness solutions when the data distribution favors sparse low-degree functions.
- Similar existence arguments could be attempted for other architectures that admit a notion of sharpness and can represent boolean functions exactly.
- Increasing context length would immediately extend the class of functions for which non-vacuous bounds are guaranteed under the same argument.
Load-bearing premise
The existence of flat minima implementing any boolean function of sparsity no greater than the context length.
What would settle it
Training or optimization runs that fail to locate any flat minimum realizing a specific sparse boolean function whose sparsity is at most the context length, or measurements showing that the resulting PAC-Bayes bound is still vacuous despite using the idealized low-sharpness learner.
Figures
read the original abstract
We study transformers' generalization behavior on boolean domains from the perspective of the Fourier Spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger and Tewari, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that sparse Fourier spectra concentrated on low-degree components enable low-sharpness constructions in transformers that implement any boolean function whose sparsity is at most the context length. The authors establish the existence of such flat minima, then apply a PAC-Bayes bound to an idealized low-sharpness learner to obtain a non-vacuous generalization bound. This approach is positioned as an alternative to prior Rademacher-complexity analyses, with supporting empirical evaluations of the theoretical predictions and a mechanistic interpretability study examining the construction's realism in trained models.
Significance. If the existence result is shown with explicit, function-independent control of sharpness and the resulting PAC-Bayes bound is rigorously non-vacuous, the work would supply a useful theoretical account of why transformers can generalize on sparse boolean tasks. It would highlight the interplay between Fourier sparsity, flat minima, and generalization, offering a concrete alternative to complexity-based bounds and potentially explaining empirical observations of flat minima. The empirical validation and interpretability analysis add practical relevance.
major comments (2)
- [§3.2, Theorem 1] §3.2, Theorem 1: The existence construction for flat minima must explicitly bound the sharpness measure (e.g., Hessian trace or largest eigenvalue) by a quantity depending only on context length and sparsity level, independent of the particular boolean function or its Fourier coefficients. If sharpness scales with the target function, the idealized low-sharpness learner cannot be defined uniformly and the subsequent PAC-Bayes bound collapses to vacuous for some sparse functions.
- [§4, Eq. (8)] §4, Eq. (8): The PAC-Bayes application to the idealized learner requires a prior that is independent of the data yet yields a controlled KL term; it is unclear whether the bound remains non-vacuous when the posterior is centered at the constructed flat minimum for arbitrary sparse spectra up to the context length.
minor comments (2)
- [Figure 3] The caption of Figure 3 should specify the exact sharpness metric plotted and the number of random seeds used for the error bars.
- Notation for the Fourier support size S and context length n should be introduced consistently in the introduction rather than first appearing in the theoretical section.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important points about uniformity in our constructions and the PAC-Bayes application. We respond to each major comment below.
read point-by-point responses
-
Referee: [§3.2, Theorem 1] The existence construction for flat minima must explicitly bound the sharpness measure (e.g., Hessian trace or largest eigenvalue) by a quantity depending only on context length and sparsity level, independent of the particular boolean function or its Fourier coefficients. If sharpness scales with the target function, the idealized low-sharpness learner cannot be defined uniformly and the subsequent PAC-Bayes bound collapses to vacuous for some sparse functions.
Authors: We agree that an explicit function-independent bound is necessary to ensure the idealized learner is well-defined uniformly. In the construction underlying Theorem 1, the transformer parameters are set by routing each sparse low-degree Fourier term through dedicated attention heads and MLP channels using a fixed template whose scaling depends only on context length and sparsity level; the resulting Hessian trace (or largest eigenvalue) is then bounded by a quantity polynomial in the context length and linear in the sparsity level, with no dependence on the numerical values of the Fourier coefficients. We will revise the theorem statement and proof to state this bound explicitly. revision: yes
-
Referee: [§4, Eq. (8)] The PAC-Bayes application to the idealized learner requires a prior that is independent of the data yet yields a controlled KL term; it is unclear whether the bound remains non-vacuous when the posterior is centered at the constructed flat minimum for arbitrary sparse spectra up to the context length.
Authors: The prior is a fixed, data-independent Gaussian over parameter space whose variance is chosen once to cover the entire family of constructed minima for all sparse spectra up to context length. The posterior is a Gaussian centered at the flat minimum whose covariance is inversely proportional to the (uniformly bounded) sharpness; the resulting KL term is therefore bounded by a function of context length and sparsity alone. Consequently the PAC-Bayes bound remains non-vacuous whenever the constructed model achieves zero empirical risk, which it does by design. We will add a clarifying paragraph after Equation (8) that makes the data-independence and uniform KL control explicit. revision: partial
Circularity Check
No circularity: existence construction and PAC-Bayes application are independent of the target bound
full rationale
The paper's chain proceeds by first establishing (via construction or proof) the existence of flat minima in transformer parameter space that realize any boolean function whose Fourier support size is at most the context length, then feeding that idealized low-sharpness learner into a standard PAC-Bayes bound to obtain a non-vacuous generalization guarantee. No step reduces the claimed existence or the resulting bound to a fitted parameter, a self-citation chain, or a redefinition of the target quantity; the abstract and skeptic summary both treat the existence result as a separate, load-bearing mathematical claim rather than a tautology or data-dependent fit. The derivation is therefore self-contained against external benchmarks once the existence statement is accepted.
Axiom & Free-Parameter Ledger
axioms (1)
- ad hoc to paper Existence of flat minima implementing any boolean function of sparsity no greater than the context length
Reference graph
Works this paper leans on
-
[1]
Generalization on the unseen, logic reasoning and degree curriculum
Abbe, E., Bengio, S., Lotfi, A., and Rizk, K. Generalization on the unseen, logic reasoning and degree curriculum. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research,...
work page 2023
-
[2]
How far can transformers reason? the globalitybarrierandinductivescratchpad
Abbé, E., Bengio, S., Lotfi, A., Sandon, C., and Saremi, O. How far can transformers reason? the globalitybarrierandinductivescratchpad. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=FoGwiFXzuN
work page 2024
-
[3]
Testing low-degree polynomials over GF(2)
Alon, N., Kaufman, T., Krivelevich, M., Litsyn, S., and Ron, D. Testing low-degree polynomials over GF(2). InApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 188–199. Springer, 2003
work page 2003
-
[4]
A user-friendly introduction to pac-bayes bounds
Alquier, P. A user-friendly introduction to pac-bayes bounds. 2023
work page 2023
-
[5]
Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. Exploring length generalization in large language models.Advances in Neural Information Processing Systems, 35:38546–38556, 2022
work page 2022
-
[6]
On the ability and limitations of transformers to recognize formal languages
Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.),Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 7096–7116. Association for Computational Linguisti...
-
[7]
Simplicity bias in transformers and their ability tolearnsparsebooleanfunctions
Bhattamishra, S., Patel, A., Kanade, V., and Blunsom, P. Simplicity bias in transformers and their ability tolearnsparsebooleanfunctions. InRogers, A., Boyd-Graber, J.L., andOkazaki, N.(eds.),Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 5767...
-
[8]
Cabannes, V., Arnal, C., Bouaziz, W., Yang, X. A., Charton, F., and Kempe, J. Iteration head: A mechanistic study of chain-of-thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=QBCxWpOt5w
work page 2024
-
[9]
Chiang, D. and Cholak, P. Overcoming a theoretical limitation of self-attention. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.),Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 7654–7664. Association for Computational Linguistics, 2022....
-
[10]
Deora, P., Ghaderi, R., Taheri, H., and Thrampoulidis, C. On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2023
work page 2023
-
[11]
L., Goel, S., Kakade, S., and Zhang, C
Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. InInternational Conference on Machine Learning, pp. 5793–5831. PMLR, 2022
work page 2022
-
[12]
A., Shpilka, A., and Wimmer, K
Gopalan, P., O’Donnell, R., Servedio, R. A., Shpilka, A., and Wimmer, K. Testing fourier dimensionality and sparsity.SIAM Journal on Computing, 40(4):1075–1100, 2011. doi: 10.1137/100785429. URL https://doi.org/10.1137/100785429
-
[13]
Hahn, M. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020
work page 2020
-
[14]
Hahn, M. and Rofin, M. Why are sensitive functions hard for transformers? In Ku, L.-W., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14973–15008, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-l...
-
[15]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima.CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
Kim, J. and Suzuki, T. Transformers provably solve parity efficiently with chain of thought. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024. URLhttps://openreview.net/ forum?id=E7HwPhfX1B
work page 2024
-
[17]
URLhttps://doi.org/10.1214/aos/ 1015957395
Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection.The Annals of Statistics, 28(5):1302–1338, 2000. doi: 10.1214/aos/1015957395
-
[18]
T., Goel, S., Krishnamurthy, A., and Zhang, C
Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata,
- [19]
-
[20]
Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023
work page 2023
-
[21]
Exploring Generalization in Deep Learning
Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. CoRR, abs/1706.08947, 2017. URLhttp://arxiv.org/abs/1706.08947
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Oikonomou, D. and Loizou, N. Sharpness-aware minimization: General analysis and improved rates. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=8rvqpiTTFv
work page 2025
-
[23]
Representational strengths and limitations of transformers
Sanford, C., Hsu, D., and Telgarsky, M. Representational strengths and limitations of transformers. CoRR, abs/2306.02896, 2023. doi: 10.48550/ARXIV.2306.02896. URLhttps://doi.org/10.48550/ arXiv.2306.02896
-
[24]
Transformers, parallel computation, and logarithmic depth
Sanford, C., Hsu, D., and Telgarsky, M. Transformers, parallel computation, and logarithmic depth. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/ forum?id=QCZabhKQhB
work page 2024
-
[25]
Transformers as recognizers of formal languages: A survey on expressivity.CoRR, abs/2311.00208, 2023
Strobl, L., Merrill, W., Weiss, G., Chiang, D., and Angluin, D. Transformers as recognizers of formal languages: A survey on expressivity.CoRR, abs/2311.00208, 2023. doi: 10.48550/ARXIV.2311.00208. URLhttps://doi.org/10.48550/arXiv.2311.00208
-
[26]
Trauger, J. and Tewari, A. Sequence length independent norm-based generalization bounds for trans- formers. InInternational Conference on Artificial Intelligence and Statistics, pp. 1405–1413. PMLR, 2024
work page 2024
-
[27]
Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. InInternational Conference on Machine Learning, pp. 11080–11090. PMLR, 2021
work page 2021
-
[28]
Yi, K. and Zhang, Q. Nearly optimal sparse Fourier transform in the general case. InProceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 590–600. ACM, 2019. 16 Contents 1 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Experiments . . . . . . . . . . . . . . ...
work page 2019
-
[29]
(see also 19 in the supplementary material), with probability1−δ,such an R.V. is upper bounded as ∥ϵt,:∥≤1√ d √ d+ 2 √ dlog( 1 δ) + 2log(1 δ) To upper bound the maximum ofTsuch variables, we apply a union bound: max t∈T ∥ϵt,:∥≤1 d ( d+ 2 √ dlog(T δ) + 2log(T δ) ) Note that per our definition ofd, d>8log(T) ϵ2p .Therefore, with probability at least1−δ, max...
work page 1998
-
[30]
provide a bound intended for losses that are possibly unbounded, but have a tail distribution that is sub-gaussian. Mathematically, we assume that for some positive constantΣ,our loss function satisfies: E[et[l(Θ,x,f)−E[l(Θ,x,f)]]]≤e Σ2t2 2 . Then with probability at least1−δ,the following bound holds: EΘ∼Q[L(fΘ )]≤EΘ∼Q[ˆL(fΘ )] +λΣ2 2m + KL(Q∥P) + ln1 δ ...
work page 2022
-
[31]
+ (8d + 6)(Df + 1) )) block matrix. To bound the maximum eigenvalue of the Hessian, we want to find max||v||=1⟨v,∇2 ΘT(X,Θ)v⟩,where v=concat(s,p,q,r,η,γ,ν,), =concat(s 1,...,sd+1,p 1,...,pd+1,q 1,...,qd+1,r 1,...rd+1,η1,...,ηd+1,γ1,ν1,...,ν4(Df+1)), wheres i∈R,pi,qi,ri,νi∈Rd+1,ηi,γ1∈R4(Df+1).Written out explicitly, we seek to upper bound ∇2 ΘT(X,Θ) =...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.