A Sharper Picture of Generalization in Transformers

Paul Lintilhac; Sair Shaikh

arxiv: 2605.20988 · v1 · pith:FQLV235Rnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

A Sharper Picture of Generalization in Transformers

Paul Lintilhac , Sair Shaikh This is my paper

Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformersgeneralizationPAC-BayesFourier spectraboolean functionsflat minimasharpnesssparse spectra

0 comments

The pith

Transformers can implement any boolean function with sparsity at most the context length using flat minima, which then yield non-vacuous PAC-Bayes generalization bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates generalization in transformers on boolean domains through the lens of Fourier spectra rather than Rademacher complexity. It establishes that sparse, low-degree spectra permit the construction of low-sharpness flat minima capable of realizing the target functions. By proving the existence of such minima for any boolean function whose sparsity does not exceed the context length and then applying PAC-Bayes bounds to an idealized learner that uses them, the authors obtain generalization guarantees that remain non-vacuous. Empirical evaluations and mechanistic interpretability experiments are used to check whether real trained transformers behave in ways consistent with these constructions.

Core claim

We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound.

What carries the argument

Existence of flat (low-sharpness) minima that exactly implement any sparse boolean function whose sparsity is bounded by the context length; these minima serve as the basis for applying PAC-Bayes theory to obtain non-vacuous bounds.

If this is right

Any boolean function whose sparsity is at most the context length admits a flat-minimum realization inside the transformer parameter space.
PAC-Bayes bounds applied to an idealized learner that selects among such low-sharpness solutions produce non-vacuous generalization guarantees.
Empirical predictions derived from the low-sharpness construction are testable on concrete transformer training runs.
Mechanistic interpretability analysis can reveal whether real transformers exhibit weight configurations consistent with the flat-minima construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the flat-minima construction holds, transformers may implicitly select low-sharpness solutions when the data distribution favors sparse low-degree functions.
Similar existence arguments could be attempted for other architectures that admit a notion of sharpness and can represent boolean functions exactly.
Increasing context length would immediately extend the class of functions for which non-vacuous bounds are guaranteed under the same argument.

Load-bearing premise

The existence of flat minima implementing any boolean function of sparsity no greater than the context length.

What would settle it

Training or optimization runs that fail to locate any flat minimum realizing a specific sparse boolean function whose sparsity is at most the context length, or measurements showing that the resulting PAC-Bayes bound is still vacuous despite using the idealized low-sharpness learner.

Figures

Figures reproduced from arXiv: 2605.20988 by Paul Lintilhac, Sair Shaikh.

**Figure 2.** Figure 2: Sharpness and Frobenius norm comparisons between the exact construction and the learned [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: This plot shows the empirical 90th−percentile perturbation of the sharpness (the trace of the loss Hessian), and how it depends on the degree Df and sparsity ω of the target function it expresses, as well as the sequence length T. each line is a different magnitude of the perturbation, and naturally the size of the perturbation always increases as σ increases. While qualitatively similar to our analytic bo… view at source ↗

**Figure 5.** Figure 5: Left: A plot showing the combined attention matrix W for a function learned with an architecture [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 4.** Figure 4: This plot shows our analytic bound on the perturbation to the sharpness over the same grid of [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 6.** Figure 6: This plot shows a comparison of the bound with analytic worst-case bound on [PITH_FULL_IMAGE:figures/full_fig_p031_6.png] view at source ↗

**Figure 7.** Figure 7: A diagram showing the high-level dependency graph of our 1.5-layer transformer construction. [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Left: A plot showing the upper bound on the maximum degree of the target function obtained [PITH_FULL_IMAGE:figures/full_fig_p046_8.png] view at source ↗

read the original abstract

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier Spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger and Tewari, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims to construct flat minima in transformers for any sparse boolean function up to context length and then derives a non-vacuous PAC-Bayes bound from an idealized low-sharpness learner.

read the letter

The key point to know is that the authors construct flat minima for transformers that exactly implement any boolean function with Fourier sparsity at most the context length, then use an idealized low-sharpness learner to get a non-vacuous PAC-Bayes generalization bound. This differs from the Rademacher complexity route in the papers they cite. They support this with empirical evaluation of the predictions and a mechanistic interpretability study to argue that real transformers behave in ways consistent with the construction. What the paper does well is link spectral properties of the target to the existence of good minima and then apply PAC-Bayes in a way that aims for non-vacuous results. The interpretability work is a positive step toward making the theory more believable. The soft spots are in the central construction. The bound relies on being able to control sharpness while implementing the function. If the construction only achieves exact implementation without a separate bound on the Hessian or sharpness that stays small, then the PAC-Bayes step won't deliver what is claimed. The stress-test note is right to flag this as the pivotal unverified step. Since the abstract gives no derivations, the full paper needs to show the explicit construction clearly. The setting is limited to boolean domains and sparse low-degree functions, which keeps things tractable but means the result applies to a specific class of problems rather than general transformer behavior. This paper is for researchers interested in theoretical explanations of generalization in transformers, particularly those exploring PAC-Bayes or spectral methods. A reader working on logical reasoning tasks or flat minima could get value from the ideas if the math checks out. I would bring it to a reading group to go through the construction in detail. I would not cite it in the next year unless the proofs are solid. It deserves serious peer review because the angle is new and the empirical grounding is there, even if revisions on the main result are likely.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that sparse Fourier spectra concentrated on low-degree components enable low-sharpness constructions in transformers that implement any boolean function whose sparsity is at most the context length. The authors establish the existence of such flat minima, then apply a PAC-Bayes bound to an idealized low-sharpness learner to obtain a non-vacuous generalization bound. This approach is positioned as an alternative to prior Rademacher-complexity analyses, with supporting empirical evaluations of the theoretical predictions and a mechanistic interpretability study examining the construction's realism in trained models.

Significance. If the existence result is shown with explicit, function-independent control of sharpness and the resulting PAC-Bayes bound is rigorously non-vacuous, the work would supply a useful theoretical account of why transformers can generalize on sparse boolean tasks. It would highlight the interplay between Fourier sparsity, flat minima, and generalization, offering a concrete alternative to complexity-based bounds and potentially explaining empirical observations of flat minima. The empirical validation and interpretability analysis add practical relevance.

major comments (2)

[§3.2, Theorem 1] §3.2, Theorem 1: The existence construction for flat minima must explicitly bound the sharpness measure (e.g., Hessian trace or largest eigenvalue) by a quantity depending only on context length and sparsity level, independent of the particular boolean function or its Fourier coefficients. If sharpness scales with the target function, the idealized low-sharpness learner cannot be defined uniformly and the subsequent PAC-Bayes bound collapses to vacuous for some sparse functions.
[§4, Eq. (8)] §4, Eq. (8): The PAC-Bayes application to the idealized learner requires a prior that is independent of the data yet yields a controlled KL term; it is unclear whether the bound remains non-vacuous when the posterior is centered at the constructed flat minimum for arbitrary sparse spectra up to the context length.

minor comments (2)

[Figure 3] The caption of Figure 3 should specify the exact sharpness metric plotted and the number of random seeds used for the error bars.
Notation for the Fourier support size S and context length n should be introduced consistently in the introduction rather than first appearing in the theoretical section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about uniformity in our constructions and the PAC-Bayes application. We respond to each major comment below.

read point-by-point responses

Referee: [§3.2, Theorem 1] The existence construction for flat minima must explicitly bound the sharpness measure (e.g., Hessian trace or largest eigenvalue) by a quantity depending only on context length and sparsity level, independent of the particular boolean function or its Fourier coefficients. If sharpness scales with the target function, the idealized low-sharpness learner cannot be defined uniformly and the subsequent PAC-Bayes bound collapses to vacuous for some sparse functions.

Authors: We agree that an explicit function-independent bound is necessary to ensure the idealized learner is well-defined uniformly. In the construction underlying Theorem 1, the transformer parameters are set by routing each sparse low-degree Fourier term through dedicated attention heads and MLP channels using a fixed template whose scaling depends only on context length and sparsity level; the resulting Hessian trace (or largest eigenvalue) is then bounded by a quantity polynomial in the context length and linear in the sparsity level, with no dependence on the numerical values of the Fourier coefficients. We will revise the theorem statement and proof to state this bound explicitly. revision: yes
Referee: [§4, Eq. (8)] The PAC-Bayes application to the idealized learner requires a prior that is independent of the data yet yields a controlled KL term; it is unclear whether the bound remains non-vacuous when the posterior is centered at the constructed flat minimum for arbitrary sparse spectra up to the context length.

Authors: The prior is a fixed, data-independent Gaussian over parameter space whose variance is chosen once to cover the entire family of constructed minima for all sparse spectra up to context length. The posterior is a Gaussian centered at the flat minimum whose covariance is inversely proportional to the (uniformly bounded) sharpness; the resulting KL term is therefore bounded by a function of context length and sparsity alone. Consequently the PAC-Bayes bound remains non-vacuous whenever the constructed model achieves zero empirical risk, which it does by design. We will add a clarifying paragraph after Equation (8) that makes the data-independence and uniform KL control explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: existence construction and PAC-Bayes application are independent of the target bound

full rationale

The paper's chain proceeds by first establishing (via construction or proof) the existence of flat minima in transformer parameter space that realize any boolean function whose Fourier support size is at most the context length, then feeding that idealized low-sharpness learner into a standard PAC-Bayes bound to obtain a non-vacuous generalization guarantee. No step reduces the claimed existence or the resulting bound to a fitted parameter, a self-citation chain, or a redefinition of the target quantity; the abstract and skeptic summary both treat the existence result as a separate, load-bearing mathematical claim rather than a tautology or data-dependent fit. The derivation is therefore self-contained against external benchmarks once the existence statement is accepted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central construction rests on one key unproven existence statement extracted from the abstract; no free parameters or invented entities are described.

axioms (1)

ad hoc to paper Existence of flat minima implementing any boolean function of sparsity no greater than the context length
This existence statement is required to define the idealized low-sharpness learner to which the PAC-Bayes bound is applied.

pith-pipeline@v0.9.0 · 5661 in / 1499 out tokens · 45560 ms · 2026-05-21T05:57:41.515603+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

Generalization on the unseen, logic reasoning and degree curriculum

Abbe, E., Bengio, S., Lotfi, A., and Rizk, K. Generalization on the unseen, logic reasoning and degree curriculum. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research,...

work page 2023
[2]

How far can transformers reason? the globalitybarrierandinductivescratchpad

Abbé, E., Bengio, S., Lotfi, A., Sandon, C., and Saremi, O. How far can transformers reason? the globalitybarrierandinductivescratchpad. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=FoGwiFXzuN

work page 2024
[3]

Testing low-degree polynomials over GF(2)

Alon, N., Kaufman, T., Krivelevich, M., Litsyn, S., and Ron, D. Testing low-degree polynomials over GF(2). InApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 188–199. Springer, 2003

work page 2003
[4]

A user-friendly introduction to pac-bayes bounds

Alquier, P. A user-friendly introduction to pac-bayes bounds. 2023

work page 2023
[5]

Exploring length generalization in large language models.Advances in Neural Information Processing Systems, 35:38546–38556, 2022

Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. Exploring length generalization in large language models.Advances in Neural Information Processing Systems, 35:38546–38556, 2022

work page 2022
[6]

On the ability and limitations of transformers to recognize formal languages

Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.),Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 7096–7116. Association for Computational Linguisti...

work page doi:10.18653/v1/2020.emnlp-main.576 2020
[7]

Simplicity bias in transformers and their ability tolearnsparsebooleanfunctions

Bhattamishra, S., Patel, A., Kanade, V., and Blunsom, P. Simplicity bias in transformers and their ability tolearnsparsebooleanfunctions. InRogers, A., Boyd-Graber, J.L., andOkazaki, N.(eds.),Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 5767...

work page doi:10.18653/v1/2023.acl-long.317 2023
[8]

A., Charton, F., and Kempe, J

Cabannes, V., Arnal, C., Bouaziz, W., Yang, X. A., Charton, F., and Kempe, J. Iteration head: A mechanistic study of chain-of-thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=QBCxWpOt5w

work page 2024
[9]

and Cholak, P

Chiang, D. and Cholak, P. Overcoming a theoretical limitation of self-attention. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.),Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 7654–7664. Association for Computational Linguistics, 2022....

work page doi:10.18653/v1/2022.acl-long.527 2022
[10]

On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2023

Deora, P., Ghaderi, R., Taheri, H., and Thrampoulidis, C. On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2023

work page 2023
[11]

L., Goel, S., Kakade, S., and Zhang, C

Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. InInternational Conference on Machine Learning, pp. 5793–5831. PMLR, 2022

work page 2022
[12]

A., Shpilka, A., and Wimmer, K

Gopalan, P., O’Donnell, R., Servedio, R. A., Shpilka, A., and Wimmer, K. Testing fourier dimensionality and sparsity.SIAM Journal on Computing, 40(4):1075–1100, 2011. doi: 10.1137/100785429. URL https://doi.org/10.1137/100785429

work page doi:10.1137/100785429 2011
[13]

Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

Hahn, M. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

work page 2020
[14]

and Rofin, M

Hahn, M. and Rofin, M. Why are sensitive functions hard for transformers? In Ku, L.-W., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14973–15008, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-l...

work page doi:10.18653/v1/2024.acl-long.800 2024
[15]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima.CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

and Suzuki, T

Kim, J. and Suzuki, T. Transformers provably solve parity efficiently with chain of thought. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024. URLhttps://openreview.net/ forum?id=E7HwPhfX1B

work page 2024
[17]

URLhttps://doi.org/10.1214/aos/ 1015957395

Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection.The Annals of Statistics, 28(5):1302–1338, 2000. doi: 10.1214/aos/1015957395

work page doi:10.1214/aos/1015957395 2000
[18]

T., Goel, S., Krishnamurthy, A., and Zhang, C

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata,

work page
[19]

URLhttps://arxiv.org/abs/2210.10749

work page arXiv
[20]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023

work page 2023
[21]

Exploring Generalization in Deep Learning

Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. CoRR, abs/1706.08947, 2017. URLhttp://arxiv.org/abs/1706.08947

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

and Loizou, N

Oikonomou, D. and Loizou, N. Sharpness-aware minimization: General analysis and improved rates. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=8rvqpiTTFv

work page 2025
[23]

Representational strengths and limitations of transformers

Sanford, C., Hsu, D., and Telgarsky, M. Representational strengths and limitations of transformers. CoRR, abs/2306.02896, 2023. doi: 10.48550/ARXIV.2306.02896. URLhttps://doi.org/10.48550/ arXiv.2306.02896

work page doi:10.48550/arxiv.2306.02896 2023
[24]

Transformers, parallel computation, and logarithmic depth

Sanford, C., Hsu, D., and Telgarsky, M. Transformers, parallel computation, and logarithmic depth. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/ forum?id=QCZabhKQhB

work page 2024
[25]

Transformers as recognizers of formal languages: A survey on expressivity.CoRR, abs/2311.00208, 2023

Strobl, L., Merrill, W., Weiss, G., Chiang, D., and Angluin, D. Transformers as recognizers of formal languages: A survey on expressivity.CoRR, abs/2311.00208, 2023. doi: 10.48550/ARXIV.2311.00208. URLhttps://doi.org/10.48550/arXiv.2311.00208

work page doi:10.48550/arxiv.2311.00208 2023
[26]

and Tewari, A

Trauger, J. and Tewari, A. Sequence length independent norm-based generalization bounds for trans- formers. InInternational Conference on Artificial Intelligence and Statistics, pp. 1405–1413. PMLR, 2024

work page 2024
[27]

Thinking like transformers

Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. InInternational Conference on Machine Learning, pp. 11080–11090. PMLR, 2021

work page 2021
[28]

half-layer

Yi, K. and Zhang, Q. Nearly optimal sparse Fourier transform in the general case. InProceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 590–600. ACM, 2019. 16 Contents 1 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Experiments . . . . . . . . . . . . . . ...

work page 2019
[29]

remainder perturbations

(see also 19 in the supplementary material), with probability1−δ,such an R.V. is upper bounded as ∥ϵt,:∥≤1√ d √ d+ 2 √ dlog( 1 δ) + 2log(1 δ) To upper bound the maximum ofTsuch variables, we apply a union bound: max t∈T ∥ϵt,:∥≤1 d ( d+ 2 √ dlog(T δ) + 2log(T δ) ) Note that per our definition ofd, d>8log(T) ϵ2p .Therefore, with probability at least1−δ, max...

work page 1998
[30]

intra-layer derivative

provide a bound intended for losses that are possibly unbounded, but have a tail distribution that is sub-gaussian. Mathematically, we assume that for some positive constantΣ,our loss function satisfies: E[et[l(Θ,x,f)−E[l(Θ,x,f)]]]≤e Σ2t2 2 . Then with probability at least1−δ,the following bound holds: EΘ∼Q[L(fΘ )]≤EΘ∼Q[ˆL(fΘ )] +λΣ2 2m + KL(Q∥P) + ln1 δ ...

work page 2022
[31]

remainder perturbations

+ (8d + 6)(Df + 1) )) block matrix. To bound the maximum eigenvalue of the Hessian, we want to find max||v||=1⟨v,∇2 ΘT(X,Θ)v⟩,where v=concat(s,p,q,r,η,γ,ν,), =concat(s 1,...,sd+1,p 1,...,pd+1,q 1,...,qd+1,r 1,...rd+1,η1,...,ηd+1,γ1,ν1,...,ν4(Df+1)), wheres i∈R,pi,qi,ri,νi∈Rd+1,ηi,γ1∈R4(Df+1).Written out explicitly, we seek to upper bound ∇2 ΘT(X,Θ)  =...

work page

[1] [1]

Generalization on the unseen, logic reasoning and degree curriculum

Abbe, E., Bengio, S., Lotfi, A., and Rizk, K. Generalization on the unseen, logic reasoning and degree curriculum. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.),International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research,...

work page 2023

[2] [2]

How far can transformers reason? the globalitybarrierandinductivescratchpad

Abbé, E., Bengio, S., Lotfi, A., Sandon, C., and Saremi, O. How far can transformers reason? the globalitybarrierandinductivescratchpad. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=FoGwiFXzuN

work page 2024

[3] [3]

Testing low-degree polynomials over GF(2)

Alon, N., Kaufman, T., Krivelevich, M., Litsyn, S., and Ron, D. Testing low-degree polynomials over GF(2). InApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 188–199. Springer, 2003

work page 2003

[4] [4]

A user-friendly introduction to pac-bayes bounds

Alquier, P. A user-friendly introduction to pac-bayes bounds. 2023

work page 2023

[5] [5]

Exploring length generalization in large language models.Advances in Neural Information Processing Systems, 35:38546–38556, 2022

Anil, C., Wu, Y., Andreassen, A., Lewkowycz, A., Misra, V., Ramasesh, V., Slone, A., Gur-Ari, G., Dyer, E., and Neyshabur, B. Exploring length generalization in large language models.Advances in Neural Information Processing Systems, 35:38546–38556, 2022

work page 2022

[6] [6]

On the ability and limitations of transformers to recognize formal languages

Bhattamishra, S., Ahuja, K., and Goyal, N. On the ability and limitations of transformers to recognize formal languages. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.),Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 7096–7116. Association for Computational Linguisti...

work page doi:10.18653/v1/2020.emnlp-main.576 2020

[7] [7]

Simplicity bias in transformers and their ability tolearnsparsebooleanfunctions

Bhattamishra, S., Patel, A., Kanade, V., and Blunsom, P. Simplicity bias in transformers and their ability tolearnsparsebooleanfunctions. InRogers, A., Boyd-Graber, J.L., andOkazaki, N.(eds.),Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp. 5767...

work page doi:10.18653/v1/2023.acl-long.317 2023

[8] [8]

A., Charton, F., and Kempe, J

Cabannes, V., Arnal, C., Bouaziz, W., Yang, X. A., Charton, F., and Kempe, J. Iteration head: A mechanistic study of chain-of-thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=QBCxWpOt5w

work page 2024

[9] [9]

and Cholak, P

Chiang, D. and Cholak, P. Overcoming a theoretical limitation of self-attention. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.),Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp. 7654–7664. Association for Computational Linguistics, 2022....

work page doi:10.18653/v1/2022.acl-long.527 2022

[10] [10]

On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2023

Deora, P., Ghaderi, R., Taheri, H., and Thrampoulidis, C. On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2023

work page 2023

[11] [11]

L., Goel, S., Kakade, S., and Zhang, C

Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. InInternational Conference on Machine Learning, pp. 5793–5831. PMLR, 2022

work page 2022

[12] [12]

A., Shpilka, A., and Wimmer, K

Gopalan, P., O’Donnell, R., Servedio, R. A., Shpilka, A., and Wimmer, K. Testing fourier dimensionality and sparsity.SIAM Journal on Computing, 40(4):1075–1100, 2011. doi: 10.1137/100785429. URL https://doi.org/10.1137/100785429

work page doi:10.1137/100785429 2011

[13] [13]

Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

Hahn, M. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171, 2020

work page 2020

[14] [14]

and Rofin, M

Hahn, M. and Rofin, M. Why are sensitive functions hard for transformers? In Ku, L.-W., Martins, A., and Srikumar, V. (eds.),Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14973–15008, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-l...

work page doi:10.18653/v1/2024.acl-long.800 2024

[15] [15]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima.CoRR, abs/1609.04836, 2016. URL http://arxiv.org/abs/1609.04836

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

and Suzuki, T

Kim, J. and Suzuki, T. Transformers provably solve parity efficiently with chain of thought. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024. URLhttps://openreview.net/ forum?id=E7HwPhfX1B

work page 2024

[17] [17]

URLhttps://doi.org/10.1214/aos/ 1015957395

Laurent, B. and Massart, P. Adaptive estimation of a quadratic functional by model selection.The Annals of Statistics, 28(5):1302–1338, 2000. doi: 10.1214/aos/1015957395

work page doi:10.1214/aos/1015957395 2000

[18] [18]

T., Goel, S., Krishnamurthy, A., and Zhang, C

Liu, B., Ash, J. T., Goel, S., Krishnamurthy, A., and Zhang, C. Transformers learn shortcuts to automata,

work page

[19] [19]

URLhttps://arxiv.org/abs/2210.10749

work page arXiv

[20] [20]

and Sabharwal, A

Merrill, W. and Sabharwal, A. The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023

work page 2023

[21] [21]

Exploring Generalization in Deep Learning

Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. CoRR, abs/1706.08947, 2017. URLhttp://arxiv.org/abs/1706.08947

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

and Loizou, N

Oikonomou, D. and Loizou, N. Sharpness-aware minimization: General analysis and improved rates. In The Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=8rvqpiTTFv

work page 2025

[23] [23]

Representational strengths and limitations of transformers

Sanford, C., Hsu, D., and Telgarsky, M. Representational strengths and limitations of transformers. CoRR, abs/2306.02896, 2023. doi: 10.48550/ARXIV.2306.02896. URLhttps://doi.org/10.48550/ arXiv.2306.02896

work page doi:10.48550/arxiv.2306.02896 2023

[24] [24]

Transformers, parallel computation, and logarithmic depth

Sanford, C., Hsu, D., and Telgarsky, M. Transformers, parallel computation, and logarithmic depth. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/ forum?id=QCZabhKQhB

work page 2024

[25] [25]

Transformers as recognizers of formal languages: A survey on expressivity.CoRR, abs/2311.00208, 2023

Strobl, L., Merrill, W., Weiss, G., Chiang, D., and Angluin, D. Transformers as recognizers of formal languages: A survey on expressivity.CoRR, abs/2311.00208, 2023. doi: 10.48550/ARXIV.2311.00208. URLhttps://doi.org/10.48550/arXiv.2311.00208

work page doi:10.48550/arxiv.2311.00208 2023

[26] [26]

and Tewari, A

Trauger, J. and Tewari, A. Sequence length independent norm-based generalization bounds for trans- formers. InInternational Conference on Artificial Intelligence and Statistics, pp. 1405–1413. PMLR, 2024

work page 2024

[27] [27]

Thinking like transformers

Weiss, G., Goldberg, Y., and Yahav, E. Thinking like transformers. InInternational Conference on Machine Learning, pp. 11080–11090. PMLR, 2021

work page 2021

[28] [28]

half-layer

Yi, K. and Zhang, Q. Nearly optimal sparse Fourier transform in the general case. InProceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 590–600. ACM, 2019. 16 Contents 1 Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Experiments . . . . . . . . . . . . . . ...

work page 2019

[29] [29]

remainder perturbations

(see also 19 in the supplementary material), with probability1−δ,such an R.V. is upper bounded as ∥ϵt,:∥≤1√ d √ d+ 2 √ dlog( 1 δ) + 2log(1 δ) To upper bound the maximum ofTsuch variables, we apply a union bound: max t∈T ∥ϵt,:∥≤1 d ( d+ 2 √ dlog(T δ) + 2log(T δ) ) Note that per our definition ofd, d>8log(T) ϵ2p .Therefore, with probability at least1−δ, max...

work page 1998

[30] [30]

intra-layer derivative

provide a bound intended for losses that are possibly unbounded, but have a tail distribution that is sub-gaussian. Mathematically, we assume that for some positive constantΣ,our loss function satisfies: E[et[l(Θ,x,f)−E[l(Θ,x,f)]]]≤e Σ2t2 2 . Then with probability at least1−δ,the following bound holds: EΘ∼Q[L(fΘ )]≤EΘ∼Q[ˆL(fΘ )] +λΣ2 2m + KL(Q∥P) + ln1 δ ...

work page 2022

[31] [31]

remainder perturbations

+ (8d + 6)(Df + 1) )) block matrix. To bound the maximum eigenvalue of the Hessian, we want to find max||v||=1⟨v,∇2 ΘT(X,Θ)v⟩,where v=concat(s,p,q,r,η,γ,ν,), =concat(s 1,...,sd+1,p 1,...,pd+1,q 1,...,qd+1,r 1,...rd+1,η1,...,ηd+1,γ1,ν1,...,ν4(Df+1)), wheres i∈R,pi,qi,ri,νi∈Rd+1,ηi,γ1∈R4(Df+1).Written out explicitly, we seek to upper bound ∇2 ΘT(X,Θ)  =...

work page