Recognition: 2 theorem links
· Lean TheoremApproximation Error Upper and Lower Bounds for H\"{o}lder Class with Transformers
Pith reviewed 2026-05-11 01:55 UTC · model grok-4.3
The pith
Transformers can approximate any bounded Hölder function to accuracy ε with O(ε^{-d0/α}) blocks, but require at least Ω(ε^{-d0/(4α)}) blocks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove that a Transformer network composed of at most O(ε^{-d0/α}) blocks can approximate any bounded Hölder function with d0-dimensional input and smoothness α∈(0,1] under any accuracy ε>0. Leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least Ω(ε^{-d0/(4α)}) blocks to achieve the ε approximation accuracy. We extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.
What carries the argument
Standard Transformer architecture with softmax attention, ReLU activations, and residual connections, used to approximate functions in the bounded Hölder class.
Load-bearing premise
The target functions are bounded and Hölder continuous with smoothness α in (0,1], and the lower bound relies on a known upper bound for the VC-dimension of the Transformer hypothesis class.
What would settle it
For small ε, d0 and α, construct a worst-case bounded Hölder function and check whether any Transformer with fewer than c times ε to the power of negative d0 over 4α blocks can achieve approximation error below ε.
Figures
read the original abstract
We explore the expressive power of Transformers by establishing precise approximation error upper and lower bounds for H\"{o}lder class. Specifically, a new approximation upper bound is derived for the standard Transformer architecture equipped with Softmax operators, ReLU activation functions, and residual connections. We prove that a Transformer network composed of at most $\mathcal{O}(\varepsilon^{-{d_{0}}/{\alpha}})$ blocks can approximate any bounded H\"{o}lder function with $d_{0}$-dimensional input and smoothness $\alpha\in(0,1]$ under any accuracy $\varepsilon>0$. In the case of approximation lower bounds, leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least $\Omega(\varepsilon^{-{d_{0}}/({4\alpha})})$ blocks to achieve the $\varepsilon$ approximation accuracy. As a final step, we extend the derived results for standard Transformers to a general regression task and establish the corresponding excess risk rates demonstrating Transformers' empirical effectiveness in real-world settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives approximation upper and lower bounds for the Hölder class using standard Transformer networks (softmax attention, ReLU activations, residual connections). It proves that O(ε^{-d_0/α}) blocks suffice to ε-approximate any bounded Hölder function on d_0-dimensional inputs with smoothness α ∈ (0,1]. For the lower bound, it combines the metric entropy of the Hölder class with a VC-dimension upper bound from prior literature to claim that Ω(ε^{-d_0/(4α)}) blocks are necessary. The results are extended to a regression setting yielding corresponding excess risk rates.
Significance. If the bounds hold, this provides the first rigorous lower bound on the number of Transformer blocks needed for Hölder approximation, which is significant for understanding depth requirements in attention-based models. The explicit upper-bound construction and the regression extension strengthen the theoretical analysis and link to empirical performance.
major comments (2)
- [Lower bound section] Lower bound argument: The Ω(ε^{-d_0/(4α)}) claim is obtained by combining Hölder metric entropy (order ε^{-d_0/α}) with an external upper bound on VC-dimension of k-block Transformers that is assumed to grow at most as O(k^4). No verification is supplied that the specific architecture (softmax attention + residual connections) satisfies the hypotheses of the cited VC result; softmax is not piecewise-polynomial and residuals alter the compositional structure, so the polynomial degree and thus the 1/4 factor in the exponent may not hold.
- [Upper bound section] Upper bound construction: While an explicit construction is claimed to achieve the O(ε^{-d_0/α}) rate, the manuscript should explicitly state the norm in which approximation holds (e.g., uniform on the compact domain) and confirm that the boundedness assumption on the target function is used without circularity in the residual-connection analysis.
minor comments (3)
- [Abstract] Abstract: 'demand for at least' is grammatically awkward; replace with 'require at least'.
- [Throughout] Notation: Ensure consistent subscript formatting (d_0 vs d0) and math-mode usage for all parameters throughout the text and equations.
- [References] References: When citing the prior VC-dimension result, include the specific theorem or proposition number used to obtain the polynomial degree.
Simulated Author's Rebuttal
We are grateful to the referee for the insightful comments that help improve the clarity and rigor of our work. We address the major comments point-by-point below, indicating the planned revisions.
read point-by-point responses
-
Referee: [Lower bound section] Lower bound argument: The Ω(ε^{-d_0/(4α)}) claim is obtained by combining Hölder metric entropy (order ε^{-d_0/α}) with an external upper bound on VC-dimension of k-block Transformers that is assumed to grow at most as O(k^4). No verification is supplied that the specific architecture (softmax attention + residual connections) satisfies the hypotheses of the cited VC result; softmax is not piecewise-polynomial and residuals alter the compositional structure, so the polynomial degree and thus the 1/4 factor in the exponent may not hold.
Authors: We thank the referee for highlighting this subtlety. The cited VC-dimension result is applied to our architecture under the assumption that the overall function class remains polynomially bounded in the number of blocks; however, we agree that direct verification for softmax attention and residual connections is not provided. In the revision we will add an appendix lemma that bounds the VC dimension of the specific Transformer class by O(k^4) via a piecewise-polynomial approximation of the softmax (with error controlled by the boundedness assumption) and a direct accounting of residual additions as linear operations. If the resulting degree changes the exponent, we will state the adjusted lower bound explicitly. This addresses the concern without altering the main claim structure. revision: partial
-
Referee: [Upper bound section] Upper bound construction: While an explicit construction is claimed to achieve the O(ε^{-d_0/α}) rate, the manuscript should explicitly state the norm in which approximation holds (e.g., uniform on the compact domain) and confirm that the boundedness assumption on the target function is used without circularity in the residual-connection analysis.
Authors: We agree these clarifications strengthen the presentation. The upper bound holds in the uniform norm over the compact domain [0,1]^{d_0}, which we will state explicitly in the theorem statement and proof. The boundedness assumption (|f| ≤ M) is an a priori property of the target function and is used only to derive uniform bounds on intermediate activations via induction on the residual blocks; the induction hypothesis controls the error reduction at each step independently of ε, so no circularity arises. A short remark will be added after the construction to make this explicit. revision: yes
Circularity Check
No significant circularity; upper bound from explicit construction, lower bound from external VC citation
full rationale
The paper's upper bound is obtained by an explicit construction of a Transformer with O(ε^{-d0/α}) blocks that approximates bounded Hölder functions (standard residual+softmax+ReLU architecture). The lower bound combines the known metric entropy of the Hölder class (order ε^{-d0/α}) with a VC-dimension upper bound taken from prior literature, yielding the Ω(ε^{-d0/(4α)}) rate. No equation or claim reduces a derived rate to a quantity defined by the same rate, no parameter is fitted and then renamed as a prediction, and the cited VC result is external rather than a self-citation chain. The derivation is therefore self-contained against standard tools in approximation theory and statistical learning.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard definition and properties of the Hölder class on bounded domains
- standard math Existence of an upper bound on the VC-dimension of the Transformer hypothesis class
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove that a Transformer network composed of at most O(ε^{-d0/α}) blocks can approximate any bounded Hölder function... leveraging the VC-dimension upper bound, we are the first to rigorously prove that Transformers demand for at least Ω(ε^{-d0/(4α)}) blocks
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 5.2 (Anthony & Bartlett, 2009, Theorem 8.14) ... VCdim(M) ≤ t²ω(ω+19 log₂(9ω))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Anthony, M. and Bartlett, P. L. Neural network learning: Theoretical foundations. cambridge university press, 2009
work page 2009
-
[3]
Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39 0 (3): 0 930--945, 1993
work page 1993
-
[4]
L., Harvey, N., Liaw, C., and Mehrabian, A
Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20 0 (63): 0 1--17, 2019
work page 2019
-
[5]
Baum, E. B. On the capabilities of multilayer perceptrons. Journal of complexity, 4 0 (3): 0 193--215, 1988
work page 1988
-
[6]
Approximation by superpositions of a sigmoidal function
Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2 0 (4): 0 303--314, 1989
work page 1989
-
[7]
An image is worth 16x16 words: Transformers for image recognition at scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy
work page 2021
-
[8]
L., Goel, S., Kakade, S., and Zhang, C
Edelman, B. L., Goel, S., Kakade, S., and Zhang, C. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp.\ 5793--5831. PMLR, 2022
work page 2022
-
[9]
Ghojogh, B. and Ghodsi, A. Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey . working paper or preprint, December 2020. URL https://hal.science/hal-04637647
work page 2020
-
[10]
Gurevych, I., Kohler, M., and S ahin, G. G. On the rate of convergence of a classifier based on a transformer encoder. IEEE Transactions on Information Theory, 68 0 (12): 0 8139--8155, 2022
work page 2022
-
[11]
Multilayer feedforward networks are universal approximators
Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2 0 (5): 0 359--366, 1989
work page 1989
-
[12]
Y.-C., Wu, W., Li, Z., Pi, S., Song, Z., and Liu, H
Hu, J. Y.-C., Wu, W., Li, Z., Pi, S., Song, Z., and Liu, H. On statistical rates and provably efficient criteria of latent diffusion transformers (dits). In Advances in Neural Information Processing Systems, volume 37, pp.\ 31562--31628, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/37f6bed18b9b404f53dcaec4607c4fb7-Paper-Conference.pdf
work page 2024
-
[13]
Y.-C., Wang, W.-P., Gilani, A., Li, C., Song, Z., and Liu, H
Hu, J. Y.-C., Wang, W.-P., Gilani, A., Li, C., Song, Z., and Liu, H. Fundamental limits of prompt tuning transformers: Universality, capacity and efficiency. In The Thirteenth International Conference on Learning Representations, 2025 a . URL https://openreview.net/forum?id=jDpdQPMosW
work page 2025
-
[14]
Y.-C., Wu, W., Lee, Y.-C., Huang, Y.-C., Chen, M., and Liu, H
Hu, J. Y.-C., Wu, W., Lee, Y.-C., Huang, Y.-C., Chen, M., and Liu, H. On statistical rates of conditional diffusion transformers: Approximation, estimation and minimax optimality. In The Thirteenth International Conference on Learning Representations, 2025 b . URL https://openreview.net/forum?id=c54apoozCS
work page 2025
-
[15]
Learning capability and storage capacity of two-hidden-layer feedforward networks
Huang, G.-B. Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE transactions on neural networks, 14 0 (2): 0 274--281, 2003
work page 2003
-
[16]
Convergence analysis of flow matching in latent space with transformers
Jiao, Y., Lai, Y., Wang, Y., and Yan, B. Convergence analysis of flow matching in latent space with transformers. arXiv preprint arXiv:2404.02538, 2024
-
[17]
Approximation bounds for transformer networks with application to regression
Jiao, Y., Lai, Y., Sun, D., Wang, Y., and Yan, B. Approximation bounds for transformer networks with application to regression. arXiv preprint arXiv:2504.12175, 2025
-
[18]
Jiao, Y., Lai, Y., Wang, Y., and Yan, B. Transformers can overcome the curse of dimensionality: A theoretical study from an approximation perspective. Journal of Machine Learning Research, 27 0 (50): 0 1--34, 2026
work page 2026
-
[19]
Kajitsuka, T. and Sato, I. Are transformers with one layer self-attention using low-rank weight matrices universal approximators? In the twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024. URL https://openreview.net/forum?id=nJnky5K944
work page 2024
-
[20]
Kajitsuka, T. and Sato, I. On the optimal memorization capacity of transformers. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=UGVYezlLcZ
work page 2025
-
[21]
Provable memorization capacity of transformers
Kim, J., Kim, M., and Mozafari, B. Provable memorization capacity of transformers. In the eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 , 2023. URL https://openreview.net/forum?id=8JCg5xJCTPR
work page 2023
-
[22]
Universal regular conditional distributions via probabilistic transformers
Kratsios, A. Universal regular conditional distributions via probabilistic transformers. Constructive Approximation, 57 0 (3): 0 1145--1212, 2023
work page 2023
-
[23]
Universal approximation under constraints is possible with transformers
Kratsios, A., Zamanlooy, B., Liu, T., and Dokmanic, I. Universal approximation under constraints is possible with transformers. In the tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , 2022. URL https://openreview.net/forum?id=JGO8CvG5S9
work page 2022
-
[24]
Memorization capacity of multi-head attention in transformers
Mahdavi, S., Liao, R., and Thrampoulidis, C. Memorization capacity of multi-head attention in transformers. In the twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 , 2024. URL https://openreview.net/forum?id=MrR3rMxqqv
work page 2024
-
[25]
Nguyen, Q. and Hein, M. Optimization landscape and expressivity of deep cnns. In International conference on machine learning, pp.\ 3730--3739. PMLR, 2018
work page 2018
-
[26]
Provable memorization via deep neural networks using sub-linear parameters
Park, S., Lee, J., Yun, C., and Shin, J. Provable memorization via deep neural networks using sub-linear parameters. In Conference on learning theory, pp.\ 3627--3661. PMLR, 2021
work page 2021
-
[27]
Prompting a pretrained transformer can be a universal approximator
Petrov, A., Torr, P., and Bibi, A. Prompting a pretrained transformer can be a universal approximator. In forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 , 2024. URL https://openreview.net/forum?id=3mQ6ZKTSQl
work page 2024
-
[28]
Nonparametric regression using deep neural networks with ReLU activation function , volume=
Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function . The Annals of Statistics, 48 0 (4): 0 1875 -- 1897, 2020. doi:10.1214/19-AOS1875. URL https://doi.org/10.1214/19-AOS1875
-
[29]
Optimal approximation rate of relu networks in terms of width and depth
Shen, Z., Yang, H., and Zhang, S. Optimal approximation rate of relu networks in terms of width and depth. Journal de Math \'e matiques Pures et Appliqu \'e es , 157: 0 101--135, 2022
work page 2022
-
[30]
Siegel, J. W. and Xu, J. Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks. Foundations of Computational Mathematics, 24 0 (2): 0 481--537, 2024
work page 2024
-
[31]
Takakura, S. and Suzuki, T. Approximation and estimation ability of transformers for sequence-to-sequence functions with infinite dimensional input. In International Conference on Machine Learning, pp.\ 33416--33447. PMLR, 2023
work page 2023
-
[32]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, K., et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
On the optimal memorization power of re LU neural networks
Vardi, G., Yehudai, G., and Shamir, O. On the optimal memorization power of re LU neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=MkTPtnjeYTV
work page 2022
-
[34]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
work page 2017
-
[35]
Xu, K. and Sato, I. On expressive power of looped transformers: Theoretical analysis and enhancement via timestep encoding. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=H4BuhRezCV
work page 2025
-
[36]
On the rates of convergence for learning with convolutional neural networks
Yang, Y., Feng, H., and Zhou, D.-X. On the rates of convergence for learning with convolutional neural networks. SIAM Journal on Mathematics of Data Science, 7 0 (4): 0 1755--1772, 2025
work page 2025
-
[37]
Error bounds for approximations with deep
Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94: 0 103--114, 2017. ISSN 0893-6080. doi:https://doi.org/10.1016/j.neunet.2017.07.002. URL https://www.sciencedirect.com/science/article/pii/S0893608017301545
-
[38]
Yarotsky, D. and Zhevnerchuk, A. The phase diagram of approximation rates for deep neural networks. Advances in neural information processing systems, 33: 0 13005--13015, 2020
work page 2020
-
[39]
Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems, volume 34, pp.\ 28877--28888, 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f1c1592588411002af340cbaedd6fc33-Paper.pdf
work page 2021
-
[40]
Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S., and Kumar, S. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByxRM0Ntvr
work page 2020
-
[41]
Theory of deep convolutional neural networks: Downsampling
Zhou, D.-X. Theory of deep convolutional neural networks: Downsampling. Neural Networks, 124: 0 319--327, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.