Overfitting has a limitation: a model-independent generalization gap bound based on R\'enyi entropy

Atsushi Suzuki; Jing Wang

arxiv: 2506.00182 · v3 · pith:ZBXQOUSOnew · submitted 2025-05-30 · 📊 stat.ML · cs.IT· cs.LG· math.IT· math.ST· stat.TH

Overfitting has a limitation: a model-independent generalization gap bound based on R\'enyi entropy

Atsushi Suzuki , Jing Wang This is my paper

Pith reviewed 2026-05-22 02:17 UTC · model grok-4.3

classification 📊 stat.ML cs.ITcs.LGmath.ITmath.STstat.TH

keywords generalization boundRényi entropymodel-independent boundhistogram-based algorithmsoverfittingno-free-lunch theoremempirical risk minimization

0 comments

The pith

Algorithms whose outputs depend only on the data histogram have a generalization gap upper-bounded solely by the Rényi entropy of the data-generating distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a model-independent upper bound on the generalization gap that applies to algorithms whose outputs are fully determined by the histogram of the training data, such as empirical risk minimization and gradient-based methods. This bound depends only on the Rényi entropy of the true data-generating distribution rather than on model size or architecture. A sympathetic reader would care because the result indicates that arbitrarily large models can still generalize well when the number of samples is sufficient relative to this entropy, and it directly attributes the harm of adding random noise to the resulting increase in Rényi entropy. The work also adapts the no-free-lunch theorem into a data-distribution-dependent form to show that data volume on the order of the Rényi entropy is necessary for successful learning.

Core claim

For any algorithm whose output is completely determined by the empirical histogram of the training data, the generalization gap admits an upper bound that is a function exclusively of the Rényi entropy of the underlying data-generating distribution. This bound is independent of the hypothesis class or model complexity, so that sufficiently many samples relative to the entropy suffice to keep the gap small even when the model is arbitrarily large. The same entropy quantity explains why performance collapses under added noise, since noise raises the entropy, and a distribution-dependent no-free-lunch result shows that data volume comparable to the entropy is required for learning to be better.

What carries the argument

The Rényi entropy of the data-generating distribution, which alone determines the upper bound on the generalization gap once the algorithm is restricted to histogram-determined outputs.

If this is right

Extremely large models can maintain small generalization gaps provided the sample size is large enough relative to the Rényi entropy.
Adding random noise to training data increases the Rényi entropy and thereby enlarges the possible generalization gap.
The amount of data needed for successful learning scales with the Rényi entropy of the data distribution.
The bound applies directly to empirical risk minimization and to gradient-based training methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Rényi entropy could serve as a practical measure of dataset complexity that predicts the data volume needed before scaling up models further improves generalization.
Estimating Rényi entropy from samples might allow generalization assessment without training or evaluating large models.
The approach opens a route to information-theoretic analyses that replace classical complexity measures such as VC dimension.

Load-bearing premise

The assumption that the algorithm's output is completely determined by the empirical histogram of the training data.

What would settle it

An explicit histogram-determined algorithm and data distribution whose measured generalization gap exceeds the Rényi-entropy-derived upper bound for that distribution.

read the original abstract

Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization gap, which is the impact of overfitting. Understanding generalization gap behavior of increasingly large-scale machine learning models remains a significant area of investigation, as conventional analyses often link error bounds to model complexity, failing to fully explain the success of extremely large architectures. This research introduces a novel perspective by establishing a model-independent upper bound for generalization gap applicable to algorithms whose outputs are determined solely by the data's histogram, such as empirical risk minimization or gradient-based methods. Crucially, this bound is shown to depend only on the R\'enyi entropy of the data-generating distribution, suggesting that a small generalization gap can be maintained even with arbitrarily large models, provided the data quantity is sufficient relative to this entropy. This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data, where the performance degrade is attributed to the consequent increase in the data distribution's R\'enyi entropy. Furthermore, we adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the R\'enyi entropy is indeed essential for successful learning, thereby highlighting the tightness of our proposed generalization bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to derive a model-independent upper bound on the generalization gap for algorithms whose outputs depend only on the empirical data histogram (e.g., ERM or gradient-based methods). The bound depends solely on the Rényi entropy of the true data-generating distribution rather than model complexity, and is used to explain why sufficient data volume relative to this entropy permits good generalization even for arbitrarily large models, why injecting noise degrades performance (by increasing Rényi entropy), and to obtain a data-distribution-dependent version of the no-free-lunch theorem.

Significance. If the central derivation is valid and the histogram-determinism premise is properly scoped, the result would offer a genuinely architecture-independent information-theoretic explanation for generalization in large models, directly tying the gap to a property of the data distribution alone. This would be a notable contribution to generalization theory in statistical machine learning, particularly if the bound is shown to be tight via the adapted no-free-lunch argument.

major comments (1)

[Applicability to ERM and gradient-based methods] Applicability section (referenced in abstract as covering ERM and gradient-based methods): The central claim that the bound is model-independent rests on the premise that algorithm outputs are a deterministic function of the data histogram alone. However, standard SGD training of overparameterized networks is stochastic due to random initialization, mini-batch shuffling, and learning-rate schedules; identical histograms routinely yield different converged models and different test risks. This injects an architecture- and optimizer-dependent component into the generalization gap that cannot be bounded solely by the Rényi entropy of the data distribution, undermining the model-independence assertion unless the section explicitly restricts the claim to deterministic algorithms or provides a separate argument for the stochastic case.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and insightful feedback on our manuscript. We address the major comment point by point below, indicating the revisions we will make to improve the clarity and accuracy of our claims.

read point-by-point responses

Referee: Applicability section (referenced in abstract as covering ERM and gradient-based methods): The central claim that the bound is model-independent rests on the premise that algorithm outputs are a deterministic function of the data histogram alone. However, standard SGD training of overparameterized networks is stochastic due to random initialization, mini-batch shuffling, and learning-rate schedules; identical histograms routinely yield different converged models and different test risks. This injects an architecture- and optimizer-dependent component into the generalization gap that cannot be bounded solely by the Rényi entropy of the data distribution, undermining the model-independence assertion unless the section explicitly restricts the claim to deterministic algorithms or provides a separate argument for the stochastic case.

Authors: We thank the referee for highlighting this important scoping issue. Our central derivation assumes that the algorithm output is a deterministic function of the empirical histogram, which is satisfied exactly by ERM under a fixed tie-breaking rule. While the abstract and applicability section mention gradient-based methods because the empirical risk depends only on the histogram, we acknowledge that practical SGD introduces additional stochasticity from initialization, shuffling, and schedules that is not captured by the histogram alone and can depend on architecture and optimizer details. We agree this requires explicit clarification to preserve the model-independence claim. We will revise the applicability section and abstract to restrict the primary result to deterministic histogram-determined algorithms. We will also add a short discussion noting that the bound may extend in expectation to certain stochastic settings, though a full analysis of optimizer-induced variance lies beyond the current scope and is noted as future work. These changes will accurately scope the contribution without affecting the core theoretical result for the deterministic case. revision: yes

Circularity Check

0 steps flagged

No circularity: bound expressed via external distribution entropy under stated assumption

full rationale

The paper states an explicit modeling assumption that algorithm outputs are a deterministic function of the empirical histogram alone, then derives an upper bound on the generalization gap that depends only on the Rényi entropy of the true data-generating distribution. This entropy is an external, unfitted property of the underlying distribution rather than any quantity computed from model outputs, training loss, or fitted parameters; the derivation therefore does not reduce to its inputs by construction. No self-citations appear as load-bearing premises for the central bound, no uniqueness theorems are imported from prior author work, and no ansatz is smuggled in via citation. The adaptation of the no-free-lunch theorem is presented as a data-distribution-dependent variant whose tightness follows directly from the entropy term, without circular redefinition. The result is therefore self-contained against external benchmarks once the histogram-determinism premise is granted.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the domain assumption that algorithm outputs are histogram-determined and on standard properties of Rényi entropy; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Algorithm outputs are determined solely by the histogram of the training data
This restricts the result to ERM and gradient-based methods and is invoked to achieve model independence.
standard math Rényi entropy is well-defined for the data-generating distribution
Standard information-theoretic definition used as the sole quantity in the bound.

pith-pipeline@v0.9.0 · 5769 in / 1426 out tokens · 53093 ms · 2026-05-22T02:17:28.954672+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 14 internal anchors

[1]

Language models are unsupervised multitask learners,

A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[2]

Language models are few$shot learners,

T. Brown et al., “Language models are few$shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901
[3]

GPT-4 Technical Report

J. Achiam et al., “GPT$4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

GPT-4o System Card

A. Hurst et al., “Gpt$4o system card,” arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

The Llama 3 Herd of Models

A. Dubey et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic, “The Claude 3 Model Family: Opus, Sonnet, Haiku.” [Online]. Available: https://www$cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page
[10]

Qwen Technical Report

J. Bai et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Qwen2 Technical Report

A. Yang et al., “Qwen2 Technical Report,” arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Qwen2.5 Technical Report

A. Yang et al., “Qwen2.5 Technical Report,” arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

DeepSeek-V3 Technical Report

A. Liu et al., “Deepseek$v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo et al., “DeepSeek$R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,

X. Sun et al. , “Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,” arXiv preprint arXiv:2411.02265, 2024

work page arXiv 2024
[16]

PaLM$E: An Embodied Multimodal Language Model,

D. Driess et al., “PaLM$E: An Embodied Multimodal Language Model,” in International Conference on Machine Learning, 2023, pp. 8469–8488

work page 2023
[17]

Rademacher processes and bounding the risk of function learning,

V. Koltchinskii and D. Panchenko, “Rademacher processes and bounding the risk of function learning,” High dimensional probability II. Springer, pp. 443–457, 2000

work page 2000
[18]

Empirical margin distributions and bounding the generalization error of combined classifiers,

V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” Annals of statistics, vol. 30, no. 1, pp. 1–50, 2002

work page 2002
[19]

Rademacher and gaussian complexities: Risk bounds and structural results,

P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002. 10

work page 2002
[20]

Norm$based capacity control in neural networks,

B. Neyshabur, R. Tomioka, and N. Srebro, “Norm$based capacity control in neural networks,” in Proceed- ings of The 28th Conference on Learning Theory, in Proceedings of Machine Learning Research, vol. 40. PMLR, 2015, pp. 1376–1401

work page 2015
[21]

Spectrally$normalized margin bounds for neural networks,

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally$normalized margin bounds for neural networks,” in Advances in Neural Information Processing Systems 30, 2017, pp. 6240–6250

work page 2017
[22]

Data$dependent sample complexity of deep neural networks via Lipschitz augmen$ tation,

C. Wei and T. Ma, “Data$dependent sample complexity of deep neural networks via Lipschitz augmen$ tation,” in Advances in Neural Information Processing Systems 32, 2019, pp. 9603–9613

work page 2019
[23]

Size$independent sample complexity of neural networks,

N. Golowich, A. Rakhlin, and O. Shamir, “Size$independent sample complexity of neural networks,” in Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet, Eds., in Proceedings of Machine Learning Research, vol. 75. PMLR, 2018, pp. 297–299

work page 2018
[24]

Size$independent sample complexity of neural networks,

N. Golowich, A. Rakhlin, and O. Shamir, “Size$independent sample complexity of neural networks,” Information and Inference: A Journal of the IMA, vol. 9, no. 2, pp. 473–504, 2020

work page 2020
[25]

On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond

X. Li, J. Lu, Z. Wang, J. Haupt, and T. Zhao, “On tighter generalization bound for deep neural networks: CNNs, ResNets, and beyond,” arXiv preprint arXiv:1806.05159, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Nearly$tight VC$dimension bounds for piecewise linear neural networks,

N. Harvey, C. Liaw, and A. Mehrabian, “Nearly$tight VC$dimension bounds for piecewise linear neural networks,” in Proceedings of the 2017 Conference on Learning Theory, S. Kale and O. Shamir, Eds., in Proceedings of Machine Learning Research, vol. 65. PMLR, 2017, pp. 1064–1068

work page 2017
[27]

Generalization bounds for neural networks via approximate description length,

A. Daniely and E. Granot, “Generalization bounds for neural networks via approximate description length,” in Advances in Neural Information Processing Systems 32, 2019, pp. 11700–11710

work page 2019
[28]

Stronger generalization bounds for deep nets via a compression approach,

S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, “Stronger generalization bounds for deep nets via a compression approach,” in International conference on machine learning, 2018, pp. 254–263

work page 2018
[29]

Spectral pruning: Compressing deep neural networks via spectral analysis and its generalization error,

T. Suzuki et al. , “Spectral pruning: Compressing deep neural networks via spectral analysis and its generalization error,” arXiv preprint arXiv:1808.08558, 2018

work page arXiv 2018
[30]

Compression based bound for non$compressed network: unified generalization error analysis of large compressible deep neural network,

T. Suzuki, H. Abe, and T. Nishimura, “Compression based bound for non$compressed network: unified generalization error analysis of large compressible deep neural network,” in International Conference on Learning Representations, 2020

work page 2020
[31]

Non$vacuous Generalization Bounds at the ImageNet Scale: a PAC$Bayesian Compression Approach,

W. Zhou, V. Veitch, M. Austern, R. P. Adams, and P. Orbanz, “Non$vacuous Generalization Bounds at the ImageNet Scale: a PAC$Bayesian Compression Approach,” in International Conference on Learning Representations, 2019

work page 2019
[32]

PAC$Bayes compression bounds so tight that they can explain generalization,

S. Lotfi, M. Finzi, S. Kapoor, A. Potapczynski, M. Goldblum, and A. G. Wilson, “PAC$Bayes compression bounds so tight that they can explain generalization,” Advances in Neural Information Processing Systems, vol. 35, pp. 31459–31473, 2022

work page 2022
[33]

A New Look at the Statistical Model Identification,

H. Akaike, “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, vol. 19, pp. 716–723, 1974

work page 1974
[34]

Local rademacher complexities,

P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,” The Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005

work page 2005
[35]

Local Rademacher complexities and oracle inequalities in risk minimization,

V. Koltchinskii, “Local Rademacher complexities and oracle inequalities in risk minimization,” The Annals of Statistics, vol. 34, no. 6, pp. 2593–2656, 2006

work page 2006
[36]

Fast generalization error bound of deep learning from a kernel perspective,

T. Suzuki, “Fast generalization error bound of deep learning from a kernel perspective,” in International conference on artificial intelligence and statistics, 2018, pp. 1397–1406

work page 2018
[37]

Fast generalization error bound of deep learning without scale invariance of activation functions,

Y. Terada and R. Hirose, “Fast generalization error bound of deep learning without scale invariance of activation functions,” Neural Networks, vol. 129, pp. 344–358, 2020

work page 2020
[38]

Generalization bounds of stochastic gradient descent for wide and deep neural networks,

Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” Advances in neural information processing systems, vol. 32, 2019

work page 2019
[39]

Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation,

A. Jentzen and T. Welti, “Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation,” Applied Mathematics and Computation, vol. 455, p. 127907, 2023

work page 2023
[40]

Understanding deep learning requires rethinking generalization,

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations, 2017

work page 2017
[41]

Inductive biases and variable creation in self$attention mechanisms,

B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self$attention mechanisms,” in International Conference on Machine Learning, 2022, pp. 5793–5831. 11

work page 2022
[42]

On the Rate of Convergence of a Classifier Based on a Transformer Encoder,

I. Gurevych, M. Kohler, and G. G. Şahin, “On the Rate of Convergence of a Classifier Based on a Transformer Encoder,” IEEE Transactions on Information Theory, vol. 68, no. 12, pp. 8139–8155, 2022

work page 2022
[43]

Approximation and estimation ability of transformers for sequence$to$ sequence functions with infinite dimensional input,

S. Takakura and T. Suzuki, “Approximation and estimation ability of transformers for sequence$to$ sequence functions with infinite dimensional input,” in International Conference on Machine Learning, 2023, pp. 33416–33447

work page 2023
[44]

Transformers are minimax optimal nonparametric in-context learners,

J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in$context learners,” arXiv preprint arXiv:2408.12186, 2024

work page arXiv 2024
[45]

Csiszar and J

I. Csiszar and J. Korner, Information Theory: Coding Theorems for Discrete Memoryless Systems. USA: Academic Press, Inc., 1982

work page 1982
[46]

PAC$Bayesian model averaging,

D. A. McAllester, “PAC$Bayesian model averaging,” in Proceedings of the twelfth annual conference on Computational learning theory, 1999, pp. 164–170

work page 1999
[47]

Dimension independent generalization error by stochastic gradient descent,

X. Chen, Q. Liu, and X. T. Tong, “Dimension independent generalization error by stochastic gradient descent,” arXiv preprint arXiv:2003.11196, 2020

work page arXiv 2003
[48]

Sgd implicitly regularizes generalization error,

D. A. Roberts, “Sgd implicitly regularizes generalization error,” arXiv preprint arXiv:2104.04874, 2021

work page arXiv 2021
[49]

Generalization error bounds via Rényi$, f$divergences and maximal leakage,

A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via Rényi$, f$divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021

work page 2021
[50]

Information$theoretic analysis of generalization capability of learning algo$ rithms,

A. Xu and M. Raginsky, “Information$theoretic analysis of generalization capability of learning algo$ rithms,” Advances in neural information processing systems, vol. 30, 2017

work page 2017
[51]

Generalization error bounds for noisy, iterative algorithms,

A. Pensia, V. Jog, and P.$L. Loh, “Generalization error bounds for noisy, iterative algorithms,” in 2018 IEEE International Symposium on Information Theory (ISIT), 2018, pp. 546–550

work page 2018
[52]

Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,

K. E. Nikolakakis, F. Haddadpour, A. Karbasi, and D. S. Kalogerias, “Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,” arXiv preprint arXiv:2204.12446, 2022

work page arXiv 2022
[53]

A method for solving the convex programming problem with convergence rate O (1/k2),

Y. Nesterov, “A method for solving the convex programming problem with convergence rate O (1/k2),” in Dokl akad nauk Sssr, 1983, p. 543

work page 1983
[54]

A new approach to variable metric algorithms,

R. Fletcher, “A new approach to variable metric algorithms,” The computer journal , vol. 13, no. 3, pp. 317–322, 1970

work page 1970
[55]

A family of variable$metric methods derived by variational means,

D. Goldfarb, “A family of variable$metric methods derived by variational means,” Mathematics of computation, vol. 24, no. 109, pp. 23–26, 1970

work page 1970
[56]

Conditioning of quasi$Newton methods for function minimization,

D. F. Shanno, “Conditioning of quasi$Newton methods for function minimization,” Mathematics of computation, vol. 24, no. 111, pp. 647–656, 1970

work page 1970
[57]

Adam: A Method for Stochastic Optimization

D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[58]

G. K. Zipf, Human behavior and the principle of least effort: An introduction to human ecology . Cambridge, MA: Addison$Wesley Press, 1949

work page 1949
[59]

Critical behavior in physics and probabilistic formal languages,

H. W. Lin and M. E. Tegmark, “Critical behavior in physics and probabilistic formal languages,” Entropy, vol. 19, no. 7, p. 299, Jun. 2017

work page 2017
[60]

Long$range correlations between letters and sentences in texts,

W. Ebeling and A. Neiman, “Long$range correlations between letters and sentences in texts,” Physica A: Statistical Mechanics and its Applications, vol. 215, no. 3, pp. 233–241, 1995

work page 1995
[61]

Entropy and Long$Range correlations in literary english,

W. Ebeling and T. Pöschel, “Entropy and Long$Range correlations in literary english,” Europhysics Letters, vol. 26, no. 4, p. 241, 1994

work page 1994
[62]

Mutual information functions of natural language texts,

W. Li, “Mutual information functions of natural language texts,” 1989

work page 1989
[63]

Parallels in the sequential organization of birdsong and human speech,

T. Sainburg, B. Theilman, M. Thielk, and T. Q. Gentner, “Parallels in the sequential organization of birdsong and human speech,” Nature Communications, vol. 10, no. 1, p. 3636, 2019

work page 2019
[64]

Do neural nets learn statistical laws behind natural language?,

S. Takahashi and K. Tanaka$Ishii, “Do neural nets learn statistical laws behind natural language?,” PLoS ONE, vol. 12, no. 12, p. e189326, 2017

work page 2017
[65]

Evaluating computational language models with scaling properties of natural language,

S. Takahashi and K. Tanaka$Ishii, “Evaluating computational language models with scaling properties of natural language,” Computational Linguistics, vol. 45, no. 3, pp. 481–513, 2019

work page 2019
[66]

Long$range memory in literary texts: On the universal clustering of the rare words,

K. Tanaka$Ishii and A. Bunde, “Long$range memory in literary texts: On the universal clustering of the rare words,” PLoS ONE, vol. 11, no. 11, p. e164658, 2016. 12

work page 2016
[67]

Shalev$Shwartz and S

S. Shalev$Shwartz and S. Ben$David, Understanding machine learning: From theory to algorithms . Cambridge university press, 2014

work page 2014
[68]

Hallucinations are inevitable but can be made statistically negligible

A. Suzuki, Y. He, F. Tian, and Z. Wang, “Hallucinations are inevitable but can be made statistically negli$ gible. The "innate" inevitability of hallucinations cannot explain practical LLM issues,” arXiv preprint arXiv:2502.12187, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

Position: the no free lunch theorem, Kolmogorov complexity, and the role of inductive biases in machine learning,

M. Goldblum, M. Finzi, K. Rowan, and A. G. Wilson, “Position: the no free lunch theorem, Kolmogorov complexity, and the role of inductive biases in machine learning,” in Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024

work page 2024
[70]

Deep Learning is Not So Mysterious or Different,

A. G. Wilson, “Deep Learning is Not So Mysterious or Different,” arXiv preprint arXiv:2503.02113, 2025

work page arXiv 2025
[71]

Minimax estimation of functionals of discrete distributions,

J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015

work page 2015
[72]

Reconciling modern machine$learning practice and the classical bias–variance trade$off,

M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine$learning practice and the classical bias–variance trade$off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019

work page 2019
[73]

Deep double descent: Where bigger models and more data hurt,

P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data hurt,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2021, no. 12, p. 124003, 2021

work page 2021
[74]

A remark on Stirling's formula,

H. Robbins, “A remark on Stirling's formula,” The American mathematical monthly , vol. 62, no. 1, pp. 26–29, 1955. A Limitations, discussions, and future work A.1 Rényi Entropy May Diverge As stated in Remark 16, if Rényi entropy diverges, Theorem 12 and Theorem 17 give vacuous upper bounds. However, as also stated in Remark 16, such cases are pathologic...

work page 1955
[75]

ln 𝑛− 𝑛+ 1 2ln(2𝜋) + 𝜃𝑛. (13) Thus, for (𝑛1,𝑛2,…,𝑛𝑘) ∈ ℕ𝑘 >0, (2𝜋) 1−𝑘 2 · 𝑛𝑛+1 2 ∏𝑘 𝑖=1𝑛 𝑛𝑖 +1 2 𝑖 · exp(𝐴) < ( 𝑛 𝑛1,…,𝑛𝑘 ) < (2𝜋) 1−𝑘 2 · 𝑛𝑛+1 2 ∏𝑘 𝑗=1𝑛 𝑛𝑗+1 2 𝑗 · exp(𝐵), (14) where 𝐴 = 1 12𝑛+1 − ∑𝑘 𝑗=1 1 12𝑛𝑗 , 𝐵 = 1 12𝑛 − ∑𝑘 𝑗=1 1 12𝑛𝑗+1. Lemma 30. (Multinomial Coefficient Estimation) Let 𝒏∈ ℕ𝑘 >0, and let 𝑛= ‖𝒏‖1 ≔ ∑𝑘−1 𝑗=0 𝑛𝑗. There exists 𝜃𝑛,𝒏 suc...

work page
[76]

ln 𝑛− ∑ 𝑘 𝑗=1 ( 𝑛𝑘 + 1

work page
[77]

□ B.4 Upper Bound on the Self-Entropy of a Multinomial Distribution using KL Divergence Lemma 31

ln 𝑛𝑘 − 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛− ∑ 𝑘 𝑗=1 𝜃𝑛𝑘 = 𝑛( − ∑ 𝑘 𝑗=1 𝑛𝑘 𝑛 ln 𝑛𝑘 𝑛) + 1 2( ln 𝑛− ∑ 𝑘 𝑗=1 ln 𝑛𝑘) + 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛,𝒏 = 𝑛𝐻1( 𝒏 𝑛) + 1 2( ln 𝑛− ∑ 𝑘 𝑗=1 ln 𝑛𝑘) + 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛,𝒏, (16) where 𝜃𝑛,𝒏 ≔ 𝜃𝑛− ∑𝑘 𝑖=1𝜃𝑛𝑘 satisfies |𝜃𝑛,𝒏| ≤ 𝑘 12𝑛. □ B.4 Upper Bound on the Self-Entropy of a Multinomial Distribution using KL Divergence Lemma 31. (Upper Bound on t...

work page
[78]

Using Lemma 42 with 𝜌 = 1− 𝛼 and 𝑏 = 𝑦 exp((1−𝛼)𝐻𝛼(Q)), 𝑛> ( 2[ln exp((1−𝛼)𝐻𝛼(Q)) 𝑦(1−𝛼) ] + 𝑦(1−𝛼) exp((1− 𝛼)𝐻𝛼(Q))) 1 1−𝛼 is a sufficient condition. [ln exp((1− 𝛼)𝐻𝛼(Q)) 𝑦(1− 𝛼) ] + = [(1− 𝛼)𝐻𝛼(Q)ln 1 𝑦(1− 𝛼)] + ≤ [(1− 𝛼)𝐻𝛼(Q)]+[ln 1 𝑦(1− 𝛼)] + = (1− 𝛼)𝐻𝛼(Q)[ln 1 𝑦(1− 𝛼)] + (48) From this, 𝑛> ( ((( 2(1− 𝛼)𝐻𝛼(Q)[ln 1 𝑦(1−𝛼)] + 𝑦(1− 𝛼) exp((1− 𝛼)𝐻𝛼(Q)) ) ...

work page
[79]

From the above, 𝐴1 = 𝐴1,1 + 𝐴1,2 < 𝜀2 4 + 𝜀2 12 = 𝜀2 3

This immediately follows from the assumption 𝑛> ( 36ln 2𝜋 𝛿2 𝜀2 ) 1 1−𝛼 exp(𝐻𝛼(Q)) = ( 36exp((1− 𝛼)𝐻𝛼(Q))ln 2𝜋 𝛿2 𝜀2 ) 1 1−𝛼 (51) that (𝐴1,2 ≔) exp((1−𝛼)𝐻𝛼(Q))𝑛𝛼 ln 2𝜋 𝛿2 2𝑛 < 𝜀2 12. From the above, 𝐴1 = 𝐴1,1 + 𝐴1,2 < 𝜀2 4 + 𝜀2 12 = 𝜀2 3 . Decompose 𝐴2 = 𝐴2,1 + 𝐴2,2 where 𝐴2,1 ≔ 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 , 𝐴2,2 ≔ ln 2𝜋 𝛿2 √ 𝑛 2 ln 2 𝛿3 2𝑛 It is sufficient to...

work page
[80]

This means (𝐴2,1 ≔) 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 < 𝜀2 12

Using Lemma 42 with 𝑏 = 𝜀2 9√ 2ln 2 𝛿3 ,𝜌 = 1 2, from the assumption 𝑛> ( ((( 18√ ln 2 𝛿3 𝜀2 [ln 9√ 2ln 2 𝛿3 𝜀2 ] +) ))) 2 , we can say ln 𝑛√𝑛 < 𝜀2 9√ 2ln 2 𝛿3 . This means (𝐴2,1 ≔) 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 < 𝜀2 12. Let’s show 𝐴2,2 < 𝜀2

work page

Showing first 80 references.

[1] [1]

Language models are unsupervised multitask learners,

A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019

[2] [2]

Language models are few$shot learners,

T. Brown et al., “Language models are few$shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

work page 1901

[3] [3]

GPT-4 Technical Report

J. Achiam et al., “GPT$4 technical report,” arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

GPT-4o System Card

A. Hurst et al., “Gpt$4o system card,” arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

The Llama 3 Herd of Models

A. Dubey et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic, “The Claude 3 Model Family: Opus, Sonnet, Haiku.” [Online]. Available: https://www$cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page

[10] [10]

Qwen Technical Report

J. Bai et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Qwen2 Technical Report

A. Yang et al., “Qwen2 Technical Report,” arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Qwen2.5 Technical Report

A. Yang et al., “Qwen2.5 Technical Report,” arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

DeepSeek-V3 Technical Report

A. Liu et al., “Deepseek$v3 technical report,” arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo et al., “DeepSeek$R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,

X. Sun et al. , “Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,” arXiv preprint arXiv:2411.02265, 2024

work page arXiv 2024

[16] [16]

PaLM$E: An Embodied Multimodal Language Model,

D. Driess et al., “PaLM$E: An Embodied Multimodal Language Model,” in International Conference on Machine Learning, 2023, pp. 8469–8488

work page 2023

[17] [17]

Rademacher processes and bounding the risk of function learning,

V. Koltchinskii and D. Panchenko, “Rademacher processes and bounding the risk of function learning,” High dimensional probability II. Springer, pp. 443–457, 2000

work page 2000

[18] [18]

Empirical margin distributions and bounding the generalization error of combined classifiers,

V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” Annals of statistics, vol. 30, no. 1, pp. 1–50, 2002

work page 2002

[19] [19]

Rademacher and gaussian complexities: Risk bounds and structural results,

P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002. 10

work page 2002

[20] [20]

Norm$based capacity control in neural networks,

B. Neyshabur, R. Tomioka, and N. Srebro, “Norm$based capacity control in neural networks,” in Proceed- ings of The 28th Conference on Learning Theory, in Proceedings of Machine Learning Research, vol. 40. PMLR, 2015, pp. 1376–1401

work page 2015

[21] [21]

Spectrally$normalized margin bounds for neural networks,

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally$normalized margin bounds for neural networks,” in Advances in Neural Information Processing Systems 30, 2017, pp. 6240–6250

work page 2017

[22] [22]

Data$dependent sample complexity of deep neural networks via Lipschitz augmen$ tation,

C. Wei and T. Ma, “Data$dependent sample complexity of deep neural networks via Lipschitz augmen$ tation,” in Advances in Neural Information Processing Systems 32, 2019, pp. 9603–9613

work page 2019

[23] [23]

Size$independent sample complexity of neural networks,

N. Golowich, A. Rakhlin, and O. Shamir, “Size$independent sample complexity of neural networks,” in Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet, Eds., in Proceedings of Machine Learning Research, vol. 75. PMLR, 2018, pp. 297–299

work page 2018

[24] [24]

Size$independent sample complexity of neural networks,

N. Golowich, A. Rakhlin, and O. Shamir, “Size$independent sample complexity of neural networks,” Information and Inference: A Journal of the IMA, vol. 9, no. 2, pp. 473–504, 2020

work page 2020

[25] [25]

On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond

X. Li, J. Lu, Z. Wang, J. Haupt, and T. Zhao, “On tighter generalization bound for deep neural networks: CNNs, ResNets, and beyond,” arXiv preprint arXiv:1806.05159, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Nearly$tight VC$dimension bounds for piecewise linear neural networks,

N. Harvey, C. Liaw, and A. Mehrabian, “Nearly$tight VC$dimension bounds for piecewise linear neural networks,” in Proceedings of the 2017 Conference on Learning Theory, S. Kale and O. Shamir, Eds., in Proceedings of Machine Learning Research, vol. 65. PMLR, 2017, pp. 1064–1068

work page 2017

[27] [27]

Generalization bounds for neural networks via approximate description length,

A. Daniely and E. Granot, “Generalization bounds for neural networks via approximate description length,” in Advances in Neural Information Processing Systems 32, 2019, pp. 11700–11710

work page 2019

[28] [28]

Stronger generalization bounds for deep nets via a compression approach,

S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, “Stronger generalization bounds for deep nets via a compression approach,” in International conference on machine learning, 2018, pp. 254–263

work page 2018

[29] [29]

Spectral pruning: Compressing deep neural networks via spectral analysis and its generalization error,

T. Suzuki et al. , “Spectral pruning: Compressing deep neural networks via spectral analysis and its generalization error,” arXiv preprint arXiv:1808.08558, 2018

work page arXiv 2018

[30] [30]

Compression based bound for non$compressed network: unified generalization error analysis of large compressible deep neural network,

T. Suzuki, H. Abe, and T. Nishimura, “Compression based bound for non$compressed network: unified generalization error analysis of large compressible deep neural network,” in International Conference on Learning Representations, 2020

work page 2020

[31] [31]

Non$vacuous Generalization Bounds at the ImageNet Scale: a PAC$Bayesian Compression Approach,

W. Zhou, V. Veitch, M. Austern, R. P. Adams, and P. Orbanz, “Non$vacuous Generalization Bounds at the ImageNet Scale: a PAC$Bayesian Compression Approach,” in International Conference on Learning Representations, 2019

work page 2019

[32] [32]

PAC$Bayes compression bounds so tight that they can explain generalization,

S. Lotfi, M. Finzi, S. Kapoor, A. Potapczynski, M. Goldblum, and A. G. Wilson, “PAC$Bayes compression bounds so tight that they can explain generalization,” Advances in Neural Information Processing Systems, vol. 35, pp. 31459–31473, 2022

work page 2022

[33] [33]

A New Look at the Statistical Model Identification,

H. Akaike, “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, vol. 19, pp. 716–723, 1974

work page 1974

[34] [34]

Local rademacher complexities,

P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,” The Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005

work page 2005

[35] [35]

Local Rademacher complexities and oracle inequalities in risk minimization,

V. Koltchinskii, “Local Rademacher complexities and oracle inequalities in risk minimization,” The Annals of Statistics, vol. 34, no. 6, pp. 2593–2656, 2006

work page 2006

[36] [36]

Fast generalization error bound of deep learning from a kernel perspective,

T. Suzuki, “Fast generalization error bound of deep learning from a kernel perspective,” in International conference on artificial intelligence and statistics, 2018, pp. 1397–1406

work page 2018

[37] [37]

Fast generalization error bound of deep learning without scale invariance of activation functions,

Y. Terada and R. Hirose, “Fast generalization error bound of deep learning without scale invariance of activation functions,” Neural Networks, vol. 129, pp. 344–358, 2020

work page 2020

[38] [38]

Generalization bounds of stochastic gradient descent for wide and deep neural networks,

Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” Advances in neural information processing systems, vol. 32, 2019

work page 2019

[39] [39]

Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation,

A. Jentzen and T. Welti, “Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation,” Applied Mathematics and Computation, vol. 455, p. 127907, 2023

work page 2023

[40] [40]

Understanding deep learning requires rethinking generalization,

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations, 2017

work page 2017

[41] [41]

Inductive biases and variable creation in self$attention mechanisms,

B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self$attention mechanisms,” in International Conference on Machine Learning, 2022, pp. 5793–5831. 11

work page 2022

[42] [42]

On the Rate of Convergence of a Classifier Based on a Transformer Encoder,

I. Gurevych, M. Kohler, and G. G. Şahin, “On the Rate of Convergence of a Classifier Based on a Transformer Encoder,” IEEE Transactions on Information Theory, vol. 68, no. 12, pp. 8139–8155, 2022

work page 2022

[43] [43]

Approximation and estimation ability of transformers for sequence$to$ sequence functions with infinite dimensional input,

S. Takakura and T. Suzuki, “Approximation and estimation ability of transformers for sequence$to$ sequence functions with infinite dimensional input,” in International Conference on Machine Learning, 2023, pp. 33416–33447

work page 2023

[44] [44]

Transformers are minimax optimal nonparametric in-context learners,

J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in$context learners,” arXiv preprint arXiv:2408.12186, 2024

work page arXiv 2024

[45] [45]

Csiszar and J

I. Csiszar and J. Korner, Information Theory: Coding Theorems for Discrete Memoryless Systems. USA: Academic Press, Inc., 1982

work page 1982

[46] [46]

PAC$Bayesian model averaging,

D. A. McAllester, “PAC$Bayesian model averaging,” in Proceedings of the twelfth annual conference on Computational learning theory, 1999, pp. 164–170

work page 1999

[47] [47]

Dimension independent generalization error by stochastic gradient descent,

X. Chen, Q. Liu, and X. T. Tong, “Dimension independent generalization error by stochastic gradient descent,” arXiv preprint arXiv:2003.11196, 2020

work page arXiv 2003

[48] [48]

Sgd implicitly regularizes generalization error,

D. A. Roberts, “Sgd implicitly regularizes generalization error,” arXiv preprint arXiv:2104.04874, 2021

work page arXiv 2021

[49] [49]

Generalization error bounds via Rényi$, f$divergences and maximal leakage,

A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via Rényi$, f$divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021

work page 2021

[50] [50]

Information$theoretic analysis of generalization capability of learning algo$ rithms,

A. Xu and M. Raginsky, “Information$theoretic analysis of generalization capability of learning algo$ rithms,” Advances in neural information processing systems, vol. 30, 2017

work page 2017

[51] [51]

Generalization error bounds for noisy, iterative algorithms,

A. Pensia, V. Jog, and P.$L. Loh, “Generalization error bounds for noisy, iterative algorithms,” in 2018 IEEE International Symposium on Information Theory (ISIT), 2018, pp. 546–550

work page 2018

[52] [52]

Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,

K. E. Nikolakakis, F. Haddadpour, A. Karbasi, and D. S. Kalogerias, “Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,” arXiv preprint arXiv:2204.12446, 2022

work page arXiv 2022

[53] [53]

A method for solving the convex programming problem with convergence rate O (1/k2),

Y. Nesterov, “A method for solving the convex programming problem with convergence rate O (1/k2),” in Dokl akad nauk Sssr, 1983, p. 543

work page 1983

[54] [54]

A new approach to variable metric algorithms,

R. Fletcher, “A new approach to variable metric algorithms,” The computer journal , vol. 13, no. 3, pp. 317–322, 1970

work page 1970

[55] [55]

A family of variable$metric methods derived by variational means,

D. Goldfarb, “A family of variable$metric methods derived by variational means,” Mathematics of computation, vol. 24, no. 109, pp. 23–26, 1970

work page 1970

[56] [56]

Conditioning of quasi$Newton methods for function minimization,

D. F. Shanno, “Conditioning of quasi$Newton methods for function minimization,” Mathematics of computation, vol. 24, no. 111, pp. 647–656, 1970

work page 1970

[57] [57]

Adam: A Method for Stochastic Optimization

D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[58] [58]

G. K. Zipf, Human behavior and the principle of least effort: An introduction to human ecology . Cambridge, MA: Addison$Wesley Press, 1949

work page 1949

[59] [59]

Critical behavior in physics and probabilistic formal languages,

H. W. Lin and M. E. Tegmark, “Critical behavior in physics and probabilistic formal languages,” Entropy, vol. 19, no. 7, p. 299, Jun. 2017

work page 2017

[60] [60]

Long$range correlations between letters and sentences in texts,

W. Ebeling and A. Neiman, “Long$range correlations between letters and sentences in texts,” Physica A: Statistical Mechanics and its Applications, vol. 215, no. 3, pp. 233–241, 1995

work page 1995

[61] [61]

Entropy and Long$Range correlations in literary english,

W. Ebeling and T. Pöschel, “Entropy and Long$Range correlations in literary english,” Europhysics Letters, vol. 26, no. 4, p. 241, 1994

work page 1994

[62] [62]

Mutual information functions of natural language texts,

W. Li, “Mutual information functions of natural language texts,” 1989

work page 1989

[63] [63]

Parallels in the sequential organization of birdsong and human speech,

T. Sainburg, B. Theilman, M. Thielk, and T. Q. Gentner, “Parallels in the sequential organization of birdsong and human speech,” Nature Communications, vol. 10, no. 1, p. 3636, 2019

work page 2019

[64] [64]

Do neural nets learn statistical laws behind natural language?,

S. Takahashi and K. Tanaka$Ishii, “Do neural nets learn statistical laws behind natural language?,” PLoS ONE, vol. 12, no. 12, p. e189326, 2017

work page 2017

[65] [65]

Evaluating computational language models with scaling properties of natural language,

S. Takahashi and K. Tanaka$Ishii, “Evaluating computational language models with scaling properties of natural language,” Computational Linguistics, vol. 45, no. 3, pp. 481–513, 2019

work page 2019

[66] [66]

Long$range memory in literary texts: On the universal clustering of the rare words,

K. Tanaka$Ishii and A. Bunde, “Long$range memory in literary texts: On the universal clustering of the rare words,” PLoS ONE, vol. 11, no. 11, p. e164658, 2016. 12

work page 2016

[67] [67]

Shalev$Shwartz and S

S. Shalev$Shwartz and S. Ben$David, Understanding machine learning: From theory to algorithms . Cambridge university press, 2014

work page 2014

[68] [68]

Hallucinations are inevitable but can be made statistically negligible

A. Suzuki, Y. He, F. Tian, and Z. Wang, “Hallucinations are inevitable but can be made statistically negli$ gible. The "innate" inevitability of hallucinations cannot explain practical LLM issues,” arXiv preprint arXiv:2502.12187, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

Position: the no free lunch theorem, Kolmogorov complexity, and the role of inductive biases in machine learning,

M. Goldblum, M. Finzi, K. Rowan, and A. G. Wilson, “Position: the no free lunch theorem, Kolmogorov complexity, and the role of inductive biases in machine learning,” in Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024

work page 2024

[70] [70]

Deep Learning is Not So Mysterious or Different,

A. G. Wilson, “Deep Learning is Not So Mysterious or Different,” arXiv preprint arXiv:2503.02113, 2025

work page arXiv 2025

[71] [71]

Minimax estimation of functionals of discrete distributions,

J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015

work page 2015

[72] [72]

Reconciling modern machine$learning practice and the classical bias–variance trade$off,

M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine$learning practice and the classical bias–variance trade$off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019

work page 2019

[73] [73]

Deep double descent: Where bigger models and more data hurt,

P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data hurt,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2021, no. 12, p. 124003, 2021

work page 2021

[74] [74]

A remark on Stirling's formula,

H. Robbins, “A remark on Stirling's formula,” The American mathematical monthly , vol. 62, no. 1, pp. 26–29, 1955. A Limitations, discussions, and future work A.1 Rényi Entropy May Diverge As stated in Remark 16, if Rényi entropy diverges, Theorem 12 and Theorem 17 give vacuous upper bounds. However, as also stated in Remark 16, such cases are pathologic...

work page 1955

[75] [75]

ln 𝑛− 𝑛+ 1 2ln(2𝜋) + 𝜃𝑛. (13) Thus, for (𝑛1,𝑛2,…,𝑛𝑘) ∈ ℕ𝑘 >0, (2𝜋) 1−𝑘 2 · 𝑛𝑛+1 2 ∏𝑘 𝑖=1𝑛 𝑛𝑖 +1 2 𝑖 · exp(𝐴) < ( 𝑛 𝑛1,…,𝑛𝑘 ) < (2𝜋) 1−𝑘 2 · 𝑛𝑛+1 2 ∏𝑘 𝑗=1𝑛 𝑛𝑗+1 2 𝑗 · exp(𝐵), (14) where 𝐴 = 1 12𝑛+1 − ∑𝑘 𝑗=1 1 12𝑛𝑗 , 𝐵 = 1 12𝑛 − ∑𝑘 𝑗=1 1 12𝑛𝑗+1. Lemma 30. (Multinomial Coefficient Estimation) Let 𝒏∈ ℕ𝑘 >0, and let 𝑛= ‖𝒏‖1 ≔ ∑𝑘−1 𝑗=0 𝑛𝑗. There exists 𝜃𝑛,𝒏 suc...

work page

[76] [76]

ln 𝑛− ∑ 𝑘 𝑗=1 ( 𝑛𝑘 + 1

work page

[77] [77]

□ B.4 Upper Bound on the Self-Entropy of a Multinomial Distribution using KL Divergence Lemma 31

ln 𝑛𝑘 − 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛− ∑ 𝑘 𝑗=1 𝜃𝑛𝑘 = 𝑛( − ∑ 𝑘 𝑗=1 𝑛𝑘 𝑛 ln 𝑛𝑘 𝑛) + 1 2( ln 𝑛− ∑ 𝑘 𝑗=1 ln 𝑛𝑘) + 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛,𝒏 = 𝑛𝐻1( 𝒏 𝑛) + 1 2( ln 𝑛− ∑ 𝑘 𝑗=1 ln 𝑛𝑘) + 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛,𝒏, (16) where 𝜃𝑛,𝒏 ≔ 𝜃𝑛− ∑𝑘 𝑖=1𝜃𝑛𝑘 satisfies |𝜃𝑛,𝒏| ≤ 𝑘 12𝑛. □ B.4 Upper Bound on the Self-Entropy of a Multinomial Distribution using KL Divergence Lemma 31. (Upper Bound on t...

work page

[78] [78]

Using Lemma 42 with 𝜌 = 1− 𝛼 and 𝑏 = 𝑦 exp((1−𝛼)𝐻𝛼(Q)), 𝑛> ( 2[ln exp((1−𝛼)𝐻𝛼(Q)) 𝑦(1−𝛼) ] + 𝑦(1−𝛼) exp((1− 𝛼)𝐻𝛼(Q))) 1 1−𝛼 is a sufficient condition. [ln exp((1− 𝛼)𝐻𝛼(Q)) 𝑦(1− 𝛼) ] + = [(1− 𝛼)𝐻𝛼(Q)ln 1 𝑦(1− 𝛼)] + ≤ [(1− 𝛼)𝐻𝛼(Q)]+[ln 1 𝑦(1− 𝛼)] + = (1− 𝛼)𝐻𝛼(Q)[ln 1 𝑦(1− 𝛼)] + (48) From this, 𝑛> ( ((( 2(1− 𝛼)𝐻𝛼(Q)[ln 1 𝑦(1−𝛼)] + 𝑦(1− 𝛼) exp((1− 𝛼)𝐻𝛼(Q)) ) ...

work page

[79] [79]

From the above, 𝐴1 = 𝐴1,1 + 𝐴1,2 < 𝜀2 4 + 𝜀2 12 = 𝜀2 3

This immediately follows from the assumption 𝑛> ( 36ln 2𝜋 𝛿2 𝜀2 ) 1 1−𝛼 exp(𝐻𝛼(Q)) = ( 36exp((1− 𝛼)𝐻𝛼(Q))ln 2𝜋 𝛿2 𝜀2 ) 1 1−𝛼 (51) that (𝐴1,2 ≔) exp((1−𝛼)𝐻𝛼(Q))𝑛𝛼 ln 2𝜋 𝛿2 2𝑛 < 𝜀2 12. From the above, 𝐴1 = 𝐴1,1 + 𝐴1,2 < 𝜀2 4 + 𝜀2 12 = 𝜀2 3 . Decompose 𝐴2 = 𝐴2,1 + 𝐴2,2 where 𝐴2,1 ≔ 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 , 𝐴2,2 ≔ ln 2𝜋 𝛿2 √ 𝑛 2 ln 2 𝛿3 2𝑛 It is sufficient to...

work page

[80] [80]

This means (𝐴2,1 ≔) 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 < 𝜀2 12

Using Lemma 42 with 𝑏 = 𝜀2 9√ 2ln 2 𝛿3 ,𝜌 = 1 2, from the assumption 𝑛> ( ((( 18√ ln 2 𝛿3 𝜀2 [ln 9√ 2ln 2 𝛿3 𝜀2 ] +) ))) 2 , we can say ln 𝑛√𝑛 < 𝜀2 9√ 2ln 2 𝛿3 . This means (𝐴2,1 ≔) 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 < 𝜀2 12. Let’s show 𝐴2,2 < 𝜀2

work page