Overfitting has a limitation: a model-independent generalization gap bound based on R\'enyi entropy
Pith reviewed 2026-05-22 02:17 UTC · model grok-4.3
The pith
Algorithms whose outputs depend only on the data histogram have a generalization gap upper-bounded solely by the Rényi entropy of the data-generating distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For any algorithm whose output is completely determined by the empirical histogram of the training data, the generalization gap admits an upper bound that is a function exclusively of the Rényi entropy of the underlying data-generating distribution. This bound is independent of the hypothesis class or model complexity, so that sufficiently many samples relative to the entropy suffice to keep the gap small even when the model is arbitrarily large. The same entropy quantity explains why performance collapses under added noise, since noise raises the entropy, and a distribution-dependent no-free-lunch result shows that data volume comparable to the entropy is required for learning to be better.
What carries the argument
The Rényi entropy of the data-generating distribution, which alone determines the upper bound on the generalization gap once the algorithm is restricted to histogram-determined outputs.
If this is right
- Extremely large models can maintain small generalization gaps provided the sample size is large enough relative to the Rényi entropy.
- Adding random noise to training data increases the Rényi entropy and thereby enlarges the possible generalization gap.
- The amount of data needed for successful learning scales with the Rényi entropy of the data distribution.
- The bound applies directly to empirical risk minimization and to gradient-based training methods.
Where Pith is reading between the lines
- Rényi entropy could serve as a practical measure of dataset complexity that predicts the data volume needed before scaling up models further improves generalization.
- Estimating Rényi entropy from samples might allow generalization assessment without training or evaluating large models.
- The approach opens a route to information-theoretic analyses that replace classical complexity measures such as VC dimension.
Load-bearing premise
The assumption that the algorithm's output is completely determined by the empirical histogram of the training data.
What would settle it
An explicit histogram-determined algorithm and data distribution whose measured generalization gap exceeds the Rényi-entropy-derived upper bound for that distribution.
read the original abstract
Will further scaling up of machine learning models continue to bring success? A significant challenge in answering this question lies in understanding generalization gap, which is the impact of overfitting. Understanding generalization gap behavior of increasingly large-scale machine learning models remains a significant area of investigation, as conventional analyses often link error bounds to model complexity, failing to fully explain the success of extremely large architectures. This research introduces a novel perspective by establishing a model-independent upper bound for generalization gap applicable to algorithms whose outputs are determined solely by the data's histogram, such as empirical risk minimization or gradient-based methods. Crucially, this bound is shown to depend only on the R\'enyi entropy of the data-generating distribution, suggesting that a small generalization gap can be maintained even with arbitrarily large models, provided the data quantity is sufficient relative to this entropy. This framework offers a direct explanation for the phenomenon where generalization performance degrades significantly upon injecting random noise into data, where the performance degrade is attributed to the consequent increase in the data distribution's R\'enyi entropy. Furthermore, we adapt the no-free-lunch theorem to be data-distribution-dependent, demonstrating that an amount of data corresponding to the R\'enyi entropy is indeed essential for successful learning, thereby highlighting the tightness of our proposed generalization bound.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to derive a model-independent upper bound on the generalization gap for algorithms whose outputs depend only on the empirical data histogram (e.g., ERM or gradient-based methods). The bound depends solely on the Rényi entropy of the true data-generating distribution rather than model complexity, and is used to explain why sufficient data volume relative to this entropy permits good generalization even for arbitrarily large models, why injecting noise degrades performance (by increasing Rényi entropy), and to obtain a data-distribution-dependent version of the no-free-lunch theorem.
Significance. If the central derivation is valid and the histogram-determinism premise is properly scoped, the result would offer a genuinely architecture-independent information-theoretic explanation for generalization in large models, directly tying the gap to a property of the data distribution alone. This would be a notable contribution to generalization theory in statistical machine learning, particularly if the bound is shown to be tight via the adapted no-free-lunch argument.
major comments (1)
- [Applicability to ERM and gradient-based methods] Applicability section (referenced in abstract as covering ERM and gradient-based methods): The central claim that the bound is model-independent rests on the premise that algorithm outputs are a deterministic function of the data histogram alone. However, standard SGD training of overparameterized networks is stochastic due to random initialization, mini-batch shuffling, and learning-rate schedules; identical histograms routinely yield different converged models and different test risks. This injects an architecture- and optimizer-dependent component into the generalization gap that cannot be bounded solely by the Rényi entropy of the data distribution, undermining the model-independence assertion unless the section explicitly restricts the claim to deterministic algorithms or provides a separate argument for the stochastic case.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful feedback on our manuscript. We address the major comment point by point below, indicating the revisions we will make to improve the clarity and accuracy of our claims.
read point-by-point responses
-
Referee: Applicability section (referenced in abstract as covering ERM and gradient-based methods): The central claim that the bound is model-independent rests on the premise that algorithm outputs are a deterministic function of the data histogram alone. However, standard SGD training of overparameterized networks is stochastic due to random initialization, mini-batch shuffling, and learning-rate schedules; identical histograms routinely yield different converged models and different test risks. This injects an architecture- and optimizer-dependent component into the generalization gap that cannot be bounded solely by the Rényi entropy of the data distribution, undermining the model-independence assertion unless the section explicitly restricts the claim to deterministic algorithms or provides a separate argument for the stochastic case.
Authors: We thank the referee for highlighting this important scoping issue. Our central derivation assumes that the algorithm output is a deterministic function of the empirical histogram, which is satisfied exactly by ERM under a fixed tie-breaking rule. While the abstract and applicability section mention gradient-based methods because the empirical risk depends only on the histogram, we acknowledge that practical SGD introduces additional stochasticity from initialization, shuffling, and schedules that is not captured by the histogram alone and can depend on architecture and optimizer details. We agree this requires explicit clarification to preserve the model-independence claim. We will revise the applicability section and abstract to restrict the primary result to deterministic histogram-determined algorithms. We will also add a short discussion noting that the bound may extend in expectation to certain stochastic settings, though a full analysis of optimizer-induced variance lies beyond the current scope and is noted as future work. These changes will accurately scope the contribution without affecting the core theoretical result for the deterministic case. revision: yes
Circularity Check
No circularity: bound expressed via external distribution entropy under stated assumption
full rationale
The paper states an explicit modeling assumption that algorithm outputs are a deterministic function of the empirical histogram alone, then derives an upper bound on the generalization gap that depends only on the Rényi entropy of the true data-generating distribution. This entropy is an external, unfitted property of the underlying distribution rather than any quantity computed from model outputs, training loss, or fitted parameters; the derivation therefore does not reduce to its inputs by construction. No self-citations appear as load-bearing premises for the central bound, no uniqueness theorems are imported from prior author work, and no ansatz is smuggled in via citation. The adaptation of the no-free-lunch theorem is presented as a data-distribution-dependent variant whose tightness follows directly from the entropy term, without circular redefinition. The result is therefore self-contained against external benchmarks once the histogram-determinism premise is granted.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Algorithm outputs are determined solely by the histogram of the training data
- standard math Rényi entropy is well-defined for the data-generating distribution
Reference graph
Works this paper leans on
-
[1]
Language models are unsupervised multitask learners,
A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019
work page 2019
-
[2]
Language models are few$shot learners,
T. Brown et al., “Language models are few$shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[3]
J. Achiam et al., “GPT$4 technical report,” arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
A. Hurst et al., “Gpt$4o system card,” arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team et al. , “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron et al. , “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
A. Dubey et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
The Claude 3 Model Family: Opus, Sonnet, Haiku
Anthropic, “The Claude 3 Model Family: Opus, Sonnet, Haiku.” [Online]. Available: https://www$cdn. anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
-
[10]
J. Bai et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
A. Yang et al., “Qwen2 Technical Report,” arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
A. Yang et al., “Qwen2.5 Technical Report,” arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
A. Liu et al., “Deepseek$v3 technical report,” arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo et al., “DeepSeek$R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,
X. Sun et al. , “Hunyuan$large: An open$source moe model with 52 billion activated parameters by tencent,” arXiv preprint arXiv:2411.02265, 2024
-
[16]
PaLM$E: An Embodied Multimodal Language Model,
D. Driess et al., “PaLM$E: An Embodied Multimodal Language Model,” in International Conference on Machine Learning, 2023, pp. 8469–8488
work page 2023
-
[17]
Rademacher processes and bounding the risk of function learning,
V. Koltchinskii and D. Panchenko, “Rademacher processes and bounding the risk of function learning,” High dimensional probability II. Springer, pp. 443–457, 2000
work page 2000
-
[18]
Empirical margin distributions and bounding the generalization error of combined classifiers,
V. Koltchinskii and D. Panchenko, “Empirical margin distributions and bounding the generalization error of combined classifiers,” Annals of statistics, vol. 30, no. 1, pp. 1–50, 2002
work page 2002
-
[19]
Rademacher and gaussian complexities: Risk bounds and structural results,
P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002. 10
work page 2002
-
[20]
Norm$based capacity control in neural networks,
B. Neyshabur, R. Tomioka, and N. Srebro, “Norm$based capacity control in neural networks,” in Proceed- ings of The 28th Conference on Learning Theory, in Proceedings of Machine Learning Research, vol. 40. PMLR, 2015, pp. 1376–1401
work page 2015
-
[21]
Spectrally$normalized margin bounds for neural networks,
P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally$normalized margin bounds for neural networks,” in Advances in Neural Information Processing Systems 30, 2017, pp. 6240–6250
work page 2017
-
[22]
Data$dependent sample complexity of deep neural networks via Lipschitz augmen$ tation,
C. Wei and T. Ma, “Data$dependent sample complexity of deep neural networks via Lipschitz augmen$ tation,” in Advances in Neural Information Processing Systems 32, 2019, pp. 9603–9613
work page 2019
-
[23]
Size$independent sample complexity of neural networks,
N. Golowich, A. Rakhlin, and O. Shamir, “Size$independent sample complexity of neural networks,” in Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet, Eds., in Proceedings of Machine Learning Research, vol. 75. PMLR, 2018, pp. 297–299
work page 2018
-
[24]
Size$independent sample complexity of neural networks,
N. Golowich, A. Rakhlin, and O. Shamir, “Size$independent sample complexity of neural networks,” Information and Inference: A Journal of the IMA, vol. 9, no. 2, pp. 473–504, 2020
work page 2020
-
[25]
On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond
X. Li, J. Lu, Z. Wang, J. Haupt, and T. Zhao, “On tighter generalization bound for deep neural networks: CNNs, ResNets, and beyond,” arXiv preprint arXiv:1806.05159, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
Nearly$tight VC$dimension bounds for piecewise linear neural networks,
N. Harvey, C. Liaw, and A. Mehrabian, “Nearly$tight VC$dimension bounds for piecewise linear neural networks,” in Proceedings of the 2017 Conference on Learning Theory, S. Kale and O. Shamir, Eds., in Proceedings of Machine Learning Research, vol. 65. PMLR, 2017, pp. 1064–1068
work page 2017
-
[27]
Generalization bounds for neural networks via approximate description length,
A. Daniely and E. Granot, “Generalization bounds for neural networks via approximate description length,” in Advances in Neural Information Processing Systems 32, 2019, pp. 11700–11710
work page 2019
-
[28]
Stronger generalization bounds for deep nets via a compression approach,
S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, “Stronger generalization bounds for deep nets via a compression approach,” in International conference on machine learning, 2018, pp. 254–263
work page 2018
-
[29]
T. Suzuki et al. , “Spectral pruning: Compressing deep neural networks via spectral analysis and its generalization error,” arXiv preprint arXiv:1808.08558, 2018
-
[30]
T. Suzuki, H. Abe, and T. Nishimura, “Compression based bound for non$compressed network: unified generalization error analysis of large compressible deep neural network,” in International Conference on Learning Representations, 2020
work page 2020
-
[31]
Non$vacuous Generalization Bounds at the ImageNet Scale: a PAC$Bayesian Compression Approach,
W. Zhou, V. Veitch, M. Austern, R. P. Adams, and P. Orbanz, “Non$vacuous Generalization Bounds at the ImageNet Scale: a PAC$Bayesian Compression Approach,” in International Conference on Learning Representations, 2019
work page 2019
-
[32]
PAC$Bayes compression bounds so tight that they can explain generalization,
S. Lotfi, M. Finzi, S. Kapoor, A. Potapczynski, M. Goldblum, and A. G. Wilson, “PAC$Bayes compression bounds so tight that they can explain generalization,” Advances in Neural Information Processing Systems, vol. 35, pp. 31459–31473, 2022
work page 2022
-
[33]
A New Look at the Statistical Model Identification,
H. Akaike, “A New Look at the Statistical Model Identification,” IEEE Transactions on Automatic Control, vol. 19, pp. 716–723, 1974
work page 1974
-
[34]
Local rademacher complexities,
P. L. Bartlett, O. Bousquet, and S. Mendelson, “Local rademacher complexities,” The Annals of Statistics, vol. 33, no. 4, pp. 1497–1537, 2005
work page 2005
-
[35]
Local Rademacher complexities and oracle inequalities in risk minimization,
V. Koltchinskii, “Local Rademacher complexities and oracle inequalities in risk minimization,” The Annals of Statistics, vol. 34, no. 6, pp. 2593–2656, 2006
work page 2006
-
[36]
Fast generalization error bound of deep learning from a kernel perspective,
T. Suzuki, “Fast generalization error bound of deep learning from a kernel perspective,” in International conference on artificial intelligence and statistics, 2018, pp. 1397–1406
work page 2018
-
[37]
Fast generalization error bound of deep learning without scale invariance of activation functions,
Y. Terada and R. Hirose, “Fast generalization error bound of deep learning without scale invariance of activation functions,” Neural Networks, vol. 129, pp. 344–358, 2020
work page 2020
-
[38]
Generalization bounds of stochastic gradient descent for wide and deep neural networks,
Y. Cao and Q. Gu, “Generalization bounds of stochastic gradient descent for wide and deep neural networks,” Advances in neural information processing systems, vol. 32, 2019
work page 2019
-
[39]
A. Jentzen and T. Welti, “Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation,” Applied Mathematics and Computation, vol. 455, p. 127907, 2023
work page 2023
-
[40]
Understanding deep learning requires rethinking generalization,
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in International Conference on Learning Representations, 2017
work page 2017
-
[41]
Inductive biases and variable creation in self$attention mechanisms,
B. L. Edelman, S. Goel, S. Kakade, and C. Zhang, “Inductive biases and variable creation in self$attention mechanisms,” in International Conference on Machine Learning, 2022, pp. 5793–5831. 11
work page 2022
-
[42]
On the Rate of Convergence of a Classifier Based on a Transformer Encoder,
I. Gurevych, M. Kohler, and G. G. Şahin, “On the Rate of Convergence of a Classifier Based on a Transformer Encoder,” IEEE Transactions on Information Theory, vol. 68, no. 12, pp. 8139–8155, 2022
work page 2022
-
[43]
S. Takakura and T. Suzuki, “Approximation and estimation ability of transformers for sequence$to$ sequence functions with infinite dimensional input,” in International Conference on Machine Learning, 2023, pp. 33416–33447
work page 2023
-
[44]
Transformers are minimax optimal nonparametric in-context learners,
J. Kim, T. Nakamaki, and T. Suzuki, “Transformers are minimax optimal nonparametric in$context learners,” arXiv preprint arXiv:2408.12186, 2024
-
[45]
I. Csiszar and J. Korner, Information Theory: Coding Theorems for Discrete Memoryless Systems. USA: Academic Press, Inc., 1982
work page 1982
-
[46]
D. A. McAllester, “PAC$Bayesian model averaging,” in Proceedings of the twelfth annual conference on Computational learning theory, 1999, pp. 164–170
work page 1999
-
[47]
Dimension independent generalization error by stochastic gradient descent,
X. Chen, Q. Liu, and X. T. Tong, “Dimension independent generalization error by stochastic gradient descent,” arXiv preprint arXiv:2003.11196, 2020
-
[48]
Sgd implicitly regularizes generalization error,
D. A. Roberts, “Sgd implicitly regularizes generalization error,” arXiv preprint arXiv:2104.04874, 2021
-
[49]
Generalization error bounds via Rényi$, f$divergences and maximal leakage,
A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via Rényi$, f$divergences and maximal leakage,” IEEE Transactions on Information Theory, vol. 67, no. 8, pp. 4986–5004, 2021
work page 2021
-
[50]
Information$theoretic analysis of generalization capability of learning algo$ rithms,
A. Xu and M. Raginsky, “Information$theoretic analysis of generalization capability of learning algo$ rithms,” Advances in neural information processing systems, vol. 30, 2017
work page 2017
-
[51]
Generalization error bounds for noisy, iterative algorithms,
A. Pensia, V. Jog, and P.$L. Loh, “Generalization error bounds for noisy, iterative algorithms,” in 2018 IEEE International Symposium on Information Theory (ISIT), 2018, pp. 546–550
work page 2018
-
[52]
Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,
K. E. Nikolakakis, F. Haddadpour, A. Karbasi, and D. S. Kalogerias, “Beyond lipschitz: Sharp general$ ization and excess risk bounds for full$batch gd,” arXiv preprint arXiv:2204.12446, 2022
-
[53]
A method for solving the convex programming problem with convergence rate O (1/k2),
Y. Nesterov, “A method for solving the convex programming problem with convergence rate O (1/k2),” in Dokl akad nauk Sssr, 1983, p. 543
work page 1983
-
[54]
A new approach to variable metric algorithms,
R. Fletcher, “A new approach to variable metric algorithms,” The computer journal , vol. 13, no. 3, pp. 317–322, 1970
work page 1970
-
[55]
A family of variable$metric methods derived by variational means,
D. Goldfarb, “A family of variable$metric methods derived by variational means,” Mathematics of computation, vol. 24, no. 109, pp. 23–26, 1970
work page 1970
-
[56]
Conditioning of quasi$Newton methods for function minimization,
D. F. Shanno, “Conditioning of quasi$Newton methods for function minimization,” Mathematics of computation, vol. 24, no. 111, pp. 647–656, 1970
work page 1970
-
[57]
Adam: A Method for Stochastic Optimization
D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[58]
G. K. Zipf, Human behavior and the principle of least effort: An introduction to human ecology . Cambridge, MA: Addison$Wesley Press, 1949
work page 1949
-
[59]
Critical behavior in physics and probabilistic formal languages,
H. W. Lin and M. E. Tegmark, “Critical behavior in physics and probabilistic formal languages,” Entropy, vol. 19, no. 7, p. 299, Jun. 2017
work page 2017
-
[60]
Long$range correlations between letters and sentences in texts,
W. Ebeling and A. Neiman, “Long$range correlations between letters and sentences in texts,” Physica A: Statistical Mechanics and its Applications, vol. 215, no. 3, pp. 233–241, 1995
work page 1995
-
[61]
Entropy and Long$Range correlations in literary english,
W. Ebeling and T. Pöschel, “Entropy and Long$Range correlations in literary english,” Europhysics Letters, vol. 26, no. 4, p. 241, 1994
work page 1994
-
[62]
Mutual information functions of natural language texts,
W. Li, “Mutual information functions of natural language texts,” 1989
work page 1989
-
[63]
Parallels in the sequential organization of birdsong and human speech,
T. Sainburg, B. Theilman, M. Thielk, and T. Q. Gentner, “Parallels in the sequential organization of birdsong and human speech,” Nature Communications, vol. 10, no. 1, p. 3636, 2019
work page 2019
-
[64]
Do neural nets learn statistical laws behind natural language?,
S. Takahashi and K. Tanaka$Ishii, “Do neural nets learn statistical laws behind natural language?,” PLoS ONE, vol. 12, no. 12, p. e189326, 2017
work page 2017
-
[65]
Evaluating computational language models with scaling properties of natural language,
S. Takahashi and K. Tanaka$Ishii, “Evaluating computational language models with scaling properties of natural language,” Computational Linguistics, vol. 45, no. 3, pp. 481–513, 2019
work page 2019
-
[66]
Long$range memory in literary texts: On the universal clustering of the rare words,
K. Tanaka$Ishii and A. Bunde, “Long$range memory in literary texts: On the universal clustering of the rare words,” PLoS ONE, vol. 11, no. 11, p. e164658, 2016. 12
work page 2016
-
[67]
S. Shalev$Shwartz and S. Ben$David, Understanding machine learning: From theory to algorithms . Cambridge university press, 2014
work page 2014
-
[68]
Hallucinations are inevitable but can be made statistically negligible
A. Suzuki, Y. He, F. Tian, and Z. Wang, “Hallucinations are inevitable but can be made statistically negli$ gible. The "innate" inevitability of hallucinations cannot explain practical LLM issues,” arXiv preprint arXiv:2502.12187, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
M. Goldblum, M. Finzi, K. Rowan, and A. G. Wilson, “Position: the no free lunch theorem, Kolmogorov complexity, and the role of inductive biases in machine learning,” in Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024
work page 2024
-
[70]
Deep Learning is Not So Mysterious or Different,
A. G. Wilson, “Deep Learning is Not So Mysterious or Different,” arXiv preprint arXiv:2503.02113, 2025
-
[71]
Minimax estimation of functionals of discrete distributions,
J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of functionals of discrete distributions,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2835–2885, 2015
work page 2015
-
[72]
Reconciling modern machine$learning practice and the classical bias–variance trade$off,
M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine$learning practice and the classical bias–variance trade$off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019
work page 2019
-
[73]
Deep double descent: Where bigger models and more data hurt,
P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data hurt,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2021, no. 12, p. 124003, 2021
work page 2021
-
[74]
A remark on Stirling's formula,
H. Robbins, “A remark on Stirling's formula,” The American mathematical monthly , vol. 62, no. 1, pp. 26–29, 1955. A Limitations, discussions, and future work A.1 Rényi Entropy May Diverge As stated in Remark 16, if Rényi entropy diverges, Theorem 12 and Theorem 17 give vacuous upper bounds. However, as also stated in Remark 16, such cases are pathologic...
work page 1955
-
[75]
ln 𝑛− 𝑛+ 1 2ln(2𝜋) + 𝜃𝑛. (13) Thus, for (𝑛1,𝑛2,…,𝑛𝑘) ∈ ℕ𝑘 >0, (2𝜋) 1−𝑘 2 · 𝑛𝑛+1 2 ∏𝑘 𝑖=1𝑛 𝑛𝑖 +1 2 𝑖 · exp(𝐴) < ( 𝑛 𝑛1,…,𝑛𝑘 ) < (2𝜋) 1−𝑘 2 · 𝑛𝑛+1 2 ∏𝑘 𝑗=1𝑛 𝑛𝑗+1 2 𝑗 · exp(𝐵), (14) where 𝐴 = 1 12𝑛+1 − ∑𝑘 𝑗=1 1 12𝑛𝑗 , 𝐵 = 1 12𝑛 − ∑𝑘 𝑗=1 1 12𝑛𝑗+1. Lemma 30. (Multinomial Coefficient Estimation) Let 𝒏∈ ℕ𝑘 >0, and let 𝑛= ‖𝒏‖1 ≔ ∑𝑘−1 𝑗=0 𝑛𝑗. There exists 𝜃𝑛,𝒏 suc...
-
[76]
ln 𝑛− ∑ 𝑘 𝑗=1 ( 𝑛𝑘 + 1
-
[77]
□ B.4 Upper Bound on the Self-Entropy of a Multinomial Distribution using KL Divergence Lemma 31
ln 𝑛𝑘 − 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛− ∑ 𝑘 𝑗=1 𝜃𝑛𝑘 = 𝑛( − ∑ 𝑘 𝑗=1 𝑛𝑘 𝑛 ln 𝑛𝑘 𝑛) + 1 2( ln 𝑛− ∑ 𝑘 𝑗=1 ln 𝑛𝑘) + 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛,𝒏 = 𝑛𝐻1( 𝒏 𝑛) + 1 2( ln 𝑛− ∑ 𝑘 𝑗=1 ln 𝑛𝑘) + 1 2(𝑘− 1)ln 2𝜋+ 𝜃𝑛,𝒏, (16) where 𝜃𝑛,𝒏 ≔ 𝜃𝑛− ∑𝑘 𝑖=1𝜃𝑛𝑘 satisfies |𝜃𝑛,𝒏| ≤ 𝑘 12𝑛. □ B.4 Upper Bound on the Self-Entropy of a Multinomial Distribution using KL Divergence Lemma 31. (Upper Bound on t...
-
[78]
Using Lemma 42 with 𝜌 = 1− 𝛼 and 𝑏 = 𝑦 exp((1−𝛼)𝐻𝛼(Q)), 𝑛> ( 2[ln exp((1−𝛼)𝐻𝛼(Q)) 𝑦(1−𝛼) ] + 𝑦(1−𝛼) exp((1− 𝛼)𝐻𝛼(Q))) 1 1−𝛼 is a sufficient condition. [ln exp((1− 𝛼)𝐻𝛼(Q)) 𝑦(1− 𝛼) ] + = [(1− 𝛼)𝐻𝛼(Q)ln 1 𝑦(1− 𝛼)] + ≤ [(1− 𝛼)𝐻𝛼(Q)]+[ln 1 𝑦(1− 𝛼)] + = (1− 𝛼)𝐻𝛼(Q)[ln 1 𝑦(1− 𝛼)] + (48) From this, 𝑛> ( ((( 2(1− 𝛼)𝐻𝛼(Q)[ln 1 𝑦(1−𝛼)] + 𝑦(1− 𝛼) exp((1− 𝛼)𝐻𝛼(Q)) ) ...
-
[79]
From the above, 𝐴1 = 𝐴1,1 + 𝐴1,2 < 𝜀2 4 + 𝜀2 12 = 𝜀2 3
This immediately follows from the assumption 𝑛> ( 36ln 2𝜋 𝛿2 𝜀2 ) 1 1−𝛼 exp(𝐻𝛼(Q)) = ( 36exp((1− 𝛼)𝐻𝛼(Q))ln 2𝜋 𝛿2 𝜀2 ) 1 1−𝛼 (51) that (𝐴1,2 ≔) exp((1−𝛼)𝐻𝛼(Q))𝑛𝛼 ln 2𝜋 𝛿2 2𝑛 < 𝜀2 12. From the above, 𝐴1 = 𝐴1,1 + 𝐴1,2 < 𝜀2 4 + 𝜀2 12 = 𝜀2 3 . Decompose 𝐴2 = 𝐴2,1 + 𝐴2,2 where 𝐴2,1 ≔ 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 , 𝐴2,2 ≔ ln 2𝜋 𝛿2 √ 𝑛 2 ln 2 𝛿3 2𝑛 It is sufficient to...
-
[80]
This means (𝐴2,1 ≔) 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 < 𝜀2 12
Using Lemma 42 with 𝑏 = 𝜀2 9√ 2ln 2 𝛿3 ,𝜌 = 1 2, from the assumption 𝑛> ( ((( 18√ ln 2 𝛿3 𝜀2 [ln 9√ 2ln 2 𝛿3 𝜀2 ] +) ))) 2 , we can say ln 𝑛√𝑛 < 𝜀2 9√ 2ln 2 𝛿3 . This means (𝐴2,1 ≔) 3 ln 𝑛√ 𝑛 2 ln 2 𝛿3 2𝑛 < 𝜀2 12. Let’s show 𝐴2,2 < 𝜀2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.