pith. sign in

arxiv: 2508.17412 · v4 · submitted 2025-08-24 · 💻 cs.LG · cs.AI· stat.ML

A Ridge Too Far: Correcting Over-Shrinkage via Negative Regularization

Pith reviewed 2026-05-18 20:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords ridge regressionnegative regularizationover-shrinkageweak eigendirectionsanti-shrinkagesmall data regressioneffective degrees of freedom
0
0 comments X

The pith

Negative regularization corrects over-shrinkage in ridge estimators by boosting complexity along weak eigendirections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a negative-capable ridge family that allows feasible negative regularization values when the estimator remains well-posed. It shows that negative regularization functions as controlled anti-shrinkage, raising effective complexity most strongly in weak eigendirections where predictive signal often concentrates in small-data settings. The work formalizes weak-spectrum underfitting and derives a sign-switch result that appears once baseline shrinkage is conservative. Automatic selection criteria over the full family are examined, and experiments confirm feasibility along with spectral complexity gains and recovery of useful negative adjustments.

Core claim

Negative regularization acts as controlled anti-shrinkage by increasing effective complexity most strongly along weak eigendirections, with a sign-switch result under conservative baseline shrinkage, thereby correcting over-shrinkage in small-data regression problems where signal resides in restricted representations.

What carries the argument

The negative-capable ridge family, which extends conventional ridge estimators to permit a feasible negative region while preserving well-posedness and enables targeted increases in effective degrees of freedom along weak spectral directions.

If this is right

  • Weak-spectrum underfitting becomes a formal and addressable phenomenon once negative regularization is admitted.
  • A sign-switch occurs in the regularization path once baseline shrinkage is set conservatively.
  • Criterion-based selection over the negative-capable family recovers effective negative adjustments in the predicted regimes.
  • Synthetic and semi-synthetic data verify feasibility, spectral complexity increase, and sign-switch behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anti-shrinkage logic could be tested in other linear estimators that admit tunable penalty signs.
  • High-dimensional problems with known low-signal directions may benefit from explicit negative adjustments.
  • Real datasets containing weak but predictive features offer a direct test of whether automatic selection prefers the negative region.

Load-bearing premise

The estimator remains well-posed whenever negative regularization is applied.

What would settle it

An experiment measuring effective degrees of freedom under negative regularization values that fails to show stronger increases along the smallest eigenvalues would refute the anti-shrinkage mechanism.

read the original abstract

Conventional regularization is designed to control variance, but in small-data regression it can also aggravate underfitting when predictive signal is concentrated in weak directions of a restricted representation. We study a negative-capable ridge family that permits a feasible negative region whenever the estimator remains well posed, and show that negative regularization acts there as controlled anti-shrinkage by increasing effective complexity most strongly along weak eigendirections. Building on this mechanism, we formalize weak-spectrum underfitting, derive a sign-switch result under conservative baseline shrinkage, and study criterion-based automatic selection over the full negative-capable family. Synthetic and semi-synthetic experiments support the theory by verifying feasibility, spectral complexity increase, sign-switch behavior, and effective recovery of negative adjustments in the predicted regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces a negative-capable ridge regression family that permits negative regularization values in the feasible region where the estimator remains well-posed (i.e., when the regularization parameter exceeds the negative of the smallest eigenvalue of the Gram matrix). It shows that negative regularization functions as controlled anti-shrinkage, preferentially increasing effective complexity along weak eigendirections, formalizes weak-spectrum underfitting, derives a sign-switch result relative to a conservative positive baseline, and demonstrates via synthetic and semi-synthetic experiments that criterion-based selection can recover appropriate negative adjustments.

Significance. If the central algebraic claims hold, the work offers a principled mechanism to mitigate over-shrinkage in small-data ridge regression without arbitrary increases in model complexity. A clear strength is the direct derivation of the sign-switch and spectral effects from the standard ridge spectral decomposition, which requires no additional assumptions beyond the stated feasible negative region. The explicit restriction to well-posed cases and the experimental checks on feasibility, complexity increase, and sign-switch behavior provide concrete support. This could be relevant for adaptive regularization in high-dimensional or low-sample regimes.

major comments (1)
  1. [Derivation of sign-switch result] The sign-switch result and its dependence on the conservative baseline are load-bearing for the main claim. The manuscript should clarify in the derivation section whether the baseline is chosen independently of the data spectrum or if it introduces any implicit parameter sensitivity that could affect the predicted regimes for negative adjustments.
minor comments (3)
  1. [Abstract] The abstract paragraph on the negative-capable family is information-dense; splitting the description of the feasible region and the anti-shrinkage mechanism into separate sentences would improve readability.
  2. [Experiments] Synthetic experiment details on data generation (e.g., eigenvalue distributions and noise levels) are referenced but could be expanded with explicit parameter values or pseudocode to strengthen reproducibility.
  3. [Theoretical development] Notation for the effective complexity measure along eigendirections should be defined at first use with a brief reminder of its relation to the shrinkage factor.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below and have incorporated a clarification into the revised manuscript.

read point-by-point responses
  1. Referee: [Derivation of sign-switch result] The sign-switch result and its dependence on the conservative baseline are load-bearing for the main claim. The manuscript should clarify in the derivation section whether the baseline is chosen independently of the data spectrum or if it introduces any implicit parameter sensitivity that could affect the predicted regimes for negative adjustments.

    Authors: We appreciate the referee drawing attention to this aspect of the sign-switch derivation. The conservative baseline is selected independently of the data spectrum: it is a fixed positive regularization value obtained either by cross-validation over a positive grid or by a small preset positive constant, without reference to the eigenvalues of the Gram matrix beyond the requirement that the baseline remain positive. This choice is made prior to considering negative adjustments and does not introduce spectrum-dependent sensitivity into the algebraic comparison. The sign-switch result then follows directly from the spectral decomposition once this fixed positive baseline is subtracted. To eliminate any ambiguity, we have added an explicit paragraph in the derivation section stating the baseline selection rule, confirming its independence from the spectrum, and noting that the predicted regimes for beneficial negative regularization are therefore insensitive to the precise positive value chosen within the conservative range. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims are algebraic identities from standard ridge spectral decomposition

full rationale

The paper's core results (anti-shrinkage via negative regularization, sign-switch under baseline, weak-spectrum underfitting) follow directly from the closed-form ridge estimator's eigendecomposition: the per-direction shrinkage factor λ_i/(λ_i + reg) is an identity of the estimator definition, and its differential behavior for small vs. large λ_i when reg is negative (but > -min(λ)) is a direct algebraic comparison, not a fit or redefinition. The feasible negative region is explicitly restricted to where the Gram matrix stays positive definite, preserving well-posedness without introducing circularity. No self-citations, ansatzes, or renamed empirical patterns are invoked as load-bearing steps in the provided derivation chain. The argument is self-contained under standard linear-algebra assumptions for ridge regression.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Information is limited to the abstract; the central claim rests on the well-posedness condition for negative regularization and the assumption that signal concentration occurs in weak eigendirections.

free parameters (1)
  • regularization parameter (lambda)
    The value of lambda, including its sign, is selected via criterion-based automatic selection over the negative-capable family.
axioms (1)
  • domain assumption The estimator remains well-posed for negative regularization values
    Abstract states that negative regularization is feasible whenever the estimator remains well posed.

pith-pipeline@v0.9.0 · 5654 in / 1156 out tokens · 32836 ms · 2026-05-18T20:58:09.409348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 4 internal anchors

  1. [1]

    Sme digitalisation to manage shocks and transitions: 2024 OECD D4SME survey

    Organisation for Economic Co-operation and Development. Sme digitalisation to manage shocks and transitions: 2024 OECD D4SME survey. OECD SME and Entrepreneurship Papers No. 62, OECD Publishing, Paris, 2024

  2. [2]

    Kergroach and J

    S. Kergroach and J. Héritier. Emerging divides in the transition to artificial intelligence. OECD Regional Development Papers No. 147, OECD Publishing, Paris, 2025. 19 Kim et al

  3. [3]

    The impact of the general data protection regulation (gdpr) on artificial intelligence

    Giovanni Sartor and Francesca Lagioia. The impact of the general data protection regulation (gdpr) on artificial intelligence. Study; Scientific Foresight Unit (STOA) PE 641.530, European Parliamentary Research Service (EPRS), Brussels, 2020

  4. [4]

    A survey on few-shot class- incremental learning.Neural Networks, 169:307–324, 2024

    Songsong Tian, Lusi Li, Weijun Li, Hang Ran, Xin Ning, and Prayag Tiwari. A survey on few-shot class- incremental learning.Neural Networks, 169:307–324, 2024

  5. [5]

    Deep learning for unsupervised anomaly localization in industrial images: A survey.IEEE Transactions on Instrumentation and Measurement, 71:1–21, 2022

    Xian Tao, Xinyi Gong, Xin Zhang, Shaohua Yan, and Chandranath Adak. Deep learning for unsupervised anomaly localization in industrial images: A survey.IEEE Transactions on Instrumentation and Measurement, 71:1–21, 2022

  6. [6]

    A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591, 2024

    Zaitian Wang, Pengfei Wang, Kunpeng Liu, Pengyang Wang, Yanjie Fu, Chang-Tien Lu, Charu C Aggarwal, Jian Pei, and Yuanchun Zhou. A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591, 2024

  7. [7]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  8. [8]

    Detecting shortcut learning for fair medical ai using shortcut testing.Nature communications, 14(1):4314, 2023

    Alexander Brown, Nenad Tomasev, Jan Freyberg, Yuan Liu, Alan Karthikesalingam, and Jessica Schrouff. Detecting shortcut learning for fair medical ai using shortcut testing.Nature communications, 14(1):4314, 2023

  9. [9]

    The elements of statistical learning, 2009

    Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning, 2009

  10. [10]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

  11. [11]

    Challenges in deploying machine learning: a survey of case studies.ACM computing surveys, 55(6):1–29, 2022

    Andrei Paleyes, Raoul-Gabriel Urma, and Neil D Lawrence. Challenges in deploying machine learning: a survey of case studies.ACM computing surveys, 55(6):1–29, 2022

  12. [12]

    Hidden technical debt in machine learning systems

    David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28, 2015

  13. [13]

    A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

    Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

  14. [14]

    Trust region policy optimiza- tion

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza- tion. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  15. [15]

    Support-vector networks.Machine learning, 20(3):273–297, 1995

    Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

  16. [16]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017

  17. [17]

    The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

  18. [18]

    Benign overfitting in linear regression

    Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

  19. [19]

    A brief prehistory of double descent.Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020

    Marco Loog, Tom Viering, Alexander Mey, Jesse H Krijthe, and David MJ Tax. A brief prehistory of double descent.Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020

  20. [20]

    Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

  21. [21]

    Understanding double descent requires a fine-grained bias-variance decompo- sition.Advances in neural information processing systems, 33:11022–11032, 2020

    Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained bias-variance decompo- sition.Advances in neural information processing systems, 33:11022–11032, 2020

  22. [22]

    Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

    Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

  23. [23]

    Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4), 2020

    Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4), 2020

  24. [24]

    Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

    Vidya Muthukumar, Kailas V odrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

  25. [25]

    Jamming transition as a paradigm to understand the loss landscape of deep neural networks.Physical Review E, 100(1):012115, 2019

    Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep neural networks.Physical Review E, 100(1):012115, 2019. 20 Kim et al

  26. [26]

    A jamming transition from under-to over-parametrization affects generalization in deep learning.Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019

    Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under-to over-parametrization affects generalization in deep learning.Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019

  27. [27]

    High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

    Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

  28. [28]

    Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

    Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

  29. [29]

    Regularizing Neural Networks by Penalizing Confident Output Distributions

    Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions.arXiv preprint arXiv:1701.06548, 2017

  30. [30]

    Rethinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

  31. [31]

    When does label smoothing help?Advances in neural information processing systems, 32, 2019

    Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help?Advances in neural information processing systems, 32, 2019

  32. [32]

    Nlnl: Negative learning for noisy labels

    Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision, pages 101–110, 2019

  33. [33]

    Learning from complementary labels.Advances in neural information processing systems, 30, 2017

    Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. Learning from complementary labels.Advances in neural information processing systems, 30, 2017

  34. [34]

    Sequence level training with recurrent neural networks

    Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In4th International Conference on Learning Representations, ICLR 2016, 2016

  35. [35]

    Reward augmented maximum likelihood for neural structured prediction.Advances In Neural Information Processing Systems, 29, 2016

    Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction.Advances In Neural Information Processing Systems, 29, 2016

  36. [36]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  37. [37]

    Fine-tuning language models from human preferences, 2020.URL https://arxiv

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020.URL https://arxiv. org/abs, page 14, 1909

  38. [38]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  39. [39]

    Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

  40. [40]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  41. [41]

    Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016

  42. [42]

    Mitigating unwanted biases with adversarial learning

    Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018

  43. [43]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

  44. [44]

    A new look at the statistical model identification.IEEE transactions on automatic control, 19(6):716–723, 2003

    Hirotugu Akaike. A new look at the statistical model identification.IEEE transactions on automatic control, 19(6):716–723, 2003

  45. [45]

    Some comments on c_p.Technometrics, 15:661–675, 1973

    MALLOWS CL. Some comments on c_p.Technometrics, 15:661–675, 1973

  46. [46]

    Estimation of the mean of a multivariate normal distribution.The annals of Statistics, pages 1135–1151, 1981

    Charles M Stein. Estimation of the mean of a multivariate normal distribution.The annals of Statistics, pages 1135–1151, 1981

  47. [47]

    Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.Numerische mathematik, 31(4):377–403, 1978

    Peter Craven and Grace Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.Numerische mathematik, 31(4):377–403, 1978. 21 Kim et al

  48. [48]

    SIAM, 1990

    Grace Wahba.Spline models for observational data. SIAM, 1990

  49. [49]

    On measuring and correcting the effects of data mining and model selection.Journal of the American Statistical Association, 93(441):120–131, 1998

    Jianming Ye. On measuring and correcting the effects of data mining and model selection.Journal of the American Statistical Association, 93(441):120–131, 1998

  50. [50]

    Least angle regression.Annals of Statistics, pages 407–451, 2004

    Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.Annals of Statistics, pages 407–451, 2004

  51. [51]

    On the degrees of freedom of the lasso.Annals of statistics, 35(5):2173–2192, 2007

    HUI ZOU, Trevor HASTIE, and Robert TIBSHIRANI. On the degrees of freedom of the lasso.Annals of statistics, 35(5):2173–2192, 2007

  52. [52]

    Degrees of freedom in lasso problems.The Annals of Statistics, 40(2):1198–1232, 2012

    RYAN J TIBSHIRANI and JONATHAN TAYLOR. Degrees of freedom in lasso problems.The Annals of Statistics, 40(2):1198–1232, 2012

  53. [53]

    Adapting to unknown smoothness via wavelet shrinkage.Journal of the american statistical association, 90(432):1200–1224, 1995

    David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage.Journal of the american statistical association, 90(432):1200–1224, 1995

  54. [54]

    Cross-validatory choice and assessment of statistical predictions.Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974

    Mervyn Stone. Cross-validatory choice and assessment of statistical predictions.Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974

  55. [55]

    Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69

    Lutz Prechelt. Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69. Springer, 2002

  56. [56]

    On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007

    Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007

  57. [57]

    Early stopping and non-parametric regression: an optimal data-dependent stopping rule.The Journal of Machine Learning Research, 15(1):335–366, 2014

    Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule.The Journal of Machine Learning Research, 15(1):335–366, 2014

  58. [58]

    Stochastic gradient descent as approximate bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

    Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

  59. [59]

    A bayesian perspective on generalization and stochastic gradient descent

    Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018

  60. [60]

    Three Factors Influencing Minima in SGD

    Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017

  61. [61]

    On large-batch training for deep learning: Generalization gap and sharp minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

  62. [62]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  63. [63]

    Cyclical learning rates for training neural networks

    Leslie N Smith. Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017

  64. [64]

    Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

    Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

  65. [65]

    Train faster, generalize better: Stability of stochastic gradient descent

    Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016

  66. [66]

    On calibration of modern neural networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017

  67. [67]

    Making deep neural networks robust to label noise: A loss correction approach

    Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952, 2017

  68. [68]

    Robust loss functions under label noise for deep neural networks

    Aritra Ghosh, Himanshu Kumar, and P Shanti Sastry. Robust loss functions under label noise for deep neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

  69. [69]

    Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018

    Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018

  70. [70]

    Symmetric cross entropy for robust learning with noisy labels

    Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. Symmetric cross entropy for robust learning with noisy labels. InProceedings of the IEEE/CVF international conference on computer vision, pages 322–330, 2019

  71. [71]

    Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018. 22 Kim et al

  72. [72]

    mixup: Beyond empirical risk minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations, 2018

  73. [73]

    On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

    Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

  74. [74]

    On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013

  75. [75]

    Towards deep learning models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

  76. [76]

    Theoretically principled trade-off between robustness and accuracy

    Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR, 2019

  77. [77]

    Parseval networks: Improving robustness to adversarial examples

    Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. InInternational conference on machine learning, pages 854–863. PMLR, 2017

  78. [78]

    Certified adversarial robustness via randomized smoothing

    Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019

  79. [79]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  80. [80]

    Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

    Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

Showing first 80 references.