A Ridge Too Far: Correcting Over-Shrinkage via Negative Regularization

Dongseok Kim; Gisung Oh

arxiv: 2508.17412 · v4 · submitted 2025-08-24 · 💻 cs.LG · cs.AI· stat.ML

A Ridge Too Far: Correcting Over-Shrinkage via Negative Regularization

Dongseok Kim , Gisung Oh This is my paper

Pith reviewed 2026-05-18 20:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords ridge regressionnegative regularizationover-shrinkageweak eigendirectionsanti-shrinkagesmall data regressioneffective degrees of freedom

0 comments

The pith

Negative regularization corrects over-shrinkage in ridge estimators by boosting complexity along weak eigendirections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies a negative-capable ridge family that allows feasible negative regularization values when the estimator remains well-posed. It shows that negative regularization functions as controlled anti-shrinkage, raising effective complexity most strongly in weak eigendirections where predictive signal often concentrates in small-data settings. The work formalizes weak-spectrum underfitting and derives a sign-switch result that appears once baseline shrinkage is conservative. Automatic selection criteria over the full family are examined, and experiments confirm feasibility along with spectral complexity gains and recovery of useful negative adjustments.

Core claim

Negative regularization acts as controlled anti-shrinkage by increasing effective complexity most strongly along weak eigendirections, with a sign-switch result under conservative baseline shrinkage, thereby correcting over-shrinkage in small-data regression problems where signal resides in restricted representations.

What carries the argument

The negative-capable ridge family, which extends conventional ridge estimators to permit a feasible negative region while preserving well-posedness and enables targeted increases in effective degrees of freedom along weak spectral directions.

If this is right

Weak-spectrum underfitting becomes a formal and addressable phenomenon once negative regularization is admitted.
A sign-switch occurs in the regularization path once baseline shrinkage is set conservatively.
Criterion-based selection over the negative-capable family recovers effective negative adjustments in the predicted regimes.
Synthetic and semi-synthetic data verify feasibility, spectral complexity increase, and sign-switch behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anti-shrinkage logic could be tested in other linear estimators that admit tunable penalty signs.
High-dimensional problems with known low-signal directions may benefit from explicit negative adjustments.
Real datasets containing weak but predictive features offer a direct test of whether automatic selection prefers the negative region.

Load-bearing premise

The estimator remains well-posed whenever negative regularization is applied.

What would settle it

An experiment measuring effective degrees of freedom under negative regularization values that fails to show stronger increases along the smallest eigenvalues would refute the anti-shrinkage mechanism.

read the original abstract

Conventional regularization is designed to control variance, but in small-data regression it can also aggravate underfitting when predictive signal is concentrated in weak directions of a restricted representation. We study a negative-capable ridge family that permits a feasible negative region whenever the estimator remains well posed, and show that negative regularization acts there as controlled anti-shrinkage by increasing effective complexity most strongly along weak eigendirections. Building on this mechanism, we formalize weak-spectrum underfitting, derive a sign-switch result under conservative baseline shrinkage, and study criterion-based automatic selection over the full negative-capable family. Synthetic and semi-synthetic experiments support the theory by verifying feasibility, spectral complexity increase, sign-switch behavior, and effective recovery of negative adjustments in the predicted regimes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Negative regularization in ridge can fix over-shrinkage along weak directions in small samples, and the algebra follows directly from the standard spectral form without extra assumptions.

read the letter

Colleague, the main point is that this paper shows how a feasible negative adjustment to the ridge parameter can act as targeted anti-shrinkage. It boosts effective complexity most in the weak eigendirections where small-data underfitting often hides. They formalize this as weak-spectrum underfitting and derive a sign-switch result by comparing a conservative positive baseline to a small negative shift, all within the region where the Gram matrix stays positive definite. That core step is just algebra on the shrinkage factors lambda_i over (lambda_i plus reg), so it holds up cleanly once the feasibility condition is accepted. The experiments then check that the negative values are recoverable by their selection criterion and that spectral complexity moves in the predicted direction on synthetic and semi-synthetic data. What the work does well is keep the argument self-contained and avoid inventing new entities; the negative-capable family is a direct extension of ordinary ridge, and the automatic selection is criterion-based rather than ad-hoc. The feasibility restriction is stated up front, which prevents the usual blow-up objections. On the softer side, the empirical gains are demonstrated in controlled regimes, but the paper gives limited detail on how large the usable negative region typically is across real datasets or how sensitive the selector is to eigenvalue spread. A few more checks on stability when the negative step is applied would strengthen the practical takeaway. This is for people who already work with ridge or spectral regularization in limited-sample settings and want a principled tweak rather than a whole new method. The thinking is clear and the claims track the derivations, so the paper deserves a serious referee even if the real-world lift turns out modest in some domains. I would send it to review after the authors expand the sensitivity analysis a bit.

Referee Report

1 major / 3 minor

Summary. The paper introduces a negative-capable ridge regression family that permits negative regularization values in the feasible region where the estimator remains well-posed (i.e., when the regularization parameter exceeds the negative of the smallest eigenvalue of the Gram matrix). It shows that negative regularization functions as controlled anti-shrinkage, preferentially increasing effective complexity along weak eigendirections, formalizes weak-spectrum underfitting, derives a sign-switch result relative to a conservative positive baseline, and demonstrates via synthetic and semi-synthetic experiments that criterion-based selection can recover appropriate negative adjustments.

Significance. If the central algebraic claims hold, the work offers a principled mechanism to mitigate over-shrinkage in small-data ridge regression without arbitrary increases in model complexity. A clear strength is the direct derivation of the sign-switch and spectral effects from the standard ridge spectral decomposition, which requires no additional assumptions beyond the stated feasible negative region. The explicit restriction to well-posed cases and the experimental checks on feasibility, complexity increase, and sign-switch behavior provide concrete support. This could be relevant for adaptive regularization in high-dimensional or low-sample regimes.

major comments (1)

[Derivation of sign-switch result] The sign-switch result and its dependence on the conservative baseline are load-bearing for the main claim. The manuscript should clarify in the derivation section whether the baseline is chosen independently of the data spectrum or if it introduces any implicit parameter sensitivity that could affect the predicted regimes for negative adjustments.

minor comments (3)

[Abstract] The abstract paragraph on the negative-capable family is information-dense; splitting the description of the feasible region and the anti-shrinkage mechanism into separate sentences would improve readability.
[Experiments] Synthetic experiment details on data generation (e.g., eigenvalue distributions and noise levels) are referenced but could be expanded with explicit parameter values or pseudocode to strengthen reproducibility.
[Theoretical development] Notation for the effective complexity measure along eigendirections should be defined at first use with a brief reminder of its relation to the shrinkage factor.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below and have incorporated a clarification into the revised manuscript.

read point-by-point responses

Referee: [Derivation of sign-switch result] The sign-switch result and its dependence on the conservative baseline are load-bearing for the main claim. The manuscript should clarify in the derivation section whether the baseline is chosen independently of the data spectrum or if it introduces any implicit parameter sensitivity that could affect the predicted regimes for negative adjustments.

Authors: We appreciate the referee drawing attention to this aspect of the sign-switch derivation. The conservative baseline is selected independently of the data spectrum: it is a fixed positive regularization value obtained either by cross-validation over a positive grid or by a small preset positive constant, without reference to the eigenvalues of the Gram matrix beyond the requirement that the baseline remain positive. This choice is made prior to considering negative adjustments and does not introduce spectrum-dependent sensitivity into the algebraic comparison. The sign-switch result then follows directly from the spectral decomposition once this fixed positive baseline is subtracted. To eliminate any ambiguity, we have added an explicit paragraph in the derivation section stating the baseline selection rule, confirming its independence from the spectrum, and noting that the predicted regimes for beneficial negative regularization are therefore insensitive to the precise positive value chosen within the conservative range. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims are algebraic identities from standard ridge spectral decomposition

full rationale

The paper's core results (anti-shrinkage via negative regularization, sign-switch under baseline, weak-spectrum underfitting) follow directly from the closed-form ridge estimator's eigendecomposition: the per-direction shrinkage factor λ_i/(λ_i + reg) is an identity of the estimator definition, and its differential behavior for small vs. large λ_i when reg is negative (but > -min(λ)) is a direct algebraic comparison, not a fit or redefinition. The feasible negative region is explicitly restricted to where the Gram matrix stays positive definite, preserving well-posedness without introducing circularity. No self-citations, ansatzes, or renamed empirical patterns are invoked as load-bearing steps in the provided derivation chain. The argument is self-contained under standard linear-algebra assumptions for ridge regression.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Information is limited to the abstract; the central claim rests on the well-posedness condition for negative regularization and the assumption that signal concentration occurs in weak eigendirections.

free parameters (1)

regularization parameter (lambda)
The value of lambda, including its sign, is selected via criterion-based automatic selection over the negative-capable family.

axioms (1)

domain assumption The estimator remains well-posed for negative regularization values
Abstract states that negative regularization is feasible whenever the estimator remains well posed.

pith-pipeline@v0.9.0 · 5654 in / 1156 out tokens · 32836 ms · 2026-05-18T20:58:09.409348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 93 canonical work pages · 4 internal anchors

[1]

Sme digitalisation to manage shocks and transitions: 2024 OECD D4SME survey

Organisation for Economic Co-operation and Development. Sme digitalisation to manage shocks and transitions: 2024 OECD D4SME survey. OECD SME and Entrepreneurship Papers No. 62, OECD Publishing, Paris, 2024

work page 2024
[2]

Kergroach and J

S. Kergroach and J. Héritier. Emerging divides in the transition to artificial intelligence. OECD Regional Development Papers No. 147, OECD Publishing, Paris, 2025. 19 Kim et al

work page 2025
[3]

The impact of the general data protection regulation (gdpr) on artificial intelligence

Giovanni Sartor and Francesca Lagioia. The impact of the general data protection regulation (gdpr) on artificial intelligence. Study; Scientific Foresight Unit (STOA) PE 641.530, European Parliamentary Research Service (EPRS), Brussels, 2020

work page 2020
[4]

A survey on few-shot class- incremental learning.Neural Networks, 169:307–324, 2024

Songsong Tian, Lusi Li, Weijun Li, Hang Ran, Xin Ning, and Prayag Tiwari. A survey on few-shot class- incremental learning.Neural Networks, 169:307–324, 2024

work page 2024
[5]

Deep learning for unsupervised anomaly localization in industrial images: A survey.IEEE Transactions on Instrumentation and Measurement, 71:1–21, 2022

Xian Tao, Xinyi Gong, Xin Zhang, Shaohua Yan, and Chandranath Adak. Deep learning for unsupervised anomaly localization in industrial images: A survey.IEEE Transactions on Instrumentation and Measurement, 71:1–21, 2022

work page 2022
[6]

A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591, 2024

Zaitian Wang, Pengfei Wang, Kunpeng Liu, Pengyang Wang, Yanjie Fu, Chang-Tien Lu, Charu C Aggarwal, Jian Pei, and Yuanchun Zhou. A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591, 2024

work page arXiv 2024
[7]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[8]

Detecting shortcut learning for fair medical ai using shortcut testing.Nature communications, 14(1):4314, 2023

Alexander Brown, Nenad Tomasev, Jan Freyberg, Yuan Liu, Alan Karthikesalingam, and Jessica Schrouff. Detecting shortcut learning for fair medical ai using shortcut testing.Nature communications, 14(1):4314, 2023

work page 2023
[9]

The elements of statistical learning, 2009

Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning, 2009

work page 2009
[10]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

work page 2019
[11]

Challenges in deploying machine learning: a survey of case studies.ACM computing surveys, 55(6):1–29, 2022

Andrei Paleyes, Raoul-Gabriel Urma, and Neil D Lawrence. Challenges in deploying machine learning: a survey of case studies.ACM computing surveys, 55(6):1–29, 2022

work page 2022
[12]

Hidden technical debt in machine learning systems

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28, 2015

work page 2015
[13]

A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

work page 2021
[14]

Trust region policy optimiza- tion

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza- tion. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015
[15]

Support-vector networks.Machine learning, 20(3):273–297, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

work page 1995
[16]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017

work page 2017
[17]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

work page 2018
[18]

Benign overfitting in linear regression

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020
[19]

A brief prehistory of double descent.Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020

Marco Loog, Tom Viering, Alexander Mey, Jesse H Krijthe, and David MJ Tax. A brief prehistory of double descent.Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020

work page 2020
[20]

Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021
[21]

Understanding double descent requires a fine-grained bias-variance decompo- sition.Advances in neural information processing systems, 33:11022–11032, 2020

Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained bias-variance decompo- sition.Advances in neural information processing systems, 33:11022–11032, 2020

work page 2020
[22]

Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

work page 2022
[23]

Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4), 2020

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4), 2020

work page 2020
[24]

Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

Vidya Muthukumar, Kailas V odrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

work page 2020
[25]

Jamming transition as a paradigm to understand the loss landscape of deep neural networks.Physical Review E, 100(1):012115, 2019

Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep neural networks.Physical Review E, 100(1):012115, 2019. 20 Kim et al

work page 2019
[26]

A jamming transition from under-to over-parametrization affects generalization in deep learning.Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019

Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under-to over-parametrization affects generalization in deep learning.Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019

work page 2019
[27]

High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

work page 2020
[28]

Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

work page arXiv 2003
[29]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions.arXiv preprint arXiv:1701.06548, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016
[31]

When does label smoothing help?Advances in neural information processing systems, 32, 2019

Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help?Advances in neural information processing systems, 32, 2019

work page 2019
[32]

Nlnl: Negative learning for noisy labels

Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision, pages 101–110, 2019

work page 2019
[33]

Learning from complementary labels.Advances in neural information processing systems, 30, 2017

Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. Learning from complementary labels.Advances in neural information processing systems, 30, 2017

work page 2017
[34]

Sequence level training with recurrent neural networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In4th International Conference on Learning Representations, ICLR 2016, 2016

work page 2016
[35]

Reward augmented maximum likelihood for neural structured prediction.Advances In Neural Information Processing Systems, 29, 2016

Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction.Advances In Neural Information Processing Systems, 29, 2016

work page 2016
[36]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017
[37]

Fine-tuning language models from human preferences, 2020.URL https://arxiv

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020.URL https://arxiv. org/abs, page 14, 1909

work page 2020
[38]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[39]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[40]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016

work page 2016
[42]

Mitigating unwanted biases with adversarial learning

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018

work page 2018
[43]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[44]

A new look at the statistical model identification.IEEE transactions on automatic control, 19(6):716–723, 2003

Hirotugu Akaike. A new look at the statistical model identification.IEEE transactions on automatic control, 19(6):716–723, 2003

work page 2003
[45]

Some comments on c_p.Technometrics, 15:661–675, 1973

MALLOWS CL. Some comments on c_p.Technometrics, 15:661–675, 1973

work page 1973
[46]

Estimation of the mean of a multivariate normal distribution.The annals of Statistics, pages 1135–1151, 1981

Charles M Stein. Estimation of the mean of a multivariate normal distribution.The annals of Statistics, pages 1135–1151, 1981

work page 1981
[47]

Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.Numerische mathematik, 31(4):377–403, 1978

Peter Craven and Grace Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.Numerische mathematik, 31(4):377–403, 1978. 21 Kim et al

work page 1978
[48]

SIAM, 1990

Grace Wahba.Spline models for observational data. SIAM, 1990

work page 1990
[49]

On measuring and correcting the effects of data mining and model selection.Journal of the American Statistical Association, 93(441):120–131, 1998

Jianming Ye. On measuring and correcting the effects of data mining and model selection.Journal of the American Statistical Association, 93(441):120–131, 1998

work page 1998
[50]

Least angle regression.Annals of Statistics, pages 407–451, 2004

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.Annals of Statistics, pages 407–451, 2004

work page 2004
[51]

On the degrees of freedom of the lasso.Annals of statistics, 35(5):2173–2192, 2007

HUI ZOU, Trevor HASTIE, and Robert TIBSHIRANI. On the degrees of freedom of the lasso.Annals of statistics, 35(5):2173–2192, 2007

work page 2007
[52]

Degrees of freedom in lasso problems.The Annals of Statistics, 40(2):1198–1232, 2012

RYAN J TIBSHIRANI and JONATHAN TAYLOR. Degrees of freedom in lasso problems.The Annals of Statistics, 40(2):1198–1232, 2012

work page 2012
[53]

Adapting to unknown smoothness via wavelet shrinkage.Journal of the american statistical association, 90(432):1200–1224, 1995

David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage.Journal of the american statistical association, 90(432):1200–1224, 1995

work page 1995
[54]

Cross-validatory choice and assessment of statistical predictions.Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974

Mervyn Stone. Cross-validatory choice and assessment of statistical predictions.Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974

work page 1974
[55]

Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69

Lutz Prechelt. Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69. Springer, 2002

work page 2002
[56]

On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007

work page 2007
[57]

Early stopping and non-parametric regression: an optimal data-dependent stopping rule.The Journal of Machine Learning Research, 15(1):335–366, 2014

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule.The Journal of Machine Learning Research, 15(1):335–366, 2014

work page 2014
[58]

Stochastic gradient descent as approximate bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

work page 2017
[59]

A bayesian perspective on generalization and stochastic gradient descent

Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018

work page 2018
[60]

Three Factors Influencing Minima in SGD

Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[61]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

work page 2017
[62]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[63]

Cyclical learning rates for training neural networks

Leslie N Smith. Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017

work page 2017
[64]

Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

work page 2002
[65]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016

work page 2016
[66]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017

work page 2017
[67]

Making deep neural networks robust to label noise: A loss correction approach

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952, 2017

work page 1944
[68]

Robust loss functions under label noise for deep neural networks

Aritra Ghosh, Himanshu Kumar, and P Shanti Sastry. Robust loss functions under label noise for deep neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017
[69]

Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018

Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018

work page 2018
[70]

Symmetric cross entropy for robust learning with noisy labels

Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. Symmetric cross entropy for robust learning with noisy labels. InProceedings of the IEEE/CVF international conference on computer vision, pages 322–330, 2019

work page 2019
[71]

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018. 22 Kim et al

work page 1979
[72]

mixup: Beyond empirical risk minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations, 2018

work page 2018
[73]

On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

work page 2019
[74]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013

work page 2013
[75]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

work page 2018
[76]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR, 2019

work page 2019
[77]

Parseval networks: Improving robustness to adversarial examples

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. InInternational conference on machine learning, pages 854–863. PMLR, 2017

work page 2017
[78]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019

work page 2019
[79]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

work page 2018
[80]

Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

work page 2019

Showing first 80 references.

[1] [1]

Sme digitalisation to manage shocks and transitions: 2024 OECD D4SME survey

Organisation for Economic Co-operation and Development. Sme digitalisation to manage shocks and transitions: 2024 OECD D4SME survey. OECD SME and Entrepreneurship Papers No. 62, OECD Publishing, Paris, 2024

work page 2024

[2] [2]

Kergroach and J

S. Kergroach and J. Héritier. Emerging divides in the transition to artificial intelligence. OECD Regional Development Papers No. 147, OECD Publishing, Paris, 2025. 19 Kim et al

work page 2025

[3] [3]

The impact of the general data protection regulation (gdpr) on artificial intelligence

Giovanni Sartor and Francesca Lagioia. The impact of the general data protection regulation (gdpr) on artificial intelligence. Study; Scientific Foresight Unit (STOA) PE 641.530, European Parliamentary Research Service (EPRS), Brussels, 2020

work page 2020

[4] [4]

A survey on few-shot class- incremental learning.Neural Networks, 169:307–324, 2024

Songsong Tian, Lusi Li, Weijun Li, Hang Ran, Xin Ning, and Prayag Tiwari. A survey on few-shot class- incremental learning.Neural Networks, 169:307–324, 2024

work page 2024

[5] [5]

Deep learning for unsupervised anomaly localization in industrial images: A survey.IEEE Transactions on Instrumentation and Measurement, 71:1–21, 2022

Xian Tao, Xinyi Gong, Xin Zhang, Shaohua Yan, and Chandranath Adak. Deep learning for unsupervised anomaly localization in industrial images: A survey.IEEE Transactions on Instrumentation and Measurement, 71:1–21, 2022

work page 2022

[6] [6]

A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591, 2024

Zaitian Wang, Pengfei Wang, Kunpeng Liu, Pengyang Wang, Yanjie Fu, Chang-Tien Lu, Charu C Aggarwal, Jian Pei, and Yuanchun Zhou. A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591, 2024

work page arXiv 2024

[7] [7]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020

[8] [8]

Detecting shortcut learning for fair medical ai using shortcut testing.Nature communications, 14(1):4314, 2023

Alexander Brown, Nenad Tomasev, Jan Freyberg, Yuan Liu, Alan Karthikesalingam, and Jessica Schrouff. Detecting shortcut learning for fair medical ai using shortcut testing.Nature communications, 14(1):4314, 2023

work page 2023

[9] [9]

The elements of statistical learning, 2009

Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning, 2009

work page 2009

[10] [10]

Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

work page 2019

[11] [11]

Challenges in deploying machine learning: a survey of case studies.ACM computing surveys, 55(6):1–29, 2022

Andrei Paleyes, Raoul-Gabriel Urma, and Neil D Lawrence. Challenges in deploying machine learning: a survey of case studies.ACM computing surveys, 55(6):1–29, 2022

work page 2022

[12] [12]

Hidden technical debt in machine learning systems

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28, 2015

work page 2015

[13] [13]

A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021

work page 2021

[14] [14]

Trust region policy optimiza- tion

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza- tion. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

work page 2015

[15] [15]

Support-vector networks.Machine learning, 20(3):273–297, 1995

Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

work page 1995

[16] [16]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017

work page 2017

[17] [17]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

work page 2018

[18] [18]

Benign overfitting in linear regression

Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020

[19] [19]

A brief prehistory of double descent.Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020

Marco Loog, Tom Viering, Alexander Mey, Jesse H Krijthe, and David MJ Tax. A brief prehistory of double descent.Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020

work page 2020

[20] [20]

Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021

work page 2021

[21] [21]

Understanding double descent requires a fine-grained bias-variance decompo- sition.Advances in neural information processing systems, 33:11022–11032, 2020

Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained bias-variance decompo- sition.Advances in neural information processing systems, 33:11022–11032, 2020

work page 2020

[22] [22]

Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022

work page 2022

[23] [23]

Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4), 2020

Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4), 2020

work page 2020

[24] [24]

Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

Vidya Muthukumar, Kailas V odrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020

work page 2020

[25] [25]

Jamming transition as a paradigm to understand the loss landscape of deep neural networks.Physical Review E, 100(1):012115, 2019

Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep neural networks.Physical Review E, 100(1):012115, 2019. 20 Kim et al

work page 2019

[26] [26]

A jamming transition from under-to over-parametrization affects generalization in deep learning.Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019

Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under-to over-parametrization affects generalization in deep learning.Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019

work page 2019

[27] [27]

High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020

work page 2020

[28] [28]

Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020

work page arXiv 2003

[29] [29]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions.arXiv preprint arXiv:1701.06548, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

work page 2016

[31] [31]

When does label smoothing help?Advances in neural information processing systems, 32, 2019

Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help?Advances in neural information processing systems, 32, 2019

work page 2019

[32] [32]

Nlnl: Negative learning for noisy labels

Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision, pages 101–110, 2019

work page 2019

[33] [33]

Learning from complementary labels.Advances in neural information processing systems, 30, 2017

Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. Learning from complementary labels.Advances in neural information processing systems, 30, 2017

work page 2017

[34] [34]

Sequence level training with recurrent neural networks

Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In4th International Conference on Learning Representations, ICLR 2016, 2016

work page 2016

[35] [35]

Reward augmented maximum likelihood for neural structured prediction.Advances In Neural Information Processing Systems, 29, 2016

Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction.Advances In Neural Information Processing Systems, 29, 2016

work page 2016

[36] [36]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

work page 2017

[37] [37]

Fine-tuning language models from human preferences, 2020.URL https://arxiv

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020.URL https://arxiv. org/abs, page 14, 1909

work page 2020

[38] [38]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[39] [39]

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[40] [40]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [41]

Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016

work page 2016

[42] [42]

Mitigating unwanted biases with adversarial learning

Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018

work page 2018

[43] [43]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018

[44] [44]

A new look at the statistical model identification.IEEE transactions on automatic control, 19(6):716–723, 2003

Hirotugu Akaike. A new look at the statistical model identification.IEEE transactions on automatic control, 19(6):716–723, 2003

work page 2003

[45] [45]

Some comments on c_p.Technometrics, 15:661–675, 1973

MALLOWS CL. Some comments on c_p.Technometrics, 15:661–675, 1973

work page 1973

[46] [46]

Estimation of the mean of a multivariate normal distribution.The annals of Statistics, pages 1135–1151, 1981

Charles M Stein. Estimation of the mean of a multivariate normal distribution.The annals of Statistics, pages 1135–1151, 1981

work page 1981

[47] [47]

Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.Numerische mathematik, 31(4):377–403, 1978

Peter Craven and Grace Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.Numerische mathematik, 31(4):377–403, 1978. 21 Kim et al

work page 1978

[48] [48]

SIAM, 1990

Grace Wahba.Spline models for observational data. SIAM, 1990

work page 1990

[49] [49]

On measuring and correcting the effects of data mining and model selection.Journal of the American Statistical Association, 93(441):120–131, 1998

Jianming Ye. On measuring and correcting the effects of data mining and model selection.Journal of the American Statistical Association, 93(441):120–131, 1998

work page 1998

[50] [50]

Least angle regression.Annals of Statistics, pages 407–451, 2004

Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.Annals of Statistics, pages 407–451, 2004

work page 2004

[51] [51]

On the degrees of freedom of the lasso.Annals of statistics, 35(5):2173–2192, 2007

HUI ZOU, Trevor HASTIE, and Robert TIBSHIRANI. On the degrees of freedom of the lasso.Annals of statistics, 35(5):2173–2192, 2007

work page 2007

[52] [52]

Degrees of freedom in lasso problems.The Annals of Statistics, 40(2):1198–1232, 2012

RYAN J TIBSHIRANI and JONATHAN TAYLOR. Degrees of freedom in lasso problems.The Annals of Statistics, 40(2):1198–1232, 2012

work page 2012

[53] [53]

Adapting to unknown smoothness via wavelet shrinkage.Journal of the american statistical association, 90(432):1200–1224, 1995

David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage.Journal of the american statistical association, 90(432):1200–1224, 1995

work page 1995

[54] [54]

Cross-validatory choice and assessment of statistical predictions.Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974

Mervyn Stone. Cross-validatory choice and assessment of statistical predictions.Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974

work page 1974

[55] [55]

Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69

Lutz Prechelt. Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69. Springer, 2002

work page 2002

[56] [56]

On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007

work page 2007

[57] [57]

Early stopping and non-parametric regression: an optimal data-dependent stopping rule.The Journal of Machine Learning Research, 15(1):335–366, 2014

Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule.The Journal of Machine Learning Research, 15(1):335–366, 2014

work page 2014

[58] [58]

Stochastic gradient descent as approximate bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

work page 2017

[59] [59]

A bayesian perspective on generalization and stochastic gradient descent

Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018

work page 2018

[60] [60]

Three Factors Influencing Minima in SGD

Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[61] [61]

On large-batch training for deep learning: Generalization gap and sharp minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017

work page 2017

[62] [62]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[63] [63]

Cyclical learning rates for training neural networks

Leslie N Smith. Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017

work page 2017

[64] [64]

Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

work page 2002

[65] [65]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016

work page 2016

[66] [66]

On calibration of modern neural networks

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017

work page 2017

[67] [67]

Making deep neural networks robust to label noise: A loss correction approach

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952, 2017

work page 1944

[68] [68]

Robust loss functions under label noise for deep neural networks

Aritra Ghosh, Himanshu Kumar, and P Shanti Sastry. Robust loss functions under label noise for deep neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017

work page 2017

[69] [69]

Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018

Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018

work page 2018

[70] [70]

Symmetric cross entropy for robust learning with noisy labels

Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. Symmetric cross entropy for robust learning with noisy labels. InProceedings of the IEEE/CVF international conference on computer vision, pages 322–330, 2019

work page 2019

[71] [71]

Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018. 22 Kim et al

work page 1979

[72] [72]

mixup: Beyond empirical risk minimization

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations, 2018

work page 2018

[73] [73]

On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019

work page 2019

[74] [74]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013

work page 2013

[75] [75]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

work page 2018

[76] [76]

Theoretically principled trade-off between robustness and accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR, 2019

work page 2019

[77] [77]

Parseval networks: Improving robustness to adversarial examples

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. InInternational conference on machine learning, pages 854–863. PMLR, 2017

work page 2017

[78] [78]

Certified adversarial robustness via randomized smoothing

Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019

work page 2019

[79] [79]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

work page 2018

[80] [80]

Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019

work page 2019