A Ridge Too Far: Correcting Over-Shrinkage via Negative Regularization
Pith reviewed 2026-05-18 20:58 UTC · model grok-4.3
The pith
Negative regularization corrects over-shrinkage in ridge estimators by boosting complexity along weak eigendirections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Negative regularization acts as controlled anti-shrinkage by increasing effective complexity most strongly along weak eigendirections, with a sign-switch result under conservative baseline shrinkage, thereby correcting over-shrinkage in small-data regression problems where signal resides in restricted representations.
What carries the argument
The negative-capable ridge family, which extends conventional ridge estimators to permit a feasible negative region while preserving well-posedness and enables targeted increases in effective degrees of freedom along weak spectral directions.
If this is right
- Weak-spectrum underfitting becomes a formal and addressable phenomenon once negative regularization is admitted.
- A sign-switch occurs in the regularization path once baseline shrinkage is set conservatively.
- Criterion-based selection over the negative-capable family recovers effective negative adjustments in the predicted regimes.
- Synthetic and semi-synthetic data verify feasibility, spectral complexity increase, and sign-switch behavior.
Where Pith is reading between the lines
- The same anti-shrinkage logic could be tested in other linear estimators that admit tunable penalty signs.
- High-dimensional problems with known low-signal directions may benefit from explicit negative adjustments.
- Real datasets containing weak but predictive features offer a direct test of whether automatic selection prefers the negative region.
Load-bearing premise
The estimator remains well-posed whenever negative regularization is applied.
What would settle it
An experiment measuring effective degrees of freedom under negative regularization values that fails to show stronger increases along the smallest eigenvalues would refute the anti-shrinkage mechanism.
read the original abstract
Conventional regularization is designed to control variance, but in small-data regression it can also aggravate underfitting when predictive signal is concentrated in weak directions of a restricted representation. We study a negative-capable ridge family that permits a feasible negative region whenever the estimator remains well posed, and show that negative regularization acts there as controlled anti-shrinkage by increasing effective complexity most strongly along weak eigendirections. Building on this mechanism, we formalize weak-spectrum underfitting, derive a sign-switch result under conservative baseline shrinkage, and study criterion-based automatic selection over the full negative-capable family. Synthetic and semi-synthetic experiments support the theory by verifying feasibility, spectral complexity increase, sign-switch behavior, and effective recovery of negative adjustments in the predicted regimes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a negative-capable ridge regression family that permits negative regularization values in the feasible region where the estimator remains well-posed (i.e., when the regularization parameter exceeds the negative of the smallest eigenvalue of the Gram matrix). It shows that negative regularization functions as controlled anti-shrinkage, preferentially increasing effective complexity along weak eigendirections, formalizes weak-spectrum underfitting, derives a sign-switch result relative to a conservative positive baseline, and demonstrates via synthetic and semi-synthetic experiments that criterion-based selection can recover appropriate negative adjustments.
Significance. If the central algebraic claims hold, the work offers a principled mechanism to mitigate over-shrinkage in small-data ridge regression without arbitrary increases in model complexity. A clear strength is the direct derivation of the sign-switch and spectral effects from the standard ridge spectral decomposition, which requires no additional assumptions beyond the stated feasible negative region. The explicit restriction to well-posed cases and the experimental checks on feasibility, complexity increase, and sign-switch behavior provide concrete support. This could be relevant for adaptive regularization in high-dimensional or low-sample regimes.
major comments (1)
- [Derivation of sign-switch result] The sign-switch result and its dependence on the conservative baseline are load-bearing for the main claim. The manuscript should clarify in the derivation section whether the baseline is chosen independently of the data spectrum or if it introduces any implicit parameter sensitivity that could affect the predicted regimes for negative adjustments.
minor comments (3)
- [Abstract] The abstract paragraph on the negative-capable family is information-dense; splitting the description of the feasible region and the anti-shrinkage mechanism into separate sentences would improve readability.
- [Experiments] Synthetic experiment details on data generation (e.g., eigenvalue distributions and noise levels) are referenced but could be expanded with explicit parameter values or pseudocode to strengthen reproducibility.
- [Theoretical development] Notation for the effective complexity measure along eigendirections should be defined at first use with a brief reminder of its relation to the shrinkage factor.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the recommendation for minor revision. We address the single major comment below and have incorporated a clarification into the revised manuscript.
read point-by-point responses
-
Referee: [Derivation of sign-switch result] The sign-switch result and its dependence on the conservative baseline are load-bearing for the main claim. The manuscript should clarify in the derivation section whether the baseline is chosen independently of the data spectrum or if it introduces any implicit parameter sensitivity that could affect the predicted regimes for negative adjustments.
Authors: We appreciate the referee drawing attention to this aspect of the sign-switch derivation. The conservative baseline is selected independently of the data spectrum: it is a fixed positive regularization value obtained either by cross-validation over a positive grid or by a small preset positive constant, without reference to the eigenvalues of the Gram matrix beyond the requirement that the baseline remain positive. This choice is made prior to considering negative adjustments and does not introduce spectrum-dependent sensitivity into the algebraic comparison. The sign-switch result then follows directly from the spectral decomposition once this fixed positive baseline is subtracted. To eliminate any ambiguity, we have added an explicit paragraph in the derivation section stating the baseline selection rule, confirming its independence from the spectrum, and noting that the predicted regimes for beneficial negative regularization are therefore insensitive to the precise positive value chosen within the conservative range. revision: yes
Circularity Check
No significant circularity; central claims are algebraic identities from standard ridge spectral decomposition
full rationale
The paper's core results (anti-shrinkage via negative regularization, sign-switch under baseline, weak-spectrum underfitting) follow directly from the closed-form ridge estimator's eigendecomposition: the per-direction shrinkage factor λ_i/(λ_i + reg) is an identity of the estimator definition, and its differential behavior for small vs. large λ_i when reg is negative (but > -min(λ)) is a direct algebraic comparison, not a fit or redefinition. The feasible negative region is explicitly restricted to where the Gram matrix stays positive definite, preserving well-posedness without introducing circularity. No self-citations, ansatzes, or renamed empirical patterns are invoked as load-bearing steps in the provided derivation chain. The argument is self-contained under standard linear-algebra assumptions for ridge regression.
Axiom & Free-Parameter Ledger
free parameters (1)
- regularization parameter (lambda)
axioms (1)
- domain assumption The estimator remains well-posed for negative regularization values
Reference graph
Works this paper leans on
-
[1]
Sme digitalisation to manage shocks and transitions: 2024 OECD D4SME survey
Organisation for Economic Co-operation and Development. Sme digitalisation to manage shocks and transitions: 2024 OECD D4SME survey. OECD SME and Entrepreneurship Papers No. 62, OECD Publishing, Paris, 2024
work page 2024
-
[2]
S. Kergroach and J. Héritier. Emerging divides in the transition to artificial intelligence. OECD Regional Development Papers No. 147, OECD Publishing, Paris, 2025. 19 Kim et al
work page 2025
-
[3]
The impact of the general data protection regulation (gdpr) on artificial intelligence
Giovanni Sartor and Francesca Lagioia. The impact of the general data protection regulation (gdpr) on artificial intelligence. Study; Scientific Foresight Unit (STOA) PE 641.530, European Parliamentary Research Service (EPRS), Brussels, 2020
work page 2020
-
[4]
A survey on few-shot class- incremental learning.Neural Networks, 169:307–324, 2024
Songsong Tian, Lusi Li, Weijun Li, Hang Ran, Xin Ning, and Prayag Tiwari. A survey on few-shot class- incremental learning.Neural Networks, 169:307–324, 2024
work page 2024
-
[5]
Xian Tao, Xinyi Gong, Xin Zhang, Shaohua Yan, and Chandranath Adak. Deep learning for unsupervised anomaly localization in industrial images: A survey.IEEE Transactions on Instrumentation and Measurement, 71:1–21, 2022
work page 2022
-
[6]
A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591, 2024
Zaitian Wang, Pengfei Wang, Kunpeng Liu, Pengyang Wang, Yanjie Fu, Chang-Tien Lu, Charu C Aggarwal, Jian Pei, and Yuanchun Zhou. A comprehensive survey on data augmentation.arXiv preprint arXiv:2405.09591, 2024
-
[7]
Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020
work page 2020
-
[8]
Alexander Brown, Nenad Tomasev, Jan Freyberg, Yuan Liu, Alan Karthikesalingam, and Jessica Schrouff. Detecting shortcut learning for fair medical ai using shortcut testing.Nature communications, 14(1):4314, 2023
work page 2023
-
[9]
The elements of statistical learning, 2009
Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning, 2009
work page 2009
-
[10]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019
work page 2019
-
[11]
Andrei Paleyes, Raoul-Gabriel Urma, and Neil D Lawrence. Challenges in deploying machine learning: a survey of case studies.ACM computing surveys, 55(6):1–29, 2022
work page 2022
-
[12]
Hidden technical debt in machine learning systems
David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28, 2015
work page 2015
-
[13]
Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleš Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks.IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021
work page 2021
-
[14]
Trust region policy optimiza- tion
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimiza- tion. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015
work page 2015
-
[15]
Support-vector networks.Machine learning, 20(3):273–297, 1995
Corinna Cortes and Vladimir Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995
work page 1995
-
[16]
Understanding deep learning requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. InInternational Conference on Learning Representations, 2017
work page 2017
-
[17]
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018
work page 2018
-
[18]
Benign overfitting in linear regression
Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020
work page 2020
-
[19]
Marco Loog, Tom Viering, Alexander Mey, Jesse H Krijthe, and David MJ Tax. A brief prehistory of double descent.Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020
work page 2020
-
[20]
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021
work page 2021
-
[21]
Ben Adlam and Jeffrey Pennington. Understanding double descent requires a fine-grained bias-variance decompo- sition.Advances in neural information processing systems, 33:11022–11032, 2020
work page 2020
-
[22]
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation.Annals of statistics, 50(2):949, 2022
work page 2022
-
[23]
Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features.SIAM Journal on Mathematics of Data Science, 2(4), 2020
work page 2020
-
[24]
Vidya Muthukumar, Kailas V odrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression.IEEE Journal on Selected Areas in Information Theory, 1(1):67–83, 2020
work page 2020
-
[25]
Mario Geiger, Stefano Spigler, Stéphane d’Ascoli, Levent Sagun, Marco Baity-Jesi, Giulio Biroli, and Matthieu Wyart. Jamming transition as a paradigm to understand the loss landscape of deep neural networks.Physical Review E, 100(1):012115, 2019. 20 Kim et al
work page 2019
-
[26]
Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under-to over-parametrization affects generalization in deep learning.Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019
work page 2019
-
[27]
Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks.Neural Networks, 132:428–446, 2020
work page 2020
-
[28]
Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020
Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent.arXiv preprint arXiv:2003.01897, 2020
-
[29]
Regularizing Neural Networks by Penalizing Confident Output Distributions
Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions.arXiv preprint arXiv:1701.06548, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Rethinking the inception architecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016
work page 2016
-
[31]
When does label smoothing help?Advances in neural information processing systems, 32, 2019
Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help?Advances in neural information processing systems, 32, 2019
work page 2019
-
[32]
Nlnl: Negative learning for noisy labels
Youngdong Kim, Junho Yim, Juseung Yun, and Junmo Kim. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision, pages 101–110, 2019
work page 2019
-
[33]
Learning from complementary labels.Advances in neural information processing systems, 30, 2017
Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. Learning from complementary labels.Advances in neural information processing systems, 30, 2017
work page 2017
-
[34]
Sequence level training with recurrent neural networks
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. In4th International Conference on Learning Representations, ICLR 2016, 2016
work page 2016
-
[35]
Mohammad Norouzi, Samy Bengio, Navdeep Jaitly, Mike Schuster, Yonghui Wu, Dale Schuurmans, et al. Reward augmented maximum likelihood for neural structured prediction.Advances In Neural Information Processing Systems, 29, 2016
work page 2016
-
[36]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
work page 2017
-
[37]
Fine-tuning language models from human preferences, 2020.URL https://arxiv
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020.URL https://arxiv. org/abs, page 14, 1909
work page 2020
-
[38]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[39]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[40]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[41]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of machine learning research, 17(59):1–35, 2016
work page 2016
-
[42]
Mitigating unwanted biases with adversarial learning
Brian Hu Zhang, Blake Lemoine, and Margaret Mitchell. Mitigating unwanted biases with adversarial learning. InProceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 335–340, 2018
work page 2018
-
[43]
Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018
work page 2018
-
[44]
Hirotugu Akaike. A new look at the statistical model identification.IEEE transactions on automatic control, 19(6):716–723, 2003
work page 2003
-
[45]
Some comments on c_p.Technometrics, 15:661–675, 1973
MALLOWS CL. Some comments on c_p.Technometrics, 15:661–675, 1973
work page 1973
-
[46]
Charles M Stein. Estimation of the mean of a multivariate normal distribution.The annals of Statistics, pages 1135–1151, 1981
work page 1981
-
[47]
Peter Craven and Grace Wahba. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.Numerische mathematik, 31(4):377–403, 1978. 21 Kim et al
work page 1978
- [48]
-
[49]
Jianming Ye. On measuring and correcting the effects of data mining and model selection.Journal of the American Statistical Association, 93(441):120–131, 1998
work page 1998
-
[50]
Least angle regression.Annals of Statistics, pages 407–451, 2004
Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.Annals of Statistics, pages 407–451, 2004
work page 2004
-
[51]
On the degrees of freedom of the lasso.Annals of statistics, 35(5):2173–2192, 2007
HUI ZOU, Trevor HASTIE, and Robert TIBSHIRANI. On the degrees of freedom of the lasso.Annals of statistics, 35(5):2173–2192, 2007
work page 2007
-
[52]
Degrees of freedom in lasso problems.The Annals of Statistics, 40(2):1198–1232, 2012
RYAN J TIBSHIRANI and JONATHAN TAYLOR. Degrees of freedom in lasso problems.The Annals of Statistics, 40(2):1198–1232, 2012
work page 2012
-
[53]
David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage.Journal of the american statistical association, 90(432):1200–1224, 1995
work page 1995
-
[54]
Mervyn Stone. Cross-validatory choice and assessment of statistical predictions.Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974
work page 1974
-
[55]
Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69
Lutz Prechelt. Early stopping-but when? InNeural Networks: Tricks of the trade, pages 55–69. Springer, 2002
work page 2002
-
[56]
On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007
work page 2007
-
[57]
Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule.The Journal of Machine Learning Research, 15(1):335–366, 2014
work page 2014
-
[58]
Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017
work page 2017
-
[59]
A bayesian perspective on generalization and stochastic gradient descent
Samuel L Smith and Quoc V Le. A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018
work page 2018
-
[60]
Three Factors Influencing Minima in SGD
Stanisław Jastrz˛ ebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd.arXiv preprint arXiv:1711.04623, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[61]
On large-batch training for deep learning: Generalization gap and sharp minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. InInternational Conference on Learning Representations, 2017
work page 2017
-
[62]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[63]
Cyclical learning rates for training neural networks
Leslie N Smith. Cyclical learning rates for training neural networks. In2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE, 2017
work page 2017
-
[64]
Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002
Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002
work page 2002
-
[65]
Train faster, generalize better: Stability of stochastic gradient descent
Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational conference on machine learning, pages 1225–1234. PMLR, 2016
work page 2016
-
[66]
On calibration of modern neural networks
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017
work page 2017
-
[67]
Making deep neural networks robust to label noise: A loss correction approach
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952, 2017
work page 1944
-
[68]
Robust loss functions under label noise for deep neural networks
Aritra Ghosh, Himanshu Kumar, and P Shanti Sastry. Robust loss functions under label noise for deep neural networks. InProceedings of the AAAI conference on artificial intelligence, volume 31, 2017
work page 2017
-
[69]
Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels.Advances in neural information processing systems, 31, 2018
work page 2018
-
[70]
Symmetric cross entropy for robust learning with noisy labels
Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. Symmetric cross entropy for robust learning with noisy labels. InProceedings of the IEEE/CVF international conference on computer vision, pages 322–330, 2019
work page 2019
-
[71]
Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018. 22 Kim et al
work page 1979
-
[72]
mixup: Beyond empirical risk minimization
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations, 2018
work page 2018
-
[73]
Sunil Thulasidasan, Gopinath Chennupati, Jeff A Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks.Advances in neural information processing systems, 32, 2019
work page 2019
-
[74]
On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013
work page 2013
-
[75]
Towards deep learning models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[76]
Theoretically principled trade-off between robustness and accuracy
Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. InInternational conference on machine learning, pages 7472–7482. PMLR, 2019
work page 2019
-
[77]
Parseval networks: Improving robustness to adversarial examples
Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. InInternational conference on machine learning, pages 854–863. PMLR, 2017
work page 2017
-
[78]
Certified adversarial robustness via randomized smoothing
Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019
work page 2019
-
[79]
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018
work page 2018
-
[80]
Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent.Advances in neural information processing systems, 32, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.