Recognition: unknown
Estimating Implicit Regularization in Deep Learning
Pith reviewed 2026-05-08 15:51 UTC · model grok-4.3
The pith
Gradient matching methods can empirically recover the implicit regularization effects induced by complex training procedures in deep networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By solving for the penalty term whose gradient best explains the difference between actual parameter updates and the loss gradient alone, one obtains an estimate of the implicit regularization at work during training; this recovers known explicit and implicit penalties and characterizes dropout as inducing an L2 effect in deep networks.
What carries the argument
Gradient matching, which finds a penalty function whose gradient aligns observed weight updates with the sum of the loss gradient plus that penalty gradient.
If this is right
- The method verifies analytical predictions of implicit bias such as the quadratic penalty from early stopping.
- Dropout in deep networks produces an implicit L2 regularization effect that the matching procedure can quantify.
- Practitioners can apply the technique to interpret the net regularization strength of their chosen training modifications.
- The empirical nature of the approach allows characterization of implicit biases in networks too complex for analytic derivation.
Where Pith is reading between the lines
- The same matching idea could be used to compare regularization strength across different optimizers or architectures in controlled experiments.
- Quantified implicit penalties might guide automated hyperparameter search by treating regularization strength as an observable target.
- Extending the method to measure how regularization evolves over training epochs could reveal time-varying bias effects not captured by static penalties.
Load-bearing premise
The deviation between actual weight updates and pure loss gradients can be matched to the gradient of some explicit penalty term.
What would settle it
Running the method on early-stopped gradient descent and failing to recover a quadratic weight penalty would falsify the claim that the matching procedure identifies the implicit regularization.
Figures
read the original abstract
Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes gradient matching methods to empirically estimate implicit regularization (implicit bias) in deep networks with complex training modifications. It validates the approach by recovering known explicit penalties (ℓ1, ℓ2) and known implicit effects (quadratic penalty from early stopping in GD), then applies it to characterize dropout as inducing implicit ℓ2 regularization, claiming the empirical method works for arbitrary networks where analytic derivation is intractable.
Significance. If the method can reliably isolate implicit bias from stochastic gradient noise, it would offer a practical, falsifiable tool for quantifying regularization effects in real training pipelines, aiding both theoretical understanding and hyperparameter interpretation. The grounding in recovery of independently known penalties is a strength, as is the focus on replicable effects like early stopping.
major comments (2)
- [§5 (dropout experiments)] §5 (dropout experiments): the central claim that the method characterizes implicit regularization in arbitrary modified procedures (including dropout) is load-bearing on the assertion that gradient deviations can be matched to an equivalent penalty even when minibatch and dropout noise are present. The validation recovers penalties only in low-noise/deterministic cases (explicit ℓ1/ℓ2, early stopping); no ablation, noise-only baseline, or variance decomposition is reported to show the fitted ℓ2 term reflects bias rather than absorbing irreducible stochastic variance whose statistics depend on batch size, dropout rate, and weights. This directly affects whether the dropout result supports the claim for complex networks.
- [Method section (gradient matching procedure)] Method section (gradient matching procedure): the fitting of an equivalent penalty term to observed update deviations assumes the deviation is a clean signal of implicit bias. No analysis is given of how the matching accuracy or recovered parameters degrade as a function of batch size or dropout rate, nor is there a quantitative metric (e.g., residual error after fitting, cross-validation against held-out updates) that would confirm the procedure separates bias from noise when both are simultaneously present.
minor comments (3)
- [Abstract] The abstract states the method 'replicates known implicit effects' but does not specify the quantitative criterion used to declare successful replication (e.g., parameter recovery error, R² of the fit).
- [Method] Notation for the estimated penalty (e.g., how the equivalent regularization term is parameterized and optimized) should be introduced earlier and used consistently when reporting recovered coefficients for ℓ1/ℓ2 and dropout cases.
- [Figures (dropout results)] Figure captions for the dropout results should include error bars or multiple random seeds to indicate variability in the recovered penalty strength.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of validating our gradient matching approach under stochastic conditions. We agree that additional evidence is needed to confirm the method isolates implicit bias from noise in the dropout setting, and we will revise the manuscript accordingly to strengthen these claims.
read point-by-point responses
-
Referee: [§5 (dropout experiments)] the central claim that the method characterizes implicit regularization in arbitrary modified procedures (including dropout) is load-bearing on the assertion that gradient deviations can be matched to an equivalent penalty even when minibatch and dropout noise are present. The validation recovers penalties only in low-noise/deterministic cases (explicit ℓ1/ℓ2, early stopping); no ablation, noise-only baseline, or variance decomposition is reported to show the fitted ℓ2 term reflects bias rather than absorbing irreducible stochastic variance whose statistics depend on batch size, dropout rate, and weights. This directly affects whether the dropout result supports the claim for complex networks.
Authors: We agree that the current validation leaves open the question of whether the fitted ℓ2 term in the dropout experiments primarily captures implicit bias or absorbs stochastic variance. While the observed scaling of the recovered regularization strength with dropout rate is consistent with known theoretical predictions for dropout, we acknowledge the absence of targeted ablations. In the revision we will add a noise-only baseline (fitting to stochastic gradients from a network without dropout), ablations over batch size and dropout rate that track the stability of the recovered parameter, and reporting of residual fitting error to quantify the portion of variance explained by the bias term versus irreducible noise. These additions will directly test whether the procedure separates the two effects. revision: yes
-
Referee: [Method section (gradient matching procedure)] the fitting of an equivalent penalty term to observed update deviations assumes the deviation is a clean signal of implicit bias. No analysis is given of how the matching accuracy or recovered parameters degrade as a function of batch size or dropout rate, nor is there a quantitative metric (e.g., residual error after fitting, cross-validation against held-out updates) that would confirm the procedure separates bias from noise when both are simultaneously present.
Authors: The referee is correct that the manuscript does not yet provide a quantitative characterization of robustness to noise. We will expand the method section to include: (i) plots of recovered regularization strength and mean-squared residual error as functions of batch size and dropout probability; (ii) a cross-validation procedure that fits the penalty on one set of observed updates and evaluates predictive accuracy on held-out gradient deviations; and (iii) explicit discussion of the conditions (e.g., sufficient averaging over multiple runs) under which the matching procedure is expected to isolate bias. These changes will give readers clear guidance on the method's operating regime. revision: yes
Circularity Check
Empirical gradient-matching method is self-contained with no circular derivation steps
full rationale
The paper introduces an empirical procedure for estimating implicit regularization via matching observed deviations between weight updates and loss gradients to equivalent penalty terms. Validation proceeds by recovering externally known explicit penalties (ℓ1, ℓ2) and independently established implicit effects (quadratic penalty from early stopping), none of which are defined or derived from the method's own equations. Application to dropout is presented as a characterization exercise rather than a closed prediction. No load-bearing step reduces by construction to a fitted parameter, self-citation, or ansatz imported from the authors' prior work; the central claim remains falsifiable against external benchmarks and does not loop back to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Implicit regularization can be represented as an equivalent penalty that augments the learning objective and causes observable deviations in weight updates.
Reference graph
Works this paper leans on
-
[1]
Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study, June 2025
Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, and Nadav Cohen. Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study, June 2025. URLhttp://arxiv. org/abs/2506.03931. arXiv:2506.03931 [cs]
-
[2]
A continuous-time view of early stopping for least squares regression
Alnur Ali, J Zico Kolter, and Ryan J Tibshirani. A continuous-time view of early stopping for least squares regression. InThe 22nd international conference on artificial intelligence and statistics, pages 1370–1378. PMLR, 2019
2019
-
[3]
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018. URLhttp://arxiv.org/abs/ 1802.06509. arXiv:1802.06509 [cs]
work page Pith review arXiv 2018
-
[4]
Implicit Regularization in Deep Matrix Factorization, October 2019
Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit Regularization in Deep Matrix Factorization, October 2019. URLhttp://arxiv.org/abs/1905.13655. arXiv:1905.13655 [cs]
- [5]
-
[6]
Bartlett, Andrea Montanari, and Alexander Rakhlin
Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statis- tical viewpoint.Acta Numerica, 30:87–201, May 2021. ISSN 0962-4929, 1474-0508. doi: 10.1017/S0962492921000027. URLhttps://www.cambridge.org/core/product/ identifier/S0962492921000027/type/journal_article
-
[7]
Towards Exact Computation of Inductive Bias, June 2024
Akhilan Boopathy, William Yue, Jaedong Hwang, Abhiram Iyer, and Ila Fiete. Towards Exact Computation of Inductive Bias, June 2024. URLhttp://arxiv.org/abs/2406.15941. arXiv:2406.15941 [cs]
-
[8]
Spectral bias and task- model alignment explain generalization in kernel regression and infinitely wide neu- ral networks.Nature Communications, 12(1):2914, May 2021
Abdulkadir Canatar, Blake Bordelon, and Cengiz Pehlevan. Spectral bias and task- model alignment explain generalization in kernel regression and infinitely wide neu- ral networks.Nature Communications, 12(1):2914, May 2021. ISSN 2041-
2021
-
[9]
Nature Communications , author =
doi: 10.1038/s41467-021-23103-1. URLhttps://www.nature.com/articles/ s41467-021-23103-1
-
[10]
Dropout as a Low-Rank Regularizer for Matrix Factorization
Jacopo Cavazza, Pietro Morerio, Benjamin Haeffele, Connor Lane, Vittorio Murino, and Rene Vidal. Dropout as a Low-Rank Regularizer for Matrix Factorization. InProceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, pages 435–444. PMLR, March 2018. URLhttps://proceedings.mlr.press/v84/cavazza18a.html
2018
-
[11]
Entropy-sgd: Biasing gradient descent into wide valleys
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing Gra- dient Descent Into Wide Valleys, April 2017. URLhttp://arxiv.org/abs/1611.01838. arXiv:1611.01838 [cs]
-
[12]
Implicit Bias of Gradient Descent for Wide Two-layer Neu- ral Networks Trained with the Logistic Loss
Lénaïc Chizat and Francis Bach. Implicit Bias of Gradient Descent for Wide Two-layer Neu- ral Networks Trained with the Logistic Loss. InProceedings of Thirty Third Conference on Learning Theory, pages 1305–1338. PMLR, July 2020. URLhttps://proceedings.mlr. press/v125/chizat20a.html
2020
-
[13]
Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-Aware Min- imization for Efficiently Improving Generalization, April 2021. URLhttp://arxiv.org/ abs/2010.01412. arXiv:2010.01412 [cs]
-
[14]
Implicit Regularization in Matrix Factorization, May 2017
Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit Regularization in Matrix Factorization, May 2017. URLhttp://arxiv. org/abs/1705.09280. arXiv:1705.09280 [stat]
-
[15]
Comparing Biases for Minimal Network Construction with Back-Propagation
Stephen Hanson and Lorien Pratt. Comparing Biases for Minimal Network Construction with Back-Propagation. InAdvances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988. URLhttps://proceedings.neurips.cc/paper/1988/hash/ 1c9ac0159c94d8d0cbedc973445af2da-Abstract.html. 10
1988
-
[16]
Helmbold and Philip M
David P. Helmbold and Philip M. Long. On the inductive bias of dropout.J. Mach. Learn. Res., 16(1):3403–3454, January 2015. ISSN 1532-4435
2015
-
[17]
Sepp Hochreiter and Jürgen Schmidhuber. Flat Minima.Neural Computation, 9(1):1–42, January 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URLhttps://doi.org/ 10.1162/neco.1997.9.1.1
-
[18]
Probing as Quantifying Inductive Bias
Alexander Immer, Lucas Torroba Hennigen, Vincent Fortuin, and Ryan Cotterell. Probing as Quantifying Inductive Bias. In Smaranda Muresan, Preslav Nakov, and Aline Villavicen- cio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1839–1851, Dublin, Ireland, May 2022. Asso- ciatio...
- [19]
-
[20]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, February 2017. URLhttp://arxiv.org/abs/1609.04836. arXiv:1609.04836 [cs]
work page internal anchor Pith review arXiv 2017
-
[21]
Y . Lecun, L. Bottou, Y . Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11):2278–2324, November 1998. ISSN 1558-2256. doi: 10.1109/5.726791. URLhttps://ieeexplore.ieee.org/document/726791
-
[22]
MNIST handwritten digit database.ATT Labs [Online]
Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database.ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010
2010
-
[23]
On the Principles of Parsimony and Self- Consistency for the Emergence of Intelligence, July 2022
Yi Ma, Doris Tsao, and Heung-Yeung Shum. On the Principles of Parsimony and Self- Consistency for the Emergence of Intelligence, July 2022. URLhttp://arxiv.org/abs/ 2207.04630. arXiv:2207.04630 [cs, math]
-
[24]
D. J.C. MacKay. Probable networks and plausible predictions-a review of practical Bayesian methods for supervised neural networks.Network: Computation in Neural Systems, 6(3):469, August 1995. ISSN 0954-898X. doi: 10.1088/0954-898X/6/3/011. URLhttps://doi.org/ 10.1088/0954-898X/6/3/011
-
[25]
On Dropout and Nuclear Norm Regularization
Poorya Mianjy and Raman Arora. On Dropout and Nuclear Norm Regularization. InProceed- ings of the 36th International Conference on Machine Learning, pages 4575–4584. PMLR, May 2019. URLhttps://proceedings.mlr.press/v97/mianjy19a.html
2019
-
[26]
On the Implicit Bias of Dropout
Poorya Mianjy, Raman Arora, and Rene Vidal. On the Implicit Bias of Dropout. InProceed- ings of the 35th International Conference on Machine Learning, pages 3540–3548. PMLR, July 2018. URLhttps://proceedings.mlr.press/v80/mianjy18b.html
2018
-
[27]
Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks, May 2022
Hancheng Min, Salma Tarmoun, Rene Vidal, and Enrique Mallada. Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks, May 2022. URLhttp:// arxiv.org/abs/2105.06351. arXiv:2105.06351 [cs]
-
[28]
Implicit Bias of the Step Size in Linear Diagonal Neural Networks
Mor Shpigel Nacson, Kavya Ravichandran, Nathan Srebro, and Daniel Soudry. Implicit Bias of the Step Size in Linear Diagonal Neural Networks. InProceedings of the 39th International Conference on Machine Learning, pages 16270–16295. PMLR, June 2022. URLhttps: //proceedings.mlr.press/v162/nacson22a.html
2022
-
[29]
Edelman, Fred Zhang, and Boaz Barak
Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L. Edelman, Fred Zhang, and Boaz Barak. SGD on Neural Networks Learns Functions of Increasing Complexity, May 2019. URLhttp://arxiv.org/abs/1905.11604. arXiv:1905.11604 [cs]
-
[30]
2012.Bayesian learning for neural networks
Radford M. Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics. Springer, New York, NY , 1996. ISBN 978-0-387-94724-2 978-1-4612- 0745-0. doi: 10.1007/978-1-4612-0745-0. URLhttp://link.springer.com/10.1007/ 978-1-4612-0745-0. 11
-
[31]
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning, April 2015. URLhttp: //arxiv.org/abs/1412.6614. arXiv:1412.6614 [cs]
work page Pith review arXiv 2015
-
[32]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, January 2022. URLhttp: //arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs]
work page internal anchor Pith review arXiv 2022
-
[33]
On the Spectral Bias of Neural Networks
Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the Spectral Bias of Neural Networks. InProceed- ings of the 36th International Conference on Machine Learning, pages 5301–5310. PMLR, May 2019. URLhttps://proceedings.mlr.press/v97/rahaman19a.html
2019
-
[34]
Reginaldo J. Santos. Equivalence of regularization and truncated iteration for general ill-posed problems.Linear Algebra and its Applications, 236:25–33, March 1996. ISSN 0024-3795. doi: 10.1016/0024-3795(94)00114-6. URLhttps://www.sciencedirect.com/science/ article/pii/0024379594001146
-
[35]
arXiv preprint arXiv:2101.12176 , year=
Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the Origin of Implicit Regularization in Stochastic Gradient Descent, January 2021. URLhttp://arxiv.org/ abs/2101.12176. arXiv:2101.12176 [cs]
-
[36]
The Implicit Bias of Gradient Descent on Separable Data, November 2018
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The Implicit Bias of Gradient Descent on Separable Data, November 2018. URLhttp://arxiv. org/abs/1710.10345. arXiv:1710.10345 [stat]
-
[37]
Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.Journal of Machine Learning Research, 15(56):1929–1958, 2014. ISSN 1533-7928. URLhttp: //jmlr.org/papers/v15/srivastava14a.html
1929
-
[38]
Deep learning generalizes because the parameter-function map is biased towards simple functions
Guillermo Valle-Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions, April 2019. URL http://arxiv.org/abs/1805.08522. arXiv:1805.08522 [stat]
work page Pith review arXiv 2019
-
[39]
On Margin Maximization in Linear and ReLU Networks, October 2022
Gal Vardi, Ohad Shamir, and Nathan Srebro. On Margin Maximization in Linear and ReLU Networks, October 2022. URLhttp://arxiv.org/abs/2110.02732. arXiv:2110.02732 [cs]
-
[40]
Dropout Training as Adaptive Regularization, November 2013
Stefan Wager, Sida Wang, and Percy Liang. Dropout Training as Adaptive Regularization, November 2013. URLhttp://arxiv.org/abs/1307.1493. arXiv:1307.1493 [stat]
-
[41]
The Implicit and Explicit Regularization Effects of Dropout
Colin Wei, Sham Kakade, and Tengyu Ma. The Implicit and Explicit Regularization Effects of Dropout. InProceedings of the 37th International Conference on Machine Learning, pages 10181–10192. PMLR, November 2020. URLhttps://proceedings.mlr.press/v119/ wei20d.html
2020
-
[42]
Deep Learning is Not So Mysterious or Different, March 2025
Andrew Gordon Wilson. Deep Learning is Not So Mysterious or Different, March 2025. URL http://arxiv.org/abs/2503.02113. arXiv:2503.02113 [cs]
-
[43]
Zhenqin Wu, Bharath Ramsundar, Evan N
Andrew Gordon Wilson and Pavel Izmailov. Bayesian Deep Learning and a Probabilistic Perspective of Generalization, March 2022. URLhttp://arxiv.org/abs/2002.08791. arXiv:2002.08791 [cs]
-
[44]
Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro
Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and Rich Regimes in Overparametrized Models. InProceedings of Thirty Third Conference on Learning Theory, pages 3635–3673. PMLR, July 2020. URLhttps://proceedings.mlr.press/v125/woodworth20a.html
2020
-
[45]
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On Early Stopping in Gradient Descent Learning.Constructive Approximation, 26(2):289–315, August 2007. ISSN 1432-0940. doi: 10.1007/s00365-006-0663-2. URLhttps://doi.org/10.1007/s00365-006-0663-2. 12
-
[46]
The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks, June 2023
Can Yaras, Peng Wang, Wei Hu, Zhihui Zhu, Laura Balzano, and Qing Qu. The Law of Parsimony in Gradient Descent for Learning Deep Linear Networks, June 2023. URLhttp: //arxiv.org/abs/2306.01154. arXiv:2306.01154 [cs]
-
[47]
Understanding Deep Learning (Still) Requires Rethinking Generalization , volume =
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Un- derstanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, February 2021. ISSN 0001-0782. doi: 10.1145/3446776. URL https://doi.org/10.1145/3446776
-
[48]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xingxuan Zhang, Renzhe Xu, Han Yu, Hao Zou, and Peng Cui. Gradient Norm Aware Mini- mization Seeks First-Order Flatness and Improves Generalization. In2023 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 20247–20257, Vancouver, BC, Canada, June 2023. IEEE. ISBN 979-8-3503-0129-8. doi: 10.1109/CVPR52729.2023. 01939. URLhttps:...
-
[49]
Additionally, •(Sub-Gaussian correlated errors)E[ξ|Φ] = 0, and there existsσ≥0so that E[exp(tv⊤ξ)|Φ]≤exp(σ 2t2v⊤Σv/2)for allv∈R p
To obtain a bound on the error, we make the following assumptions: Assumption 2(Nonparametric endpoint well-specification).b=b ⋆ +ξ, whereb ⋆ is the predictable endpoint bias andξis mean-zero correlated error. Additionally, •(Sub-Gaussian correlated errors)E[ξ|Φ] = 0, and there existsσ≥0so that E[exp(tv⊤ξ)|Φ]≤exp(σ 2t2v⊤Σv/2)for allv∈R p. •(Bounded log-co...
-
[50]
propose a Bayesian framework for quantifying theamountof inductive bias needed to achieve generalization on some prediction task. They define inductive bias of a task as the negative log probability that a hypothesishachieves some test error rateε– intuitively, if a randomly sampled hypothesish∼p h is unlikely (small probability) then a large inductive bi...
-
[51]
propose a method of quantifying inductive bias based on probing intermediate representations. They consider Bayesian evidence as a proxy for inductive bias, formalizing it as the maximum evidence (how likely it is that a particular dataset could have been generated by a given model) for some intermediate representation, over all possible probes in a funct...
-
[52]
We repeat each configuration across10random seeds
by gradient matching, fitting the single parameter with Adam at learning rate0.05for up to2000epochs and bias-fit patience100. We repeat each configuration across10random seeds. H.5 Recovering Barrett implicit gradient regularization This appendix details the known-coefficient recovery experiment for the implicit gradient regularizer derived by Barrett an...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.