Recognition: no theorem link
The Effect of Mini-Batch Noise on the Implicit Bias of Adam
Pith reviewed 2026-05-16 08:00 UTC · model grok-4.3
The pith
Mini-batch noise reverses whether Adam's higher β2 pushes toward sharper or flatter minima.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the case of large batch sizes, higher β2 increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regularization on β2 is reversed. A similar monotonicity shift (in the opposite direction) happens in β1. The commonly default pair (β1, β2) = (0.9, 0.999) is a good choice if batches are small; for larger batches, moving β1 closer to β2 is much better in terms of validation accuracy in multi-epoch training. The scale of the batch size at which the shift happens connects to the scale of the critical batch size.
What carries the argument
The interaction between mini-batch noise and Adam's momentum memory parameters β1 and β2, which sets the direction and strength of implicit bias toward sharper or flatter loss minima.
If this is right
- Large-batch training benefits from setting β1 closer to β2 to reduce anti-regularization.
- Small-batch training performs well with the standard default values β1=0.9 and β2=0.999.
- The batch-size scale at which the monotonicity reversal occurs tracks the critical batch size.
- The effect is observable in the about-to-overfit multi-epoch regime on small-scale data.
Where Pith is reading between the lines
- Batch size should be treated as a first-class hyperparameter when choosing momentum values for Adam.
- The reversal implies that noise can counteract memory-driven anti-regularization in ways standard bias analyses miss.
- Similar batch-size-dependent reversals may appear in other momentum-based adaptive methods.
- Dynamic schedules that adjust β1 or β2 as effective batch size changes could improve generalization.
Load-bearing premise
The interaction between mini-batch noise and momentum memory can be isolated from other optimization dynamics, and that flatness of reached minima correlates with generalization in the multi-epoch regime.
What would settle it
An experiment that trains Adam with fixed β2 across a range of batch sizes, measures the curvature of the reached minima, and finds either no reversal in the β2-flatness dependence or no connection between that curvature and validation accuracy.
Figures
read the original abstract
With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(\beta_1, \beta_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $\beta_1$, $\beta_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $\beta_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $\beta_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $\beta_1$. In particular, the commonly "default" pair $(\beta_1, \beta_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $\beta_1$ closer to $\beta_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a theoretical framework to analyze how mini-batch noise influences the implicit bias induced by Adam's momentum parameters (β1, β2) toward sharper or flatter regions of the loss landscape in multi-epoch training. It claims that the dependence of this (anti-)regularization effect on β2 reverses with batch size: higher β2 increases anti-regularization for large batches but the dependence reverses for smaller batches, with an opposite monotonicity shift for β1. The framework connects the reversal scale to the critical batch size, and the practical implication is that the default (0.9, 0.999) pair is suitable for small batches while moving β1 closer to β2 is preferable for larger batches. These predictions are illustrated via experiments on small-scale data in the about-to-overfit regime.
Significance. If the isolation of noise-memory coupling holds with the required domination bounds, the result supplies a principled, batch-size-dependent rule for tuning Adam hyperparameters to improve generalization in multi-epoch regimes that are regaining importance under data constraints. The explicit linkage of the reversal point to critical batch size constitutes a falsifiable prediction and a strength of the work. The significance remains conditional on verifying that the modeling assumptions do not introduce artifacts near the critical scale.
major comments (3)
- [Theoretical Framework] Theoretical Framework: the monotonicity reversal for β2 (and opposite shift for β1) is stated without derivation details, error bounds, or explicit assumptions on the loss landscape; it is therefore impossible to verify whether the predicted reversal is independent of the data used to illustrate it or reduces to a quantity fitted from the same observations.
- [Theoretical Framework] Theoretical Framework: no domination bounds or regime conditions are supplied to guarantee that the mini-batch noise–momentum memory interaction dominates curvature evolution, gradient alignment changes, and multi-epoch landscape drift; without these, the reversal could be an artifact of the isolation choice rather than a robust prediction.
- [Experiments] Experiments section: the experiments are characterized only as “small-scale” and “about-to-overfit” with no reported controls for post-hoc hyperparameter choices or checks that the observed correlation between flatness and generalization persists outside this narrow regime, weakening support for the practical recommendation on default β values.
minor comments (2)
- [Abstract] Abstract: the phrase “anti-regularization by memory” is used without a concise definition or pointer to the relevant equation, which may hinder readers outside the implicit-bias literature.
- [Notation] Notation: introduce the precise definition of the critical batch size and its relation to the reversal threshold in a dedicated paragraph or equation early in the theoretical section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, committing to revisions that add derivation details, regime bounds, and experimental clarifications while preserving the core contributions.
read point-by-point responses
-
Referee: [Theoretical Framework] Theoretical Framework: the monotonicity reversal for β2 (and opposite shift for β1) is stated without derivation details, error bounds, or explicit assumptions on the loss landscape; it is therefore impossible to verify whether the predicted reversal is independent of the data used to illustrate it or reduces to a quantity fitted from the same observations.
Authors: We will include the complete derivation in the appendix, starting from the Adam second-moment update with additive mini-batch noise. Assumptions are local quadratic loss approximation and bounded noise variance. The reversal follows analytically from the closed-form bias term coupling noise scale to β2, with the transition point tied directly to critical batch size; no data fitting is involved. Error bounds on the approximation will be stated explicitly. revision: yes
-
Referee: [Theoretical Framework] Theoretical Framework: no domination bounds or regime conditions are supplied to guarantee that the mini-batch noise–momentum memory interaction dominates curvature evolution, gradient alignment changes, and multi-epoch landscape drift; without these, the reversal could be an artifact of the isolation choice rather than a robust prediction.
Authors: We agree and will add a dedicated subsection deriving explicit regime conditions. These include noise variance dominating curvature evolution rate (by factor Ω(1/√B)) and memory decay outpacing alignment drift over epochs. The bounds confirm the noise-memory term governs the reversal within the stated regime, ruling out isolation artifacts. revision: yes
-
Referee: [Experiments] Experiments section: the experiments are characterized only as “small-scale” and “about-to-overfit” with no reported controls for post-hoc hyperparameter choices or checks that the observed correlation between flatness and generalization persists outside this narrow regime, weakening support for the practical recommendation on default β values.
Authors: Experiments target the about-to-overfit regime where implicit bias is most visible in multi-epoch settings. We will expand the section with full hyperparameter grid results to eliminate post-hoc concerns and add explicit discussion of regime limitations. The practical β recommendations rest primarily on the theory; experiments remain illustrative, and broader validation is noted as future work. revision: partial
Circularity Check
Derivation chain self-contained without reduction to inputs
full rationale
The paper introduces a theoretical framework analyzing mini-batch noise effects on Adam momentum memory (β1, β2) and its implicit bias toward flat/sharp regions, deriving monotonicity reversals in β dependence as batch size varies and linking the transition scale to critical batch size. No equations, self-citations, fitted parameters presented as predictions, or imported uniqueness theorems appear in the provided text to reduce the central claims to inputs by construction. The isolation of noise-memory coupling is stated as an assumption within the framework rather than a self-referential fit, leaving the derivations independent and self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A Modern Look at the Relationship between Sharpness and Generalization
Maksym Andriushchenko et al. “A Modern Look at the Relationship between Sharpness and Generalization”.Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023 (cit. on p. 3)
work page 2023
-
[2]
Maksym Andriushchenko et al.Why Do We Need Weight Decay in Modern Deep Learning?2024 (cit. on p. 3). 10
work page 2024
-
[3]
Rohan Anil et al. “Palm 2 technical report”.arXiv preprint arXiv:2305.10403(2023) (cit. on p. 1)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Understanding Gradient Descent on the Edge of Stability in Deep Learning
Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. “Understanding Gradient Descent on the Edge of Stability in Deep Learning”.Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 3)
work page 2022
-
[5]
Implicit Gradient Regularization
David Barrett and Benoit Dherin. “Implicit Gradient Regularization”.International Conference on Learning Representations. 2021 (cit. on p. 3)
work page 2021
-
[6]
On the Trajectories of SGD Without Replacement
Pierfrancesco Beneventano. “On the Trajectories of SGD Without Replacement”.arXiv preprint arXiv:2312.16143(2023) (cit. on p. 3)
-
[7]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Stella Biderman et al. “Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling”.Proceedings of the 40th International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023 (cit. on p. 1)
work page 2023
-
[9]
Language Models are Few-Shot Learners
Tom Brown et al. “Language Models are Few-Shot Learners”.Advances in Neural Information Processing Systems. Vol. 33. Curran Associates, Inc., 2020 (cit. on p. 1)
work page 2020
-
[10]
Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis
Matias D Cattaneo and Boris Shigida. “Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis”.arXiv preprint arXiv:2509.08483(2025) (cit. on p. 4)
-
[11]
Matias D. Cattaneo, Jason Matthew Klusowski, and Boris Shigida. “On the Implicit Bias of Adam”. Proceedings of the 41st International Conference on Machine Learning. Vol. 235. Proceedings of Machine Learning Research. PMLR, 2024 (cit. on pp. 2, 3, 6, 8)
work page 2024
-
[12]
How Memory in Optimization Algorithms Implicitly Modifies the Loss
Matias D. Cattaneo and Boris Shigida. “How Memory in Optimization Algorithms Implicitly Modifies the Loss”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on pp. 2–4)
work page 2025
-
[13]
Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise
Enea Monzio Compagnoni et al. “Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 2)
work page 2025
-
[14]
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
Zihang Dai et al. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context”. Annual Meeting of the Association for Computational Linguistics. 2019 (cit. on p. 10)
work page 2019
-
[15]
Label Noise SGD Provably Prefers Flat Global Minimizers
Alex Damian, Tengyu Ma, and Jason D Lee. “Label Noise SGD Provably Prefers Flat Global Minimizers”.Advances in Neural Information Processing Systems. Vol. 34. Curran Associates, Inc., 2021 (cit. on p. 3)
work page 2021
-
[16]
Sharp Minima Can Generalize For Deep Nets
Laurent Dinh et al. “Sharp Minima Can Generalize For Deep Nets”.Proceedings of the 34th International Conference on Machine Learning. Vol. 70. Proceedings of Machine Learning Research. PMLR, 2017 (cit. on p. 3)
work page 2017
-
[17]
Sharpness-Aware Training for Free
Jiawei Du et al. “Sharpness-Aware Training for Free”.Advances in Neural Information Processing Systems. Vol. 35. Curran Associates, Inc., 2022 (cit. on p. 3)
work page 2022
-
[18]
Abhimanyu Dubey et al. “The llama 3 herd of models”.arXiv preprint arXiv:2407.21783(2024) (cit. on p. 1)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Multiscale analysis of accelerated gradient methods
Mohammad Farazmand. “Multiscale analysis of accelerated gradient methods”.SIAM Journal on Optimization30.3 (2020) (cit. on p. 3)
work page 2020
-
[20]
Sharpness-aware Minimization for Efficiently Improving Generalization
Pierre Foret et al. “Sharpness-aware Minimization for Efficiently Improving Generalization”.Inter- national Conference on Learning Representations. 2021 (cit. on pp. 2, 3, 5)
work page 2021
-
[21]
Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent
Avrajit Ghosh et al. “Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent”.The Eleventh International Conference on Learning Representations. 2023 (cit. on pp. 3, 6)
work page 2023
-
[22]
Characterizing implicit bias in terms of optimization geometry
Suriya Gunasekar et al. “Characterizing implicit bias in terms of optimization geometry”.Interna- tional Conference on Machine Learning. PMLR. 2018 (cit. on p. 3)
work page 2018
-
[23]
Implicit bias of gradient descent on linear convolutional networks
Suriya Gunasekar et al. “Implicit bias of gradient descent on linear convolutional networks”.Advances in neural information processing systems31 (2018) (cit. on p. 3). 11
work page 2018
-
[24]
SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA
Sepp Hochreiter and J¨ urgen Schmidhuber. “SIMPLIFYING NEURAL NETS BY DISCOVERING FLAT MINIMA”.Advances in Neural Information Processing Systems. Vol. 7. MIT Press, 1994 (cit. on p. 2)
work page 1994
-
[25]
Directional convergence and alignment in deep learning
Ziwei Ji and Matus Telgarsky. “Directional convergence and alignment in deep learning”.Advances in Neural Information Processing Systems33 (2020) (cit. on p. 3)
work page 2020
-
[26]
Gradient descent aligns the layers of deep linear networks
Ziwei Ji and Matus Telgarsky. “Gradient descent aligns the layers of deep linear networks”.arXiv preprint arXiv:1810.02032(2018) (cit. on p. 3)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Risk and parameter convergence of logistic regression
Ziwei Ji and Matus Telgarsky. “Risk and parameter convergence of logistic regression”.arXiv preprint arXiv:1803.07300(2018) (cit. on p. 3)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
The implicit bias of gradient descent on nonseparable data
Ziwei Ji and Matus Telgarsky. “The implicit bias of gradient descent on nonseparable data”. Conference on Learning Theory. PMLR. 2019 (cit. on p. 3)
work page 2019
-
[29]
Fantastic Generalization Measures and Where to Find Them
Yiding Jiang et al. “Fantastic Generalization Measures and Where to Find Them”.International Conference on Learning Representations. 2020 (cit. on pp. 2, 5)
work page 2020
-
[30]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”.International Conference on Learning Representations. 2017 (cit. on p. 2)
work page 2017
- [31]
-
[32]
Fisher SAM: Information Geometry and Sharpness Aware Minimisation
Minyoung Kim et al. “Fisher SAM: Information Geometry and Sharpness Aware Minimisation”. Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 3)
work page 2022
-
[33]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”.arXiv preprint arXiv:1412.6980(2014) (cit. on pp. 1, 5)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[34]
Weight decay induces low-rank attention layers
Seijin Kobayashi, Yassir Akram, and Johannes von Oswald. “Weight decay induces low-rank attention layers”.The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 3)
work page 2024
-
[35]
Continuous time analysis of momentum methods
Nikola B Kovachki and Andrew M Stuart. “Continuous time analysis of momentum methods”. Journal of Machine Learning Research22.17 (2021) (cit. on pp. 3, 6)
work page 2021
-
[36]
Frederik Kunstner et al. “Noise Is Not the Main Factor Behind the Gap Between Sgd and Adam on Transformers, But Sign Descent Might Be”.The Eleventh International Conference on Learning Representations. 2023 (cit. on p. 10)
work page 2023
-
[37]
ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks
Jungmin Kwon et al. “ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 3)
work page 2021
-
[38]
Enhancing Sharpness-Aware Optimization Through Variance Suppression
Bingcong Li and Georgios B. Giannakis. “Enhancing Sharpness-Aware Optimization Through Variance Suppression”.Thirty-seventh Conference on Neural Information Processing Systems. 2023 (cit. on p. 3)
work page 2023
-
[39]
Friendly Sharpness-Aware Minimization
Tao Li et al. “Friendly Sharpness-Aware Minimization”.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024 (cit. on p. 3)
work page 2024
-
[40]
Towards Efficient and Scalable Sharpness-Aware Minimization
Yong Liu et al. “Towards Efficient and Scalable Sharpness-Aware Minimization”.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022 (cit. on p. 3)
work page 2022
- [41]
-
[42]
Gradient descent maximizes the margin of homogeneous neural networks
Kaifeng Lyu and Jian Li. “Gradient descent maximizes the margin of homogeneous neural networks”. arXiv preprint arXiv:1906.05890(2019) (cit. on p. 3)
-
[43]
A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms
Chao Ma, Lei Wu, and Weinan E. “A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms”.Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference. Vol. 145. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 2)
work page 2022
-
[44]
On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
Sadhika Malladi et al. “On the SDEs and Scaling Rules for Adaptive Gradient Algorithms”.Advances in Neural Information Processing Systems. Vol. 35. Curran Associates, Inc., 2022 (cit. on p. 2)
work page 2022
-
[45]
Martin Marek et al. “Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation is Wasteful”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on p. 2). 12
work page 2025
- [46]
-
[47]
Pointer Sentinel Mixture Models
Stephen Merity et al. “Pointer Sentinel Mixture Models”.International Conference on Learning Representations. 2017 (cit. on p. 10)
work page 2017
-
[48]
Taiki Miyagawa. “Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis”.Advances in Neural Information Processing Systems. 2022 (cit. on p. 3)
work page 2022
-
[49]
Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate
Mor Shpigel Nacson, Nathan Srebro, and Daniel Soudry. “Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate”.The 22nd International Conference on Artificial Intelligence and Statistics. PMLR. 2019 (cit. on p. 3)
work page 2019
-
[50]
Convergence of gradient descent on separable data
Mor Shpigel Nacson et al. “Convergence of gradient descent on separable data”.The 22nd Interna- tional Conference on Artificial Intelligence and Statistics. PMLR. 2019 (cit. on p. 3)
work page 2019
-
[51]
Lexicographic and depth-sensitive margins in homogeneous and non- homogeneous deep models
Mor Shpigel Nacson et al. “Lexicographic and depth-sensitive margins in homogeneous and non- homogeneous deep models”.International Conference on Machine Learning. PMLR. 2019 (cit. on p. 3)
work page 2019
-
[52]
In Search of Adam’s Secret Sauce
Antonio Orvieto and Robert M. Gower. “In Search of Adam’s Secret Sauce”.The Thirty-ninth Annual Conference on Neural Information Processing Systems. 2025 (cit. on p. 2)
work page 2025
-
[53]
The AdEMAMix Optimizer: Better, Faster, Older
Matteo Pagliardini, Pierre Ablin, and David Grangier. “The AdEMAMix Optimizer: Better, Faster, Older”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 2)
work page 2025
-
[54]
Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Tomer Porian et al. “Resolving Discrepancies in Compute-Optimal Scaling of Language Models”. The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 2)
work page 2024
-
[55]
The implicit bias of adagrad on separable data
Qian Qian and Xiaoyuan Qian. “The implicit bias of adagrad on separable data”.Advances in Neural Information Processing Systems32 (2019) (cit. on p. 3)
work page 2019
-
[56]
A Scale Invariant Measure of Flatness for Deep Network Minima
Akshay Rangamani et al. “A Scale Invariant Measure of Flatness for Deep Network Minima”. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2021 (cit. on p. 3)
work page 2021
-
[57]
On a continuous time model of gradient descent dynamics and instability in deep learning
Mihaela Rosca et al. “On a continuous time model of gradient descent dynamics and instability in deep learning”.Transactions on Machine Learning Research(2023).issn: 2835-8856 (cit. on p. 3)
work page 2023
-
[58]
Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
Robin M Schmidt, Frank Schneider, and Philipp Hennig. “Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 2)
work page 2021
-
[59]
Adafactor: Adaptive learning rates with sublinear memory cost
Noam Shazeer and Mitchell Stern. “Adafactor: Adaptive learning rates with sublinear memory cost”. International Conference on Machine Learning. PMLR. 2018 (cit. on p. 1)
work page 2018
-
[60]
Optimizer benchmarking needs to account for hyperparameter tuning
Prabhu Teja Sivaprasad et al. “Optimizer benchmarking needs to account for hyperparameter tuning”.International conference on machine learning. PMLR. 2020 (cit. on p. 1)
work page 2020
-
[61]
On the Origin of Implicit Regularization in Stochastic Gradient Descent
Samuel L Smith et al. “On the Origin of Implicit Regularization in Stochastic Gradient Descent”. International Conference on Learning Representations. 2021 (cit. on pp. 3, 4)
work page 2021
-
[62]
The implicit bias of gradient descent on separable data
Daniel Soudry et al. “The implicit bias of gradient descent on separable data”.The Journal of Machine Learning Research19.1 (2018) (cit. on p. 3)
work page 2018
-
[63]
A Universal Class of Sharpness-Aware Minimization Algorithms
Behrooz Tahmasebi et al. “A Universal Class of Sharpness-Aware Minimization Algorithms”.Forty- first International Conference on Machine Learning. 2024 (cit. on p. 3)
work page 2024
-
[64]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron et al. “Llama 2: Open foundation and fine-tuned chat models”.arXiv preprint arXiv:2307.09288(2023) (cit. on p. 1)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. “Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks Using PAC-Bayesian Analysis”.Proceedings of the 37th International Conference on Machine Learning. Vol. 119. Proceedings of Machine Learning Research. PMLR, 2020 (cit. on p. 3)
work page 2020
-
[66]
Position: Will we run out of data? Limits of LLM scaling based on human- generated data
Pablo Villalobos et al. “Position: Will we run out of data? Limits of LLM scaling based on human- generated data”.Forty-first International Conference on Machine Learning. 2024 (cit. on p. 1)
work page 2024
-
[67]
Does Momentum Change the Implicit Regularization on Separable Data?
Bohan Wang et al. “Does Momentum Change the Implicit Regularization on Separable Data?” Advances in Neural Information Processing Systems35 (2022) (cit. on p. 3). 13
work page 2022
-
[68]
The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks
Bohan Wang et al. “The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks”.Proceedings of the 38th International Conference on Machine Learning. Vol. 139. Proceedings of Machine Learning Research. PMLR, 2021 (cit. on p. 3)
work page 2021
- [69]
- [70]
-
[71]
Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization
Shuo Xie and Zhiyuan Li. “Implicit Bias of AdamW: ℓ∞-Norm Constrained Optimization”.Proceed- ings of the 41st International Conference on Machine Learning. Vol. 235. Proceedings of Machine Learning Research. PMLR, 2024 (cit. on p. 3)
work page 2024
-
[72]
SAMPa: Sharpness-aware Minimization Par- allelized
Wanyun Xie, Thomas Pethick, and Volkan Cevher. “SAMPa: Sharpness-aware Minimization Par- allelized”.The Thirty-eighth Annual Conference on Neural Information Processing Systems. 2024 (cit. on p. 3)
work page 2024
-
[73]
Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momen- tum
Zeke Xie et al. “Adaptive Inertia: Disentangling the Effects of Adaptive Learning Rate and Momen- tum”.Proceedings of the 39th International Conference on Machine Learning. Vol. 162. Proceedings of Machine Learning Research. PMLR, 2022 (cit. on p. 2)
work page 2022
-
[74]
Positively Scale-Invariant Flatness of ReLU Neural Networks
Mingyang Yi et al. “Positively Scale-Invariant Flatness of ReLU Neural Networks”.arXiv preprint arXiv:1903.02237(2019) (cit. on p. 3)
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[75]
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng et al. “Glm-130b: An open bilingual pre-trained model”.arXiv preprint arXiv:2210.02414 (2022) (cit. on p. 1)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[76]
Three Mechanisms of Weight Decay Regularization
Guodong Zhang et al. “Three Mechanisms of Weight Decay Regularization”.International Conference on Learning Representations. 2019 (cit. on p. 3)
work page 2019
-
[77]
How Does Critical Batch Size Scale in Pre-training?
Hanlin Zhang et al. “How Does Critical Batch Size Scale in Pre-training?”The Thirteenth Interna- tional Conference on Learning Representations. 2025 (cit. on p. 2)
work page 2025
-
[78]
Why are adaptive methods good for attention models?
Jingzhao Zhang et al. “Why are adaptive methods good for attention models?”Advances in Neural Information Processing Systems33 (2020) (cit. on p. 10)
work page 2020
-
[79]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang et al. “Opt: Open pre-trained transformer language models”.arXiv preprint arX- iv:2205.01068(2022) (cit. on p. 1)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[80]
Deconstructing What Makes a Good Optimizer for Autoregressive Language Models
Rosie Zhao et al. “Deconstructing What Makes a Good Optimizer for Autoregressive Language Models”.The Thirteenth International Conference on Learning Representations. 2025 (cit. on pp. 2, 26)
work page 2025
-
[81]
Regularizing Neural Networks via Adversarial Model Perturbation
Yaowei Zheng, Richong Zhang, and Yongyi Mao. “Regularizing Neural Networks via Adversarial Model Perturbation”.2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021 (cit. on p. 3)
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.