A Physics-Inspired Optimizer: Velocity Regularized Adam

Lucas Schorling; Michael A. Osborne; Natalia Ares; Pranav Vaidhyanathan

arxiv: 2505.13196 · v2 · submitted 2025-05-19 · 💻 cs.LG · cs.AI· quant-ph

A Physics-Inspired Optimizer: Velocity Regularized Adam

Pranav Vaidhyanathan , Lucas Schorling , Natalia Ares , Michael A. Osborne This is my paper

Pith reviewed 2026-05-22 14:30 UTC · model grok-4.3

classification 💻 cs.LG cs.AIquant-ph

keywords velocity regularized adamadam optimizeredge of stabilityconvergence boundsdeep neural networksphysics-inspired optimizationnon-convex stochastic optimization

0 comments

The pith

Velocity-Regularized Adam damps oscillations by penalizing high-velocity updates and outperforms AdamW on standard tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Velocity-Regularized Adam, or VRAdam, which adds a higher-order penalty drawn from quartic kinetic energy terms to the Adam optimizer. This penalty automatically shrinks the effective learning rate whenever parameter updates grow large, reducing the rapid oscillations that arise when Adam operates at the edge of stability. The resulting hybrid combines global velocity damping with Adam's per-parameter adaptation. If the method works as described, it yields both stronger empirical performance across image classification, language modeling, and generative modeling and a convergence guarantee of order O(ln(N)/sqrt(N)) for stochastic non-convex problems under mild assumptions.

Core claim

VRAdam adds a velocity-based higher-order penalty to the Adam update rule so that the algorithm automatically slows down in regimes of large weight changes. The penalty is motivated by the stabilizing role of quartic terms in physical kinetic energy and is analyzed from both physical and control-theoretic viewpoints on momentum dynamics. Under mild assumptions the method delivers a convergence rate of O(ln(N)/sqrt(N)) for stochastic non-convex objectives while, in practice, exceeding the performance of AdamW on CNN image classification, Transformer language modeling, and GFlowNet generative tasks.

What carries the argument

The velocity regularizer, a higher-order penalty on the learning rate that scales inversely with the squared velocity of parameter updates and supplies global damping while preserving per-parameter scaling.

If this is right

VRAdam exceeds AdamW performance on image classification with CNNs, language modeling with Transformers, and generative modeling with GFlowNets.
The effective learning rate shrinks automatically in high-velocity regimes, damping oscillations at the edge of stability.
Convergence bounds of O(ln(N)/sqrt(N)) hold for stochastic non-convex objectives under the paper's mild assumptions.
The optimizer combines Adam-style per-parameter scaling with a single global velocity-based damping mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar velocity penalties could be attached to other adaptive first-order methods to obtain comparable stabilization.
The same control perspective might be used to design stabilizers for optimization in reinforcement learning or physics-informed neural networks.
The approach suggests that explicit penalties on update speed can reduce reliance on manual learning-rate schedules.

Load-bearing premise

The velocity penalty can be inserted into Adam without creating fresh instabilities and the mild assumptions used in the convergence proof continue to hold for the deep-network objectives and architectures tested.

What would settle it

A controlled run on one of the reported benchmarks in which VRAdam either diverges or records lower final accuracy than AdamW, or a calculation showing that the stated O(ln(N)/sqrt(N)) bound is violated once the velocity term is active.

Figures

Figures reproduced from arXiv: 2505.13196 by Lucas Schorling, Michael A. Osborne, Natalia Ares, Pranav Vaidhyanathan.

**Figure 2.** Figure 2: (a) Training loss curves for VRAdam, Adam, and SAM Foret et al. (2021) of ResNet 32 on CIFAR-10 (b) training accuracy curves (c) plot of maximal eigenvalues of the loss Hessian d effective learning rate during training. Hyperparameters for these plots are provided in Appendix D.4. 4.2 STABILITY OF VRADAM We analyze the behavior of VRAdam in the adaptive edge of stability regime compared to that of Adam in … view at source ↗

**Figure 3.** Figure 3: Train (left) and validation (right) loss curves with error envelopes calculated using different [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Train (left) and validation (right) loss curves with error envelopes calculated using different [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Train (left) and validation (right) loss curves calculated using different run values for [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Train (left) and validation (right) loss curves with error envelopes calculated using different [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Train (right) curves with error envelopes calculated using different run values for image [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Train (right) curves with error envelopes calculated using different run values for image [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Train (right) curves with error envelopes calculated using different run values for image [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Train (right) curves with error envelopes calculated using different run values for image [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VRAdam adds a velocity penalty to Adam for damping at the edge of stability and claims an O(ln N / sqrt N) rate, but the assumptions and experiment details look thin.

read the letter

The main point is that this paper proposes VRAdam, which layers a higher-order penalty on velocity onto standard Adam. The penalty comes from a quartic kinetic energy analogy and is meant to shrink the effective learning rate when updates grow large, reducing oscillations during training at the adaptive edge of stability. They also give a convergence bound of O(ln(N)/sqrt(N)) for stochastic non-convex problems under mild assumptions and report better results than AdamW on image classification, language modeling, and generative tasks with CNNs, transformers, and GFlowNets.

Referee Report

3 major / 2 minor

Summary. The paper introduces Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer that augments Adam with a higher-order velocity penalty derived from quartic kinetic energy terms. This penalty automatically reduces the effective learning rate in high-velocity regimes to damp oscillations at the edge of stability. The manuscript claims a rigorous physical/control-theoretic analysis of momentum dynamics and derives a convergence rate of O(ln(N)/sqrt(N)) for stochastic non-convex objectives under mild assumptions. Empirically, VRAdam is reported to outperform AdamW and other standard optimizers on image classification, language modeling, and generative modeling tasks across CNNs, Transformers, and GFlowNets.

Significance. If the convergence analysis can be made fully rigorous with explicit assumptions and the empirical gains prove robust under matched hyperparameter budgets, the hybrid of global velocity damping and per-parameter adaptation could provide both practical improvements and new insights into optimizer stability for deep networks. The stated rate would be noteworthy for non-convex stochastic optimization if the assumptions align with typical deep-learning regimes.

major comments (3)

[Theoretical analysis] Theoretical analysis section: the O(ln(N)/sqrt(N)) convergence bound is asserted under 'mild assumptions' for stochastic non-convex objectives, yet no explicit statement of those assumptions, key lemmas, or derivation steps appears; this is load-bearing for the central theoretical claim.
[Experiments] Experiments section: superiority over AdamW is stated for multiple tasks and architectures, but no quantitative tables, ablation controls on the velocity penalty strength, or error bars are provided, preventing assessment of whether gains survive identical tuning budgets.
[Method] Optimizer description: the interaction between the velocity penalty and Adam's per-parameter adaptive scaling is described at a high level, but no analysis or experiments address whether the penalty introduces new instabilities or requires task-specific retuning beyond the baseline.

minor comments (2)

[Abstract] Abstract: the phrase 'rigorous theoretical analysis' should include a forward reference to the specific section containing the proof or derivation.
[Method] Notation: the precise mathematical form of the velocity penalty term (e.g., how it modifies the update rule) should be stated explicitly with an equation number for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their constructive feedback, which has helped us identify areas for improvement in our manuscript. Below, we provide detailed responses to each major comment and indicate the revisions we have made or plan to make.

read point-by-point responses

Referee: [Theoretical analysis] Theoretical analysis section: the O(ln(N)/sqrt(N)) convergence bound is asserted under 'mild assumptions' for stochastic non-convex objectives, yet no explicit statement of those assumptions, key lemmas, or derivation steps appears; this is load-bearing for the central theoretical claim.

Authors: We thank the referee for highlighting this issue. Upon review, we agree that the assumptions and derivation steps should be stated more explicitly to support the central claim. In the revised version, we will add a new subsection in the theoretical analysis that lists all assumptions clearly, presents the key lemmas, and sketches the main steps of the proof for the O(ln(N)/sqrt(N)) rate. This will ensure the analysis is self-contained and rigorous. revision: yes
Referee: [Experiments] Experiments section: superiority over AdamW is stated for multiple tasks and architectures, but no quantitative tables, ablation controls on the velocity penalty strength, or error bars are provided, preventing assessment of whether gains survive identical tuning budgets.

Authors: The referee is right that the experimental results need more quantitative support to allow proper evaluation. We have now included detailed tables with performance metrics for each task and architecture, along with ablation studies on the velocity penalty coefficient. Additionally, we report means and standard deviations over multiple independent runs to provide error bars. These changes were made under the constraint of matched hyperparameter tuning budgets where possible. revision: yes
Referee: [Method] Optimizer description: the interaction between the velocity penalty and Adam's per-parameter adaptive scaling is described at a high level, but no analysis or experiments address whether the penalty introduces new instabilities or requires task-specific retuning beyond the baseline.

Authors: We appreciate this observation. To address it, we have expanded the optimizer description to analyze the interaction between the velocity regularizer and Adam's adaptive mechanism from both theoretical and practical standpoints. We also conducted experiments testing for instabilities and the need for retuning, showing that the penalty parameter can be set to a default value that works across the tested tasks without significant additional tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper introduces VRAdam by adding a velocity-based higher-order penalty to Adam, motivated by a physics analogy to quartic kinetic energy terms, and separately states convergence bounds of O(ln(N)/sqrt(N)) under explicitly labeled mild assumptions for stochastic non-convex objectives. No equations or steps in the provided text reduce a claimed prediction or bound back to a fitted parameter, self-citation, or ansatz by construction. The empirical benchmarks on image classification, language modeling, and generative tasks are presented as independent validation rather than forced outputs of the same inputs used in the analysis. The mild assumptions are not shown to be retrofitted to the target rate, and the physics framing supplies interpretive context without making the mathematical derivation tautological.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claims rest on the existence of an adaptive edge-of-stability regime in standard Adam training and on mild assumptions sufficient for the stated convergence rate; no new particles or dimensions are introduced.

free parameters (1)

velocity penalty strength
The coefficient controlling the higher-order velocity term is a tunable hyperparameter whose value is not derived from first principles.

axioms (2)

domain assumption Previous Adam-like optimizers operate at the adaptive edge of stability, producing rapid oscillations.
Invoked to motivate the need for the velocity regularizer.
domain assumption Mild assumptions hold for the stochastic non-convex objective.
Required to obtain the O(ln(N)/sqrt(N)) convergence bound.

pith-pipeline@v0.9.0 · 5768 in / 1475 out tokens · 64011 ms · 2026-05-22T14:30:54.247066+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 10 internal anchors

[1]

Benoˆıt Assi, Bernd A

URLhttps://arxiv.org/abs/2205.09745. Benoˆıt Assi, Bernd A. Kniehl, and Joan Soto. Matching the standard model to heavy-quark ef- fective theory and nonrelativistic qcd.Nuclear Physics B, 992:116173, July

work page arXiv
[2]

URLhttp://dx.doi.org/10.1016/j

doi: 10.1016/j.nuclphysb.2023.116173. URLhttp://dx.doi.org/10.1016/j. nuclphysb.2023.116173. Yehonatan Avidan, Qianyi Li, and Haim Sompolinsky. Connecting ntk and nngp: A unified theo- retical framework for wide neural network learning dynamics,

work page doi:10.1016/j.nuclphysb.2023.116173 2023
[3]

org/abs/2309.04522

URLhttps://arxiv. org/abs/2309.04522. 9 Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization.arXiv preprint arXiv:1802.03653,

work page arXiv
[4]

URL https://arxiv.org/abs/hep-ph/9702225. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher ...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Language Models are Few-Shot Learners

URL https://arxiv.org/abs/2005.14165. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in neural information processing systems, 36:49205–49233,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[6]

org/abs/2103.00065

URLhttps://arxiv. org/abs/2103.00065. Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability,

work page arXiv
[7]

Alex Damian, Eshaan Nichani, and Jason D Lee

URLhttps://arxiv.org/abs/ 2207.14484. Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594,

work page arXiv
[8]

doi: 10.1080/10556788.2023. 2214837. Alexandre D´efossez, L´eon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad,

work page doi:10.1080/10556788.2023 2023
[9]

Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster

URLhttps://arxiv.org/abs/2003.02395. Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster. Adam on local time: Addressing nonstationarity in rl with relative adam timesteps.Advances in Neural Information Processing Systems, 37:134567–134590,

work page arXiv 2003
[10]

Sepp Hochreiter and J¨urgen Schmidhuber

URLhttps: //arxiv.org/abs/1911.11626. Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42,

work page arXiv 1911
[11]

Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

doi: 10.1162/neco.1997.9.1.1. Peter Holderrieth, Yilun Xu, and Tommi Jaakkola. Hamiltonian score matching and generative flows,

work page doi:10.1162/neco.1997.9.1.1 1997
[12]

URLhttps://arxiv.org/abs/2410.20470. John J. Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558,

work page arXiv
[13]

Adam: A Method for Stochastic Optimization

URL https://arxiv.org/abs/1412.6980. Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny im- ages. Technical Report TR-2009-003, University of Toronto,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[14]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

URLhttps://www.cs. toronto.edu/˜kriz/learning-features-2009-TR.pdf. Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342,

work page arXiv 2009
[15]

On the variance of the adaptive learning rate and beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In8th International Conference on Learning Representations, ICLR 2020,

work page 2020
[16]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Hamiltonian Descent Methods

Chris J. Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet. Hamiltonian descent methods.arXiv preprint arXiv:1809.05042,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf

URLhttps://arxiv.org/ abs/2109.06091. Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

work page arXiv
[19]

On the Convergence of Adam and Beyond

ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(98) 00116-6. URLhttps://www.sciencedirect.com/science/article/pii/ S0893608098001166. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s0893-6080(98 1904
[20]

An overview of gradient descent optimization algorithms

URLhttps: //arxiv.org/abs/1609.04747. Alfred Shapere and Frank Wilczek. Classical time crystals.Physical Review Letters, 109(16), October

work page internal anchor Pith review Pith/arXiv arXiv
[21]

doi: 10.1103/physrevlett.109.160402

ISSN 1079-7114. doi: 10.1103/physrevlett.109.160402. URLhttp://dx. doi.org/10.1103/PhysRevLett.109.160402. 11 Minhak Song and Chulhee Yun. Trajectory alignment: understanding the edge of stability phe- nomenon via bifurcation theory.arXiv preprint arXiv:2307.04204,

work page doi:10.1103/physrevlett.109.160402
[22]

Attention Is All You Need

URLhttps://arxiv. org/abs/1706.03762. Jing Wang and Anna Choromanska. A survey of optimization methods for training dl models: The- oretical perspective on convergence and generalization,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Zixuan Wang, Zhouzi Li, and Jian Li

URLhttps://arxiv.org/ abs/2501.14458. Zixuan Wang, Zhouzi Li, and Jian Li. Analyzing sharpness along gd trajectory: Progressive sharp- ening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994,

work page arXiv
[24]

Large Batch Training of Convolutional Networks

URLhttps://arxiv.org/abs/1708.03888. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learn- ing: Training bert in 76 minutes,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

URLhttps://arxiv.org/abs/1904.00962. Ya-xiang Yuan and Yi Zhang. Symplectic discretization approach for developing new proximal point algorithm.arXiv preprint arXiv:2308.03986,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[26]

A BOUNDINGNRQCD By explicitly breaking certain symmetries—Lorentz invariance in NRQCD and time-translation in time crystals— higher-order kinetic terms paradoxically enhance stability through topological protection mechanisms and the generation of emergent length/time scales Niemi (2021); Guha & Ghose-Choudhury (2019). As a demonstration of this phenomeno...

work page 2021
[27]

(1−β 1)α0 1 + min(β3|vt|2, α1) ,(45) so the method moves away from instability as oscillations grow. Second, each parameter update is uniformly bounded in norm by the gate, |xt −x t−1|=η t|vt|= α0|vt| 1 +β 3|vt|2 ≤ α0 2√β3 ,(46) which prevents runaway steps and is not available to classical momentum. These properties are consistent with the design of Algo...

work page 2022

[1] [1]

Benoˆıt Assi, Bernd A

URLhttps://arxiv.org/abs/2205.09745. Benoˆıt Assi, Bernd A. Kniehl, and Joan Soto. Matching the standard model to heavy-quark ef- fective theory and nonrelativistic qcd.Nuclear Physics B, 992:116173, July

work page arXiv

[2] [2]

URLhttp://dx.doi.org/10.1016/j

doi: 10.1016/j.nuclphysb.2023.116173. URLhttp://dx.doi.org/10.1016/j. nuclphysb.2023.116173. Yehonatan Avidan, Qianyi Li, and Haim Sompolinsky. Connecting ntk and nngp: A unified theo- retical framework for wide neural network learning dynamics,

work page doi:10.1016/j.nuclphysb.2023.116173 2023

[3] [3]

org/abs/2309.04522

URLhttps://arxiv. org/abs/2309.04522. 9 Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization.arXiv preprint arXiv:1802.03653,

work page arXiv

[4] [4]

URL https://arxiv.org/abs/hep-ph/9702225. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher ...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Language Models are Few-Shot Learners

URL https://arxiv.org/abs/2005.14165. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in neural information processing systems, 36:49205–49233,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[6] [6]

org/abs/2103.00065

URLhttps://arxiv. org/abs/2103.00065. Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability,

work page arXiv

[7] [7]

Alex Damian, Eshaan Nichani, and Jason D Lee

URLhttps://arxiv.org/abs/ 2207.14484. Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594,

work page arXiv

[8] [8]

doi: 10.1080/10556788.2023. 2214837. Alexandre D´efossez, L´eon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad,

work page doi:10.1080/10556788.2023 2023

[9] [9]

Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster

URLhttps://arxiv.org/abs/2003.02395. Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster. Adam on local time: Addressing nonstationarity in rl with relative adam timesteps.Advances in Neural Information Processing Systems, 37:134567–134590,

work page arXiv 2003

[10] [10]

Sepp Hochreiter and J¨urgen Schmidhuber

URLhttps: //arxiv.org/abs/1911.11626. Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42,

work page arXiv 1911

[11] [11]

Peter Holderrieth, Yilun Xu, and Tommi Jaakkola

doi: 10.1162/neco.1997.9.1.1. Peter Holderrieth, Yilun Xu, and Tommi Jaakkola. Hamiltonian score matching and generative flows,

work page doi:10.1162/neco.1997.9.1.1 1997

[12] [12]

URLhttps://arxiv.org/abs/2410.20470. John J. Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the National Academy of Sciences, 79(8):2554–2558,

work page arXiv

[13] [13]

Adam: A Method for Stochastic Optimization

URL https://arxiv.org/abs/1412.6980. Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny im- ages. Technical Report TR-2009-003, University of Toronto,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[14] [14]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

URLhttps://www.cs. toronto.edu/˜kriz/learning-features-2009-TR.pdf. Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342,

work page arXiv 2009

[15] [15]

On the variance of the adaptive learning rate and beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In8th International Conference on Learning Representations, ICLR 2020,

work page 2020

[16] [16]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Hamiltonian Descent Methods

Chris J. Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet. Hamiltonian descent methods.arXiv preprint arXiv:1809.05042,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf

URLhttps://arxiv.org/ abs/2109.06091. Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

work page arXiv

[19] [19]

On the Convergence of Adam and Beyond

ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(98) 00116-6. URLhttps://www.sciencedirect.com/science/article/pii/ S0893608098001166. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s0893-6080(98 1904

[20] [20]

An overview of gradient descent optimization algorithms

URLhttps: //arxiv.org/abs/1609.04747. Alfred Shapere and Frank Wilczek. Classical time crystals.Physical Review Letters, 109(16), October

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

doi: 10.1103/physrevlett.109.160402

ISSN 1079-7114. doi: 10.1103/physrevlett.109.160402. URLhttp://dx. doi.org/10.1103/PhysRevLett.109.160402. 11 Minhak Song and Chulhee Yun. Trajectory alignment: understanding the edge of stability phe- nomenon via bifurcation theory.arXiv preprint arXiv:2307.04204,

work page doi:10.1103/physrevlett.109.160402

[22] [22]

Attention Is All You Need

URLhttps://arxiv. org/abs/1706.03762. Jing Wang and Anna Choromanska. A survey of optimization methods for training dl models: The- oretical perspective on convergence and generalization,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Zixuan Wang, Zhouzi Li, and Jian Li

URLhttps://arxiv.org/ abs/2501.14458. Zixuan Wang, Zhouzi Li, and Jian Li. Analyzing sharpness along gd trajectory: Progressive sharp- ening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994,

work page arXiv

[24] [24]

Large Batch Training of Convolutional Networks

URLhttps://arxiv.org/abs/1708.03888. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learn- ing: Training bert in 76 minutes,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

URLhttps://arxiv.org/abs/1904.00962. Ya-xiang Yuan and Yi Zhang. Symplectic discretization approach for developing new proximal point algorithm.arXiv preprint arXiv:2308.03986,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[26] [26]

A BOUNDINGNRQCD By explicitly breaking certain symmetries—Lorentz invariance in NRQCD and time-translation in time crystals— higher-order kinetic terms paradoxically enhance stability through topological protection mechanisms and the generation of emergent length/time scales Niemi (2021); Guha & Ghose-Choudhury (2019). As a demonstration of this phenomeno...

work page 2021

[27] [27]

(1−β 1)α0 1 + min(β3|vt|2, α1) ,(45) so the method moves away from instability as oscillations grow. Second, each parameter update is uniformly bounded in norm by the gate, |xt −x t−1|=η t|vt|= α0|vt| 1 +β 3|vt|2 ≤ α0 2√β3 ,(46) which prevents runaway steps and is not available to classical momentum. These properties are consistent with the design of Algo...

work page 2022