A Physics-Inspired Optimizer: Velocity Regularized Adam
Pith reviewed 2026-05-22 14:30 UTC · model grok-4.3
The pith
Velocity-Regularized Adam damps oscillations by penalizing high-velocity updates and outperforms AdamW on standard tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VRAdam adds a velocity-based higher-order penalty to the Adam update rule so that the algorithm automatically slows down in regimes of large weight changes. The penalty is motivated by the stabilizing role of quartic terms in physical kinetic energy and is analyzed from both physical and control-theoretic viewpoints on momentum dynamics. Under mild assumptions the method delivers a convergence rate of O(ln(N)/sqrt(N)) for stochastic non-convex objectives while, in practice, exceeding the performance of AdamW on CNN image classification, Transformer language modeling, and GFlowNet generative tasks.
What carries the argument
The velocity regularizer, a higher-order penalty on the learning rate that scales inversely with the squared velocity of parameter updates and supplies global damping while preserving per-parameter scaling.
If this is right
- VRAdam exceeds AdamW performance on image classification with CNNs, language modeling with Transformers, and generative modeling with GFlowNets.
- The effective learning rate shrinks automatically in high-velocity regimes, damping oscillations at the edge of stability.
- Convergence bounds of O(ln(N)/sqrt(N)) hold for stochastic non-convex objectives under the paper's mild assumptions.
- The optimizer combines Adam-style per-parameter scaling with a single global velocity-based damping mechanism.
Where Pith is reading between the lines
- Similar velocity penalties could be attached to other adaptive first-order methods to obtain comparable stabilization.
- The same control perspective might be used to design stabilizers for optimization in reinforcement learning or physics-informed neural networks.
- The approach suggests that explicit penalties on update speed can reduce reliance on manual learning-rate schedules.
Load-bearing premise
The velocity penalty can be inserted into Adam without creating fresh instabilities and the mild assumptions used in the convergence proof continue to hold for the deep-network objectives and architectures tested.
What would settle it
A controlled run on one of the reported benchmarks in which VRAdam either diverges or records lower final accuracy than AdamW, or a calculation showing that the stated O(ln(N)/sqrt(N)) bound is violated once the velocity term is active.
Figures
read the original abstract
We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer that augments Adam with a higher-order velocity penalty derived from quartic kinetic energy terms. This penalty automatically reduces the effective learning rate in high-velocity regimes to damp oscillations at the edge of stability. The manuscript claims a rigorous physical/control-theoretic analysis of momentum dynamics and derives a convergence rate of O(ln(N)/sqrt(N)) for stochastic non-convex objectives under mild assumptions. Empirically, VRAdam is reported to outperform AdamW and other standard optimizers on image classification, language modeling, and generative modeling tasks across CNNs, Transformers, and GFlowNets.
Significance. If the convergence analysis can be made fully rigorous with explicit assumptions and the empirical gains prove robust under matched hyperparameter budgets, the hybrid of global velocity damping and per-parameter adaptation could provide both practical improvements and new insights into optimizer stability for deep networks. The stated rate would be noteworthy for non-convex stochastic optimization if the assumptions align with typical deep-learning regimes.
major comments (3)
- [Theoretical analysis] Theoretical analysis section: the O(ln(N)/sqrt(N)) convergence bound is asserted under 'mild assumptions' for stochastic non-convex objectives, yet no explicit statement of those assumptions, key lemmas, or derivation steps appears; this is load-bearing for the central theoretical claim.
- [Experiments] Experiments section: superiority over AdamW is stated for multiple tasks and architectures, but no quantitative tables, ablation controls on the velocity penalty strength, or error bars are provided, preventing assessment of whether gains survive identical tuning budgets.
- [Method] Optimizer description: the interaction between the velocity penalty and Adam's per-parameter adaptive scaling is described at a high level, but no analysis or experiments address whether the penalty introduces new instabilities or requires task-specific retuning beyond the baseline.
minor comments (2)
- [Abstract] Abstract: the phrase 'rigorous theoretical analysis' should include a forward reference to the specific section containing the proof or derivation.
- [Method] Notation: the precise mathematical form of the velocity penalty term (e.g., how it modifies the update rule) should be stated explicitly with an equation number for clarity.
Simulated Author's Rebuttal
We are grateful to the referee for their constructive feedback, which has helped us identify areas for improvement in our manuscript. Below, we provide detailed responses to each major comment and indicate the revisions we have made or plan to make.
read point-by-point responses
-
Referee: [Theoretical analysis] Theoretical analysis section: the O(ln(N)/sqrt(N)) convergence bound is asserted under 'mild assumptions' for stochastic non-convex objectives, yet no explicit statement of those assumptions, key lemmas, or derivation steps appears; this is load-bearing for the central theoretical claim.
Authors: We thank the referee for highlighting this issue. Upon review, we agree that the assumptions and derivation steps should be stated more explicitly to support the central claim. In the revised version, we will add a new subsection in the theoretical analysis that lists all assumptions clearly, presents the key lemmas, and sketches the main steps of the proof for the O(ln(N)/sqrt(N)) rate. This will ensure the analysis is self-contained and rigorous. revision: yes
-
Referee: [Experiments] Experiments section: superiority over AdamW is stated for multiple tasks and architectures, but no quantitative tables, ablation controls on the velocity penalty strength, or error bars are provided, preventing assessment of whether gains survive identical tuning budgets.
Authors: The referee is right that the experimental results need more quantitative support to allow proper evaluation. We have now included detailed tables with performance metrics for each task and architecture, along with ablation studies on the velocity penalty coefficient. Additionally, we report means and standard deviations over multiple independent runs to provide error bars. These changes were made under the constraint of matched hyperparameter tuning budgets where possible. revision: yes
-
Referee: [Method] Optimizer description: the interaction between the velocity penalty and Adam's per-parameter adaptive scaling is described at a high level, but no analysis or experiments address whether the penalty introduces new instabilities or requires task-specific retuning beyond the baseline.
Authors: We appreciate this observation. To address it, we have expanded the optimizer description to analyze the interaction between the velocity regularizer and Adam's adaptive mechanism from both theoretical and practical standpoints. We also conducted experiments testing for instabilities and the need for retuning, showing that the penalty parameter can be set to a default value that works across the tested tasks without significant additional tuning. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper introduces VRAdam by adding a velocity-based higher-order penalty to Adam, motivated by a physics analogy to quartic kinetic energy terms, and separately states convergence bounds of O(ln(N)/sqrt(N)) under explicitly labeled mild assumptions for stochastic non-convex objectives. No equations or steps in the provided text reduce a claimed prediction or bound back to a fitted parameter, self-citation, or ansatz by construction. The empirical benchmarks on image classification, language modeling, and generative tasks are presented as independent validation rather than forced outputs of the same inputs used in the analysis. The mild assumptions are not shown to be retrofitted to the target rate, and the physics framing supplies interpretive context without making the mathematical derivation tautological.
Axiom & Free-Parameter Ledger
free parameters (1)
- velocity penalty strength
axioms (2)
- domain assumption Previous Adam-like optimizers operate at the adaptive edge of stability, producing rapid oscillations.
- domain assumption Mild assumptions hold for the stochastic non-convex objective.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2205.09745. Benoˆıt Assi, Bernd A. Kniehl, and Joan Soto. Matching the standard model to heavy-quark ef- fective theory and nonrelativistic qcd.Nuclear Physics B, 992:116173, July
-
[2]
URLhttp://dx.doi.org/10.1016/j
doi: 10.1016/j.nuclphysb.2023.116173. URLhttp://dx.doi.org/10.1016/j. nuclphysb.2023.116173. Yehonatan Avidan, Qianyi Li, and Haim Sompolinsky. Connecting ntk and nngp: A unified theo- retical framework for wide neural network learning dynamics,
-
[3]
URLhttps://arxiv. org/abs/2309.04522. 9 Michael Betancourt, Michael I. Jordan, and Ashia C. Wilson. On symplectic optimization.arXiv preprint arXiv:1802.03653,
-
[4]
URL https://arxiv.org/abs/hep-ph/9702225. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Language Models are Few-Shot Learners
URL https://arxiv.org/abs/2005.14165. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in neural information processing systems, 36:49205–49233,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[6]
URLhttps://arxiv. org/abs/2103.00065. Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, and Justin Gilmer. Adaptive gradient methods at the edge of stability,
-
[7]
Alex Damian, Eshaan Nichani, and Jason D Lee
URLhttps://arxiv.org/abs/ 2207.14484. Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594,
-
[8]
doi: 10.1080/10556788.2023. 2214837. Alexandre D´efossez, L´eon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad,
-
[9]
URLhttps://arxiv.org/abs/2003.02395. Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster. Adam on local time: Addressing nonstationarity in rl with relative adam timesteps.Advances in Neural Information Processing Systems, 37:134567–134590,
-
[10]
Sepp Hochreiter and J¨urgen Schmidhuber
URLhttps: //arxiv.org/abs/1911.11626. Sepp Hochreiter and J¨urgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42,
-
[11]
Peter Holderrieth, Yilun Xu, and Tommi Jaakkola
doi: 10.1162/neco.1997.9.1.1. Peter Holderrieth, Yilun Xu, and Tommi Jaakkola. Hamiltonian score matching and generative flows,
- [12]
-
[13]
Adam: A Method for Stochastic Optimization
URL https://arxiv.org/abs/1412.6980. Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny im- ages. Technical Report TR-2009-003, University of Toronto,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[14]
Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a
URLhttps://www.cs. toronto.edu/˜kriz/learning-features-2009-TR.pdf. Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training.arXiv preprint arXiv:2305.14342,
-
[15]
On the variance of the adaptive learning rate and beyond
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In8th International Conference on Learning Representations, ICLR 2020,
work page 2020
-
[16]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Chris J. Maddison, Daniel Paulin, Yee Whye Teh, Brendan O’Donoghue, and Arnaud Doucet. Hamiltonian descent methods.arXiv preprint arXiv:1809.05042,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
URLhttps://arxiv.org/ abs/2109.06091. Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track,
-
[19]
On the Convergence of Adam and Beyond
ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(98) 00116-6. URLhttps://www.sciencedirect.com/science/article/pii/ S0893608098001166. Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond.arXiv preprint arXiv:1904.09237,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s0893-6080(98 1904
-
[20]
An overview of gradient descent optimization algorithms
URLhttps: //arxiv.org/abs/1609.04747. Alfred Shapere and Frank Wilczek. Classical time crystals.Physical Review Letters, 109(16), October
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
doi: 10.1103/physrevlett.109.160402
ISSN 1079-7114. doi: 10.1103/physrevlett.109.160402. URLhttp://dx. doi.org/10.1103/PhysRevLett.109.160402. 11 Minhak Song and Chulhee Yun. Trajectory alignment: understanding the edge of stability phe- nomenon via bifurcation theory.arXiv preprint arXiv:2307.04204,
-
[22]
URLhttps://arxiv. org/abs/1706.03762. Jing Wang and Anna Choromanska. A survey of optimization methods for training dl models: The- oretical perspective on convergence and generalization,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Zixuan Wang, Zhouzi Li, and Jian Li
URLhttps://arxiv.org/ abs/2501.14458. Zixuan Wang, Zhouzi Li, and Jian Li. Analyzing sharpness along gd trajectory: Progressive sharp- ening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994,
-
[24]
Large Batch Training of Convolutional Networks
URLhttps://arxiv.org/abs/1708.03888. Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learn- ing: Training bert in 76 minutes,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
URLhttps://arxiv.org/abs/1904.00962. Ya-xiang Yuan and Yi Zhang. Symplectic discretization approach for developing new proximal point algorithm.arXiv preprint arXiv:2308.03986,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[26]
A BOUNDINGNRQCD By explicitly breaking certain symmetries—Lorentz invariance in NRQCD and time-translation in time crystals— higher-order kinetic terms paradoxically enhance stability through topological protection mechanisms and the generation of emergent length/time scales Niemi (2021); Guha & Ghose-Choudhury (2019). As a demonstration of this phenomeno...
work page 2021
-
[27]
(1−β 1)α0 1 + min(β3|vt|2, α1) ,(45) so the method moves away from instability as oscillations grow. Second, each parameter update is uniformly bounded in norm by the gate, |xt −x t−1|=η t|vt|= α0|vt| 1 +β 3|vt|2 ≤ α0 2√β3 ,(46) which prevents runaway steps and is not available to classical momentum. These properties are consistent with the design of Algo...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.