Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach
Pith reviewed 2026-05-20 07:47 UTC · model grok-4.3
The pith
In zero-sum games the first- and second-order momentum terms of Adam-DA reverse the convergence roles they play in minimization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By taking the continuous-time limit of the Adam-DA iterates, the authors obtain a system of ODEs whose equilibria and stability properties can be studied directly. Analysis of these ODEs shows that raising the first-order momentum coefficient destabilizes the saddle while raising the second-order coefficient stabilizes it; the signs of these effects are reversed relative to the standard minimization case. The same ODEs also reveal an implicit regularization term whose form depends on the momentum parameters in the opposite manner from gradient descent.
What carries the argument
The system of ordinary differential equations obtained as the continuous-time limit of the Adam-DA discrete updates.
If this is right
- Local convergence of Adam-DA to a saddle can be read off from the eigenvalues of the linearized ODE.
- The implicit regularization induced by Adam-DA in games takes the opposite functional form from the regularization induced in minimization.
- Tuning guidelines for Adam-DA in GANs should invert the usual momentum recommendations.
Where Pith is reading between the lines
- The same ODE construction could be applied to other adaptive optimizers such as RMSProp-DA to check whether the momentum reversal is specific to Adam or generic.
- If the reversal persists in non-convex zero-sum problems, it may explain why small first-order momentum values are often preferred in practice for GAN training.
- The ODE view suggests a possible continuous-time schedule for the momentum coefficients that could improve stability without changing the discrete algorithm.
Load-bearing premise
The discrete Adam-DA steps with typical learning rates and momentum values stay close enough to their continuous ODE trajectories that local stability and regularization results carry over.
What would settle it
Run Adam-DA on a simple bilinear zero-sum game with known saddle and measure whether increasing the first-order momentum visibly slows convergence or increasing the second-order momentum visibly speeds it up; a reversal of either trend would contradict the claim.
Figures
read the original abstract
The remarkable success of the Adam in training neural networks has naturally led to the widespread use of its descent-ascent counterpart, Adam-DA, for solving zero-sum games. Despite its popularity in practice, a rigorous theoretical understanding of Adam-DA still lags behind. In this paper, we derive ordinary differential equations (ODEs) that serve as continuous-time limits of the Adam-DA. These ODEs closely approximate the discrete-time dynamics of Adam-DA, providing a tractable analytical framework for understanding its behavior in zero-sum games. Using this ODE approach, we investigate two fundamental aspects of Adam-DA: local convergence and implicit gradient regularization. Our analysis reveals that the roles of the first- and second-order momentum parameters in zero-sum games are exactly the opposite of their well-documented effects in minimization problems. We validate these predictions through GAN experiments across multiple architectures and datasets, demonstrating the practical implications of this reversed momentum effect.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives ordinary differential equation (ODE) limits for the discrete Adam-DA algorithm applied to zero-sum games. These ODEs are used to analyze local convergence and implicit gradient regularization, leading to the claim that the first-order momentum parameter (β1) and second-order momentum parameter (β2) play exactly opposite roles compared to their effects in standard minimization problems. The analysis is validated through qualitative GAN experiments on multiple architectures and datasets.
Significance. If the ODE approximation is faithful at practical step sizes and momentum values, the work supplies a useful continuous-time framework for understanding momentum in non-monotone settings and could inform hyperparameter selection for GAN training. The explicit reversal result, if rigorously supported, distinguishes this contribution from prior ODE analyses of Adam in convex or minimization settings.
major comments (2)
- [§3] §3 (ODE derivation): The continuous-time limit is obtained via standard Euler discretization and momentum rescaling, but the manuscript provides no explicit error bounds, timescale-separation conditions, or verification that the approximation remains valid for β2 ≈ 0.999 and learning rates ≈ 10^{-3} when the underlying vector field is non-monotone and oscillatory. This assumption is load-bearing for transferring local convergence and regularization conclusions from the ODE to the discrete Adam-DA updates.
- [§5] §5 (Experiments): The GAN results are presented without quantitative metrics (e.g., FID scores, convergence rates), ablation controls on β1/β2, or direct comparisons to Adam in minimization tasks that would demonstrate the claimed reversal. This weakens the empirical support for the central theoretical prediction.
minor comments (2)
- [§3] The notation for the rescaled momentum terms in the ODE should be aligned more explicitly with the discrete Adam-DA update equations to improve readability.
- [Introduction] Add a brief discussion of how the derived ODEs relate to existing continuous-time analyses of Adam in minimization (e.g., prior works on momentum in convex optimization) for clearer positioning.
Simulated Author's Rebuttal
We thank the referee for the detailed and insightful comments on our manuscript. We have carefully considered each point and outline our responses and planned revisions below.
read point-by-point responses
-
Referee: [§3] §3 (ODE derivation): The continuous-time limit is obtained via standard Euler discretization and momentum rescaling, but the manuscript provides no explicit error bounds, timescale-separation conditions, or verification that the approximation remains valid for β2 ≈ 0.999 and learning rates ≈ 10^{-3} when the underlying vector field is non-monotone and oscillatory. This assumption is load-bearing for transferring local convergence and regularization conclusions from the ODE to the discrete Adam-DA updates.
Authors: We thank the referee for this observation. The ODE limit is derived using the standard Euler method with appropriate rescaling of the momentum terms, as is common in the literature on continuous-time analyses of adaptive optimizers. We acknowledge that the manuscript does not provide explicit error bounds or detailed timescale separation conditions, particularly for the non-monotone case. Deriving such bounds rigorously for oscillatory dynamics would require substantial additional analysis. In the revised manuscript, we will include a new subsection in §3 discussing the assumptions underlying the approximation and provide numerical verification by comparing discrete trajectories with the ODE solutions for β2 close to 1 and small learning rates in the context of our GAN experiments. This will offer practical evidence for the validity of the limit in the relevant parameter regime. revision: partial
-
Referee: [§5] §5 (Experiments): The GAN results are presented without quantitative metrics (e.g., FID scores, convergence rates), ablation controls on β1/β2, or direct comparisons to Adam in minimization tasks that would demonstrate the claimed reversal. This weakens the empirical support for the central theoretical prediction.
Authors: We agree that incorporating quantitative metrics and ablations would strengthen the empirical section. In the revision, we will augment §5 with FID scores and other relevant quantitative measures for the GAN experiments. We will also add ablation studies on the effects of varying β1 and β2, as well as direct comparisons to the behavior of Adam in standard minimization settings. These changes will better substantiate the claimed reversal of roles for the momentum parameters. revision: yes
- Deriving explicit error bounds and timescale-separation conditions for the ODE approximation in non-monotone and oscillatory settings.
Circularity Check
No circularity: standard ODE limit derivation with independent analysis
full rationale
The paper derives ODEs as continuous-time limits of discrete Adam-DA updates via standard Euler discretization and momentum rescaling techniques. Local convergence and implicit regularization properties are then analyzed directly on the resulting ODE system in the zero-sum setting, yielding the reversed-momentum observation as a consequence of the vector field structure. This chain is self-contained, does not reduce any prediction to a fitted input or prior self-citation by construction, and is externally validated via GAN experiments. No load-bearing step equates to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Discrete Adam-DA updates admit a continuous-time ODE limit that approximates their trajectory for small learning rates.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our analysis reveals that the roles of the first- and second-order momentum parameters in zero-sum games are exactly the opposite of their well-documented effects in minimization problems.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Continuous Adam-DA ... JAdam = 1/√ϵ (I - h(1+β)/(2√ϵ(1-β)) J) J
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yurii Nesterov , title =. Math. Program. , year =
- [2]
-
[3]
G. Brown , title =. Activity Analysis of Production and Allocation , year =
-
[4]
Zur Elektrodynamik bewegter Körper
Albert Einstein. Zur Elektrodynamik bewegter Körper. Annalen der Physik. 1905
work page 1905
-
[5]
Michel Goossens and Frank Mittelbach and Alexander Samarin. The \ Companion. 1993
work page 1993
-
[6]
Advances in neural information processing systems , volume=
A unified game-theoretic approach to multiagent reinforcement learning , author=. Advances in neural information processing systems , volume=
-
[7]
arXiv preprint arXiv:2011.00583 , year=
An overview of multi-agent reinforcement learning from game theoretical perspective , author=. arXiv preprint arXiv:2011.00583 , year=
-
[8]
Competing in the dark: An efficient algorithm for bandit linear optimization , author=
-
[9]
Mathematical programming , volume=
Primal-dual subgradient methods for convex problems , author=. Mathematical programming , volume=. 2009 , publisher=
work page 2009
-
[10]
Online learning and online convex optimization , author=. Foundations and Trends. 2012 , publisher=
work page 2012
-
[11]
Advances in Neural Information Processing Systems , volume=
Online Learning in Periodic Zero-Sum Games , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
Characterization and computation of local
Ratliff, Lillian J and Burden, Samuel A and Sastry, S Shankar , booktitle=. Characterization and computation of local. 2013 , organization=
work page 2013
-
[13]
Lee and Tengyu Ma , Booktitle =
Rong Ge and Jason D. Lee and Tengyu Ma , Booktitle =. Matrix Completion has No Spurious Local Minimum , Year =
- [14]
-
[15]
Yann N. Dauphin and Razvan Pascanu and Caglar Gulcehre and Kyunghyun Cho and Surya Ganguli and Yoshua Bengio , Date-Added =. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , Urldate =
-
[16]
Lee and Ioannis Panageas and Georgios Piliouras and Max Simchowitz and Michael I
Jason D. Lee and Ioannis Panageas and Georgios Piliouras and Max Simchowitz and Michael I. Jordan and Benjamin Recht , Journal =. First-order methods almost always avoid strict saddle points , Volume =
-
[17]
Chi Jin and Rong Ge and Praneeth Netrapalli and Sham M. Kakade and Michael I. Jordan , Booktitle =. How to Escape Saddle Points Efficiently , Year =
-
[18]
Songtao Lu and Meisam Razaviyayn and Bo Yang and Kejun Huang and Mingyi Hong , title =. CoRR , volume =
-
[19]
Proceedings of the 36th International Conference on Machine Learning,
Ioannis Panageas and Georgios Piliouras and Xiao Wang , title =. Proceedings of the 36th International Conference on Machine Learning,. 2019 , crossref =
work page 2019
-
[20]
First-order methods almost always avoid saddle points: The case of vanishing step-sizes , Year =
Ioannis Panageas and Georgios Piliouras and Xiao Wang , Booktitle =. First-order methods almost always avoid saddle points: The case of vanishing step-sizes , Year =
-
[21]
N. Gillis , Booktitle =. The Why and How of Nonnegative Matrix Factorization" , Year =
-
[22]
D. P. Bertsekas , Date-Added =. Nonlinear Programming , Year =
-
[23]
N.D. Ho , Date-Added =. Nonnegative matrix factorization algorithms and applications , Year =
-
[24]
A. Cichocki, R. Zdunek, S.I. Amari , Booktitle =. Hierarchical ALS algorithms for nonnegative matrix and 3d tensor factorization , Year =
- [25]
-
[26]
Journal of Functional Analysis , Pages =
Felix Otto and Cedric Villani , Title =. Journal of Functional Analysis , Pages =
- [27]
-
[28]
2006 American Control Conference , pages=
Fundamental constraints on uncertainty evolution in Hamiltonian systems , author=. 2006 American Control Conference , pages=. 2006 , organization=
work page 2006
- [29]
- [30]
-
[31]
Advances in neural information processing systems , volume=
Tight last-iterate convergence rates for no-regret learning in multi-player games , author=. Advances in neural information processing systems , volume=
-
[32]
International Conference on Machine Learning , pages=
Finite-time last-iterate convergence for multi-agent learning in games , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
- [33]
-
[34]
Optimization despite chaos: Convex relaxations to complex limit sets via Poincar
Piliouras, Georgios and Shamma, Jeff S , booktitle=. Optimization despite chaos: Convex relaxations to complex limit sets via Poincar. 2014 , organization=
work page 2014
-
[35]
Superhuman AI for heads-up no-limit poker: Libratus beats top professionals , author=. Science , volume=. 2018 , publisher=
work page 2018
-
[36]
Goodfellow, Ian J. and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua , title =. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 , pages =. 2014 , publisher =
work page 2014
-
[37]
Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research , author=. 2019 , eprint=
work page 2019
-
[38]
International Conference on Learning Representations , year=
Smooth markets: A basic mechanism for organizing gradient-based learners , author=. International Conference on Learning Representations , year=
- [39]
- [40]
-
[41]
Yun Kuen Cheung and Georgios Piliouras , title =. 2020 , booktitle =
work page 2020
-
[42]
Conference on Learning Theory , pages=
Vortices instead of equilibria in minmax optimization: Chaos and butterfly effects of online learning in zero-sum games , author=. Conference on Learning Theory , pages=. 2019 , organization=
work page 2019
-
[43]
International Conference on Learning Representations , year=
Chaos of Learning Beyond Zero-sum and Coordination via Game Decompositions , author=. International Conference on Learning Representations , year=
-
[44]
Linear Last-iterate Convergence in Constrained Saddle-point Optimization , booktitle =
Chen. Linear Last-iterate Convergence in Constrained Saddle-point Optimization , booktitle =
-
[45]
Yang Cai and Argyris Oikonomou and Weiqiang Zheng , title =. NeurIPS , year =
-
[46]
Eduard Gorbunov and Adrien Taylor and Gauthier Gidel , title =. NeurIPS , year =
-
[47]
Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile , author=. ICLR , year=
-
[48]
Eduard Gorbunov and Nicolas Loizou and Gauthier Gidel , editor =. Extragradient Method:. International Conference on Artificial Intelligence and Statistics,
-
[49]
Pseudo holomorphic curves in symplectic manifolds , Volume =
Misha Gromov , Journal =. Pseudo holomorphic curves in symplectic manifolds , Volume =
-
[50]
Differential Equations and Dynamical Systems
Lawrence Perko. Differential Equations and Dynamical Systems. 2001
work page 2001
-
[51]
Introduction to Symplectic Topology
Dust McDuff and Dietmar Salamon. Introduction to Symplectic Topology. 2017
work page 2017
-
[52]
Noam Nisan and Tim Roughgarden and Eva Tardos and Vijay Varian. Algorithmic Game Theory. 2007
work page 2007
-
[53]
Foundations of Physics , volume=
The symplectic camel and the uncertainty principle: The tip of an iceberg? , author=. Foundations of Physics , volume=. 2009 , publisher=
work page 2009
- [54]
-
[55]
What is symplectic gemoetry , journal=
Dusa Mcduff , year=. What is symplectic gemoetry , journal=
-
[56]
Russian Mathematical Surveys , volume=
First steps in symplectic topology , author=. Russian Mathematical Surveys , volume=. 1986 , publisher=
work page 1986
-
[57]
Proceedings of the 2018 ACM Conference on Economics and Computation , pages=
Multiplicative weights update in zero-sum games , author=. Proceedings of the 2018 ACM Conference on Economics and Computation , pages=
work page 2018
-
[58]
Adaptive learning in continuous games: Optimal regret bounds and convergence to
Hsieh, Yu-Guan and Antonakopoulos, Kimon and Mertikopoulos, Panayotis , booktitle=. Adaptive learning in continuous games: Optimal regret bounds and convergence to. 2021 , organization=
work page 2021
-
[59]
International Conference on Machine Learning , pages=
The limits of min-max optimization algorithms: Convergence to spurious non-critical sets , author=. International Conference on Machine Learning , pages=. 2021 , organization=
work page 2021
-
[60]
Mastering the game of Go with deep neural networks and tree search , author=. nature , volume=. 2016 , publisher=
work page 2016
-
[61]
Symplectic geometric algorithms for Hamiltonian systems , author=. 2010 , publisher=
work page 2010
-
[62]
ACM SIGecom Exchanges , volume=
Game dynamics as the meaning of a game , author=. ACM SIGecom Exchanges , volume=. 2019 , publisher=
work page 2019
-
[63]
-rank: Multi-agent evaluation by evolution , author=. Scientific reports , volume=. 2019 , publisher=
work page 2019
-
[64]
Conference on Learning Theory , pages=
Learning in matrix games can be arbitrarily complex , author=. Conference on Learning Theory , pages=. 2021 , organization=
work page 2021
-
[65]
arXiv preprint arXiv:2005.12649 , year=
On the impossibility of global convergence in multi-loss optimization , author=. arXiv preprint arXiv:2005.12649 , year=
-
[66]
Conference on Learning Theory , pages=
Finite regret and cycles with fixed step-size via alternating gradient descent-ascent , author=. Conference on Learning Theory , pages=. 2020 , organization=
work page 2020
-
[67]
Physica D: Nonlinear Phenomena , volume=
Some aspects of Hamiltonian systems and symplectic algorithms , author=. Physica D: Nonlinear Phenomena , volume=. 1994 , publisher=
work page 1994
- [68]
-
[69]
Advances in Neural Information Processing Systems , volume=
Alternating mirror descent for constrained min-max games , author=. Advances in Neural Information Processing Systems , volume=
-
[70]
Fuzzy Optimization and Decision Making , volume=
Uncertain bimatrix game with applications , author=. Fuzzy Optimization and Decision Making , volume=. 2013 , publisher=
work page 2013
-
[71]
Advances in Neural Information Processing Systems , volume=
Stochastic variance reduction methods for saddle-point problems , author=. Advances in Neural Information Processing Systems , volume=
-
[72]
arXiv preprint arXiv:1909.06946 , year=
A stochastic proximal point algorithm for saddle-point problems , author=. arXiv preprint arXiv:1909.06946 , year=
-
[73]
Mengxiao Zhang and Peng Zhao and Haipeng Luo and Zhi-Hua Zhou , title =. ICML , year =
-
[74]
Advances in Neural Information Processing Systems , volume=
Reducing noise in gan training with variance reduced extragradient , author=. Advances in Neural Information Processing Systems , volume=
-
[75]
Advances in Neural Information Processing Systems , volume=
Stochastic recursive gradient descent ascent for stochastic nonconvex-strongly-concave minimax problems , author=. Advances in Neural Information Processing Systems , volume=
-
[76]
Advances in Neural Information Processing Systems , volume=
Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems , author=. Advances in Neural Information Processing Systems , volume=
-
[77]
The 22nd International Conference on Artificial Intelligence and Statistics , year=
Negative momentum for improved game dynamics , author=. The 22nd International Conference on Artificial Intelligence and Statistics , year=
-
[78]
Dynamic Games and Applications , volume=
On the expected number of internal equilibria in random evolutionary games with correlated payoff matrix , author=. Dynamic Games and Applications , volume=. 2019 , publisher=
work page 2019
-
[79]
A variational inequality perspective on generative adversarial networks , author=. ICLR , year=
-
[80]
Convergence of gradient methods on bilinear zero-sum games , author=. ICLR , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.