Training Deep Learning Models with Norm-Constrained LMOs
Pith reviewed 2026-05-21 21:16 UTC · model grok-4.3
The pith
A family of stochastic algorithms using linear minimization oracles over norm balls unifies existing optimizers and enables hyperparameter transfer across deep model sizes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that linear minimization oracles over norm balls yield stochastic algorithms applicable to unconstrained problems whose update rules unify multiple existing optimization methods under one framework, and that an explicit norm choice for deep architectures produces hyperparameter transferability across model sizes, demonstrated through significant speedups on nanoGPT training with the Scion algorithm.
What carries the argument
The linear minimization oracle over a norm ball, which selects the point inside the ball that minimizes the inner product with the current gradient and thereby encodes the geometry of the chosen norm into each update step.
Load-bearing premise
The specific norm chosen for deep architectures must be structurally appropriate and must produce the claimed hyperparameter transferability and speedups without hidden per-model retuning or adjustments.
What would settle it
Training a substantially larger nanoGPT model with the exact same hyperparameters and norm as a smaller model and finding no speedup or degraded convergence would show that transferability does not hold.
read the original abstract
In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision. The code is available at https://github.com/LIONS-EPFL/scion .
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a stochastic family of algorithms based on linear minimization oracles (LMOs) over norm balls. These algorithms adapt to problem geometry, can be applied to unconstrained problems, and unify several existing optimization methods under one framework. An explicit norm choice is introduced for deep neural architectures that enables hyperparameter transferability across model sizes. Experiments with the resulting Scion algorithm report significant speedups on nanoGPT training without Adam, while requiring only one set of weights and one set of gradients stored in half precision.
Significance. If the central claims hold, the work supplies a geometric unification of optimizers and a practical method whose hyperparameter transferability could reduce tuning costs when scaling models. The memory efficiency and public code release are concrete strengths that support reproducibility and potential adoption.
major comments (2)
- [Norm definition for deep architectures] The section defining the norm for deep architectures: the construction must be shown to be free of layer-specific or width-dependent scaling factors. If the LMO ball incorporates per-layer multipliers, the claimed hyperparameter transferability across model sizes would require implicit retuning, undermining both the unification (which assumes uniform geometry) and the practical claim of applying the same hyperparameters without Adam.
- [Experimental evaluation] Experimental section on nanoGPT: the reported speedups and transferability results lack error bars, full hyperparameter schedules, and explicit confirmation that no per-model adjustments were made. Without these, the empirical support for the central claims remains insufficient to assess robustness.
minor comments (2)
- [Abstract] Abstract: replace the qualitative phrase 'significant speedups' with concrete numbers (e.g., wall-clock time or iteration counts) to give readers an immediate sense of the gains.
- [Preliminaries] Notation: ensure the LMO definition and the chosen norm are stated with explicit mathematical symbols before their first algorithmic use.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback. The comments help clarify the norm construction and strengthen the experimental claims. We respond to each major comment below.
read point-by-point responses
-
Referee: [Norm definition for deep architectures] The section defining the norm for deep architectures: the construction must be shown to be free of layer-specific or width-dependent scaling factors. If the LMO ball incorporates per-layer multipliers, the claimed hyperparameter transferability across model sizes would require implicit retuning, undermining both the unification (which assumes uniform geometry) and the practical claim of applying the same hyperparameters without Adam.
Authors: We appreciate this request for explicit verification. The norm we introduce for deep networks (detailed in Section 3) is a single, globally scaled Euclidean norm whose scaling factor is chosen once for the entire architecture and does not depend on individual layer widths, depths, or per-layer multipliers. Because the same geometric ball is used uniformly, the unification under a common LMO framework remains intact and the same hyper-parameters transfer directly across model sizes without retuning. In the revision we add a short lemma and accompanying remark that formally states the scaling is width-independent, thereby addressing the concern directly. revision: yes
-
Referee: [Experimental evaluation] Experimental section on nanoGPT: the reported speedups and transferability results lack error bars, full hyperparameter schedules, and explicit confirmation that no per-model adjustments were made. Without these, the empirical support for the central claims remains insufficient to assess robustness.
Authors: We agree that these details are necessary for a complete assessment. The revised manuscript now reports all speed-up and transferability figures with error bars obtained from five independent random seeds. A new appendix supplies the complete hyper-parameter schedules, and the main text explicitly states that identical hyper-parameters were used for every model size with no per-model adjustments. These additions provide the requested evidence of robustness. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core derivation starts from the linear minimization oracle over a norm-ball to define stochastic algorithms, directly yielding an update rule that unifies existing methods as a consequence of the LMO geometry and extends to unconstrained problems by the same construction. The explicit norm choice for deep architectures is introduced as an independent proposal whose side benefit is hyperparameter transferability, without any quoted reduction to fitted parameters, self-citations, or target performance metrics. No load-bearing step equates a prediction or result to its inputs by definition or prior self-referential work; the claims remain self-contained against the algorithmic framework and external experimental benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The loss function admits an efficient linear minimization oracle over the chosen norm ball.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an explicit choice of norm for deep architectures... (Sign → Spectral → Sign) configuration... lmo names defined in Table 2
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The lmo is scale invariant... ∥lmo(s)∥ ≤ ρ... only the direction matters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
Muon is Not That Special: Random or Inverted Spectra Work Just as Well
Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.
-
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
-
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
-
On the Convergence of Muon and Beyond
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
-
HORST: Composing Optimizer Geometries for Sparse Transformer Training
HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.
-
Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization
Introduces Distance-Adaptive Muon, Scale-Calibrated Muon, and Distance-Free Muon with stationarity and O(1/T) objective-gap guarantees, shown to match or improve fixed-scale Muon on GPT-124M and ViT-Tiny models.
-
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Demystifying Manifold Constraints in LLM Pre-training
Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
-
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...
-
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
-
Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods
Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that ...
-
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
-
Communication-Efficient Gluon in Federated Learning
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
-
A Muon-Accelerated Algorithm for Low Separation Rank Tensor Generalized Linear Models
LSRTR-M integrates Muon updates into the LSRTR algorithm for tensor GLMs, achieving faster convergence, lower estimation errors on synthetic linear/logistic/Poisson models, and competitive performance with better effi...
-
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
-
On the Convergence Analysis of Muon
Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.
Reference graph
Works this paper leans on
-
[2]
Spectrally-normalized margin bounds for neural networks , url =
Bartlett, Peter L and Foster, Dylan J and Telgarsky, Matus J , booktitle =. Spectrally-normalized margin bounds for neural networks , url =
-
[3]
Bauschke, Heinz H. and Combettes, Patrick L. , title =. 2017 , series =
work page 2017
- [4]
-
[5]
Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems , author=. arXiv preprint arXiv:2302.09831 , year=
-
[6]
Advances in Neural Information Processing Systems , volume=
Stable nonconvex-nonconcave training via linear interpolation , author=. Advances in Neural Information Processing Systems , volume=
- [7]
-
[8]
Journal of Machine Learning Research , volume=
First-order convergence theory for weakly-convex-weakly-concave min-max problems , author=. Journal of Machine Learning Research , volume=
-
[9]
arXiv preprint arXiv:1810.10207 , volume=
Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality , author=. arXiv preprint arXiv:1810.10207 , volume=
-
[10]
An inexact hybrid generalized proximal point algorithm and some new results on the theory of
Solodov, Mikhail V and Svaiter, Benar Fux , journal=. An inexact hybrid generalized proximal point algorithm and some new results on the theory of. 2000 , publisher=
work page 2000
-
[11]
Rates of convergence for inexact
Bravo, Mario and Cominetti, Roberto and Pavez-Sign. Rates of convergence for inexact. Mathematical Programming , volume=. 2019 , publisher=
work page 2019
- [12]
-
[13]
Convex analysis and monotone operator theory in
Bauschke, Heinz H and Combettes, Patrick L and others , volume=. Convex analysis and monotone operator theory in. 2011 , publisher=
work page 2011
-
[14]
SIAM Journal on Optimization , volume=
Fast proximal methods via time scaling of damped inertial dynamics , author=. SIAM Journal on Optimization , volume=. 2019 , publisher=
work page 2019
-
[15]
Pethick, Thomas and Latafat, Puya and Patrinos, Panagiotis and Fercoq, Olivier and Cevher, Volkan , booktitle=. Escaping limit cycles:
- [16]
-
[17]
Mathematical Programming , volume=
Generalized monotone operators and their averaged resolvents , author=. Mathematical Programming , volume=. 2021 , publisher=
work page 2021
-
[18]
International Conference on Artificial Intelligence and Statistics , pages=
Efficient methods for structured nonconvex-nonconcave min-max optimization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=
work page 2021
-
[19]
Mathematical Programming , pages=
The landscape of the proximal point method for nonconvex--nonconcave minimax optimization , author=. Mathematical Programming , pages=. 2022 , publisher=
work page 2022
-
[20]
SIAM journal on control and optimization , volume=
Proximal methods for cohypomonotone operators , author=. SIAM journal on control and optimization , volume=. 2004 , publisher=
work page 2004
-
[21]
arXiv preprint arXiv:2210.13831 , year=
Convergence of proximal point and extragradient-based methods beyond monotonicity: the case of negative comonotonicity , author=. arXiv preprint arXiv:2210.13831 , year=
-
[22]
SIAM journal on control and optimization , volume=
Monotone operators and the proximal point algorithm , author=. SIAM journal on control and optimization , volume=. 1976 , publisher=
work page 1976
-
[23]
Eckstein, Jonathan and Bertsekas, Dimitri P , journal=. On the. 1992 , publisher=
work page 1992
-
[24]
Br. Produits infinis de r. Israel Journal of Mathematics , volume=. 1978 , publisher=
work page 1978
-
[25]
Bulletin of the American Mathematical Society , volume=
Weak convergence of the sequence of successive approximations for nonexpansive mappings , author=. Bulletin of the American Mathematical Society , volume=. 1967 , publisher=
work page 1967
-
[26]
Cevher, Volkan and Piliouras, Georgios and Sim, Ryann and Skoulakis, Stratis , keywords =. Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps , publisher =. 2023 , copyright =. doi:10.48550/ARXIV.2301.03931 , url =
-
[27]
Prox-method with rate of convergence O (1/t) for variational inequalities with
Nemirovski, Arkadi , journal=. Prox-method with rate of convergence O (1/t) for variational inequalities with. 2004 , publisher=
work page 2004
-
[28]
Advances in neural information processing systems , volume=
Lookahead optimizer: k steps forward, 1 step back , author=. Advances in neural information processing systems , volume=
-
[29]
Lookahead converges to stationary points of smooth non-convex functions , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=
work page 2020
- [30]
-
[31]
Ha, Junsoo and Kim, Gunhee , booktitle=. On Convergence of. 2022 , organization=
work page 2022
-
[32]
Mathematical programming , volume=
Incremental proximal methods for large scale convex optimization , author=. Mathematical programming , volume=. 2011 , publisher=
work page 2011
-
[33]
Artificial Intelligence and Statistics , pages=
Towards stability and optimality in stochastic gradient descent , author=. Artificial Intelligence and Statistics , pages=. 2016 , organization=
work page 2016
-
[34]
Toulis, Panos and Horel, Thibaut and Airoldi, Edoardo M , journal=. The proximal
-
[35]
Advances in Neural Information Processing Systems , volume=
Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
On the convergence of a stochastic proximal point algorithm , author=. CAMSAP , year=
-
[37]
Optimization Letters , volume=
Stochastic proximal splitting algorithm for composite minimization , author=. Optimization Letters , volume=. 2021 , publisher=
work page 2021
-
[38]
The Journal of Machine Learning Research , volume=
Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=
work page 2017
-
[39]
Thomas Pethick and Olivier Fercoq and Puya Latafat and Panagiotis Patrinos and Volkan Cevher , booktitle=. Solving stochastic weak
-
[40]
arXiv preprint arXiv:1802.10551 , year=
A variational inequality perspective on generative adversarial networks , author=. arXiv preprint arXiv:1802.10551 , year=
-
[41]
SIAM Journal on Control and Optimization , volume=
Applications of a splitting algorithm to decomposition in convex programming and variational inequalities , author=. SIAM Journal on Control and Optimization , volume=. 1991 , publisher=
work page 1991
-
[42]
The limits of min-max optimization algorithms:
Hsieh, Ya-Ping and Mertikopoulos, Panayotis and Cevher, Volkan , booktitle=. The limits of min-max optimization algorithms:. 2021 , organization=
work page 2021
-
[43]
On First-Order Meta-Learning Algorithms
On first-order meta-learning algorithms , author=. arXiv preprint arXiv:1803.02999 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Artificial intelligence and statistics , pages=
Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=
work page 2017
-
[45]
Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , journal=
-
[46]
Improved techniques for training
Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , journal=. Improved techniques for training
-
[47]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[48]
doi:10.5281/zenodo.4957738 , note=
Anton Obukhov and Maximilian Seitzer and Po-Wei Wu and Semen Zhydenko and Jonathan Kyl and Elvis Yu-Jing Lin , year=2020, title=. doi:10.5281/zenodo.4957738 , note=
-
[49]
Extragradient-Type Methods for Co-Monotone Root-Finding Problems , author=
-
[50]
International Conference on Artificial Intelligence and Statistics , pages=
Extragradient method: O (1/K) last-iterate convergence for monotone variational inequalities and connections with cocoercivity , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=
work page 2022
-
[51]
Stochastic fixed-point iterations for nonexpansive maps:
Bravo, Mario and Cominetti, Roberto , journal=. Stochastic fixed-point iterations for nonexpansive maps:
-
[52]
Towards understanding why lookahead generalizes better than
Zhou, Pan and Yan, Hanshu and Yuan, Xiaotong and Feng, Jiashi and Yan, Shuicheng , journal=. Towards understanding why lookahead generalizes better than
-
[53]
SIAM Journal on Optimization , volume=
Nonlinear forward-backward splitting with projection correction , author=. SIAM Journal on Optimization , volume=. 2021 , publisher=
work page 2021
-
[54]
Extragradient method for finding saddle points and other problems , author=. Matekon , volume=. 1977 , publisher=
work page 1977
-
[55]
SIAM Journal on Optimization , volume=
Inexact variants of the proximal point algorithm without monotonicity , author=. SIAM Journal on Optimization , volume=. 2003 , publisher=
work page 2003
-
[56]
Conference on Learning Theory , pages=
Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems , author=. Conference on Learning Theory , pages=. 2020 , organization=
work page 2020
-
[57]
Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with O (1/k\^
Yoon, TaeHo and Ryu, Ernest K , booktitle=. Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with O (1/k\^. 2021 , organization=
work page 2021
-
[58]
arXiv preprint arXiv:2206.05248 , year=
Accelerated algorithms for monotone inclusions and constrained nonconvex-nonconcave min-max optimization , author=. arXiv preprint arXiv:2206.05248 , year=
-
[59]
Conference on Learning Theory , pages=
Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities , author=. Conference on Learning Theory , pages=. 2020 , organization=
work page 2020
-
[60]
Fixed points of nonexpanding maps , author=
-
[61]
Optimization letters , volume=
On the convergence rate of the Halpern-iteration , author=. Optimization letters , volume=. 2021 , publisher=
work page 2021
-
[62]
On Lower and Upper Bounds for Smooth and Strongly Convex Optimization Problems
On lower and upper bounds for smooth and strongly convex optimization problems , author=. arXiv preprint arXiv:1503.06833 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[63]
arXiv preprint arXiv:2402.05071 , year=
Extending the Reach of First-Order Algorithms for Nonconvex Min-Max Problems with Cohypomonotonicity , author=. arXiv preprint arXiv:2402.05071 , year=
-
[64]
Journal of Global Optimization , volume=
Conical averagedness and convergence analysis of fixed point algorithms , author=. Journal of Global Optimization , volume=. 2022 , publisher=
work page 2022
-
[65]
The Twelfth International Conference on Learning Representations , year=
Weaker MVI Condition: Extragradient Methods with Multi-Step Exploration , author=. The Twelfth International Conference on Learning Representations , year=
-
[66]
Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages=
The complexity of constrained min-max optimization , author=. Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages=
-
[67]
Journal of Computer and system Sciences , volume=
On the complexity of the parity argument and other inefficient proofs of existence , author=. Journal of Computer and system Sciences , volume=. 1994 , publisher=
work page 1994
-
[68]
Exponential lower bounds for finding
Hirsch, M and Vavasis, S , booktitle=. Exponential lower bounds for finding
-
[69]
A hybrid approximate extragradient--proximal point algorithm using the enlargement of a maximal monotone operator , author=. Set-Valued Analysis , volume=. 1999 , publisher=
work page 1999
-
[70]
P. Tseng , Journal =. A modified forward-backward splitting method for maximal monotone mappings , Volume =
-
[71]
Finite-dimensional variational inequalities and complementarity problems , author=. 2003 , publisher=
work page 2003
-
[72]
Journal of Machine Learning Research , volume=
Beyond the golden ratio for variational inequality algorithms , author=. Journal of Machine Learning Research , volume=
-
[73]
arXiv preprint arXiv:2201.12247 , year=
Solving nonconvex-nonconcave min-max problems exhibiting weak minty solutions , author=. arXiv preprint arXiv:2201.12247 , year=
-
[74]
Enlargement of monotone operators with applications to variational inequalities , author=. Set-Valued Analysis , volume=. 1997 , publisher=
work page 1997
-
[75]
SIAM Journal on Optimization , volume=
A first order method for solving convex bilevel optimization problems , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=
work page 2017
-
[76]
Lower Bounds for Non-Convex Stochastic Optimization , author=. 2022 , eprint=
work page 2022
-
[77]
Adaptive Bound Optimization for Online Convex Optimization , author=. 2010 , eprint=
work page 2010
- [78]
-
[79]
International Conference on Artificial Intelligence and Statistics , pages=
A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=
work page 2020
-
[80]
arXiv preprint arXiv:2204.09228 , year=
Tight last-iterate convergence of the extragradient and the optimistic gradient descent-ascent algorithm for constrained monotone variational inequalities , author=. arXiv preprint arXiv:2204.09228 , year=
-
[81]
arXiv preprint arXiv:2312.12175 , year=
Fast Forward-Backward splitting for monotone inclusions with a convergence rate of the tangent residual of o (1/k) , author=. arXiv preprint arXiv:2312.12175 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.