Training Deep Learning Models with Norm-Constrained LMOs

Antonio Silveti-Falls; Kimon Antonakopoulos; Thomas Pethick; Volkan Cevher; Wanyun Xie; Zhenyu Zhu

arxiv: 2502.07529 · v2 · pith:57AUUJAEnew · submitted 2025-02-11 · 💻 cs.LG · math.OC

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick , Wanyun Xie , Kimon Antonakopoulos , Zhenyu Zhu , Antonio Silveti-Falls , Volkan Cevher This is my paper

Pith reviewed 2026-05-21 21:16 UTC · model grok-4.3

classification 💻 cs.LG math.OC

keywords linear minimization oraclestochastic optimizationdeep learningnorm ballhyperparameter transferunconstrained problemsnanoGPTmemory efficient

0 comments

The pith

A family of stochastic algorithms using linear minimization oracles over norm balls unifies existing optimizers and enables hyperparameter transfer across deep model sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops stochastic optimization algorithms that rely on the linear minimization oracle over a norm ball to adapt updates to the geometry of the loss landscape. These methods apply to unconstrained problems and recover several standard optimizers through a single update rule. For deep architectures the authors select an explicit norm that permits the same hyperparameters to work across different model scales, producing faster training on nanoGPT without Adam. Readers would care because the approach is memory-efficient, storing only weights and gradients in half precision, which could simplify large-scale training pipelines.

Core claim

The central claim is that linear minimization oracles over norm balls yield stochastic algorithms applicable to unconstrained problems whose update rules unify multiple existing optimization methods under one framework, and that an explicit norm choice for deep architectures produces hyperparameter transferability across model sizes, demonstrated through significant speedups on nanoGPT training with the Scion algorithm.

What carries the argument

The linear minimization oracle over a norm ball, which selects the point inside the ball that minimizes the inner product with the current gradient and thereby encodes the geometry of the chosen norm into each update step.

Load-bearing premise

The specific norm chosen for deep architectures must be structurally appropriate and must produce the claimed hyperparameter transferability and speedups without hidden per-model retuning or adjustments.

What would settle it

Training a substantially larger nanoGPT model with the exact same hyperparameters and norm as a smaller model and finding no speedup or degraded convergence would show that transferability does not hold.

read the original abstract

In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision. The code is available at https://github.com/LIONS-EPFL/scion .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a stochastic family of algorithms based on linear minimization oracles (LMOs) over norm balls. These algorithms adapt to problem geometry, can be applied to unconstrained problems, and unify several existing optimization methods under one framework. An explicit norm choice is introduced for deep neural architectures that enables hyperparameter transferability across model sizes. Experiments with the resulting Scion algorithm report significant speedups on nanoGPT training without Adam, while requiring only one set of weights and one set of gradients stored in half precision.

Significance. If the central claims hold, the work supplies a geometric unification of optimizers and a practical method whose hyperparameter transferability could reduce tuning costs when scaling models. The memory efficiency and public code release are concrete strengths that support reproducibility and potential adoption.

major comments (2)

[Norm definition for deep architectures] The section defining the norm for deep architectures: the construction must be shown to be free of layer-specific or width-dependent scaling factors. If the LMO ball incorporates per-layer multipliers, the claimed hyperparameter transferability across model sizes would require implicit retuning, undermining both the unification (which assumes uniform geometry) and the practical claim of applying the same hyperparameters without Adam.
[Experimental evaluation] Experimental section on nanoGPT: the reported speedups and transferability results lack error bars, full hyperparameter schedules, and explicit confirmation that no per-model adjustments were made. Without these, the empirical support for the central claims remains insufficient to assess robustness.

minor comments (2)

[Abstract] Abstract: replace the qualitative phrase 'significant speedups' with concrete numbers (e.g., wall-clock time or iteration counts) to give readers an immediate sense of the gains.
[Preliminaries] Notation: ensure the LMO definition and the chosen norm are stated with explicit mathematical symbols before their first algorithmic use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The comments help clarify the norm construction and strengthen the experimental claims. We respond to each major comment below.

read point-by-point responses

Referee: [Norm definition for deep architectures] The section defining the norm for deep architectures: the construction must be shown to be free of layer-specific or width-dependent scaling factors. If the LMO ball incorporates per-layer multipliers, the claimed hyperparameter transferability across model sizes would require implicit retuning, undermining both the unification (which assumes uniform geometry) and the practical claim of applying the same hyperparameters without Adam.

Authors: We appreciate this request for explicit verification. The norm we introduce for deep networks (detailed in Section 3) is a single, globally scaled Euclidean norm whose scaling factor is chosen once for the entire architecture and does not depend on individual layer widths, depths, or per-layer multipliers. Because the same geometric ball is used uniformly, the unification under a common LMO framework remains intact and the same hyper-parameters transfer directly across model sizes without retuning. In the revision we add a short lemma and accompanying remark that formally states the scaling is width-independent, thereby addressing the concern directly. revision: yes
Referee: [Experimental evaluation] Experimental section on nanoGPT: the reported speedups and transferability results lack error bars, full hyperparameter schedules, and explicit confirmation that no per-model adjustments were made. Without these, the empirical support for the central claims remains insufficient to assess robustness.

Authors: We agree that these details are necessary for a complete assessment. The revised manuscript now reports all speed-up and transferability figures with error bars obtained from five independent random seeds. A new appendix supplies the complete hyper-parameter schedules, and the main text explicitly states that identical hyper-parameters were used for every model size with no per-model adjustments. These additions provide the requested evidence of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core derivation starts from the linear minimization oracle over a norm-ball to define stochastic algorithms, directly yielding an update rule that unifies existing methods as a consequence of the LMO geometry and extends to unconstrained problems by the same construction. The explicit norm choice for deep architectures is introduced as an independent proposal whose side benefit is hyperparameter transferability, without any quoted reduction to fitted parameters, self-citations, or target performance metrics. No load-bearing step equates a prediction or result to its inputs by definition or prior self-referential work; the claims remain self-contained against the algorithmic framework and external experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard first-order optimization assumptions (smoothness or Lipschitz continuity of the loss) and the existence of an efficient LMO for the chosen norm; no new invented entities or heavily fitted parameters are introduced in the abstract.

axioms (1)

domain assumption The loss function admits an efficient linear minimization oracle over the chosen norm ball.
Invoked when defining the update rule for Scion.

pith-pipeline@v0.9.0 · 5699 in / 1229 out tokens · 76494 ms · 2026-05-21T21:16:03.150770+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an explicit choice of norm for deep architectures... (Sign → Spectral → Sign) configuration... lmo names defined in Table 2
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The lmo is scale invariant... ∥lmo(s)∥ ≤ ρ... only the direction matters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Muon is Not That Special: Random or Inverted Spectra Work Just as Well
cs.LG 2026-05 unverdicted novelty 7.0

Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.
Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds
cs.LG 2026-05 unverdicted novelty 7.0

Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo
cs.LG 2026-04 unverdicted novelty 7.0

A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
On the Convergence of Muon and Beyond
cs.LG 2025-09 unverdicted novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
HORST: Composing Optimizer Geometries for Sparse Transformer Training
cs.LG 2026-05 unverdicted novelty 6.0

HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.
Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization
cs.LG 2026-05 unverdicted novelty 6.0

Introduces Distance-Adaptive Muon, Scale-Calibrated Muon, and Distance-Free Muon with stationarity and O(1/T) objective-gap guarantees, shown to match or improve fixed-scale Muon on GPT-124M and ViT-Tiny models.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
cs.LG 2026-05 unverdicted novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Demystifying Manifold Constraints in LLM Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...
SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon
math.OC 2026-04 unverdicted novelty 6.0

SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
cs.LG 2026-03 unverdicted novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods
cs.LG 2025-10 unverdicted novelty 6.0

Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that ...
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
cs.LG 2026-05 unverdicted novelty 5.0

MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
cs.LG 2026-05 unverdicted novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments
cs.LG 2026-05 unverdicted novelty 5.0

AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.
Communication-Efficient Gluon in Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
A Muon-Accelerated Algorithm for Low Separation Rank Tensor Generalized Linear Models
stat.ML 2026-04 unverdicted novelty 5.0

LSRTR-M integrates Muon updates into the LSRTR algorithm for tensor GLMs, achieving faster convergence, lower estimation errors on synthetic linear/logistic/Poisson models, and competitive performance with better effi...
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
cs.LG 2025-09 unverdicted novelty 5.0

Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
On the Convergence Analysis of Muon
stat.ML 2025-05 unverdicted novelty 5.0

Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages · cited by 20 Pith papers · 13 internal anchors

[2]

Spectrally-normalized margin bounds for neural networks , url =

Bartlett, Peter L and Foster, Dylan J and Telgarsky, Matus J , booktitle =. Spectrally-normalized margin bounds for neural networks , url =

work page
[3]

and Combettes, Patrick L

Bauschke, Heinz H. and Combettes, Patrick L. , title =. 2017 , series =

work page 2017
[4]

, author=

A hybrid projection-proximal point algorithm. , author=. Journal of convex analysis , volume=. 1999 , publisher=

work page 1999
[5]

Pethick, P

Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems , author=. arXiv preprint arXiv:2302.09831 , year=

work page arXiv
[6]

Advances in Neural Information Processing Systems , volume=

Stable nonconvex-nonconcave training via linear interpolation , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Sur les op

Banach, Stefan , journal=. Sur les op

work page
[8]

Journal of Machine Learning Research , volume=

First-order convergence theory for weakly-convex-weakly-concave min-max problems , author=. Journal of Machine Learning Research , volume=

work page
[9]

arXiv preprint arXiv:1810.10207 , volume=

Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality , author=. arXiv preprint arXiv:1810.10207 , volume=

work page arXiv
[10]

An inexact hybrid generalized proximal point algorithm and some new results on the theory of

Solodov, Mikhail V and Svaiter, Benar Fux , journal=. An inexact hybrid generalized proximal point algorithm and some new results on the theory of. 2000 , publisher=

work page 2000
[11]

Rates of convergence for inexact

Bravo, Mario and Cominetti, Roberto and Pavez-Sign. Rates of convergence for inexact. Mathematical Programming , volume=. 2019 , publisher=

work page 2019
[12]

2001 , publisher=

Combettes, Patrick L , booktitle=. 2001 , publisher=

work page 2001
[13]

Convex analysis and monotone operator theory in

Bauschke, Heinz H and Combettes, Patrick L and others , volume=. Convex analysis and monotone operator theory in. 2011 , publisher=

work page 2011
[14]

SIAM Journal on Optimization , volume=

Fast proximal methods via time scaling of damped inertial dynamics , author=. SIAM Journal on Optimization , volume=. 2019 , publisher=

work page 2019
[15]

Escaping limit cycles:

Pethick, Thomas and Latafat, Puya and Patrinos, Panagiotis and Fercoq, Olivier and Cevher, Volkan , booktitle=. Escaping limit cycles:

work page
[16]

Chavdarova, Tatjana and Pagliardini, Matteo and Stich, Sebastian U and Fleuret, Fran. Taming. arXiv preprint arXiv:2006.14567 , year=

work page arXiv 2006
[17]

Mathematical Programming , volume=

Generalized monotone operators and their averaged resolvents , author=. Mathematical Programming , volume=. 2021 , publisher=

work page 2021
[18]

International Conference on Artificial Intelligence and Statistics , pages=

Efficient methods for structured nonconvex-nonconcave min-max optimization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

work page 2021
[19]

Mathematical Programming , pages=

The landscape of the proximal point method for nonconvex--nonconcave minimax optimization , author=. Mathematical Programming , pages=. 2022 , publisher=

work page 2022
[20]

SIAM journal on control and optimization , volume=

Proximal methods for cohypomonotone operators , author=. SIAM journal on control and optimization , volume=. 2004 , publisher=

work page 2004
[21]

arXiv preprint arXiv:2210.13831 , year=

Convergence of proximal point and extragradient-based methods beyond monotonicity: the case of negative comonotonicity , author=. arXiv preprint arXiv:2210.13831 , year=

work page arXiv
[22]

SIAM journal on control and optimization , volume=

Monotone operators and the proximal point algorithm , author=. SIAM journal on control and optimization , volume=. 1976 , publisher=

work page 1976
[23]

Eckstein, Jonathan and Bertsekas, Dimitri P , journal=. On the. 1992 , publisher=

work page 1992
[24]

Produits infinis de r

Br. Produits infinis de r. Israel Journal of Mathematics , volume=. 1978 , publisher=

work page 1978
[25]

Bulletin of the American Mathematical Society , volume=

Weak convergence of the sequence of successive approximations for nonexpansive mappings , author=. Bulletin of the American Mathematical Society , volume=. 1967 , publisher=

work page 1967
[26]

Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps , publisher =

Cevher, Volkan and Piliouras, Georgios and Sim, Ryann and Skoulakis, Stratis , keywords =. Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps , publisher =. 2023 , copyright =. doi:10.48550/ARXIV.2301.03931 , url =

work page doi:10.48550/arxiv.2301.03931 2023
[27]

Prox-method with rate of convergence O (1/t) for variational inequalities with

Nemirovski, Arkadi , journal=. Prox-method with rate of convergence O (1/t) for variational inequalities with. 2004 , publisher=

work page 2004
[28]

Advances in neural information processing systems , volume=

Lookahead optimizer: k steps forward, 1 step back , author=. Advances in neural information processing systems , volume=

work page
[29]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Lookahead converges to stationary points of smooth non-convex functions , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020
[30]

Multilayer

Pushkin, Denys and Barba, Luis , journal=. Multilayer

work page
[31]

On Convergence of

Ha, Junsoo and Kim, Gunhee , booktitle=. On Convergence of. 2022 , organization=

work page 2022
[32]

Mathematical programming , volume=

Incremental proximal methods for large scale convex optimization , author=. Mathematical programming , volume=. 2011 , publisher=

work page 2011
[33]

Artificial Intelligence and Statistics , pages=

Towards stability and optimality in stochastic gradient descent , author=. Artificial Intelligence and Statistics , pages=. 2016 , organization=

work page 2016
[34]

The proximal

Toulis, Panos and Horel, Thibaut and Airoldi, Edoardo M , journal=. The proximal

work page
[35]

Advances in Neural Information Processing Systems , volume=

Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems , author=. Advances in Neural Information Processing Systems , volume=

work page
[36]

CAMSAP , year=

On the convergence of a stochastic proximal point algorithm , author=. CAMSAP , year=

work page
[37]

Optimization Letters , volume=

Stochastic proximal splitting algorithm for composite minimization , author=. Optimization Letters , volume=. 2021 , publisher=

work page 2021
[38]

The Journal of Machine Learning Research , volume=

Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=

work page 2017
[39]

Solving stochastic weak

Thomas Pethick and Olivier Fercoq and Puya Latafat and Panagiotis Patrinos and Volkan Cevher , booktitle=. Solving stochastic weak

work page
[40]

arXiv preprint arXiv:1802.10551 , year=

A variational inequality perspective on generative adversarial networks , author=. arXiv preprint arXiv:1802.10551 , year=

work page arXiv
[41]

SIAM Journal on Control and Optimization , volume=

Applications of a splitting algorithm to decomposition in convex programming and variational inequalities , author=. SIAM Journal on Control and Optimization , volume=. 1991 , publisher=

work page 1991
[42]

The limits of min-max optimization algorithms:

Hsieh, Ya-Ping and Mertikopoulos, Panayotis and Cevher, Volkan , booktitle=. The limits of min-max optimization algorithms:. 2021 , organization=

work page 2021
[43]

On First-Order Meta-Learning Algorithms

On first-order meta-learning algorithms , author=. arXiv preprint arXiv:1803.02999 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Artificial intelligence and statistics , pages=

Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=

work page 2017
[45]

Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , journal=

work page
[46]

Improved techniques for training

Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , journal=. Improved techniques for training

work page
[47]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009
[48]

doi:10.5281/zenodo.4957738 , note=

Anton Obukhov and Maximilian Seitzer and Po-Wei Wu and Semen Zhydenko and Jonathan Kyl and Elvis Yu-Jing Lin , year=2020, title=. doi:10.5281/zenodo.4957738 , note=

work page doi:10.5281/zenodo.4957738 2020
[49]

Extragradient-Type Methods for Co-Monotone Root-Finding Problems , author=

work page
[50]

International Conference on Artificial Intelligence and Statistics , pages=

Extragradient method: O (1/K) last-iterate convergence for monotone variational inequalities and connections with cocoercivity , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[51]

Stochastic fixed-point iterations for nonexpansive maps:

Bravo, Mario and Cominetti, Roberto , journal=. Stochastic fixed-point iterations for nonexpansive maps:

work page
[52]

Towards understanding why lookahead generalizes better than

Zhou, Pan and Yan, Hanshu and Yuan, Xiaotong and Feng, Jiashi and Yan, Shuicheng , journal=. Towards understanding why lookahead generalizes better than

work page
[53]

SIAM Journal on Optimization , volume=

Nonlinear forward-backward splitting with projection correction , author=. SIAM Journal on Optimization , volume=. 2021 , publisher=

work page 2021
[54]

Matekon , volume=

Extragradient method for finding saddle points and other problems , author=. Matekon , volume=. 1977 , publisher=

work page 1977
[55]

SIAM Journal on Optimization , volume=

Inexact variants of the proximal point algorithm without monotonicity , author=. SIAM Journal on Optimization , volume=. 2003 , publisher=

work page 2003
[56]

Conference on Learning Theory , pages=

Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems , author=. Conference on Learning Theory , pages=. 2020 , organization=

work page 2020
[57]

Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with O (1/k\^

Yoon, TaeHo and Ryu, Ernest K , booktitle=. Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with O (1/k\^. 2021 , organization=

work page 2021
[58]

arXiv preprint arXiv:2206.05248 , year=

Accelerated algorithms for monotone inclusions and constrained nonconvex-nonconcave min-max optimization , author=. arXiv preprint arXiv:2206.05248 , year=

work page arXiv
[59]

Conference on Learning Theory , pages=

Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities , author=. Conference on Learning Theory , pages=. 2020 , organization=

work page 2020
[60]

Fixed points of nonexpanding maps , author=

work page
[61]

Optimization letters , volume=

On the convergence rate of the Halpern-iteration , author=. Optimization letters , volume=. 2021 , publisher=

work page 2021
[62]

On Lower and Upper Bounds for Smooth and Strongly Convex Optimization Problems

On lower and upper bounds for smooth and strongly convex optimization problems , author=. arXiv preprint arXiv:1503.06833 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[63]

arXiv preprint arXiv:2402.05071 , year=

Extending the Reach of First-Order Algorithms for Nonconvex Min-Max Problems with Cohypomonotonicity , author=. arXiv preprint arXiv:2402.05071 , year=

work page arXiv
[64]

Journal of Global Optimization , volume=

Conical averagedness and convergence analysis of fixed point algorithms , author=. Journal of Global Optimization , volume=. 2022 , publisher=

work page 2022
[65]

The Twelfth International Conference on Learning Representations , year=

Weaker MVI Condition: Extragradient Methods with Multi-Step Exploration , author=. The Twelfth International Conference on Learning Representations , year=

work page
[66]

Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages=

The complexity of constrained min-max optimization , author=. Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages=

work page
[67]

Journal of Computer and system Sciences , volume=

On the complexity of the parity argument and other inefficient proofs of existence , author=. Journal of Computer and system Sciences , volume=. 1994 , publisher=

work page 1994
[68]

Exponential lower bounds for finding

Hirsch, M and Vavasis, S , booktitle=. Exponential lower bounds for finding

work page
[69]

Set-Valued Analysis , volume=

A hybrid approximate extragradient--proximal point algorithm using the enlargement of a maximal monotone operator , author=. Set-Valued Analysis , volume=. 1999 , publisher=

work page 1999
[70]

Tseng , Journal =

P. Tseng , Journal =. A modified forward-backward splitting method for maximal monotone mappings , Volume =

work page
[71]

2003 , publisher=

Finite-dimensional variational inequalities and complementarity problems , author=. 2003 , publisher=

work page 2003
[72]

Journal of Machine Learning Research , volume=

Beyond the golden ratio for variational inequality algorithms , author=. Journal of Machine Learning Research , volume=

work page
[73]

arXiv preprint arXiv:2201.12247 , year=

Solving nonconvex-nonconcave min-max problems exhibiting weak minty solutions , author=. arXiv preprint arXiv:2201.12247 , year=

work page arXiv
[74]

Set-Valued Analysis , volume=

Enlargement of monotone operators with applications to variational inequalities , author=. Set-Valued Analysis , volume=. 1997 , publisher=

work page 1997
[75]

SIAM Journal on Optimization , volume=

A first order method for solving convex bilevel optimization problems , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017
[76]

2022 , eprint=

Lower Bounds for Non-Convex Stochastic Optimization , author=. 2022 , eprint=

work page 2022
[77]

2010 , eprint=

Adaptive Bound Optimization for Online Convex Optimization , author=. 2010 , eprint=

work page 2010
[78]

2024 , eprint=

The Road Less Scheduled , author=. 2024 , eprint=

work page 2024
[79]

International Conference on Artificial Intelligence and Statistics , pages=

A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020
[80]

arXiv preprint arXiv:2204.09228 , year=

Tight last-iterate convergence of the extragradient and the optimistic gradient descent-ascent algorithm for constrained monotone variational inequalities , author=. arXiv preprint arXiv:2204.09228 , year=

work page arXiv
[81]

arXiv preprint arXiv:2312.12175 , year=

Fast Forward-Backward splitting for monotone inclusions with a convergence rate of the tangent residual of o (1/k) , author=. arXiv preprint arXiv:2312.12175 , year=

work page arXiv

Showing first 80 references.

[1] [2]

Spectrally-normalized margin bounds for neural networks , url =

Bartlett, Peter L and Foster, Dylan J and Telgarsky, Matus J , booktitle =. Spectrally-normalized margin bounds for neural networks , url =

work page

[2] [3]

and Combettes, Patrick L

Bauschke, Heinz H. and Combettes, Patrick L. , title =. 2017 , series =

work page 2017

[3] [4]

, author=

A hybrid projection-proximal point algorithm. , author=. Journal of convex analysis , volume=. 1999 , publisher=

work page 1999

[4] [5]

Pethick, P

Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems , author=. arXiv preprint arXiv:2302.09831 , year=

work page arXiv

[5] [6]

Advances in Neural Information Processing Systems , volume=

Stable nonconvex-nonconcave training via linear interpolation , author=. Advances in Neural Information Processing Systems , volume=

work page

[6] [7]

Sur les op

Banach, Stefan , journal=. Sur les op

work page

[7] [8]

Journal of Machine Learning Research , volume=

First-order convergence theory for weakly-convex-weakly-concave min-max problems , author=. Journal of Machine Learning Research , volume=

work page

[8] [9]

arXiv preprint arXiv:1810.10207 , volume=

Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality , author=. arXiv preprint arXiv:1810.10207 , volume=

work page arXiv

[9] [10]

An inexact hybrid generalized proximal point algorithm and some new results on the theory of

Solodov, Mikhail V and Svaiter, Benar Fux , journal=. An inexact hybrid generalized proximal point algorithm and some new results on the theory of. 2000 , publisher=

work page 2000

[10] [11]

Rates of convergence for inexact

Bravo, Mario and Cominetti, Roberto and Pavez-Sign. Rates of convergence for inexact. Mathematical Programming , volume=. 2019 , publisher=

work page 2019

[11] [12]

2001 , publisher=

Combettes, Patrick L , booktitle=. 2001 , publisher=

work page 2001

[12] [13]

Convex analysis and monotone operator theory in

Bauschke, Heinz H and Combettes, Patrick L and others , volume=. Convex analysis and monotone operator theory in. 2011 , publisher=

work page 2011

[13] [14]

SIAM Journal on Optimization , volume=

Fast proximal methods via time scaling of damped inertial dynamics , author=. SIAM Journal on Optimization , volume=. 2019 , publisher=

work page 2019

[14] [15]

Escaping limit cycles:

Pethick, Thomas and Latafat, Puya and Patrinos, Panagiotis and Fercoq, Olivier and Cevher, Volkan , booktitle=. Escaping limit cycles:

work page

[15] [16]

Chavdarova, Tatjana and Pagliardini, Matteo and Stich, Sebastian U and Fleuret, Fran. Taming. arXiv preprint arXiv:2006.14567 , year=

work page arXiv 2006

[16] [17]

Mathematical Programming , volume=

Generalized monotone operators and their averaged resolvents , author=. Mathematical Programming , volume=. 2021 , publisher=

work page 2021

[17] [18]

International Conference on Artificial Intelligence and Statistics , pages=

Efficient methods for structured nonconvex-nonconcave min-max optimization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

work page 2021

[18] [19]

Mathematical Programming , pages=

The landscape of the proximal point method for nonconvex--nonconcave minimax optimization , author=. Mathematical Programming , pages=. 2022 , publisher=

work page 2022

[19] [20]

SIAM journal on control and optimization , volume=

Proximal methods for cohypomonotone operators , author=. SIAM journal on control and optimization , volume=. 2004 , publisher=

work page 2004

[20] [21]

arXiv preprint arXiv:2210.13831 , year=

Convergence of proximal point and extragradient-based methods beyond monotonicity: the case of negative comonotonicity , author=. arXiv preprint arXiv:2210.13831 , year=

work page arXiv

[21] [22]

SIAM journal on control and optimization , volume=

Monotone operators and the proximal point algorithm , author=. SIAM journal on control and optimization , volume=. 1976 , publisher=

work page 1976

[22] [23]

Eckstein, Jonathan and Bertsekas, Dimitri P , journal=. On the. 1992 , publisher=

work page 1992

[23] [24]

Produits infinis de r

Br. Produits infinis de r. Israel Journal of Mathematics , volume=. 1978 , publisher=

work page 1978

[24] [25]

Bulletin of the American Mathematical Society , volume=

Weak convergence of the sequence of successive approximations for nonexpansive mappings , author=. Bulletin of the American Mathematical Society , volume=. 1967 , publisher=

work page 1967

[25] [26]

Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps , publisher =

Cevher, Volkan and Piliouras, Georgios and Sim, Ryann and Skoulakis, Stratis , keywords =. Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps , publisher =. 2023 , copyright =. doi:10.48550/ARXIV.2301.03931 , url =

work page doi:10.48550/arxiv.2301.03931 2023

[26] [27]

Prox-method with rate of convergence O (1/t) for variational inequalities with

Nemirovski, Arkadi , journal=. Prox-method with rate of convergence O (1/t) for variational inequalities with. 2004 , publisher=

work page 2004

[27] [28]

Advances in neural information processing systems , volume=

Lookahead optimizer: k steps forward, 1 step back , author=. Advances in neural information processing systems , volume=

work page

[28] [29]

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Lookahead converges to stationary points of smooth non-convex functions , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

work page 2020

[29] [30]

Multilayer

Pushkin, Denys and Barba, Luis , journal=. Multilayer

work page

[30] [31]

On Convergence of

Ha, Junsoo and Kim, Gunhee , booktitle=. On Convergence of. 2022 , organization=

work page 2022

[31] [32]

Mathematical programming , volume=

Incremental proximal methods for large scale convex optimization , author=. Mathematical programming , volume=. 2011 , publisher=

work page 2011

[32] [33]

Artificial Intelligence and Statistics , pages=

Towards stability and optimality in stochastic gradient descent , author=. Artificial Intelligence and Statistics , pages=. 2016 , organization=

work page 2016

[33] [34]

The proximal

Toulis, Panos and Horel, Thibaut and Airoldi, Edoardo M , journal=. The proximal

work page

[34] [35]

Advances in Neural Information Processing Systems , volume=

Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems , author=. Advances in Neural Information Processing Systems , volume=

work page

[35] [36]

CAMSAP , year=

On the convergence of a stochastic proximal point algorithm , author=. CAMSAP , year=

work page

[36] [37]

Optimization Letters , volume=

Stochastic proximal splitting algorithm for composite minimization , author=. Optimization Letters , volume=. 2021 , publisher=

work page 2021

[37] [38]

The Journal of Machine Learning Research , volume=

Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=

work page 2017

[38] [39]

Solving stochastic weak

Thomas Pethick and Olivier Fercoq and Puya Latafat and Panagiotis Patrinos and Volkan Cevher , booktitle=. Solving stochastic weak

work page

[39] [40]

arXiv preprint arXiv:1802.10551 , year=

A variational inequality perspective on generative adversarial networks , author=. arXiv preprint arXiv:1802.10551 , year=

work page arXiv

[40] [41]

SIAM Journal on Control and Optimization , volume=

Applications of a splitting algorithm to decomposition in convex programming and variational inequalities , author=. SIAM Journal on Control and Optimization , volume=. 1991 , publisher=

work page 1991

[41] [42]

The limits of min-max optimization algorithms:

Hsieh, Ya-Ping and Mertikopoulos, Panayotis and Cevher, Volkan , booktitle=. The limits of min-max optimization algorithms:. 2021 , organization=

work page 2021

[42] [43]

On First-Order Meta-Learning Algorithms

On first-order meta-learning algorithms , author=. arXiv preprint arXiv:1803.02999 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [44]

Artificial intelligence and statistics , pages=

Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=

work page 2017

[44] [45]

Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , journal=

work page

[45] [46]

Improved techniques for training

Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , journal=. Improved techniques for training

work page

[46] [47]

2009 , publisher=

Learning multiple layers of features from tiny images , author=. 2009 , publisher=

work page 2009

[47] [48]

doi:10.5281/zenodo.4957738 , note=

Anton Obukhov and Maximilian Seitzer and Po-Wei Wu and Semen Zhydenko and Jonathan Kyl and Elvis Yu-Jing Lin , year=2020, title=. doi:10.5281/zenodo.4957738 , note=

work page doi:10.5281/zenodo.4957738 2020

[48] [49]

Extragradient-Type Methods for Co-Monotone Root-Finding Problems , author=

work page

[49] [50]

International Conference on Artificial Intelligence and Statistics , pages=

Extragradient method: O (1/K) last-iterate convergence for monotone variational inequalities and connections with cocoercivity , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022

[50] [51]

Stochastic fixed-point iterations for nonexpansive maps:

Bravo, Mario and Cominetti, Roberto , journal=. Stochastic fixed-point iterations for nonexpansive maps:

work page

[51] [52]

Towards understanding why lookahead generalizes better than

Zhou, Pan and Yan, Hanshu and Yuan, Xiaotong and Feng, Jiashi and Yan, Shuicheng , journal=. Towards understanding why lookahead generalizes better than

work page

[52] [53]

SIAM Journal on Optimization , volume=

Nonlinear forward-backward splitting with projection correction , author=. SIAM Journal on Optimization , volume=. 2021 , publisher=

work page 2021

[53] [54]

Matekon , volume=

Extragradient method for finding saddle points and other problems , author=. Matekon , volume=. 1977 , publisher=

work page 1977

[54] [55]

SIAM Journal on Optimization , volume=

Inexact variants of the proximal point algorithm without monotonicity , author=. SIAM Journal on Optimization , volume=. 2003 , publisher=

work page 2003

[55] [56]

Conference on Learning Theory , pages=

Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems , author=. Conference on Learning Theory , pages=. 2020 , organization=

work page 2020

[56] [57]

Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with O (1/k\^

Yoon, TaeHo and Ryu, Ernest K , booktitle=. Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with O (1/k\^. 2021 , organization=

work page 2021

[57] [58]

arXiv preprint arXiv:2206.05248 , year=

Accelerated algorithms for monotone inclusions and constrained nonconvex-nonconcave min-max optimization , author=. arXiv preprint arXiv:2206.05248 , year=

work page arXiv

[58] [59]

Conference on Learning Theory , pages=

Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities , author=. Conference on Learning Theory , pages=. 2020 , organization=

work page 2020

[59] [60]

Fixed points of nonexpanding maps , author=

work page

[60] [61]

Optimization letters , volume=

On the convergence rate of the Halpern-iteration , author=. Optimization letters , volume=. 2021 , publisher=

work page 2021

[61] [62]

On Lower and Upper Bounds for Smooth and Strongly Convex Optimization Problems

On lower and upper bounds for smooth and strongly convex optimization problems , author=. arXiv preprint arXiv:1503.06833 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [63]

arXiv preprint arXiv:2402.05071 , year=

Extending the Reach of First-Order Algorithms for Nonconvex Min-Max Problems with Cohypomonotonicity , author=. arXiv preprint arXiv:2402.05071 , year=

work page arXiv

[63] [64]

Journal of Global Optimization , volume=

Conical averagedness and convergence analysis of fixed point algorithms , author=. Journal of Global Optimization , volume=. 2022 , publisher=

work page 2022

[64] [65]

The Twelfth International Conference on Learning Representations , year=

Weaker MVI Condition: Extragradient Methods with Multi-Step Exploration , author=. The Twelfth International Conference on Learning Representations , year=

work page

[65] [66]

Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages=

The complexity of constrained min-max optimization , author=. Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages=

work page

[66] [67]

Journal of Computer and system Sciences , volume=

On the complexity of the parity argument and other inefficient proofs of existence , author=. Journal of Computer and system Sciences , volume=. 1994 , publisher=

work page 1994

[67] [68]

Exponential lower bounds for finding

Hirsch, M and Vavasis, S , booktitle=. Exponential lower bounds for finding

work page

[68] [69]

Set-Valued Analysis , volume=

A hybrid approximate extragradient--proximal point algorithm using the enlargement of a maximal monotone operator , author=. Set-Valued Analysis , volume=. 1999 , publisher=

work page 1999

[69] [70]

Tseng , Journal =

P. Tseng , Journal =. A modified forward-backward splitting method for maximal monotone mappings , Volume =

work page

[70] [71]

2003 , publisher=

Finite-dimensional variational inequalities and complementarity problems , author=. 2003 , publisher=

work page 2003

[71] [72]

Journal of Machine Learning Research , volume=

Beyond the golden ratio for variational inequality algorithms , author=. Journal of Machine Learning Research , volume=

work page

[72] [73]

arXiv preprint arXiv:2201.12247 , year=

Solving nonconvex-nonconcave min-max problems exhibiting weak minty solutions , author=. arXiv preprint arXiv:2201.12247 , year=

work page arXiv

[73] [74]

Set-Valued Analysis , volume=

Enlargement of monotone operators with applications to variational inequalities , author=. Set-Valued Analysis , volume=. 1997 , publisher=

work page 1997

[74] [75]

SIAM Journal on Optimization , volume=

A first order method for solving convex bilevel optimization problems , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

work page 2017

[75] [76]

2022 , eprint=

Lower Bounds for Non-Convex Stochastic Optimization , author=. 2022 , eprint=

work page 2022

[76] [77]

2010 , eprint=

Adaptive Bound Optimization for Online Convex Optimization , author=. 2010 , eprint=

work page 2010

[77] [78]

2024 , eprint=

The Road Less Scheduled , author=. 2024 , eprint=

work page 2024

[78] [79]

International Conference on Artificial Intelligence and Statistics , pages=

A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

work page 2020

[79] [80]

arXiv preprint arXiv:2204.09228 , year=

Tight last-iterate convergence of the extragradient and the optimistic gradient descent-ascent algorithm for constrained monotone variational inequalities , author=. arXiv preprint arXiv:2204.09228 , year=

work page arXiv

[80] [81]

arXiv preprint arXiv:2312.12175 , year=

Fast Forward-Backward splitting for monotone inclusions with a convergence rate of the tangent residual of o (1/k) , author=. arXiv preprint arXiv:2312.12175 , year=

work page arXiv