pith. sign in

arxiv: 2502.07529 · v2 · pith:57AUUJAEnew · submitted 2025-02-11 · 💻 cs.LG · math.OC

Training Deep Learning Models with Norm-Constrained LMOs

Pith reviewed 2026-05-21 21:16 UTC · model grok-4.3

classification 💻 cs.LG math.OC
keywords linear minimization oraclestochastic optimizationdeep learningnorm ballhyperparameter transferunconstrained problemsnanoGPTmemory efficient
0
0 comments X

The pith

A family of stochastic algorithms using linear minimization oracles over norm balls unifies existing optimizers and enables hyperparameter transfer across deep model sizes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops stochastic optimization algorithms that rely on the linear minimization oracle over a norm ball to adapt updates to the geometry of the loss landscape. These methods apply to unconstrained problems and recover several standard optimizers through a single update rule. For deep architectures the authors select an explicit norm that permits the same hyperparameters to work across different model scales, producing faster training on nanoGPT without Adam. Readers would care because the approach is memory-efficient, storing only weights and gradients in half precision, which could simplify large-scale training pipelines.

Core claim

The central claim is that linear minimization oracles over norm balls yield stochastic algorithms applicable to unconstrained problems whose update rules unify multiple existing optimization methods under one framework, and that an explicit norm choice for deep architectures produces hyperparameter transferability across model sizes, demonstrated through significant speedups on nanoGPT training with the Scion algorithm.

What carries the argument

The linear minimization oracle over a norm ball, which selects the point inside the ball that minimizes the inner product with the current gradient and thereby encodes the geometry of the chosen norm into each update step.

Load-bearing premise

The specific norm chosen for deep architectures must be structurally appropriate and must produce the claimed hyperparameter transferability and speedups without hidden per-model retuning or adjustments.

What would settle it

Training a substantially larger nanoGPT model with the exact same hyperparameters and norm as a smaller model and finding no speedup or degraded convergence would show that transferability does not hold.

read the original abstract

In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision. The code is available at https://github.com/LIONS-EPFL/scion .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a stochastic family of algorithms based on linear minimization oracles (LMOs) over norm balls. These algorithms adapt to problem geometry, can be applied to unconstrained problems, and unify several existing optimization methods under one framework. An explicit norm choice is introduced for deep neural architectures that enables hyperparameter transferability across model sizes. Experiments with the resulting Scion algorithm report significant speedups on nanoGPT training without Adam, while requiring only one set of weights and one set of gradients stored in half precision.

Significance. If the central claims hold, the work supplies a geometric unification of optimizers and a practical method whose hyperparameter transferability could reduce tuning costs when scaling models. The memory efficiency and public code release are concrete strengths that support reproducibility and potential adoption.

major comments (2)
  1. [Norm definition for deep architectures] The section defining the norm for deep architectures: the construction must be shown to be free of layer-specific or width-dependent scaling factors. If the LMO ball incorporates per-layer multipliers, the claimed hyperparameter transferability across model sizes would require implicit retuning, undermining both the unification (which assumes uniform geometry) and the practical claim of applying the same hyperparameters without Adam.
  2. [Experimental evaluation] Experimental section on nanoGPT: the reported speedups and transferability results lack error bars, full hyperparameter schedules, and explicit confirmation that no per-model adjustments were made. Without these, the empirical support for the central claims remains insufficient to assess robustness.
minor comments (2)
  1. [Abstract] Abstract: replace the qualitative phrase 'significant speedups' with concrete numbers (e.g., wall-clock time or iteration counts) to give readers an immediate sense of the gains.
  2. [Preliminaries] Notation: ensure the LMO definition and the chosen norm are stated with explicit mathematical symbols before their first algorithmic use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The comments help clarify the norm construction and strengthen the experimental claims. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Norm definition for deep architectures] The section defining the norm for deep architectures: the construction must be shown to be free of layer-specific or width-dependent scaling factors. If the LMO ball incorporates per-layer multipliers, the claimed hyperparameter transferability across model sizes would require implicit retuning, undermining both the unification (which assumes uniform geometry) and the practical claim of applying the same hyperparameters without Adam.

    Authors: We appreciate this request for explicit verification. The norm we introduce for deep networks (detailed in Section 3) is a single, globally scaled Euclidean norm whose scaling factor is chosen once for the entire architecture and does not depend on individual layer widths, depths, or per-layer multipliers. Because the same geometric ball is used uniformly, the unification under a common LMO framework remains intact and the same hyper-parameters transfer directly across model sizes without retuning. In the revision we add a short lemma and accompanying remark that formally states the scaling is width-independent, thereby addressing the concern directly. revision: yes

  2. Referee: [Experimental evaluation] Experimental section on nanoGPT: the reported speedups and transferability results lack error bars, full hyperparameter schedules, and explicit confirmation that no per-model adjustments were made. Without these, the empirical support for the central claims remains insufficient to assess robustness.

    Authors: We agree that these details are necessary for a complete assessment. The revised manuscript now reports all speed-up and transferability figures with error bars obtained from five independent random seeds. A new appendix supplies the complete hyper-parameter schedules, and the main text explicitly states that identical hyper-parameters were used for every model size with no per-model adjustments. These additions provide the requested evidence of robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core derivation starts from the linear minimization oracle over a norm-ball to define stochastic algorithms, directly yielding an update rule that unifies existing methods as a consequence of the LMO geometry and extends to unconstrained problems by the same construction. The explicit norm choice for deep architectures is introduced as an independent proposal whose side benefit is hyperparameter transferability, without any quoted reduction to fitted parameters, self-citations, or target performance metrics. No load-bearing step equates a prediction or result to its inputs by definition or prior self-referential work; the claims remain self-contained against the algorithmic framework and external experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard first-order optimization assumptions (smoothness or Lipschitz continuity of the loss) and the existence of an efficient LMO for the chosen norm; no new invented entities or heavily fitted parameters are introduced in the abstract.

axioms (1)
  • domain assumption The loss function admits an efficient linear minimization oracle over the chosen norm ball.
    Invoked when defining the update rule for Scion.

pith-pipeline@v0.9.0 · 5699 in / 1229 out tokens · 76494 ms · 2026-05-21T21:16:03.150770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

  2. Muon is Not That Special: Random or Inverted Spectra Work Just as Well

    cs.LG 2026-05 unverdicted novelty 7.0

    Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.

  3. Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

  4. A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

    cs.LG 2026-04 unverdicted novelty 7.0

    A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

  5. On the Convergence of Muon and Beyond

    cs.LG 2025-09 unverdicted novelty 7.0

    Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.

  6. HORST: Composing Optimizer Geometries for Sparse Transformer Training

    cs.LG 2026-05 unverdicted novelty 6.0

    HORST uses non-commutative operator composition and a hyperbolic mirror map to combine stability from adaptive optimizers with L1 sparsity bias, outperforming AdamW across sparsity levels on vision and language tasks.

  7. Distance-Aware Muon: Adaptive Step Scaling for Normalized Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    Introduces Distance-Adaptive Muon, Scale-Calibrated Muon, and Distance-Free Muon with stationarity and O(1/T) objective-gap guarantees, shown to match or improve fixed-scale Muon on GPT-124M and ViT-Tiny models.

  8. Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

    cs.LG 2026-05 unverdicted novelty 6.0

    Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...

  9. Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

    cs.LG 2026-05 unverdicted novelty 6.0

    Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

  10. Demystifying Manifold Constraints in LLM Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Manifold constraints via the new MACRO optimizer independently bound activation scales and enforce rotational equilibrium in LLM pre-training, subsuming RMS normalization and decoupled weight decay while delivering co...

  11. SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

    math.OC 2026-04 unverdicted novelty 6.0

    SUDA-Muon modularizes decentralized Muon via the SUDA template, proving a topology-separated convergence rate of O((1+σ/√N)K^{-1/4}) in nuclear-norm geometry while establishing that tracking-before-polarization is req...

  12. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

  13. Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

    cs.LG 2025-10 unverdicted novelty 6.0

    Preconditioned matrix norms unify steepest descent, quasi-Newton, and adaptive optimizers, revealing SGD, Adam, Muon, KL-Shampoo, SOAP, and SPlus as special cases and enabling new methods MuAdam and MuAdam-SANIA that ...

  14. MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

    cs.LG 2026-05 unverdicted novelty 5.0

    MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.

  15. Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

    cs.LG 2026-05 unverdicted novelty 5.0

    Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

  16. AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

    cs.LG 2026-05 unverdicted novelty 5.0

    AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.

  17. Communication-Efficient Gluon in Federated Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

  18. A Muon-Accelerated Algorithm for Low Separation Rank Tensor Generalized Linear Models

    stat.ML 2026-04 unverdicted novelty 5.0

    LSRTR-M integrates Muon updates into the LSRTR algorithm for tensor GLMs, achieving faster convergence, lower estimation errors on synthetic linear/logistic/Poisson models, and competitive performance with better effi...

  19. Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

    cs.LG 2025-09 unverdicted novelty 5.0

    Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.

  20. On the Convergence Analysis of Muon

    stat.ML 2025-05 unverdicted novelty 5.0

    Convergence analysis shows Muon outperforms gradient descent by exploiting low-rank structure in neural network Hessians.

Reference graph

Works this paper leans on

194 extracted references · 194 canonical work pages · cited by 20 Pith papers · 13 internal anchors

  1. [2]

    Spectrally-normalized margin bounds for neural networks , url =

    Bartlett, Peter L and Foster, Dylan J and Telgarsky, Matus J , booktitle =. Spectrally-normalized margin bounds for neural networks , url =

  2. [3]

    and Combettes, Patrick L

    Bauschke, Heinz H. and Combettes, Patrick L. , title =. 2017 , series =

  3. [4]

    , author=

    A hybrid projection-proximal point algorithm. , author=. Journal of convex analysis , volume=. 1999 , publisher=

  4. [5]

    Pethick, P

    Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems , author=. arXiv preprint arXiv:2302.09831 , year=

  5. [6]

    Advances in Neural Information Processing Systems , volume=

    Stable nonconvex-nonconcave training via linear interpolation , author=. Advances in Neural Information Processing Systems , volume=

  6. [7]

    Sur les op

    Banach, Stefan , journal=. Sur les op

  7. [8]

    Journal of Machine Learning Research , volume=

    First-order convergence theory for weakly-convex-weakly-concave min-max problems , author=. Journal of Machine Learning Research , volume=

  8. [9]

    arXiv preprint arXiv:1810.10207 , volume=

    Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality , author=. arXiv preprint arXiv:1810.10207 , volume=

  9. [10]

    An inexact hybrid generalized proximal point algorithm and some new results on the theory of

    Solodov, Mikhail V and Svaiter, Benar Fux , journal=. An inexact hybrid generalized proximal point algorithm and some new results on the theory of. 2000 , publisher=

  10. [11]

    Rates of convergence for inexact

    Bravo, Mario and Cominetti, Roberto and Pavez-Sign. Rates of convergence for inexact. Mathematical Programming , volume=. 2019 , publisher=

  11. [12]

    2001 , publisher=

    Combettes, Patrick L , booktitle=. 2001 , publisher=

  12. [13]

    Convex analysis and monotone operator theory in

    Bauschke, Heinz H and Combettes, Patrick L and others , volume=. Convex analysis and monotone operator theory in. 2011 , publisher=

  13. [14]

    SIAM Journal on Optimization , volume=

    Fast proximal methods via time scaling of damped inertial dynamics , author=. SIAM Journal on Optimization , volume=. 2019 , publisher=

  14. [15]

    Escaping limit cycles:

    Pethick, Thomas and Latafat, Puya and Patrinos, Panagiotis and Fercoq, Olivier and Cevher, Volkan , booktitle=. Escaping limit cycles:

  15. [16]

    Chavdarova, Tatjana and Pagliardini, Matteo and Stich, Sebastian U and Fleuret, Fran. Taming. arXiv preprint arXiv:2006.14567 , year=

  16. [17]

    Mathematical Programming , volume=

    Generalized monotone operators and their averaged resolvents , author=. Mathematical Programming , volume=. 2021 , publisher=

  17. [18]

    International Conference on Artificial Intelligence and Statistics , pages=

    Efficient methods for structured nonconvex-nonconcave min-max optimization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=

  18. [19]

    Mathematical Programming , pages=

    The landscape of the proximal point method for nonconvex--nonconcave minimax optimization , author=. Mathematical Programming , pages=. 2022 , publisher=

  19. [20]

    SIAM journal on control and optimization , volume=

    Proximal methods for cohypomonotone operators , author=. SIAM journal on control and optimization , volume=. 2004 , publisher=

  20. [21]

    arXiv preprint arXiv:2210.13831 , year=

    Convergence of proximal point and extragradient-based methods beyond monotonicity: the case of negative comonotonicity , author=. arXiv preprint arXiv:2210.13831 , year=

  21. [22]

    SIAM journal on control and optimization , volume=

    Monotone operators and the proximal point algorithm , author=. SIAM journal on control and optimization , volume=. 1976 , publisher=

  22. [23]

    Eckstein, Jonathan and Bertsekas, Dimitri P , journal=. On the. 1992 , publisher=

  23. [24]

    Produits infinis de r

    Br. Produits infinis de r. Israel Journal of Mathematics , volume=. 1978 , publisher=

  24. [25]

    Bulletin of the American Mathematical Society , volume=

    Weak convergence of the sequence of successive approximations for nonexpansive mappings , author=. Bulletin of the American Mathematical Society , volume=. 1967 , publisher=

  25. [26]

    Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps , publisher =

    Cevher, Volkan and Piliouras, Georgios and Sim, Ryann and Skoulakis, Stratis , keywords =. Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps , publisher =. 2023 , copyright =. doi:10.48550/ARXIV.2301.03931 , url =

  26. [27]

    Prox-method with rate of convergence O (1/t) for variational inequalities with

    Nemirovski, Arkadi , journal=. Prox-method with rate of convergence O (1/t) for variational inequalities with. 2004 , publisher=

  27. [28]

    Advances in neural information processing systems , volume=

    Lookahead optimizer: k steps forward, 1 step back , author=. Advances in neural information processing systems , volume=

  28. [29]

    ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

    Lookahead converges to stationary points of smooth non-convex functions , author=. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2020 , organization=

  29. [30]

    Multilayer

    Pushkin, Denys and Barba, Luis , journal=. Multilayer

  30. [31]

    On Convergence of

    Ha, Junsoo and Kim, Gunhee , booktitle=. On Convergence of. 2022 , organization=

  31. [32]

    Mathematical programming , volume=

    Incremental proximal methods for large scale convex optimization , author=. Mathematical programming , volume=. 2011 , publisher=

  32. [33]

    Artificial Intelligence and Statistics , pages=

    Towards stability and optimality in stochastic gradient descent , author=. Artificial Intelligence and Statistics , pages=. 2016 , organization=

  33. [34]

    The proximal

    Toulis, Panos and Horel, Thibaut and Airoldi, Edoardo M , journal=. The proximal

  34. [35]

    Advances in Neural Information Processing Systems , volume=

    Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems , author=. Advances in Neural Information Processing Systems , volume=

  35. [36]

    CAMSAP , year=

    On the convergence of a stochastic proximal point algorithm , author=. CAMSAP , year=

  36. [37]

    Optimization Letters , volume=

    Stochastic proximal splitting algorithm for composite minimization , author=. Optimization Letters , volume=. 2021 , publisher=

  37. [38]

    The Journal of Machine Learning Research , volume=

    Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization , author=. The Journal of Machine Learning Research , volume=. 2017 , publisher=

  38. [39]

    Solving stochastic weak

    Thomas Pethick and Olivier Fercoq and Puya Latafat and Panagiotis Patrinos and Volkan Cevher , booktitle=. Solving stochastic weak

  39. [40]

    arXiv preprint arXiv:1802.10551 , year=

    A variational inequality perspective on generative adversarial networks , author=. arXiv preprint arXiv:1802.10551 , year=

  40. [41]

    SIAM Journal on Control and Optimization , volume=

    Applications of a splitting algorithm to decomposition in convex programming and variational inequalities , author=. SIAM Journal on Control and Optimization , volume=. 1991 , publisher=

  41. [42]

    The limits of min-max optimization algorithms:

    Hsieh, Ya-Ping and Mertikopoulos, Panayotis and Cevher, Volkan , booktitle=. The limits of min-max optimization algorithms:. 2021 , organization=

  42. [43]

    On First-Order Meta-Learning Algorithms

    On first-order meta-learning algorithms , author=. arXiv preprint arXiv:1803.02999 , year=

  43. [44]

    Artificial intelligence and statistics , pages=

    Communication-efficient learning of deep networks from decentralized data , author=. Artificial intelligence and statistics , pages=. 2017 , organization=

  44. [45]

    Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , journal=

  45. [46]

    Improved techniques for training

    Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , journal=. Improved techniques for training

  46. [47]

    2009 , publisher=

    Learning multiple layers of features from tiny images , author=. 2009 , publisher=

  47. [48]

    doi:10.5281/zenodo.4957738 , note=

    Anton Obukhov and Maximilian Seitzer and Po-Wei Wu and Semen Zhydenko and Jonathan Kyl and Elvis Yu-Jing Lin , year=2020, title=. doi:10.5281/zenodo.4957738 , note=

  48. [49]

    Extragradient-Type Methods for Co-Monotone Root-Finding Problems , author=

  49. [50]

    International Conference on Artificial Intelligence and Statistics , pages=

    Extragradient method: O (1/K) last-iterate convergence for monotone variational inequalities and connections with cocoercivity , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

  50. [51]

    Stochastic fixed-point iterations for nonexpansive maps:

    Bravo, Mario and Cominetti, Roberto , journal=. Stochastic fixed-point iterations for nonexpansive maps:

  51. [52]

    Towards understanding why lookahead generalizes better than

    Zhou, Pan and Yan, Hanshu and Yuan, Xiaotong and Feng, Jiashi and Yan, Shuicheng , journal=. Towards understanding why lookahead generalizes better than

  52. [53]

    SIAM Journal on Optimization , volume=

    Nonlinear forward-backward splitting with projection correction , author=. SIAM Journal on Optimization , volume=. 2021 , publisher=

  53. [54]

    Matekon , volume=

    Extragradient method for finding saddle points and other problems , author=. Matekon , volume=. 1977 , publisher=

  54. [55]

    SIAM Journal on Optimization , volume=

    Inexact variants of the proximal point algorithm without monotonicity , author=. SIAM Journal on Optimization , volume=. 2003 , publisher=

  55. [56]

    Conference on Learning Theory , pages=

    Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems , author=. Conference on Learning Theory , pages=. 2020 , organization=

  56. [57]

    Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with O (1/k\^

    Yoon, TaeHo and Ryu, Ernest K , booktitle=. Accelerated Algorithms for Smooth Convex-Concave Minimax Problems with O (1/k\^. 2021 , organization=

  57. [58]

    arXiv preprint arXiv:2206.05248 , year=

    Accelerated algorithms for monotone inclusions and constrained nonconvex-nonconcave min-max optimization , author=. arXiv preprint arXiv:2206.05248 , year=

  58. [59]

    Conference on Learning Theory , pages=

    Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities , author=. Conference on Learning Theory , pages=. 2020 , organization=

  59. [60]

    Fixed points of nonexpanding maps , author=

  60. [61]

    Optimization letters , volume=

    On the convergence rate of the Halpern-iteration , author=. Optimization letters , volume=. 2021 , publisher=

  61. [62]

    On Lower and Upper Bounds for Smooth and Strongly Convex Optimization Problems

    On lower and upper bounds for smooth and strongly convex optimization problems , author=. arXiv preprint arXiv:1503.06833 , year=

  62. [63]

    arXiv preprint arXiv:2402.05071 , year=

    Extending the Reach of First-Order Algorithms for Nonconvex Min-Max Problems with Cohypomonotonicity , author=. arXiv preprint arXiv:2402.05071 , year=

  63. [64]

    Journal of Global Optimization , volume=

    Conical averagedness and convergence analysis of fixed point algorithms , author=. Journal of Global Optimization , volume=. 2022 , publisher=

  64. [65]

    The Twelfth International Conference on Learning Representations , year=

    Weaker MVI Condition: Extragradient Methods with Multi-Step Exploration , author=. The Twelfth International Conference on Learning Representations , year=

  65. [66]

    Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages=

    The complexity of constrained min-max optimization , author=. Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing , pages=

  66. [67]

    Journal of Computer and system Sciences , volume=

    On the complexity of the parity argument and other inefficient proofs of existence , author=. Journal of Computer and system Sciences , volume=. 1994 , publisher=

  67. [68]

    Exponential lower bounds for finding

    Hirsch, M and Vavasis, S , booktitle=. Exponential lower bounds for finding

  68. [69]

    Set-Valued Analysis , volume=

    A hybrid approximate extragradient--proximal point algorithm using the enlargement of a maximal monotone operator , author=. Set-Valued Analysis , volume=. 1999 , publisher=

  69. [70]

    Tseng , Journal =

    P. Tseng , Journal =. A modified forward-backward splitting method for maximal monotone mappings , Volume =

  70. [71]

    2003 , publisher=

    Finite-dimensional variational inequalities and complementarity problems , author=. 2003 , publisher=

  71. [72]

    Journal of Machine Learning Research , volume=

    Beyond the golden ratio for variational inequality algorithms , author=. Journal of Machine Learning Research , volume=

  72. [73]

    arXiv preprint arXiv:2201.12247 , year=

    Solving nonconvex-nonconcave min-max problems exhibiting weak minty solutions , author=. arXiv preprint arXiv:2201.12247 , year=

  73. [74]

    Set-Valued Analysis , volume=

    Enlargement of monotone operators with applications to variational inequalities , author=. Set-Valued Analysis , volume=. 1997 , publisher=

  74. [75]

    SIAM Journal on Optimization , volume=

    A first order method for solving convex bilevel optimization problems , author=. SIAM Journal on Optimization , volume=. 2017 , publisher=

  75. [76]

    2022 , eprint=

    Lower Bounds for Non-Convex Stochastic Optimization , author=. 2022 , eprint=

  76. [77]

    2010 , eprint=

    Adaptive Bound Optimization for Online Convex Optimization , author=. 2010 , eprint=

  77. [78]

    2024 , eprint=

    The Road Less Scheduled , author=. 2024 , eprint=

  78. [79]

    International Conference on Artificial Intelligence and Statistics , pages=

    A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

  79. [80]

    arXiv preprint arXiv:2204.09228 , year=

    Tight last-iterate convergence of the extragradient and the optimistic gradient descent-ascent algorithm for constrained monotone variational inequalities , author=. arXiv preprint arXiv:2204.09228 , year=

  80. [81]

    arXiv preprint arXiv:2312.12175 , year=

    Fast Forward-Backward splitting for monotone inclusions with a convergence rate of the tangent residual of o (1/k) , author=. arXiv preprint arXiv:2312.12175 , year=

Showing first 80 references.