pith. sign in

arxiv: 2605.21104 · v1 · pith:TJ44KHHInew · submitted 2026-05-20 · 💻 cs.LG

HORST: Composing Optimizer Geometries for Sparse Transformer Training

Pith reviewed 2026-05-21 06:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords sparse trainingtransformersoptimizer geometryhyperbolic mirror mapL1 sparsity biasadaptive optimizationvision taskslanguage tasks
0
0 comments X

The pith

Composing non-commutative optimizer operators with a hyperbolic mirror map creates a stable sparse trainer for transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard optimizers like AdamW favor stability through an implicit L-infinity bias but struggle to promote sparsity in transformer models. The paper shows that by treating optimizer updates as non-commutative operators and composing them with a hyperbolic mirror map, one can inject an L1 sparsity bias without losing the stability benefits. This results in HORST, which achieves better performance than AdamW at all sparsity levels, especially when sparsity is high. Sympathetic readers care because sparse models reduce compute and memory costs while maintaining accuracy in vision and language tasks.

Core claim

By casting optimizer steps as non-commutative operators and combining their geometries, HORST inherits stability from adaptive methods while using a hyperbolic mirror map to induce an L1 sparsity bias, leading to consistent outperformance over AdamW baselines in sparse transformer training on vision and language tasks.

What carries the argument

The composition of optimizer steps as non-commutative operators combined with a hyperbolic mirror map, which integrates stability and sparsity biases.

Load-bearing premise

Casting optimizer steps as non-commutative operators and applying a hyperbolic mirror map will reliably induce an L1 sparsity bias without undermining the stability inherited from adaptive methods.

What would settle it

Training a transformer with HORST at high sparsity and observing no improvement or instability compared to AdamW would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.21104 by Rebekka Burkholz, Rohan Jain, Tom Jacobs.

Figure 1
Figure 1. Figure 1: (Left) Standardized weight distributions for a pretrained ResNet-50 (with SGD) and a DeiT [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: • We introduce optimizer-operator composition as a design principle in §4. • We show that the entropy mirror map can overwrite the steepest-descent implicit bias in Theorem 4.7. This motivates the composed sparsity aware optimizer (Algorithm 2): Hyperbolic Operator for Robust Sparse Training (HORST). • We experimentally evaluate on sparse training settings in vision and language tasks (§6). 2 [PITH_FULL_I… view at source ↗
Figure 2
Figure 2. Figure 2: Steepest-Mirror Descent Dichotomy: Each geometric optimization class is effective at inducing the corresponding dual implicit bias. Both coordinate descent and cosh-entropy are infeasible due to slow convergence. 2 Related work Steepest descent and modern optimization. Recent work views optimizers as modular operations on groups of parameters [Bernstein and Newhouse, 2025]. We build on this and focus on an… view at source ↗
Figure 3
Figure 3. Figure 3: (Left) The final unmasked weight distribution of a DeiT-base trained with AC/DC to [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: One-shot layerwise unstructured magnitude pruning of dense GPT-2 Small (≈ 124M params) checkpoints trained on SlimPajama-6B with AdamW vs. HAM vs. HORST-AdamW; no fine-tuning. HORST￾AdamW consistently achieves lower valida￾tion perplexity than both. See [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Implicit bias of additive vs. multiplicative steepest descent on sparse linear classification. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sparse linear classification with mirror maps. (a) the learned features by the hyperbolic [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The evolution of the loss, for signSGD and the [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: HORST-AdamW induces a sparser weight distribution. Standardized weight distribu￾tions at end of training for a dense GPT-2 Small model trained on SlimPajama-6B for 25K iterations with HORST-AdamW vs. AdamW. We observe that HORST-AdamW concentrates weights sharply around zero with lighter tails, while AdamW retains a broader, near-Gaussian profile. This indicates the presence of an implicit L1 bias [PITH_F… view at source ↗
read the original abstract

Sparsifying transformers remains a fundamental challenge, as standard optimizers fail to simultaneously encourage sparsity and maintain training stability. Effective adaptive optimizers exhibit an implicit $L_{\infty}$ bias favoring stability, yet, sparsity requires an $L_1$ bias. To integrate sparsity, we propose a composition of optimizer steps, which we cast as non-commutative operators to analyze and combine their optimization geometry in a principled way. This yields HORST (Hyperbolic Operator for Robust Sparse Training), a modular optimizer that inherits stability from adaptive methods while inducing $L_1$ sparsity bias through a hyperbolic mirror map. Our experiments demonstrate its utility for sparse training of transformers on both vision and language tasks. HORST consistently and significantly outperforms AdamW baselines across all sparsity levels, with large gains at higher sparsity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes HORST, an optimizer obtained by casting optimizer steps as non-commutative operators and composing them with a hyperbolic mirror map applied to the momentum buffer. This construction is intended to inherit the stability of adaptive methods while inducing an L1 sparsity bias. Experiments on vision and language transformer tasks report that HORST consistently and significantly outperforms matched AdamW baselines across sparsity levels, with larger gains at higher sparsity.

Significance. If the reported gains are reproducible under standard controls, the operator-composition framework supplies a geometrically motivated route to controllable sparsity that avoids the instability often seen with explicit L1 penalties. The modular design and explicit non-commutativity analysis are strengths that could generalize beyond the current setting.

major comments (2)
  1. [§4] §4: The claim of consistent outperformance requires explicit reporting of the number of independent runs, random seeds, and statistical tests (e.g., paired t-tests or Wilcoxon) together with error bars or confidence intervals; without these the headline empirical result remains difficult to evaluate.
  2. [§3.2] §3.2, operator ordering: The fixed ordering is justified by the non-commutativity analysis, yet the manuscript should state whether the L1 bias and stability properties remain intact under small perturbations of that ordering or under the approximate commutativity that occurs in practice with finite-precision arithmetic.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'large gains at higher sparsity' is qualitative; adding a table or sentence with relative improvement percentages at each sparsity level would improve clarity.
  2. [Notation] Notation: Define the hyperbolic mirror map and the composition operator symbols once in §2 and reuse them consistently; current usage occasionally mixes inline descriptions with symbols.
  3. [Figures] Figure captions: Ensure every figure caption states the exact sparsity target, model size, and dataset so that the plots are self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and positive recommendation for minor revision. We address the major comments below and have updated the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4] §4: The claim of consistent outperformance requires explicit reporting of the number of independent runs, random seeds, and statistical tests (e.g., paired t-tests or Wilcoxon) together with error bars or confidence intervals; without these the headline empirical result remains difficult to evaluate.

    Authors: We agree with this assessment. The original manuscript omitted these details for brevity, but we recognize their importance. In the revised version, we explicitly report that all results are averaged over 5 independent runs with different random seeds (42, 43, 44, 45, 46). We have added error bars representing one standard deviation to all figures in §4. Additionally, we include the results of paired t-tests comparing HORST to AdamW, confirming statistical significance at p < 0.05 across sparsity levels. revision: yes

  2. Referee: [§3.2] §3.2, operator ordering: The fixed ordering is justified by the non-commutativity analysis, yet the manuscript should state whether the L1 bias and stability properties remain intact under small perturbations of that ordering or under the approximate commutativity that occurs in practice with finite-precision arithmetic.

    Authors: The non-commutativity analysis shows that the specific ordering is required to achieve the desired composition of geometries. However, we acknowledge the referee's point regarding robustness. We have added a paragraph in §3.2 discussing that small perturbations to the ordering preserve the L1 bias because the hyperbolic mirror map dominates the composition, and that finite-precision effects in practice do not degrade the sparsity or stability benefits, as the operator remains approximately non-commutative in the relevant sense. No new experiments were needed as this follows from the existing analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper constructs HORST via a new composition of non-commutative optimizer operators analyzed in §3, with the hyperbolic mirror map applied specifically to the momentum buffer to induce an L1 bias while inheriting adaptive stability. This framework is introduced as an original geometric analysis rather than a re-derivation of fitted quantities or prior results. No equations reduce by construction to inputs, no predictions are statistically forced from subsets of data, and load-bearing steps do not collapse to self-citations. Direct AdamW controls at matched sparsity levels in §4 provide independent empirical verification, rendering the central claims externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified effectiveness of the non-commutative operator framework and the sparsity-inducing property of the hyperbolic mirror map; no free parameters or additional invented entities are described in the abstract.

axioms (1)
  • domain assumption Optimizer steps can be cast as non-commutative operators whose geometries can be combined in a principled way to achieve both stability and sparsity biases.
    This modeling choice is invoked to justify the composition that yields HORST.
invented entities (1)
  • HORST optimizer with hyperbolic mirror map no independent evidence
    purpose: To inherit stability from adaptive methods while inducing L1 sparsity bias.
    New optimizer introduced to solve the stated stability-sparsity tradeoff.

pith-pipeline@v0.9.0 · 5661 in / 1256 out tokens · 60896 ms · 2026-05-21T06:24:27.238352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

250 extracted references · 250 canonical work pages · 3 internal anchors

  1. [1]

    36th International Conference on Algorithmic Learning Theory , year=

    How rotation invariant algorithms are fooled by noise on sparse targets , author=. 36th International Conference on Algorithmic Learning Theory , year=

  2. [2]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  3. [3]

    ArXiv , year=

    Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , author=. ArXiv , year=

  4. [4]

    The Fourteenth International Conference on Learning Representations , year=

    Never Saddle for Reparameterized Steepest Descent as Mirror Flow , author=. The Fourteenth International Conference on Learning Representations , year=

  5. [5]

    Automated Flower Classification over a Large Number of Classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated Flower Classification over a Large Number of Classes. Indian Conference on Computer Vision, Graphics and Image Processing. 2008

  6. [6]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  7. [7]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  8. [8]

    Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification , url =

    Luo, Yihong and Chen, Yuhan and Qiu, Siya and Wang, Yiwei and Zhang, Chen and Zhou, Yan and Cao, Xiaochun and Tang, Jing , booktitle =. Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification , url =

  9. [9]

    Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks , volume=

    Santos, Claudio Filipi Gonçalves Dos and Papa, João Paulo , year=. Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks , volume=. ACM Computing Surveys , publisher=. doi:10.1145/3510413 , number=

  10. [10]

    2006 , publisher=

    Pattern recognition and machine learning , author=. 2006 , publisher=

  11. [11]

    Noah Golmant and Zhewei Yao and Amir Gholami and Michael Mahoney and Joseph Gonzalez , title =

  12. [12]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Implicit Bias of Mirror Flow on Separable Data , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  13. [13]

    2023 , eprint=

    Symbolic Discovery of Optimization Algorithms , author=. 2023 , eprint=

  14. [14]

    International Conference on Artificial Intelligence and Statistics , pages=

    Sinkhorn Flow as Mirror Flow: A Continuous-Time Framework for Generalizing the Sinkhorn Algorithm , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

  15. [15]

    Proceedings of the National Academy of Sciences , volume =

    Mikhail Belkin and Daniel Hsu and Siyuan Ma and Soumik Mandal , title =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =

  16. [16]

    Proceedings of the National Academy of Sciences , volume =

    Adityanarayanan Radhakrishnan and Mikhail Belkin and Caroline Uhler , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =

  17. [17]

    arXiv preprint arXiv:2202.10788 , year=

    Explicit regularization via regularizer mirror descent , author=. arXiv preprint arXiv:2202.10788 , year=

  18. [18]

    Operations Research Letters , volume=

    Mirror descent and nonlinear projected subgradient methods for convex optimization , author=. Operations Research Letters , volume=. 2003 , publisher=

  19. [19]

    Proceedings of the 35th International Conference on Machine Learning , pages =

    Characterizing Implicit Bias in Terms of Optimization Geometry , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

  20. [20]

    Information Fusion , volume=

    A comprehensive survey on regularization strategies in machine learning , author=. Information Fusion , volume=. 2022 , publisher=

  21. [21]

    International Conference on Machine Learning , pages=

    Why regularized auto-encoders learn sparse representation? , author=. International Conference on Machine Learning , pages=. 2016 , organization=

  22. [22]

    Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

  23. [23]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  24. [24]

    2024 , eprint=

    Efficient Large Language Models: A Survey , author=. 2024 , eprint=

  25. [25]

    Advances in neural information processing systems , volume=

    A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=

  26. [26]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

    Randaugment: Practical automated data augmentation with a reduced search space , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

  27. [27]

    ArXiv , year=

    Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows , author=. ArXiv , year=

  28. [28]

    Conference on Uncertainty in Artificial Intelligence , year=

    A Mirror Descent Perspective of Smoothed Sign Descent , author=. Conference on Uncertainty in Artificial Intelligence , year=

  29. [29]

    2025 , eprint=

    Transformative or Conservative? Conservation laws for ResNets and Transformers , author=. 2025 , eprint=

  30. [30]

    International Conference on Machine Learning , year=

    How to Escape Saddle Points Efficiently , author=. International Conference on Machine Learning , year=

  31. [31]

    2024 , eprint=

    Keep the Momentum: Conservation Laws beyond Euclidean Gradient Flows , author=. 2024 , eprint=

  32. [32]

    International Conference on Learning Representations , year=

    Three Mechanisms of Feature Learning in a Linear Network , author=. International Conference on Learning Representations , year=

  33. [33]

    Frontiers in Neuroscience , volume=

    Noise helps optimization escape from saddle points in the synaptic plasticity , author=. Frontiers in Neuroscience , volume=. 2020 , publisher=

  34. [34]

    Advances in Neural Information Processing Systems , volume=

    Escaping saddle-point faster under interpolation-like conditions , author=. Advances in Neural Information Processing Systems , volume=

  35. [35]

    The journal of machine learning research , volume=

    Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , volume=. 2014 , publisher=

  36. [36]

    Proceedings of the 32nd International Conference on Machine Learning , pages =

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

  37. [37]

    Proceedings of the 26th annual international conference on machine learning , pages=

    Online dictionary learning for sparse coding , author=. Proceedings of the 26th annual international conference on machine learning , pages=

  38. [38]

    Proceedings of the 27th international conference on international conference on machine learning , pages=

    Learning fast approximations of sparse coding , author=. Proceedings of the 27th international conference on international conference on machine learning , pages=

  39. [39]

    Journal of Machine Learning Research , volume=

    Convolutional neural networks analyzed via convolutional sparse coding , author=. Journal of Machine Learning Research , volume=

  40. [40]

    IEEE access , volume=

    A survey of sparse representation: algorithms and applications , author=. IEEE access , volume=. 2015 , publisher=

  41. [41]

    Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , volume=

    An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , author=. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , volume=. 2004 , publisher=

  42. [42]

    , journal=

    Tropp, J.A. , journal=. Greed is good: algorithmic results for sparse approximation , year=

  43. [43]

    Advances in Neural Information Processing Systems , editor=

    Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  44. [44]

    Implicit Bias of

    Scott Pesme and Loucas Pillaud-Vivien and Nicolas Flammarion , booktitle=. Implicit Bias of. 2021 , url=

  45. [45]

    2024 , eprint=

    Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes , author=. 2024 , eprint=

  46. [46]

    2021 , eprint=

    Powerpropagation: A sparsity inducing weight reparameterisation , author=. 2021 , eprint=

  47. [47]

    Part I: Discrete time analysis , author=

    Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis , author=. 2021 , eprint=

  48. [48]

    Twelfth International Conference on Learning Representations , year=

    Masks, Signs, And Learning Rate Rewinding , author=. Twelfth International Conference on Learning Representations , year=

  49. [49]

    2021 , eprint=

    Winning the Lottery with Continuous Sparsification , author=. 2021 , eprint=

  50. [50]

    2009 , isbn=

    Convex optimization , author=. 2009 , isbn=

  51. [51]

    Proceedings of Thirty Third Conference on Learning Theory , pages =

    Kernel and Rich Regimes in Overparametrized Models , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

  52. [52]

    Hessian Riemannian Gradient Flows in Convex Programming , volume=

    Alvarez, Felipe and Bolte, Jérôme and Brahic, Olivier , year=. Hessian Riemannian Gradient Flows in Convex Programming , volume=. SIAM Journal on Control and Optimization , publisher=. doi:10.1137/s0363012902419977 , number=

  53. [53]

    Princeton Landmarks in Mathematics and Physics , year=

    Convex Analysis , author=. Princeton Landmarks in Mathematics and Physics , year=

  54. [54]

    International Conference on Learning Representations , year=

    Masks, Signs, And Learning Rate Rewinding , author=. International Conference on Learning Representations , year=

  55. [55]

    International Conference on Machine Learning , year =

    Why Random Pruning Is All We Need to Start Sparse , author =. International Conference on Machine Learning , year =

  56. [56]

    International Conference on Learning Representations , year=

    On the Existence of Universal Lottery Tickets , author=. International Conference on Learning Representations , year=

  57. [57]

    2021 , eprint=

    Plant 'n' Seek: Can You Find the Winning Ticket? , author=. 2021 , eprint=

  58. [58]

    International Conference on Machine Learning , year=

    Convolutional and Residual Networks Provably Contain Lottery Tickets , author=. International Conference on Machine Learning , year=

  59. [59]

    Ferbach, Damien and Tsirigotis, Christos and Gidel, Gauthier and Avishek, Bose , title =

  60. [60]

    International Conference on Learning Representations , year=

    Pruning Neural Networks at Initialization: Why Are We Missing the Mark? , author=. International Conference on Learning Representations , year=

  61. [61]

    2024 , eprint=

    A Survey of Lottery Ticket Hypothesis , author=. 2024 , eprint=

  62. [62]

    2024 , eprint=

    Implicit Bias and Fast Convergence Rates for Self-attention , author=. 2024 , eprint=

  63. [63]

    2024 , eprint=

    Implicit Regularization of Gradient Flow on One-Layer Softmax Attention , author=. 2024 , eprint=

  64. [64]

    Advances in Neural Information Processing Systems , editor=

    Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  65. [65]

    2020 , eprint=

    On Lazy Training in Differentiable Programming , author=. 2020 , eprint=

  66. [66]

    2021 , eprint=

    Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors , author=. 2021 , eprint=

  67. [67]

    2018 , eprint=

    Learning Sparse Neural Networks through L_0 Regularization , author=. 2018 , eprint=

  68. [68]

    arXiv: Learning , year=

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. arXiv: Learning , year=

  69. [69]

    2021 , eprint=

    Towards Understanding Iterative Magnitude Pruning: Why Lottery Tickets Win , author=. 2021 , eprint=

  70. [70]

    ArXiv , year=

    A Survey of Lottery Ticket Hypothesis , author=. ArXiv , year=

  71. [71]

    Proceedings of Thirty Third Conference on Learning Theory , pages =

    Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

  72. [72]

    Implicit Regularization in Matrix Factorization , url =

    Gunasekar, Suriya and Woodworth, Blake E and Bhojanapalli, Srinadh and Neyshabur, Behnam and Srebro, Nati , booktitle =. Implicit Regularization in Matrix Factorization , url =

  73. [73]

    (S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , url =

    Even, Mathieu and Pesme, Scott and Gunasekar, Suriya and Flammarion, Nicolas , booktitle =. (S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , url =

  74. [74]

    Toward Effective Intrusion Detection Using Log-Cosh Conditional Variational Autoencoder , year=

    Xu, Xing and Li, Jie and Yang, Yang and Shen, Fumin , journal=. Toward Effective Intrusion Detection Using Log-Cosh Conditional Variational Autoencoder , year=

  75. [75]

    Nature Climate Change , volume=

    Aligning artificial intelligence with climate change mitigation , author=. Nature Climate Change , volume=. 2022 , publisher=

  76. [76]

    Proceedings of Machine Learning and Systems , volume=

    Sustainable ai: Environmental implications, challenges and opportunities , author=. Proceedings of Machine Learning and Systems , volume=

  77. [77]

    arXiv preprint arXiv:2311.16863 , year=

    Power hungry processing: Watts driving the cost of ai deployment? , author=. arXiv preprint arXiv:2311.16863 , year=

  78. [78]

    1983 , publisher=

    Problem Complexity and Method Efficiency in Optimization , author=. 1983 , publisher=

  79. [79]

    Mirror descent and nonlinear projected subgradient methods for convex optimization , journal =

    Amir Beck and Marc Teboulle , keywords =. Mirror descent and nonlinear projected subgradient methods for convex optimization , journal =. 2003 , issn =. doi:https://doi.org/10.1016/S0167-6377(02)00231-6 , url =

  80. [80]

    2017 , eprint=

    Gradient Descent Can Take Exponential Time to Escape Saddle Points , author=. 2017 , eprint=

Showing first 80 references.