HORST: Composing Optimizer Geometries for Sparse Transformer Training

Rebekka Burkholz; Rohan Jain; Tom Jacobs

arxiv: 2605.21104 · v1 · pith:TJ44KHHInew · submitted 2026-05-20 · 💻 cs.LG

HORST: Composing Optimizer Geometries for Sparse Transformer Training

Tom Jacobs , Rohan Jain , Rebekka Burkholz This is my paper

Pith reviewed 2026-05-21 06:24 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse trainingtransformersoptimizer geometryhyperbolic mirror mapL1 sparsity biasadaptive optimizationvision taskslanguage tasks

0 comments

The pith

Composing non-commutative optimizer operators with a hyperbolic mirror map creates a stable sparse trainer for transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard optimizers like AdamW favor stability through an implicit L-infinity bias but struggle to promote sparsity in transformer models. The paper shows that by treating optimizer updates as non-commutative operators and composing them with a hyperbolic mirror map, one can inject an L1 sparsity bias without losing the stability benefits. This results in HORST, which achieves better performance than AdamW at all sparsity levels, especially when sparsity is high. Sympathetic readers care because sparse models reduce compute and memory costs while maintaining accuracy in vision and language tasks.

Core claim

By casting optimizer steps as non-commutative operators and combining their geometries, HORST inherits stability from adaptive methods while using a hyperbolic mirror map to induce an L1 sparsity bias, leading to consistent outperformance over AdamW baselines in sparse transformer training on vision and language tasks.

What carries the argument

The composition of optimizer steps as non-commutative operators combined with a hyperbolic mirror map, which integrates stability and sparsity biases.

Load-bearing premise

Casting optimizer steps as non-commutative operators and applying a hyperbolic mirror map will reliably induce an L1 sparsity bias without undermining the stability inherited from adaptive methods.

What would settle it

Training a transformer with HORST at high sparsity and observing no improvement or instability compared to AdamW would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.21104 by Rebekka Burkholz, Rohan Jain, Tom Jacobs.

**Figure 2.** Figure 2: • We introduce optimizer-operator composition as a design principle in §4. • We show that the entropy mirror map can overwrite the steepest-descent implicit bias in Theorem 4.7. This motivates the composed sparsity aware optimizer (Algorithm 2): Hyperbolic Operator for Robust Sparse Training (HORST). • We experimentally evaluate on sparse training settings in vision and language tasks (§6). 2 [PITH_FULL_I… view at source ↗

**Figure 2.** Figure 2: Steepest-Mirror Descent Dichotomy: Each geometric optimization class is effective at inducing the corresponding dual implicit bias. Both coordinate descent and cosh-entropy are infeasible due to slow convergence. 2 Related work Steepest descent and modern optimization. Recent work views optimizers as modular operations on groups of parameters [Bernstein and Newhouse, 2025]. We build on this and focus on an… view at source ↗

**Figure 3.** Figure 3: (Left) The final unmasked weight distribution of a DeiT-base trained with AC/DC to [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: One-shot layerwise unstructured magnitude pruning of dense GPT-2 Small (≈ 124M params) checkpoints trained on SlimPajama-6B with AdamW vs. HAM vs. HORST-AdamW; no fine-tuning. HORSTAdamW consistently achieves lower validation perplexity than both. See [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Implicit bias of additive vs. multiplicative steepest descent on sparse linear classification. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Sparse linear classification with mirror maps. (a) the learned features by the hyperbolic [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: The evolution of the loss, for signSGD and the [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: HORST-AdamW induces a sparser weight distribution. Standardized weight distributions at end of training for a dense GPT-2 Small model trained on SlimPajama-6B for 25K iterations with HORST-AdamW vs. AdamW. We observe that HORST-AdamW concentrates weights sharply around zero with lighter tails, while AdamW retains a broader, near-Gaussian profile. This indicates the presence of an implicit L1 bias [PITH_F… view at source ↗

read the original abstract

Sparsifying transformers remains a fundamental challenge, as standard optimizers fail to simultaneously encourage sparsity and maintain training stability. Effective adaptive optimizers exhibit an implicit $L_{\infty}$ bias favoring stability, yet, sparsity requires an $L_1$ bias. To integrate sparsity, we propose a composition of optimizer steps, which we cast as non-commutative operators to analyze and combine their optimization geometry in a principled way. This yields HORST (Hyperbolic Operator for Robust Sparse Training), a modular optimizer that inherits stability from adaptive methods while inducing $L_1$ sparsity bias through a hyperbolic mirror map. Our experiments demonstrate its utility for sparse training of transformers on both vision and language tasks. HORST consistently and significantly outperforms AdamW baselines across all sparsity levels, with large gains at higher sparsity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes HORST, an optimizer obtained by casting optimizer steps as non-commutative operators and composing them with a hyperbolic mirror map applied to the momentum buffer. This construction is intended to inherit the stability of adaptive methods while inducing an L1 sparsity bias. Experiments on vision and language transformer tasks report that HORST consistently and significantly outperforms matched AdamW baselines across sparsity levels, with larger gains at higher sparsity.

Significance. If the reported gains are reproducible under standard controls, the operator-composition framework supplies a geometrically motivated route to controllable sparsity that avoids the instability often seen with explicit L1 penalties. The modular design and explicit non-commutativity analysis are strengths that could generalize beyond the current setting.

major comments (2)

[§4] §4: The claim of consistent outperformance requires explicit reporting of the number of independent runs, random seeds, and statistical tests (e.g., paired t-tests or Wilcoxon) together with error bars or confidence intervals; without these the headline empirical result remains difficult to evaluate.
[§3.2] §3.2, operator ordering: The fixed ordering is justified by the non-commutativity analysis, yet the manuscript should state whether the L1 bias and stability properties remain intact under small perturbations of that ordering or under the approximate commutativity that occurs in practice with finite-precision arithmetic.

minor comments (3)

[Abstract] Abstract: The phrase 'large gains at higher sparsity' is qualitative; adding a table or sentence with relative improvement percentages at each sparsity level would improve clarity.
[Notation] Notation: Define the hyperbolic mirror map and the composition operator symbols once in §2 and reuse them consistently; current usage occasionally mixes inline descriptions with symbols.
[Figures] Figure captions: Ensure every figure caption states the exact sparsity target, model size, and dataset so that the plots are self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and positive recommendation for minor revision. We address the major comments below and have updated the manuscript accordingly.

read point-by-point responses

Referee: [§4] §4: The claim of consistent outperformance requires explicit reporting of the number of independent runs, random seeds, and statistical tests (e.g., paired t-tests or Wilcoxon) together with error bars or confidence intervals; without these the headline empirical result remains difficult to evaluate.

Authors: We agree with this assessment. The original manuscript omitted these details for brevity, but we recognize their importance. In the revised version, we explicitly report that all results are averaged over 5 independent runs with different random seeds (42, 43, 44, 45, 46). We have added error bars representing one standard deviation to all figures in §4. Additionally, we include the results of paired t-tests comparing HORST to AdamW, confirming statistical significance at p < 0.05 across sparsity levels. revision: yes
Referee: [§3.2] §3.2, operator ordering: The fixed ordering is justified by the non-commutativity analysis, yet the manuscript should state whether the L1 bias and stability properties remain intact under small perturbations of that ordering or under the approximate commutativity that occurs in practice with finite-precision arithmetic.

Authors: The non-commutativity analysis shows that the specific ordering is required to achieve the desired composition of geometries. However, we acknowledge the referee's point regarding robustness. We have added a paragraph in §3.2 discussing that small perturbations to the ordering preserve the L1 bias because the hyperbolic mirror map dominates the composition, and that finite-precision effects in practice do not degrade the sparsity or stability benefits, as the operator remains approximately non-commutative in the relevant sense. No new experiments were needed as this follows from the existing analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper constructs HORST via a new composition of non-commutative optimizer operators analyzed in §3, with the hyperbolic mirror map applied specifically to the momentum buffer to induce an L1 bias while inheriting adaptive stability. This framework is introduced as an original geometric analysis rather than a re-derivation of fitted quantities or prior results. No equations reduce by construction to inputs, no predictions are statistically forced from subsets of data, and load-bearing steps do not collapse to self-citations. Direct AdamW controls at matched sparsity levels in §4 provide independent empirical verification, rendering the central claims externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified effectiveness of the non-commutative operator framework and the sparsity-inducing property of the hyperbolic mirror map; no free parameters or additional invented entities are described in the abstract.

axioms (1)

domain assumption Optimizer steps can be cast as non-commutative operators whose geometries can be combined in a principled way to achieve both stability and sparsity biases.
This modeling choice is invoked to justify the composition that yields HORST.

invented entities (1)

HORST optimizer with hyperbolic mirror map no independent evidence
purpose: To inherit stability from adaptive methods while inducing L1 sparsity bias.
New optimizer introduced to solve the stated stability-sparsity tradeoff.

pith-pipeline@v0.9.0 · 5661 in / 1256 out tokens · 60896 ms · 2026-05-21T06:24:27.238352+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

hyperbolic entropy mirror map ... R_γ(θ) = ∑ θ_i arcsinh(θ_i/γ) − √(θ_i² + γ²) ... induces an L1 bias
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero / J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

cosh-entropy ... fails inverse μ-coercivity ... hyperbolic entropy ... L1-bias

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

250 extracted references · 250 canonical work pages · 3 internal anchors

[1]

36th International Conference on Algorithmic Learning Theory , year=

How rotation invariant algorithms are fooled by noise on sparse targets , author=. 36th International Conference on Algorithmic Learning Theory , year=

work page
[2]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[3]

ArXiv , year=

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , author=. ArXiv , year=

work page
[4]

The Fourteenth International Conference on Learning Representations , year=

Never Saddle for Reparameterized Steepest Descent as Mirror Flow , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[5]

Automated Flower Classification over a Large Number of Classes

Maria-Elena Nilsback and Andrew Zisserman. Automated Flower Classification over a Large Number of Classes. Indian Conference on Computer Vision, Graphics and Image Processing. 2008

work page 2008
[6]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[7]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[8]

Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification , url =

Luo, Yihong and Chen, Yuhan and Qiu, Siya and Wang, Yiwei and Zhang, Chen and Zhou, Yan and Cao, Xiaochun and Tang, Jing , booktitle =. Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification , url =

work page
[9]

Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks , volume=

Santos, Claudio Filipi Gonçalves Dos and Papa, João Paulo , year=. Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks , volume=. ACM Computing Surveys , publisher=. doi:10.1145/3510413 , number=

work page doi:10.1145/3510413
[10]

2006 , publisher=

Pattern recognition and machine learning , author=. 2006 , publisher=

work page 2006
[11]

Noah Golmant and Zhewei Yao and Amir Gholami and Michael Mahoney and Joseph Gonzalez , title =

work page
[12]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Implicit Bias of Mirror Flow on Separable Data , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[13]

2023 , eprint=

Symbolic Discovery of Optimization Algorithms , author=. 2023 , eprint=

work page 2023
[14]

International Conference on Artificial Intelligence and Statistics , pages=

Sinkhorn Flow as Mirror Flow: A Continuous-Time Framework for Generalizing the Sinkhorn Algorithm , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024
[15]

Proceedings of the National Academy of Sciences , volume =

Mikhail Belkin and Daniel Hsu and Siyuan Ma and Soumik Mandal , title =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =

work page 2019
[16]

Proceedings of the National Academy of Sciences , volume =

Adityanarayanan Radhakrishnan and Mikhail Belkin and Caroline Uhler , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =

work page 2020
[17]

arXiv preprint arXiv:2202.10788 , year=

Explicit regularization via regularizer mirror descent , author=. arXiv preprint arXiv:2202.10788 , year=

work page arXiv
[18]

Operations Research Letters , volume=

Mirror descent and nonlinear projected subgradient methods for convex optimization , author=. Operations Research Letters , volume=. 2003 , publisher=

work page 2003
[19]

Proceedings of the 35th International Conference on Machine Learning , pages =

Characterizing Implicit Bias in Terms of Optimization Geometry , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018
[20]

Information Fusion , volume=

A comprehensive survey on regularization strategies in machine learning , author=. Information Fusion , volume=. 2022 , publisher=

work page 2022
[21]

International Conference on Machine Learning , pages=

Why regularized auto-encoders learn sparse representation? , author=. International Conference on Machine Learning , pages=. 2016 , organization=

work page 2016
[22]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

work page 2017
[23]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022
[24]

2024 , eprint=

Efficient Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024
[25]

Advances in neural information processing systems , volume=

A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=

work page
[26]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

Randaugment: Practical automated data augmentation with a reduced search space , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

work page
[27]

ArXiv , year=

Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows , author=. ArXiv , year=

work page
[28]

Conference on Uncertainty in Artificial Intelligence , year=

A Mirror Descent Perspective of Smoothed Sign Descent , author=. Conference on Uncertainty in Artificial Intelligence , year=

work page
[29]

2025 , eprint=

Transformative or Conservative? Conservation laws for ResNets and Transformers , author=. 2025 , eprint=

work page 2025
[30]

International Conference on Machine Learning , year=

How to Escape Saddle Points Efficiently , author=. International Conference on Machine Learning , year=

work page
[31]

2024 , eprint=

Keep the Momentum: Conservation Laws beyond Euclidean Gradient Flows , author=. 2024 , eprint=

work page 2024
[32]

International Conference on Learning Representations , year=

Three Mechanisms of Feature Learning in a Linear Network , author=. International Conference on Learning Representations , year=

work page
[33]

Frontiers in Neuroscience , volume=

Noise helps optimization escape from saddle points in the synaptic plasticity , author=. Frontiers in Neuroscience , volume=. 2020 , publisher=

work page 2020
[34]

Advances in Neural Information Processing Systems , volume=

Escaping saddle-point faster under interpolation-like conditions , author=. Advances in Neural Information Processing Systems , volume=

work page
[35]

The journal of machine learning research , volume=

Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , volume=. 2014 , publisher=

work page 2014
[36]

Proceedings of the 32nd International Conference on Machine Learning , pages =

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

work page 2015
[37]

Proceedings of the 26th annual international conference on machine learning , pages=

Online dictionary learning for sparse coding , author=. Proceedings of the 26th annual international conference on machine learning , pages=

work page
[38]

Proceedings of the 27th international conference on international conference on machine learning , pages=

Learning fast approximations of sparse coding , author=. Proceedings of the 27th international conference on international conference on machine learning , pages=

work page
[39]

Journal of Machine Learning Research , volume=

Convolutional neural networks analyzed via convolutional sparse coding , author=. Journal of Machine Learning Research , volume=

work page
[40]

IEEE access , volume=

A survey of sparse representation: algorithms and applications , author=. IEEE access , volume=. 2015 , publisher=

work page 2015
[41]

Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , volume=

An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , author=. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , volume=. 2004 , publisher=

work page 2004
[42]

, journal=

Tropp, J.A. , journal=. Greed is good: algorithmic results for sparse approximation , year=

work page
[43]

Advances in Neural Information Processing Systems , editor=

Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[44]

Implicit Bias of

Scott Pesme and Loucas Pillaud-Vivien and Nicolas Flammarion , booktitle=. Implicit Bias of. 2021 , url=

work page 2021
[45]

2024 , eprint=

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes , author=. 2024 , eprint=

work page 2024
[46]

2021 , eprint=

Powerpropagation: A sparsity inducing weight reparameterisation , author=. 2021 , eprint=

work page 2021
[47]

Part I: Discrete time analysis , author=

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis , author=. 2021 , eprint=

work page 2021
[48]

Twelfth International Conference on Learning Representations , year=

Masks, Signs, And Learning Rate Rewinding , author=. Twelfth International Conference on Learning Representations , year=

work page
[49]

2021 , eprint=

Winning the Lottery with Continuous Sparsification , author=. 2021 , eprint=

work page 2021
[50]

2009 , isbn=

Convex optimization , author=. 2009 , isbn=

work page 2009
[51]

Proceedings of Thirty Third Conference on Learning Theory , pages =

Kernel and Rich Regimes in Overparametrized Models , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

work page 2020
[52]

Hessian Riemannian Gradient Flows in Convex Programming , volume=

Alvarez, Felipe and Bolte, Jérôme and Brahic, Olivier , year=. Hessian Riemannian Gradient Flows in Convex Programming , volume=. SIAM Journal on Control and Optimization , publisher=. doi:10.1137/s0363012902419977 , number=

work page doi:10.1137/s0363012902419977
[53]

Princeton Landmarks in Mathematics and Physics , year=

Convex Analysis , author=. Princeton Landmarks in Mathematics and Physics , year=

work page
[54]

International Conference on Learning Representations , year=

Masks, Signs, And Learning Rate Rewinding , author=. International Conference on Learning Representations , year=

work page
[55]

International Conference on Machine Learning , year =

Why Random Pruning Is All We Need to Start Sparse , author =. International Conference on Machine Learning , year =

work page
[56]

International Conference on Learning Representations , year=

On the Existence of Universal Lottery Tickets , author=. International Conference on Learning Representations , year=

work page
[57]

2021 , eprint=

Plant 'n' Seek: Can You Find the Winning Ticket? , author=. 2021 , eprint=

work page 2021
[58]

International Conference on Machine Learning , year=

Convolutional and Residual Networks Provably Contain Lottery Tickets , author=. International Conference on Machine Learning , year=

work page
[59]

Ferbach, Damien and Tsirigotis, Christos and Gidel, Gauthier and Avishek, Bose , title =

work page
[60]

International Conference on Learning Representations , year=

Pruning Neural Networks at Initialization: Why Are We Missing the Mark? , author=. International Conference on Learning Representations , year=

work page
[61]

2024 , eprint=

A Survey of Lottery Ticket Hypothesis , author=. 2024 , eprint=

work page 2024
[62]

2024 , eprint=

Implicit Bias and Fast Convergence Rates for Self-attention , author=. 2024 , eprint=

work page 2024
[63]

2024 , eprint=

Implicit Regularization of Gradient Flow on One-Layer Softmax Attention , author=. 2024 , eprint=

work page 2024
[64]

Advances in Neural Information Processing Systems , editor=

Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022
[65]

2020 , eprint=

On Lazy Training in Differentiable Programming , author=. 2020 , eprint=

work page 2020
[66]

2021 , eprint=

Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors , author=. 2021 , eprint=

work page 2021
[67]

2018 , eprint=

Learning Sparse Neural Networks through L_0 Regularization , author=. 2018 , eprint=

work page 2018
[68]

arXiv: Learning , year=

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. arXiv: Learning , year=

work page
[69]

2021 , eprint=

Towards Understanding Iterative Magnitude Pruning: Why Lottery Tickets Win , author=. 2021 , eprint=

work page 2021
[70]

ArXiv , year=

A Survey of Lottery Ticket Hypothesis , author=. ArXiv , year=

work page
[71]

Proceedings of Thirty Third Conference on Learning Theory , pages =

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

work page 2020
[72]

Implicit Regularization in Matrix Factorization , url =

Gunasekar, Suriya and Woodworth, Blake E and Bhojanapalli, Srinadh and Neyshabur, Behnam and Srebro, Nati , booktitle =. Implicit Regularization in Matrix Factorization , url =

work page
[73]

(S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , url =

Even, Mathieu and Pesme, Scott and Gunasekar, Suriya and Flammarion, Nicolas , booktitle =. (S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , url =

work page
[74]

Toward Effective Intrusion Detection Using Log-Cosh Conditional Variational Autoencoder , year=

Xu, Xing and Li, Jie and Yang, Yang and Shen, Fumin , journal=. Toward Effective Intrusion Detection Using Log-Cosh Conditional Variational Autoencoder , year=

work page
[75]

Nature Climate Change , volume=

Aligning artificial intelligence with climate change mitigation , author=. Nature Climate Change , volume=. 2022 , publisher=

work page 2022
[76]

Proceedings of Machine Learning and Systems , volume=

Sustainable ai: Environmental implications, challenges and opportunities , author=. Proceedings of Machine Learning and Systems , volume=

work page
[77]

arXiv preprint arXiv:2311.16863 , year=

Power hungry processing: Watts driving the cost of ai deployment? , author=. arXiv preprint arXiv:2311.16863 , year=

work page arXiv
[78]

1983 , publisher=

Problem Complexity and Method Efficiency in Optimization , author=. 1983 , publisher=

work page 1983
[79]

Mirror descent and nonlinear projected subgradient methods for convex optimization , journal =

Amir Beck and Marc Teboulle , keywords =. Mirror descent and nonlinear projected subgradient methods for convex optimization , journal =. 2003 , issn =. doi:https://doi.org/10.1016/S0167-6377(02)00231-6 , url =

work page doi:10.1016/s0167-6377(02)00231-6 2003
[80]

2017 , eprint=

Gradient Descent Can Take Exponential Time to Escape Saddle Points , author=. 2017 , eprint=

work page 2017

Showing first 80 references.

[1] [1]

36th International Conference on Algorithmic Learning Theory , year=

How rotation invariant algorithms are fooled by noise on sparse targets , author=. 36th International Conference on Algorithmic Learning Theory , year=

work page

[2] [2]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[3] [3]

ArXiv , year=

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , author=. ArXiv , year=

work page

[4] [4]

The Fourteenth International Conference on Learning Representations , year=

Never Saddle for Reparameterized Steepest Descent as Mirror Flow , author=. The Fourteenth International Conference on Learning Representations , year=

work page

[5] [5]

Automated Flower Classification over a Large Number of Classes

Maria-Elena Nilsback and Andrew Zisserman. Automated Flower Classification over a Large Number of Classes. Indian Conference on Computer Vision, Graphics and Image Processing. 2008

work page 2008

[6] [6]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[7] [7]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[8] [8]

Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification , url =

Luo, Yihong and Chen, Yuhan and Qiu, Siya and Wang, Yiwei and Zhang, Chen and Zhou, Yan and Cao, Xiaochun and Tang, Jing , booktitle =. Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification , url =

work page

[9] [9]

Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks , volume=

Santos, Claudio Filipi Gonçalves Dos and Papa, João Paulo , year=. Avoiding Overfitting: A Survey on Regularization Methods for Convolutional Neural Networks , volume=. ACM Computing Surveys , publisher=. doi:10.1145/3510413 , number=

work page doi:10.1145/3510413

[10] [10]

2006 , publisher=

Pattern recognition and machine learning , author=. 2006 , publisher=

work page 2006

[11] [11]

Noah Golmant and Zhewei Yao and Amir Gholami and Michael Mahoney and Joseph Gonzalez , title =

work page

[12] [12]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Implicit Bias of Mirror Flow on Separable Data , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page

[13] [13]

2023 , eprint=

Symbolic Discovery of Optimization Algorithms , author=. 2023 , eprint=

work page 2023

[14] [14]

International Conference on Artificial Intelligence and Statistics , pages=

Sinkhorn Flow as Mirror Flow: A Continuous-Time Framework for Generalizing the Sinkhorn Algorithm , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

work page 2024

[15] [15]

Proceedings of the National Academy of Sciences , volume =

Mikhail Belkin and Daniel Hsu and Siyuan Ma and Soumik Mandal , title =. Proceedings of the National Academy of Sciences , volume =. 2019 , doi =

work page 2019

[16] [16]

Proceedings of the National Academy of Sciences , volume =

Adityanarayanan Radhakrishnan and Mikhail Belkin and Caroline Uhler , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =

work page 2020

[17] [17]

arXiv preprint arXiv:2202.10788 , year=

Explicit regularization via regularizer mirror descent , author=. arXiv preprint arXiv:2202.10788 , year=

work page arXiv

[18] [18]

Operations Research Letters , volume=

Mirror descent and nonlinear projected subgradient methods for convex optimization , author=. Operations Research Letters , volume=. 2003 , publisher=

work page 2003

[19] [19]

Proceedings of the 35th International Conference on Machine Learning , pages =

Characterizing Implicit Bias in Terms of Optimization Geometry , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

work page 2018

[20] [20]

Information Fusion , volume=

A comprehensive survey on regularization strategies in machine learning , author=. Information Fusion , volume=. 2022 , publisher=

work page 2022

[21] [21]

International Conference on Machine Learning , pages=

Why regularized auto-encoders learn sparse representation? , author=. International Conference on Machine Learning , pages=. 2016 , organization=

work page 2016

[22] [22]

Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

work page 2017

[23] [23]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

work page 2022

[24] [24]

2024 , eprint=

Efficient Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024

[25] [25]

Advances in neural information processing systems , volume=

A simple weight decay can improve generalization , author=. Advances in neural information processing systems , volume=

work page

[26] [26]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

Randaugment: Practical automated data augmentation with a reduced search space , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages=

work page

[27] [27]

ArXiv , year=

Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows , author=. ArXiv , year=

work page

[28] [28]

Conference on Uncertainty in Artificial Intelligence , year=

A Mirror Descent Perspective of Smoothed Sign Descent , author=. Conference on Uncertainty in Artificial Intelligence , year=

work page

[29] [29]

2025 , eprint=

Transformative or Conservative? Conservation laws for ResNets and Transformers , author=. 2025 , eprint=

work page 2025

[30] [30]

International Conference on Machine Learning , year=

How to Escape Saddle Points Efficiently , author=. International Conference on Machine Learning , year=

work page

[31] [31]

2024 , eprint=

Keep the Momentum: Conservation Laws beyond Euclidean Gradient Flows , author=. 2024 , eprint=

work page 2024

[32] [32]

International Conference on Learning Representations , year=

Three Mechanisms of Feature Learning in a Linear Network , author=. International Conference on Learning Representations , year=

work page

[33] [33]

Frontiers in Neuroscience , volume=

Noise helps optimization escape from saddle points in the synaptic plasticity , author=. Frontiers in Neuroscience , volume=. 2020 , publisher=

work page 2020

[34] [34]

Advances in Neural Information Processing Systems , volume=

Escaping saddle-point faster under interpolation-like conditions , author=. Advances in Neural Information Processing Systems , volume=

work page

[35] [35]

The journal of machine learning research , volume=

Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , volume=. 2014 , publisher=

work page 2014

[36] [36]

Proceedings of the 32nd International Conference on Machine Learning , pages =

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

work page 2015

[37] [37]

Proceedings of the 26th annual international conference on machine learning , pages=

Online dictionary learning for sparse coding , author=. Proceedings of the 26th annual international conference on machine learning , pages=

work page

[38] [38]

Proceedings of the 27th international conference on international conference on machine learning , pages=

Learning fast approximations of sparse coding , author=. Proceedings of the 27th international conference on international conference on machine learning , pages=

work page

[39] [39]

Journal of Machine Learning Research , volume=

Convolutional neural networks analyzed via convolutional sparse coding , author=. Journal of Machine Learning Research , volume=

work page

[40] [40]

IEEE access , volume=

A survey of sparse representation: algorithms and applications , author=. IEEE access , volume=. 2015 , publisher=

work page 2015

[41] [41]

Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , volume=

An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , author=. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences , volume=. 2004 , publisher=

work page 2004

[42] [42]

, journal=

Tropp, J.A. , journal=. Greed is good: algorithmic results for sparse approximation , year=

work page

[43] [43]

Advances in Neural Information Processing Systems , editor=

Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022

[44] [44]

Implicit Bias of

Scott Pesme and Loucas Pillaud-Vivien and Nicolas Flammarion , booktitle=. Implicit Bias of. 2021 , url=

work page 2021

[45] [45]

2024 , eprint=

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes , author=. 2024 , eprint=

work page 2024

[46] [46]

2021 , eprint=

Powerpropagation: A sparsity inducing weight reparameterisation , author=. 2021 , eprint=

work page 2021

[47] [47]

Part I: Discrete time analysis , author=

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis , author=. 2021 , eprint=

work page 2021

[48] [48]

Twelfth International Conference on Learning Representations , year=

Masks, Signs, And Learning Rate Rewinding , author=. Twelfth International Conference on Learning Representations , year=

work page

[49] [49]

2021 , eprint=

Winning the Lottery with Continuous Sparsification , author=. 2021 , eprint=

work page 2021

[50] [50]

2009 , isbn=

Convex optimization , author=. 2009 , isbn=

work page 2009

[51] [51]

Proceedings of Thirty Third Conference on Learning Theory , pages =

Kernel and Rich Regimes in Overparametrized Models , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

work page 2020

[52] [52]

Hessian Riemannian Gradient Flows in Convex Programming , volume=

Alvarez, Felipe and Bolte, Jérôme and Brahic, Olivier , year=. Hessian Riemannian Gradient Flows in Convex Programming , volume=. SIAM Journal on Control and Optimization , publisher=. doi:10.1137/s0363012902419977 , number=

work page doi:10.1137/s0363012902419977

[53] [53]

Princeton Landmarks in Mathematics and Physics , year=

Convex Analysis , author=. Princeton Landmarks in Mathematics and Physics , year=

work page

[54] [54]

International Conference on Learning Representations , year=

Masks, Signs, And Learning Rate Rewinding , author=. International Conference on Learning Representations , year=

work page

[55] [55]

International Conference on Machine Learning , year =

Why Random Pruning Is All We Need to Start Sparse , author =. International Conference on Machine Learning , year =

work page

[56] [56]

International Conference on Learning Representations , year=

On the Existence of Universal Lottery Tickets , author=. International Conference on Learning Representations , year=

work page

[57] [57]

2021 , eprint=

Plant 'n' Seek: Can You Find the Winning Ticket? , author=. 2021 , eprint=

work page 2021

[58] [58]

International Conference on Machine Learning , year=

Convolutional and Residual Networks Provably Contain Lottery Tickets , author=. International Conference on Machine Learning , year=

work page

[59] [59]

Ferbach, Damien and Tsirigotis, Christos and Gidel, Gauthier and Avishek, Bose , title =

work page

[60] [60]

International Conference on Learning Representations , year=

Pruning Neural Networks at Initialization: Why Are We Missing the Mark? , author=. International Conference on Learning Representations , year=

work page

[61] [61]

2024 , eprint=

A Survey of Lottery Ticket Hypothesis , author=. 2024 , eprint=

work page 2024

[62] [62]

2024 , eprint=

Implicit Bias and Fast Convergence Rates for Self-attention , author=. 2024 , eprint=

work page 2024

[63] [63]

2024 , eprint=

Implicit Regularization of Gradient Flow on One-Layer Softmax Attention , author=. 2024 , eprint=

work page 2024

[64] [64]

Advances in Neural Information Processing Systems , editor=

Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

work page 2022

[65] [65]

2020 , eprint=

On Lazy Training in Differentiable Programming , author=. 2020 , eprint=

work page 2020

[66] [66]

2021 , eprint=

Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors , author=. 2021 , eprint=

work page 2021

[67] [67]

2018 , eprint=

Learning Sparse Neural Networks through L_0 Regularization , author=. 2018 , eprint=

work page 2018

[68] [68]

arXiv: Learning , year=

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , author=. arXiv: Learning , year=

work page

[69] [69]

2021 , eprint=

Towards Understanding Iterative Magnitude Pruning: Why Lottery Tickets Win , author=. 2021 , eprint=

work page 2021

[70] [70]

ArXiv , year=

A Survey of Lottery Ticket Hypothesis , author=. ArXiv , year=

work page

[71] [71]

Proceedings of Thirty Third Conference on Learning Theory , pages =

Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , author =. Proceedings of Thirty Third Conference on Learning Theory , pages =. 2020 , editor =

work page 2020

[72] [72]

Implicit Regularization in Matrix Factorization , url =

Gunasekar, Suriya and Woodworth, Blake E and Bhojanapalli, Srinadh and Neyshabur, Behnam and Srebro, Nati , booktitle =. Implicit Regularization in Matrix Factorization , url =

work page

[73] [73]

(S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , url =

Even, Mathieu and Pesme, Scott and Gunasekar, Suriya and Flammarion, Nicolas , booktitle =. (S)GD over Diagonal Linear Networks: Implicit bias, Large Stepsizes and Edge of Stability , url =

work page

[74] [74]

Toward Effective Intrusion Detection Using Log-Cosh Conditional Variational Autoencoder , year=

Xu, Xing and Li, Jie and Yang, Yang and Shen, Fumin , journal=. Toward Effective Intrusion Detection Using Log-Cosh Conditional Variational Autoencoder , year=

work page

[75] [75]

Nature Climate Change , volume=

Aligning artificial intelligence with climate change mitigation , author=. Nature Climate Change , volume=. 2022 , publisher=

work page 2022

[76] [76]

Proceedings of Machine Learning and Systems , volume=

Sustainable ai: Environmental implications, challenges and opportunities , author=. Proceedings of Machine Learning and Systems , volume=

work page

[77] [77]

arXiv preprint arXiv:2311.16863 , year=

Power hungry processing: Watts driving the cost of ai deployment? , author=. arXiv preprint arXiv:2311.16863 , year=

work page arXiv

[78] [78]

1983 , publisher=

Problem Complexity and Method Efficiency in Optimization , author=. 1983 , publisher=

work page 1983

[79] [79]

Mirror descent and nonlinear projected subgradient methods for convex optimization , journal =

Amir Beck and Marc Teboulle , keywords =. Mirror descent and nonlinear projected subgradient methods for convex optimization , journal =. 2003 , issn =. doi:https://doi.org/10.1016/S0167-6377(02)00231-6 , url =

work page doi:10.1016/s0167-6377(02)00231-6 2003

[80] [80]

2017 , eprint=

Gradient Descent Can Take Exponential Time to Escape Saddle Points , author=. 2017 , eprint=

work page 2017