Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning

Guido Montufar; Tom Jacobs

arxiv: 2605.19458 · v1 · pith:F5FUJCWZnew · submitted 2026-05-19 · 💻 cs.LG

Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning

Tom Jacobs , Guido Montufar This is my paper

Pith reviewed 2026-05-20 07:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords mirror flowimplicit biashomogeneous neural networksmax-margin solutionsfeature learningsparse representationsconvex dualityoptimization dynamics

0 comments

The pith

Mirror flow reaches the same max-margin solution for different mirror maps but induces representations ranging from sparse to dense neuron activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the implicit bias of mirror flow in deep neural networks that use homogeneous activation functions. Extending results from gradient flow, the authors derive a balance equation from convex duality that characterizes a horizon function controlling the limiting margin. They show that distinct non-homogeneous mirror maps can converge to the identical max-margin classifier while producing very different internal representations, with neuron activations varying from sparse to dense. The work also gives convergence rates and norm growth estimates, and verifies the predictions on synthetic datasets and standard vision tasks. This offers a unified view on how the choice of mirror map affects both the speed of optimization and the sparsity of learned features.

Core claim

Mirror flow in homogeneous networks satisfies a novel balance equation derived from convex duality. This equation characterizes the horizon function that governs the induced margin. Distinct non-homogeneous mirror maps can induce the same max-margin solution yet produce markedly different representations, ranging from sparse to dense neuron activations.

What carries the argument

The balance equation for mirror flow obtained from convex duality, which defines the horizon function that determines the limiting margin and distinguishes the effect of different mirror maps on representations.

If this is right

Distinct non-homogeneous mirror maps can induce the same max-margin solution.
Convergence of mirror flow can be extremely slow, including exponentially slow regimes.
All considered mirror maps exhibit feature learning, yet they produce representations ranging from sparse to dense neuron activations.
Mirror maps shape both the optimization dynamics and the geometry of the learned classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The choice of mirror map could be used to control the sparsity or density of learned features without altering the final decision boundary.
Differences in representation sparsity may influence generalization, robustness, or transfer performance in downstream tasks.
Similar balance equations and horizon functions might be derived for other first-order methods or for networks that are not strictly homogeneous.

Load-bearing premise

Mirror flow dynamics in homogeneous networks converge to a max-margin solution as time tends to infinity.

What would settle it

A simulation or calculation in which mirror flow on a homogeneous network fails to approach the predicted max-margin solution for a given mirror map, or in which two different mirror maps produce identical neuron activation patterns.

Figures

Figures reproduced from arXiv: 2605.19458 by Guido Montufar, Tom Jacobs.

**Figure 2.** Figure 2: Different types of feature learning under mirror flows. The hyperbolic entropy induces sparse feature learning with fewer active neurons, whereas the smoothed homogeneous potential (p = 3) induces dense feature learning with more active neurons. (Left) Input weight representations w˜ = |aj |wj of a two-layer student network in a student-teacher setup: GD (orange) produces diffuse weights near the origin, p… view at source ↗

**Figure 3.** Figure 3: Training with smoothened homogeneous potential (p = 3). This shows that increasing λ > 0 slows convergence toward the max-margin solution, so that for finite training time the learned representation remains closer to that of gradient descent. Reaching the margin. The results presented so far characterize the implicit bias of mirror flow through the introduction of a Q-margin and its maximization in th… view at source ↗

**Figure 4.** Figure 4: Weight distributions for the first layer of a VGG-16 and weight pruning. (Left) We show [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Learned representation by mirror descent with hyperbolic entropy for various [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗

**Figure 6.** Figure 6: Learned representation by mirror descent with smoothened homogeneous (p [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

**Figure 7.** Figure 7: Training ten times longer does not change the representation much when [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Weight magnitude distribution of the first layer after time rescaling for all mirror maps. [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: Weight magnitude distribution of the second layer after time rescaling for all mirror maps. [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

**Figure 10.** Figure 10: Weight magnitude distribution of the third layer after time rescaling for all mirror maps. [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗

**Figure 11.** Figure 11: Illustration of the decision boundary and function activation value reached for the [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

**Figure 12.** Figure 12: Learned representation in the input layer [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

**Figure 13.** Figure 13: Learned representation in the input layer [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗

**Figure 14.** Figure 14: The weight magnitude distribution of a VGG16’s last layer. We observe that all algorithms [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗

read the original abstract

We study the max-margin solutions reached by mirror flow in deep neural networks with homogeneous activation functions. Extending classical results on gradient flow, we derive a novel balance equation for mirror flow from convex duality, enabling a characterization of the horizon function governing the induced margin. We further establish max-margin characterizations together with convergence rates and norm growth estimates. Finally, we support our theory through experiments on synthetic datasets and standard vision tasks. Concretely, we show that: (1) distinct non-homogeneous mirror maps can induce the same max-margin solution; (2) convergence can be extremely slow, including exponentially slow regimes; and (3) although all considered mirror maps exhibit feature learning, they can produce markedly different representations, ranging from sparse to dense neuron activations. Together, these results provide a unified perspective on sparse and dense feature learning in homogeneous neural networks, highlighting how mirror maps shape both optimization dynamics and the geometry of the learned classifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a balance equation for mirror flow from convex duality that links mirror map choice to sparse versus dense features while preserving the same max-margin solution.

read the letter

The main thing to know is that this work derives a balance equation for mirror flow from convex duality, which characterizes how different mirror maps can yield the same max-margin solution but with representations that range from sparse to dense neuron activations. This is a clear extension of the gradient flow implicit bias results to mirror flow on homogeneous networks. They provide max-margin characterizations, convergence rates including slow regimes, and norm growth estimates. The experiments on synthetic and vision tasks show the practical differences in feature learning induced by the mirror map choice. The paper does a good job keeping the focus on the geometry of the learned classifiers and how the optimizer affects internal sparsity without changing the decision boundary. The soft spot is around the convergence assumption. They claim to establish that mirror flow reaches the max-margin limit, but the conditions under which this happens for general mirror maps and networks are important. If convergence is not guaranteed in all cases they consider, the balance equation may not always describe the long-run behavior. The math uses standard tools from convex duality, so that part looks reproducible. The citation pattern is appropriate for the subfield. This paper is for theorists and practitioners interested in implicit bias and controlling feature sparsity through optimization choices. It offers a unified view that could help in designing mirror maps for desired properties. I recommend it for peer review. The combination of the new derivation and the supporting experiments makes it worth the time of referees.

Referee Report

1 major / 2 minor

Summary. The paper studies the implicit bias of mirror flow on homogeneous neural networks. It derives a balance equation from convex duality to characterize the horizon function and induced margin, establishes max-margin characterizations along with convergence rates and norm growth estimates, and shows via experiments on synthetic data and vision tasks that distinct non-homogeneous mirror maps can reach the same max-margin solution while producing markedly different sparse-to-dense neuron activations and representations.

Significance. If the derivations and convergence results hold, the work extends gradient-flow implicit-bias theory to mirror flow and supplies a unified account of how mirror-map choice shapes both dynamics and the geometry of learned classifiers. The experimental demonstration that the same max-margin classifier can arise with qualitatively different feature sparsity is a concrete contribution to understanding representation learning under different optimizers.

major comments (1)

[Abstract and convergence-rate section] Abstract and the convergence-rate section: the balance equation and horizon-function characterization are derived under the premise that mirror flow converges to a max-margin solution as t→∞. The manuscript states that max-margin characterizations and convergence rates are established, yet supplies no explicit conditions (strict convexity of the mirror map, depth restrictions, initialization, or loss assumptions) guaranteeing that the limit is attained and that the rates are positive. If convergence is only logarithmic or fails in some regimes, the duality-based balance equation does not govern the observed trajectory, weakening the claim that distinct mirror maps produce different sparse/dense representations while sharing the same max-margin classifier.

minor comments (2)

[Theory sections] Notation for the mirror map and its Legendre dual should be introduced once and used consistently; several passages switch between φ and its conjugate without explicit reminder.
[Experiments] Experimental details (learning-rate schedules, initialization variance, exact synthetic data generation) are only sketched; a short reproducibility paragraph or supplementary table would help.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and valuable feedback. The major comment raises an important point about the assumptions underlying our convergence claims, which we address below by clarifying the scope of our results and committing to revisions that make the conditions explicit.

read point-by-point responses

Referee: [Abstract and convergence-rate section] Abstract and the convergence-rate section: the balance equation and horizon-function characterization are derived under the premise that mirror flow converges to a max-margin solution as t→∞. The manuscript states that max-margin characterizations and convergence rates are established, yet supplies no explicit conditions (strict convexity of the mirror map, depth restrictions, initialization, or loss assumptions) guaranteeing that the limit is attained and that the rates are positive. If convergence is only logarithmic or fails in some regimes, the duality-based balance equation does not govern the observed trajectory, weakening the claim that distinct mirror maps produce different sparse/dense representations while sharing the same max-margin classifier.

Authors: We thank the referee for this precise observation. Our derivations of the balance equation and horizon-function characterization are indeed performed under the assumption that mirror flow converges to a max-margin solution as t tends to infinity, following the standard methodology in the implicit-bias literature. The manuscript already notes that convergence can be exponentially slow in some regimes and provides supporting experiments. To strengthen the presentation, we will revise the abstract, introduction, and convergence-rate section to state explicit sufficient conditions for convergence to the max-margin limit (including strict convexity of the mirror map, suitable initialization, homogeneity of the network, and standard assumptions on the loss). We will also clarify that the duality-based characterization governs the asymptotic behavior precisely when these conditions hold and that the reported rates are positive under the same assumptions. These changes will ensure the claims about sparse versus dense representations under different mirror maps are rigorously tied to the regimes where the limit is attained. revision: yes

Circularity Check

0 steps flagged

No circularity: balance equation derived from external convex duality

full rationale

The paper derives the balance equation and horizon function characterization explicitly from convex duality applied to mirror flow, extending classical gradient flow results without reducing any claimed prediction or first-principles result to a quantity defined by the paper's own fitted parameters or self-referential inputs. The max-margin convergence assumption is stated as an extension of prior work rather than proven here, but this does not create a definitional loop or fitted-input-called-prediction pattern in the derivation chain. No self-citation is load-bearing for the core balance equation, and the sparse/dense representation distinctions follow from the distinct mirror maps under the shared limit, which remains independently characterizable. The analysis is self-contained against external mathematical tools.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about homogeneous activations and the applicability of convex duality to the continuous-time dynamics; no free parameters or invented entities are indicated in the abstract.

axioms (2)

domain assumption Neural network activations are homogeneous
This property is required to extend classical gradient flow results and derive the balance equation for mirror flow.
domain assumption Convex duality applies directly to mirror flow trajectories
Invoked to obtain the novel balance equation and horizon function characterization.

pith-pipeline@v0.9.0 · 5687 in / 1382 out tokens · 48267 ms · 2026-05-20T07:12:27.169122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

[1]

Implicit regularization in deep matrix factorization

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019
[2]

On the implicit bias of initialization shape: Beyond infinitesimal mirror descent

Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 468–477. PMLR, 18–24...

work page 2021
[3]

Modular duality in deep learning

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InForty-second International Conference on Machine Learning, 2025

work page 2025
[4]

Implicit bias of gradient descent for non-homogeneous deep networks

Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, and Peter Bartlett. Implicit bias of gradient descent for non-homogeneous deep networks. InForty-second Interna- tional Conference on Machine Learning, 2025

work page 2025
[5]

More is less: inducing sparsity via overparameterization.Information and Inference, 12(3):1437–1460, 2023

Hung-Hsu Chou, Johannes Maly, and Holger Rauhut. More is less: inducing sparsity via overparameterization.Information and Inference, 12(3):1437–1460, 2023

work page 2023
[6]

Robust implicit regularization via weight normalization.Information and Inference: A Journal of the IMA, 13(3):iaae022, 09 2024

Hung-Hsu Chou, Holger Rauhut, and Rachel Ward. Robust implicit regularization via weight normalization.Information and Inference: A Journal of the IMA, 13(3):iaae022, 09 2024

work page 2024
[7]

Clarke.Optimization and Nonsmooth Analysis

Frank H. Clarke.Optimization and Nonsmooth Analysis. Wiley-Interscience, 1983

work page 1983
[8]

Kakade, and J

Damek Davis, Dmitriy Drusvyatskiy, Sham M. Kakade, and J. Lee. Stochastic subgradient method converges on tame functions.Foundations of Computational Mathematics, 20:119–154, 2018

work page 2018
[9]

Clémentine Carla Juliette Dominé, Nicolas Anguita, Alexandra Maria Proca, Lukas Braun, Daniel Kunin, Pedro A. M. Mediano, and Andrew M Saxe. From lazy to rich: Exact learning dynamics in deep linear networks. InUniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models, 2024

work page 2024
[10]

A. F. Filippov.Differential Equations with Discontinuous Right-Hand Sides. Springer Dordrecht, 1988

work page 1988
[11]

Sign-in to the lottery: Reparameterizing sparse training

Advait Gadhikar, Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Sign-in to the lottery: Reparameterizing sparse training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026
[12]

Masks, signs, and learning rate rewinding

Advait Harshal Gadhikar and Rebekka Burkholz. Masks, signs, and learning rate rewinding. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[13]

Implicit regularization in matrix factorization

Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[14]

Deep networks are reproducing kernel chains.ArXiv, abs/2501.03697, 2025

Tjeerd Jan Heeringa, Len Spek, and Christoph Brune. Deep networks are reproducing kernel chains.ArXiv, abs/2501.03697, 2025

work page arXiv 2025
[15]

Mask in the mirror: Implicit sparsification

Tom Jacobs and Rebekka Burkholz. Mask in the mirror: Implicit sparsification. InThe Thirteenth International Conference on Learning Representations, 2025. 10

work page 2025
[16]

Hyperbolic aware minimization: Implicit bias for sparsity

Tom Jacobs, Advait Gadhikar, Celia Rubio-Madrigal, and Rebekka Burkholz. Hyperbolic aware minimization: Implicit bias for sparsity. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[17]

Mirror, mirror of the flow: How does regularization shape implicit bias? InForty-second International Conference on Machine Learning, 2025

Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Mirror, mirror of the flow: How does regularization shape implicit bias? InForty-second International Conference on Machine Learning, 2025

work page 2025
[18]

Never saddle: Reparameterized steepest descent as mirror flow

Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Never saddle: Reparameterized steepest descent as mirror flow. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[19]

Kakade, and Michael I

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. InInternational Conference on Machine Learning, 2017

work page 2017
[20]

arXiv preprint arXiv:2410.14581 , year=

Aaron Alvarado Kristanto Julistiono, Davoud Ataee Tarzanagh, and Navid Azizan. Optimizing attention with mirror descent: Generalized max-margin token selection.ArXiv, abs/2410.14581, 2024

work page arXiv 2024
[21]

Benign overfitting in leaky ReLU networks with moderate input dimension

Kedar Karhadkar, Erin George, Michael Murray, Guido Montúfar, and Deanna Needell. Benign overfitting in leaky ReLU networks with moderate input dimension. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[22]

Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors.Information and Computation, 132(1):1–63, 1997

work page 1997
[23]

Differentiable sparsity via d-gating: Simple and versatile structured penalization, 2025

Chris Kolb, Laetitia Frost, Bernd Bischl, and David Rügamer. Differentiable sparsity via d-gating: Simple and versatile structured penalization, 2025

work page 2025
[24]

Deep weight factorization: Sparse learning through the lens of artificial symmetries

Chris Kolb, Tobias Weber, Bernd Bischl, and David Rügamer. Deep weight factorization: Sparse learning through the lens of artificial symmetries. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[25]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[26]

Scalable optimization in the modular norm

Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm.CoRR, abs/2405.14813, 2024

work page arXiv 2024
[27]

Nguyen, Chinmay Hegde, and Raymond K

Jiangyuan Li, Thanh V . Nguyen, Chinmay Hegde, and Raymond K. W. Wong. Implicit sparse regularization: The impact of depth and early stopping, 2021

work page 2021
[28]

Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning

Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. InInternational Conference on Learning Representations, 2021

work page 2021
[29]

Lee, and Sanjeev Arora

Zhiyuan Li, Tianhao Wang, Jason D. Lee, and Sanjeev Arora. Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[30]

Implicit bias of mirror flow for shallow neural networks in univariate regression

Shuang Liang and Guido Montúfar. Implicit bias of mirror flow for shallow neural networks in univariate regression. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[31]

Sparse training of neural networks based on multilevel mirror descent, 2026

Yannick Lunk, Sebastian James Scott, and Leon Bungert. Sparse training of neural networks based on multilevel mirror descent, 2026

work page 2026
[32]

Lee, and Wei Hu

Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon Shaolei Du, Jason D. Lee, and Wei Hu. Dichotomy of early and late phase implicit biases can provably induce grokking. InICLR, 2024

work page 2024
[33]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InInternational Conference on Learning Representations, 2020

work page 2020
[34]

Negin Majidi, Ehsan Amid, Hossein Talebi, and Manfred K. Warmuth. Exponentiated gradient reweighting for robust training under label noise and beyond.ArXiv, abs/2104.01493, 2021. 11

work page arXiv 2021
[35]

Abide by the law and follow the flow: conservation laws for gradient flows

Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Abide by the law and follow the flow: conservation laws for gradient flows. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[36]

Transformative or conservative? conser- vation laws for resnets and transformers

Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Transformative or conservative? conser- vation laws for resnets and transformers. InForty-second International Conference on Machine Learning, 2025

work page 2025
[37]

Keep the momentum: Conservation laws beyond Euclidean gradient flows, 2024

Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Keep the momentum: Conservation laws beyond Euclidean gradient flows, 2024

work page 2024
[38]

Deep linear networks for regression are implicitly regularized towards flat minima

Pierre Marion and Lénaïc Chizat. Deep linear networks for regression are implicitly regularized towards flat minima. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[39]

Minor first, major last: A depth-induced implicit bias of sharpness-aware minimization

Chaewon Moon, Dongkuk Si, and Chulhee Yun. Minor first, major last: A depth-induced implicit bias of sharpness-aware minimization. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[40]

Implicit bias of mirror flow on separable data

Scott Pesme, Radu-Alexandru Dragomir, and Nicolas Flammarion. Implicit bias of mirror flow on separable data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[41]

Implicit bias of sgd for diag- onal linear networks: a provable benefit of stochasticity

Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diag- onal linear networks: a provable benefit of stochasticity. InAdvances in Neural Information Processing Systems, volume 34, pages 29218–29230. Curran Associates, Inc., 2021

work page 2021
[42]

Pedro H. P. Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? InAnnual Conference Computational Learning Theory, 2019

work page 2019
[43]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.CoRR, abs/1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[44]

The implicit bias of gradient descent on separable data, 2017

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data, 2017

work page 2017
[45]

A unified approach to controlling implicit regularization via mirror descent.ArXiv, abs/2306.13853, 2023

Haoyuan Sun, Khashayar Gatmiry, Kwangjun Ahn, and Navid Azizan. A unified approach to controlling implicit regularization via mirror descent.ArXiv, abs/2306.13853, 2023

work page arXiv 2023
[46]

Flavors of margin: Implicit bias of steepest descent in homogeneous neural networks

Nikolaos Tsilivis, Gal Vardi, and Julia Kempe. Flavors of margin: Implicit bias of steepest descent in homogeneous neural networks. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024

work page 2024
[47]

Simplicity bias of two-layer networks beyond linearly separable data

Nikita Tsoy and Nikola Konstantinov. Simplicity bias of two-layer networks beyond linearly separable data. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48728–48767. PMLR, 21–27 Jul 2024

work page 2024
[48]

Implicit regularization for optimal sparse recovery, 2019

Tomas Vaškeviˇcius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery, 2019

work page 2019
[49]

Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro

Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. InProceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 3635–3673. PMLR, 09–12 Jul 2020

work page 2020
[50]

Implicit regularization in matrix sensing via mirror descent

Fan Wu and Patrick Rebeschini. Implicit regularization in matrix sensing via mirror descent. In Neural Information Processing Systems, 2021

work page 2021
[51]

Implicit bias of gradient descent for logistic regression at the edge of stability

Jingfeng Wu, Vladimir Braverman, and Jason D Lee. Implicit bias of gradient descent for logistic regression at the edge of stability. InAdvances in Neural Information Processing Systems, volume 36, pages 74229–74256. Curran Associates, Inc., 2023. 12

work page 2023
[52]

Implicit bias of AdamW: ℓ∞-norm constrained optimization

Shuo Xie and Zhiyuan Li. Implicit bias of AdamW: ℓ∞-norm constrained optimization. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[53]

Greg Yang and Edward J. Hu. Tensor programs IV: Feature learning in infinite-width neural networks. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11727–11737. PMLR, 18–24 Jul 2021

work page 2021
[54]

The implicit bias of adam on separable data

Chenyang Zhang, Difan Zou, and Yuan Cao. The implicit bias of adam on separable data. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 13 Appendix Contents A Extended related work 15 B Proof of Lemma 3.1 16 C Verification of Assumption 4.4 17 D Proof of Theorem 4.5 18 E Bounds on the normalized dual iterates 20 F Proo...

work page 2024
[55]

Therefore the max-margin solution has additional constraint ¯u2 = ¯v2

= (u2 in −v 2 in)/(∥ut, vt∥2 2)→0. Therefore the max-margin solution has additional constraint ¯u2 = ¯v2. Changing the objective from 1 2 ∥¯u,¯v∥2 2 =∥ ¯θ∥1 where ¯θ:= ¯u⊙¯v. By the same argument we have that there exist a positive constant b >0 such that ∥ut, vt∥2 2/∥θt∥1 →b , thus ¯θ≃ θ ∥θ∥1 , which concludes the result. This implies that the normalized...

work page

[1] [1]

Implicit regularization in deep matrix factorization

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

work page 2019

[2] [2]

On the implicit bias of initialization shape: Beyond infinitesimal mirror descent

Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 468–477. PMLR, 18–24...

work page 2021

[3] [3]

Modular duality in deep learning

Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InForty-second International Conference on Machine Learning, 2025

work page 2025

[4] [4]

Implicit bias of gradient descent for non-homogeneous deep networks

Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, and Peter Bartlett. Implicit bias of gradient descent for non-homogeneous deep networks. InForty-second Interna- tional Conference on Machine Learning, 2025

work page 2025

[5] [5]

More is less: inducing sparsity via overparameterization.Information and Inference, 12(3):1437–1460, 2023

Hung-Hsu Chou, Johannes Maly, and Holger Rauhut. More is less: inducing sparsity via overparameterization.Information and Inference, 12(3):1437–1460, 2023

work page 2023

[6] [6]

Robust implicit regularization via weight normalization.Information and Inference: A Journal of the IMA, 13(3):iaae022, 09 2024

Hung-Hsu Chou, Holger Rauhut, and Rachel Ward. Robust implicit regularization via weight normalization.Information and Inference: A Journal of the IMA, 13(3):iaae022, 09 2024

work page 2024

[7] [7]

Clarke.Optimization and Nonsmooth Analysis

Frank H. Clarke.Optimization and Nonsmooth Analysis. Wiley-Interscience, 1983

work page 1983

[8] [8]

Kakade, and J

Damek Davis, Dmitriy Drusvyatskiy, Sham M. Kakade, and J. Lee. Stochastic subgradient method converges on tame functions.Foundations of Computational Mathematics, 20:119–154, 2018

work page 2018

[9] [9]

Clémentine Carla Juliette Dominé, Nicolas Anguita, Alexandra Maria Proca, Lukas Braun, Daniel Kunin, Pedro A. M. Mediano, and Andrew M Saxe. From lazy to rich: Exact learning dynamics in deep linear networks. InUniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models, 2024

work page 2024

[10] [10]

A. F. Filippov.Differential Equations with Discontinuous Right-Hand Sides. Springer Dordrecht, 1988

work page 1988

[11] [11]

Sign-in to the lottery: Reparameterizing sparse training

Advait Gadhikar, Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Sign-in to the lottery: Reparameterizing sparse training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

work page 2026

[12] [12]

Masks, signs, and learning rate rewinding

Advait Harshal Gadhikar and Rebekka Burkholz. Masks, signs, and learning rate rewinding. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[13] [13]

Implicit regularization in matrix factorization

Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[14] [14]

Deep networks are reproducing kernel chains.ArXiv, abs/2501.03697, 2025

Tjeerd Jan Heeringa, Len Spek, and Christoph Brune. Deep networks are reproducing kernel chains.ArXiv, abs/2501.03697, 2025

work page arXiv 2025

[15] [15]

Mask in the mirror: Implicit sparsification

Tom Jacobs and Rebekka Burkholz. Mask in the mirror: Implicit sparsification. InThe Thirteenth International Conference on Learning Representations, 2025. 10

work page 2025

[16] [16]

Hyperbolic aware minimization: Implicit bias for sparsity

Tom Jacobs, Advait Gadhikar, Celia Rubio-Madrigal, and Rebekka Burkholz. Hyperbolic aware minimization: Implicit bias for sparsity. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[17] [17]

Mirror, mirror of the flow: How does regularization shape implicit bias? InForty-second International Conference on Machine Learning, 2025

Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Mirror, mirror of the flow: How does regularization shape implicit bias? InForty-second International Conference on Machine Learning, 2025

work page 2025

[18] [18]

Never saddle: Reparameterized steepest descent as mirror flow

Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Never saddle: Reparameterized steepest descent as mirror flow. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[19] [19]

Kakade, and Michael I

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. InInternational Conference on Machine Learning, 2017

work page 2017

[20] [20]

arXiv preprint arXiv:2410.14581 , year=

Aaron Alvarado Kristanto Julistiono, Davoud Ataee Tarzanagh, and Navid Azizan. Optimizing attention with mirror descent: Generalized max-margin token selection.ArXiv, abs/2410.14581, 2024

work page arXiv 2024

[21] [21]

Benign overfitting in leaky ReLU networks with moderate input dimension

Kedar Karhadkar, Erin George, Michael Murray, Guido Montúfar, and Deanna Needell. Benign overfitting in leaky ReLU networks with moderate input dimension. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[22] [22]

Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors.Information and Computation, 132(1):1–63, 1997

work page 1997

[23] [23]

Differentiable sparsity via d-gating: Simple and versatile structured penalization, 2025

Chris Kolb, Laetitia Frost, Bernd Bischl, and David Rügamer. Differentiable sparsity via d-gating: Simple and versatile structured penalization, 2025

work page 2025

[24] [24]

Deep weight factorization: Sparse learning through the lens of artificial symmetries

Chris Kolb, Tobias Weber, Bernd Bischl, and David Rügamer. Deep weight factorization: Sparse learning through the lens of artificial symmetries. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[25] [25]

Learning multiple layers of features from tiny images

Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009

[26] [26]

Scalable optimization in the modular norm

Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm.CoRR, abs/2405.14813, 2024

work page arXiv 2024

[27] [27]

Nguyen, Chinmay Hegde, and Raymond K

Jiangyuan Li, Thanh V . Nguyen, Chinmay Hegde, and Raymond K. W. Wong. Implicit sparse regularization: The impact of depth and early stopping, 2021

work page 2021

[28] [28]

Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning

Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. InInternational Conference on Learning Representations, 2021

work page 2021

[29] [29]

Lee, and Sanjeev Arora

Zhiyuan Li, Tianhao Wang, Jason D. Lee, and Sanjeev Arora. Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[30] [30]

Implicit bias of mirror flow for shallow neural networks in univariate regression

Shuang Liang and Guido Montúfar. Implicit bias of mirror flow for shallow neural networks in univariate regression. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[31] [31]

Sparse training of neural networks based on multilevel mirror descent, 2026

Yannick Lunk, Sebastian James Scott, and Leon Bungert. Sparse training of neural networks based on multilevel mirror descent, 2026

work page 2026

[32] [32]

Lee, and Wei Hu

Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon Shaolei Du, Jason D. Lee, and Wei Hu. Dichotomy of early and late phase implicit biases can provably induce grokking. InICLR, 2024

work page 2024

[33] [33]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InInternational Conference on Learning Representations, 2020

work page 2020

[34] [34]

Negin Majidi, Ehsan Amid, Hossein Talebi, and Manfred K. Warmuth. Exponentiated gradient reweighting for robust training under label noise and beyond.ArXiv, abs/2104.01493, 2021. 11

work page arXiv 2021

[35] [35]

Abide by the law and follow the flow: conservation laws for gradient flows

Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Abide by the law and follow the flow: conservation laws for gradient flows. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[36] [36]

Transformative or conservative? conser- vation laws for resnets and transformers

Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Transformative or conservative? conser- vation laws for resnets and transformers. InForty-second International Conference on Machine Learning, 2025

work page 2025

[37] [37]

Keep the momentum: Conservation laws beyond Euclidean gradient flows, 2024

Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Keep the momentum: Conservation laws beyond Euclidean gradient flows, 2024

work page 2024

[38] [38]

Deep linear networks for regression are implicitly regularized towards flat minima

Pierre Marion and Lénaïc Chizat. Deep linear networks for regression are implicitly regularized towards flat minima. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[39] [39]

Minor first, major last: A depth-induced implicit bias of sharpness-aware minimization

Chaewon Moon, Dongkuk Si, and Chulhee Yun. Minor first, major last: A depth-induced implicit bias of sharpness-aware minimization. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[40] [40]

Implicit bias of mirror flow on separable data

Scott Pesme, Radu-Alexandru Dragomir, and Nicolas Flammarion. Implicit bias of mirror flow on separable data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[41] [41]

Implicit bias of sgd for diag- onal linear networks: a provable benefit of stochasticity

Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diag- onal linear networks: a provable benefit of stochasticity. InAdvances in Neural Information Processing Systems, volume 34, pages 29218–29230. Curran Associates, Inc., 2021

work page 2021

[42] [42]

Pedro H. P. Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? InAnnual Conference Computational Learning Theory, 2019

work page 2019

[43] [43]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.CoRR, abs/1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[44] [44]

The implicit bias of gradient descent on separable data, 2017

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data, 2017

work page 2017

[45] [45]

A unified approach to controlling implicit regularization via mirror descent.ArXiv, abs/2306.13853, 2023

Haoyuan Sun, Khashayar Gatmiry, Kwangjun Ahn, and Navid Azizan. A unified approach to controlling implicit regularization via mirror descent.ArXiv, abs/2306.13853, 2023

work page arXiv 2023

[46] [46]

Flavors of margin: Implicit bias of steepest descent in homogeneous neural networks

Nikolaos Tsilivis, Gal Vardi, and Julia Kempe. Flavors of margin: Implicit bias of steepest descent in homogeneous neural networks. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024

work page 2024

[47] [47]

Simplicity bias of two-layer networks beyond linearly separable data

Nikita Tsoy and Nikola Konstantinov. Simplicity bias of two-layer networks beyond linearly separable data. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48728–48767. PMLR, 21–27 Jul 2024

work page 2024

[48] [48]

Implicit regularization for optimal sparse recovery, 2019

Tomas Vaškeviˇcius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery, 2019

work page 2019

[49] [49]

Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro

Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. InProceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 3635–3673. PMLR, 09–12 Jul 2020

work page 2020

[50] [50]

Implicit regularization in matrix sensing via mirror descent

Fan Wu and Patrick Rebeschini. Implicit regularization in matrix sensing via mirror descent. In Neural Information Processing Systems, 2021

work page 2021

[51] [51]

Implicit bias of gradient descent for logistic regression at the edge of stability

Jingfeng Wu, Vladimir Braverman, and Jason D Lee. Implicit bias of gradient descent for logistic regression at the edge of stability. InAdvances in Neural Information Processing Systems, volume 36, pages 74229–74256. Curran Associates, Inc., 2023. 12

work page 2023

[52] [52]

Implicit bias of AdamW: ℓ∞-norm constrained optimization

Shuo Xie and Zhiyuan Li. Implicit bias of AdamW: ℓ∞-norm constrained optimization. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[53] [53]

Greg Yang and Edward J. Hu. Tensor programs IV: Feature learning in infinite-width neural networks. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11727–11737. PMLR, 18–24 Jul 2021

work page 2021

[54] [54]

The implicit bias of adam on separable data

Chenyang Zhang, Difan Zou, and Yuan Cao. The implicit bias of adam on separable data. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 13 Appendix Contents A Extended related work 15 B Proof of Lemma 3.1 16 C Verification of Assumption 4.4 17 D Proof of Theorem 4.5 18 E Bounds on the normalized dual iterates 20 F Proo...

work page 2024

[55] [55]

Therefore the max-margin solution has additional constraint ¯u2 = ¯v2

= (u2 in −v 2 in)/(∥ut, vt∥2 2)→0. Therefore the max-margin solution has additional constraint ¯u2 = ¯v2. Changing the objective from 1 2 ∥¯u,¯v∥2 2 =∥ ¯θ∥1 where ¯θ:= ¯u⊙¯v. By the same argument we have that there exist a positive constant b >0 such that ∥ut, vt∥2 2/∥θt∥1 →b , thus ¯θ≃ θ ∥θ∥1 , which concludes the result. This implies that the normalized...

work page