pith. sign in

arxiv: 2605.19458 · v1 · pith:F5FUJCWZnew · submitted 2026-05-19 · 💻 cs.LG

Implicit Bias of Mirror Flow in Homogeneous Neural Networks: Sparse and Dense Feature Learning

Pith reviewed 2026-05-20 07:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords mirror flowimplicit biashomogeneous neural networksmax-margin solutionsfeature learningsparse representationsconvex dualityoptimization dynamics
0
0 comments X

The pith

Mirror flow reaches the same max-margin solution for different mirror maps but induces representations ranging from sparse to dense neuron activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the implicit bias of mirror flow in deep neural networks that use homogeneous activation functions. Extending results from gradient flow, the authors derive a balance equation from convex duality that characterizes a horizon function controlling the limiting margin. They show that distinct non-homogeneous mirror maps can converge to the identical max-margin classifier while producing very different internal representations, with neuron activations varying from sparse to dense. The work also gives convergence rates and norm growth estimates, and verifies the predictions on synthetic datasets and standard vision tasks. This offers a unified view on how the choice of mirror map affects both the speed of optimization and the sparsity of learned features.

Core claim

Mirror flow in homogeneous networks satisfies a novel balance equation derived from convex duality. This equation characterizes the horizon function that governs the induced margin. Distinct non-homogeneous mirror maps can induce the same max-margin solution yet produce markedly different representations, ranging from sparse to dense neuron activations.

What carries the argument

The balance equation for mirror flow obtained from convex duality, which defines the horizon function that determines the limiting margin and distinguishes the effect of different mirror maps on representations.

If this is right

  • Distinct non-homogeneous mirror maps can induce the same max-margin solution.
  • Convergence of mirror flow can be extremely slow, including exponentially slow regimes.
  • All considered mirror maps exhibit feature learning, yet they produce representations ranging from sparse to dense neuron activations.
  • Mirror maps shape both the optimization dynamics and the geometry of the learned classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The choice of mirror map could be used to control the sparsity or density of learned features without altering the final decision boundary.
  • Differences in representation sparsity may influence generalization, robustness, or transfer performance in downstream tasks.
  • Similar balance equations and horizon functions might be derived for other first-order methods or for networks that are not strictly homogeneous.

Load-bearing premise

Mirror flow dynamics in homogeneous networks converge to a max-margin solution as time tends to infinity.

What would settle it

A simulation or calculation in which mirror flow on a homogeneous network fails to approach the predicted max-margin solution for a given mirror map, or in which two different mirror maps produce identical neuron activation patterns.

Figures

Figures reproduced from arXiv: 2605.19458 by Guido Montufar, Tom Jacobs.

Figure 1
Figure 1. Figure 1: This provides an analog of the layer-wise norm balance equation well known for gradient [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Different types of feature learning under mirror flows. The hyperbolic entropy induces sparse feature learning with fewer active neurons, whereas the smoothed homogeneous potential (p = 3) induces dense feature learning with more active neurons. (Left) Input weight representations w˜ = |aj |wj of a two-layer student network in a student-teacher setup: GD (orange) produces diffuse weights near the origin, p… view at source ↗
Figure 3
Figure 3. Figure 3: Training with smoothened ho￾mogeneous potential (p = 3). This shows that increasing λ > 0 slows con￾vergence toward the max-margin solu￾tion, so that for finite training time the learned representation remains closer to that of gradient descent. Reaching the margin. The results presented so far char￾acterize the implicit bias of mirror flow through the in￾troduction of a Q-margin and its maximization in th… view at source ↗
Figure 4
Figure 4. Figure 4: Weight distributions for the first layer of a VGG-16 and weight pruning. (Left) We show [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Learned representation by mirror descent with hyperbolic entropy for various [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learned representation by mirror descent with smoothened homogeneous (p [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training ten times longer does not change the representation much when [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Weight magnitude distribution of the first layer after time rescaling for all mirror maps. [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Weight magnitude distribution of the second layer after time rescaling for all mirror maps. [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Weight magnitude distribution of the third layer after time rescaling for all mirror maps. [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Illustration of the decision boundary and function activation value reached for the [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Learned representation in the input layer [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Learned representation in the input layer [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The weight magnitude distribution of a VGG16’s last layer. We observe that all algorithms [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗
read the original abstract

We study the max-margin solutions reached by mirror flow in deep neural networks with homogeneous activation functions. Extending classical results on gradient flow, we derive a novel balance equation for mirror flow from convex duality, enabling a characterization of the horizon function governing the induced margin. We further establish max-margin characterizations together with convergence rates and norm growth estimates. Finally, we support our theory through experiments on synthetic datasets and standard vision tasks. Concretely, we show that: (1) distinct non-homogeneous mirror maps can induce the same max-margin solution; (2) convergence can be extremely slow, including exponentially slow regimes; and (3) although all considered mirror maps exhibit feature learning, they can produce markedly different representations, ranging from sparse to dense neuron activations. Together, these results provide a unified perspective on sparse and dense feature learning in homogeneous neural networks, highlighting how mirror maps shape both optimization dynamics and the geometry of the learned classifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper studies the implicit bias of mirror flow on homogeneous neural networks. It derives a balance equation from convex duality to characterize the horizon function and induced margin, establishes max-margin characterizations along with convergence rates and norm growth estimates, and shows via experiments on synthetic data and vision tasks that distinct non-homogeneous mirror maps can reach the same max-margin solution while producing markedly different sparse-to-dense neuron activations and representations.

Significance. If the derivations and convergence results hold, the work extends gradient-flow implicit-bias theory to mirror flow and supplies a unified account of how mirror-map choice shapes both dynamics and the geometry of learned classifiers. The experimental demonstration that the same max-margin classifier can arise with qualitatively different feature sparsity is a concrete contribution to understanding representation learning under different optimizers.

major comments (1)
  1. [Abstract and convergence-rate section] Abstract and the convergence-rate section: the balance equation and horizon-function characterization are derived under the premise that mirror flow converges to a max-margin solution as t→∞. The manuscript states that max-margin characterizations and convergence rates are established, yet supplies no explicit conditions (strict convexity of the mirror map, depth restrictions, initialization, or loss assumptions) guaranteeing that the limit is attained and that the rates are positive. If convergence is only logarithmic or fails in some regimes, the duality-based balance equation does not govern the observed trajectory, weakening the claim that distinct mirror maps produce different sparse/dense representations while sharing the same max-margin classifier.
minor comments (2)
  1. [Theory sections] Notation for the mirror map and its Legendre dual should be introduced once and used consistently; several passages switch between φ and its conjugate without explicit reminder.
  2. [Experiments] Experimental details (learning-rate schedules, initialization variance, exact synthetic data generation) are only sketched; a short reproducibility paragraph or supplementary table would help.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and valuable feedback. The major comment raises an important point about the assumptions underlying our convergence claims, which we address below by clarifying the scope of our results and committing to revisions that make the conditions explicit.

read point-by-point responses
  1. Referee: [Abstract and convergence-rate section] Abstract and the convergence-rate section: the balance equation and horizon-function characterization are derived under the premise that mirror flow converges to a max-margin solution as t→∞. The manuscript states that max-margin characterizations and convergence rates are established, yet supplies no explicit conditions (strict convexity of the mirror map, depth restrictions, initialization, or loss assumptions) guaranteeing that the limit is attained and that the rates are positive. If convergence is only logarithmic or fails in some regimes, the duality-based balance equation does not govern the observed trajectory, weakening the claim that distinct mirror maps produce different sparse/dense representations while sharing the same max-margin classifier.

    Authors: We thank the referee for this precise observation. Our derivations of the balance equation and horizon-function characterization are indeed performed under the assumption that mirror flow converges to a max-margin solution as t tends to infinity, following the standard methodology in the implicit-bias literature. The manuscript already notes that convergence can be exponentially slow in some regimes and provides supporting experiments. To strengthen the presentation, we will revise the abstract, introduction, and convergence-rate section to state explicit sufficient conditions for convergence to the max-margin limit (including strict convexity of the mirror map, suitable initialization, homogeneity of the network, and standard assumptions on the loss). We will also clarify that the duality-based characterization governs the asymptotic behavior precisely when these conditions hold and that the reported rates are positive under the same assumptions. These changes will ensure the claims about sparse versus dense representations under different mirror maps are rigorously tied to the regimes where the limit is attained. revision: yes

Circularity Check

0 steps flagged

No circularity: balance equation derived from external convex duality

full rationale

The paper derives the balance equation and horizon function characterization explicitly from convex duality applied to mirror flow, extending classical gradient flow results without reducing any claimed prediction or first-principles result to a quantity defined by the paper's own fitted parameters or self-referential inputs. The max-margin convergence assumption is stated as an extension of prior work rather than proven here, but this does not create a definitional loop or fitted-input-called-prediction pattern in the derivation chain. No self-citation is load-bearing for the core balance equation, and the sparse/dense representation distinctions follow from the distinct mirror maps under the shared limit, which remains independently characterizable. The analysis is self-contained against external mathematical tools.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about homogeneous activations and the applicability of convex duality to the continuous-time dynamics; no free parameters or invented entities are indicated in the abstract.

axioms (2)
  • domain assumption Neural network activations are homogeneous
    This property is required to extend classical gradient flow results and derive the balance equation for mirror flow.
  • domain assumption Convex duality applies directly to mirror flow trajectories
    Invoked to obtain the novel balance equation and horizon function characterization.

pith-pipeline@v0.9.0 · 5687 in / 1382 out tokens · 48267 ms · 2026-05-20T07:12:27.169122+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

  1. [1]

    Implicit regularization in deep matrix factorization

    Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  2. [2]

    On the implicit bias of initialization shape: Beyond infinitesimal mirror descent

    Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 468–477. PMLR, 18–24...

  3. [3]

    Modular duality in deep learning

    Jeremy Bernstein and Laker Newhouse. Modular duality in deep learning. InForty-second International Conference on Machine Learning, 2025

  4. [4]

    Implicit bias of gradient descent for non-homogeneous deep networks

    Yuhang Cai, Kangjie Zhou, Jingfeng Wu, Song Mei, Michael Lindsey, and Peter Bartlett. Implicit bias of gradient descent for non-homogeneous deep networks. InForty-second Interna- tional Conference on Machine Learning, 2025

  5. [5]

    More is less: inducing sparsity via overparameterization.Information and Inference, 12(3):1437–1460, 2023

    Hung-Hsu Chou, Johannes Maly, and Holger Rauhut. More is less: inducing sparsity via overparameterization.Information and Inference, 12(3):1437–1460, 2023

  6. [6]

    Robust implicit regularization via weight normalization.Information and Inference: A Journal of the IMA, 13(3):iaae022, 09 2024

    Hung-Hsu Chou, Holger Rauhut, and Rachel Ward. Robust implicit regularization via weight normalization.Information and Inference: A Journal of the IMA, 13(3):iaae022, 09 2024

  7. [7]

    Clarke.Optimization and Nonsmooth Analysis

    Frank H. Clarke.Optimization and Nonsmooth Analysis. Wiley-Interscience, 1983

  8. [8]

    Kakade, and J

    Damek Davis, Dmitriy Drusvyatskiy, Sham M. Kakade, and J. Lee. Stochastic subgradient method converges on tame functions.Foundations of Computational Mathematics, 20:119–154, 2018

  9. [9]

    Clémentine Carla Juliette Dominé, Nicolas Anguita, Alexandra Maria Proca, Lukas Braun, Daniel Kunin, Pedro A. M. Mediano, and Andrew M Saxe. From lazy to rich: Exact learning dynamics in deep linear networks. InUniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models, 2024

  10. [10]

    A. F. Filippov.Differential Equations with Discontinuous Right-Hand Sides. Springer Dordrecht, 1988

  11. [11]

    Sign-in to the lottery: Reparameterizing sparse training

    Advait Gadhikar, Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Sign-in to the lottery: Reparameterizing sparse training. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  12. [12]

    Masks, signs, and learning rate rewinding

    Advait Harshal Gadhikar and Rebekka Burkholz. Masks, signs, and learning rate rewinding. In The Twelfth International Conference on Learning Representations, 2024

  13. [13]

    Implicit regularization in matrix factorization

    Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  14. [14]

    Deep networks are reproducing kernel chains.ArXiv, abs/2501.03697, 2025

    Tjeerd Jan Heeringa, Len Spek, and Christoph Brune. Deep networks are reproducing kernel chains.ArXiv, abs/2501.03697, 2025

  15. [15]

    Mask in the mirror: Implicit sparsification

    Tom Jacobs and Rebekka Burkholz. Mask in the mirror: Implicit sparsification. InThe Thirteenth International Conference on Learning Representations, 2025. 10

  16. [16]

    Hyperbolic aware minimization: Implicit bias for sparsity

    Tom Jacobs, Advait Gadhikar, Celia Rubio-Madrigal, and Rebekka Burkholz. Hyperbolic aware minimization: Implicit bias for sparsity. InThe Fourteenth International Conference on Learning Representations, 2026

  17. [17]

    Mirror, mirror of the flow: How does regularization shape implicit bias? InForty-second International Conference on Machine Learning, 2025

    Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Mirror, mirror of the flow: How does regularization shape implicit bias? InForty-second International Conference on Machine Learning, 2025

  18. [18]

    Never saddle: Reparameterized steepest descent as mirror flow

    Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Never saddle: Reparameterized steepest descent as mirror flow. InThe Fourteenth International Conference on Learning Representations, 2026

  19. [19]

    Kakade, and Michael I

    Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. InInternational Conference on Machine Learning, 2017

  20. [20]

    arXiv preprint arXiv:2410.14581 , year=

    Aaron Alvarado Kristanto Julistiono, Davoud Ataee Tarzanagh, and Navid Azizan. Optimizing attention with mirror descent: Generalized max-margin token selection.ArXiv, abs/2410.14581, 2024

  21. [21]

    Benign overfitting in leaky ReLU networks with moderate input dimension

    Kedar Karhadkar, Erin George, Michael Murray, Guido Montúfar, and Deanna Needell. Benign overfitting in leaky ReLU networks with moderate input dimension. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  22. [22]

    Jyrki Kivinen and Manfred K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors.Information and Computation, 132(1):1–63, 1997

  23. [23]

    Differentiable sparsity via d-gating: Simple and versatile structured penalization, 2025

    Chris Kolb, Laetitia Frost, Bernd Bischl, and David Rügamer. Differentiable sparsity via d-gating: Simple and versatile structured penalization, 2025

  24. [24]

    Deep weight factorization: Sparse learning through the lens of artificial symmetries

    Chris Kolb, Tobias Weber, Bernd Bischl, and David Rügamer. Deep weight factorization: Sparse learning through the lens of artificial symmetries. InThe Thirteenth International Conference on Learning Representations, 2025

  25. [25]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  26. [26]

    Scalable optimization in the modular norm

    Tim Large, Yang Liu, Minyoung Huh, Hyojin Bahng, Phillip Isola, and Jeremy Bernstein. Scalable optimization in the modular norm.CoRR, abs/2405.14813, 2024

  27. [27]

    Nguyen, Chinmay Hegde, and Raymond K

    Jiangyuan Li, Thanh V . Nguyen, Chinmay Hegde, and Raymond K. W. Wong. Implicit sparse regularization: The impact of depth and early stopping, 2021

  28. [28]

    Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning

    Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. InInternational Conference on Learning Representations, 2021

  29. [29]

    Lee, and Sanjeev Arora

    Zhiyuan Li, Tianhao Wang, Jason D. Lee, and Sanjeev Arora. Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. InAdvances in Neural Information Processing Systems, 2022

  30. [30]

    Implicit bias of mirror flow for shallow neural networks in univariate regression

    Shuang Liang and Guido Montúfar. Implicit bias of mirror flow for shallow neural networks in univariate regression. InThe Thirteenth International Conference on Learning Representations, 2025

  31. [31]

    Sparse training of neural networks based on multilevel mirror descent, 2026

    Yannick Lunk, Sebastian James Scott, and Leon Bungert. Sparse training of neural networks based on multilevel mirror descent, 2026

  32. [32]

    Lee, and Wei Hu

    Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon Shaolei Du, Jason D. Lee, and Wei Hu. Dichotomy of early and late phase implicit biases can provably induce grokking. InICLR, 2024

  33. [33]

    Gradient descent maximizes the margin of homogeneous neural networks

    Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InInternational Conference on Learning Representations, 2020

  34. [34]

    Negin Majidi, Ehsan Amid, Hossein Talebi, and Manfred K. Warmuth. Exponentiated gradient reweighting for robust training under label noise and beyond.ArXiv, abs/2104.01493, 2021. 11

  35. [35]

    Abide by the law and follow the flow: conservation laws for gradient flows

    Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Abide by the law and follow the flow: conservation laws for gradient flows. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  36. [36]

    Transformative or conservative? conser- vation laws for resnets and transformers

    Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Transformative or conservative? conser- vation laws for resnets and transformers. InForty-second International Conference on Machine Learning, 2025

  37. [37]

    Keep the momentum: Conservation laws beyond Euclidean gradient flows, 2024

    Sibylle Marcotte, Rémi Gribonval, and Gabriel Peyré. Keep the momentum: Conservation laws beyond Euclidean gradient flows, 2024

  38. [38]

    Deep linear networks for regression are implicitly regularized towards flat minima

    Pierre Marion and Lénaïc Chizat. Deep linear networks for regression are implicitly regularized towards flat minima. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  39. [39]

    Minor first, major last: A depth-induced implicit bias of sharpness-aware minimization

    Chaewon Moon, Dongkuk Si, and Chulhee Yun. Minor first, major last: A depth-induced implicit bias of sharpness-aware minimization. InThe Fourteenth International Conference on Learning Representations, 2026

  40. [40]

    Implicit bias of mirror flow on separable data

    Scott Pesme, Radu-Alexandru Dragomir, and Nicolas Flammarion. Implicit bias of mirror flow on separable data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  41. [41]

    Implicit bias of sgd for diag- onal linear networks: a provable benefit of stochasticity

    Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diag- onal linear networks: a provable benefit of stochasticity. InAdvances in Neural Information Processing Systems, volume 34, pages 29218–29230. Curran Associates, Inc., 2021

  42. [42]

    Pedro H. P. Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? InAnnual Conference Computational Learning Theory, 2019

  43. [43]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.CoRR, abs/1409.1556, 2014

  44. [44]

    The implicit bias of gradient descent on separable data, 2017

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data, 2017

  45. [45]

    A unified approach to controlling implicit regularization via mirror descent.ArXiv, abs/2306.13853, 2023

    Haoyuan Sun, Khashayar Gatmiry, Kwangjun Ahn, and Navid Azizan. A unified approach to controlling implicit regularization via mirror descent.ArXiv, abs/2306.13853, 2023

  46. [46]

    Flavors of margin: Implicit bias of steepest descent in homogeneous neural networks

    Nikolaos Tsilivis, Gal Vardi, and Julia Kempe. Flavors of margin: Implicit bias of steepest descent in homogeneous neural networks. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024

  47. [47]

    Simplicity bias of two-layer networks beyond linearly separable data

    Nikita Tsoy and Nikola Konstantinov. Simplicity bias of two-layer networks beyond linearly separable data. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48728–48767. PMLR, 21–27 Jul 2024

  48. [48]

    Implicit regularization for optimal sparse recovery, 2019

    Tomas Vaškeviˇcius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery, 2019

  49. [49]

    Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro

    Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. InProceedings of Thirty Third Conference on Learning Theory, volume 125 ofProceedings of Machine Learning Research, pages 3635–3673. PMLR, 09–12 Jul 2020

  50. [50]

    Implicit regularization in matrix sensing via mirror descent

    Fan Wu and Patrick Rebeschini. Implicit regularization in matrix sensing via mirror descent. In Neural Information Processing Systems, 2021

  51. [51]

    Implicit bias of gradient descent for logistic regression at the edge of stability

    Jingfeng Wu, Vladimir Braverman, and Jason D Lee. Implicit bias of gradient descent for logistic regression at the edge of stability. InAdvances in Neural Information Processing Systems, volume 36, pages 74229–74256. Curran Associates, Inc., 2023. 12

  52. [52]

    Implicit bias of AdamW: ℓ∞-norm constrained optimization

    Shuo Xie and Zhiyuan Li. Implicit bias of AdamW: ℓ∞-norm constrained optimization. In Forty-first International Conference on Machine Learning, 2024

  53. [53]

    Greg Yang and Edward J. Hu. Tensor programs IV: Feature learning in infinite-width neural networks. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11727–11737. PMLR, 18–24 Jul 2021

  54. [54]

    The implicit bias of adam on separable data

    Chenyang Zhang, Difan Zou, and Yuan Cao. The implicit bias of adam on separable data. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 13 Appendix Contents A Extended related work 15 B Proof of Lemma 3.1 16 C Verification of Assumption 4.4 17 D Proof of Theorem 4.5 18 E Bounds on the normalized dual iterates 20 F Proo...

  55. [55]

    Therefore the max-margin solution has additional constraint ¯u2 = ¯v2

    = (u2 in −v 2 in)/(∥ut, vt∥2 2)→0. Therefore the max-margin solution has additional constraint ¯u2 = ¯v2. Changing the objective from 1 2 ∥¯u,¯v∥2 2 =∥ ¯θ∥1 where ¯θ:= ¯u⊙¯v. By the same argument we have that there exist a positive constant b >0 such that ∥ut, vt∥2 2/∥θt∥1 →b , thus ¯θ≃ θ ∥θ∥1 , which concludes the result. This implies that the normalized...