Understanding and inverse design of implicit bias in stochastic learning: a geometric perspective

Alberto d'Onofrio; Alessio Ansuini; Emanuele Ballarin; Fabio Anselmi; Matteo Biagetti; Nicola Aladrah

arxiv: 2601.06597 · v2 · submitted 2026-01-10 · 💻 cs.LG · stat.ML

Understanding and inverse design of implicit bias in stochastic learning: a geometric perspective

Nicola Aladrah , Emanuele Ballarin , Matteo Biagetti , Alessio Ansuini , Alberto d'Onofrio , Fabio Anselmi This is my paper

Pith reviewed 2026-05-16 15:03 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords implicit biasstochastic gradient descentgeometric correctioncontinuous symmetriesinverse designloss landscapeoverparameterized modelssparsity

0 comments

The pith

Implicit bias in stochastic learning arises as a geometric correction from gradient noise interacting with continuous loss symmetries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework showing that implicit bias emerges when stochastic gradient noise interacts with the continuous symmetries of the loss function, producing a predictable geometric shift among solutions that share the same loss value. This mechanism unifies previous observations across models and turns bias from an unexplained side effect into a controllable geometric feature. If the account holds, it becomes possible to engineer parameterizations that preserve the predictor while deliberately steering the bias, for instance toward sparse or spectrally sparse solutions. A reader would care because learned representations determine how models generalize, interpret data, and remain robust, and this view supplies a direct handle on those representations through the training dynamics themselves.

Core claim

Implicit bias is induced as a geometric correction by the interplay between gradient noise and continuous symmetries of the loss. The authors compute this correction for a range of architectures, use it to predict new behaviors and recover known ones, and demonstrate inverse design by constructing predictor-preserving parameterizations that shape the bias, with sparsity and spectral sparsity arising as canonical outcomes. Numerical experiments confirm the predicted corrections and the effectiveness of the inverse-design procedure in controlled settings.

What carries the argument

The geometric correction induced by the interplay between gradient noise and continuous symmetries of the loss; it selects among equivalent-loss solutions by shifting the effective optimization trajectory.

If this is right

The induced bias can be calculated explicitly for multiple standard architectures.
Previously observed implicit-bias phenomena receive a single geometric explanation.
New bias behaviors can be predicted before training begins.
Predictor-preserving reparameterizations can be designed to steer the bias toward sparsity or spectral sparsity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same noise-symmetry mechanism may extend to discrete symmetries or to non-gradient optimizers if the effective noise structure can be characterized.
Engineering symmetries into the loss could become a systematic route to built-in regularization without changing the data or the predictor.
The framework suggests checking whether the magnitude of the correction scales with batch size or learning-rate schedule in the way the geometric term predicts.

Load-bearing premise

Stochastic gradient noise interacts with continuous symmetries of the loss to produce a predictable and computable geometric correction.

What would settle it

A controlled experiment on a loss with known continuous symmetries where the measured implicit bias deviates systematically from the geometric correction computed by the framework under the observed noise statistics.

Figures

Figures reproduced from arXiv: 2601.06597 by Alberto d'Onofrio, Alessio Ansuini, Emanuele Ballarin, Fabio Anselmi, Matteo Biagetti, Nicola Aladrah.

**Figure 1.** Figure 1: | Hyperbolic level sets of equivalent parametrizations and symmetry-breaking. a, Hyperbolic level sets u · v = θ in the positive (u, v)-plane. Each branch represents all parameter pairs (u, v) that produce the same predictor θ, making the symmetry of the factorized parameterization explicit. b, The diagonal line u = v defines symmetry-breaking that intersects each orbit once in the positive plane, selecti… view at source ↗

**Figure 2.** Figure 2: | Implicit norm equilibration in shallow ReLU networks. A student model yˆ = v ⊤ ReLU(W x) with learnable parameters v and W is trained via SGD on the mean square error loss to replicate the behavior of a teacher oracle y ⋆ = v ⋆⊤ ReLU(W⋆x) on a regression task. The entries of v ⋆ and W⋆ are randomly sampled before training, ensuring that the [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: | Query–key norm equilibration in single-head scaled dot-product attention. A student model implementing single-head scaled dot-product attention — i.e. Y = softmax XQ(XK)⊤ √rk XV — with learnable key (K), query (Q) and value (V ) matrices is trained by SGD on the mean square error loss to replicate the behavior of a teacher oracle with the same structure (and matrices respectively K⋆ , Q⋆ , V ⋆ ) on a… view at source ↗

**Figure 4.** Figure 4: | Implicit low-rank recovery in matrix completion. A rank-2 ground-truth matrix T ⋆ ∈ R 20×20 with well-separated singular values is to be recovered from just the 20% of its entries via a factorized model Tˆ(U, V ) = UV ⊤ with U ∈ R 20×20 , V ∈ R 20×20. Training performed using SGD on the mean square error loss over the observed entries. Panel a, tracks the estimated singular values σi(UV ⊤) along training… view at source ↗

**Figure 5.** Figure 5: | Sparse spectral recovery via Hadamard-factored parameterization. Two models are compared in the reconstruction of a spectrally sparse signal from a limited number of noise-corrupted observations, under the drive of SGD on the mean square error loss. A signal y ⋆ = PD−1 k=0 w ⋆ k cos(2πkt) is considered, with amplitudes w ⋆ = [w ⋆ k ] being sparse in the frequency domain (3 nonzero entries with k ≥ 1, amo… view at source ↗

**Figure 6.** Figure 6: | Recovery of a piecewise-constant signal from noisy compressed measurements. Two models are compared in the reconstruction of a piecewise-constant signal of length N = 200 from m = 60 noisy compressed measurements y = Ax⋆ + ε, with A ∈ R m×N a random Gaussian measurement matrix and ε additive Gaussian noise, under the drive of SGD on the mean square error loss ∥Axˆ − y∥ 2 2 . The baseline model directly l… view at source ↗

read the original abstract

A key challenge in machine learning is to explain how learning dynamics select among the many solutions that achieve identical loss values in overparameterized models - a phenomenon known as implicit bias. Controlling this bias provides a direct mechanism on learned representations, which are central to interpretability, robustness, and reasoning in modern AI systems. Yet, despite its importance, existing explanations remain largely ad hoc and lack a unifying mechanism. We develop a theoretical and constructive framework in which implicit bias emerges as a geometric correction induced by the interplay between gradient noise and continuous symmetries of the loss. We compute the induced bias across a range of architectures, predicting new behaviors and explaining known ones. The approach also enables inverse design: by engineering predictor - preserving parameterizations, it is possible to shape the bias, with sparsity and spectral sparsity emerging as canonical instances. Numerical experiments support the theory and validate the inverse - design framework in controlled settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The geometric correction from noise-symmetry interplay is a clean framing, but the SDE approximation lacks error control at the finite rates used in the experiments.

read the letter

The paper's core move is to treat implicit bias as a predictable geometric drift that arises when gradient noise acts on the continuous symmetries of the loss. They derive an effective correction by averaging over the symmetry orbit and then show how to engineer the bias by reparameterizing the model while keeping the predictor fixed. That inverse-design step is the part that feels most useful: it turns an explanatory story into something you can actually tune, with sparsity and spectral sparsity as concrete examples. They also compute the predicted bias for several standard architectures and run controlled numerical checks that line up with the formulas. Those pieces are concrete and go beyond the usual hand-waving about flat minima or symmetry breaking. The main weakness is the perturbative treatment. The derivation leans on an Itô or Fokker-Planck expansion that assumes small noise and small step size; the paper does not supply bounds showing that the higher-order terms remain negligible at the learning rates they actually use in the experiments. If those terms are comparable to the leading geometric correction, the claimed predictive power drops. The numerics are presented as validation, but without a separate check that the neglected pieces are small, it is hard to tell whether the agreement is robust or just consistent within the approximation's comfort zone. The citation pattern looks standard for the implicit-bias literature and does not appear to over-claim prior results. This is worth a serious referee for anyone working on geometric or symmetry-based accounts of optimization. The idea is clear enough and the constructive part is new enough that a careful review could tighten the approximation and decide whether the framework holds up. I would bring it to a reading group to see the full derivations, but I would not cite it yet without those bounds.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a theoretical framework in which implicit bias of stochastic gradient descent emerges as a geometric correction induced by the interplay between gradient noise and continuous symmetries of the loss. The authors derive this correction via Lie-algebra averaging over symmetry orbits, compute explicit biases for concrete architectures, predict new behaviors, explain known ones, and demonstrate inverse design by engineering predictor-preserving parameterizations that induce sparsity or spectral sparsity. Numerical experiments in controlled settings are presented to support the theory.

Significance. If the derivation holds, the work supplies a unifying geometric mechanism for implicit bias that moves beyond ad-hoc explanations and directly enables constructive control of learned representations. The inverse-design component is a notable strength, as are the explicit computations across architectures and the attempt to link noise-induced drift to symmetry orbits. These elements could influence both theoretical understanding and practical parameterization choices in overparameterized models.

major comments (2)

[Derivation of the geometric correction (SDE modeling and averaging step)] The central derivation treats the diffusion coefficient perturbatively within an Itô/Fokker-Planck regime to obtain the leading geometric correction (via projection onto the tangent space of the level set). No error bounds or remainder estimates are supplied for the neglected O(η^{3/2}) and higher Itô–Stratonovich terms that appear at finite step-size η. Because the numerical experiments employ practical finite learning rates, the absence of these controls leaves open whether the claimed predictive power survives outside the infinitesimal-noise limit.
[Numerical experiments and architecture-specific computations] The modeling choice that gradient noise interacts with continuous symmetries to produce a computable, architecture-specific bias is load-bearing for all subsequent claims. The paper validates this only within the same perturbative framework used to derive it; no independent test (e.g., comparison against exact discrete SGD trajectories at moderate η or against non-Gaussian noise) is provided to rule out circularity.

minor comments (2)

[Abstract] The abstract introduces 'predictor-preserving parameterizations' without a forward reference; a one-sentence definition or pointer to the relevant section would improve readability.
[Notation and preliminaries] Notation for the Lie-algebra generators, the projection operator, and the diffusion tensor should be collected in a single table or preliminary section to reduce cross-referencing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. The comments highlight important aspects of the perturbative derivation and validation strategy. We address each point below and describe the revisions we will make to strengthen the presentation.

read point-by-point responses

Referee: The central derivation treats the diffusion coefficient perturbatively within an Itô/Fokker-Planck regime to obtain the leading geometric correction (via projection onto the tangent space of the level set). No error bounds or remainder estimates are supplied for the neglected O(η^{3/2}) and higher Itô–Stratonovich terms that appear at finite step-size η. Because the numerical experiments employ practical finite learning rates, the absence of these controls leaves open whether the claimed predictive power survives outside the infinitesimal-noise limit.

Authors: We agree that the derivation is perturbative and that rigorous remainder estimates for the Itô–Stratonovich corrections at finite η are not provided. Obtaining such bounds while preserving the Lie-algebra averaging over symmetry orbits is technically demanding and lies outside the scope of the present work. In the revision we will add a new subsection discussing the regime of validity of the leading-order approximation, including heuristic scaling arguments and additional numerical comparisons of the predicted bias against discrete SGD trajectories at moderate learning rates (η ≈ 10^{-3}–10^{-2}). These checks will clarify the practical range in which the geometric correction remains predictive. revision: partial
Referee: The modeling choice that gradient noise interacts with continuous symmetries to produce a computable, architecture-specific bias is load-bearing for all subsequent claims. The paper validates this only within the same perturbative framework used to derive it; no independent test (e.g., comparison against exact discrete SGD trajectories at moderate η or against non-Gaussian noise) is provided to rule out circularity.

Authors: We acknowledge the concern about potential circularity. The current experiments were designed to isolate the symmetry-induced drift under the modeling assumptions, but they do not constitute fully independent verification. We will revise the numerical section to include (i) direct comparisons of the analytic bias formula against full discrete SGD trajectories at finite step sizes and (ii) simulations with non-Gaussian noise (e.g., heavy-tailed and clipped gradients). These additions will provide an independent test of the architecture-specific predictions and the robustness of the geometric mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation proceeds from SDE geometry and symmetry averaging without reduction to fitted inputs or self-citation chains.

full rationale

The paper constructs the implicit bias explicitly as a drift correction term arising from averaging stochastic gradient noise over the orbit of continuous symmetries of the loss, using the Lie algebra action and projection onto the tangent space of level sets. This step is derived from the Fokker-Planck or Ito expansion of the SGD SDE and produces computable predictions for specific architectures that are then checked numerically; no parameter is fitted to the target bias and then relabeled as a prediction, and no load-bearing premise rests on a self-citation whose content is itself unverified. The framework therefore remains self-contained against external benchmarks and does not collapse by construction to its modeling assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard domain assumptions about loss symmetries and stochastic noise without introducing new free parameters or invented entities in the abstract description.

axioms (2)

domain assumption Loss functions possess continuous symmetries
Invoked as the source of the geometric correction when combined with gradient noise.
domain assumption Stochastic gradient descent produces noise that interacts geometrically with loss symmetries
Central to computing the induced bias and enabling inverse design.

pith-pipeline@v0.9.0 · 5472 in / 1371 out tokens · 67444 ms · 2026-05-16T15:03:14.095092+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Leff(θ)=L(θ)+σ²/2β log det G(θ); for uv=θ symmetry, det Gχ=2θ yields log θ term minimized at balanced u=v
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

rescaling symmetry (λu,λ⁻¹v) preserves product; induced bias |vi|/∥W[i,:]∥₂→1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019

work page 2019
[2]

In search of the real inductive bias: On the role of implicit regularization in deep learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. InProceedings of the International Conference on Learning Representations, Workshop Track, 2015

work page 2015
[3]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018. 23

work page 2018
[4]

On the implicit bias in deep-learning algorithms.Communications of the ACM, 66(6):86–93, 2023

Gal Vardi. On the implicit bias in deep-learning algorithms.Communications of the ACM, 66(6):86–93, 2023

work page 2023
[5]

The implicit bias of gradient descent on nonseparable data

Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Proceedings of the Conference on Learning Theory, pages 1772–1798, 2019

work page 2019
[6]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InProceedings of the International Conference on Learning Representations, 2020

work page 2020
[7]

Schapire, and Matus Telgarsky

Ziwei Ji, Miroslav Dudík, Robert E. Schapire, and Matus Telgarsky. Risk and parameter convergence of logistic regression.Journal of Machine Learning Research, 21(73):1–61, 2020

work page 2020
[8]

Implicit bias of gradient descent for logistic regression at the edge of stability

Jingfeng Wu, Vladimir Braverman, and Jason D Lee. Implicit bias of gradient descent for logistic regression at the edge of stability. InAdvances in Neural Information Processing Systems, pages 74229–74256, 2023

work page 2023
[9]

The implicit bias of gradient descent on separable multiclass data

Hrithik Ravi, Clayton Scott, Daniel Soudry, and Yutong Wang. The implicit bias of gradient descent on separable multiclass data. InAdvances in Neural Information Processing Systems, pages 81324–81359, 2024

work page 2024
[10]

A unifying view on implicit bias in training linear neural networks

Chulhee Yun, Shankar Krishnan, and Hossein Mobahi. A unifying view on implicit bias in training linear neural networks. InProceedings of the International Conference on Learning Representations, 2021

work page 2021
[11]

Characterizing implicit bias in terms of optimization geometry

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. InProceedings of the International Conference on Machine Learning, pages 1832–1841, 2018

work page 2018
[12]

Implicit regularization in deep matrix factorization

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[13]

Implicit regularization of discrete gradient dynamics in linear neural networks

Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. InAdvances in Neural Information Processing Systems, 2019

work page 2019
[14]

Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank.Applied and Computational Harmonic Analysis, 68:101595, 2024

Hung-Hsu Chou, Carsten Gieshoff, Johannes Maly, and Holger Rauhut. Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank.Applied and Computational Harmonic Analysis, 68:101595, 2024

work page 2024
[15]

Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds.Research, 6:0024, 2023

Mengjia Xu, Akshay Rangamani, Qianli Liao, Tomer Galanti, and Tomaso Poggio. Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds.Research, 6:0024, 2023

work page 2023
[16]

Implicit regularization in deep learning may not be explainable by norms

Noam Razin and Nadav Cohen. Implicit regularization in deep learning may not be explainable by norms. InAdvances in Neural Information Processing Systems, pages 21174–21187, 2020

work page 2020
[17]

What happens after SGD reaches zero loss? – a mathematical framework

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? – a mathematical framework. InProceedings of the International Conference on Learning Representations, 2022

work page 2022
[18]

Implicit bias of deep linear networks in the large learning rate phase, 2020

Wei Huang, Weitao Du, Richard Yi Da Xu, and Chunrui Liu. Implicit bias of deep linear networks in the large learning rate phase, 2020

work page 2020
[19]

A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information Theory, 39(3):930–945, 1993. 24

work page 1993
[20]

PhD thesis, Toyota Technological Institute at Chicago, 2017

Behnam Neyshabur.Implicit regularization in deep learning. PhD thesis, Toyota Technological Institute at Chicago, 2017

work page 2017
[21]

Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. InProceedings of the Conference on Learning Theory, pages 1305–1338, 2020

work page 2020
[22]

Stochastic gradient descent as approximate Bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate Bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

work page 2017
[23]

Stochastic modified equations and adaptive stochastic gradient algorithms

Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradient algorithms. InProceedings of the International Conference on Machine Learning, pages 2101– 2110, 2017

work page 2017
[24]

Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

work page 2019
[25]

Theory of deep learning IIb: Optimization properties of SGD, 2018

Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, and Tomaso Poggio. Theory of deep learning IIb: Optimization properties of SGD, 2018

work page 2018
[26]

A Bayesian perspective on generalization and stochastic gradient descent

Samuel L Smith and Quoc V Le. A Bayesian perspective on generalization and stochastic gradient descent. InProceedings of the International Conference on Learning Representations, 2018

work page 2018
[27]

A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InProceedings of the International Conference on Learning Representations, 2021

work page 2021
[28]

Topological invariance and breakdown in learning, 2025

Yongyi Yang, Tomaso Poggio, Isaac Chuang, and Liu Ziyin. Topological invariance and breakdown in learning, 2025

work page 2025
[29]

Neural thermodynamics: Entropic forces in deep and universal representation learning

Liu Ziyin, Yizhou Xu, and Isaac Chuang. Neural thermodynamics: Entropic forces in deep and universal representation learning. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[30]

Parameter symmetry and noise equilibrium of stochastic gradient descent

Liu Ziyin, Mingze Wang, Hongchao Li, and Lei Wu. Parameter symmetry and noise equilibrium of stochastic gradient descent. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[31]

Symmetry induces structure and constraint of learning

Liu Ziyin. Symmetry induces structure and constraint of learning. InProceedings of the International Conference on Machine Learning, pages 62847–62866, 2024

work page 2024
[32]

Parameter symmetry potentially unifies deep learning theory, 2025

Liu Ziyin, Yizhou Xu, Tomaso Poggio, and Isaac Chuang. Parameter symmetry potentially unifies deep learning theory, 2025

work page 2025
[33]

Cambridge University Press, Cambridge, UK, 2009

Sumio Watanabe.Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, Cambridge, UK, 2009

work page 2009
[34]

David G. Kendall. A survey of the statistical theory of shape.Statistical Science, 4(2):87–99, 1989

work page 1989
[35]

Intrinsic statistics on Riemannian manifolds: Basic tools for geometric mea- surements.Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006

Xavier Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric mea- surements.Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006

work page 2006
[36]

Intrinsic shape analysis: Geodesic principal component analysis for Riemannian manifolds modulo Lie group actions.Statistica Sinica, 20(1):1–100, 2010

Stephan Huckemann, Thomas Hotz, and Axel Munk. Intrinsic shape analysis: Geodesic principal component analysis for Riemannian manifolds modulo Lie group actions.Statistica Sinica, 20(1):1–100, 2010. 25

work page 2010
[37]

Classical statistical mechanics of constraints: A theorem and applications to polymers.The Journal of Chemical Physics, 69(4):1527–1537, 1974

Michael Fixman. Classical statistical mechanics of constraints: A theorem and applications to polymers.The Journal of Chemical Physics, 69(4):1527–1537, 1974

work page 1974
[38]

Imperial College Press, London, 2010

Tony Lelièvre, Mathias Rousset, and Gabriel Stoltz.Free Energy Computations. Imperial College Press, London, 2010

work page 2010
[39]

Numerical-integration of Cartesian equations of motion of a system with constraints – molecular-dynamics of N-alkanes

Jean-Paul Ryckaert, Giovanni Ciccotti, and Herman Berendsen. Numerical-integration of Cartesian equations of motion of a system with constraints – molecular-dynamics of N-alkanes. Journal of Computational Physics, 23:327–341, March 1977

work page 1977
[40]

Riemann manifold Langevin and Hamiltonian Monte Carlo methods.Journal of the Royal Statistical Society: Series B, 73(2):123–214, 2011

Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods.Journal of the Royal Statistical Society: Series B, 73(2):123–214, 2011

work page 2011
[41]

Chrysos, YongtaoWu, RazvanPascanu, Philip Torr, andVolkan Cevher

GrigoriosG. Chrysos, YongtaoWu, RazvanPascanu, Philip Torr, andVolkan Cevher. Hadamard product in deep learning: Introduction, advances and challenges.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8), 2025

work page 2025
[42]

A survey on deep matrix factoriza- tions.Comput

Pierre De Handschutter, Nicolas Gillis, and Xavier Siebert. A survey on deep matrix factoriza- tions.Comput. Sci. Rev., 42(C), November 2021

work page 2021
[43]

Springer, Berlin, 1969

Herbert Federer.Geometric Measure Theory. Springer, Berlin, 1969. 26

work page 1969

[1] [1]

Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, July 2019

work page 2019

[2] [2]

In search of the real inductive bias: On the role of implicit regularization in deep learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. InProceedings of the International Conference on Learning Representations, Workshop Track, 2015

work page 2015

[3] [3]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018. 23

work page 2018

[4] [4]

On the implicit bias in deep-learning algorithms.Communications of the ACM, 66(6):86–93, 2023

Gal Vardi. On the implicit bias in deep-learning algorithms.Communications of the ACM, 66(6):86–93, 2023

work page 2023

[5] [5]

The implicit bias of gradient descent on nonseparable data

Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Proceedings of the Conference on Learning Theory, pages 1772–1798, 2019

work page 2019

[6] [6]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InProceedings of the International Conference on Learning Representations, 2020

work page 2020

[7] [7]

Schapire, and Matus Telgarsky

Ziwei Ji, Miroslav Dudík, Robert E. Schapire, and Matus Telgarsky. Risk and parameter convergence of logistic regression.Journal of Machine Learning Research, 21(73):1–61, 2020

work page 2020

[8] [8]

Implicit bias of gradient descent for logistic regression at the edge of stability

Jingfeng Wu, Vladimir Braverman, and Jason D Lee. Implicit bias of gradient descent for logistic regression at the edge of stability. InAdvances in Neural Information Processing Systems, pages 74229–74256, 2023

work page 2023

[9] [9]

The implicit bias of gradient descent on separable multiclass data

Hrithik Ravi, Clayton Scott, Daniel Soudry, and Yutong Wang. The implicit bias of gradient descent on separable multiclass data. InAdvances in Neural Information Processing Systems, pages 81324–81359, 2024

work page 2024

[10] [10]

A unifying view on implicit bias in training linear neural networks

Chulhee Yun, Shankar Krishnan, and Hossein Mobahi. A unifying view on implicit bias in training linear neural networks. InProceedings of the International Conference on Learning Representations, 2021

work page 2021

[11] [11]

Characterizing implicit bias in terms of optimization geometry

Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. InProceedings of the International Conference on Machine Learning, pages 1832–1841, 2018

work page 2018

[12] [12]

Implicit regularization in deep matrix factorization

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. InAdvances in Neural Information Processing Systems, 2019

work page 2019

[13] [13]

Implicit regularization of discrete gradient dynamics in linear neural networks

Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. InAdvances in Neural Information Processing Systems, 2019

work page 2019

[14] [14]

Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank.Applied and Computational Harmonic Analysis, 68:101595, 2024

Hung-Hsu Chou, Carsten Gieshoff, Johannes Maly, and Holger Rauhut. Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank.Applied and Computational Harmonic Analysis, 68:101595, 2024

work page 2024

[15] [15]

Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds.Research, 6:0024, 2023

Mengjia Xu, Akshay Rangamani, Qianli Liao, Tomer Galanti, and Tomaso Poggio. Dynamics in deep classifiers trained with the square loss: Normalization, low rank, neural collapse, and generalization bounds.Research, 6:0024, 2023

work page 2023

[16] [16]

Implicit regularization in deep learning may not be explainable by norms

Noam Razin and Nadav Cohen. Implicit regularization in deep learning may not be explainable by norms. InAdvances in Neural Information Processing Systems, pages 21174–21187, 2020

work page 2020

[17] [17]

What happens after SGD reaches zero loss? – a mathematical framework

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after SGD reaches zero loss? – a mathematical framework. InProceedings of the International Conference on Learning Representations, 2022

work page 2022

[18] [18]

Implicit bias of deep linear networks in the large learning rate phase, 2020

Wei Huang, Weitao Du, Richard Yi Da Xu, and Chunrui Liu. Implicit bias of deep linear networks in the large learning rate phase, 2020

work page 2020

[19] [19]

A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information Theory, 39(3):930–945, 1993. 24

work page 1993

[20] [20]

PhD thesis, Toyota Technological Institute at Chicago, 2017

Behnam Neyshabur.Implicit regularization in deep learning. PhD thesis, Toyota Technological Institute at Chicago, 2017

work page 2017

[21] [21]

Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. InProceedings of the Conference on Learning Theory, pages 1305–1338, 2020

work page 2020

[22] [22]

Stochastic gradient descent as approximate Bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate Bayesian inference.Journal of Machine Learning Research, 18(134):1–35, 2017

work page 2017

[23] [23]

Stochastic modified equations and adaptive stochastic gradient algorithms

Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradient algorithms. InProceedings of the International Conference on Machine Learning, pages 2101– 2110, 2017

work page 2017

[24] [24]

Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

work page 2019

[25] [25]

Theory of deep learning IIb: Optimization properties of SGD, 2018

Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, and Tomaso Poggio. Theory of deep learning IIb: Optimization properties of SGD, 2018

work page 2018

[26] [26]

A Bayesian perspective on generalization and stochastic gradient descent

Samuel L Smith and Quoc V Le. A Bayesian perspective on generalization and stochastic gradient descent. InProceedings of the International Conference on Learning Representations, 2018

work page 2018

[27] [27]

A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InProceedings of the International Conference on Learning Representations, 2021

work page 2021

[28] [28]

Topological invariance and breakdown in learning, 2025

Yongyi Yang, Tomaso Poggio, Isaac Chuang, and Liu Ziyin. Topological invariance and breakdown in learning, 2025

work page 2025

[29] [29]

Neural thermodynamics: Entropic forces in deep and universal representation learning

Liu Ziyin, Yizhou Xu, and Isaac Chuang. Neural thermodynamics: Entropic forces in deep and universal representation learning. InAdvances in Neural Information Processing Systems, 2025

work page 2025

[30] [30]

Parameter symmetry and noise equilibrium of stochastic gradient descent

Liu Ziyin, Mingze Wang, Hongchao Li, and Lei Wu. Parameter symmetry and noise equilibrium of stochastic gradient descent. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[31] [31]

Symmetry induces structure and constraint of learning

Liu Ziyin. Symmetry induces structure and constraint of learning. InProceedings of the International Conference on Machine Learning, pages 62847–62866, 2024

work page 2024

[32] [32]

Parameter symmetry potentially unifies deep learning theory, 2025

Liu Ziyin, Yizhou Xu, Tomaso Poggio, and Isaac Chuang. Parameter symmetry potentially unifies deep learning theory, 2025

work page 2025

[33] [33]

Cambridge University Press, Cambridge, UK, 2009

Sumio Watanabe.Algebraic Geometry and Statistical Learning Theory. Cambridge University Press, Cambridge, UK, 2009

work page 2009

[34] [34]

David G. Kendall. A survey of the statistical theory of shape.Statistical Science, 4(2):87–99, 1989

work page 1989

[35] [35]

Intrinsic statistics on Riemannian manifolds: Basic tools for geometric mea- surements.Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006

Xavier Pennec. Intrinsic statistics on Riemannian manifolds: Basic tools for geometric mea- surements.Journal of Mathematical Imaging and Vision, 25(1):127–154, 2006

work page 2006

[36] [36]

Intrinsic shape analysis: Geodesic principal component analysis for Riemannian manifolds modulo Lie group actions.Statistica Sinica, 20(1):1–100, 2010

Stephan Huckemann, Thomas Hotz, and Axel Munk. Intrinsic shape analysis: Geodesic principal component analysis for Riemannian manifolds modulo Lie group actions.Statistica Sinica, 20(1):1–100, 2010. 25

work page 2010

[37] [37]

Classical statistical mechanics of constraints: A theorem and applications to polymers.The Journal of Chemical Physics, 69(4):1527–1537, 1974

Michael Fixman. Classical statistical mechanics of constraints: A theorem and applications to polymers.The Journal of Chemical Physics, 69(4):1527–1537, 1974

work page 1974

[38] [38]

Imperial College Press, London, 2010

Tony Lelièvre, Mathias Rousset, and Gabriel Stoltz.Free Energy Computations. Imperial College Press, London, 2010

work page 2010

[39] [39]

Numerical-integration of Cartesian equations of motion of a system with constraints – molecular-dynamics of N-alkanes

Jean-Paul Ryckaert, Giovanni Ciccotti, and Herman Berendsen. Numerical-integration of Cartesian equations of motion of a system with constraints – molecular-dynamics of N-alkanes. Journal of Computational Physics, 23:327–341, March 1977

work page 1977

[40] [40]

Riemann manifold Langevin and Hamiltonian Monte Carlo methods.Journal of the Royal Statistical Society: Series B, 73(2):123–214, 2011

Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods.Journal of the Royal Statistical Society: Series B, 73(2):123–214, 2011

work page 2011

[41] [41]

Chrysos, YongtaoWu, RazvanPascanu, Philip Torr, andVolkan Cevher

GrigoriosG. Chrysos, YongtaoWu, RazvanPascanu, Philip Torr, andVolkan Cevher. Hadamard product in deep learning: Introduction, advances and challenges.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(8), 2025

work page 2025

[42] [42]

A survey on deep matrix factoriza- tions.Comput

Pierre De Handschutter, Nicolas Gillis, and Xavier Siebert. A survey on deep matrix factoriza- tions.Comput. Sci. Rev., 42(C), November 2021

work page 2021

[43] [43]

Springer, Berlin, 1969

Herbert Federer.Geometric Measure Theory. Springer, Berlin, 1969. 26

work page 1969