Born Discrete, Made Smooth: Variational Formulation of Shallow Neural Networks
Pith reviewed 2026-07-03 06:05 UTC · model grok-4.3
The pith
The optimal parameter density for shallow neural networks is recovered by solving a single linear system.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We replace the discrete training problem of shallow neural networks with a well-posed continuum variational surrogate given by a family of λ-convex functionals over parameter densities in weighted Sobolev spaces. These problems are globally well-posed, stable, and exhibit almost C³ regularity. The resulting Euler-Lagrange equation is linear, so the optimal density is obtained by solving a single linear system. Generalization error is controlled at rate 1/α and finite networks converge to the continuum optimum at O(1/N).
What carries the argument
The family of λ-convex functionals over parameter densities in weighted Sobolev spaces, which convert the training problem into a convex elliptic equation.
If this is right
- The optimal continuum density is recovered exactly by one linear solve instead of iterative training.
- Generalization error is bounded explicitly by 1/α where α is the regularization strength.
- Any finite-width network of width N approximates the continuum optimum at rate O(1/N).
- The formulation unifies the NTK regime with feature learning inside a single convex variational problem.
Where Pith is reading between the lines
- The linear-system solution could be used to derive closed-form expressions for the effective feature map learned by the network.
- The Sobolev-space formulation may allow direct transfer of existing elliptic regularity tools to deeper architectures by lifting the density to higher-dimensional parameter spaces.
- Explicit control of the density in weighted Sobolev norms suggests new regularization penalties that penalize roughness in parameter space rather than weight magnitude alone.
Load-bearing premise
The λ-convex functionals on parameter densities in weighted Sobolev spaces are well-posed and possess the claimed elliptic regularity.
What would settle it
For a fixed shallow architecture and data set, compute the variational minimizer via the linear system and compare its value of the original discrete loss against the loss achieved by standard gradient descent on the same network; a gap larger than the predicted O(1/N) rate would refute the equivalence claim.
Figures
read the original abstract
Although neural networks are remarkably effective, their underlying optimization principles remain theoretically elusive, often characterized by non-convex landscapes and stochastic heuristics. In this work, we propose a paradigm shift by replacing the discrete training problem of shallow neural networks with a well-posed continuum variational surrogate. We identify a family of $\lambda$-convex functionals over parameter densities in weighted Sobolev spaces and prove that these variational problems are globally well-posed, stable, and exhibit unexpected almost $C^3$ regularity. Unlike existing Wasserstein-based or Mean-Field approaches, which often face limited regularity and discretization challenges, our formulation provides direct access to elliptic regularity and convex analysis. This allows us to prove that the optimal parameter density can be obtained by solving a single linear system, bypassing iterative optimization entirely. We establish explicit generalization error controls at a rate of $1/\alpha$ relative to the regularization parameter, and prove that finite-width networks of size $N$ achieve the continuum optimum at an $O(1/N)$ rate. This perspective bridges the gap between the Neural Tangent Kernel (NTK) and feature-learning regimes, providing a principled framework for understanding over-parameterization through the lens of variational calculus.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a continuum variational formulation for shallow neural networks, replacing discrete training with λ-convex functionals defined over parameter densities in weighted Sobolev spaces. It claims these problems are globally well-posed and stable, exhibit almost C³ regularity, admit an optimal density obtained by solving a single linear system (bypassing iterative optimization), and yield generalization error bounds of order 1/α together with an O(1/N) rate for finite-width networks approximating the continuum optimum.
Significance. If the central claims on well-posedness, linearity of the Euler-Lagrange equation, and the stated rates hold, the work would supply a non-iterative, convex-analysis route to optimal parameter distributions and a variational bridge between NTK and feature-learning regimes. The manuscript does not, however, supply machine-checked proofs, reproducible code, or explicit parameter-free derivations, so the significance remains conditional on verification of the load-bearing analytic steps.
major comments (2)
- [Abstract] Abstract (paragraph 2): the claim that the optimal parameter density is obtained by solving a single linear system requires that the first variation of the proposed λ-convex functional yields a linear Euler-Lagrange operator. λ-convexity alone guarantees strong convexity and uniqueness but does not imply linearity of the variation unless the functional is quadratic in the density; the manuscript must exhibit the explicit functional and derive the EL equation to confirm this step.
- [Abstract] Abstract (paragraph 2): the invocation of elliptic regularity in weighted Sobolev spaces to obtain both the linear system and almost-C³ regularity is not accompanied by the requisite estimates or functional definition; without these, neither the linear-system bypass nor the claimed rates follow.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the detailed comments. We address the two major comments point by point below, clarifying where the explicit constructions and derivations appear in the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph 2): the claim that the optimal parameter density is obtained by solving a single linear system requires that the first variation of the proposed λ-convex functional yields a linear Euler-Lagrange operator. λ-convexity alone guarantees strong convexity and uniqueness but does not imply linearity of the variation unless the functional is quadratic in the density; the manuscript must exhibit the explicit functional and derive the EL equation to confirm this step.
Authors: We agree that λ-convexity by itself does not force a linear Euler-Lagrange operator. Our construction, however, uses an explicitly quadratic functional in the density variable (defined in Section 2 over the weighted Sobolev space). Because the energy is quadratic, its first variation is linear by direct differentiation; the resulting Euler-Lagrange equation is therefore a linear integral equation. The explicit functional and the derivation of the linear operator are given in Section 3.1–3.2, culminating in Theorem 3.3, which states the linear system solved by the optimal density. revision: no
-
Referee: [Abstract] Abstract (paragraph 2): the invocation of elliptic regularity in weighted Sobolev spaces to obtain both the linear system and almost-C³ regularity is not accompanied by the requisite estimates or functional definition; without these, neither the linear-system bypass nor the claimed rates follow.
Authors: The functional is defined in Section 2.1 as a λ-convex quadratic form on the weighted Sobolev space H¹_w. The elliptic regularity theory for this specific weighted space is developed in Section 4, where we obtain the almost-C³ estimates (Proposition 4.2) that justify both the linear-system representation and the subsequent generalization and approximation rates. These estimates are used in Theorems 5.2 and 6.1 to derive the 1/α generalization bound and the O(1/N) finite-width rate. The derivations are fully contained in the manuscript. revision: no
Circularity Check
No circularity detected; derivation framed as proof from chosen family of functionals
full rationale
The abstract identifies a specific family of λ-convex functionals in weighted Sobolev spaces and states that this choice yields global well-posedness plus an Euler-Lagrange equation reducible to a single linear system. No equations or self-citations are supplied that would demonstrate the linear system is obtained by definition, by fitting a parameter, or by a load-bearing self-citation chain. The central claim is therefore presented as an independent consequence of the variational setup rather than a tautology or renamed input. Because the provided text contains no explicit reduction (e.g., Eq. X defined in terms of the target optimum), the derivation chain is treated as self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
doi: 10.1007/ 978-3-030-72162-6
ISBN 978-3-030-72161-9; 978-3-030-72162-6. doi: 10.1007/ 978-3-030-72162-6. URL https://doi.org/10.1007/978-3-030-72162-6 . La Matematica per il 3+2. Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization.Advances in Neural Information Processing Systems, 32,
-
[2]
doi: 10.1007/s00526-020-01818-1
ISSN 0944-2669,1432-0835. doi: 10.1007/s00526-020-01818-1. URL https://doi.org/10.1007/ s00526-020-01818-1. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854,
-
[3]
Michał Borowski, Iwona Chlebicka, Filomena De Filippis, and Bła ˙zej Miasojedow
doi: 10.1073/pnas.1903070116. Michał Borowski, Iwona Chlebicka, Filomena De Filippis, and Bła ˙zej Miasojedow. Absence and presence of Lavrentiev’s phenomenon for double phase functionals upon every choice of exponents.Calc. Var. Partial Differential Equations, 63(2):Paper No. 35, 23,
-
[4]
doi: 10.1007/s00526-023-02640-1
ISSN 0944-2669,1432-0835. doi: 10.1007/s00526-023-02640-1. URL https://doi.org/10.1007/ s00526-023-02640-1. Haïm Brézis.Functional Analysis, Sobolev Spaces and Partial Differential Equations. Universitext. Springer, New York,
-
[5]
Yihang Chen, Fanghui Liu, Yiping Lu, Grigorios G
ISBN 978-0-387-70913-0. Yihang Chen, Fanghui Liu, Yiping Lu, Grigorios G. Chrysos, and V olkan Cevher. Generalization of scaled deep resnets in the mean-field regime.arXiv preprint arXiv:2403.09889,
-
[6]
Steffen Dereich, Arnulf Jentzen, and Sebastian Kassing
URLhttps://arxiv.org/abs/ 2507.12385. Steffen Dereich, Arnulf Jentzen, and Sebastian Kassing. On the existence of minimizers in shallow residual ReLU neural network optimization landscapes.SIAM Journal on Numerical Analysis, 62 (6):2640–2666,
-
[7]
doi: 10.1137/23M1556241. Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 1675–1685,
-
[8]
doi: 10.1007/ s40687-018-0172-y
ISSN 2197-9847. doi: 10.1007/ s40687-018-0172-y. URLhttp://dx.doi.org/10.1007/s40687-018-0172-y. Weinan E, Chao Ma, and Lei Wu. Machine learning from a continuous viewpoint, I.Science China Mathematics, 63(11):2233–2266, Nov
-
[9]
doi: 10.1007/s11425-020-1773-8
ISSN 1869-1862. doi: 10.1007/s11425-020-1773-8. URLhttps://doi.org/10.1007/s11425-020-1773-8. Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression.Ann. Statist., 32(2):407–499,
-
[10]
doi: 10.1214/009053604000000067
ISSN 0090-5364,2168-8966. doi: 10.1214/009053604000000067. URL https://doi.org/10.1214/009053604000000067. With discussion, and a rejoinder by the authors. Xavier Fernández-Real and Alessio Figalli. The continuous formulation of shallow neural networks as Wasserstein-type gradient flows. InAnalysis at large—dedicated to the life and work of Jean Bourgain,...
-
[11]
doi: 10.1007/978-3-031-05331-3_3
ISBN 978-3-031-05330-6; 978-3-031-05331-3. doi: 10.1007/978-3-031-05331-3_3. URLhttps://doi.org/10.1007/978-3-031-05331-3_3. Irene Fonseca, Jan Malý, and Giuseppe Mingione. Scalar minimizers with fractal singular sets. Arch. Ration. Mech. Anal., 172(2):295–307,
-
[12]
doi: 10.1007/ s00205-003-0301-6
ISSN 0003-9527,1432-0673. doi: 10.1007/ s00205-003-0301-6. URLhttps://doi.org/10.1007/s00205-003-0301-6. David Gilbarg and Neil S. Trudinger.Elliptic partial differential equations of second order. Class. Math. Berlin: Springer, reprint of the 1998 ed. edition,
-
[13]
doi: 10.1137/S0036141096303359
ISSN 0036-1410,1095-7154. doi: 10.1137/S0036141096303359. URLhttps://doi.org/10.1137/S0036141096303359. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization.International Conference on Learning Representations, 12
-
[14]
URL https: //doi.org/10.1137/24M1686693
doi: 10.1137/24M1686693. URL https: //doi.org/10.1137/24M1686693. Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two- layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671,
-
[15]
doi: 10.1016/j.jmaa.2021.125197
ISSN 0022-247X,1096-0813. doi: 10.1016/j.jmaa.2021.125197. URL https://doi.org/10.1016/j. jmaa.2021.125197. 11 Alireza Mousavi-Hosseini, Denny Wu, and Murat A. Erdogdu. Learning multi-index models with neural networks via mean-field Langevin dynamics. InInternational Conference on Learning Representations,
-
[16]
Sobolev acceleration for neural networks.arXiv preprint arXiv:2509.19773,
Jong Kwon Oh, Hanbaek Lyu, and Hwijae Son. Sobolev acceleration for neural networks.arXiv preprint arXiv:2509.19773,
-
[17]
A function space view of bounded norm infinite width relu nets: The multivariate case
Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width relu nets: The multivariate case. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,
2020
-
[18]
URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/f21e255f89e0f258accbe4e984eef486-Paper.pdf. Grant M. Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of neural networks: An interacting particle system approach.Communications on Pure and Applied Mathematics, 75(9): 1889–1935,
2020
-
[19]
doi: 10.1007/s13373-017-0101-1
ISSN 1664-3607,1664-3615. doi: 10.1007/s13373-017-0101-1. URL https://doi.org/10.1007/s13373-017-0101-1. Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752,
-
[20]
By Proposition A.1, one also has inf v∈C ∞c (Ω) R(f, vdx)≤inf m∈M1(Ω) R(f, m)
Proof of Theorem 1.SinceC ∞ c (Ω)⊂W⊂M 1(Ω), we have inf m∈M1(Ω) R(f, m)≤inf v∈W R(f, vdx)≤inf v∈C ∞c (Ω) R(f, vdx). By Proposition A.1, one also has inf v∈C ∞c (Ω) R(f, vdx)≤inf m∈M1(Ω) R(f, m). We can thus conclude that inf m∈M1(Ω) R(f, m) = inf v∈W R(f, vdx) = inf v∈C ∞c (Ω) R(f, vdx). Similarly, the fact thatM at(Ω)⊂M 1(Ω)and Proposition A.2 imply that...
2018
-
[21]
Then, by the Schwarz inequality and (7), Z Ω (1 +|θ|) 2|v(θ)|dθ≤ √cω∥v∥L2ω(Ω)
Proof of Proposition 2.Letv∈L 2 ω(Ω). Then, by the Schwarz inequality and (7), Z Ω (1 +|θ|) 2|v(θ)|dθ≤ √cω∥v∥L2ω(Ω) . This proves that the measurem:=vdθbelongs toM 2(Ω). Similarly, we have |m|(Ω) = Z Ω |v(θ)|dθ≤ √cω∥v∥L2ω(Ω) . Applying Proposition A.6 to this measuremand using (11), one gets for everyN≥2, inf ρ∈M at N (Ω) R(f, ρ)≤inf ρ∈M at 2⌊N/2⌋(Ω) R(f,...
2001
-
[22]
We first consider the restriction of Fα,β =F (f) α,β to its domain W. From Lemma A.8 with g= 2Q/ω , one deduces that Fα,β is 2 min(α, β)-convex on the Hilbert space W, and thus attains a unique minimum u∗ =u ∗ f , which is a weak solution of the linear elliptic equation: −β∆u∗ +αωu ∗ + Z Ω K(·, ϑ)u∗(ϑ) dϑ−Q= 0.(A.13) By Lemma A.9, the functionℓ:=−Q+ R Ω K...
2001
-
[23]
Then, by Theorem 3, the restriction uf |Ω′ belongs to C2(Ω′)
Fix Ω′ ⋐Ω . Then, by Theorem 3, the restriction uf |Ω′ belongs to C2(Ω′). We only need to establish the continuity of the linear map f∈L 2(D)7→u f |Ω′ ∈C 2(Ω′).(A.16) Let (fk)k≥1 ⊂L 2(D) converge to f∈L 2(D) and assume that (ufk |Ω′)k≥1 converges to some v∈C 2(Ω′). By Proposition 4, we know that (ufk)k≥1 converges to uf in W. We deduce that v=u f |Ω′. Sin...
2011
-
[24]
Let v be the unique minimizer of Jα,β,2Q/ω given by Lemma A.8
Let u0 ∈L 2 ω(Ω). Let v be the unique minimizer of Jα,β,2Q/ω given by Lemma A.8. Then, v∈D(A) and Av= 2Q/ω . From [Brézis, 2011, Theorem VII.6, The- orem VII.7] and Lemma A.11, we deduce that there exists a unique ¯u∈C 0([0,∞[;L 2 ω(Ω))∩ C1((0,∞);L 2 ω(Ω))∩C 0((0,∞);D(A))such that d¯u dt +A¯u= 0, ¯u(0) =u0 −v . Moreover,∥¯u(t)∥L2ω(Ω) ≤ ∥u0 −v∥ L2ω(Ω),∥ d¯...
2011
-
[25]
Then, Lemma A.8 with g= 2Q/ω implies that u∗ ∈D(A) and A(u∗) = 2Q/ω
Let u∗ be the minimizer of Fα,β. Then, Lemma A.8 with g= 2Q/ω implies that u∗ ∈D(A) and A(u∗) = 2Q/ω . Hence, the gradient flow associated to the initial condition u∗ is the constant map t7→u ∗. For every v∈L 2 ω(Ω), the estimate (A.20) applied to v1 =v and v2 =u ∗ implies that ∥Stv−u ∗∥L2ω(Ω) =∥S tv−S tu∗∥L2ω(Ω) ≤e −2αt∥v−u ∗∥L2ω(Ω) . We conclude this se...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.