Lipschitz bounds for integral kernels

Fabrice Gamboa; Justin Reverdi; Serge Gratton; Sixin Zhang

arxiv: 2604.02887 · v1 · submitted 2026-04-03 · 📊 stat.ML · cs.LG

Lipschitz bounds for integral kernels

Justin Reverdi , Sixin Zhang , Fabrice Gamboa , Serge Gratton This is my paper

Pith reviewed 2026-05-13 18:19 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Lipschitz continuityintegral kernelsneural network kernelsGaussian kernelReLU kernelshift-invariant kernelsfeature mapsstability guarantees

0 comments

The pith

The Lipschitz constant of infinite-width two-layer neural network kernels equals the supremum of a two-dimensional integral over the weight distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit conditions and formulas that guarantee Lipschitz continuity for feature maps of integral kernels under differentiability assumptions. For two-layer networks of infinite width with isotropic Gaussian weights, it shows that the Lipschitz constant reduces exactly to the supremum of a two-dimensional integral, which yields closed-form expressions for the Gaussian kernel and the ReLU random neural network kernel. The same framework covers continuous shift-invariant kernels such as Gaussian, Laplace, and Matérn, which can be viewed as networks with cosine activations; in this case the feature map is Lipschitz continuous if and only if the weight distribution has finite second-order moments. These bounds supply concrete stability and robustness guarantees for kernel methods that rely on the associated feature maps.

Core claim

Under differentiability assumptions on the kernel, the Lipschitz constant of the associated feature map equals the supremum of a certain integral expression involving the kernel and its derivatives. For infinite-width two-layer networks with isotropic Gaussian weight distributions this supremum is taken over a two-dimensional integral, producing explicit characterizations for the Gaussian kernel and the ReLU random neural network kernel. For continuous shift-invariant kernels the feature map is Lipschitz continuous precisely when the weight distribution possesses a finite second-order moment, and the constant is then given by an explicit formula involving that moment.

What carries the argument

The supremum of a two-dimensional integral over the isotropic Gaussian weight distribution, which directly supplies the Lipschitz constant of the infinite-width neural-network kernel.

If this is right

Explicit Lipschitz constants become available for the Gaussian kernel and the ReLU random neural network kernel.
Shift-invariant kernels are Lipschitz continuous exactly when the weight distribution has finite second-order moments.
Stability guarantees for kernel methods follow immediately from these constants.
The asymptotic behavior of the Lipschitz constant as width tends to infinity remains an open question for finite-width networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The moment condition suggests a practical design rule: choose weight distributions with controlled second moments to enforce a target Lipschitz bound without changing the kernel form.
The two-dimensional integral representation may extend to multi-layer or non-Gaussian weight distributions, offering a route to Lipschitz control in deeper random networks.
These bounds could be used to certify robustness margins in kernel-based classifiers by plugging the derived constant into existing generalization or stability theorems.

Load-bearing premise

The kernel must be differentiable and the weight distribution must have finite second-order moments for the shift-invariant case.

What would settle it

A direct numerical comparison between the supremum of the two-dimensional integral and an empirical estimate of the Lipschitz constant on a large finite sample of the ReLU random neural network kernel would falsify the claimed equality if the two quantities differ.

Figures

Figures reproduced from arXiv: 2604.02887 by Fabrice Gamboa, Justin Reverdi, Serge Gratton, Sixin Zhang.

read the original abstract

Feature maps associated with positive definite kernels play a central role in kernel methods and learning theory, where regularity properties such as Lipschitz continuity are closely related to robustness and stability guarantees. Despite their importance, explicit characterizations of the Lipschitz constant of kernel feature maps are available only in a limited number of cases. In this paper, we study the Lipschitz regularity of feature maps associated with integral kernels under differentiability assumptions. We first provide sufficient conditions ensuring Lipschitz continuity and derive explicit formulas for the corresponding Lipschitz constants. We then identify a condition under which the feature map fails to be Lipschitz continuous and apply these results to several important classes of kernels. For infinite width two-layer neural network with isotropic Gaussian weight distributions, we show that the Lipschitz constant of the associated kernel can be expressed as the supremum of a two-dimensional integral, leading to an explicit characterization for the Gaussian kernel and the ReLU random neural network kernel. We also study continuous and shift-invariant kernels such as Gaussian, Laplace, and Mat\'ern kernels, which admit an interpretation as neural network with cosine activation function. In this setting, we prove that the feature map is Lipschitz continuous if and only if the weight distribution has a finite second-order moment, and we then derive its Lipschitz constant. Finally, we raise an open question concerning the asymptotic behavior of the convergence of the Lipschitz constant in finite width neural networks. Numerical experiments are provided to support this behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies explicit Lipschitz constants for several integral kernel feature maps including a sup-of-2D-integral formula for the ReLU infinite-width case, but the ReLU application may need extra justification for the differentiability step.

read the letter

The one thing to know is that this paper derives explicit expressions for the Lipschitz constant of the feature map for several classes of integral kernels, including a two-dimensional integral sup for the infinite width ReLU network kernel, and proves that for shift-invariant kernels the feature map is Lipschitz if and only if the weight distribution has finite second moment. The derivations for the Gaussian kernel and the cosine-activated versions of Laplace and Matérn kernels look straightforward and useful. They also give sufficient conditions for Lipschitz continuity in general under differentiability. The application to neural network kernels is the part that stands out, since explicit constants are rare there. The numerics supporting the open question on finite width asymptotics are a good addition, even though no closed form is given. The soft spot is the ReLU part. The derivation uses differentiation under the integral sign, which requires the kernel to be differentiable. But the ReLU random neural network kernel is only C0 in general because ReLU is not differentiable at zero. Without a specific argument showing that the mixed partials exist almost everywhere or that the interchange is valid anyway, the claimed explicit characterization might not be fully justified. The abstract presents it as leading to an explicit characterization, so I assume they address it, but it is worth checking. This paper is for people who need concrete Lipschitz bounds in kernel methods or for analyzing stability in wide neural networks. It is not revolutionary but fills in some missing explicit cases. I would recommend sending it to peer review because the main results on the differentiable kernels are likely to hold and the ReLU question is interesting enough to get referee input on.

Referee Report

2 major / 2 minor

Summary. The paper studies Lipschitz regularity of feature maps associated with integral kernels under differentiability assumptions. It derives explicit formulas for the Lipschitz constants and applies them to infinite-width two-layer neural networks with isotropic Gaussian weight distributions, showing that the Lipschitz constant of the associated kernel can be expressed as the supremum of a two-dimensional integral. This leads to explicit characterizations for the Gaussian kernel and the ReLU random neural network kernel. For continuous and shift-invariant kernels (Gaussian, Laplace, Matérn), interpreted as neural networks with cosine activation, the feature map is Lipschitz continuous if and only if the weight distribution has finite second-order moment, with the Lipschitz constant derived accordingly. An open question on the asymptotic convergence of the Lipschitz constant for finite-width networks is raised, supported by numerical experiments.

Significance. If the derivations are rigorous, the explicit sup-of-integral characterizations and the iff condition for shift-invariant kernels provide valuable concrete tools for analyzing stability and robustness of kernel feature maps in kernel methods and neural network theory. The connection between integral kernels and infinite-width networks, plus the numerical support for finite-width asymptotics, strengthens the contribution to learning theory.

major comments (2)

[ReLU random neural network kernel section] ReLU random neural network kernel section: The derivation of the Lipschitz constant via the supremum of a two-dimensional integral relies on differentiability assumptions to justify differentiation under the integral (or expectation). However, the ReLU kernel k(x,y) = E[ReLU(w·x) ReLU(w·y)] is only C^0 in general, with mixed partial derivatives failing to exist on sets of positive measure due to the kink in ReLU at zero. This is load-bearing for the claimed explicit characterization of the ReLU kernel and requires either a separate non-differentiable derivation or explicit justification for the interchange.
[Shift-invariant kernels section] Shift-invariant kernels section: The necessity direction of the iff statement (Lipschitz continuity requires finite second moment) should be checked for completeness against the differentiability assumptions used elsewhere; if it relies on the same integral representation, cross-reference the justification.

minor comments (2)

[Abstract and introduction] The abstract and introduction should explicitly state which theorem or proposition gives the two-dimensional integral representation for the Lipschitz constant.
[Numerical experiments] In the numerical experiments, clarify the network widths, sampling methods for weights, and exact metrics used to illustrate the asymptotic behavior of the Lipschitz constant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the rigor of the derivations.

read point-by-point responses

Referee: [ReLU random neural network kernel section] The derivation of the Lipschitz constant via the supremum of a two-dimensional integral relies on differentiability assumptions to justify differentiation under the integral (or expectation). However, the ReLU kernel k(x,y) = E[ReLU(w·x) ReLU(w·y)] is only C^0 in general, with mixed partial derivatives failing to exist on sets of positive measure due to the kink in ReLU at zero. This is load-bearing for the claimed explicit characterization of the ReLU kernel and requires either a separate non-differentiable derivation or explicit justification for the interchange.

Authors: We agree that the general theorem relies on differentiability to interchange derivative and expectation, and that the ReLU kernel is only C^0. The explicit characterization for the ReLU case in the manuscript was obtained by formally applying the general formula, which requires additional justification. We will revise the section by adding a direct, non-differentiable proof for the ReLU kernel: we bound |φ(x) - φ(y)| directly via the expectation of |ReLU(w·x) - ReLU(w·y)| using the explicit form of the ReLU kernel and properties of the Gaussian measure, without invoking differentiation under the integral. This establishes the same sup-of-integral expression rigorously. revision: yes
Referee: [Shift-invariant kernels section] The necessity direction of the iff statement (Lipschitz continuity requires finite second moment) should be checked for completeness against the differentiability assumptions used elsewhere; if it relies on the same integral representation, cross-reference the justification.

Authors: The necessity direction is established by a separate argument that does not rely on differentiability of the kernel: if the second moment is infinite, the feature map φ(x) grows faster than linearly along certain directions (by direct computation of the integral representation of the cosine feature map), violating the Lipschitz condition. This proof uses only the integral form of the kernel and moment conditions, without invoking derivatives. We will add an explicit cross-reference in the shift-invariant section to the earlier general theorem on failure of Lipschitz continuity (which is proven independently of differentiability assumptions) to clarify the separation of arguments. revision: yes

Circularity Check

0 steps flagged

No circularity: derivations rest on standard integral analysis and explicit conditions

full rationale

The paper derives Lipschitz constants for integral kernels by providing sufficient differentiability conditions and applying differentiation under the integral sign to obtain an explicit sup-of-two-dimensional-integral formula. This is then specialized to Gaussian and ReLU random NN kernels and to shift-invariant kernels under a finite-second-moment condition on the weight distribution. None of these steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the central claims follow from the stated assumptions via standard analysis rather than tautological renaming or imported uniqueness theorems. The ReLU non-differentiability concern affects applicability but does not create a circular reduction in the derivation chain itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on differentiability assumptions and finite-moment conditions on weight distributions; no free parameters are fitted to data and no new entities are postulated.

axioms (2)

domain assumption Kernels are differentiable
Invoked to derive sufficient conditions for Lipschitz continuity of feature maps.
domain assumption Weight distribution has finite second-order moment
Required for the if-and-only-if Lipschitz statement on shift-invariant kernels.

pith-pipeline@v0.9.0 · 5548 in / 1245 out tokens · 42119 ms · 2026-05-13T18:19:42.649356+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Ahir and P

[1]S. Ahir and P. Pandit,Feature maps for the Laplacian kernel and its generalizations, Arxiv preprint arXiv:2502.15575, (2025). [2]N. Aronszajn,Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68 (1950), pp. 337–404. [3]Y. Bai, B. G. Anderson, A. Kim, and S. Sojoudi,Improving the Accuracy-Robustness Trade-Off of Clas- sifiers via Adaptive Smoothin...

work page arXiv 2025
[2]

Bietti and J

[5]A. Bietti and J. Mairal,Invariance and Stability of Deep Convolutional Representations, in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 6211 –

work page 2017
[3]

Bietti and J

[6]A. Bietti and J. Mairal,On the Inductive Bias of Neural Tangent Kernels, in Advances in Neural Information Processing Systems, 2019, pp. 12893 – 12904. [7]A. Blaas and S. J. Roberts,The Effect of Prior Lipschitz Continuity on the Adversarial Robustness of Bayesian Neural Networks,

work page 2019
[4]

Blanchard and E

[8]P. Blanchard and E. Br ¨uning,Geometry of Hilbert Spaces, Birkh¨ auser, Boston, MA, 2003, pp. 199–210, https://doi.org/10.1007/978-1-4612-0049-9

work page doi:10.1007/978-1-4612-0049-9 2003
[5]

[9]P. L. Combettes and J.-C. Pesquet,Lipschitz Certificates for Layered Network Structures Driven by Av- eraged Activation Operators, SIAM J. Math. Data Sci., 2 (2020), pp. 529–557, https://doi.org/10.1137/ 19M1272780. [10]M. EskandariNasab, S. M. Hamdi, and S. F. Boubrahimi,AVATAR: Adversarial Autoencoders with Auto- regressive Refinement for Time Series...

work page doi:10.1137/1 2020
[6]

Fiedler,Lipschitz and H¨ older Continuity in Reproducing Kernel Hilbert Spaces, arXiv preprint: arXiv:2310.18078, (2023)

[13]C. Fiedler,Lipschitz and H¨ older Continuity in Reproducing Kernel Hilbert Spaces, arXiv preprint: arXiv:2310.18078, (2023). [14]P. Geuchen, D. St ¨oger, T. Telaar, and F. Voigtlaender,Upper and Lower Bounds for the Lipschitz Constant of Random Neural Networks, Inf. Inference., 14 (2025), p. iaaf009, https://doi.org/10.1093/imaiai/ iaaf009. [15]L. Gon...

work page doi:10.1093/imaiai/ 2023
[7]

Hotz and F

[18]T. Hotz and F. Telschow,Representation by Integrating Reproducing Kernels, arXiv preprint arXiv:1202.4443, (2012). [19]B. M. G. Kibria and A. Joarder,A short review of multivariate t-distribution, Journal of Statistical Research ISSN, 40 (2006), pp. 59–72. [20]F. Latorre, P. Rolland, and V. Cevher,Lipschitz Constant Estimation of Neural Networks via S...

work page arXiv 2012
[8]

[21]P. D. Lax,Hilbert Space, in Functional analysis, Wiley and Sons, New York, NY, USA, 2002, pp. 52–63. [22]D. G. Luenberger,Optimization by vector space methods, Wiley and Sons, New York, NY, USA,

work page 2002
[9]

[23]S. Mei, T. Misiakiewicz, and A. Montanari,Mean-field theory of two-layers neural networks: dimension- free bounds and kernel limit, in Proceedings of the Thirty-Second Conference on Learning Theory, vol. 99, PMLR, 2019, pp. 2388–2464. [24]H. Q. Minh, P. Niyogi, and Y. Yao,Mercer’s Theorem, Feature Maps, and Smoothing, in International Con- ference on ...

work page 2019
[10]

Muthukumar and J

[25]R. Muthukumar and J. Sulam,Adversarial Robustness of Sparse Local Lipschitz Predictors, SIAM J. Math. Data Sci., 5 (2023), pp. 920–948, https://doi.org/10.1137/22M1478835. [26]R. Neal,BAYESIAN LEARNING FOR NEURAL NETWORKS, PhD thesis, University of Toronto,

work page doi:10.1137/22m1478835 2023
[11]

Release 1.1.9

[28]NIST Digital Library of Mathematical Functions,Nist digital library of mathematical functions, 2023, https://dlmf.nist.gov/. Release 1.1.9. [29]A. S. Q. Le T. Sarl ´os,Fastfood-computing hilbert space expansions in loglinear time, in International Confer- ence on Machine Learning, PMLR, 2013, pp. 244–252. [30]A. Rahimi and B. Recht,Random features for...

work page 2023
[12]

T ´oth, H

[35]C. T ´oth, H. Oberhauser, and Z. Szab ´o,Random Fourier Signature Features, SIAM J. Math. Data Sci., 7 (2025), pp. 329–354, https://doi.org/10.1137/23M1620478. [36]H. van Waarde and R. Sepulchre,Training Lipschitz Continuous Operators Using Reproducing Kernels, in Proceedings of The 4th Annual Learning for Dynamics and Control Conference, vol. 168, PM...

work page doi:10.1137/23m1620478 2025
[13]

[39]L. Wu, I. E.-H. Yen, F. Xu, P. Ravikumar, and M. Witbrock,D2KE: From Distance to Kernel and Embedding, arXiv preprint arXiv:1802.04956, (2018). [40]L. Wu, I. E.-H. Yen, Z. Zhang, K. Xu, L. Zhao, X. Peng, Y. Xia, and C. Aggarwal,Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding,

work page arXiv 2018

[1] [1]

Ahir and P

[1]S. Ahir and P. Pandit,Feature maps for the Laplacian kernel and its generalizations, Arxiv preprint arXiv:2502.15575, (2025). [2]N. Aronszajn,Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68 (1950), pp. 337–404. [3]Y. Bai, B. G. Anderson, A. Kim, and S. Sojoudi,Improving the Accuracy-Robustness Trade-Off of Clas- sifiers via Adaptive Smoothin...

work page arXiv 2025

[2] [2]

Bietti and J

[5]A. Bietti and J. Mairal,Invariance and Stability of Deep Convolutional Representations, in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 6211 –

work page 2017

[3] [3]

Bietti and J

[6]A. Bietti and J. Mairal,On the Inductive Bias of Neural Tangent Kernels, in Advances in Neural Information Processing Systems, 2019, pp. 12893 – 12904. [7]A. Blaas and S. J. Roberts,The Effect of Prior Lipschitz Continuity on the Adversarial Robustness of Bayesian Neural Networks,

work page 2019

[4] [4]

Blanchard and E

[8]P. Blanchard and E. Br ¨uning,Geometry of Hilbert Spaces, Birkh¨ auser, Boston, MA, 2003, pp. 199–210, https://doi.org/10.1007/978-1-4612-0049-9

work page doi:10.1007/978-1-4612-0049-9 2003

[5] [5]

[9]P. L. Combettes and J.-C. Pesquet,Lipschitz Certificates for Layered Network Structures Driven by Av- eraged Activation Operators, SIAM J. Math. Data Sci., 2 (2020), pp. 529–557, https://doi.org/10.1137/ 19M1272780. [10]M. EskandariNasab, S. M. Hamdi, and S. F. Boubrahimi,AVATAR: Adversarial Autoencoders with Auto- regressive Refinement for Time Series...

work page doi:10.1137/1 2020

[6] [6]

Fiedler,Lipschitz and H¨ older Continuity in Reproducing Kernel Hilbert Spaces, arXiv preprint: arXiv:2310.18078, (2023)

[13]C. Fiedler,Lipschitz and H¨ older Continuity in Reproducing Kernel Hilbert Spaces, arXiv preprint: arXiv:2310.18078, (2023). [14]P. Geuchen, D. St ¨oger, T. Telaar, and F. Voigtlaender,Upper and Lower Bounds for the Lipschitz Constant of Random Neural Networks, Inf. Inference., 14 (2025), p. iaaf009, https://doi.org/10.1093/imaiai/ iaaf009. [15]L. Gon...

work page doi:10.1093/imaiai/ 2023

[7] [7]

Hotz and F

[18]T. Hotz and F. Telschow,Representation by Integrating Reproducing Kernels, arXiv preprint arXiv:1202.4443, (2012). [19]B. M. G. Kibria and A. Joarder,A short review of multivariate t-distribution, Journal of Statistical Research ISSN, 40 (2006), pp. 59–72. [20]F. Latorre, P. Rolland, and V. Cevher,Lipschitz Constant Estimation of Neural Networks via S...

work page arXiv 2012

[8] [8]

[21]P. D. Lax,Hilbert Space, in Functional analysis, Wiley and Sons, New York, NY, USA, 2002, pp. 52–63. [22]D. G. Luenberger,Optimization by vector space methods, Wiley and Sons, New York, NY, USA,

work page 2002

[9] [9]

[23]S. Mei, T. Misiakiewicz, and A. Montanari,Mean-field theory of two-layers neural networks: dimension- free bounds and kernel limit, in Proceedings of the Thirty-Second Conference on Learning Theory, vol. 99, PMLR, 2019, pp. 2388–2464. [24]H. Q. Minh, P. Niyogi, and Y. Yao,Mercer’s Theorem, Feature Maps, and Smoothing, in International Con- ference on ...

work page 2019

[10] [10]

Muthukumar and J

[25]R. Muthukumar and J. Sulam,Adversarial Robustness of Sparse Local Lipschitz Predictors, SIAM J. Math. Data Sci., 5 (2023), pp. 920–948, https://doi.org/10.1137/22M1478835. [26]R. Neal,BAYESIAN LEARNING FOR NEURAL NETWORKS, PhD thesis, University of Toronto,

work page doi:10.1137/22m1478835 2023

[11] [11]

Release 1.1.9

[28]NIST Digital Library of Mathematical Functions,Nist digital library of mathematical functions, 2023, https://dlmf.nist.gov/. Release 1.1.9. [29]A. S. Q. Le T. Sarl ´os,Fastfood-computing hilbert space expansions in loglinear time, in International Confer- ence on Machine Learning, PMLR, 2013, pp. 244–252. [30]A. Rahimi and B. Recht,Random features for...

work page 2023

[12] [12]

T ´oth, H

[35]C. T ´oth, H. Oberhauser, and Z. Szab ´o,Random Fourier Signature Features, SIAM J. Math. Data Sci., 7 (2025), pp. 329–354, https://doi.org/10.1137/23M1620478. [36]H. van Waarde and R. Sepulchre,Training Lipschitz Continuous Operators Using Reproducing Kernels, in Proceedings of The 4th Annual Learning for Dynamics and Control Conference, vol. 168, PM...

work page doi:10.1137/23m1620478 2025

[13] [13]

[39]L. Wu, I. E.-H. Yen, F. Xu, P. Ravikumar, and M. Witbrock,D2KE: From Distance to Kernel and Embedding, arXiv preprint arXiv:1802.04956, (2018). [40]L. Wu, I. E.-H. Yen, Z. Zhang, K. Xu, L. Zhao, X. Peng, Y. Xia, and C. Aggarwal,Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding,

work page arXiv 2018