New Equivalences Between Interpolation and SVMs: Kernels and Structured Features

Andrew D. McRae; Chiraag Kaushik; Mark A. Davenport; Vidya Muthukumar

arxiv: 2305.02304 · v1 · submitted 2023-05-03 · 📊 stat.ML · cs.LG

New Equivalences Between Interpolation and SVMs: Kernels and Structured Features

Chiraag Kaushik , Andrew D. McRae , Mark A. Davenport , Vidya Muthukumar This is my paper

Pith reviewed 2026-05-24 08:47 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords support vector proliferationSVMkernel methodsinterpolationreproducing kernel Hilbert spacebounded orthonormal systemssub-Gaussian featuresgeneralization bounds

0 comments

The pith

SVM decision functions coincide with minimum-norm interpolants for bounded orthonormal systems and sub-Gaussian features in arbitrary RKHS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a flexible framework to establish support vector proliferation, the exact coincidence between the SVM classifier and the minimum-norm label interpolant, inside any reproducing kernel Hilbert space. Conditions are given under which this holds for two broad feature families: general bounded orthonormal systems such as Fourier features, and independent sub-Gaussian features. The framework uses only mild generative assumptions on the labels and covers regimes outside earlier analyses. Because the equivalence lets interpolation theory be applied directly to SVMs, the authors obtain new generalization bounds for kernel SVM classification.

Core claim

We present a new and flexible analysis framework for proving support vector proliferation in an arbitrary reproducing kernel Hilbert space with a flexible class of generative models for the labels. We present conditions for SVP for features in the families of general bounded orthonormal systems and independent sub-Gaussian features. In both cases, we show that SVP occurs in many interesting settings not covered by prior work, and we leverage these results to prove novel generalization results for kernel SVM classification.

What carries the argument

Support vector proliferation (SVP), the exact match between the SVM solution and the minimum-norm interpolant, proved by verifying that the interpolant satisfies the SVM optimality (KKT) conditions via spectral or concentration properties of the chosen feature map.

If this is right

SVP holds for general bounded orthonormal systems including Fourier features in regimes not covered by prior analyses.
SVP holds for independent sub-Gaussian features under the framework's generative label models.
Novel generalization bounds for kernel SVM classification follow directly from the SVP equivalence.
The framework applies inside any reproducing kernel Hilbert space without requiring strong assumptions on the kernel spectrum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditions may transfer other benign-overfitting results from interpolation theory to SVMs in structured kernel settings.
If the concentration properties extend to random features drawn from neural networks, SVP could explain max-margin behavior in overparameterized deep models.
The framework suggests that explicit margin maximization may be redundant once feature maps satisfy the required spectral conditions.

Load-bearing premise

The feature families must have spectral or concentration properties that force the minimum-norm interpolant to satisfy the SVM optimality conditions.

What would settle it

A concrete counter-example in which a bounded orthonormal system or sub-Gaussian feature map meets the paper's stated spectral or concentration conditions yet the resulting SVM solution differs from the minimum-norm interpolant.

Figures

Figures reproduced from arXiv: 2305.02304 by Andrew D. McRae, Chiraag Kaushik, Mark A. Davenport, Vidya Muthukumar.

**Figure 2.** Figure 2: Heatmap showing the proportion of trials (out of 25) that [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

read the original abstract

The support vector machine (SVM) is a supervised learning algorithm that finds a maximum-margin linear classifier, often after mapping the data to a high-dimensional feature space via the kernel trick. Recent work has demonstrated that in certain sufficiently overparameterized settings, the SVM decision function coincides exactly with the minimum-norm label interpolant. This phenomenon of support vector proliferation (SVP) is especially interesting because it allows us to understand SVM performance by leveraging recent analyses of harmless interpolation in linear and kernel models. However, previous work on SVP has made restrictive assumptions on the data/feature distribution and spectrum. In this paper, we present a new and flexible analysis framework for proving SVP in an arbitrary reproducing kernel Hilbert space with a flexible class of generative models for the labels. We present conditions for SVP for features in the families of general bounded orthonormal systems (e.g. Fourier features) and independent sub-Gaussian features. In both cases, we show that SVP occurs in many interesting settings not covered by prior work, and we leverage these results to prove novel generalization results for kernel SVM classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This extends SVP conditions to arbitrary RKHS with bounded orthonormal and sub-Gaussian features under flexible labels, but the actual proofs and tightness of those conditions are what matter.

read the letter

The core advance is a framework that proves support vector proliferation for kernel SVMs in settings previous work left out, specifically for general bounded orthonormal systems like Fourier features and independent sub-Gaussian features inside any RKHS, paired with a broader class of label models. They then turn those equivalences into new generalization bounds for the SVM classifier. That is the concrete step beyond the restrictive spectrum and distribution assumptions in earlier papers. If the derivations hold without extra hidden restrictions, it gives a cleaner way to analyze when the max-margin solution matches the min-norm interpolant in these families. The dependence on spectral or concentration properties is stated up front as the gate, which is honest. The main uncertainty is whether those properties are easy to verify in practice or if they end up being as limiting as the old assumptions once you look at the proofs. The generalization results are presented as novel, but they will stand or fall on how directly they follow from the SVP conditions. This is aimed at researchers already working on interpolation and overparameterized kernels. Someone tracking that subfield would want to see the details and check the conditions against concrete kernels. It is worth sending to referees because the direction fills documented gaps and the abstract positions the claims as conditional on verifiable properties rather than universal.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce a new flexible analysis framework for proving support vector proliferation (SVP) — the exact coincidence of the SVM decision function with the minimum-norm label interpolant — in an arbitrary reproducing kernel Hilbert space, under a flexible class of generative models for the labels. It derives explicit conditions for SVP when features belong to general bounded orthonormal systems (e.g., Fourier features) or independent sub-Gaussian families, shows that these conditions cover many settings outside prior work, and leverages the equivalences to obtain novel generalization bounds for kernel SVM classification.

Significance. If the stated conditions are shown to be sufficient and the derivations are correct, the framework meaningfully relaxes the restrictive distributional and spectral assumptions of earlier SVP analyses, thereby extending the reach of harmless-interpolation techniques to a wider range of kernel SVM settings and yielding new generalization guarantees. The explicit treatment of two practically relevant feature families is a concrete strength.

major comments (2)

[abstract (conditions paragraph)] The paragraph on conditions for SVP (abstract): the claimed equivalence between the minimum-norm interpolant and the SVM solution is stated to hold precisely when the chosen feature family satisfies the requisite spectral or concentration properties; the manuscript should verify, perhaps via a short counter-example or boundary-case calculation, that the equivalence indeed fails when those properties are violated, so that the scope of the new results is unambiguously delimited.
[section on bounded orthonormal systems] The section deriving SVP for bounded orthonormal systems: the argument that the minimum-norm interpolant satisfies the SVM optimality (KKT) conditions appears to rest on a uniform bound on the feature inner products; it is unclear whether this bound is strictly weaker than the eigenvalue-decay assumptions used in prior work, and a direct comparison (e.g., to the conditions of [prior reference]) would strengthen the claim of novelty.

minor comments (2)

[preliminaries] Notation for the RKHS inner product and the feature map should be introduced once and used consistently; occasional re-use of the same symbol for the empirical and population quantities is mildly confusing.
[introduction] A short table summarizing the new SVP conditions alongside those of the most closely related prior papers would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive evaluation and recommendation of minor revision. The comments help clarify the scope and novelty of our results, and we will revise the manuscript to address them.

read point-by-point responses

Referee: [abstract (conditions paragraph)] The paragraph on conditions for SVP (abstract): the claimed equivalence between the minimum-norm interpolant and the SVM solution is stated to hold precisely when the chosen feature family satisfies the requisite spectral or concentration properties; the manuscript should verify, perhaps via a short counter-example or boundary-case calculation, that the equivalence indeed fails when those properties are violated, so that the scope of the new results is unambiguously delimited.

Authors: We agree that a concrete illustration of necessity would better delimit the scope. In the revised manuscript we will add a short boundary-case calculation (in a remark after the abstract discussion of conditions) showing a simple feature family that violates the spectral/concentration properties and for which the min-norm interpolant fails to satisfy the SVM KKT conditions. revision: yes
Referee: [section on bounded orthonormal systems] The section deriving SVP for bounded orthonormal systems: the argument that the minimum-norm interpolant satisfies the SVM optimality (KKT) conditions appears to rest on a uniform bound on the feature inner products; it is unclear whether this bound is strictly weaker than the eigenvalue-decay assumptions used in prior work, and a direct comparison (e.g., to the conditions of [prior reference]) would strengthen the claim of novelty.

Authors: The uniform bound on inner products is strictly weaker than the eigenvalue-decay conditions of prior work because it permits slower or non-decaying spectra provided the features remain bounded; we will insert a short comparison paragraph (with a table of assumptions) in the bounded orthonormal systems section that directly contrasts our condition with the referenced prior work. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework derives SVP conditions from explicit spectral assumptions

full rationale

The paper develops a fresh analysis framework to prove support vector proliferation (SVP) equivalences between minimum-norm interpolants and SVMs inside arbitrary RKHS, conditioned on verifiable spectral or concentration properties of bounded orthonormal systems and sub-Gaussian features. These properties are stated upfront as the gate for the result rather than being fitted, renamed, or smuggled via self-citation. No derivation step reduces a claimed prediction or equivalence to an input by construction, and the central claims rest on new conditions that extend (rather than presuppose) prior work. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, ad-hoc axioms, or invented entities are named. The framework presumably relies on standard RKHS properties and concentration inequalities for the stated feature classes.

axioms (1)

standard math Standard properties of reproducing kernel Hilbert spaces and bounded orthonormal systems or sub-Gaussian concentration hold for the chosen features.
Invoked implicitly when stating conditions for SVP in arbitrary RKHS.

pith-pipeline@v0.9.0 · 5734 in / 1278 out tokens · 48222 ms · 2026-05-24T08:47:04.451219+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present conditions for SVP for features in the families of general bounded orthonormal systems (e.g. Fourier features) and independent sub-Gaussian features... S = nTG(¯α IG + nTG)−1
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the leading kernel eigenfunctions comprise a bounded orthonormal system (BOS)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

[1]

Support ve ctor machines and linear regression coincide with very high-dimensional features

Navid Ardeshir, Clayton Sanford, and Daniel J Hsu. “Support ve ctor machines and linear regression coincide with very high-dimensional features”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 4907–4918. 20

work page 2021
[2]

Fast learning rates for plug-in classiﬁers

Jean-Yves Audibert and Alexandre B. Tsybakov. “Fast learning rates for plug-in classiﬁers”. In: Ann. Stat. 35.2 (2007), pp. 608–633. doi: 10.1214/009053606000001217

work page doi:10.1214/009053606000001217 2007
[3]

Random embeddings with an almost Gaussian distortion

Daniel Bartl and Shahar Mendelson. “Random embeddings with an almost Gaussian distortion”. In: Adv. Math. 400 (2022). doi: 10.1016/j.aim.2022.108261

work page doi:10.1016/j.aim.2022.108261 2022
[4]

Generalization perfor mance of support vector machines and other pattern classiﬁers

Peter Bartlett and John Shawe-Taylor. “Generalization perfor mance of support vector machines and other pattern classiﬁers”. In: Advances in Kernel methods—support vector learning (1999), pp. 43–54

work page 1999
[5]

Benign overﬁtting in linear regression

Peter Bartlett et al. “Benign overﬁtting in linear regression”. In : Proc. Natl. Acad. Sci. U.S.A. 117.48 (2020), pp. 30063–30070. doi: 10.1073/pnas.1907378117

work page doi:10.1073/pnas.1907378117 2020
[6]

Fit without fear: remarkable mathematical pheno mena of deep learning through the prism of interpolation

Mikhail Belkin. “Fit without fear: remarkable mathematical pheno mena of deep learning through the prism of interpolation”. In: Acta Numer. 30 (2021), pp. 203–248. doi: 10.1017/s0962492921000039

work page doi:10.1017/s0962492921000039 2021
[7]

Two models of double descen t for weak features

Mikhail Belkin, Daniel Hsu, and Ji Xu. “Two models of double descen t for weak features”. In: SIAM J. Math. Data Sci. 2.4 (2020), pp. 1167–1180. doi: 10.1137/20M1336072

work page doi:10.1137/20m1336072 2020
[8]

To Understand Deep Learning We Need to Un- derstand Kernel Learning

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. “To Understand Deep Learning We Need to Un- derstand Kernel Learning”. In: Proc. Int. Conf. Mach. Learn. (ICML) . Vol. 35. Stockholm, Sweden, 2018

work page 2018
[9]

Robust learning and generaliz ation with support vector ma- chines

Arnaud Buhot and Mirta B Gordon. “Robust learning and generaliz ation with support vector ma- chines”. In: Journal of Physics A: Mathematical and General 34.21 (2001), p. 4377

work page 2001
[10]

Risk Bounds for Ove r-parameterized Maximum Margin Classiﬁcation on Sub-Gaussian Mixtures

Yuan Cao, Quanquan Gu, and Mikhail Belkin. “Risk Bounds for Ove r-parameterized Maximum Margin Classiﬁcation on Sub-Gaussian Mixtures”. In: Proc. Conf. Neural Inf. Process. Syst. (NeurIPS) . Virtual conference, Dec. 2021. url: https://proceedings.neurips.cc/paper/2021/hash/46e0eae7d5217c79c3ef6b4c212b8c6f-Abstract.html

work page 2021
[11]

Finite-sample analysis of interpolating linear classiﬁers in the overparameterized regime

Niladri S. Chatterji and Philip M. Long. “Finite-sample analysis of interpolating linear classiﬁers in the overparameterized regime”. In: J. Mach. Learn. Res. 22 (2021)

work page 2021
[12]

Support-vector network s

Corinna Cortes and Vladimir Vapnik. “Support-vector network s”. In: Machine learning 20 (1995), pp. 273–297

work page 1995
[13]

A Fa rewell to the Bias-Variance Tradeoﬀ? An Overview of the Theory of Overparameterized Machine Learning

Yehuda Dar, Vidya Muthukumar, and Richard G. Baraniuk. “A Fa rewell to the Bias-Variance Tradeoﬀ? An Overview of the Theory of Overparameterized Machine Learning ”. In: (2021). arXiv: 2109.02355 [stat.ML]

work page arXiv 2021
[14]

Statist ical mechanics of support vector networks

Rainer Dietrich, Manfred Opper, and Haim Sompolinsky. “Statist ical mechanics of support vector networks”. In: Physical review letters 82.14 (1999), p. 2975

work page 1999
[15]

A PAC-Bayes sample-compression appro ach to kernel methods

Pascal Germain et al. “A PAC-Bayes sample-compression appro ach to kernel methods”. In: ICML. 2011

work page 2011
[16]

Cross-entr opy vs. squared error training: a theoret- ical and experimental comparison

Pavel Golik, Patrick Doetsch, and Hermann Ney. “Cross-entr opy vs. squared error training: a theoret- ical and experimental comparison.” In: Interspeech. Vol. 13. 2013, pp. 1756–1760

work page 2013
[17]

PAC-B ayesian compression bounds on the prediction error of learning algorithms for classiﬁcation

Thore Graepel, Ralf Herbrich, and John Shawe-Taylor. “PAC-B ayesian compression bounds on the prediction error of learning algorithms for classiﬁcation”. In: Machine Learning 59.1 (2005), pp. 55–76

work page 2005
[18]

Implicit bias of gradient descent on linea r convolutional networks

Suriya Gunasekar et al. “Implicit bias of gradient descent on linea r convolutional networks”. In: Ad- vances in neural information processing systems 31 (2018)

work page 2018
[19]

Surprises in High-Dimensional Ridgeless Lea st Squares Interpolation

Trevor Hastie et al. “Surprises in High-Dimensional Ridgeless Lea st Squares Interpolation”. In: Ann. Stat. 50.2 (2022). doi: 10.1214/21-AOS2133

work page doi:10.1214/21-aos2133 2022
[20]

Random Design A nalysis of Ridge Regression

Daniel Hsu, Sham M. Kakade, and Tong Zhang. “Random Design A nalysis of Ridge Regression”. In: Found. Comput. Math. 14 (2014), pp. 569–600. doi: 10.1007/s10208-014-9192-1

work page doi:10.1007/s10208-014-9192-1 2014
[21]

On the proliferation of support vectors in high di- mensions

Daniel Hsu, Vidya Muthukumar, and Ji Xu. “On the proliferation of support vectors in high di- mensions”. In: Proc. Int. Conf. Artif. Intell. Statist. (AISTATS) . Virtual conference, 2021. url: http://proceedings.mlr.press/v130/hsu21a.html

work page 2021
[22]

Evaluation of Neural Architectures Trained with Square Loss vs Cross- Entropy in Classiﬁcation Tasks

Like Hui and Mikhail Belkin. “Evaluation of Neural Architectures Trained with Square Loss vs Cross- Entropy in Classiﬁcation Tasks”. In: Proc. Int. Conf. Learn. Representations (ICLR) . Virtual confer- ence, 2021. url: https://openreview.net/forum?id=hsFN92eQEla

work page 2021
[23]

Directional convergence and a lignment in deep learning

Ziwei Ji and Matus Telgarsky. “Directional convergence and a lignment in deep learning”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 17176–17186. 21

work page 2020
[24]

The implicit bias of gradient desce nt on nonseparable data

Ziwei Ji and Matus Telgarsky. “The implicit bias of gradient desce nt on nonseparable data”. In: Conference on Learning Theory . PMLR. 2019, pp. 1772–1798

work page 2019
[25]

Gradient descent follows the regularization path for general losses

Ziwei Ji et al. “Gradient descent follows the regularization path for general losses”. In: Conference on Learning Theory. PMLR. 2020, pp. 2109–2136

work page 2020
[26]

Revisiting squared-error a nd cross-entropy functions for training neural network classiﬁers

Douglas M Kline and Victor L Berardi. “Revisiting squared-error a nd cross-entropy functions for training neural network classiﬁers”. In: Neural Computing & Applications 14 (2005), pp. 310–318

work page 2005
[27]

General Loss Functions Lead to (Approximate) Interpolation in High Dimensions

Kuo-Wei Lai and Vidya Muthukumar. “General Loss Functions Lead to (Approximate) Interpolation in High Dimensions”. In: arXiv preprint arXiv:2303.07475 (2023)

work page arXiv 2023
[28]

Just interpolate: Kern el “Ridgeless“ regression can general- ize

Tengyuan Liang and Alexander Rakhlin. “Just interpolate: Kern el “Ridgeless“ regression can general- ize”. In: Ann. Stat. 48.3 (2020), pp. 1329–1347. doi: 10.1214/19-aos1849

work page doi:10.1214/19-aos1849 2020
[29]

A statistical physics ap proach for the analysis of machine learning algorithms on real data

D¨ orthe Malzahn and Manfred Opper. “A statistical physics ap proach for the analysis of machine learning algorithms on real data”. In: Journal of Statistical Mechanics: Theory and Experiment 2005.11 (2005), P11001

work page 2005
[30]

Simpliﬁed PAC-Bayesian margin bounds

David McAllester. “Simpliﬁed PAC-Bayesian margin bounds”. In: Learning theory and Kernel ma- chines. Springer, 2003, pp. 203–215

work page 2003
[31]

Harmless interpolation in regression and classiﬁcation with structured fea- tures

Andrew D. McRae et al. “Harmless interpolation in regression and classiﬁcation with structured fea- tures”. In: Proc. Int. Conf. Artif. Intell. Statist. (AISTATS) . Virtual conference, 2022. url: https://proceedings.mlr.press/v151/mcrae22a.html

work page 2022
[32]

Genera lization error of random feature and kernel methods: Hypercontractivity and kernel matrix concent ration

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “Genera lization error of random feature and kernel methods: Hypercontractivity and kernel matrix concent ration”. In: Appl. Comput. Harmon. Anal. 59 (2022), pp. 3–84. doi: 10.1016/j.acha.2021.12.003

work page doi:10.1016/j.acha.2021.12.003 2022
[33]

Classiﬁcation vs. regression in overpa rameterized regimes: Does the loss function matter?

Vidya Muthukumar et al. “Classiﬁcation vs. regression in overpa rameterized regimes: Does the loss function matter?” In: J. Mach. Learn. Res. (2021). arXiv: 2005.08054. Forthcoming

work page arXiv 2021
[34]

Harmless Interpolation of Noisy Data in Regression

Vidya Muthukumar et al. “Harmless Interpolation of Noisy Data in Regression”. In: IEEE J. Sel. Areas Inf. Theory 1.1 (2020), pp. 67–83. doi: 10.1109/jsait.2020.2984716

work page doi:10.1109/jsait.2020.2984716 2020
[35]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. “In s earch of the real inductive bias: On the role of implicit regularization in deep learning”. In: arXiv preprint arXiv:1412.6614 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[36]

Everything old is new again: a fresh look at h istorical approaches in machine learning

Ryan Michael Rifkin. “Everything old is new again: a fresh look at h istorical approaches in machine learning”. PhD thesis. Massachusetts Institute of Technology, 2 002

work page
[37]

The implicit bias of gradient descent on sepa rable data

Daniel Soudry et al. “The implicit bias of gradient descent on sepa rable data”. In: The Journal of Machine Learning Research 19.1 (2018), pp. 2822–2878

work page 2018
[38]

Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs

Ingo Steinwart and Clint Scovel. “Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs”. In: Constr. Approx. 35 (2012), pp. 363–417. doi: 10.1007/s00365-012-9153-3

work page doi:10.1007/s00365-012-9153-3 2012
[39]

Benign overﬁtting in rid ge regression

Alexander Tsigler and Peter L Bartlett. “Benign overﬁtting in rid ge regression”. In: (2020). arXiv: 2009.14286 [math.ST]

work page arXiv 2020
[40]

High-Dimensional Probability

Roman Vershynin. High-Dimensional Probability. An Introduction with Appli cations in Data Science . Cambridge, 2018. 296 pp. isbn: 1108415199

work page 2018
[41]

Benig n overﬁtting in multiclass classi- ﬁcation: All roads lead to interpolation

Ke Wang, Vidya Muthukumar, and Christos Thrampoulidis. “Benig n overﬁtting in multiclass classi- ﬁcation: All roads lead to interpolation”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 24164–24179

work page 2021
[42]

Benign Overﬁtting in Binary Classiﬁcation of Gaussian Mix- tures

Ke Wang and Christos Thrampoulidis. “Benign Overﬁtting in Binary Classiﬁcation of Gaussian Mix- tures”. In: Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Proce ssing (ICASSP) . Toronto, Canada, 2021. doi: 10.1109/icassp39728.2021.9413946

work page doi:10.1109/icassp39728.2021.9413946 2021
[43]

Understanding deep learning requires ret hinking generalization

Chiyuan Zhang et al. “Understanding deep learning requires ret hinking generalization”. In: Proc. Int. Conf. Learn. Representations (ICLR). Toulon, France, 2017. url: https://openreview.net/pdf?id=Sy8gdB9xx

work page 2017
[44]

Learning Bounds for Kernel Regression Using Eﬀ ective Data Dimensionality

Tong Zhang. “Learning Bounds for Kernel Regression Using Eﬀ ective Data Dimensionality”. In: Neural Comput. 17 (2005), pp. 2077–2098. doi: 10.1162/0899766054323008. 22

work page doi:10.1162/0899766054323008 2005

[1] [1]

Support ve ctor machines and linear regression coincide with very high-dimensional features

Navid Ardeshir, Clayton Sanford, and Daniel J Hsu. “Support ve ctor machines and linear regression coincide with very high-dimensional features”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 4907–4918. 20

work page 2021

[2] [2]

Fast learning rates for plug-in classiﬁers

Jean-Yves Audibert and Alexandre B. Tsybakov. “Fast learning rates for plug-in classiﬁers”. In: Ann. Stat. 35.2 (2007), pp. 608–633. doi: 10.1214/009053606000001217

work page doi:10.1214/009053606000001217 2007

[3] [3]

Random embeddings with an almost Gaussian distortion

Daniel Bartl and Shahar Mendelson. “Random embeddings with an almost Gaussian distortion”. In: Adv. Math. 400 (2022). doi: 10.1016/j.aim.2022.108261

work page doi:10.1016/j.aim.2022.108261 2022

[4] [4]

Generalization perfor mance of support vector machines and other pattern classiﬁers

Peter Bartlett and John Shawe-Taylor. “Generalization perfor mance of support vector machines and other pattern classiﬁers”. In: Advances in Kernel methods—support vector learning (1999), pp. 43–54

work page 1999

[5] [5]

Benign overﬁtting in linear regression

Peter Bartlett et al. “Benign overﬁtting in linear regression”. In : Proc. Natl. Acad. Sci. U.S.A. 117.48 (2020), pp. 30063–30070. doi: 10.1073/pnas.1907378117

work page doi:10.1073/pnas.1907378117 2020

[6] [6]

Fit without fear: remarkable mathematical pheno mena of deep learning through the prism of interpolation

Mikhail Belkin. “Fit without fear: remarkable mathematical pheno mena of deep learning through the prism of interpolation”. In: Acta Numer. 30 (2021), pp. 203–248. doi: 10.1017/s0962492921000039

work page doi:10.1017/s0962492921000039 2021

[7] [7]

Two models of double descen t for weak features

Mikhail Belkin, Daniel Hsu, and Ji Xu. “Two models of double descen t for weak features”. In: SIAM J. Math. Data Sci. 2.4 (2020), pp. 1167–1180. doi: 10.1137/20M1336072

work page doi:10.1137/20m1336072 2020

[8] [8]

To Understand Deep Learning We Need to Un- derstand Kernel Learning

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. “To Understand Deep Learning We Need to Un- derstand Kernel Learning”. In: Proc. Int. Conf. Mach. Learn. (ICML) . Vol. 35. Stockholm, Sweden, 2018

work page 2018

[9] [9]

Robust learning and generaliz ation with support vector ma- chines

Arnaud Buhot and Mirta B Gordon. “Robust learning and generaliz ation with support vector ma- chines”. In: Journal of Physics A: Mathematical and General 34.21 (2001), p. 4377

work page 2001

[10] [10]

Risk Bounds for Ove r-parameterized Maximum Margin Classiﬁcation on Sub-Gaussian Mixtures

Yuan Cao, Quanquan Gu, and Mikhail Belkin. “Risk Bounds for Ove r-parameterized Maximum Margin Classiﬁcation on Sub-Gaussian Mixtures”. In: Proc. Conf. Neural Inf. Process. Syst. (NeurIPS) . Virtual conference, Dec. 2021. url: https://proceedings.neurips.cc/paper/2021/hash/46e0eae7d5217c79c3ef6b4c212b8c6f-Abstract.html

work page 2021

[11] [11]

Finite-sample analysis of interpolating linear classiﬁers in the overparameterized regime

Niladri S. Chatterji and Philip M. Long. “Finite-sample analysis of interpolating linear classiﬁers in the overparameterized regime”. In: J. Mach. Learn. Res. 22 (2021)

work page 2021

[12] [12]

Support-vector network s

Corinna Cortes and Vladimir Vapnik. “Support-vector network s”. In: Machine learning 20 (1995), pp. 273–297

work page 1995

[13] [13]

A Fa rewell to the Bias-Variance Tradeoﬀ? An Overview of the Theory of Overparameterized Machine Learning

Yehuda Dar, Vidya Muthukumar, and Richard G. Baraniuk. “A Fa rewell to the Bias-Variance Tradeoﬀ? An Overview of the Theory of Overparameterized Machine Learning ”. In: (2021). arXiv: 2109.02355 [stat.ML]

work page arXiv 2021

[14] [14]

Statist ical mechanics of support vector networks

Rainer Dietrich, Manfred Opper, and Haim Sompolinsky. “Statist ical mechanics of support vector networks”. In: Physical review letters 82.14 (1999), p. 2975

work page 1999

[15] [15]

A PAC-Bayes sample-compression appro ach to kernel methods

Pascal Germain et al. “A PAC-Bayes sample-compression appro ach to kernel methods”. In: ICML. 2011

work page 2011

[16] [16]

Cross-entr opy vs. squared error training: a theoret- ical and experimental comparison

Pavel Golik, Patrick Doetsch, and Hermann Ney. “Cross-entr opy vs. squared error training: a theoret- ical and experimental comparison.” In: Interspeech. Vol. 13. 2013, pp. 1756–1760

work page 2013

[17] [17]

PAC-B ayesian compression bounds on the prediction error of learning algorithms for classiﬁcation

Thore Graepel, Ralf Herbrich, and John Shawe-Taylor. “PAC-B ayesian compression bounds on the prediction error of learning algorithms for classiﬁcation”. In: Machine Learning 59.1 (2005), pp. 55–76

work page 2005

[18] [18]

Implicit bias of gradient descent on linea r convolutional networks

Suriya Gunasekar et al. “Implicit bias of gradient descent on linea r convolutional networks”. In: Ad- vances in neural information processing systems 31 (2018)

work page 2018

[19] [19]

Surprises in High-Dimensional Ridgeless Lea st Squares Interpolation

Trevor Hastie et al. “Surprises in High-Dimensional Ridgeless Lea st Squares Interpolation”. In: Ann. Stat. 50.2 (2022). doi: 10.1214/21-AOS2133

work page doi:10.1214/21-aos2133 2022

[20] [20]

Random Design A nalysis of Ridge Regression

Daniel Hsu, Sham M. Kakade, and Tong Zhang. “Random Design A nalysis of Ridge Regression”. In: Found. Comput. Math. 14 (2014), pp. 569–600. doi: 10.1007/s10208-014-9192-1

work page doi:10.1007/s10208-014-9192-1 2014

[21] [21]

On the proliferation of support vectors in high di- mensions

Daniel Hsu, Vidya Muthukumar, and Ji Xu. “On the proliferation of support vectors in high di- mensions”. In: Proc. Int. Conf. Artif. Intell. Statist. (AISTATS) . Virtual conference, 2021. url: http://proceedings.mlr.press/v130/hsu21a.html

work page 2021

[22] [22]

Evaluation of Neural Architectures Trained with Square Loss vs Cross- Entropy in Classiﬁcation Tasks

Like Hui and Mikhail Belkin. “Evaluation of Neural Architectures Trained with Square Loss vs Cross- Entropy in Classiﬁcation Tasks”. In: Proc. Int. Conf. Learn. Representations (ICLR) . Virtual confer- ence, 2021. url: https://openreview.net/forum?id=hsFN92eQEla

work page 2021

[23] [23]

Directional convergence and a lignment in deep learning

Ziwei Ji and Matus Telgarsky. “Directional convergence and a lignment in deep learning”. In: Advances in Neural Information Processing Systems 33 (2020), pp. 17176–17186. 21

work page 2020

[24] [24]

The implicit bias of gradient desce nt on nonseparable data

Ziwei Ji and Matus Telgarsky. “The implicit bias of gradient desce nt on nonseparable data”. In: Conference on Learning Theory . PMLR. 2019, pp. 1772–1798

work page 2019

[25] [25]

Gradient descent follows the regularization path for general losses

Ziwei Ji et al. “Gradient descent follows the regularization path for general losses”. In: Conference on Learning Theory. PMLR. 2020, pp. 2109–2136

work page 2020

[26] [26]

Revisiting squared-error a nd cross-entropy functions for training neural network classiﬁers

Douglas M Kline and Victor L Berardi. “Revisiting squared-error a nd cross-entropy functions for training neural network classiﬁers”. In: Neural Computing & Applications 14 (2005), pp. 310–318

work page 2005

[27] [27]

General Loss Functions Lead to (Approximate) Interpolation in High Dimensions

Kuo-Wei Lai and Vidya Muthukumar. “General Loss Functions Lead to (Approximate) Interpolation in High Dimensions”. In: arXiv preprint arXiv:2303.07475 (2023)

work page arXiv 2023

[28] [28]

Just interpolate: Kern el “Ridgeless“ regression can general- ize

Tengyuan Liang and Alexander Rakhlin. “Just interpolate: Kern el “Ridgeless“ regression can general- ize”. In: Ann. Stat. 48.3 (2020), pp. 1329–1347. doi: 10.1214/19-aos1849

work page doi:10.1214/19-aos1849 2020

[29] [29]

A statistical physics ap proach for the analysis of machine learning algorithms on real data

D¨ orthe Malzahn and Manfred Opper. “A statistical physics ap proach for the analysis of machine learning algorithms on real data”. In: Journal of Statistical Mechanics: Theory and Experiment 2005.11 (2005), P11001

work page 2005

[30] [30]

Simpliﬁed PAC-Bayesian margin bounds

David McAllester. “Simpliﬁed PAC-Bayesian margin bounds”. In: Learning theory and Kernel ma- chines. Springer, 2003, pp. 203–215

work page 2003

[31] [31]

Harmless interpolation in regression and classiﬁcation with structured fea- tures

Andrew D. McRae et al. “Harmless interpolation in regression and classiﬁcation with structured fea- tures”. In: Proc. Int. Conf. Artif. Intell. Statist. (AISTATS) . Virtual conference, 2022. url: https://proceedings.mlr.press/v151/mcrae22a.html

work page 2022

[32] [32]

Genera lization error of random feature and kernel methods: Hypercontractivity and kernel matrix concent ration

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. “Genera lization error of random feature and kernel methods: Hypercontractivity and kernel matrix concent ration”. In: Appl. Comput. Harmon. Anal. 59 (2022), pp. 3–84. doi: 10.1016/j.acha.2021.12.003

work page doi:10.1016/j.acha.2021.12.003 2022

[33] [33]

Classiﬁcation vs. regression in overpa rameterized regimes: Does the loss function matter?

Vidya Muthukumar et al. “Classiﬁcation vs. regression in overpa rameterized regimes: Does the loss function matter?” In: J. Mach. Learn. Res. (2021). arXiv: 2005.08054. Forthcoming

work page arXiv 2021

[34] [34]

Harmless Interpolation of Noisy Data in Regression

Vidya Muthukumar et al. “Harmless Interpolation of Noisy Data in Regression”. In: IEEE J. Sel. Areas Inf. Theory 1.1 (2020), pp. 67–83. doi: 10.1109/jsait.2020.2984716

work page doi:10.1109/jsait.2020.2984716 2020

[35] [35]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. “In s earch of the real inductive bias: On the role of implicit regularization in deep learning”. In: arXiv preprint arXiv:1412.6614 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[36] [36]

Everything old is new again: a fresh look at h istorical approaches in machine learning

Ryan Michael Rifkin. “Everything old is new again: a fresh look at h istorical approaches in machine learning”. PhD thesis. Massachusetts Institute of Technology, 2 002

work page

[37] [37]

The implicit bias of gradient descent on sepa rable data

Daniel Soudry et al. “The implicit bias of gradient descent on sepa rable data”. In: The Journal of Machine Learning Research 19.1 (2018), pp. 2822–2878

work page 2018

[38] [38]

Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs

Ingo Steinwart and Clint Scovel. “Mercer’s Theorem on General Domains: On the Interaction between Measures, Kernels, and RKHSs”. In: Constr. Approx. 35 (2012), pp. 363–417. doi: 10.1007/s00365-012-9153-3

work page doi:10.1007/s00365-012-9153-3 2012

[39] [39]

Benign overﬁtting in rid ge regression

Alexander Tsigler and Peter L Bartlett. “Benign overﬁtting in rid ge regression”. In: (2020). arXiv: 2009.14286 [math.ST]

work page arXiv 2020

[40] [40]

High-Dimensional Probability

Roman Vershynin. High-Dimensional Probability. An Introduction with Appli cations in Data Science . Cambridge, 2018. 296 pp. isbn: 1108415199

work page 2018

[41] [41]

Benig n overﬁtting in multiclass classi- ﬁcation: All roads lead to interpolation

Ke Wang, Vidya Muthukumar, and Christos Thrampoulidis. “Benig n overﬁtting in multiclass classi- ﬁcation: All roads lead to interpolation”. In: Advances in Neural Information Processing Systems 34 (2021), pp. 24164–24179

work page 2021

[42] [42]

Benign Overﬁtting in Binary Classiﬁcation of Gaussian Mix- tures

Ke Wang and Christos Thrampoulidis. “Benign Overﬁtting in Binary Classiﬁcation of Gaussian Mix- tures”. In: Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Proce ssing (ICASSP) . Toronto, Canada, 2021. doi: 10.1109/icassp39728.2021.9413946

work page doi:10.1109/icassp39728.2021.9413946 2021

[43] [43]

Understanding deep learning requires ret hinking generalization

Chiyuan Zhang et al. “Understanding deep learning requires ret hinking generalization”. In: Proc. Int. Conf. Learn. Representations (ICLR). Toulon, France, 2017. url: https://openreview.net/pdf?id=Sy8gdB9xx

work page 2017

[44] [44]

Learning Bounds for Kernel Regression Using Eﬀ ective Data Dimensionality

Tong Zhang. “Learning Bounds for Kernel Regression Using Eﬀ ective Data Dimensionality”. In: Neural Comput. 17 (2005), pp. 2077–2098. doi: 10.1162/0899766054323008. 22

work page doi:10.1162/0899766054323008 2005