pith. sign in

arxiv: 2605.17718 · v1 · pith:JC54ULOCnew · submitted 2026-05-18 · 📊 stat.ML · cs.LG

How does feature learning reshape the function space?

Pith reviewed 2026-05-19 22:22 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords feature learningfunction spacegradient descenthigh-dimensional regimespiked covarianceadaptive kerneltwo-layer networksspectral structure
0
0 comments X

The pith

In high dimensions, one large gradient step on a two-layer network produces features whose distribution approximates a target-dependent spiked Gaussian covariance, inducing a data-adaptive kernel that reshapes the function space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that feature learning during gradient descent is equivalent to a specific change in the distribution of network features rather than a simple rescaling of a fixed kernel. In the proportional high-dimensional limit, a sufficiently large update makes the post-training feature covariance look like a Gaussian with an extra spike aligned to the target function. This change creates an adaptive kernel whose eigenstructure favors directions that match the signal in the data. A reader should care because it supplies a concrete function-space account of why neural networks can represent different functions after training compared with static kernel methods.

Core claim

We prove that, in the high-dimensional proportional regime, after a large gradient step the post-update feature distribution is well approximated by a target-dependent spiked Gaussian covariance. This induces a data-adaptive kernel that reshapes the function space and modifies its spectral structure. Feature learning can be viewed as a distributional transformation in parameter space or input space, or equivalently as the introduction of a target-dependent kernel. In particular, the update selectively amplifies eigenvalues aligned with the target direction and mixes leading eigenfunctions, coupling the top radial mode with a target-aligned quadratic harmonic. The overall effect is a data-adp

What carries the argument

Target-dependent spiked Gaussian covariance that approximates the post-update feature distribution and thereby induces the data-adaptive kernel.

If this is right

  • The induced kernel selectively amplifies eigenvalues aligned with the target direction.
  • Leading eigenfunctions mix, coupling the top radial mode with a target-aligned quadratic harmonic.
  • Feature learning acts as a distributional transformation in parameter space or input space.
  • Early training deforms the function space to preferentially enhance directions aligned with the signal rather than rescaling a fixed kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Repeated gradient steps could compound the distributional shift, producing successively more adapted kernels at later training stages.
  • The same spiked approximation may appear in deeper networks if each layer experiences an analogous large update.
  • Low-dimensional regimes or small step sizes offer a direct test of where the reshaping mechanism breaks down.
  • This view connects to analyses of kernel evolution under gradient flow by supplying an explicit distributional mechanism for the adaptation.

Load-bearing premise

The analysis requires the high-dimensional proportional regime together with a large enough gradient step size so that the updated features can be approximated by the spiked Gaussian form.

What would settle it

Numerical computation of the empirical covariance of hidden features after one large gradient step on finite but proportional n and d data, checking whether the observed matrix deviates from the predicted target-dependent spike.

Figures

Figures reproduced from arXiv: 2605.17718 by Bruno Loureiro, Fanghui Liu, Jo\~ao Lobo, Long Tran-Than.

Figure 1
Figure 1. Figure 1: (a) Alignment with the directional feature [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
read the original abstract

Feature learning is widely regarded as the key mechanism distinguishing neural networks from fixed-kernel methods, yet its impact on the induced function space remains poorly understood. In this work, we precisely characterize how the function space spanned by the features of a two-layer neural network evolves during gradient descent training. We prove that, in the high-dimensional proportional regime, after a large gradient step the post-update feature distribution is well approximated by a target-dependent spiked Gaussian covariance. This induces a data-adaptive kernel that reshapes the function space and modifies its spectral structure. Our analysis reveals that feature learning can be interpreted as a distributional transformation in either parameter space or input space, equivalently as the introduction of a target-dependent kernel. In particular, it selectively amplifies eigenvalues aligned with the target direction and mixes leading eigenfunctions, coupling the top radial mode with a target-aligned quadratic harmonic. Overall, our results provide a precise function-space perspective on early-stage feature learning: rather than just rescaling a fixed kernel, gradient descent induces a data-adaptive deformation that preferentially enhances directions aligned with the signal in the data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in the high-dimensional proportional regime (n, d → ∞ with n/d fixed), a single large gradient step on a two-layer neural network produces a post-update feature distribution that is well approximated by a target-dependent spiked Gaussian covariance. This induces a data-adaptive kernel that reshapes the function space, selectively amplifying eigenvalues aligned with the target direction and mixing leading eigenfunctions (e.g., coupling the top radial mode with a target-aligned quadratic harmonic). Feature learning is interpreted equivalently as a distributional transformation in parameter space or input space.

Significance. If the central approximation holds with the stated precision, the work supplies a rigorous function-space view of early-stage feature learning that goes beyond fixed-kernel or NTK analyses by exhibiting an explicit data-adaptive deformation of the spectral structure. The result is potentially useful for understanding how gradient descent modifies the effective kernel during the initial phase of training and for designing adaptive kernels that capture target-aligned directions.

major comments (2)
  1. [§3.1, Theorem 1] §3.1, Theorem 1: The spiked-Gaussian approximation is asserted to hold after a 'sufficiently large' gradient step, yet the statement provides neither an explicit lower bound on the step size η nor quantitative error bounds (in total variation or Wasserstein distance) that depend on n, d, and η. Without these, it is impossible to verify the domain of validity of the claimed limit or to assess how the approximation degrades when the step-size condition is relaxed.
  2. [§4.2, Eq. (17)] §4.2, Eq. (17): The induced kernel is defined via the expectation over the spiked covariance; however, the derivation of the eigenvalue amplification and eigenfunction mixing (radial mode coupled to quadratic harmonic) appears to rest on an additional assumption that the target is exactly aligned with a single direction. The paper does not state whether this alignment is necessary or how the spectral reshaping generalizes to targets with multiple relevant directions.
minor comments (2)
  1. [§2] Notation for the proportional limit (n/d → γ) is introduced in §2 but used inconsistently in the statement of the main result; a single displayed definition would improve readability.
  2. [Figure 2] Figure 2 caption does not specify the value of the step size η used in the simulation, making it difficult to relate the plotted spectra to the theoretical regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which help clarify the scope and presentation of our results. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§3.1, Theorem 1] §3.1, Theorem 1: The spiked-Gaussian approximation is asserted to hold after a 'sufficiently large' gradient step, yet the statement provides neither an explicit lower bound on the step size η nor quantitative error bounds (in total variation or Wasserstein distance) that depend on n, d, and η. Without these, it is impossible to verify the domain of validity of the claimed limit or to assess how the approximation degrades when the step-size condition is relaxed.

    Authors: We agree that greater precision on the step-size condition would improve the statement. In the proof of Theorem 1 the requirement that η be sufficiently large arises from ensuring the target-dependent spike dominates the fluctuation terms in the high-dimensional limit; this translates to η exceeding a constant determined by the data variance and the Lipschitz constant of the activation. We will revise the theorem to state an explicit lower bound of this form. Our analysis is strictly asymptotic (n,d→∞ with n/d fixed), so we do not derive non-asymptotic total-variation or Wasserstein bounds; we will add a remark noting this limitation and the resulting domain of validity. revision: partial

  2. Referee: [§4.2, Eq. (17)] §4.2, Eq. (17): The induced kernel is defined via the expectation over the spiked covariance; however, the derivation of the eigenvalue amplification and eigenfunction mixing (radial mode coupled to quadratic harmonic) appears to rest on an additional assumption that the target is exactly aligned with a single direction. The paper does not state whether this alignment is necessary or how the spectral reshaping generalizes to targets with multiple relevant directions.

    Authors: The single-direction setting is adopted for expository clarity, as it already exhibits the essential phenomenon of target-dependent eigenvalue amplification and the specific radial-to-quadratic mixing. The underlying spiked-covariance construction extends immediately to a finite number of spikes aligned with a multi-dimensional target subspace; the induced kernel then amplifies the corresponding eigenspace and produces analogous mixing within that subspace. We will revise §4.2 to state the single-direction assumption explicitly and add a short paragraph describing the multi-spike generalization, confirming that the qualitative conclusions on data-adaptive reshaping remain unchanged. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation proceeds from explicit high-dimensional limit assumptions without reduction to fitted inputs or self-referential definitions

full rationale

The paper derives the spiked Gaussian covariance approximation for post-update features from gradient descent dynamics under the stated proportional regime (n,d→∞, n/d fixed) and large step-size condition. This is presented as a proven limit result rather than an ansatz, fit, or self-definition. No equations reduce the target-dependent kernel or spectral reshaping to a tautology or to a parameter fitted from the same data being predicted. Self-citations, if present, are not load-bearing for the central claim, which rests on the regime-specific analysis rather than prior author work invoked as uniqueness. The result is not a renaming of a known pattern but a characterization of function-space evolution. The derivation chain is therefore self-contained against the explicit assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the high-dimensional proportional limit and the large-gradient-step approximation; these are standard modeling choices in the field but constitute the main unverified assumptions for the result.

axioms (2)
  • domain assumption High-dimensional proportional regime: n, d → ∞ with n/d = γ fixed
    Invoked to obtain the spiked Gaussian approximation after one gradient step
  • domain assumption Sufficiently large gradient step size
    Required for the post-update feature distribution to concentrate around the target-dependent spike

pith-pipeline@v0.9.0 · 5722 in / 1441 out tokens · 33307 ms · 2026-05-19T22:22:46.075133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 2 internal anchors

  1. [1]

    R., Millman, K

    Charles R. Harris and K. Jarrod Millman and St. Array programming with. 2020 , month = sep, journal =. doi:10.1038/s41586-020-2649-2 , publisher =

  2. [2]

    and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and

    Virtanen, Pauli and Gommers, Ralf and Oliphant, Travis E. and Haberland, Matt and Reddy, Tyler and Cournapeau, David and Burovski, Evgeni and Peterson, Pearu and Weckesser, Warren and Bright, Jonathan and. Nature Methods , year =

  3. [3]

    and Varoquaux, G

    Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in

  4. [4]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , url =

    Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

  5. [5]

    Journal of Machine Learning Research , year =

    Xiangxiang Xu and Lizhong Zheng , title =. Journal of Machine Learning Research , year =

  6. [6]

    2025 , journal=

    Learning Multi-Index Models with Hyper-Kernel Ridge Regression , author=. 2025 , journal=

  7. [7]

    2011 , eprint=

    Introduction to the non-asymptotic analysis of random matrices , author=. 2011 , eprint=

  8. [8]

    , journal=

    Price, R. , journal=. A useful theorem for nonlinear devices having Gaussian inputs , year=

  9. [9]

    , journal=

    McMahon, E. , journal=. An extension of Price's theorem (Corresp.) , year=

  10. [10]

    2025 , eprint=

    Learning single-index models via harmonic decomposition , author=. 2025 , eprint=

  11. [11]

    Mathematics of the USSR-Sbornik , volume=

    Distribution of eigenvalues for some sets of random matrices , author=. Mathematics of the USSR-Sbornik , volume=

  12. [12]

    2024 , eprint=

    A non-asymptotic theory of Kernel Ridge Regression: deterministic equivalents, test error, and GCV estimator , author=. 2024 , eprint=

  13. [13]

    2024 , eprint=

    Optimal Rates of Kernel Ridge Regression under Source Condition in Large Dimensions , author=. 2024 , eprint=

  14. [14]

    Advances in Neural Information Processing Systems , pages =

    Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks , author =. Advances in Neural Information Processing Systems , pages =

  15. [15]

    Linear Algebra and Its Applications , volume=

    Characterization of the subdifferential of some matrix norms , author=. Linear Algebra and Its Applications , volume=

  16. [16]

    International Conference on Machine Learning , pages=

    Gaussian process kernels for pattern discovery and extrapolation , author=. International Conference on Machine Learning , pages=

  17. [17]

    ICML , pages=

    Gaussian process kernels for pattern discovery and extrapolation , author=. ICML , pages=

  18. [18]

    Journal of Machine Learning Research , volume=

    Algorithms for learning kernels based on centered alignment , author=. Journal of Machine Learning Research , volume=

  19. [19]

    Journal of Machine Learning Research , volume=

    Multiple kernel learning algorithms , author=. Journal of Machine Learning Research , volume=

  20. [20]

    and Song, Le and Wilson, Andrew Gordon , booktitle=

    Yang, Zichao and Smola, Alexander J. and Song, Le and Wilson, Andrew Gordon , booktitle=. \`

  21. [21]

    Fixed point and

    Ma, Shiqian and Goldfarb, Donald and Chen, Lifeng , journal=. Fixed point and

  22. [22]

    Mathematical Programming , volume=

    Smooth minimization of non-smooth functions , author=. Mathematical Programming , volume=

  23. [23]

    Advances in Neural Information Processing Systems , year=

    Convolutional kernel networks , author=. Advances in Neural Information Processing Systems , year=

  24. [24]

    International Conference on Machine Learning , pages=

    Learning a kernel matrix for nonlinear dimensionality reduction , author=. International Conference on Machine Learning , pages=

  25. [25]

    International Conference on Machine Learning , pages=

    Geometry-aware metric learning , author=. International Conference on Machine Learning , pages=

  26. [26]

    International Conference on Computer Analysis of Images and Patterns , pages=

    Learning geometry-aware kernels in a regularization framework , author=. International Conference on Computer Analysis of Images and Patterns , pages=

  27. [27]

    Neural Networks , volume=

    Ideal regularization for learning kernels from labels , author=. Neural Networks , volume=

  28. [28]

    Journal of Machine Learning Research , volume=

    A family of simple non-parametric kernel learning algorithms , author=. Journal of Machine Learning Research , volume=

  29. [29]

    IEEE Transactions on Systems Man and Cybernetics Part B , volume=

    An explicit nonlinear mapping for manifold learning , author=. IEEE Transactions on Systems Man and Cybernetics Part B , volume=

  30. [30]

    International conference on machine learning , pages=

    On a nonlinear generalization of sparse coding and dictionary learning , author=. International conference on machine learning , pages=

  31. [31]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Graph embedding and extensions: a general framework for dimensionality reduction , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

  32. [32]

    Advances in Neural Information Processing Systems , pages=

    Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering , author=. Advances in Neural Information Processing Systems , pages=

  33. [33]

    AAAI Conference on Artificial Intelligence , pages=

    A Generalised Solution to the Out-of-Sample Extension Problem in Manifold Learning , author=. AAAI Conference on Artificial Intelligence , pages=

  34. [34]

    IEEE Transactions on Neural Networks , volume=

    Semi-supervised kernel matrix learning by kernel propagation , author=. IEEE Transactions on Neural Networks , volume=

  35. [35]

    Journal of Machine Learning Research , volume=

    Metric and kernel learning using a linear transformation , author=. Journal of Machine Learning Research , volume=

  36. [36]

    Journal of Machine Learning Research , volume=

    Learning the kernel with hyperkernels , author=. Journal of Machine Learning Research , volume=

  37. [37]

    Advances in Neural Information Processing Systems , year=

    On valid optimal assignment kernels and applications to graph classification , author=. Advances in Neural Information Processing Systems , year=

  38. [38]

    Advances in Neural Information Processing Systems , pages=

    Nonparametric transforms of graph kernels for semi-supervised learning , author=. Advances in Neural Information Processing Systems , pages=

  39. [39]

    Advances in Neural Information Processing Systems , pages=

    Fast kernel learning for multidimensional pattern extrapolation , author=. Advances in Neural Information Processing Systems , pages=

  40. [40]

    International Conference on Machine Learning , pages=

    Two-stage learning kernel algorithms , author=. International Conference on Machine Learning , pages=

  41. [41]

    ICML , pages=

    Two-stage learning kernel algorithms , author=. ICML , pages=

  42. [42]

    Proceedins of Advances in Neural Information Processing Systems , pages =

    Learning kernels with random features , author =. Proceedins of Advances in Neural Information Processing Systems , pages =

  43. [43]

    NeurIPS , pages =

    Learning kernels with random features , author =. NeurIPS , pages =

  44. [44]

    Foundations of Computational Mathematics , volume=

    Analysis of support vector machines regression , author=. Foundations of Computational Mathematics , volume=

  45. [45]

    Foundations of Computational Mathematics , volume=

    Learning rates of least-square regularized regression , author=. Foundations of Computational Mathematics , volume=

  46. [46]

    Annals of Statistics , volume=

    Fast learning rate of multiple kernel learning: trade-off between sparsity and smoothness , author=. Annals of Statistics , volume=

  47. [47]

    Conference on Learning Theory , year=

    Optimal rates for regularized least squares regression , author=. Conference on Learning Theory , year=

  48. [48]

    2004 , booktitle=

    Learning with non-positive kernels , author=. 2004 , booktitle=

  49. [49]

    , author=

    Learning with convex loss and indefinite kernels. , author=. Neural Computation , volume=

  50. [50]

    Applied and Computational Harmonic Analysis , volume=

    Least square regression with indefinite kernels and coefficient regularization , author=. Applied and Computational Harmonic Analysis , volume=

  51. [51]

    International Conference on Artificial Neural Networks , pages=

    Indefinite support vector regression , author=. International Conference on Artificial Neural Networks , pages=

  52. [52]

    AAAI Conference on Artificial Intelligence , pages=

    Nonlinear pairwise layer and its training for kernel learning , author=. AAAI Conference on Artificial Intelligence , pages=

  53. [53]

    Advances in Neural Information Processing Systems , year=

    Learning low-dimensional metrics , author=. Advances in Neural Information Processing Systems , year=

  54. [54]

    Foundations and Trends in Machine Learning , volume=

    Metric learning: a survey , author=. Foundations and Trends in Machine Learning , volume=

  55. [55]

    Huang, Zhiwu and Wang, Ruiping and Shan, Shiguang and Li, Xianqiu and Chen, Xilin , year=. Log-

  56. [56]

    1974 , publisher=

    Indefinite inner product spaces , author=. 1974 , publisher=

  57. [57]

    Foundations and Trends

    Pairwise independence and derandomization , author=. Foundations and Trends. 2006 , publisher=

  58. [58]

    Neural computation , volume=

    SVM Soft Margin Classifiers: Linear Programming versus Quadratic Programming , author=. Neural computation , volume=

  59. [59]

    Foundations of Computational Mathematics , volume=

    Best Choices for Regularization Parameters in Learning Theory: On the Bias—Variance Problem , author=. Foundations of Computational Mathematics , volume=

  60. [60]

    Gaussian and

    Kondor, Risi and Jebara, Tony , booktitle=. Gaussian and

  61. [61]

    Journal of Machine Learning Research , volume=

    Graph kernels , author=. Journal of Machine Learning Research , volume=

  62. [62]

    Journal of Applied Mathematics , volume=

    Approximation analysis of learning algorithms for support vector regression and quantile regression , author=. Journal of Applied Mathematics , volume=

  63. [63]

    , journal=

    Shi, Lei and Huang, Xiaolin and Tian, Zheng and Suykens, Johan A.K. , journal=. Quantile regression with _1 -regularization and

  64. [64]

    2007 , publisher=

    Learning theory: an approximation theory viewpoint , author=. 2007 , publisher=

  65. [65]

    Conditionally positive definite kernels for

    Boughorbel, Sabri and Tarel, J-P and Boujemaa, Nozha , booktitle=. Conditionally positive definite kernels for

  66. [66]

    IEEE Transactions on Image Processing , volume=

    Out-of-sample generalizations for supervised manifold learning for classification , author=. IEEE Transactions on Image Processing , volume=

  67. [67]

    International Conference on Computer Vision , pages=

    Attribute and simile classifiers for face verification , author=. International Conference on Computer Vision , pages=

  68. [68]

    Journal of Complexity , volume=

    The covering number in learning theory , author=. Journal of Complexity , volume=. 2002 , publisher=

  69. [69]

    Journal of Machine Learning Research , volume=

    Learning theory approach to minimum error entropy criterion , author=. Journal of Machine Learning Research , volume=

  70. [70]

    Fast rates for support vector machines using

    Steinwart, Ingo and Scovel, Clint , journal=. Fast rates for support vector machines using

  71. [71]

    Scalable

    Jang, Phillip A and Loeb, Andrew and Davidow, Matthew and Wilson, Andrew G , booktitle=. Scalable

  72. [72]

    IEEE transactions on pattern analysis and machine intelligence , volume=

    Representation learning: A review and new perspectives , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2013 , publisher=

  73. [73]

    CHI'06 extended abstracts on Human factors in computing systems , pages=

    Being accurate is not enough: how accuracy metrics have hurt recommender systems , author=. CHI'06 extended abstracts on Human factors in computing systems , pages=. 2006 , organization=

  74. [74]

    Advances in Computational Mathematics , volume=

    Concentration estimates for learning with unbounded sampling , author=. Advances in Computational Mathematics , volume=. 2013 , publisher=

  75. [75]

    arXiv preprint arXiv:1711.07271 , year=

    Positive semi-definite embedding for dimensionality reduction and out-of-sample extensions , author=. arXiv preprint arXiv:1711.07271 , year=

  76. [76]

    , author=

    Learning with varying insensitive loss. , author=. Applied Mathematics Letters , volume=

  77. [77]

    Machine Learning , volume=

    Support-vector networks , author=. Machine Learning , volume=. 1995 , publisher=

  78. [78]

    Journal of Machine Learning Research , volume=

    Building support vector machines with reduced classifier complexity , author=. Journal of Machine Learning Research , volume=

  79. [79]

    International Conference on Machine Learning , pages=

    A divide-and-conquer solver for kernel support vector machines , author=. International Conference on Machine Learning , pages=

  80. [80]

    ICML , pages=

    A divide-and-conquer solver for kernel support vector machines , author=. ICML , pages=

Showing first 80 references.