Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
Greg Yang
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.
νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
The work tests perturbative viability of single-layer neural networks for local QFTs at finite neuron number N in phi^4 theory, finding UV-cutoff-sensitive O(1/N) corrections with weak convergence and proposing a modification for better scaling.
citing papers explorer
-
Canonical Regularisation of Wide Feature-Learning Neural Networks
Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
-
GQA-{\mu}P: The maximal parameterization update for grouped query attention
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
-
The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning
The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
-
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
-
State-Space NTK Collapse Near Bifurcations
Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.
-
Learning Rate Transfer in Normalized Transformers
νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.
-
Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
-
Viability of perturbative expansion for quantum field theories on neurons
The work tests perturbative viability of single-layer neural networks for local QFTs at finite neuron number N in phi^4 theory, finding UV-cutoff-sensitive O(1/N) corrections with weak convergence and proposing a modification for better scaling.
- The Neural Tangent Kernel for Classification