Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
Tensor programs ii: Neural tangent kernel for any architecture
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.
νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
The work tests perturbative viability of single-layer neural networks for local QFTs at finite neuron number N in phi^4 theory, finding UV-cutoff-sensitive O(1/N) corrections with weak convergence and proposing a modification for better scaling.