Greg Yang

Greg Yang, “Tensor programs ii: Neural tangent kernel for any architecture,” ( · 2020 · arXiv 2006.14548

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

Canonical Regularisation of Wide Feature-Learning Neural Networks

stat.ML · 2026-05-18 · unverdicted · novelty 8.0

Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.

State-Space NTK Collapse Near Bifurcations

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.

Learning Rate Transfer in Normalized Transformers

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.

Viability of perturbative expansion for quantum field theories on neurons

hep-th · 2025-08-05 · unverdicted · novelty 5.0

The work tests perturbative viability of single-layer neural networks for local QFTs at finite neuron number N in phi^4 theory, finding UV-cutoff-sensitive O(1/N) corrections with weak convergence and proposing a modification for better scaling.

The Neural Tangent Kernel for Classification

cs.LG · 2026-05-17

citing papers explorer

Showing 9 of 9 citing papers.

Canonical Regularisation of Wide Feature-Learning Neural Networks stat.ML · 2026-05-18 · unverdicted · none · ref 47
Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 18
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
The Global Empirical NTK: Self-Referential Bias and Dimensionality of Gradient Descent Learning cs.LG · 2026-05-09 · unverdicted · none · ref 124
The global empirical NTK for finite-width networks has a universal Kronecker-core form that makes it structurally low-rank and biases gradient descent toward dominant modes of joint input-hidden activity.
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences cs.LG · 2026-05-06 · unverdicted · none · ref 42
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
State-Space NTK Collapse Near Bifurcations cs.LG · 2026-05-12 · unverdicted · none · ref 123
Bifurcations cause sNTK to reduce to a dominant rank-one channel matching normal forms, collapsing effective rank and funneling gradient descent into critical dynamical directions.
Learning Rate Transfer in Normalized Transformers cs.LG · 2026-04-29 · unverdicted · none · ref 19
νGPT is a modified parameterization of normalized transformers that enables learning rate transfer across width, depth, and token horizon.
Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training cs.LG · 2026-03-30 · unverdicted · none · ref 22
Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
Viability of perturbative expansion for quantum field theories on neurons hep-th · 2025-08-05 · unverdicted · none · ref 31
The work tests perturbative viability of single-layer neural networks for local QFTs at finite neuron number N in phi^4 theory, finding UV-cutoff-sensitive O(1/N) corrections with weak convergence and proposing a modification for better scaling.
The Neural Tangent Kernel for Classification cs.LG · 2026-05-17 · unreviewed · ref 6

Greg Yang

fields

years

verdicts

representative citing papers

citing papers explorer