For quadratic targets in d dimensions, two-layer quadratic networks achieve lower risk when fully trained than in random features or neural tangent regimes if hidden units < d.
The Annals of Statistics , volume =
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
Establishes asymptotic consistency of factor estimates and √T-normality in factor-augmented regressions for fixed R ≥ r using anisotropic local laws from random matrix theory.
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
A proximal operator framework unifies asymptotics and Oracle features for penalized estimators and yields new sqrt(n)-consistent Ridgeless-type estimators for linear regression.
A sparse-activation model predicts double-descent loss with distinct under- and over-parameterized scaling exponents set by sparsity, plus a compute-optimal frontier favoring dataset growth.
citing papers explorer
-
Limitations of Lazy Training of Two-layers Neural Networks
For quadratic targets in d dimensions, two-layer quadratic networks achieve lower risk when fully trained than in random features or neural tangent regimes if hidden units < d.
-
Fixed-order PCA: Theory for Overestimated Factor Models
Establishes asymptotic consistency of factor estimates and √T-normality in factor-augmented regressions for fixed R ≥ r using anisotropic local laws from random matrix theory.
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on benchmarks.
-
Proximal Estimation and Inference
A proximal operator framework unifies asymptotics and Oracle features for penalized estimators and yields new sqrt(n)-consistent Ridgeless-type estimators for linear regression.
-
Asymmetric Scaling Laws from Sparse Features
A sparse-activation model predicts double-descent loss with distinct under- and over-parameterized scaling exponents set by sparsity, plus a compute-optimal frontier favoring dataset growth.