Repetition of training data produces a systematic eval loss peak at intermediate repeat counts whose location scales with model size, quantifiable as large compute-equivalent loss even at modest repetition fractions.
hub
Explaining neural scaling laws
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A framework using capacity competition and noise reduction under an overlapping-skills assumption explains multi-domain loss behaviors and extrapolates optimal mixtures to large scales from small-scale fits with fewer parameters.
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
Ridge regression in high dimensions exhibits power-law scalings because covariance fluctuations renormalize the ridge parameter, allowing closed-form error expressions and bias-variance decompositions for random feature models via free probability.
Bounded performance metrics always favor convergence of AI capabilities to meek models while unbounded metrics allow frontier models to maintain leads indefinitely, with policy implications for capability concentration.
Pretraining data composition can be used to engineer neural scaling laws in hadronic jet classification toward data-heavy rather than model-size-heavy regimes.
Tiny NeRV models using capacity scaling, frequency-aware distillation, and low-precision quantization achieve favorable quality-efficiency trade-offs with far fewer parameters and lower computational costs than standard NeRV.
Effective depth, an operational count of sequential transformations, predicts CNN trainability better than nominal layer count because shortcuts and branches decouple the two.
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.
citing papers explorer
No citing papers match the current filters.