A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
Explaining neural scaling laws , volume=
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 2polarities
background 2representative citing papers
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
Tiny NeRV models using capacity scaling, frequency-aware distillation, and low-precision quantization achieve favorable quality-efficiency trade-offs with far fewer parameters and lower computational costs than standard NeRV.
Effective depth, an operational count of sequential transformations, predicts CNN trainability better than nominal layer count because shortcuts and branches decouple the two.
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
citing papers explorer
-
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
-
Predicting Large Model Test Losses with a Noisy Quadratic System
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
-
TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference
Tiny NeRV models using capacity scaling, frequency-aware distillation, and low-precision quantization achieve favorable quality-efficiency trade-offs with far fewer parameters and lower computational costs than standard NeRV.
-
The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs
Effective depth, an operational count of sequential transformations, predicts CNN trainability better than nominal layer count because shortcuts and branches decouple the two.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.