Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.
Ridge regression in high dimensions exhibits power-law scalings because covariance fluctuations renormalize the ridge parameter, allowing closed-form error expressions and bias-variance decompositions for random feature models via free probability.
ReST improves LLM translation quality on benchmarks via offline RL on self-generated data, achieving gains in a compute-efficient way compared to typical RLHF.
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Tuning the depth-width ratio positions models in an efficient neural interaction interval that correlates with better generalization under fixed budgets and remains stable with scale.
citing papers explorer
No citing papers match the current filters.