Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
hub Canonical reference
An Empirical Model of Large-Batch Training
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
In an increasing number of domains it has been demonstrated that deep learning models can be trained using relatively large batch sizes without sacrificing data efficiency. However the limits of this massive data parallelism seem to differ from domain to domain, ranging from batches of tens of thousands in ImageNet to batches of millions in RL agents that play the game Dota 2. To our knowledge there is limited conceptual understanding of why these limits to batch size differ or how we might choose the correct batch size in a new domain. In this paper, we demonstrate that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets (MNIST, SVHN, CIFAR-10, ImageNet, Billion Word), reinforcement learning domains (Atari and Dota), and even generative model training (autoencoders on SVHN). We find that the noise scale increases as the loss decreases over a training run and depends on the model size primarily through improved model performance. Our empirically-motivated theory also describes the tradeoff between compute-efficiency and time-efficiency, and provides a rough model of the benefits of adaptive batch-size training.
hub tools
citation-role summary
citation-polarity summary
roles
background 7polarities
background 7representative citing papers
Derives mini-batch scaling laws for sketched linear regression, with shared approximation terms and protocol-specific variance/fluctuation scalings under power-law spectrum and source condition.
ABS uses Behavioral Divergence to adaptively scale batch sizes in RL according to policy volatility, enabling effective large-batch large-network training on ALE benchmarks.
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
Effective noise scale non-monotonically governs model merging success with an optimum, unifying effects of learning rate, weight decay, batch size, and augmentation on the loss landscape.
ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Optimal learning rate for models from 22M to 707M parameters shows nonlinear upward curvature with scale that disappears under effective learning rate and data-scale extrapolation.
DEFault++ applies hierarchical learning with a Fault Propagation Graph to detect, localize, and diagnose faults in transformers, improving F1 to 0.826-0.909 and developer repair accuracy from 57.1% to 83.3% on a new benchmark of 5,556 mutation-tested runs.
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement variance.
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
citing papers explorer
-
From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression
Derives mini-batch scaling laws for sketched linear regression, with shared approximation terms and protocol-specific variance/fluctuation scalings under power-law spectrum and source condition.
-
Scalable Reinforcement Learning via Adaptive Batch Scaling
ABS uses Behavioral Divergence to adaptively scale batch sizes in RL according to policy volatility, enabling effective large-batch large-network training on ALE benchmarks.
-
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
-
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
-
HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model
Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.
-
On the Nonlinearity of Learning Rate Scaling for LLM Training
Optimal learning rate for models from 22M to 707M parameters shows nonlinear upward curvature with scale that disappears under effective learning rate and data-scale extrapolation.
-
Hierarchical Fault Detection and Diagnosis for Transformer Architectures
DEFault++ applies hierarchical learning with a Fault Propagation Graph to detect, localize, and diagnose faults in transformers, improving F1 to 0.826-0.909 and developer repair accuracy from 57.1% to 83.3% on a new benchmark of 5,556 mutation-tested runs.
-
Rethinking Language Model Scaling under Transferable Hypersphere Optimization
HyperP transfers optimal learning rates across model width, depth, tokens, and MoE granularity under Frobenius-sphere constraints, delivering stable scaling and 1.58x efficiency gains.
-
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
-
Predicting Large Model Test Losses with a Noisy Quadratic System
A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.
-
Quantum Tilted Loss in Variational Optimization: Theory and Applications
QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement variance.
-
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
-
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
-
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.
-
The Future of Facts: Tracing the Factual Generation-Verification Gap
Empirical tracing across model families shows verification precedes and outlasts generation for facts, with updates producing simultaneous verification of old and new answers.
-
Intelligence Inertia: Physical Isomorphism and Applications
Intelligence Inertia models the computational resistance to structural change in neural networks via a heuristic relativistic analogy, yielding a J-shaped cost curve that diverges from classical approximations.
-
Unified Neural Scaling Laws
Presents a single functional form for neural scaling that unifies multiple scaling dimensions and claims higher extrapolation accuracy than prior forms across diverse tasks and architectures.
-
Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence
Proposes Artificial Adaptive Intelligence as the regime between narrow and general AI, defined by elimination of human-specified hyperparameters, and introduces an adaptivity index plus parametric minimality principle grounded in minimum description length.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.