Language models scale reliably with over-training and on downstream tasks

Achal Dave; Alexandros G. Dimakis; Alex Fang; Gabriel Ilharco; Georgios Smyrnis; Igor Vasiljevic; Jean Mercat; Jeffrey Li; Jenia Jitsev; Luca Soldaini

arxiv: 2403.08540 · v2 · pith:BOFWXA6Lnew · submitted 2024-03-13 · 💻 cs.CL · cs.LG

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang

show 17 more authors

Jeffrey Li Sedrick Keh Rui Xin Marianna Nezhurina Igor Vasiljevic Jenia Jitsev Luca Soldaini Alexandros G. Dimakis Gabriel Ilharco Pang Wei Koh Shuran Song Thomas Kollar Yair Carmon Achal Dave Reinhard Heckel Niklas Muennighoff Ludwig Schmidt

This is my paper

classification 💻 cs.CL cs.LG

keywords modelsscalingdownstreamexperimentspredictlanguagelawsperformance

0 comments

read the original abstract

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32$\times$ over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)$\unicode{x2014}$each from experiments that take 300$\times$ less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20$\times$ less compute. Our experiments are available at https://github.com/mlfoundations/scaling.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation
cs.LG 2026-05 unverdicted novelty 8.0

Fixed tokens-per-parameter ratios in scaling law experiments induce ill-conditioned least-squares fits due to Jacobian geometry, making scale coefficients unidentifiable and extrapolations unreliable; diverse TPP cove...
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation
cs.LG 2026-05 unverdicted novelty 7.0

IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related be...
Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors
cs.LG 2026-06 unverdicted novelty 6.0

MD Decoupling factorizes weights into fixed-norm directions and learnable per-row/column magnitudes updated at independent rates, improving Adam and Muon training stability and scale transfer without weight decay or warmup.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
cs.LG 2026-05 conditional novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation
cs.LG 2026-05 unverdicted novelty 6.0

Collinear tokens-per-parameter designs in scaling law fits induce ill-conditioning when N and D exponents are similar, inflating uncertainty and degrading off-ray extrapolations; non-collinear designs are required for...
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition
cs.CL 2026-05 unverdicted novelty 6.0

InfoLaw models pretraining as information accumulation where quality sets information density and repetition causes scale-dependent diminishing returns, predicting loss with low error on unseen mixtures and larger sca...
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization
cs.LG 2026-03 unverdicted novelty 6.0

CAMEL is a scaling law capturing nonlinear model-size and mixture interactions to extrapolate optimal data mixtures for large LLMs from small-model experiments, reducing optimization cost by 50% and improving benchmar...
Next-Latent Prediction Transformers Learn Compact World Models
cs.LG 2025-11 unverdicted novelty 6.0

NextLat augments next-token prediction with latent next-state prediction, theoretically converging latents to belief states and showing empirical gains in world modeling, reasoning, planning, and faster inference via ...
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
cs.LG 2025-10 unverdicted novelty 6.0

A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
cs.LG 2025-02 unverdicted novelty 6.0

Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
DataComp-LM: In search of the next generation of training sets for language models
cs.LG 2024-06 unverdicted novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training
cs.LG 2026-06 unverdicted novelty 5.0

A polynomial preconditioning layer controls singular value spectra of transformer weights to stabilize pre-training, shown effective on Llama-1B and supported by convergence theory for deep linear networks.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 5.0

Emergent intelligence corresponds to the limit of a performance function E(N,P,K) as N, P, K go to infinity, originating from a parameter-limit architecture whose existence is governed by Lipschitz conditions, with sc...
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale
cs.CL 2026-06 unverdicted novelty 4.0

Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 3.0

Formalizes emergent intelligence in foundation models as the limit of E(N,P,K) as N,P,K approach infinity, proves existence conditions via nonlinear Lipschitz operators, and derives scaling laws from covering numbers.
TLoRA+: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models
cs.CL 2026-04 unverdicted novelty 3.0

TLoRA+ augments LoRA with a dedicated optimizer to improve fine-tuning performance on GLUE tasks without meaningful added compute.