hub Canonical reference

Scaling Laws and Interpretability of Learning from Repeated Data

Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk · 2022 · cs.LG · arXiv 2205.10487

Canonical reference. 100% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the model is exposed to repeated data at the sentence, paragraph, or document level. Some works have reported substantial negative performance effects of this repeated data. In this paper we attempt to study repeated data systematically and to understand its effects mechanistically. To do this, we train a family of models where most of the data is unique but a small fraction of it is repeated many times. We find a strong double descent phenomenon, in which repeated data can lead test loss to increase midway through training. A predictable range of repetition frequency leads to surprisingly severe degradation in performance. For instance, performance of an 800M parameter model can be degraded to that of a 2x smaller model (400M params) by repeating 0.1% of the data 100 times, despite the other 90% of the training tokens remaining unique. We suspect there is a range in the middle where the data can be memorized and doing so consumes a large fraction of the model's capacity, and this may be where the peak of degradation occurs. Finally, we connect these observations to recent mechanistic interpretability work - attempting to reverse engineer the detailed computations performed by the model - by showing that data repetition disproportionately damages copying and internal structures associated with generalization, such as induction heads, providing a possible mechanism for the shift from generalization to memorization. Taken together, these results provide a hypothesis for why repeating a relatively small fraction of data in large language models could lead to disproportionately large harms to performance.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

cs.LG · 2026-05-13 · conditional · novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.

Scaling Laws for Mixture Pretraining Under Data Constraints

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.

Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World

cs.LG · 2026-05-09 · conditional · novelty 6.0

A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.

Prescriptive Scaling Laws for Data Constrained Training

cs.LG · 2026-05-02 · unverdicted · novelty 6.0

A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

cs.CL · 2025-06-13 · conditional · novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.

MAGI-1: Autoregressive Video Generation at Scale

cs.CV · 2025-05-19 · unverdicted · novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

Superposition Yields Robust Neural Scaling

cs.LG · 2025-05-15 · conditional · novelty 6.0

Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

cond-mat.dis-nn · 2025-02-07 · unverdicted · novelty 6.0

Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

cs.CL · 2023-06-01 · unverdicted · novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

Scaling Data-Constrained Language Models

cs.CL · 2023-05-25 · conditional · novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

The False Promise of Imitating Proprietary LLMs

cs.CL · 2023-05-25 · conditional · novelty 6.0

Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

cs.LG · 2026-04-19 · unverdicted · novelty 5.0

ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.

cs.AI · 2025-09-21 · unverdicted · novelty 5.0

Similarity Field Theory defines a similarity field over entities, concepts as superlevel-set fibers, and intelligence as a generative operator that preserves fiber membership under evolution.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Galactica: A Large Language Model for Science

cs.CL · 2022-11-16 · unverdicted · novelty 5.0 · 2 refs

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

cs.CV · 2025-02-14 · unverdicted · novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

Large Language Models: A Survey

cs.CL · 2024-02-09 · accept · novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

A Survey of Large Language Models

cs.CL · 2023-03-31 · accept · novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

cs.LG · 2026-04-10 · unverdicted · novelty 2.0

A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

cs.CL · 2026-05-01

citing papers explorer

Showing 23 of 23 citing papers.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 104 · internal anchor
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency cs.CL · 2026-05-18 · unverdicted · none · ref 19 · internal anchor
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings cs.LG · 2026-05-13 · conditional · none · ref 7 · internal anchor
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Scaling Laws for Mixture Pretraining Under Data Constraints cs.LG · 2026-05-12 · unverdicted · none · ref 17 · 2 links · internal anchor
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World cs.LG · 2026-05-09 · conditional · none · ref 24 · internal anchor
A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.
Prescriptive Scaling Laws for Data Constrained Training cs.LG · 2026-05-02 · unverdicted · none · ref 6 · internal anchor
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.
Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource cs.CL · 2025-06-13 · conditional · none · ref 15 · internal anchor
MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.
MAGI-1: Autoregressive Video Generation at Scale cs.CV · 2025-05-19 · unverdicted · none · ref 15 · internal anchor
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Superposition Yields Robust Neural Scaling cs.LG · 2025-05-15 · conditional · none · ref 22 · internal anchor
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models cond-mat.dis-nn · 2025-02-07 · unverdicted · none · ref 15 · internal anchor
Derives a novel two-point deterministic equivalence for random matrix resolvents to obtain unified asymptotics for SGD-trained linear regression, kernel regression, and random feature models.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 290 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only cs.CL · 2023-06-01 · unverdicted · none · ref 26 · internal anchor
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 43 · internal anchor
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
The False Promise of Imitating Proprietary LLMs cs.CL · 2023-05-25 · conditional · none · ref 57 · internal anchor
Finetuning open LMs on ChatGPT outputs creates models that mimic style and fool human raters but fail to close the performance gap to proprietary systems on tasks not well-represented in the imitation data.
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods cs.LG · 2026-04-19 · unverdicted · none · ref 13 · internal anchor
ADAPT is an online reweighting framework for LLM training that outperforms offline data selection and mixing methods in cross-benchmark generalization under equal compute.
Similarity Field Theory: A Mathematical Framework for Intelligence cs.AI · 2025-09-21 · unverdicted · none · ref 28 · internal anchor
Similarity Field Theory defines a similarity field over entities, concepts as superlevel-set fibers, and intelligence as a generative operator that preserves fiber membership under evolution.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 159 · internal anchor
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 92 · 2 links · internal anchor
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model cs.CV · 2025-02-14 · unverdicted · none · ref 195 · internal anchor
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 125 · internal anchor
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 238 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder cs.LG · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models cs.CL · 2026-05-01 · unreviewed · ref 11 · internal anchor

Scaling Laws and Interpretability of Learning from Repeated Data

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer