hub Mixed citations

Scaling Data-Constrained Language Models

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi · 2023 · cs.CL · arXiv 2305.16264

Mixed citation behavior. Most common role is background (67%).

19 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 19 citing papers arXiv PDF

abstract

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 4 baseline 1 unclear 1

representative citing papers

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

cs.LG · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

Causal inference for social network formation

econ.EM · 2026-04-20 · conditional · novelty 7.0

Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.

The Art of Scaling Reinforcement Learning Compute for LLMs

cs.LG · 2025-10-15 · unverdicted · novelty 7.0

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.

OLMo: Accelerating the Science of Language Models

cs.CL · 2024-02-01 · accept · novelty 7.0

OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.

Scalable Extraction of Training Data from (Production) Language Models

cs.LG · 2023-11-28 · conditional · novelty 7.0

Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models like ChatGPT.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

RWKV: Reinventing RNNs for the Transformer Era

cs.CL · 2023-05-22 · unverdicted · novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

cs.LG · 2026-05-13 · conditional · novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.

Foundation Models for Discovery and Exploration in Chemical Space

physics.chem-ph · 2025-10-20 · unverdicted · novelty 6.0

MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

cs.CL · 2024-04-22 · accept · novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Textbooks Are All You Need

cs.CL · 2023-06-20 · unverdicted · novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

q0: Primitives for Hyper-Epoch Pretraining

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

q0 turns multi-epoch budgets into diverse model populations using three primitives that outperform single-model training and strong ensembles with fewer epochs on a 1.8B model.

DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.

Unified Neural Scaling Laws

cs.LG · 2026-05-25 · unverdicted · novelty 4.0

Presents a single functional form for neural scaling that unifies multiple scaling dimensions and claims higher extrapolation accuracy than prior forms across diverse tasks and architectures.

A Survey of Large Language Models

cs.CL · 2023-03-31 · accept · novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

cs.LG · 2026-04-10 · unverdicted · novelty 2.0

A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

cs.CL · 2026-04-22

citing papers explorer

Showing 19 of 19 citing papers.

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts cs.LG · 2026-04-21 · unverdicted · none · ref 38 · 2 links · internal anchor
Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.
Causal inference for social network formation econ.EM · 2026-04-20 · conditional · none · ref 83 · internal anchor
Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
The Art of Scaling Reinforcement Learning Compute for LLMs cs.LG · 2025-10-15 · unverdicted · none · ref 13 · internal anchor
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.
OLMo: Accelerating the Science of Language Models cs.CL · 2024-02-01 · accept · none · ref 6 · internal anchor
OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.
Scalable Extraction of Training Data from (Production) Language Models cs.LG · 2023-11-28 · conditional · none · ref 34 · internal anchor
Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models like ChatGPT.
C-Pack: Packed Resources For General Chinese Embeddings cs.CL · 2023-09-14 · accept · none · ref 37 · internal anchor
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
RWKV: Reinventing RNNs for the Transformer Era cs.CL · 2023-05-22 · unverdicted · none · ref 12 · internal anchor
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings cs.LG · 2026-05-13 · conditional · none · ref 15 · internal anchor
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
Foundation Models for Discovery and Exploration in Chemical Space physics.chem-ph · 2025-10-20 · unverdicted · none · ref 133 · internal anchor
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 130 · internal anchor
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone cs.CL · 2024-04-22 · accept · none · ref 20 · internal anchor
Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 139 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Textbooks Are All You Need cs.CL · 2023-06-20 · unverdicted · none · ref 23 · internal anchor
A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.
q0: Primitives for Hyper-Epoch Pretraining cs.LG · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
q0 turns multi-epoch budgets into diverse model populations using three primitives that outperform single-model training and strong ensembles with fewer epochs on a 1.8B model.
DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models cs.CV · 2026-04-18 · unverdicted · none · ref 5 · internal anchor
Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.
Unified Neural Scaling Laws cs.LG · 2026-05-25 · unverdicted · none · ref 20 · internal anchor
Presents a single functional form for neural scaling that unifies multiple scaling dimensions and claims higher extrapolation accuracy than prior forms across diverse tasks and architectures.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 63 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder cs.LG · 2026-04-10 · unverdicted · none · ref 8 · internal anchor
A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.
AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models cs.CL · 2026-04-22 · unreviewed · ref 4 · internal anchor

Scaling Data-Constrained Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer