hub Mixed citations

Scaling Data-Constrained Language Models

Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi · 2023 · cs.CL · arXiv 2305.16264

Mixed citation behavior. Most common role is background (67%).

17 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 17 citing papers arXiv PDF

abstract

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 4 baseline 1 unclear 1

representative citing papers

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

cs.LG · 2026-04-21 · unverdicted · novelty 7.0 · 2 refs

Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.

Causal inference for social network formation

econ.EM · 2026-04-20 · conditional · novelty 7.0

Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.

The Art of Scaling Reinforcement Learning Compute for LLMs

cs.LG · 2025-10-15 · unverdicted · novelty 7.0

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.

OLMo: Accelerating the Science of Language Models

cs.CL · 2024-02-01 · accept · novelty 7.0

OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.

Scalable Extraction of Training Data from (Production) Language Models

cs.LG · 2023-11-28 · conditional · novelty 7.0

Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models like ChatGPT.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

RWKV: Reinventing RNNs for the Transformer Era

cs.CL · 2023-05-22 · unverdicted · novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

cs.LG · 2026-05-13 · conditional · novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.

Foundation Models for Discovery and Exploration in Chemical Space

physics.chem-ph · 2025-10-20 · unverdicted · novelty 6.0

MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

cs.CL · 2024-04-22 · accept · novelty 6.0

Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Textbooks Are All You Need

cs.CL · 2023-06-20 · unverdicted · novelty 6.0

A 1.3B-parameter code model trained on 7B tokens of curated textbook and synthetic data achieves 50.6% on HumanEval, indicating data quality can enable strong performance at small scale.

DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

Off-the-shelf models assess quality and alignment to select diverse multimodal training data, letting models trained on the filtered subset match or exceed full-dataset results on standard benchmarks.

A Survey of Large Language Models

cs.CL · 2023-03-31 · accept · novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

cs.LG · 2026-04-10 · unverdicted · novelty 2.0

A reduced attention-only decoder shows diminishing returns in dataset scaling, reaching 90% of full accuracy with only 30% of the data.

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

cs.CL · 2026-04-22

citing papers explorer

Showing 4 of 4 citing papers after filters.

Causal inference for social network formation econ.EM · 2026-04-20 · conditional · none · ref 83 · internal anchor
Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
Scalable Extraction of Training Data from (Production) Language Models cs.LG · 2023-11-28 · conditional · none · ref 34 · internal anchor
Adversaries can scalably extract gigabytes of training data from open, semi-open, and closed language models via querying attacks, including a divergence method that increases extraction rates 150x on aligned models like ChatGPT.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings cs.LG · 2026-05-13 · conditional · none · ref 15 · internal anchor
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 139 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

Scaling Data-Constrained Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer