A Bitter Lesson for Data Filtering

Christopher Mohri; John Duchi; Tatsunori Hashimoto

arxiv: 2605.19407 · v1 · pith:SSLAQALSnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

A Bitter Lesson for Data Filtering

Christopher Mohri , John Duchi , Tatsunori Hashimoto This is my paper

Pith reviewed 2026-05-20 07:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords data filteringlarge modelspretrainingscaling studiesdata qualitycompute scalingdistractor data

0 comments

The pith

With enough compute, large models benefit from low-quality and distractor data instead of needing filters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs scaling studies on data filtering for large model pretraining, focused on the high-compute and data-scarce regime. It finds that large parameter models not only tolerate low-quality data but improve when it is included. This challenges the usual practice of heavy filtering to keep only high-quality information. A reader would care because it implies that as models and compute grow, time spent cleaning data could be better used to add more raw data or extend training.

Core claim

We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.

What carries the argument

Scaling studies in the high-compute, data-scarce regime that track how large parameter models respond to unfiltered data.

If this is right

Filtering data becomes unnecessary or harmful once compute is high enough.
Training can safely include more abundant but lower-quality sources.
Resources for data preparation can shift toward increasing model size or training steps.
Performance gains come from using nominally poor data rather than removing it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data collection efforts could prioritize volume and coverage over quality checks for future large-scale training.
The same tolerance to noise may appear in vision or multimodal pretraining once models reach similar scales.
Smaller models or low-compute regimes may still need filtering, creating a scale-dependent rule of thumb.

Load-bearing premise

The scaling studies accurately capture how large models will behave when trained with lots of computing power on limited or mixed-quality data.

What would settle it

A direct comparison where a large model trained on filtered high-quality data outperforms one trained on unfiltered data at the same high compute level.

Figures

Figures reproduced from arXiv: 2605.19407 by Christopher Mohri, John Duchi, Tatsunori Hashimoto.

**Figure 2.** Figure 2: Pareto frontier of Figure [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: 670M-token CC pool versus junk-injected versions. Plots show a surprising robustness to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Top: 1B model performance as we vary the pool size; the total needed steps for pool to outperform RefinedWeb grows rapidly. Bottom: Crossing point as a function of pool size for various model sizes. Markers each represent a crossing point (e.g. top panel), with text showing the epoch count. Epochs above the largest observed crossing point (121.6 epochs) are shaded to indicate unreliability at extreme epoch… view at source ↗

**Figure 6.** Figure 6: Scaling laws for optimality of no data filtering. Two scaling laws with token-per-parameter scaling (in orange) and epoch constraints (in blue) both give highly linear scaling and predict similar budgets (1e+30 FLOPs). We now also vary model size M to understand the joint scaling behavior as model size grows with pool size. Figure 5 shows a sweep over N⋆ (M, m) with each panel varying M and the x-axis var… view at source ↗

**Figure 7.** Figure 7: 330M model: loss of 670M pool subset versus +200% dataset. While CC is too large to exhaustively search through and contains non-factual content such as conspiracy theories, we argue that such actively harmful content is relatively low frequency. We provide a very brief study to support this with a corpus analysis of MMLU-related documents in CC [Hendrycks et al., 2021]. We first match keywords, and then w… view at source ↗

**Figure 8.** Figure 8: Ablation of 670M token CC pool and five filtered versions. Each plot is a different model size and the total tokens x-axis corresponds to the number of gradient steps taken (with epoching). 10 17 10 18 10 19 10 20 Compute (FLOPs, 6ND) 0.35 0.36 0.37 0.38 0.39 Social IQA Pool (CC) RefinedWeb English Repetition Stop Words DCLM-Baseline 10 17 10 18 10 19 10 20 Compute (FLOPs, 6ND) 0.28 0.30 0.32 0.34 0.36 0.3… view at source ↗

**Figure 9.** Figure 9: Pareto frontier of compute vs. benchmark performance for CC pool and filtered datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: 670M token CC pool and five filtered versions. Each plot is a different model size and the total tokens x-axis corresponds to the number of gradient steps taken (with epoching). The arrow shows the change in DCLM-Baseline performance with about an order of magnitude more tokens. 10 17 10 18 10 19 Compute (FLOPs, 6ND) 3.50 3.75 4.00 4.25 4.50 4.75 5.00 5.25 Avg NLL (C4, Cosmo, FineWeb) 100M Tokens Pool (CC… view at source ↗

**Figure 11.** Figure 11: Pareto frontier of compute vs. average negative log-likelihood for CC pool and filtered [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: 670M CC pool and random injection datasets. Each row is a downstream benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: 670M CC pool and shuffled-word injection datasets. Each row is a downstream benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling studies indicate that unfiltered data can outperform filtered data for large models in high-compute regimes, though the experiments may still sit short of the decisive crossover point.

read the letter

The core takeaway is that this paper's scaling experiments suggest the best data filter for large models becomes no filter at all once compute is high enough. Models appear to tolerate and even benefit from low-quality and distractor data under those conditions, which runs against the usual push for aggressive cleaning of pretraining corpora. What is new here is the deliberate targeting of the high-compute, data-scarce regime with scaling studies. Most earlier filtering work either stays at smaller scales or assumes compute limits that do not match current large-model practice, so this angle adds a useful empirical check on how filtering behaves as we push further out. The paper does a reasonable job presenting the case through direct comparisons that challenge the filtering-is-essential assumption. The results are framed as showing tolerance and benefit from nominally poor data when models are sufficiently trained and scaled. The main soft spot is whether the largest tested points actually demonstrate unfiltered data pulling ahead or whether the advantage is still projected from the trend. If filtered data remains competitive or better at the biggest scales reached, the central claim depends on the scaling continuing in the same direction beyond the measured range. That is a proportionate concern rather than a deal-breaker, and it lines up with the stress-test note. Details on exact model sizes, data mixing ratios, and training controls would help pin this down. This paper is aimed at researchers working on data curation and scaling for foundation models. Readers who care about practical decisions on raw web-scale data will find the targeted studies relevant. It deserves a serious referee because the question is timely for how people allocate compute and data, even if the results need tighter confirmation on the regime they cover. I would send it to peer review rather than desk reject.

Referee Report

1 major / 2 minor

Summary. The paper presents scaling studies targeting the high-compute, data-scarce regime for large-parameter models and concludes that sufficiently trained models tolerate and benefit from unfiltered data containing low-quality and distractor examples, implying that the optimal data filter is no filter at all.

Significance. If the central empirical result holds, the finding would be significant for pretraining practice by challenging the necessity of aggressive data filtering and reinforcing that additional compute can render curation less critical. The explicit targeting of the high-compute regime is a methodological strength that directly engages the relevant scaling limit.

major comments (1)

§4 (Scaling Experiments): the reported curves show unfiltered data eventually surpassing filtered data, but the manuscript does not quantify the compute threshold at which the crossover occurs or test whether the advantage continues to widen beyond the largest model sizes examined; this directly affects the strength of the 'sufficiently trained large parameter models' claim.

minor comments (2)

Figure 3: axis labels and legends should explicitly state whether token count or FLOPs are held constant across filtered and unfiltered runs.
§2.2: the precise construction of the 'distractor' data mixture is described only at a high level; adding a short table of mixture ratios would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation of minor revision. We address the single major comment below.

read point-by-point responses

Referee: §4 (Scaling Experiments): the reported curves show unfiltered data eventually surpassing filtered data, but the manuscript does not quantify the compute threshold at which the crossover occurs or test whether the advantage continues to widen beyond the largest model sizes examined; this directly affects the strength of the 'sufficiently trained large parameter models' claim.

Authors: We agree that an explicit quantification of the crossover threshold and a discussion of behavior at scales beyond those tested would strengthen the presentation of our central claim. Our experiments were designed to demonstrate the qualitative trend that unfiltered data becomes preferable in the high-compute regime rather than to produce a precise predictive model of the transition point. In the revised manuscript we will add a short analysis in §4 that estimates the crossover compute threshold by fitting the observed scaling curves and will include a brief discussion of how the advantage may continue to evolve at larger scales consistent with the reported trends. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim rests on new empirical scaling studies

full rationale

The paper derives its conclusion that sufficiently trained large models benefit from unfiltered data directly from new scaling experiments targeting the high-compute, data-scarce regime. No load-bearing step reduces to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain; the result is not equivalent to its inputs by construction. The derivation is self-contained because it reports direct performance measurements rather than invoking uniqueness theorems, ansatzes from prior work, or renamings of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The claim rests on experimental observations from scaling studies in the high-compute data-scarce regime; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5587 in / 930 out tokens · 35210 ms · 2026-05-20T07:36:55.952306+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

with enough compute, the best data filter is no data filter... sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally “poor” data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

2024 , eprint=

Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic , author=. 2024 , eprint=

work page 2024
[2]

IEEE Transactions on Signal Processing , volume = 66, number = 13, pages =

Global Optimality in Low-Rank Matrix Optimization , author =. IEEE Transactions on Signal Processing , volume = 66, number = 13, pages =

work page
[3]

Gradient descent only converges to minimizers , author=

work page
[4]

Neural Networks , volume = 2, pages =

Pierre Baldi and Kurt Hornik , title =. Neural Networks , volume = 2, pages =

work page
[5]

2025 , eprint=

Scaling Data-Constrained Language Models , author=. 2025 , eprint=

work page 2025
[6]

2023 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

work page 2023
[7]

2024 , eprint=

A Survey on Data Selection for Language Models , author=. 2024 , eprint=

work page 2024
[8]

2025 , eprint=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

work page 2025
[9]

Common Crawl Corpus , year =

work page
[10]

2025 , eprint=

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. 2025 , eprint=

work page 2025
[11]

Mathurin Videau and Badr Youbi Idrissi and Daniel Haziza and Luca Wehrstedt and Jade Copet and Olivier Teytaud and David Lopez-Paz , title =

work page
[12]

GPT - N eo X -20 B : An open-source autoregressive language model

Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...

work page doi:10.18653/v1/2022.bigscience-1.9 2022
[13]

2024 , eprint=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

work page 2024
[14]

Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro , title =

work page
[15]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only , author=. arXiv preprint arXiv:2306.01116 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

2025 , eprint=

Pre-training under infinite compute , author=. 2025 , eprint=

work page 2025
[17]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022
[18]

2016 , eprint=

Bag of Tricks for Efficient Text Classification , author=. 2016 , eprint=

work page 2016
[19]

2022 , eprint=

Scaling Language Models: Methods, Analysis & Insights from Training Gopher , author=. 2022 , eprint=

work page 2022
[20]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018
[21]

2019 , eprint=

PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=

work page 2019
[22]

2023 , eprint=

Domain Adaptation: Learning Bounds and Algorithms , author=. 2023 , eprint=

work page 2023
[23]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[24]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021
[25]

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =

Theory and Algorithm for Batch Distribution Drift Problems , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =

work page 2023
[26]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[27]

2025 , eprint=

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality , author=. 2025 , eprint=

work page 2025
[28]

2021 , eprint=

An Empirical Exploration in Quality Filtering of Text Data , author=. 2021 , eprint=

work page 2021
[29]

Goodhart, C. A. E. Problems of Monetary Management: The UK Experience. Monetary Theory and Practice: The UK Experience. 1984. doi:10.1007/978-1-349-17295-5_4

work page doi:10.1007/978-1-349-17295-5_4 1984
[30]

2025 , eprint=

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining , author=. 2025 , eprint=

work page 2025
[31]

2021 , eprint=

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , author=. 2021 , eprint=

work page 2021
[32]

2025 , eprint=

Do we really have to filter out random noise in pre-training data for language models? , author=. 2025 , eprint=

work page 2025
[33]

2025 , eprint=

When Bad Data Leads to Good Models , author=. 2025 , eprint=

work page 2025
[34]

2019 , eprint=

SocialIQA: Commonsense Reasoning about Social Interactions , author=. 2019 , eprint=

work page 2019
[35]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[36]

2024 , eprint=

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws , author=. 2024 , eprint=

work page 2024
[37]

2024 , eprint=

How many labelers do you have? A closer look at gold-standard labels , author=. 2024 , eprint=

work page 2024
[38]

2021 , eprint=

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=

work page 2021
[39]

, title =

Sutton, Richard S. , title =. 2019 , howpublished =

work page 2019
[40]

2024 , eprint=

Will we run out of data? Limits of LLM scaling based on human-generated data , author=. 2024 , eprint=

work page 2024
[41]

2025 , url=

What will AI look like in 2030? , author=. 2025 , url=

work page 2030
[42]

Grok 4 Model Card , year =

work page
[43]

Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast , booktitle =

work page

[1] [1]

2024 , eprint=

Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic , author=. 2024 , eprint=

work page 2024

[2] [2]

IEEE Transactions on Signal Processing , volume = 66, number = 13, pages =

Global Optimality in Low-Rank Matrix Optimization , author =. IEEE Transactions on Signal Processing , volume = 66, number = 13, pages =

work page

[3] [3]

Gradient descent only converges to minimizers , author=

work page

[4] [4]

Neural Networks , volume = 2, pages =

Pierre Baldi and Kurt Hornik , title =. Neural Networks , volume = 2, pages =

work page

[5] [5]

2025 , eprint=

Scaling Data-Constrained Language Models , author=. 2025 , eprint=

work page 2025

[6] [6]

2023 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

work page 2023

[7] [7]

2024 , eprint=

A Survey on Data Selection for Language Models , author=. 2024 , eprint=

work page 2024

[8] [8]

2025 , eprint=

DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

work page 2025

[9] [9]

Common Crawl Corpus , year =

work page

[10] [10]

2025 , eprint=

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. 2025 , eprint=

work page 2025

[11] [11]

Mathurin Videau and Badr Youbi Idrissi and Daniel Haziza and Luca Wehrstedt and Jade Copet and Olivier Teytaud and David Lopez-Paz , title =

work page

[12] [12]

GPT - N eo X -20 B : An open-source autoregressive language model

Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...

work page doi:10.18653/v1/2022.bigscience-1.9 2022

[13] [13]

2024 , eprint=

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

work page 2024

[14] [14]

Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro , title =

work page

[15] [15]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only , author=. arXiv preprint arXiv:2306.01116 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

2025 , eprint=

Pre-training under infinite compute , author=. 2025 , eprint=

work page 2025

[17] [17]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022

[18] [18]

2016 , eprint=

Bag of Tricks for Efficient Text Classification , author=. 2016 , eprint=

work page 2016

[19] [19]

2022 , eprint=

Scaling Language Models: Methods, Analysis & Insights from Training Gopher , author=. 2022 , eprint=

work page 2022

[20] [20]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018

[21] [21]

2019 , eprint=

PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=

work page 2019

[22] [22]

2023 , eprint=

Domain Adaptation: Learning Bounds and Algorithms , author=. 2023 , eprint=

work page 2023

[23] [23]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602

[24] [24]

2021 , eprint=

Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

work page 2021

[25] [25]

Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =

Theory and Algorithm for Batch Distribution Drift Problems , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =

work page 2023

[26] [26]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020

[27] [27]

2025 , eprint=

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality , author=. 2025 , eprint=

work page 2025

[28] [28]

2021 , eprint=

An Empirical Exploration in Quality Filtering of Text Data , author=. 2021 , eprint=

work page 2021

[29] [29]

Goodhart, C. A. E. Problems of Monetary Management: The UK Experience. Monetary Theory and Practice: The UK Experience. 1984. doi:10.1007/978-1-349-17295-5_4

work page doi:10.1007/978-1-349-17295-5_4 1984

[30] [30]

2025 , eprint=

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining , author=. 2025 , eprint=

work page 2025

[31] [31]

2021 , eprint=

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , author=. 2021 , eprint=

work page 2021

[32] [32]

2025 , eprint=

Do we really have to filter out random noise in pre-training data for language models? , author=. 2025 , eprint=

work page 2025

[33] [33]

2025 , eprint=

When Bad Data Leads to Good Models , author=. 2025 , eprint=

work page 2025

[34] [34]

2019 , eprint=

SocialIQA: Commonsense Reasoning about Social Interactions , author=. 2019 , eprint=

work page 2019

[35] [35]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020

[36] [36]

2024 , eprint=

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws , author=. 2024 , eprint=

work page 2024

[37] [37]

2024 , eprint=

How many labelers do you have? A closer look at gold-standard labels , author=. 2024 , eprint=

work page 2024

[38] [38]

2021 , eprint=

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=

work page 2021

[39] [39]

, title =

Sutton, Richard S. , title =. 2019 , howpublished =

work page 2019

[40] [40]

2024 , eprint=

Will we run out of data? Limits of LLM scaling based on human-generated data , author=. 2024 , eprint=

work page 2024

[41] [41]

2025 , url=

What will AI look like in 2030? , author=. 2025 , url=

work page 2030

[42] [42]

Grok 4 Model Card , year =

work page

[43] [43]

Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast , booktitle =

work page