A Bitter Lesson for Data Filtering
Pith reviewed 2026-05-20 07:36 UTC · model grok-4.3
The pith
With enough compute, large models benefit from low-quality and distractor data instead of needing filters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.
What carries the argument
Scaling studies in the high-compute, data-scarce regime that track how large parameter models respond to unfiltered data.
If this is right
- Filtering data becomes unnecessary or harmful once compute is high enough.
- Training can safely include more abundant but lower-quality sources.
- Resources for data preparation can shift toward increasing model size or training steps.
- Performance gains come from using nominally poor data rather than removing it.
Where Pith is reading between the lines
- Data collection efforts could prioritize volume and coverage over quality checks for future large-scale training.
- The same tolerance to noise may appear in vision or multimodal pretraining once models reach similar scales.
- Smaller models or low-compute regimes may still need filtering, creating a scale-dependent rule of thumb.
Load-bearing premise
The scaling studies accurately capture how large models will behave when trained with lots of computing power on limited or mixed-quality data.
What would settle it
A direct comparison where a large model trained on filtered high-quality data outperforms one trained on unfiltered data at the same high compute level.
Figures
read the original abstract
We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents scaling studies targeting the high-compute, data-scarce regime for large-parameter models and concludes that sufficiently trained models tolerate and benefit from unfiltered data containing low-quality and distractor examples, implying that the optimal data filter is no filter at all.
Significance. If the central empirical result holds, the finding would be significant for pretraining practice by challenging the necessity of aggressive data filtering and reinforcing that additional compute can render curation less critical. The explicit targeting of the high-compute regime is a methodological strength that directly engages the relevant scaling limit.
major comments (1)
- §4 (Scaling Experiments): the reported curves show unfiltered data eventually surpassing filtered data, but the manuscript does not quantify the compute threshold at which the crossover occurs or test whether the advantage continues to widen beyond the largest model sizes examined; this directly affects the strength of the 'sufficiently trained large parameter models' claim.
minor comments (2)
- Figure 3: axis labels and legends should explicitly state whether token count or FLOPs are held constant across filtered and unfiltered runs.
- §2.2: the precise construction of the 'distractor' data mixture is described only at a high level; adding a short table of mixture ratios would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the recommendation of minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: §4 (Scaling Experiments): the reported curves show unfiltered data eventually surpassing filtered data, but the manuscript does not quantify the compute threshold at which the crossover occurs or test whether the advantage continues to widen beyond the largest model sizes examined; this directly affects the strength of the 'sufficiently trained large parameter models' claim.
Authors: We agree that an explicit quantification of the crossover threshold and a discussion of behavior at scales beyond those tested would strengthen the presentation of our central claim. Our experiments were designed to demonstrate the qualitative trend that unfiltered data becomes preferable in the high-compute regime rather than to produce a precise predictive model of the transition point. In the revised manuscript we will add a short analysis in §4 that estimates the crossover compute threshold by fitting the observed scaling curves and will include a brief discussion of how the advantage may continue to evolve at larger scales consistent with the reported trends. revision: yes
Circularity Check
No significant circularity; central claim rests on new empirical scaling studies
full rationale
The paper derives its conclusion that sufficiently trained large models benefit from unfiltered data directly from new scaling experiments targeting the high-compute, data-scarce regime. No load-bearing step reduces to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain; the result is not equivalent to its inputs by construction. The derivation is self-contained because it reports direct performance measurements rather than invoking uniqueness theorems, ansatzes from prior work, or renamings of known patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
with enough compute, the best data filter is no data filter... sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally “poor” data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic , author=. 2024 , eprint=
work page 2024
-
[2]
IEEE Transactions on Signal Processing , volume = 66, number = 13, pages =
Global Optimality in Low-Rank Matrix Optimization , author =. IEEE Transactions on Signal Processing , volume = 66, number = 13, pages =
-
[3]
Gradient descent only converges to minimizers , author=
-
[4]
Neural Networks , volume = 2, pages =
Pierre Baldi and Kurt Hornik , title =. Neural Networks , volume = 2, pages =
- [5]
-
[6]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=
work page 2023
-
[7]
A Survey on Data Selection for Language Models , author=. 2024 , eprint=
work page 2024
-
[8]
DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=
work page 2025
-
[9]
Common Crawl Corpus , year =
-
[10]
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. 2025 , eprint=
work page 2025
-
[11]
Mathurin Videau and Badr Youbi Idrissi and Daniel Haziza and Luca Wehrstedt and Jade Copet and Olivier Teytaud and David Lopez-Paz , title =
-
[12]
GPT - N eo X -20 B : An open-source autoregressive language model
Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...
-
[13]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=
work page 2024
-
[14]
Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro , title =
-
[15]
The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only , author=. arXiv preprint arXiv:2306.01116 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [16]
-
[17]
Training Compute-Optimal Large Language Models , author=. 2022 , eprint=
work page 2022
-
[18]
Bag of Tricks for Efficient Text Classification , author=. 2016 , eprint=
work page 2016
-
[19]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher , author=. 2022 , eprint=
work page 2022
-
[20]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=
work page 2018
-
[21]
PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=
work page 2019
-
[22]
Domain Adaptation: Learning Bounds and Algorithms , author=. 2023 , eprint=
work page 2023
-
[23]
doi:10.5281/zenodo.12608602 , url =
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
-
[24]
Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=
work page 2021
-
[25]
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =
Theory and Algorithm for Batch Distribution Drift Problems , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =
work page 2023
- [26]
-
[27]
Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality , author=. 2025 , eprint=
work page 2025
-
[28]
An Empirical Exploration in Quality Filtering of Text Data , author=. 2021 , eprint=
work page 2021
-
[29]
Goodhart, C. A. E. Problems of Monetary Management: The UK Experience. Monetary Theory and Practice: The UK Experience. 1984. doi:10.1007/978-1-349-17295-5_4
-
[30]
The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining , author=. 2025 , eprint=
work page 2025
-
[31]
Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , author=. 2021 , eprint=
work page 2021
-
[32]
Do we really have to filter out random noise in pre-training data for language models? , author=. 2025 , eprint=
work page 2025
- [33]
-
[34]
SocialIQA: Commonsense Reasoning about Social Interactions , author=. 2019 , eprint=
work page 2019
- [35]
-
[36]
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws , author=. 2024 , eprint=
work page 2024
-
[37]
How many labelers do you have? A closer look at gold-standard labels , author=. 2024 , eprint=
work page 2024
-
[38]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=
work page 2021
- [39]
-
[40]
Will we run out of data? Limits of LLM scaling based on human-generated data , author=. 2024 , eprint=
work page 2024
- [41]
-
[42]
Grok 4 Model Card , year =
-
[43]
Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast , booktitle =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.