pith. sign in

arxiv: 2605.19407 · v1 · pith:SSLAQALSnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

A Bitter Lesson for Data Filtering

Pith reviewed 2026-05-20 07:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords data filteringlarge modelspretrainingscaling studiesdata qualitycompute scalingdistractor data
0
0 comments X

The pith

With enough compute, large models benefit from low-quality and distractor data instead of needing filters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs scaling studies on data filtering for large model pretraining, focused on the high-compute and data-scarce regime. It finds that large parameter models not only tolerate low-quality data but improve when it is included. This challenges the usual practice of heavy filtering to keep only high-quality information. A reader would care because it implies that as models and compute grow, time spent cleaning data could be better used to add more raw data or extend training.

Core claim

We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.

What carries the argument

Scaling studies in the high-compute, data-scarce regime that track how large parameter models respond to unfiltered data.

If this is right

  • Filtering data becomes unnecessary or harmful once compute is high enough.
  • Training can safely include more abundant but lower-quality sources.
  • Resources for data preparation can shift toward increasing model size or training steps.
  • Performance gains come from using nominally poor data rather than removing it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data collection efforts could prioritize volume and coverage over quality checks for future large-scale training.
  • The same tolerance to noise may appear in vision or multimodal pretraining once models reach similar scales.
  • Smaller models or low-compute regimes may still need filtering, creating a scale-dependent rule of thumb.

Load-bearing premise

The scaling studies accurately capture how large models will behave when trained with lots of computing power on limited or mixed-quality data.

What would settle it

A direct comparison where a large model trained on filtered high-quality data outperforms one trained on unfiltered data at the same high compute level.

Figures

Figures reproduced from arXiv: 2605.19407 by Christopher Mohri, John Duchi, Tatsunori Hashimoto.

Figure 1
Figure 1. Figure 1: Comparison of models on 670M token CC pool and five filtered subsets. For sufficiently [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pareto frontier of Figure [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 670M-token CC pool versus junk-injected versions. Plots show a surprising robustness to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top: 1B model performance as we vary the pool size; the total needed steps for pool to outperform RefinedWeb grows rapidly. Bottom: Crossing point as a function of pool size for various model sizes. Markers each represent a crossing point (e.g. top panel), with text showing the epoch count. Epochs above the largest observed crossing point (121.6 epochs) are shaded to indicate unreliability at extreme epoch… view at source ↗
Figure 6
Figure 6. Figure 6: Scaling laws for optimality of no data filtering. Two scaling laws with token-per-parameter scaling (in orange) and epoch constraints (in blue) both give highly linear scaling and predict similar budgets (1e+30 FLOPs). We now also vary model size M to understand the joint scaling behavior as model size grows with pool size. Fig￾ure 5 shows a sweep over N⋆ (M, m) with each panel varying M and the x-axis var… view at source ↗
Figure 7
Figure 7. Figure 7: 330M model: loss of 670M pool subset versus +200% dataset. While CC is too large to exhaustively search through and contains non-factual content such as conspiracy theories, we argue that such actively harmful content is relatively low frequency. We provide a very brief study to support this with a corpus analysis of MMLU-related documents in CC [Hendrycks et al., 2021]. We first match keywords, and then w… view at source ↗
Figure 8
Figure 8. Figure 8: Ablation of 670M token CC pool and five filtered versions. Each plot is a different model size and the total tokens x-axis corresponds to the number of gradient steps taken (with epoching). 10 17 10 18 10 19 10 20 Compute (FLOPs, 6ND) 0.35 0.36 0.37 0.38 0.39 Social IQA Pool (CC) RefinedWeb English Repetition Stop Words DCLM-Baseline 10 17 10 18 10 19 10 20 Compute (FLOPs, 6ND) 0.28 0.30 0.32 0.34 0.36 0.3… view at source ↗
Figure 9
Figure 9. Figure 9: Pareto frontier of compute vs. benchmark performance for CC pool and filtered datasets. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: 670M token CC pool and five filtered versions. Each plot is a different model size and the total tokens x-axis corresponds to the number of gradient steps taken (with epoching). The arrow shows the change in DCLM-Baseline performance with about an order of magnitude more tokens. 10 17 10 18 10 19 Compute (FLOPs, 6ND) 3.50 3.75 4.00 4.25 4.50 4.75 5.00 5.25 Avg NLL (C4, Cosmo, FineWeb) 100M Tokens Pool (CC… view at source ↗
Figure 11
Figure 11. Figure 11: Pareto frontier of compute vs. average negative log-likelihood for CC pool and filtered [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: 670M CC pool and random injection datasets. Each row is a downstream benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: 670M CC pool and shuffled-word injection datasets. Each row is a downstream benchmark. [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
read the original abstract

We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents scaling studies targeting the high-compute, data-scarce regime for large-parameter models and concludes that sufficiently trained models tolerate and benefit from unfiltered data containing low-quality and distractor examples, implying that the optimal data filter is no filter at all.

Significance. If the central empirical result holds, the finding would be significant for pretraining practice by challenging the necessity of aggressive data filtering and reinforcing that additional compute can render curation less critical. The explicit targeting of the high-compute regime is a methodological strength that directly engages the relevant scaling limit.

major comments (1)
  1. §4 (Scaling Experiments): the reported curves show unfiltered data eventually surpassing filtered data, but the manuscript does not quantify the compute threshold at which the crossover occurs or test whether the advantage continues to widen beyond the largest model sizes examined; this directly affects the strength of the 'sufficiently trained large parameter models' claim.
minor comments (2)
  1. Figure 3: axis labels and legends should explicitly state whether token count or FLOPs are held constant across filtered and unfiltered runs.
  2. §2.2: the precise construction of the 'distractor' data mixture is described only at a high level; adding a short table of mixture ratios would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation of minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: §4 (Scaling Experiments): the reported curves show unfiltered data eventually surpassing filtered data, but the manuscript does not quantify the compute threshold at which the crossover occurs or test whether the advantage continues to widen beyond the largest model sizes examined; this directly affects the strength of the 'sufficiently trained large parameter models' claim.

    Authors: We agree that an explicit quantification of the crossover threshold and a discussion of behavior at scales beyond those tested would strengthen the presentation of our central claim. Our experiments were designed to demonstrate the qualitative trend that unfiltered data becomes preferable in the high-compute regime rather than to produce a precise predictive model of the transition point. In the revised manuscript we will add a short analysis in §4 that estimates the crossover compute threshold by fitting the observed scaling curves and will include a brief discussion of how the advantage may continue to evolve at larger scales consistent with the reported trends. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claim rests on new empirical scaling studies

full rationale

The paper derives its conclusion that sufficiently trained large models benefit from unfiltered data directly from new scaling experiments targeting the high-compute, data-scarce regime. No load-bearing step reduces to a self-definition, a fitted parameter renamed as a prediction, or a self-citation chain; the result is not equivalent to its inputs by construction. The derivation is self-contained because it reports direct performance measurements rather than invoking uniqueness theorems, ansatzes from prior work, or renamings of known patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The claim rests on experimental observations from scaling studies in the high-compute data-scarce regime; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5587 in / 930 out tokens · 35210 ms · 2026-05-20T07:36:55.952306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    2024 , eprint=

    Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic , author=. 2024 , eprint=

  2. [2]

    IEEE Transactions on Signal Processing , volume = 66, number = 13, pages =

    Global Optimality in Low-Rank Matrix Optimization , author =. IEEE Transactions on Signal Processing , volume = 66, number = 13, pages =

  3. [3]

    Gradient descent only converges to minimizers , author=

  4. [4]

    Neural Networks , volume = 2, pages =

    Pierre Baldi and Kurt Hornik , title =. Neural Networks , volume = 2, pages =

  5. [5]

    2025 , eprint=

    Scaling Data-Constrained Language Models , author=. 2025 , eprint=

  6. [6]

    2023 , eprint=

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

  7. [7]

    2024 , eprint=

    A Survey on Data Selection for Language Models , author=. 2024 , eprint=

  8. [8]

    2025 , eprint=

    DataComp-LM: In search of the next generation of training sets for language models , author=. 2025 , eprint=

  9. [9]

    Common Crawl Corpus , year =

  10. [10]

    2025 , eprint=

    Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. 2025 , eprint=

  11. [11]

    Mathurin Videau and Badr Youbi Idrissi and Daniel Haziza and Luca Wehrstedt and Jade Copet and Olivier Teytaud and David Lopez-Paz , title =

  12. [12]

    GPT - N eo X -20 B : An open-source autoregressive language model

    Black, Sidney and Biderman, Stella and Hallahan, Eric and Anthony, Quentin and Gao, Leo and Golding, Laurence and He, Horace and Leahy, Connor and McDonell, Kyle and Phang, Jason and Pieler, Michael and Prashanth, Usvsn Sai and Purohit, Shivanshu and Reynolds, Laria and Tow, Jonathan and Wang, Ben and Weinbach, Samuel. GPT - N eo X -20 B : An Open-Source ...

  13. [13]

    2024 , eprint=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

  14. [14]

    Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro , title =

  15. [15]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only , author=. arXiv preprint arXiv:2306.01116 , year=

  16. [16]

    2025 , eprint=

    Pre-training under infinite compute , author=. 2025 , eprint=

  17. [17]

    2022 , eprint=

    Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

  18. [18]

    2016 , eprint=

    Bag of Tricks for Efficient Text Classification , author=. 2016 , eprint=

  19. [19]

    2022 , eprint=

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher , author=. 2022 , eprint=

  20. [20]

    2018 , eprint=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

  21. [21]

    2019 , eprint=

    PIQA: Reasoning about Physical Commonsense in Natural Language , author=. 2019 , eprint=

  22. [22]

    2023 , eprint=

    Domain Adaptation: Learning Bounds and Algorithms , author=. 2023 , eprint=

  23. [23]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  24. [24]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  25. [25]

    Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =

    Theory and Algorithm for Batch Distribution Drift Problems , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =

  26. [26]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  27. [27]

    2025 , eprint=

    Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality , author=. 2025 , eprint=

  28. [28]

    2021 , eprint=

    An Empirical Exploration in Quality Filtering of Text Data , author=. 2021 , eprint=

  29. [29]

    Goodhart, C. A. E. Problems of Monetary Management: The UK Experience. Monetary Theory and Practice: The UK Experience. 1984. doi:10.1007/978-1-349-17295-5_4

  30. [30]

    2025 , eprint=

    The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining , author=. 2025 , eprint=

  31. [31]

    2021 , eprint=

    Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , author=. 2021 , eprint=

  32. [32]

    2025 , eprint=

    Do we really have to filter out random noise in pre-training data for language models? , author=. 2025 , eprint=

  33. [33]

    2025 , eprint=

    When Bad Data Leads to Good Models , author=. 2025 , eprint=

  34. [34]

    2019 , eprint=

    SocialIQA: Commonsense Reasoning about Social Interactions , author=. 2019 , eprint=

  35. [35]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  36. [36]

    2024 , eprint=

    Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws , author=. 2024 , eprint=

  37. [37]

    2024 , eprint=

    How many labelers do you have? A closer look at gold-standard labels , author=. 2024 , eprint=

  38. [38]

    2021 , eprint=

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=

  39. [39]

    , title =

    Sutton, Richard S. , title =. 2019 , howpublished =

  40. [40]

    2024 , eprint=

    Will we run out of data? Limits of LLM scaling based on human-generated data , author=. 2024 , eprint=

  41. [41]

    2025 , url=

    What will AI look like in 2030? , author=. 2025 , url=

  42. [42]

    Grok 4 Model Card , year =

  43. [43]

    Janek Bevendorff and Benno Stein and Matthias Hagen and Martin Potthast , booktitle =