Data filtering methods for training language models

Egor Shevchenko; Elena Bruches

arxiv: 2605.29807 · v1 · pith:5ZZWCMLQnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI· cs.LG

Data filtering methods for training language models

Egor Shevchenko , Elena Bruches This is my paper

Pith reviewed 2026-06-29 07:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords label error detectionconfident learningdataset cartographydata filteringRussian text classificationmachine learningnoise reduction

0 comments

The pith

Targeted label error removal using confident learning and dataset cartography outperforms random removal on Russian text classification corpora.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two automatic methods for detecting label errors in training data for language models. It tests Confident Learning and Dataset Cartography on three Russian text classification datasets of different sizes and noise levels. The results show that removing examples flagged by these methods improves model performance compared to removing the same number at random, especially on small noisy datasets where Confident Learning gives a notable boost. On larger cleaner datasets, the filtering does not help. This suggests the methods can identify meaningful errors rather than just any data points.

Core claim

Across all three corpora, targeted removal by both Confident Learning and Dataset Cartography outperforms random removal of an equivalent number of examples. On small datasets with high noise, Confident Learning achieves a significant F1-macro improvement, while Dataset Cartography removes fewer examples in a more conservative manner. On large corpora with low noise, filtering does not improve performance.

What carries the argument

Confident Learning and Dataset Cartography applied to a fine-tuned rubert-base-cased model to flag and remove label errors from the training corpora.

If this is right

Targeted filtering beats random removal across all tested corpora, confirming the methods detect real issues.
The benefit of filtering is strongest on small, high-noise datasets.
Dataset Cartography tends to remove fewer examples than Confident Learning.
Performance does not improve from filtering on large, low-noise corpora.
Both methods are more effective than chance, validating their use for data cleaning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The methods may generalize to other languages or tasks if the underlying model is adapted accordingly.
Hybrid approaches combining both detection methods could balance precision and recall in error removal.
These filtering techniques might reduce the need for large-scale data collection in resource-constrained settings.
Further work could measure the actual precision of the detected errors through manual inspection.

Load-bearing premise

The fine-tuned rubert-base-cased model provides a reliable signal for genuine label errors instead of model-specific or dataset artifacts.

What would settle it

Human annotation of the examples removed by each method to verify they are actual label errors, followed by retraining and performance comparison.

read the original abstract

Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies two existing label-error detectors to three Russian corpora and shows targeted removal beats random on all, with gains only on the smallest noisiest set, but offers no evidence the flagged items are genuine errors rather than model artifacts.

read the letter

The main takeaway is that filtering with Confident Learning or Dataset Cartography improves held-out F1 over random removal on these three Russian datasets, but the gains appear only on the small high-noise corpus while the large low-noise ones show no benefit. The paper does a clean job of including the random-removal controls, which rules out the trivial explanation that any removal helps. That control is the strongest part of the work.

The soft spot is exactly the one the stress-test flags. Both methods rely on predictions from rubert-base-cased fine-tuned on the full target corpus, so the examples they flag could simply be the ones the model finds hard rather than actual label mistakes. The abstract gives no human re-labeling, no known-error subset, and no cross-model agreement to check this. Without that, the claim that the methods detect label errors rests on the assumption that the fine-tuned model is a reliable oracle, which is the weakest link.

The scope is narrow: three specific corpora, two off-the-shelf methods, no new algorithm or theory. The results are dataset-dependent in a way that matches intuition, but they do not generalize beyond these cases. Citation pattern looks standard for an empirical comparison.

This is the kind of paper that might interest someone already working on Russian text classification or small noisy datasets who wants a quick data point on whether these two detectors are worth trying. It does not rise to the level of a must-read for the broader data-cleaning literature. I would send it to peer review if the full text supplies the missing validation checks or at least reports inter-annotator agreement on a sample of flagged items; otherwise a desk reject seems reasonable given how little new ground is broken.

Referee Report

1 major / 1 minor

Summary. The manuscript compares Confident Learning and Dataset Cartography for automatic label error detection on three Russian text classification corpora (ru_emotion_e-culture, RuCoLA, TERRa) of varying size and noise. It fine-tunes rubert-base-cased on each full corpus to generate predictions, applies the two methods to flag and remove examples, and evaluates downstream performance against random-removal controls of equal size. The central empirical claim is that targeted removal outperforms random removal on all three corpora and that Confident Learning yields a significant F1-macro gain on the smallest, noisiest dataset, while Dataset Cartography is more conservative; effectiveness is reported to depend on dataset characteristics.

Significance. If the reported performance differences hold under scrutiny, the work supplies a useful empirical comparison of two established filtering techniques on non-English data and illustrates their dataset-dependent utility, especially for small noisy corpora. The random-removal controls are a positive design choice that rules out uniform noise as the sole driver of gains.

major comments (1)

[Abstract] Abstract: the claim that outperforming random removal 'confirm[s] the meaningfulness of the approaches' as label-error detectors is not supported by the described evidence. Both methods rely on predictions from rubert-base-cased fine-tuned on the identical target corpus; without human re-annotation, a known-error subset, or cross-model agreement, it remains possible that the flagged examples reflect model-specific artifacts or dataset biases rather than genuine label errors. The random-removal control addresses uniform noise but does not isolate label-error-specific signal.

minor comments (1)

[Abstract] Abstract: the exact fractions or absolute numbers of examples removed by each method, the precise definition of 'significant' F1-macro improvement, and any statistical tests used are not stated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their valuable feedback on our manuscript. We address the major comment regarding the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that outperforming random removal 'confirm[s] the meaningfulness of the approaches' as label-error detectors is not supported by the described evidence. Both methods rely on predictions from rubert-base-cased fine-tuned on the identical target corpus; without human re-annotation, a known-error subset, or cross-model agreement, it remains possible that the flagged examples reflect model-specific artifacts or dataset biases rather than genuine label errors. The random-removal control addresses uniform noise but does not isolate label-error-specific signal.

Authors: We thank the referee for highlighting this important distinction. Our experiments demonstrate that both Confident Learning and Dataset Cartography identify subsets whose removal leads to greater performance improvements than random removal of the same number of examples. This provides evidence that the flagged examples contain signal that is relevant to the model's performance, beyond what would be expected from uniform noise. While we agree that direct validation through human re-annotation would provide stronger confirmation that these are indeed label errors rather than other types of difficult or biased examples, the current design with random controls supports the practical utility of these methods for data filtering. We will revise the abstract to more precisely state that the results confirm the utility of the approaches for improving downstream performance, rather than directly confirming their accuracy as label-error detectors. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparison of external methods vs random baseline

full rationale

The paper applies two established label-error detection techniques (Confident Learning and Dataset Cartography) to three Russian corpora, using a fine-tuned rubert-base-cased model only as a practical detector. It then measures downstream F1-macro after targeted vs random removal. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear in the reported chain. The central claim (targeted removal outperforms random) is a direct experimental outcome on held-out performance and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study relying on standard supervised fine-tuning assumptions; no free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5736 in / 967 out tokens · 21053 ms · 2026-06-29T07:56:52.121547+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references

[1]

Cleanlab: Data-centric AI library for data quality,https://github.com/cleanlab/ cleanlab, 2026

2026
[2]

4171–4186

Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding, Proceedings of NAACL-HLT, 2019, pp. 4171–4186

2019
[3]

co/datasets/Kostya165/ru_emotion_e-culture, 2026

Kostya165/ru_emotion_e-culture, HuggingFace Datasets,https://huggingface. co/datasets/Kostya165/ru_emotion_e-culture, 2026

2026
[4]

Dialogue 2019

Kuratov Y., Arkhipov M., Adaptation of Deep Bidirectional Multilingual Transform- ers for Russian Language, Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2019”, pp. 333–339

2019
[5]

5765–5785

Mikhailov V., Shamardina T., Ryabinin M., Pestova A., Smurov I., Artemova E., RuCoLA: Russian Corpus of Linguistic Acceptability, Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), 2022, pp. 5765–5785

2022
[6]

G., Jiang L., Chuang I

Northcutt C. G., Jiang L., Chuang I. L., Confident Learning: Estimating Uncertainty in Dataset Labels, Journal of Artificial Intelligence Research, vol. 70, 2021, pp. 1373– 1411

2021
[7]

Pang J., Wei J., Shah A. P., Zhu Z., Wang Y., Qian C., Liu Y., Bao Y., Wei W., Improving Data Efficiency via Curating LLM-Driven Rating Systems, Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025
[8]

Shavrina T. et al., RussianSuperGLUE: A Russian Language Understanding Eval- uation Benchmark, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4717–4726

2020
[9]

Swayamdipta S. et al., Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9275–9293. Е. Шевченко, Е. Бручес. Методы фильтрации данных для обучения языковых моделей. Аннотация.Качестводанныхявляетсякритическимфакторомэффек- ти...

2020

[1] [1]

Cleanlab: Data-centric AI library for data quality,https://github.com/cleanlab/ cleanlab, 2026

2026

[2] [2]

4171–4186

Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding, Proceedings of NAACL-HLT, 2019, pp. 4171–4186

2019

[3] [3]

co/datasets/Kostya165/ru_emotion_e-culture, 2026

Kostya165/ru_emotion_e-culture, HuggingFace Datasets,https://huggingface. co/datasets/Kostya165/ru_emotion_e-culture, 2026

2026

[4] [4]

Dialogue 2019

Kuratov Y., Arkhipov M., Adaptation of Deep Bidirectional Multilingual Transform- ers for Russian Language, Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2019”, pp. 333–339

2019

[5] [5]

5765–5785

Mikhailov V., Shamardina T., Ryabinin M., Pestova A., Smurov I., Artemova E., RuCoLA: Russian Corpus of Linguistic Acceptability, Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), 2022, pp. 5765–5785

2022

[6] [6]

G., Jiang L., Chuang I

Northcutt C. G., Jiang L., Chuang I. L., Confident Learning: Estimating Uncertainty in Dataset Labels, Journal of Artificial Intelligence Research, vol. 70, 2021, pp. 1373– 1411

2021

[7] [7]

Pang J., Wei J., Shah A. P., Zhu Z., Wang Y., Qian C., Liu Y., Bao Y., Wei W., Improving Data Efficiency via Curating LLM-Driven Rating Systems, Proceedings of the 13th International Conference on Learning Representations (ICLR), 2025

2025

[8] [8]

Shavrina T. et al., RussianSuperGLUE: A Russian Language Understanding Eval- uation Benchmark, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4717–4726

2020

[9] [9]

Swayamdipta S. et al., Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9275–9293. Е. Шевченко, Е. Бручес. Методы фильтрации данных для обучения языковых моделей. Аннотация.Качестводанныхявляетсякритическимфакторомэффек- ти...

2020