CRAFT: Clustered Regression for Adaptive Filtering of Training data

Asheswari Swain; Parthasarathi Panda; Subhrakanta Panda

arxiv: 2604.22693 · v1 · submitted 2026-04-24 · 💻 cs.CL · cs.AI

CRAFT: Clustered Regression for Adaptive Filtering of Training data

Parthasarathi Panda , Asheswari Swain , Subhrakanta Panda This is my paper

Pith reviewed 2026-05-08 11:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords data selectionmachine translationclusteringKL divergencefine-tuningsequence-to-sequence modelsadaptive filtering

0 comments

The pith

CRAFT selects high-quality training subsets from large corpora by clustering source embeddings and matching target distributions within clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CRAFT as a method to pick small effective subsets from tens of millions of sequence-to-sequence training pairs for fine-tuning. It splits the task into proportional budget allocation across k-means clusters on source embeddings to match the validation source distribution, followed by target selection inside each cluster that minimizes a conditional expected distance to the validation targets. The authors prove that this proportional allocation bounds the continuous KL divergence between the chosen subset and the validation set, with the remaining gap limited by how large the clusters are. On English-to-Hindi translation drawn from 33 million NLLB pairs and fine-tuned with mBART via LoRA, CRAFT reaches 43.34 BLEU while running more than 40 times faster than the TSDS baseline on the same data pool. The pipeline also finishes in under one minute on CPU when TF-IDF vectors are used instead of embeddings.

Core claim

CRAFT performs two-stage selection: first allocate the training budget proportionally across k-means clusters on source embeddings to approximate the validation source distribution, then inside each cluster pick pairs whose target embeddings minimize the conditional expected distance to the validation target distribution. Proportional cluster allocation is shown to bound the continuous KL divergence between the selected and validation distributions, with the residual controlled by cluster diameters. When applied to 33 million English-Hindi pairs for mBART fine-tuning, this yields 43.34 BLEU, 2.13 points above TSDS, at over 40 times the selection speed.

What carries the argument

Proportional cluster allocation on source embeddings combined with conditional expected-distance selection on target embeddings within each cluster.

Load-bearing premise

k-means clusters on source embeddings together with conditional target selection inside clusters sufficiently approximate the joint source-target distribution for the KL bound to remain useful.

What would settle it

Direct measurement showing that the empirical KL divergence between the selected subset and validation set exceeds the diameter-controlled bound, or that BLEU gains disappear when the same selection is repeated on a different validation split.

Figures

Figures reproduced from arXiv: 2604.22693 by Asheswari Swain, Parthasarathi Panda, Subhrakanta Panda.

**Figure 1.** Figure 1: illustrates how this differs from direct distribution matching. The x-axis represents the source embedding and the y-axis the target embedding and the dashed lines indicate cluster boundaries. Naïve distribution matching selects points that cover the full joint distribution but includes many pairs far from the conditional structure. Our approach selects points along the conditional relationship, concent… view at source ↗

**Figure 2.** Figure 2: shows the end-to-end pipeline used across all experiments view at source ↗

read the original abstract

Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT's 75.6 seconds, a 2.8 time speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CRAFT gives a faster two-stage selection method with a formal KL bound, but the bound's practical tightness is not shown and the quality gains are modest compared to TAROT.

read the letter

CRAFT's core move is to split the selection into proportional allocation across source k-means clusters followed by within-cluster target picking that minimizes expected distance to the validation targets. They prove that the proportional step bounds the continuous KL between the selected and validation distributions, with the leftover error set by cluster diameters. That combination plus the speed numbers is what stands out from the TSDS and TAROT baselines they cite. The paper does a few things cleanly. It reports a concrete 2.13 BLEU lift over TSDS on the same 33-million NLLB pool and mBART-LoRA setup, while running the selection more than 40 times faster. It also shows the pipeline works with plain TF-IDF and finishes in under a minute on CPU, which matters for people who cannot afford heavy embedding models just to filter data. The two-stage decomposition itself is a straightforward way to handle the joint source-target distribution at scale. The soft spots are real but not fatal. The KL bound's residual term depends on cluster diameters, yet the abstract supplies no diameter statistics, no sensitivity checks on k, and no numerical evaluation of how large the residual actually is on their data. Without those, the guarantee stays formal and does not clearly explain the BLEU improvement; the within-cluster heuristic could be carrying most of the weight. TAROT still reaches 45.61 BLEU, so CRAFT is trading some quality for speed rather than dominating on both. No error bars appear on the reported scores either. This work is aimed at researchers who fine-tune sequence-to-sequence models on corpora too large for full training and need a practical filter that is faster than current options. A reader already running data-selection experiments on MT or similar tasks would find a usable algorithm and a starting theoretical claim. It deserves peer review because the method is coherent, the empirical comparison is head-to-head, and the proof sketch is present even if it requires more supporting measurements in revision.

Referee Report

2 major / 1 minor

Summary. The paper proposes CRAFT, a two-stage method for selecting high-quality subsets from large training corpora (e.g., 33M NLLB pairs) for fine-tuning seq2seq models such as mBART via LoRA. Stage (i) applies k-means clustering to source embeddings and allocates the selection budget proportionally across clusters to match the validation source distribution; stage (ii) selects, within each cluster, the training pairs whose target embeddings minimize a conditional expected distance to the validation target distribution. The authors prove that the proportional-allocation rule bounds the continuous KL divergence between the selected and validation distributions, with the residual term controlled by cluster diameters. On English-Hindi translation, CRAFT reports 43.34 BLEU (vs. TSDS at 41.21) while completing selection >40× faster than TSDS and 2.8× faster than TAROT; with TF-IDF vectorization the pipeline runs in <1 minute on CPU.

Significance. If the KL bound is tight in practice and the BLEU gains are attributable to the controlled distributional approximation rather than the particular within-cluster heuristic, CRAFT would supply a computationally lightweight, theoretically grounded alternative to existing data-filtering techniques for large-scale machine translation and other seq2seq tasks. The reported speed-ups and vectorization flexibility are concrete practical strengths.

major comments (2)

[Abstract] Abstract: the central claim that proportional cluster allocation bounds continuous KL(source_selected || source_val) with residual governed by cluster diameters is presented without any reported cluster-diameter statistics, sensitivity analysis over k, or numerical evaluation of the residual term on the 33 M NLLB pool. Because the usefulness of the bound for explaining the 2.13 BLEU improvement rests on the diameters being small relative to the embedding scale, the absence of these measurements leaves the distributional guarantee formal rather than empirically supported.
[Abstract] Abstract / evaluation section: the reported 43.34 BLEU is given as a single point estimate with no error bars, no multiple random seeds, and no ablation isolating the contribution of the proportional-allocation rule versus the conditional expected-distance heuristic inside clusters. This makes it impossible to determine whether the observed gain over TSDS is robust or driven by the KL-bound mechanism.

minor comments (1)

[Abstract] The abstract states the method is 'vectorization-agnostic' yet immediately reports TF-IDF results; clarify whether source embeddings are required for the clustering step or whether the method truly operates with arbitrary vectorizers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, acknowledging where the current manuscript is limited and outlining specific revisions that will strengthen the empirical support for the theoretical claims and the robustness of the reported results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that proportional cluster allocation bounds continuous KL(source_selected || source_val) with residual governed by cluster diameters is presented without any reported cluster-diameter statistics, sensitivity analysis over k, or numerical evaluation of the residual term on the 33 M NLLB pool. Because the usefulness of the bound for explaining the 2.13 BLEU improvement rests on the diameters being small relative to the embedding scale, the absence of these measurements leaves the distributional guarantee formal rather than empirically supported.

Authors: We agree that the manuscript would benefit from empirical measurements to demonstrate that the theoretical bound is practically meaningful. In the revised version we will add: (i) average and maximum cluster diameters for the k-means clustering performed on the 33 M NLLB source embeddings, (ii) a sensitivity table showing BLEU and estimated residual values for k in {10, 50, 100}, and (iii) a direct numerical evaluation of the residual term on the selected subset to confirm it remains small relative to the embedding scale. These additions will make the connection between the KL bound and the observed 2.13 BLEU gain explicit rather than purely formal. revision: yes
Referee: [Abstract] Abstract / evaluation section: the reported 43.34 BLEU is given as a single point estimate with no error bars, no multiple random seeds, and no ablation isolating the contribution of the proportional-allocation rule versus the conditional expected-distance heuristic inside clusters. This makes it impossible to determine whether the observed gain over TSDS is robust or driven by the KL-bound mechanism.

Authors: We acknowledge that a single-run result and the lack of targeted ablations limit the ability to attribute gains specifically to the proportional-allocation rule. In the revision we will report BLEU scores averaged over three independent random seeds with standard deviations for both CRAFT and the TSDS baseline. We will also add an ablation that replaces proportional cluster allocation with uniform allocation while retaining the within-cluster conditional selection; the resulting performance difference will isolate the contribution of the KL-bounding step. These experiments are feasible within the existing experimental framework and will be included in the updated evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: KL bound follows directly from proportional allocation and cluster properties

full rationale

The paper's central derivation states that proportional budget allocation across k-means clusters on source embeddings bounds the continuous KL divergence between selected and validation source distributions, with the residual term controlled by cluster diameters. This is presented as a mathematical consequence of the two-stage selection rule (proportional allocation followed by within-cluster conditional expected-distance selection on targets) rather than a self-referential definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. The empirical BLEU results (43.34 vs. 41.21) are reported as separate experimental evaluations on the 33M NLLB pool and mBART fine-tuning, not as quantities derived from or forced by the bound itself. No ansatz smuggling, uniqueness theorems, or renaming of known results appear in the derivation chain. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions about k-means and embedding spaces rather than new postulates; no invented entities are introduced.

free parameters (1)

number of clusters k
Hyperparameter controlling the granularity of source clustering; value not specified in abstract.

axioms (1)

domain assumption k-means clustering on source embeddings produces clusters whose diameters control residual distribution mismatch
Invoked to bound the KL divergence after proportional allocation.

pith-pipeline@v0.9.0 · 5557 in / 1191 out tokens · 42505 ms · 2026-05-08T11:52:49.334898+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 canonical work pages

[1]

arXiv preprint arXiv:2402.16827

Albalak, A., Elazar, Y., Xie, S. M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al. (2024). A survey on data selection for language models. arXiv preprint arXiv:2402.16827. Anthropic (2024). Claude: A family of large language models. Anthropic. Accessed:

work page arXiv 2024
[2]

and Lavie, A

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with im- proved correlation with human judgments. In Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C., editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Ass...

2005
[3]

and Leibler, R

Kullback, S. and Leibler, R. A. (1951). On informa- tion and suﬀiciency. The Annals of Mathemati- cal Statistics, 22(1):79–86. Liu, Z., Karbasi, A., and Rekatsinas, T. (2024). TSDS: Data selection for task-specific model finetuning. In The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems. Liu, Z., Zhou, K., Zhao, W. X., Gao, D., ...

1951

[1] [1]

arXiv preprint arXiv:2402.16827

Albalak, A., Elazar, Y., Xie, S. M., Longpre, S., Lambert, N., Wang, X., Muennighoff, N., Hou, B., Pan, L., Jeong, H., et al. (2024). A survey on data selection for language models. arXiv preprint arXiv:2402.16827. Anthropic (2024). Claude: A family of large language models. Anthropic. Accessed:

work page arXiv 2024

[2] [2]

and Lavie, A

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with im- proved correlation with human judgments. In Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C., editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Ass...

2005

[3] [3]

and Leibler, R

Kullback, S. and Leibler, R. A. (1951). On informa- tion and suﬀiciency. The Annals of Mathemati- cal Statistics, 22(1):79–86. Liu, Z., Karbasi, A., and Rekatsinas, T. (2024). TSDS: Data selection for task-specific model finetuning. In The Thirty-eighth Annual Confer- ence on Neural Information Processing Systems. Liu, Z., Zhou, K., Zhao, W. X., Gao, D., ...

1951