EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation

Hanghang Tong; Jiaru Zou; Jingrui He; Sirui Chen; Tianxin Wei; Ting-Wei Li; Yingbing Huang

arxiv: 2604.26170 · v1 · submitted 2026-04-28 · 💻 cs.CL

EvoSelect: Data-Efficient LLM Evolution for Targeted Task Adaptation

Ting-Wei Li , Sirui Chen , Jiaru Zou , Yingbing Huang , Tianxin Wei , Jingrui He , Hanghang Tong This is my paper

Pith reviewed 2026-05-07 15:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM adaptationdata selectionsynthetic dataoptimal transporttask alignmentdiversificationiterative training

0 comments

The pith

EvoSelect improves LLM adaptation to targeted tasks by selecting aligned and diverse synthetic data before each training update.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that inserting a deliberate selection step between data generation and model training allows LLMs to adapt more effectively to specific tasks while using fewer or lower-quality generated examples. It claims this works by scoring each candidate sample for how closely it matches the target task and then ensuring the chosen set covers different useful aspects rather than repeating similar ones. A reader would care because human-labeled data remains expensive and scarce, so any method that extracts more signal from cheap synthetic data reduces cost and risk of performance degradation. The method is tested across benchmarks and shown to beat prior selection techniques whether the generator itself is weak or already strong.

Core claim

EvoSelect refines the standard iterative generation-training loop into a generation-selection-training loop in which candidate samples are scored for task relevance via optimal transport distances computed on proxy gradient representations and then further filtered by a diversification step that favors complementary coverage, producing training batches that drive progressive improvement of the LLM toward the desired task distribution.

What carries the argument

Joint scoring of task alignment through optimal transport on proxy gradient representations together with a diversification mechanism that selects complementary samples.

If this is right

Models reach higher performance on targeted tasks using the same number of generated samples.
The gains hold whether the external data generator is weak or already strong.
Redundancy and misalignment in synthetic data are reduced before they dilute the training signal.
Iterative evolution proceeds more steadily across multiple rounds of generation and update.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection logic could be tested on non-language tasks that also rely on iterative synthetic-data training.
Alternative proxy representations for alignment might be substituted if gradient-based ones prove expensive for very large models.
The approach highlights the value of treating data selection as an explicit optimization problem rather than a post-hoc filter.

Load-bearing premise

That distances computed by optimal transport on proxy gradient representations give an accurate unbiased measure of how well a sample aligns with the target task distribution and that adding diversification reliably aids learning without creating new biases.

What would settle it

Running the same generators and benchmarks with EvoSelect producing final model performance that is no higher than random selection or existing data-selection baselines.

Figures

Figures reproduced from arXiv: 2604.26170 by Hanghang Tong, Jiaru Zou, Jingrui He, Sirui Chen, Tianxin Wei, Ting-Wei Li, Yingbing Huang.

**Figure 1.** Figure 1: Issues of synthetic training samples and naive view at source ↗

**Figure 2.** Figure 2: Trade-off between target alignment and diversity. Left: Attribution ( view at source ↗

**Figure 3.** Figure 3: Performance gain over 3b base model. The figure shows that only EVOSELECT has consistent improvement across selection ratios & datasets. 4.2 Main Results In this subsection, we present the main experiment results by comparing EVOSELECT with other baseline data selection methods across various settings. EVOSELECT outperforms baseline selection methods. We present our main results in view at source ↗

**Figure 4.** Figure 4: Performance gain vs task difficulty. EVOSELECT’s advantage becomes more significant as task difficulty increases (i.e. when base perf. is lower). 5 Related Work Data Selection for LLM Adaptation. Existing work can be grouped into attribution, diversity and hybrid approaches. Attribution-only methods (Xia et al., 2024; San Joaquin et al., 2024; Zhou et al., 2024; Kwon et al., 2023; Cao et al., 2023) mostly … view at source ↗

**Figure 5.** Figure 5: Performance relative to full-data training. view at source ↗

**Figure 6.** Figure 6: Win-rate across task clusters. EVOSELECT attains the highest rank-1 rate across all task clusters, demonstrating robust top-performing behavior across different task domains. 18 view at source ↗

read the original abstract

Adapting large language models (LLMs) to a targeted task efficiently and effectively remains a fundamental challenge. Such adaptation often requires iteratively improving the model toward a targeted task, yet collecting high-quality human-labeled data to support this process is costly and difficult to scale. As a result, synthetic data generation has emerged as a flexible and scalable alternative. One straightforward approach is through an iterative generation-training loop, where candidate data are synthesized through an external generator, the model is updated using these data and the process is repeated over iterations. However, generated samples can be noisy, highly redundant, or even misaligned with the targeted task distribution. Training indiscriminately on such data can dilute useful learning signals and even degrade model performance. To address this, we introduce a refined paradigm, namely an iterative generation-selection-training loop, which incorporates a selection step prior to model updates. Building on this paradigm, we propose EvoSelect, a data-efficient framework to evolve LLM effectively. Given candidate samples produced by the data generator, EvoSelect selects training data by jointly modeling targeted task alignment and diversity. We estimate task relevance through optimal transport with proxy gradient representations, which quantifies how well candidate samples align with the targeted task distribution. To mitigate redundancy, we incorporate a diversification mechanism that promotes coverage of complementary training samples. By interleaving alignment and diversification, EvoSelect enables progressive LLM evolution toward targeted tasks. Extensive experiments on various benchmarks demonstrate that with either weak or strong data generators, EvoSelect consistently improves adaptation efficacy over existing data selection methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EvoSelect wraps the synthetic data loop with OT selection on proxy gradients plus diversification, but the proxy alignment claim rests on an assumption that needs direct checks.

read the letter

The main takeaway is that this paper adds a selection step to the usual iterative generate-and-train process for adapting LLMs on synthetic data. EvoSelect scores candidate samples for task alignment using optimal transport distances computed from proxy gradient representations, then applies a diversification step to reduce redundancy before the next training round. The result is framed as a more controlled evolution toward the target task distribution, and the abstract reports consistent gains over existing selection methods whether the generator is weak or strong. That joint alignment-plus-diversity loop inside the iteration is the clearest new element relative to simpler baselines. It directly targets the practical issue of noisy or repetitive synthetic data diluting the signal, and the choice to keep the method generator-agnostic makes it more usable in real pipelines. The experiments are described as extensive across benchmarks, which at least suggests the authors tested the idea under varied conditions. The soft spot is the unverified link between the proxy gradient OT distances and actual learning gains. Proxy representations from an initial or surrogate model may not preserve the directions that matter for the target task, especially when the generator produces misaligned samples; without a correlation study or ablation showing that lower OT distance predicts bigger downstream improvements, the selection could retain low-value data or drop useful ones. The diversification step is a reasonable heuristic, but it could introduce its own bias if it over-penalizes useful clusters. The abstract does not detail statistical significance, exact baseline implementations, or controls for total data volume, so the reported improvements need closer inspection in the full text. This work is aimed at researchers doing targeted fine-tuning or synthetic data curation rather than core theory. A reader already running generation-training loops would get a concrete recipe worth trying, and the claims are specific enough to be falsifiable. It deserves peer review because the problem is real and the proposed fix is straightforward to implement and test, even if the proxy assumption will likely require extra evidence or ablations.

Referee Report

3 major / 2 minor

Summary. The paper introduces EvoSelect, an iterative generation-selection-training framework for data-efficient adaptation of LLMs to targeted tasks. Given synthetic candidate samples from a generator (weak or strong), it selects a subset by estimating task alignment via optimal transport distances computed on proxy gradient representations and interleaving this with a diversification step to promote coverage of complementary samples. The central claim is that this yields consistent improvements in adaptation efficacy over existing data selection methods across various benchmarks.

Significance. If the empirical results hold under rigorous verification, EvoSelect would offer a practical, scalable approach to mitigating noise and redundancy in synthetic data for targeted LLM fine-tuning, potentially reducing the need for costly human annotations while enabling progressive model evolution. The use of optimal transport on gradients and explicit diversification are standard tools but combined here in a targeted iterative loop; credit is due for the reproducible experimental protocol if code and full details are released.

major comments (3)

[§4 and §3.2] §4 (Experiments) and §3.2 (Proxy Gradient OT): the central claim that lower OT distances on proxy gradients reliably quantify alignment with the target task distribution is load-bearing, yet the manuscript provides no correlation analysis, ablation, or derivation showing that selected samples with smaller OT distances produce larger downstream gains than random or baseline selection. This leaves the improvement over baselines dependent on an unverified proxy, especially for weak generators where the representation space may not preserve relevant directions.
[Table 2 and Figure 3] Table 2 and Figure 3 (main results): the reported consistent improvements lack details on statistical significance testing, number of runs, variance, or controls for confounds such as total data volume, learning rate schedules, or generator-specific biases. Without these, it is unclear whether the gains are robust or attributable to the selection mechanism itself.
[§3.3] §3.3 (Diversification mechanism): the claim that interleaving diversification reliably improves learning without introducing new biases requires explicit ablation showing that the diversification heuristic does not discard high-alignment samples or favor low-value diversity; current description relies on standard heuristics without quantifying the trade-off against pure alignment selection.

minor comments (2)

[§3.1] Notation for proxy gradients and OT formulation in §3.1 should be clarified with explicit equations for the cost function and transport plan to avoid ambiguity in implementation.
[Abstract and §1] The abstract and introduction would benefit from a concise statement of the precise baselines compared (e.g., random, perplexity-based, gradient-based) and the exact benchmarks used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with point-by-point responses and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Proxy Gradient OT): the central claim that lower OT distances on proxy gradients reliably quantify alignment with the target task distribution is load-bearing, yet the manuscript provides no correlation analysis, ablation, or derivation showing that selected samples with smaller OT distances produce larger downstream gains than random or baseline selection. This leaves the improvement over baselines dependent on an unverified proxy, especially for weak generators where the representation space may not preserve relevant directions.

Authors: We agree that a direct correlation analysis would provide stronger support for the proxy. The current results demonstrate end-to-end gains across benchmarks, but to address this concern we will add (i) a correlation plot between OT distances and downstream task improvements, (ii) an ablation comparing selection by OT distance versus random/baseline selection, and (iii) additional discussion of why proxy gradients preserve task-relevant directions even for weaker generators. These additions will be placed in a new subsection of §3.2 and expanded experiments in §4. revision: yes
Referee: [Table 2 and Figure 3] Table 2 and Figure 3 (main results): the reported consistent improvements lack details on statistical significance testing, number of runs, variance, or controls for confounds such as total data volume, learning rate schedules, or generator-specific biases. Without these, it is unclear whether the gains are robust or attributable to the selection mechanism itself.

Authors: We acknowledge the need for these details. The experiments were conducted with three random seeds and fixed data volumes across methods, but reporting was incomplete. We will revise Table 2 and Figure 3 to include standard deviations, results of paired t-tests for significance, and explicit statements confirming that learning-rate schedules, total token counts, and generator configurations were held constant. A new paragraph in §4 will describe these controls. revision: yes
Referee: [§3.3] §3.3 (Diversification mechanism): the claim that interleaving diversification reliably improves learning without introducing new biases requires explicit ablation showing that the diversification heuristic does not discard high-alignment samples or favor low-value diversity; current description relies on standard heuristics without quantifying the trade-off against pure alignment selection.

Authors: We will add the requested ablation. A new experiment will compare (a) pure OT-based alignment selection, (b) the interleaved alignment-plus-diversification procedure, and (c) pure diversity selection. Metrics will include average OT distance of the selected set, coverage of task-relevant directions, and final downstream performance. This will be reported in §3.3 and §4 to quantify any trade-offs and confirm that high-alignment samples are retained. revision: yes

Circularity Check

0 steps flagged

No circularity detected in EvoSelect's selection framework

full rationale

The paper defines EvoSelect via standard optimal transport on proxy gradients for task alignment plus an independent diversification step; neither component is defined in terms of the downstream performance metric, nor does any prediction reduce to a fitted input by construction. The abstract and described method contain no self-referential equations, no load-bearing self-citations, and no renaming of known results as novel derivations. Empirical gains are reported on external benchmarks, leaving the framework self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that proxy gradient representations combined with optimal transport yield a reliable alignment signal, plus the assumption that diversification adds net positive value to training.

axioms (1)

domain assumption Proxy gradient representations can be used to quantify alignment between candidate samples and the targeted task distribution via optimal transport
This is the core mechanism for the selection step as described in the abstract.

pith-pipeline@v0.9.0 · 5589 in / 1254 out tokens · 47258 ms · 2026-05-07T15:58:33.454037+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Carefully analyze the question and all options

work page
[2]

Provide a concise 1 -2 sentence rationale explaining your reasoning

work page
[3]

< Instruction Prompt > Answer the following multiple - choice question by selecting the correct option

Select the correct answer with confidence Always provide a 1 -2 sentence brief explanation before giving the final answer . < Instruction Prompt > Answer the following multiple - choice question by selecting the correct option . Provide a concise 1 -2 sentence rationale explaining your reasoning , then give your answer in the format : Answer : [ LETTER ] ...

work page 2022
[4]

Gradients are obtained from proxy model and no warm-up training is involved

Attribution(Xia et al., 2024) selects top- k samples with highest scores, which are computed by the inner product between training and the average validation representations. Gradients are obtained from proxy model and no warm-up training is involved

work page 2024
[5]

We follow the default configuration: cluster ratio = 0.1 (i.e

Diversity(Jung et al., 2025) first runs K-means clustering over the training repre- sentations, then keeps filling the candidate pool from the smallest clusters until the budget is exhausted. We follow the default configuration: cluster ratio = 0.1 (i.e. #clusters = #samples * cluster ratio)

work page 2025
[6]

Attr-Divfirst filters out all the bottom-25% data by attribution scores, then apply the same selection strategy asDiversityto select data from the pruned candidates

work page
[7]

It then samples from the associated probability distribution

TSDS(Liu et al., 2024) finds candidate samples that are closest to the validation examples via OT and down-weights redundant ones using a kernel density es- timate (KDE). It then samples from the associated probability distribution. We follow the default configuration: max k (maximum kNN neighbors) is set to 5000, kde k (neighborhood size for KDE) is set ...

work page 2024

[1] [1]

Carefully analyze the question and all options

work page

[2] [2]

Provide a concise 1 -2 sentence rationale explaining your reasoning

work page

[3] [3]

< Instruction Prompt > Answer the following multiple - choice question by selecting the correct option

Select the correct answer with confidence Always provide a 1 -2 sentence brief explanation before giving the final answer . < Instruction Prompt > Answer the following multiple - choice question by selecting the correct option . Provide a concise 1 -2 sentence rationale explaining your reasoning , then give your answer in the format : Answer : [ LETTER ] ...

work page 2022

[4] [4]

Gradients are obtained from proxy model and no warm-up training is involved

Attribution(Xia et al., 2024) selects top- k samples with highest scores, which are computed by the inner product between training and the average validation representations. Gradients are obtained from proxy model and no warm-up training is involved

work page 2024

[5] [5]

We follow the default configuration: cluster ratio = 0.1 (i.e

Diversity(Jung et al., 2025) first runs K-means clustering over the training repre- sentations, then keeps filling the candidate pool from the smallest clusters until the budget is exhausted. We follow the default configuration: cluster ratio = 0.1 (i.e. #clusters = #samples * cluster ratio)

work page 2025

[6] [6]

Attr-Divfirst filters out all the bottom-25% data by attribution scores, then apply the same selection strategy asDiversityto select data from the pruned candidates

work page

[7] [7]

It then samples from the associated probability distribution

TSDS(Liu et al., 2024) finds candidate samples that are closest to the validation examples via OT and down-weights redundant ones using a kernel density es- timate (KDE). It then samples from the associated probability distribution. We follow the default configuration: max k (maximum kNN neighbors) is set to 5000, kde k (neighborhood size for KDE) is set ...

work page 2024