arxiv: 2605.09404 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.CL· cs.CV

Recognition: 2 theorem links

· Lean Theorem

Let the Target Select for Itself: Data Selection via Target-Aligned Paths

Huitao Yang , Hengzhi He , Guang Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:43 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords data selectiontarget-aligned pathsvalidation-induced flowreference path biasloss drop scoringmachine learninginstruction tuning

0 comments

The pith

A short warmup on target validation data creates a reusable reference path that scores candidate training samples by their normalized loss drop.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix reference path bias in data selection, where trajectories built from a heterogeneous candidate pool misalign with the actual dynamics needed for a specific target task. Instead of aggregating attributions along the full pool's path, it runs a brief capacity-limited warmup on the available target validation data to induce an aligned flow. Candidates are then ranked by how much they reduce loss at the end of this flow, using a simple normalized drop that needs no gradients or Hessian information. This yields a zero-order rule that matches strong dynamic baselines in logistic, vision, and instruction-tuning experiments while cutting warmup time and storage. Because the path depends only on the target proxy, the same compact warmup can be reused for any number of new candidate pools without recomputation.

Core claim

The paper claims that a validation-induced flow from a short capacity-limited warmup on the target validation proxy supplies an aligned reference trajectory, and that scoring each candidate by its normalized endpoint loss drop along this trajectory produces effective data selection for the downstream task.

What carries the argument

The validation-induced flow obtained from the short warmup, used as the reference path along which candidates are scored by normalized endpoint loss drop.

If this is right

Data selection no longer requires candidate gradients or second-order approximations.
A single warmup computation can be reused across multiple heterogeneous candidate pools.
Warmup cost and storage scale with the small validation proxy rather than the full candidate pool.
The method remains competitive with dynamic attribution techniques on controlled logistic, vision, and instruction-tuning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same warmup path could support repeated selection rounds as new candidates arrive without recomputing trajectories.
Testing progressively shorter warmups would reveal the minimal target data needed to keep the flow representative.
The loss-drop score could be combined with simple diversity filters to address potential sample interactions the current proxy ignores.

Load-bearing premise

The brief warmup on target validation data generates a flow whose dynamics closely match those of the full target task, and the endpoint loss drop reliably measures a candidate's standalone utility.

What would settle it

Run the same logistic, vision, and instruction-tuning experiments but measure whether the proposed scores produce subsets that underperform strong gradient-based attribution baselines by a clear margin on the target metric.

Figures

Figures reproduced from arXiv: 2605.09404 by Guang Cheng, Hengzhi He, Huitao Yang.

**Figure 1.** Figure 1: Conceptual illustration. Left: a 2D slice of parameter space; each colored region marks where that data type has high local alignment. Right: two possible reference trajectories on different manifolds. A pool-induced path can miss regions on the target manifold where some candidates become visible, even though those candidates may be valuable under a more target-aligned trajectory. Motivated by this issue,… view at source ↗

**Figure 2.** Figure 2: Toy logistic mixture. (a) Endpoint displacement directions for retraining, validation, and pool warmups projected to the 2D plane spanned by final retraining and validation displacements. (b) Shape-distance ratio; below 1 means retraining is closer to the validation path. (c) Better path alignment tracks lower target error. (a) Noise drop (b) Selected quality [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Binary CIFAR-10 selection with ResNet-18. Full FT updates all parameters; partial FT updates layer4+head. (a) Accuracy drop from clean to noisy pools at k = 500 (lower is better). (b) Clean-label and target fractions under 40% label noise (higher is better). head. At k = 500, LESS performs better in the clean full-finetuning setting, but degrades sharply under label noise, while TACS retains substantially … view at source ↗

**Figure 4.** Figure 4: TyDiQA ablations. (a) Warmup rank controls the quality-stability tradeoff. (b) More warmup epochs improve selection. (c) Trajectory scoring outperforms a base-model perturbation score. Uncertainty is over warmup seeds only. 6 Discussion 6.1 Scope and Theoretical Justification The scope of TACS depends on when an inverse signal—candidate loss reduction along a targetinduced trajectory—transfers to the down… view at source ↗

**Figure 5.** Figure 5: Dynamic selection paradigms: forward attribution along a pool path, online gradient [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Controlled logistic-regression metrics across budgets, mean [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Binary CIFAR-10 target accuracy across budgets under clean and 40% noisy pools. Full [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Binary CIFAR-10 partial fine-tuning: only layer4 and the classification head are retrained [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: First-loss normalization vs. selected-example length. Each point is one task-source cell, [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Selected-length distributions under raw and normalized endpoint loss gaps. Gray his [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Score-variant ablation in noisy CIFAR-10 and logistic regression. In CIFAR-10, normal [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

read the original abstract

Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean, low-cost way to avoid reference path bias by using a short validation warmup as the fixed reference trajectory and scoring candidates via normalized endpoint loss drop.

read the letter

The core move here is decoupling the reference path from the candidate pool. Standard attribution approaches build their trajectory from the pool itself, which can misalign when the pool is heterogeneous. Instead the authors run a brief capacity-limited warmup on the target validation set, then score each candidate by how much it lowers loss at the end of that fixed path, normalized. The rule is zero-order, needs no gradients or Hessians, and the same compact warmup can be reused across pools without recomputation. That is the actual novelty and it is stated plainly in the abstract. The experiments claim the score stays competitive with stronger dynamic baselines on logistic, vision, and instruction-tuning tasks while cutting warmup and storage cost. If those numbers hold in the full paper, the practical upside is real for anyone doing targeted selection on large pools. The method is simple enough that it could be tried quickly. The main soft spot is whether the short warmup path actually stays aligned with the dynamics that matter once the selected subset is used for full training. Early transients can differ from later behavior, especially in instruction tuning or heterogeneous data, and the zero-order scoring ignores sample interactions or path curvature. The abstract does not bound these effects or show failure cases, so the claim that the path remains representative rests on the reported competitiveness rather than direct evidence. The citation pattern looks standard for the area and the circularity burden is low since the path comes from independent validation data. This is for practitioners who need a reusable, low-overhead selection rule more than for theorists chasing tight guarantees. A reader working on efficient fine-tuning or data pruning would find the idea worth testing. I would send it to peer review; the construction is clear and the cost argument is worth referee scrutiny even if the alignment assumption needs more stress-testing in the revision.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes 'target-aligned paths' for data selection: a short, capacity-limited warmup on the target validation proxy generates a reference trajectory, along which candidate samples are scored via their normalized endpoint loss drop. This zero-order rule is claimed to be competitive with dynamic attribution baselines in logistic, vision, and instruction-tuning experiments while reducing warmup and storage costs, and the trajectory can be reused across pools.

Significance. If the results hold, this provides a simple and efficient alternative to gradient-based attribution methods for targeted data selection. The decoupling of the reference path from the candidate pool is a notable strength, enabling reuse and lowering computational overhead. The approach addresses a relevant issue in data selection for heterogeneous datasets.

major comments (3)

[§3 (Proposed Method)] The central assumption that the short capacity-limited warmup induces a flow representative of the target task's optimization dynamics is load-bearing but insufficiently validated. The skeptic note highlights that in heterogeneous pools or instruction-tuning, the loss landscape changes after the initial phase, so the induced flow may diverge. A concrete test, such as comparing trajectories from short vs. longer warmups or analyzing alignment metrics, is needed to support the claim.
[Experiments section (likely §4)] The claim of competitive results lacks specific metrics, baseline details, and analysis of failure cases. For instance, without reported accuracy deltas, exact comparisons to methods like influence functions or TracIn, or R² values in the logistic experiments, it is difficult to assess if the data supports the competitiveness assertion.
[§3.2 (Scoring rule)] The normalized endpoint loss drop is presented as a reliable proxy for utility, but it omits sample interactions and higher-order effects. While the paper treats these as negligible, no theoretical bounds or empirical ablation on this approximation are provided, which could undermine the zero-order rule's validity in complex settings.

minor comments (2)

[Abstract] The abstract states 'substantially reducing warmup and storage cost' without quantifying the savings (e.g., number of epochs or memory footprint compared to baselines).
[Notation and §3] Clarify the exact definition of the 'normalized endpoint loss drop' early in the paper, including the normalization factor and how the endpoint is chosen.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential of target-aligned paths as an efficient alternative for data selection. We address each major comment below and will revise the manuscript accordingly to strengthen the validation, quantitative details, and analysis of the proposed method.

read point-by-point responses

Referee: [§3 (Proposed Method)] The central assumption that the short capacity-limited warmup induces a flow representative of the target task's optimization dynamics is load-bearing but insufficiently validated. The skeptic note highlights that in heterogeneous pools or instruction-tuning, the loss landscape changes after the initial phase, so the induced flow may diverge. A concrete test, such as comparing trajectories from short vs. longer warmups or analyzing alignment metrics, is needed to support the claim.

Authors: We agree that further empirical validation of the reference path's alignment with target optimization dynamics would strengthen the paper, particularly for heterogeneous pools. In the revised manuscript, we will add direct comparisons of short versus extended warmups (e.g., 5% vs. 20% of training steps) along with alignment metrics such as average cosine similarity of parameter updates and divergence in validation loss curves. These additions will appear in Section 3 and the experimental analysis to address potential divergence concerns. revision: yes
Referee: [Experiments section (likely §4)] The claim of competitive results lacks specific metrics, baseline details, and analysis of failure cases. For instance, without reported accuracy deltas, exact comparisons to methods like influence functions or TracIn, or R² values in the logistic experiments, it is difficult to assess if the data supports the competitiveness assertion.

Authors: We will expand the experiments section with precise quantitative results, including accuracy deltas relative to full training and baselines, direct numerical comparisons to influence functions and TracIn (with reported values and standard deviations), and R² statistics for the logistic regression experiments. A new subsection will discuss observed failure cases and conditions under which performance degrades. revision: yes
Referee: [§3.2 (Scoring rule)] The normalized endpoint loss drop is presented as a reliable proxy for utility, but it omits sample interactions and higher-order effects. While the paper treats these as negligible, no theoretical bounds or empirical ablation on this approximation are provided, which could undermine the zero-order rule's validity in complex settings.

Authors: The normalized endpoint loss drop is intentionally a zero-order heuristic to avoid the computational cost of higher-order terms or pairwise interactions. While theoretical bounds on the approximation error are not derived in the current work (as the method prioritizes practicality), we will add an empirical ablation in the revised manuscript that measures the effect of sample interactions via controlled subsets and discusses limitations in highly complex or non-convex settings. revision: partial

Circularity Check

0 steps flagged

No circularity: reference path and scoring rule are direct constructions from independent validation data

full rationale

The paper defines a validation-induced flow via short capacity-limited warmup on the target validation proxy, then applies a normalized endpoint loss drop as the candidate score. This is a direct, zero-order computation on the induced trajectory and does not reduce by construction to any fitted parameter, self-citation chain, or input-derived prediction. No equations or claims in the provided text equate the output selection rule to its own inputs, rename a known result, or import uniqueness via author-overlapping citations. The method is presented as an alternative to existing attribution baselines with explicit decoupling from the candidate pool, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The approach introduces minimal new entities but rests on assumptions about the representativeness of the validation warmup and the effectiveness of the loss-drop scoring rule.

free parameters (2)

warmup length
The length of the short capacity-limited warmup is a design choice not specified numerically in the abstract.
capacity limit
The capacity limit for the warmup training is a hyperparameter.

axioms (2)

domain assumption A short warmup on the target validation proxy generates a flow aligned with the target task.
Central to creating the reference path decoupled from the candidate pool.
domain assumption Normalized endpoint loss drop is a valid zero-order measure of candidate utility.
Used as the selection score without further justification in abstract.

invented entities (1)

reference path bias no independent evidence
purpose: To describe the misalignment issue in heterogeneous pools
Conceptual term introduced to motivate the new method.

pith-pipeline@v0.9.0 · 5469 in / 1513 out tokens · 74539 ms · 2026-05-12T02:43:28.382551+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dR(θ_t) = ⟨∇R(θ_t), dθ_t⟩ = −η_t ⟨∇R(θ_t), ∇R_S(θ_t)⟩ dt; integrating yields R(θ_T) − R(θ_0) = −∫ η_t ⟨∇R, ∇R_S⟩ dt decomposed additively over examples as path-integrated alignment utility F(z,T).
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Validation-induced flow dθ_val_t = −η_t ∇R_val(θ_val_t) dt; zero-order score s(z) = [ℓ(θ_val_1;z) − ℓ(θ_val_T;z)] / max{ℓ(θ_val_1;z), ε} with LoRA capacity bottleneck.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 6 internal anchors

[1]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning, 2020

Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning, 2020. URL https://arxiv.org/abs/2012. 13255. 9

work page 2020
[2]

arXiv preprint arXiv:2307.08701 , year=

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data, 2024. URLhttps://arxiv.org/abs/2307.08701

work page arXiv 2024
[3]

Influence-preserving proxies for gradient-based data selection in llm fine-tuning, 2026

Sirui Chen, Yunzhe Qi, Mengting Ai, Yifan Sun, Ruizhong Qiu, Jiaru Zou, and Jingrui He. Influence-preserving proxies for gradient-based data selection in llm fine-tuning, 2026. URL https://arxiv.org/abs/2602.17835

work page arXiv 2026
[4]

Task- aware data selection via proxy-label enhanced distribution matching for LLM finetuning

Hao Cheng, Rui Zhang, Ling Li, Na Di, Jiaheng Wei, Zhaowei Zhu, and Bo Han. Task- aware data selection via proxy-label enhanced distribution matching for LLM finetuning. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=R40WoYbYab

work page 2026
[5]

TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020

Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 8:454–470, 2020

work page 2020
[6]

Free Dolly: Introducing the world’s first truly open instruction-tuned LLM

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free Dolly: Introducing the world’s first truly open instruction-tuned LLM. https://www.databricks.com/blog/2023/04/ 12/dolly-first-open-commercially-viable-instruction-tuned-llm , Apr 2023. Databricks Blog

work page 2023
[7]

David R. Cox. The regression analysis of binary sequences.Journal of the Royal Statistical Society: Series B, 20(2):215–242, 1958

work page 1958
[8]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

work page 2009
[9]

Influential language data selection via gradient trajectory pursuit, 2024

Zhiwei Deng, Tao Li, and Yang Li. Influential language data selection via gradient trajectory pursuit, 2024. URLhttps://arxiv.org/abs/2410.16710

work page arXiv 2024
[10]

Greedy information projection for llm data selection

Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shen Liu, Yi Liu, and Jian Jiao. Greedy information projection for llm data selection. arXiv, March

work page
[11]

URL https://www.microsoft.com/en-us/research/publication/ greedy-information-projection-for-llm-data-selection/

work page
[12]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Schwab, and Ari S

Jonathan Frankle, David J. Schwab, and Ari S. Morcos. The early phase of neural network training, 2020. URLhttps://arxiv.org/abs/2002.10365

work page arXiv 2020
[14]

Golub and Victor Pereyra

Gene H. Golub and Victor Pereyra. The differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate.SIAM Journal on Numerical Analysis, 10(2): 413–432, 1973. doi: 10.1137/0710036

work page doi:10.1137/0710036 1973
[15]

Bliss: A lightweight bilevel influence scoring method for data selection in language model pretraining, 2026

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, and Mingrui Liu. Bliss: A lightweight bilevel influence scoring method for data selection in language model pretraining, 2026. URL https://arxiv.org/abs/2510.06048

work page arXiv 2026
[16]

Deep residual learning for im- age recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[17]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InProceedings of the International Conference on Learning Representations (ICLR), 2021. 10

work page 2021
[18]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[19]

Train on validation (tov): Fast data selection with applications to fine-tuning, 2025

Ayush Jain, Andrea Montanari, and Eren Sasoglu. Train on validation (tov): Fast data selection with applications to fine-tuning, 2025. URLhttps://arxiv.org/abs/2510.00386

work page arXiv 2025
[20]

Grad-match: Gradient matching based data subset selection for efficient deep model training, 2021

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, and Rishabh Iyer. Grad-match: Gradient matching based data subset selection for efficient deep model training, 2021. URLhttps://arxiv.org/abs/2103.00123

work page arXiv 2021
[21]

Glister: Generalization based data subset selection for efficient and robust learning, 2021

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer. Glister: Generalization based data subset selection for efficient and robust learning, 2021. URL https://arxiv.org/abs/2012.10630

work page arXiv 2021
[22]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1885–1894. PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/koh17a.html

work page 2017
[23]

A study of cross-validation and bootstrap for accuracy estimation and model selection

Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. InProceedings of the 14th International Joint Conference on Artificial Intelligence, pages 1137–1145, 1995

work page 1995
[24]

Openassistant conversations–democratizing large language model alignment, 2023

Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations–democratizing large language model alignment, 2023

work page 2023
[25]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[26]

What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning

Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, and Junxian He. What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning, 2024. URL https://arxiv.org/abs/2312.15685

work page arXiv 2024
[27]

The flan collection: Designing data and methods for effective instruction tuning

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 22631–22648, 2023

work page 2023
[28]

Gist: Targeted data selection for instruction tuning via coupled optimization geometry, 2026

Guanghui Min, Tianhao Huang, Ke Wan, and Chen Chen. Gist: Targeted data selection for instruction tuning via coupled optimization geometry, 2026. URL https://arxiv.org/abs/ 2602.18584

work page arXiv 2026
[29]

Prioritized training on points that are learnable, worth learning, and not yet learnt

Sören Mindermann, Jan M Brauner, Muhammed T Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N Gomez, Adrien Morisot, Sebastian Farquhar, and Yarin Gal. Prioritized training on points that are learnable, worth learning, and not yet learnt. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sab...

work page 2022
[30]

Coresets for data-efficient training of machine learning models, 2020

Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models, 2020. URLhttps://arxiv.org/abs/1906.01827

work page arXiv 2020
[31]

Token cleaning: Fine-grained data selection for llm supervised fine-tuning

Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, and Yang Liu. Token cleaning: Fine-grained data selection for llm supervised fine-tuning. InProceedings of the 42nd International Conference on Machine Learning, pages 47837–47858, 2025. URL https://arxiv.org/abs/2502.01968

work page arXiv 2025
[32]

Trak: Attributing model behavior at scale, 2023

Sung Min Park, Kristian Georgiev, Andrew Ilyas, Guillaume Leclerc, and Aleksander Madry. Trak: Attributing model behavior at scale, 2023. URL https://arxiv.org/abs/2303. 14186. 11

work page 2023
[33]

Deep learning on a data diet: Finding important examples early in training, 2023

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training, 2023. URL https://arxiv.org/abs/2107. 07075

work page 2023
[34]

Elenberg, and Kilian Q

Geoff Pleiss, Tianyi Zhang, Ethan R. Elenberg, and Kilian Q. Weinberger. Identifying mislabeled data using the area under the margin ranking, 2020. URL https://arxiv.org/abs/2001. 10528

work page 2020
[35]

Estimating training data influence by tracing gradient descent, 2020

Garima Pruthi, Frederick Liu, Mukund Sundararajan, and Satyen Kale. Estimating training data influence by tracing gradient descent, 2020. URLhttps://arxiv.org/abs/2002.08484

work page arXiv 2020
[36]

A unified convergence analysis of block successive minimization methods for nonsmooth optimization.SIAM Journal on Optimization, 23(2):1126–1153, 2013

Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization.SIAM Journal on Optimization, 23(2):1126–1153, 2013. doi: 10.1137/120891009

work page doi:10.1137/120891009 2013
[37]

Cross-validatory choice and assessment of statistical predictions.Journal of the Royal Statistical Society: Series B, 36(2):111–133, 1974

Mervyn Stone. Cross-validatory choice and assessment of statistical predictions.Journal of the Royal Statistical Society: Series B, 36(2):111–133, 1974

work page 1974
[38]

Challenging BIG- bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging BIG- bench tasks and whether chain-of-thought can solve them. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, 2023

work page 2023
[39]

Smith, and Yejin Choi

Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, and Yejin Choi. Dataset cartography: Mapping and diagnosing datasets with training dynamics, 2020. URLhttps://arxiv.org/abs/2009.10795

work page arXiv 2020
[40]

Pereira, and William Bialek

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method,

work page
[41]

URLhttps://arxiv.org/abs/physics/0004057

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. An empirical study of example forgetting during deep neural network learning, 2019. URLhttps://arxiv.org/abs/1812.05159

work page arXiv 2019
[43]

Fangxin Wang, Peyman Baghershahi, Langzhou He, Henry Peng Zou, Sourav Medya, and Philip S. Yu. Two-stage optimizer-aware online data selection for large language models, 2026. URLhttps://arxiv.org/abs/2604.00001

work page internal anchor Pith review arXiv 2026
[44]

Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia

Jiachen T. Wang, Tianji Yang, James Zou, Yongchan Kwon, and Ruoxi Jia. Rethinking data shapley for data selection tasks: Misleads and merits, 2024. URL https://arxiv.org/abs/ 2405.03875

work page arXiv 2024
[45]

NICE data selection for instruction tuning in LLMs with non-differentiable evaluation metric

Jingtan Wang, Xiaoqiang Lin, Rui Qiao, Pang Wei Koh, Chuan-Sheng Foo, and Bryan Kian Hsiang Low. NICE data selection for instruction tuning in LLMs with non-differentiable evaluation metric. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=2wt8m5HUBs

work page 2025
[46]

Opus: Towards efficient and principled data selection in large language model pre-training in every iteration,

Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, and Linfeng Zhang. Opus: Towards efficient and principled data selection in large language model pre-training in every iteration,

work page
[47]

URLhttps://arxiv.org/abs/2602.05400

work page arXiv
[48]

Smith, Iz Beltagy, and Hannaneh Ha- jishirzi

Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, and Hannaneh Ha- jishirzi. How far can camels go? exploring the state of instruction tuning on open resources,

work page
[49]

URLhttps://arxiv.org/abs/2306.04751

work page arXiv
[50]

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou, Bingni Zhang, Weiguo Feng, Taifeng Wang, Cihang Xie, and Fengze Liu. Target-oriented pretraining data selection via neuron-activated graph, 2026. URLhttps://arxiv.org/abs/2604.15706

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URLhttps://arxiv.org/abs/2201.11903. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Rose: A reward-oriented data selection framework for llm task-specific instruction tuning, 2025

Yang Wu, Huayi Zhang, Yizheng Jiao, Lin Ma, Xiaozhong Liu, Jinhong Yu, Dongyu Zhang, Dezhi Yu, and Wei Xu. Rose: A reward-oriented data selection framework for llm task-specific instruction tuning, 2025. URLhttps://arxiv.org/abs/2412.00631

work page arXiv 2025
[53]

arXiv preprint arXiv:2402.04333 , year=

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen. Less: Selecting influential data for targeted instruction tuning, 2024. URL https://arxiv.org/ abs/2402.04333

work page arXiv 2024
[54]

Gradalign: Gradient- aligned data selection for llm reinforcement learning, 2026

Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, and Yiming Yang. Gradalign: Gradient- aligned data selection for llm reinforcement learning, 2026. URL https://arxiv.org/abs/ 2602.21492

work page arXiv 2026
[55]

A survey on data selection for llm instruction tuning.Journal of Artificial Intelligence Research, 83, August 2025

Bolin Zhang, Jiahao Wang, Qianlong Du, Jiajun Zhang, Zhiying Tu, and Dianhui Chu. A survey on data selection for llm instruction tuning.Journal of Artificial Intelligence Research, 83, August 2025. ISSN 1076-9757. doi: 10.1613/jair.1.17625. URL http://dx.doi.org/10. 1613/jair.1.17625

work page doi:10.1613/jair.1.17625 2025
[56]

The best instruction-tuning data are those that fit,

Dylan Zhang, Qirun Dai, and Hao Peng. The best instruction-tuning data are those that fit, 2025. URLhttps://arxiv.org/abs/2502.04194

work page arXiv 2025
[57]

Towards understanding valuable preference data for large language model alignment,

Zizhuo Zhang, Qizhou Wang, Shanshan Ye, Jianing Zhu, Jiangchao Yao, Bo Han, and Masashi Sugiyama. Towards understanding valuable preference data for large language model alignment,

work page
[58]

URLhttps://arxiv.org/abs/2510.13212

work page arXiv
[59]

arXiv preprint arXiv:2305.11206 , year=

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment, 2023. URL https://arxiv.org/abs/2305.11206. 13 A Limitations While our target-aligned trajectory framework offers a highly scalable and mod...

work page arXiv 2023