LESS: Selecting Influential Data for Targeted Instruction Tuning
read the original abstract
Instruction tuning has unlocked powerful capabilities in large language models (LLMs), effectively using combined datasets to develop generalpurpose chatbots. However, real-world applications often require a specialized suite of skills (e.g., reasoning). The challenge lies in identifying the most relevant data from these extensive datasets to effectively develop specific capabilities, a setting we frame as targeted instruction tuning. We propose LESS, an optimizer-aware and practically efficient algorithm to effectively estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. Crucially, LESS adapts existing influence formulations to work with the Adam optimizer and variable-length instruction data. LESS first constructs a highly reusable and transferable gradient datastore with low-dimensional gradient features and then selects examples based on their similarity to few-shot examples embodying a specific capability. Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Furthermore, the selected data is highly transferable: smaller models can be leveraged to select useful data for larger models and models from different families. Our qualitative analysis shows that our method goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills for the intended downstream application.
This paper has not been read by Pith yet.
Forward citations
Cited by 29 Pith papers
-
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures
HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.
-
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
TAC is a bandit curriculum for multi-domain RLVR that prioritizes domains whose gradient updates align with and benefit other domains, yielding up to 2.8-point macro accuracy gains over learnability-only baselines on ...
-
Transferability for General Reasoning: An Automated Curriculum for Multi-Domain RLVR
TAC is a new online curriculum for multi-domain RLVR that estimates cross-domain transferability from gradient geometry to achieve up to 2.8 point macro-averaged accuracy gains over learnability-only baselines on six-...
-
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
-
CODEBLOCK: Learning to Supervise Code at the Right Granularity
CodeBlock partitions code responses into syntactically coherent blocks, scores them with generalized cross-entropy and data-flow signals, and applies sparse supervision to achieve higher pass@1 than full SFT using 1.9...
-
Selective Contrastive Learning For Gloss Free Sign Language Translation
A pair selection strategy based on negative similarity dynamics strengthens contrastive supervision in gloss-free sign language translation by reducing noisy negatives.
-
When Does Generating More Help? Disentangling Fixed-Source Synthesis from Source Expansion in Synthetic Data Scaling
Fixed-source synthesis is bounded; a derived scaling law predicts high-budget performance from low-budget fits, and source expansion outperforms fixed-source at large matched budgets.
-
On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity
On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.
-
Quantifying the Agreement Between Data-Influence and Data-Similarity to Understand LLM Behavior
Data-similarity and data-influence produce significantly overlapping rankings of training documents for LLM outputs, with asymmetry allowing a favorable cost-accuracy trade-off.
-
DRIFT: Refining Instruction Data via On-Policy Data Attribution
DRIFT applies on-policy influence functions with signed weighting and debiasing to attribute and refine SFT data, raising performance on 7B instruction and reasoning models over prior curation methods.
-
Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection
SHIFT selects compact RLVR training subsets using the magnitude of hidden-state change from a single inference rollout plus quality-weighted farthest-first coverage, outperforming training-free baselines on math reaso...
-
Unified Data Selection for LLM Reasoning
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
-
PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning
PRISM weights target examples by the current model's preference to build a better representation for influence-function scoring of training samples in efficient LLM fine-tuning.
-
PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning
PRISM weights target examples by model preference to build an improved direction for influence-based data selection in LLM fine-tuning.
-
Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
For a fixed data budget in LLM supervised fine-tuning, optimal data difficulty shifts toward harder examples as the budget grows because of the tradeoff between in-distribution generalization gap and extrapolation gap.
-
Let the Target Select for Itself: Data Selection via Target-Aligned Paths
Target-aligned data selection via normalized endpoint loss drop on a validation-induced reference path achieves competitive performance with reduced computational overhead.
-
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-paramete...
-
LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning
LLM agents iteratively generate and optimize data processing strategies for fine-tuning, delivering over 80% win rates versus unprocessed data and 65% versus LLM-based AutoML baselines while cutting search time by up to 10x.
-
Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Fin-PRM is a domain-specialized process reward model that supplies binary step-level and trajectory-level supervision signals for financial reasoning in LLMs and outperforms general PRMs on CFLUE and FinQA benchmarks.
-
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks
DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergen...
-
Data Selection Through Iterative Self-Filtering for Vision-Language Settings
An iterative bootstrapped self-filtering approach selects balanced clean and diverse subsets from noisy vision-language datasets to train improved CLIP models.
-
SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning
SLAP is a new batch-aware pruning framework that uses distribution-aware stratified sampling and Hessian-approximated gradients to select data, claiming 20-40% less data while matching or exceeding full-dataset perfor...
-
Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
Optimal data difficulty for LLM supervised fine-tuning shifts toward harder examples as data budget increases due to the generalization-extrapolation tradeoff.
-
Toward Communication-Efficient Space Data Centers: Bottlenecks, Architectures, and New Paradigms
Semantic communication in a multi-layer heterogeneous space data center framework can substantially reduce uplink pressure for orbital AI by sending compact representations rather than raw data.
-
Rigorous Interpretation Is a Form of Evaluation
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
-
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
-
Retrieval-Augmented Generation for AI-Generated Content: A Survey
A survey classifying RAG foundations for AIGC, summarizing enhancements, cross-modal applications, benchmarks, limitations, and future directions.
-
An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models
Data-influence-score filtering using validation-set loss on downstream coding tasks improves Code-LLM performance, with the most beneficial training data varying significantly across different programming tasks.
-
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.