hub

A survey on data selection for language models

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al · 2024 · arXiv 2402.16827

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 other 1

citation-polarity summary

background 1 unclear 1

representative citing papers

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

cs.LG · 2026-05-16 · conditional · novelty 8.0

BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.

Unified Data Selection for LLM Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.

SEED: Targeted Data Selection by Weighted Independent Set

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods on instruction tuning and segmentation tasks.

Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.

CRAFT: Clustered Regression for Adaptive Filtering of Training data

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

CRAFT filters training data via source clustering and conditional target selection to bound KL divergence to validation distributions, yielding 43.34 BLEU on English-Hindi translation from 33M pairs while running over 40x faster than TSDS.

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

cs.LG · 2025-02-01 · unverdicted · novelty 6.0

DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergence to the optimal mixture.

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

cs.CL · 2024-11-08 · unverdicted · novelty 6.0

The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

cs.CL · 2025-11-26 · unverdicted · novelty 5.0

Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

cs.CL · 2025-01-02 · unverdicted · novelty 5.0

SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.

Factual Inconsistencies in Multilingual Wikipedia Tables

cs.CL · 2025-07-24 · unverdicted · novelty 4.0

The study introduces a method for detecting and categorizing cross-lingual factual inconsistencies in Wikipedia tables using alignment techniques and metrics on sample data.

From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

cs.SE · 2024-10-28 · unverdicted · novelty 4.0

A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

citing papers explorer

Showing 17 of 17 citing papers.

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks cs.LG · 2026-05-16 · conditional · none · ref 2
BoLT is a benchmark of surrogate models fitted to real LLM experiment data that enables evaluation of Bayesian and black-box optimization methods on multi-fidelity, multi-objective, high-dimensional LLM tasks.
Unified Data Selection for LLM Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 41
High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code cs.AI · 2026-05-19 · unverdicted · none · ref 25
Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.
SEED: Targeted Data Selection by Weighted Independent Set cs.LG · 2026-05-15 · unverdicted · none · ref 3
SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods on instruction tuning and segmentation tasks.
Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data cs.CL · 2026-05-11 · unverdicted · none · ref 54
Synthetic pre-pre-training on structured data improves LLM robustness to noisy pre-training, matching baseline loss with up to 49% fewer natural tokens for a 1B model.
CRAFT: Clustered Regression for Adaptive Filtering of Training data cs.CL · 2026-04-24 · unverdicted · none · ref 1
CRAFT filters training data via source clustering and conditional target selection to bound KL divergence to validation distributions, yielding 43.34 BLEU on English-Hindi translation from 33M pairs while running over 40x faster than TSDS.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 59
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
RewardBench 2: Advancing Reward Model Evaluation cs.CL · 2025-06-02 · unverdicted · none · ref 8
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks cs.LG · 2025-02-01 · unverdicted · none · ref 2
DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergence to the optimal mixture.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP cs.CL · 2024-11-08 · unverdicted · none · ref 9
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 8
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 153
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining cs.CL · 2025-11-26 · unverdicted · none · ref 2
Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.
SEDD: Scalable and Efficient Dataset Deduplication with GPUs cs.CL · 2025-01-02 · unverdicted · none · ref 1
SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.
Factual Inconsistencies in Multilingual Wikipedia Tables cs.CL · 2025-07-24 · unverdicted · none · ref 1
The study introduces a method for detecting and categorizing cross-lingual factual inconsistencies in Wikipedia tables using alignment techniques and metrics on sample data.
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap cs.SE · 2024-10-28 · unverdicted · none · ref 21
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 6
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

A survey on data selection for language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer