pith. sign in

hub

A survey on data selection for language models

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

hub tools

citation-role summary

background 1 other 1

citation-polarity summary

polarities

background 1 unclear 1

representative citing papers

Unified Data Selection for LLM Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.

SEED: Targeted Data Selection by Weighted Independent Set

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

SEED models data selection as Weighted Independent Set on a similarity graph, using node value calibration and local scale normalization to produce compact high-quality training subsets that outperform prior methods on instruction tuning and segmentation tasks.

CRAFT: Clustered Regression for Adaptive Filtering of Training data

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

CRAFT filters training data via source clustering and conditional target selection to bound KL divergence to validation distributions, yielding 43.34 BLEU on English-Hindi translation from 33M pairs while running over 40x faster than TSDS.

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

SEDD: Scalable and Efficient Dataset Deduplication with GPUs

cs.CL · 2025-01-02 · unverdicted · novelty 5.0

SEDD delivers a distributed GPU deduplication system that reports up to 158x speedup over CPU baselines and 7.8x over NeMo Curator on 30M documents while preserving MinHash fidelity above 0.95 Jaccard.

Factual Inconsistencies in Multilingual Wikipedia Tables

cs.CL · 2025-07-24 · unverdicted · novelty 4.0

The study introduces a method for detecting and categorizing cross-lingual factual inconsistencies in Wikipedia tables using alignment techniques and metrics on sample data.

citing papers explorer

Showing 17 of 17 citing papers.