Unified Data Selection for LLM Reasoning

Chengpeng Li; Dayiheng Liu; Fengbin Zhu; Fuli Feng; Keqin Bao; Wenjie Wang; Xiaoyuan Li; Yiyao Yu; Yubo Ma

arxiv: 2605.22389 · v1 · pith:4VVT6V4Inew · submitted 2026-05-21 · 💻 cs.CL

Unified Data Selection for LLM Reasoning

Xiaoyuan Li , Yubo Ma , Chengpeng Li , Fengbin Zhu , Yiyao Yu , Keqin Bao , Wenjie Wang , Fuli Feng

show 1 more author

Dayiheng Liu

This is my paper

Pith reviewed 2026-05-22 05:29 UTC · model grok-4.3

classification 💻 cs.CL

keywords High-Entropy Sumdata selectionLLM reasoningsupervised fine-tuningrejection fine-tuningreinforcement learningtraining-free metrictoken entropy

0 comments

The pith

Summing entropy from the top 0.5 percent highest-entropy tokens ranks reasoning samples for effective LLM training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that High-Entropy Sum provides a training-free way to score reasoning quality in large language models by focusing only on the most uncertain tokens within each sample. This metric lets researchers select small subsets of data that deliver strong results when training models for complex reasoning. A sympathetic reader would care because building advanced reasoning capabilities currently demands large volumes of high-quality traces, and an efficient selection method reduces the data and compute burden across multiple training approaches. The work shows consistent benefits whether the selected data is used for supervised fine-tuning, rejection sampling, or reinforcement learning.

Core claim

High-Entropy Sum quantifies reasoning quality by summing the entropy of only the top 0.5 percent highest-entropy tokens in each sample. When samples are ranked by this score, the highest-ranked 20 percent match the performance of training on the entire dataset under supervised fine-tuning. The same ranking produces better outcomes than baseline methods in rejection fine-tuning and allows reinforcement learning to extract stronger reasoning patterns from selected successful trajectories.

What carries the argument

High-Entropy Sum (HES), which aggregates entropy values from the top 0.5 percent highest-entropy tokens per reasoning sample to produce a quality score without any model training.

Load-bearing premise

The assumption that entropy concentrated in a tiny fraction of tokens per sample reliably signals high overall quality of the full reasoning trace, and that this signal remains consistent across models and tasks.

What would settle it

Training an LLM on the top 20 percent of HES-ranked samples and observing lower accuracy than the full dataset on a held-out reasoning benchmark such as GSM8K would falsify the central effectiveness claim.

Figures

Figures reproduced from arXiv: 2605.22389 by Chengpeng Li, Dayiheng Liu, Fengbin Zhu, Fuli Feng, Keqin Bao, Wenjie Wang, Xiaoyuan Li, Yiyao Yu, Yubo Ma.

**Figure 1.** Figure 1: The comparative analysis of discriminative ability is conducted on 512 responses per problem generated [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Calculation of High-Entropy Sum (HES). The colored tokens are the tokens involved in the calculation. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Parameter sensitivity across diverse domains, including data selection ratio and high-entropy token ratio. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Wordcloud of high-entropy tokens. including AIME24, HMMT23, HMMT24, and HMMT25. As illustrated by the score distributions in [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 4.** Figure 4: Wordcloud of high-entropy tokens. (a) Wordcloud of OpenR1-Math-220k. (b) Wordcloud of Open-Math-Reasoning [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 6.** Figure 6: Comparative analysis of discriminative ability between HES and other metrics based on 512 responses per [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Comparative analysis of discriminative ability between HES and other metrics based on 512 responses per [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Comparative analysis of discriminative ability between HES and other metrics based on 512 responses per [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Comparative analysis of discriminative ability between HES and other metrics based on 512 responses per [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Comparative analysis of discriminative ability between HES and other metrics based on 512 responses [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

Effectively training Large Language Models (LLMs) for complex, long-CoT reasoning is often bottlenecked by the need for massive high-quality reasoning data. Existing methods are either computationally expensive or fail to reliably distinguish high- from low-quality reasoning samples. To address this, we propose High-Entropy Sum (HES), a training-free metric that quantifies reasoning quality by summing only the entropy of the top (e.g., 0.5\%) highest-entropy tokens in each reasoning sample. We validate HES across three mainstream training paradigms: Supervised Fine-tuning (SFT), Rejection Fine-tuning (RFT), and Reinforcement Learning (RL), with extensive results demonstrating its consistent effectiveness and significantly reduced computational overhead. In SFT, training on the top 20\% HES-ranked data matches full-dataset performance, while using the lowest-HES data degrades it. In RFT, our HES-based training approach significantly outperforms baseline methods. In RL, HES-selected successful trajectories enable the model to learn strong reasoning patterns, significantly surpassing other compared methods. Our findings establish HES as a robust, training-free metric that enables a unified, effective, and efficient method for developing advanced reasoning in LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HES gives a cheap training-free filter for reasoning data that hits full SFT performance at 20% and beats baselines in RFT/RL, but the 0.5% top-entropy cutoff is presented without visible justification or stability checks.

read the letter

The core idea is straightforward: compute per-token entropy on a reasoning trace, keep only the top 0.5% highest values, sum them, and use that score to rank and select data. They apply the same score to pick subsets for SFT, RFT, and RL, reporting that the top 20% matches full-dataset SFT results while low-HES data hurts, and that HES selection improves the other two paradigms over the baselines they tried. That unified application across three setups is the main new piece, and the training-free nature keeps overhead low, which is useful when data volume is the bottleneck for long-CoT work. The abstract shows consistent directional gains, so the basic filter idea has some empirical support on the tasks they ran. The weak point is exactly the cutoff. The 0.5% figure is stated as the operating point with no derivation, no sensitivity plot across other percentiles, and no check on whether the ranking stays stable if entropy comes from a different base model or task distribution. Without those controls the claim that HES reliably separates high- from low-quality traces rests on the reported outcomes alone. If the full paper includes cross-percentile ablations and model-variation tests, that would tighten the argument; if not, the metric still needs that validation before it can be treated as robust. This is the sort of incremental data-selection paper that people actually training reasoning models would want to see the details on. It is worth sending to referees so the community can check the stability of the threshold and the baseline implementations.

Referee Report

2 major / 2 minor

Summary. The paper proposes High-Entropy Sum (HES), a training-free metric that quantifies reasoning quality in LLM traces by summing the per-token entropy of only the top 0.5% highest-entropy tokens per sample. It validates this metric across three training paradigms—SFT, RFT, and RL—claiming that top-20% HES data matches full-dataset SFT performance, outperforms baselines in RFT, and yields stronger reasoning in RL while reducing computational overhead.

Significance. If the central claims hold after addressing the noted gaps, the work offers a practical, training-free data-selection tool that could lower the barrier to scaling long-CoT reasoning training. The unified evaluation across SFT/RFT/RL and the reported efficiency gains are genuine strengths; the manuscript also ships concrete empirical comparisons that could be reproduced if code and exact data splits are released.

major comments (2)

[Abstract and §3] Abstract and §3 (HES definition): The metric is defined by summing entropy exclusively over the top 0.5% highest-entropy tokens, yet no derivation, sensitivity analysis, or cross-percentile ablation is presented. If the selected subset changes materially at 0.1% or 1%, the reported SFT/RFT/RL gains no longer demonstrate that HES is a stable, robust quality signal.
[§4] §4 (experimental results): The manuscript reports consistent gains but provides no statistical significance tests, variance across random seeds, or controls that isolate the effect of the HES ranking from model size or task difficulty. Without these, it is unclear whether the top-20% HES subset truly matches full-dataset performance or merely reflects easier subsets.

minor comments (2)

[§3] Notation for entropy computation (per-token vs. sequence-level) should be stated explicitly in the method section to avoid ambiguity when readers re-implement HES.
[Figures in §4] Figure captions and axis labels in the experimental plots could be expanded to include the exact percentile threshold and base model used for entropy extraction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the work.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (HES definition): The metric is defined by summing entropy exclusively over the top 0.5% highest-entropy tokens, yet no derivation, sensitivity analysis, or cross-percentile ablation is presented. If the selected subset changes materially at 0.1% or 1%, the reported SFT/RFT/RL gains no longer demonstrate that HES is a stable, robust quality signal.

Authors: We acknowledge that the manuscript does not include an explicit derivation or sensitivity analysis for the 0.5% threshold. This percentile was chosen empirically during preliminary experiments as it isolates the most uncertain tokens while avoiding dilution from lower-entropy ones. To address the concern directly, we will add a cross-percentile ablation study (0.1%, 0.5%, 1%, and 2%) to Section 3 and the appendix in the revised manuscript. Preliminary checks indicate that performance trends remain qualitatively stable across this range for the reported SFT, RFT, and RL settings, supporting robustness; the full results will be included to allow readers to evaluate stability. revision: yes
Referee: [§4] §4 (experimental results): The manuscript reports consistent gains but provides no statistical significance tests, variance across random seeds, or controls that isolate the effect of the HES ranking from model size or task difficulty. Without these, it is unclear whether the top-20% HES subset truly matches full-dataset performance or merely reflects easier subsets.

Authors: We agree that reporting variance and statistical tests would improve interpretability. In the revision we will rerun the core SFT and RFT experiments across three random seeds, reporting means and standard deviations, and include paired t-tests or Wilcoxon tests against the full dataset and random baselines of equal size. To isolate HES ranking from task difficulty, we will add per-task breakdowns and comparisons against difficulty-stratified random subsets (using proxy metrics such as trace length or token count). These controls will help demonstrate that gains are driven by the entropy-based selection rather than incidental subset properties. Model-size effects are already partially controlled by using the same base model across conditions, but we will note this limitation explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity: HES is an explicitly defined empirical metric validated by downstream experiments.

full rationale

The paper defines HES directly as the sum of per-token entropies for the top 0.5% highest-entropy tokens within each sample. This definition is not derived from or equated to the target performance outcomes; instead, the authors report separate experimental results showing that selecting top-20% HES data for SFT matches full-dataset performance, and that HES-based selection improves RFT and RL. No equations, self-citations, or uniqueness theorems are invoked that would reduce the claimed effectiveness to the definition itself. The 0.5% cutoff is presented as a design choice rather than a fitted parameter whose value is then 'predicted' from the same data. The derivation chain remains self-contained and externally falsifiable via the reported training outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that token-level entropy is a valid proxy for reasoning quality; one tunable threshold (top 0.5%) is introduced without independent justification.

free parameters (1)

top entropy percentile threshold
Example value of 0.5% used to select which tokens contribute to the sum; chosen to focus on highest-uncertainty positions.

axioms (1)

domain assumption Entropy of selected tokens correlates with overall reasoning quality of the sample
Invoked when defining HES as a quality metric in the abstract.

pith-pipeline@v0.9.0 · 5764 in / 1255 out tokens · 36061 ms · 2026-05-22T05:29:12.203534+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HESrelative = sum_{t | rank(H_t) >= 1-p} H_t with p=0.005; high-entropy tokens identified as top 0.5% of entropy distribution
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Figure 1 and Section 3.1 contrast Sum Entropy of High-Entropy Tokens against Avg Entropy and total ES

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 16 internal anchors

[1]

arXiv preprint arXiv:2504.16891 , year=

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset , author=. arXiv preprint arXiv:2504.16891 , year=

work page arXiv
[2]

Open R1: A fully open reproduction of DeepSeek-R1 , url =

work page
[3]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=

work page
[4]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models , author=. arXiv preprint arXiv:2503.21380 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

work page
[7]

DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL , author=

work page
[8]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[9]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

work page
[10]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2407.14622 , year=

Bond: Aligning llms with best-of-n distillation , author=. arXiv preprint arXiv:2407.14622 , year=

work page arXiv
[13]

arXiv preprint arXiv:2501.11110 , year=

Chain-of-reasoning: Towards unified mathematical reasoning in large language models via a multi-paradigm perspective , author=. arXiv preprint arXiv:2501.11110 , year=

work page arXiv
[14]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

work page
[15]

Proceedings of the AAAI conference on artificial intelligence , volume=

Graph of thoughts: Solving elaborate problems with large language models , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[16]

Algorithm of thoughts: Enhancing exploration of ideas in large language models.arXiv preprint arXiv:2308.10379, 2023

Algorithm of thoughts: Enhancing exploration of ideas in large language models , author=. arXiv preprint arXiv:2308.10379 , year=

work page arXiv
[17]

Proceedings ENLSP-III , year=

Skeleton-of-thought: Large language models can do parallel decoding , author=. Proceedings ENLSP-III , year=

work page
[18]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks , author=. arXiv preprint arXiv:2211.12588 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Tora: A tool-integrated reasoning agent for mathematical problem solving , author=. arXiv preprint arXiv:2309.17452 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

Self-reflection in llm agents: Effects on problem-solving performance , author=. arXiv preprint arXiv:2405.06682 , year=

work page arXiv
[21]

arXiv preprint arXiv:2502.15589 , year=

Lightthinker: Thinking step-by-step compression , author=. arXiv preprint arXiv:2502.15589 , year=

work page arXiv
[22]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[23]

arXiv preprint arXiv:2502.18600 , year=

Chain of draft: Thinking faster by writing less , author=. arXiv preprint arXiv:2502.18600 , year=

work page arXiv
[24]

arXiv preprint arXiv:2503.05179

Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching , author=. arXiv preprint arXiv:2503.05179 , year=

work page arXiv
[25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

work page
[27]

arXiv preprint arXiv:2503.06639 , year=

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification , author=. arXiv preprint arXiv:2503.06639 , year=

work page arXiv
[28]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page
[29]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics , author=. arXiv preprint arXiv:2501.04686 , year=

work page arXiv
[31]

Advances in Neural Information Processing Systems , volume=

Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =

Guilherme Penedo and Hynek Kydl. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =

work page
[33]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan and Hongyi Yuan and Chengpeng Li and Guanting Dong and Chuanqi Tan and Chang Zhou , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2308.01825 , eprinttype =. 2308.01825 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.01825 2023
[34]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[35]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (arXiv: 2308.01825). arXiv , author=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024
[37]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

2nd AI for Math Workshop @ ICML 2025 , year=

GenSelect: A Generative Approach to Best-of-N , author=. 2nd AI for Math Workshop @ ICML 2025 , year=

work page 2025
[39]

Deep Think with Confidence

Deep think with confidence , author=. arXiv preprint arXiv:2508.15260 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2402.16827

A survey on data selection for language models , author=. arXiv preprint arXiv:2402.16827 , year=

work page arXiv
[42]

Foundations and Trends

The probabilistic relevance framework: BM25 and beyond , author=. Foundations and Trends. 2009 , publisher=

work page 2009
[43]

Advances in Neural Information Processing Systems , volume=

Data selection for language models via importance resampling , author=. Advances in Neural Information Processing Systems , volume=

work page
[44]

arXiv preprint arXiv:2402.09739 , year=

Qurating: Selecting high-quality data for training language models , author=. arXiv preprint arXiv:2402.09739 , year=

work page arXiv
[45]

When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

When less is more: Investigating data pruning for pretraining llms at scale , author=. arXiv preprint arXiv:2309.04564 , year=

work page arXiv
[46]

arXiv preprint arXiv:2212.00196 , year=

Data-efficient finetuning using cross-task nearest neighbors , author=. arXiv preprint arXiv:2212.00196 , year=

work page arXiv
[47]

International Conference on Machine Learning , pages=

Grad-match: Gradient matching based data subset selection for efficient deep model training , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[48]

arXiv preprint arXiv:2308.04275 , year=

In-context alignment: Chat with vanilla language models before fine-tuning , author=. arXiv preprint arXiv:2308.04275 , year=

work page arXiv
[49]

Less: Selecting influential data for targeted instruction tuning

Less: Selecting influential data for targeted instruction tuning , author=. arXiv preprint arXiv:2402.04333 , year=

work page arXiv
[50]

Forty-second International Conference on Machine Learning , year=

Predictive Data Selection: The Data That Predicts Is the Data That Teaches , author=. Forty-second International Conference on Machine Learning , year=

work page
[51]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2311.16302 , year=

Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection , author=. arXiv preprint arXiv:2311.16302 , year=

work page arXiv
[53]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[55]

2025 , eprint=

Llama-Nemotron: Efficient Reasoning Models , author=. 2025 , eprint=

work page 2025
[56]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

work page
[57]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[58]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

work page

[1] [1]

arXiv preprint arXiv:2504.16891 , year=

Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset , author=. arXiv preprint arXiv:2504.16891 , year=

work page arXiv

[2] [2]

Open R1: A fully open reproduction of DeepSeek-R1 , url =

work page

[3] [3]

Hugging Face repository , volume=

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=

work page

[4] [4]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models , author=. arXiv preprint arXiv:2503.21380 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

First Conference on Language Modeling , year=

Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=

work page

[7] [7]

DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL , author=

work page

[8] [8]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[9] [9]

Proceedings of the Twentieth European Conference on Computer Systems , pages=

Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=

work page

[10] [10]

OpenAI o1 System Card

Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2407.14622 , year=

Bond: Aligning llms with best-of-n distillation , author=. arXiv preprint arXiv:2407.14622 , year=

work page arXiv

[13] [13]

arXiv preprint arXiv:2501.11110 , year=

Chain-of-reasoning: Towards unified mathematical reasoning in large language models via a multi-paradigm perspective , author=. arXiv preprint arXiv:2501.11110 , year=

work page arXiv

[14] [14]

Advances in neural information processing systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in neural information processing systems , volume=

work page

[15] [15]

Proceedings of the AAAI conference on artificial intelligence , volume=

Graph of thoughts: Solving elaborate problems with large language models , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page

[16] [16]

Algorithm of thoughts: Enhancing exploration of ideas in large language models.arXiv preprint arXiv:2308.10379, 2023

Algorithm of thoughts: Enhancing exploration of ideas in large language models , author=. arXiv preprint arXiv:2308.10379 , year=

work page arXiv

[17] [17]

Proceedings ENLSP-III , year=

Skeleton-of-thought: Large language models can do parallel decoding , author=. Proceedings ENLSP-III , year=

work page

[18] [18]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks , author=. arXiv preprint arXiv:2211.12588 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving

Tora: A tool-integrated reasoning agent for mathematical problem solving , author=. arXiv preprint arXiv:2309.17452 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Self-reflection in llm agents: Effects on problem-solving performance.arXiv preprint arXiv:2405.06682, 2024

Self-reflection in llm agents: Effects on problem-solving performance , author=. arXiv preprint arXiv:2405.06682 , year=

work page arXiv

[21] [21]

arXiv preprint arXiv:2502.15589 , year=

Lightthinker: Thinking step-by-step compression , author=. arXiv preprint arXiv:2502.15589 , year=

work page arXiv

[22] [22]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page

[23] [23]

arXiv preprint arXiv:2502.18600 , year=

Chain of draft: Thinking faster by writing less , author=. arXiv preprint arXiv:2502.18600 , year=

work page arXiv

[24] [24]

arXiv preprint arXiv:2503.05179

Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching , author=. arXiv preprint arXiv:2503.05179 , year=

work page arXiv

[25] [25]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

, author=

Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

work page

[27] [27]

arXiv preprint arXiv:2503.06639 , year=

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification , author=. arXiv preprint arXiv:2503.06639 , year=

work page arXiv

[28] [28]

The Twelfth International Conference on Learning Representations , year=

Let's verify step by step , author=. The Twelfth International Conference on Learning Representations , year=

work page

[29] [29]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit rewards , author=. arXiv preprint arXiv:2502.01456 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Ursa: Under- standing and verifying chain-of-thought reasoning in multi- modal mathematics

URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics , author=. arXiv preprint arXiv:2501.04686 , year=

work page arXiv

[31] [31]

Advances in Neural Information Processing Systems , volume=

Datacomp-lm: In search of the next generation of training sets for language models , author=. Advances in Neural Information Processing Systems , volume=

work page

[32] [32]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =

Guilherme Penedo and Hynek Kydl. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , booktitle =

work page

[33] [33]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan and Hongyi Yuan and Chengpeng Li and Guanting Dong and Chuanqi Tan and Chang Zhou , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2308.01825 , eprinttype =. 2308.01825 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.01825 2023

[34] [34]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page

[35] [35]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (arXiv: 2308.01825). arXiv , author=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

work page 2024

[37] [37]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

2nd AI for Math Workshop @ ICML 2025 , year=

GenSelect: A Generative Approach to Best-of-N , author=. 2nd AI for Math Workshop @ ICML 2025 , year=

work page 2025

[39] [39]

Deep Think with Confidence

Deep think with confidence , author=. arXiv preprint arXiv:2508.15260 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters , author=. arXiv preprint arXiv:2408.03314 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

arXiv preprint arXiv:2402.16827

A survey on data selection for language models , author=. arXiv preprint arXiv:2402.16827 , year=

work page arXiv

[42] [42]

Foundations and Trends

The probabilistic relevance framework: BM25 and beyond , author=. Foundations and Trends. 2009 , publisher=

work page 2009

[43] [43]

Advances in Neural Information Processing Systems , volume=

Data selection for language models via importance resampling , author=. Advances in Neural Information Processing Systems , volume=

work page

[44] [44]

arXiv preprint arXiv:2402.09739 , year=

Qurating: Selecting high-quality data for training language models , author=. arXiv preprint arXiv:2402.09739 , year=

work page arXiv

[45] [45]

When less is more: Investigating data pruning for pretraining llms at scale.arXiv preprint arXiv:2309.04564,

When less is more: Investigating data pruning for pretraining llms at scale , author=. arXiv preprint arXiv:2309.04564 , year=

work page arXiv

[46] [46]

arXiv preprint arXiv:2212.00196 , year=

Data-efficient finetuning using cross-task nearest neighbors , author=. arXiv preprint arXiv:2212.00196 , year=

work page arXiv

[47] [47]

International Conference on Machine Learning , pages=

Grad-match: Gradient matching based data subset selection for efficient deep model training , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021

[48] [48]

arXiv preprint arXiv:2308.04275 , year=

In-context alignment: Chat with vanilla language models before fine-tuning , author=. arXiv preprint arXiv:2308.04275 , year=

work page arXiv

[49] [49]

Less: Selecting influential data for targeted instruction tuning

Less: Selecting influential data for targeted instruction tuning , author=. arXiv preprint arXiv:2402.04333 , year=

work page arXiv

[50] [50]

Forty-second International Conference on Machine Learning , year=

Predictive Data Selection: The Data That Predicts Is the Data That Teaches , author=. Forty-second International Conference on Machine Learning , year=

work page

[51] [51]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Scaling language models: Methods, analysis & insights from training gopher , author=. arXiv preprint arXiv:2112.11446 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

arXiv preprint arXiv:2311.16302 , year=

Comprehensive Benchmarking of Entropy and Margin Based Scoring Metrics for Data Selection , author=. arXiv preprint arXiv:2311.16302 , year=

work page arXiv

[53] [53]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning , author=. arXiv preprint arXiv:2506.01939 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[55] [55]

2025 , eprint=

Llama-Nemotron: Efficient Reasoning Models , author=. 2025 , eprint=

work page 2025

[56] [56]

International Conference on Learning Representations , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

work page

[57] [57]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[58] [58]

Advances in Neural Information Processing Systems , volume=

Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

work page