pith. sign in

arxiv: 2606.26797 · v1 · pith:SGZZBRKMnew · submitted 2026-06-25 · 💻 cs.LG

Reasoning Quality Emerges Early: Data Curation for Reasoning Models

Pith reviewed 2026-06-26 04:56 UTC · model grok-4.3

classification 💻 cs.LG
keywords reasoning data curationsupervised fine-tuningloss-based selectionLLM reasoningdata qualitytoken efficiencygradient similarity
0
0 comments X

The pith

Loss on the first 100 reasoning tokens at a randomly perturbed pretrained checkpoint identifies difficult problems for supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that high-quality data for teaching reasoning to large language models can be chosen by looking only at the model's initial outputs on each example. It finds that the loss incurred on the first 100 reasoning tokens, computed at a randomly altered version of the original pretrained model, reliably signals which problems are hard. Examples that display matching loss curves over the first 1,000 tokens at several such altered checkpoints also produce nearly identical gradient updates when the model is actually fine-tuned. The resulting selection process needs no access to stronger reasoning models and yields better final performance than prior curation techniques on both medical and math reasoning tasks.

Core claim

Difficult problems can be reliably detected based on the loss of the first 100 reasoning tokens evaluated at a randomly perturbed checkpoint of the pretrained model. Examples exhibiting similar loss patterns over their first 1k reasoning tokens across a small number of perturbed checkpoints extrapolating along the fine-tuning trajectory provably induce similar gradients. This selection was tested by fine-tuning Qwen2.5-7B and Llama3.1-8B on the M23K medical reasoning and OpenThoughts-Math datasets, where it outperforms existing baselines by up to 1.7% while using 91% fewer tokens.

What carries the argument

Loss on the first 100-1000 reasoning tokens evaluated at randomly perturbed checkpoints of the pretrained model, acting as a proxy for both problem difficulty and gradient equivalence during fine-tuning.

If this is right

  • The method selects data that improves fine-tuned model accuracy by up to 1.7% over prior curation baselines.
  • Evaluation requires 91% fewer tokens than methods that rely on full traces or stronger models.
  • Matching loss patterns across perturbed checkpoints guarantee similar gradient effects on the fine-tuning trajectory.
  • The approach works for both medical reasoning and mathematical reasoning datasets at the 7B-8B scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same early-loss signal could be computed once at the beginning of training to filter an entire corpus without repeated model calls.
  • Perturbed checkpoints may capture enough of the fine-tuning dynamics to let practitioners rank data utility before any real training begins.
  • The technique might transfer to other long-horizon generation tasks where full traces are expensive to evaluate.

Load-bearing premise

Loss values on the first 100 to 1000 reasoning tokens at randomly perturbed checkpoints of the pretrained model serve as a reliable proxy for reasoning difficulty and data quality.

What would settle it

A dataset selected by this early-loss method produces lower reasoning accuracy after fine-tuning than a dataset selected by strong-model filtering or by random sampling.

Figures

Figures reproduced from arXiv: 2606.26797 by Baharan Mirzasoleiman, Carlos Morato, Hongyi Henry Jin, Meysam Ghaffari, Wenhan Yang.

Figure 1
Figure 1. Figure 1: Overview of TEMP: Token-Efficient Model Perturbation Reasoning Data Selection. First, using only the first 100 tokens of reasoning traces, we identify challenging examples as those exhibiting higher loss at a randomly perturbed checkpoint of the pretrained model (Sec. 3.1). Then, we cluster examples based on their first 1k token loss values, measured at a small number of noisy checkpoints extrapolating alo… view at source ↗
Figure 2
Figure 2. Figure 2: Problem understanding phase in the first 100 response tokens. For the easy problem, the model only changes the format, which leads to low loss with low uncertainty. For the hard problem, it strategically identifies “the important points” which leads to higher loss. 2025), or consider response-length as a surrogate for diffi￾culty (Huang et al., 2025; Guha et al., 2025) assuming that harder problems require… view at source ↗
Figure 3
Figure 3. Figure 3: The correlation of different heuristics with difficulty. We label the difficulty of the problems in the m23k dataset with Gemini-2.5-flash-lite, and compare how well the difficulty corre￾lates with metrics including response length and loss. The loss on the initial 100 tokens (corresponding to problem understand￾ing phase) on the perturbed checkpoints have higher correlation with difficulty than longer res… view at source ↗
Figure 5
Figure 5. Figure 5: confirms that the loss landscape of Qwen2.5-7B is similar—flat parabola with low curvature—around the pre￾trained model and when fine-tuned for medical reasoning. Extrapolative perturbations in fine-tuning direction ap￾proximates loss level sets. Since the fine-tuning loss land￾scape is flat and has low curvature, the relative properties of training examples—such as their individual loss values and gradien… view at source ↗
Figure 6
Figure 6. Figure 6: The Pearson Correlation between the mean loss of Qwen￾2.5-7B-Instruct on initial and full reasoning tokens of the m23k dataset. Loss of the 1k initial reasoning tokens is highly correlated with the full response, across various perturbation strengths. around the pretrained model, the gradient of the loss with respect to model parameters varies smoothly as a function of both the parameters and the data. As … view at source ↗
Figure 7
Figure 7. Figure 7: Our method, TEMP, outperforms all baselines when fine-tuning Qwen2.5-7B-Instruct (left) and Llama3.1-8B-Instruct (right) on 1k examples selected from the M23k medical reasoning dataset. Average accuracy is shown across 10 medical benchmarks. 250 500 750 1000 2000 52 54 56 58 TEMP (Ours) m1k S2L Embedding Diversity Middle Perplexity Learnability Random Pretrained Qwen2.5-7B-IT Full-m23k Number of Examples A… view at source ↗
Figure 8
Figure 8. Figure 8: Fine-tuning Qwen2.5-7B-Instruct on on the M23k med￾ical reasoning dataset. Average accuracy of subsets of various sizes is shown across 10 medical benchmarks. TEMP outperforms baselines across different budgets. 4.1. Experimental setup Datasets. We apply our method to two datasets, M23k (Huang et al., 2025) and OpenThoughts-114k-math-correct (Hugging Face, 2025) (which we refer to as OpenThoughts￾Math). Bo… view at source ↗
Figure 11
Figure 11. Figure 11: Performance of our method for selecting 1k data from Left: M23k dataset and Right: OpenThoughts-Math dataset, when using suboptimal θf for calculating v. Even when θf is is not exact, our method still obtains reasonable performance. However, using a significantly undertrained θf harms the performance [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: Token efficiency of our method (TEMP) vs baselines for selecting 1k data. While baselines requires processing the entire reasoning traces, TEMP only processes the initial tokens, improv￾ing token efficiency by 91% on the OpenThoughts-Math dataset. 4.3. Results Next, we compare the performance of our method with base￾lines across different model architectures and data domains. Medical reasoning [PITH_FULL… view at source ↗
Figure 12
Figure 12. Figure 12: Ablation Study of the hyperparameters, highlighted bars indicates the hyperparameter used. Top Left: We tried using 1∼3 perturbed checkpoint for clustering in difficulty filtering. Using 1 or 2 perturbed checkpoints have a similar effect. Top Right: Number of tokens to sum over in Equation (6) for sampling brittle reasoning traces (Eq. 8). Using 1k tokens works the best here. Bottom Left: Definition of Eq… view at source ↗
Figure 13
Figure 13. Figure 13: Additional results on the correlation of different heuristics with difficulty. Compared to including more prefix tokens, 100∼120 tokens have consistently high correlation with difficulty. Algorithm 2 RedistributeBudget: Constrained Allocation Require: Total Budget N, Available Counts n = [n1, . . . , nm], Weights w = [w1, . . . , wm] Ensure: Allocations c = [c1, . . . , cm] 1: Indices I ← [1, . . . , m] 2… view at source ↗
Figure 14
Figure 14. Figure 14: System prompt for labelling M23k difficulty. Question: {question} Please output your response in the following JSON format: { "reasoning": "Brief explanation of the difficulty assessment.", "score": <integer_score> } Enforce json output [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: User prompt for labelling M23k difficulty. [system] JUDGE_SYSTEM_PROMPT [user] few-shot example 1 (question + think prefix) [assistant] <rephrase>...</rephrase> [user] few-shot example 2 (question + think prefix) [assistant] <rephrase>...</rephrase> [user] few-shot example 3 (question + think prefix) [assistant] <rephrase>...</rephrase> [user] few-shot example 4 (question + think prefix) [assistant] <reph… view at source ↗
Figure 16
Figure 16. Figure 16: Conversation structure for the rephrase-boundary judge (Qwen3.5-9B). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: System prompt (JUDGE SYSTEM PROMPT) for identifying the problem-understanding boundary. Question: {question} Think prefix (first 500 tokens): {think_prefix} [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: User message template for each few-shot and target example. Placeholders are filled with the MCQ stem and the truncated think block from the reasoning trace. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
read the original abstract

Supervised fine-tuning (SFT) on a small, high-quality set of long reasoning traces is an effective approach for eliciting strong reasoning capabilities in Large Language Models (LLMs). However, existing methods for curating high-quality SFT data rely heavily on strong reasoning models to filter examples based on diversity and difficulty, making the curation process costly while often yielding suboptimal data quality. In this work, we show that diverse and challenging reasoning examples can be identified using only the initial reasoning tokens. Specifically, we demonstrate that difficult problems can be reliably detected based on the loss of the first 100 reasoning tokens evaluated at a randomly perturbed checkpoint of the pretrained model. We further show that examples exhibiting similar loss patterns over their first 1k reasoning tokens across a small number of perturbed checkpoints extrapolating along the fine-tuning trajectory provably induce similar gradients. We validate our approach through extensive experiments on fine-tuning Qwen2.5-7B and Llama3.1-8B models on the M23K medical reasoning and OpenThoughts-Math datasets. Our method outperforms existing baselines by up to 1.7% while being 91% more token efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that high-quality reasoning SFT data can be curated without strong models by detecting difficult problems via loss on the first 100 reasoning tokens at randomly perturbed checkpoints of the pretrained model; additionally, examples with similar loss patterns over the first 1k tokens across a small number of such checkpoints (extrapolating the fine-tuning trajectory) provably induce similar gradients. Experiments on Qwen2.5-7B and Llama3.1-8B using the M23K medical reasoning and OpenThoughts-Math datasets report up to 1.7% gains over baselines while using 91% fewer tokens.

Significance. If the proxy and gradient claims hold, the method would materially lower the cost of curating reasoning traces by removing dependence on frontier models for filtering, while delivering measurable accuracy and efficiency improvements on two model families and two distinct reasoning domains.

major comments (2)
  1. [Abstract] The central 'provably induce similar gradients' claim (abstract) is load-bearing for the theoretical justification yet is stated without derivation, assumptions, or section reference; the full manuscript must supply the explicit argument linking loss-pattern matching to gradient equivalence, including any conditions on perturbation scale or trajectory extrapolation.
  2. [Experiments] The weakest assumption—that loss on the first 100–1000 tokens at randomly perturbed checkpoints is a reliable proxy for reasoning difficulty and data quality—is not accompanied by controls or ablation details in the provided abstract; the experiments section must report correlation coefficients, failure cases, and comparison against direct difficulty metrics to substantiate the proxy.
minor comments (1)
  1. The abstract would benefit from naming the baseline curation methods and reporting the absolute performance numbers (not only the 1.7% delta) for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify areas where the manuscript can be strengthened for clarity and rigor. We address each point below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] The central 'provably induce similar gradients' claim (abstract) is load-bearing for the theoretical justification yet is stated without derivation, assumptions, or section reference; the full manuscript must supply the explicit argument linking loss-pattern matching to gradient equivalence, including any conditions on perturbation scale or trajectory extrapolation.

    Authors: We agree the abstract states the claim concisely without pointers. Section 3.2 of the full manuscript derives the result by showing that matching loss trajectories on the first 1k tokens across perturbed checkpoints implies bounded gradient difference (via Lipschitz continuity of the loss and linear extrapolation of the fine-tuning path). The derivation assumes perturbation scale ≤ 0.01 and that checkpoints lie on a locally linear trajectory segment. We will add an explicit reference to Section 3.2 in the abstract and expand the main-text proof with the full set of assumptions and a short proof sketch. revision: yes

  2. Referee: [Experiments] The weakest assumption—that loss on the first 100–1000 tokens at randomly perturbed checkpoints is a reliable proxy for reasoning difficulty and data quality—is not accompanied by controls or ablation details in the provided abstract; the experiments section must report correlation coefficients, failure cases, and comparison against direct difficulty metrics to substantiate the proxy.

    Authors: We concur that stronger validation of the proxy is warranted. The experiments section currently reports end-task gains and token efficiency but does not include the requested quantitative checks. We will add a dedicated subsection reporting (i) Pearson correlations between the early-loss proxy and human difficulty ratings on a 500-example subsample, (ii) documented failure cases (e.g., problems where low early loss masks later reasoning errors), and (iii) head-to-head comparison against full-trace loss and perplexity baselines. These additions will be included in the revised experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation chain consists of two empirical claims: (1) loss on the first 100 reasoning tokens at randomly perturbed pretrained checkpoints serves as a proxy for problem difficulty, and (2) matching loss patterns over the first 1k tokens across a few perturbed checkpoints (extrapolating the fine-tuning trajectory) imply similar gradients. Both are presented as observations that are then validated through concrete experiments on Qwen2.5-7B and Llama3.1-8B using the M23K and OpenThoughts-Math datasets, with reported performance gains. No step reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation; the "provably" language attaches to the loss-pattern-to-gradient implication as a stated mathematical consequence rather than an input assumption. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5749 in / 1041 out tokens · 23070 ms · 2026-06-26T04:56:42.243716+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    naacl-long.421/

    URL https://aclanthology.org/2024. naacl-long.421/. Li, Y ., Yue, X., Xu, Z., Jiang, F., Niu, L., Lin, B. Y ., Ra- masubramanian, B., and Poovendran, R. Small models struggle to learn from strong reasoners. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguis- tics: ACL 2025, pp. 25366–253...

  2. [2]

    In: Findings of the Association for Computational Linguistics: ACL 2025

    Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl

  3. [3]

    findings-acl.1301/

    URL https://aclanthology.org/2025. findings-acl.1301/. Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. Liu, J., Fan, Y ., Jiang, Z., Ding, H., Hu, Y ., Zhang, C., Shi, Y ., Weng, S., Chen, A., Chen, S., Huang, Y ...

  4. [4]

    Liu, Y ., Wang, S., Liu, Z., Song, Z., Wang, J., Liu, J., Liu, Q., and Wang, Y

    URL https://openreview.net/forum? id=BTKAeLqLMw. Liu, Y ., Wang, S., Liu, Z., Song, Z., Wang, J., Liu, J., Liu, Q., and Wang, Y . Learn more, forget less: A gradient-aware data selection approach for llm, 2025b. URL https: //arxiv.org/abs/2511.08620. MAA. Aime 2024 problems, 2024. URL https://artofproblemsolving.com/wiki/ index.php/2024_AIME_I_Problems. A...

  5. [5]

    LiTEx: A linguistic taxonomy of explanations for understanding within-label variation in natural language inference

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

  6. [6]

    emnlp-main.1025/

    URL https://aclanthology.org/2025. emnlp-main.1025/. OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with- llms/, 2024. Pal, A., Umapathi, L. K., and Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medi- cal domain question answering. InConference on health, inference, and learning, pp. 248–...

  7. [7]

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

    URL https://openreview.net/forum? id=K9IGlMQpif. Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. LIMO: Less is more for reasoning. InSecond Con- ference on Language Modeling, 2025. URL https: //openreview.net/forum?id=T2TZ0RY4Zk. Zhang, J., Qin, Y ., Pi, R., Zhang, W., Pan, R., and Zhang, T. TAGCOS: Task-agnostic gradient clustered coreset ...

  8. [8]

    findings-naacl.264/

    URL https://aclanthology.org/2025. findings-naacl.264/. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scal- ing reasoning in diffusion large language models via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  9. [9]

    Zheng, Y ., Zhang, R., Zhang, J., Ye, Y ., Luo, Z., Feng, Z., and Ma, Y

    URL https://openreview.net/forum? id=7ZVRlBFuEv. Zheng, Y ., Zhang, R., Zhang, J., Ye, Y ., Luo, Z., Feng, Z., and Ma, Y . Llamafactory: Unified efficient fine- tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association ...

  10. [10]

    MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

    doi: 10.18653/v1/2025.acl-long.452. URL https: //aclanthology.org/2025.acl-long.452/. Zuo, Y ., Qu, S., Li, Y ., Chen, Z., Zhu, X., Hua, E., Zhang, K., Ding, N., and Zhou, B. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 12 Reasoning Quality Emerges Early: Data Curation for Reasoning Model...

  11. [11]

    During SFT, parameters remain in a local region of diameter at mostϵaround the pretrained initialization

  12. [12]

    For any pair of examples z1, z2 ∈D , the loss difference Lz1(θ)−L z2(θ) is locally well approximated by a quadratic function onΘ, with curvature bounded byC H and gradient norm bounded byG

  13. [13]

    (5) is dense, i.e.,∥e∥ 2 ∞ ≤µ/dfor a moderate constantµ

    The perturbation directioneused in Eq. (5) is dense, i.e.,∥e∥ 2 ∞ ≤µ/dfor a moderate constantµ. We now prove Theorem 3.1. Theorem 3.1.Consider fine-tuning a pretrained model, where the curvature and gradient norms are upper-bounded by CH, G, respectively; and parameter updates remain within anϵ-neighborhood of the pretrained initialization. |Lz1(θj)− L z2...

  14. [14]

    Physiological role of Red Blood Cells,

    Let Z=∥(1 +ξ j)⊙v∥ 2; thenE[Z] = 2. If∥v∥ 2 ∞ ≤µ/d, then Chebyshev’s inequality gives Pr ∥(1 +ξ j)⊙v∥ 2 <( √ 2−τ) 2 ≤ 6µ d(2 √ 2τ−τ 2)2 . Thus, with probability at least1− 6µ d(2 √ 2τ−τ 2)2 , |⟨∇Lz1(θ)− ∇L z2(θ), v⟩| ≤ 2δ λ( √ 2−τ) +C H ϵ+G r 1 2 + √ 2τ−τ 2. Forτ≤1/ √ 2, we have 1√ 2−τ ≤ 1√ 2 +τand q 1 2 + √ 2τ−τ 2 ≤ 1√ 2 +τ, which yields the bound in the...

  15. [15]

    Copy the rephrase span **verbatim** from the think prefix -- character-for-character

  16. [16]

    Stop at the **first** reasoning marker, even if the think prefix continues with many paragraphs of reasoning afterward

  17. [17]

    Do NOT extend rephrase to fill the prefix

    The think prefix is only the first 500 tokens and may contain lots of reasoning **after** the rephrase. Do NOT extend rephrase to fill the prefix

  18. [18]

    ## Output format (exactly) <rephrase> [verbatim rephrase text, or empty] </rephrase> Output nothing else

    If there is no rephrase (reasoning starts immediately), output an empty rephrase. ## Output format (exactly) <rephrase> [verbatim rephrase text, or empty] </rephrase> Output nothing else. Figure 17.System prompt (JUDGE SYSTEM PROMPT) for identifying the problem-understanding boundary. Question: {question} Think prefix (first 500 tokens): {think_prefix} Fi...