Reasoning Quality Emerges Early: Data Curation for Reasoning Models

Baharan Mirzasoleiman; Carlos Morato; Hongyi Henry Jin; Meysam Ghaffari; Wenhan Yang

arxiv: 2606.26797 · v1 · pith:SGZZBRKMnew · submitted 2026-06-25 · 💻 cs.LG

Reasoning Quality Emerges Early: Data Curation for Reasoning Models

Hongyi Henry Jin , Wenhan Yang , Meysam Ghaffari , Carlos Morato , Baharan Mirzasoleiman This is my paper

Pith reviewed 2026-06-26 04:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords reasoning data curationsupervised fine-tuningloss-based selectionLLM reasoningdata qualitytoken efficiencygradient similarity

0 comments

The pith

Loss on the first 100 reasoning tokens at a randomly perturbed pretrained checkpoint identifies difficult problems for supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that high-quality data for teaching reasoning to large language models can be chosen by looking only at the model's initial outputs on each example. It finds that the loss incurred on the first 100 reasoning tokens, computed at a randomly altered version of the original pretrained model, reliably signals which problems are hard. Examples that display matching loss curves over the first 1,000 tokens at several such altered checkpoints also produce nearly identical gradient updates when the model is actually fine-tuned. The resulting selection process needs no access to stronger reasoning models and yields better final performance than prior curation techniques on both medical and math reasoning tasks.

Core claim

Difficult problems can be reliably detected based on the loss of the first 100 reasoning tokens evaluated at a randomly perturbed checkpoint of the pretrained model. Examples exhibiting similar loss patterns over their first 1k reasoning tokens across a small number of perturbed checkpoints extrapolating along the fine-tuning trajectory provably induce similar gradients. This selection was tested by fine-tuning Qwen2.5-7B and Llama3.1-8B on the M23K medical reasoning and OpenThoughts-Math datasets, where it outperforms existing baselines by up to 1.7% while using 91% fewer tokens.

What carries the argument

Loss on the first 100-1000 reasoning tokens evaluated at randomly perturbed checkpoints of the pretrained model, acting as a proxy for both problem difficulty and gradient equivalence during fine-tuning.

If this is right

The method selects data that improves fine-tuned model accuracy by up to 1.7% over prior curation baselines.
Evaluation requires 91% fewer tokens than methods that rely on full traces or stronger models.
Matching loss patterns across perturbed checkpoints guarantee similar gradient effects on the fine-tuning trajectory.
The approach works for both medical reasoning and mathematical reasoning datasets at the 7B-8B scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-loss signal could be computed once at the beginning of training to filter an entire corpus without repeated model calls.
Perturbed checkpoints may capture enough of the fine-tuning dynamics to let practitioners rank data utility before any real training begins.
The technique might transfer to other long-horizon generation tasks where full traces are expensive to evaluate.

Load-bearing premise

Loss values on the first 100 to 1000 reasoning tokens at randomly perturbed checkpoints of the pretrained model serve as a reliable proxy for reasoning difficulty and data quality.

What would settle it

A dataset selected by this early-loss method produces lower reasoning accuracy after fine-tuning than a dataset selected by strong-model filtering or by random sampling.

Figures

Figures reproduced from arXiv: 2606.26797 by Baharan Mirzasoleiman, Carlos Morato, Hongyi Henry Jin, Meysam Ghaffari, Wenhan Yang.

**Figure 1.** Figure 1: Overview of TEMP: Token-Efficient Model Perturbation Reasoning Data Selection. First, using only the first 100 tokens of reasoning traces, we identify challenging examples as those exhibiting higher loss at a randomly perturbed checkpoint of the pretrained model (Sec. 3.1). Then, we cluster examples based on their first 1k token loss values, measured at a small number of noisy checkpoints extrapolating alo… view at source ↗

**Figure 2.** Figure 2: Problem understanding phase in the first 100 response tokens. For the easy problem, the model only changes the format, which leads to low loss with low uncertainty. For the hard problem, it strategically identifies “the important points” which leads to higher loss. 2025), or consider response-length as a surrogate for difficulty (Huang et al., 2025; Guha et al., 2025) assuming that harder problems require… view at source ↗

**Figure 3.** Figure 3: The correlation of different heuristics with difficulty. We label the difficulty of the problems in the m23k dataset with Gemini-2.5-flash-lite, and compare how well the difficulty correlates with metrics including response length and loss. The loss on the initial 100 tokens (corresponding to problem understanding phase) on the perturbed checkpoints have higher correlation with difficulty than longer res… view at source ↗

**Figure 5.** Figure 5: confirms that the loss landscape of Qwen2.5-7B is similar—flat parabola with low curvature—around the pretrained model and when fine-tuned for medical reasoning. Extrapolative perturbations in fine-tuning direction approximates loss level sets. Since the fine-tuning loss landscape is flat and has low curvature, the relative properties of training examples—such as their individual loss values and gradien… view at source ↗

**Figure 6.** Figure 6: The Pearson Correlation between the mean loss of Qwen2.5-7B-Instruct on initial and full reasoning tokens of the m23k dataset. Loss of the 1k initial reasoning tokens is highly correlated with the full response, across various perturbation strengths. around the pretrained model, the gradient of the loss with respect to model parameters varies smoothly as a function of both the parameters and the data. As … view at source ↗

**Figure 7.** Figure 7: Our method, TEMP, outperforms all baselines when fine-tuning Qwen2.5-7B-Instruct (left) and Llama3.1-8B-Instruct (right) on 1k examples selected from the M23k medical reasoning dataset. Average accuracy is shown across 10 medical benchmarks. 250 500 750 1000 2000 52 54 56 58 TEMP (Ours) m1k S2L Embedding Diversity Middle Perplexity Learnability Random Pretrained Qwen2.5-7B-IT Full-m23k Number of Examples A… view at source ↗

**Figure 8.** Figure 8: Fine-tuning Qwen2.5-7B-Instruct on on the M23k medical reasoning dataset. Average accuracy of subsets of various sizes is shown across 10 medical benchmarks. TEMP outperforms baselines across different budgets. 4.1. Experimental setup Datasets. We apply our method to two datasets, M23k (Huang et al., 2025) and OpenThoughts-114k-math-correct (Hugging Face, 2025) (which we refer to as OpenThoughtsMath). Bo… view at source ↗

**Figure 11.** Figure 11: Performance of our method for selecting 1k data from Left: M23k dataset and Right: OpenThoughts-Math dataset, when using suboptimal θf for calculating v. Even when θf is is not exact, our method still obtains reasonable performance. However, using a significantly undertrained θf harms the performance [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 10.** Figure 10: Token efficiency of our method (TEMP) vs baselines for selecting 1k data. While baselines requires processing the entire reasoning traces, TEMP only processes the initial tokens, improving token efficiency by 91% on the OpenThoughts-Math dataset. 4.3. Results Next, we compare the performance of our method with baselines across different model architectures and data domains. Medical reasoning [PITH_FULL… view at source ↗

**Figure 12.** Figure 12: Ablation Study of the hyperparameters, highlighted bars indicates the hyperparameter used. Top Left: We tried using 1∼3 perturbed checkpoint for clustering in difficulty filtering. Using 1 or 2 perturbed checkpoints have a similar effect. Top Right: Number of tokens to sum over in Equation (6) for sampling brittle reasoning traces (Eq. 8). Using 1k tokens works the best here. Bottom Left: Definition of Eq… view at source ↗

**Figure 13.** Figure 13: Additional results on the correlation of different heuristics with difficulty. Compared to including more prefix tokens, 100∼120 tokens have consistently high correlation with difficulty. Algorithm 2 RedistributeBudget: Constrained Allocation Require: Total Budget N, Available Counts n = [n1, . . . , nm], Weights w = [w1, . . . , wm] Ensure: Allocations c = [c1, . . . , cm] 1: Indices I ← [1, . . . , m] 2… view at source ↗

**Figure 14.** Figure 14: System prompt for labelling M23k difficulty. Question: {question} Please output your response in the following JSON format: { "reasoning": "Brief explanation of the difficulty assessment.", "score": <integer_score> } Enforce json output [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: User prompt for labelling M23k difficulty. [system] JUDGE_SYSTEM_PROMPT [user] few-shot example 1 (question + think prefix) [assistant] <rephrase>...</rephrase> [user] few-shot example 2 (question + think prefix) [assistant] <rephrase>...</rephrase> [user] few-shot example 3 (question + think prefix) [assistant] <rephrase>...</rephrase> [user] few-shot example 4 (question + think prefix) [assistant] <reph… view at source ↗

**Figure 16.** Figure 16: Conversation structure for the rephrase-boundary judge (Qwen3.5-9B). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: System prompt (JUDGE SYSTEM PROMPT) for identifying the problem-understanding boundary. Question: {question} Think prefix (first 500 tokens): {think_prefix} [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: User message template for each few-shot and target example. Placeholders are filled with the MCQ stem and the truncated think block from the reasoning trace. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

read the original abstract

Supervised fine-tuning (SFT) on a small, high-quality set of long reasoning traces is an effective approach for eliciting strong reasoning capabilities in Large Language Models (LLMs). However, existing methods for curating high-quality SFT data rely heavily on strong reasoning models to filter examples based on diversity and difficulty, making the curation process costly while often yielding suboptimal data quality. In this work, we show that diverse and challenging reasoning examples can be identified using only the initial reasoning tokens. Specifically, we demonstrate that difficult problems can be reliably detected based on the loss of the first 100 reasoning tokens evaluated at a randomly perturbed checkpoint of the pretrained model. We further show that examples exhibiting similar loss patterns over their first 1k reasoning tokens across a small number of perturbed checkpoints extrapolating along the fine-tuning trajectory provably induce similar gradients. We validate our approach through extensive experiments on fine-tuning Qwen2.5-7B and Llama3.1-8B models on the M23K medical reasoning and OpenThoughts-Math datasets. Our method outperforms existing baselines by up to 1.7% while being 91% more token efficient.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Early loss on the first 100 tokens at perturbed checkpoints lets them curate reasoning data more cheaply than prior methods, with experiments showing modest gains and big efficiency wins.

read the letter

The main point is that loss on the first 100 reasoning tokens at a randomly perturbed checkpoint of the base model can flag hard examples, and matching loss patterns over the first 1k tokens across a few checkpoints along the trajectory means similar gradients. This lets them build SFT sets without running a strong reasoner for filtering.

They back it with runs on Qwen2.5-7B and Llama3.1-8B using the M23K medical set and OpenThoughts-Math. The method beats the baselines by up to 1.7% while using 91% fewer tokens. That efficiency number is the clearest practical result, and the setup covers two different domains and model families.

The proxy assumption is the load-bearing piece: early loss at perturbed points has to track actual difficulty and data quality. Their experiments test this correlation directly, so the claim is at least falsifiable. The gradient part is framed as following from the pattern match, which keeps it from being pure hand-waving.

One soft spot is that the 100-token and 1k-token cutoffs, plus the perturbation scheme, could be tuned to these specific models and datasets. The absolute gains stay small, so the work improves an existing pipeline rather than replacing it. No obvious circularity or internal contradiction shows up in the reported results.

This is for groups doing SFT on reasoning traces who care about curation cost. It has enough concrete procedure and measured outcomes to go to referees.

Referee Report

2 major / 1 minor

Summary. The paper claims that high-quality reasoning SFT data can be curated without strong models by detecting difficult problems via loss on the first 100 reasoning tokens at randomly perturbed checkpoints of the pretrained model; additionally, examples with similar loss patterns over the first 1k tokens across a small number of such checkpoints (extrapolating the fine-tuning trajectory) provably induce similar gradients. Experiments on Qwen2.5-7B and Llama3.1-8B using the M23K medical reasoning and OpenThoughts-Math datasets report up to 1.7% gains over baselines while using 91% fewer tokens.

Significance. If the proxy and gradient claims hold, the method would materially lower the cost of curating reasoning traces by removing dependence on frontier models for filtering, while delivering measurable accuracy and efficiency improvements on two model families and two distinct reasoning domains.

major comments (2)

[Abstract] The central 'provably induce similar gradients' claim (abstract) is load-bearing for the theoretical justification yet is stated without derivation, assumptions, or section reference; the full manuscript must supply the explicit argument linking loss-pattern matching to gradient equivalence, including any conditions on perturbation scale or trajectory extrapolation.
[Experiments] The weakest assumption—that loss on the first 100–1000 tokens at randomly perturbed checkpoints is a reliable proxy for reasoning difficulty and data quality—is not accompanied by controls or ablation details in the provided abstract; the experiments section must report correlation coefficients, failure cases, and comparison against direct difficulty metrics to substantiate the proxy.

minor comments (1)

The abstract would benefit from naming the baseline curation methods and reporting the absolute performance numbers (not only the 1.7% delta) for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments identify areas where the manuscript can be strengthened for clarity and rigor. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] The central 'provably induce similar gradients' claim (abstract) is load-bearing for the theoretical justification yet is stated without derivation, assumptions, or section reference; the full manuscript must supply the explicit argument linking loss-pattern matching to gradient equivalence, including any conditions on perturbation scale or trajectory extrapolation.

Authors: We agree the abstract states the claim concisely without pointers. Section 3.2 of the full manuscript derives the result by showing that matching loss trajectories on the first 1k tokens across perturbed checkpoints implies bounded gradient difference (via Lipschitz continuity of the loss and linear extrapolation of the fine-tuning path). The derivation assumes perturbation scale ≤ 0.01 and that checkpoints lie on a locally linear trajectory segment. We will add an explicit reference to Section 3.2 in the abstract and expand the main-text proof with the full set of assumptions and a short proof sketch. revision: yes
Referee: [Experiments] The weakest assumption—that loss on the first 100–1000 tokens at randomly perturbed checkpoints is a reliable proxy for reasoning difficulty and data quality—is not accompanied by controls or ablation details in the provided abstract; the experiments section must report correlation coefficients, failure cases, and comparison against direct difficulty metrics to substantiate the proxy.

Authors: We concur that stronger validation of the proxy is warranted. The experiments section currently reports end-task gains and token efficiency but does not include the requested quantitative checks. We will add a dedicated subsection reporting (i) Pearson correlations between the early-loss proxy and human difficulty ratings on a 500-example subsample, (ii) documented failure cases (e.g., problems where low early loss masks later reasoning errors), and (iii) head-to-head comparison against full-trace loss and perplexity baselines. These additions will be included in the revised experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation chain consists of two empirical claims: (1) loss on the first 100 reasoning tokens at randomly perturbed pretrained checkpoints serves as a proxy for problem difficulty, and (2) matching loss patterns over the first 1k tokens across a few perturbed checkpoints (extrapolating the fine-tuning trajectory) imply similar gradients. Both are presented as observations that are then validated through concrete experiments on Qwen2.5-7B and Llama3.1-8B using the M23K and OpenThoughts-Math datasets, with reported performance gains. No step reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation; the "provably" language attaches to the loss-pattern-to-gradient implication as a stated mathematical consequence rather than an input assumption. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5749 in / 1041 out tokens · 23070 ms · 2026-06-26T04:56:42.243716+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages · 1 internal anchor

[1]

naacl-long.421/

URL https://aclanthology.org/2024. naacl-long.421/. Li, Y ., Yue, X., Xu, Z., Jiang, F., Niu, L., Lin, B. Y ., Ra- masubramanian, B., and Poovendran, R. Small models struggle to learn from strong reasoners. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguis- tics: ACL 2025, pp. 25366–253...

2024
[2]

In: Findings of the Association for Computational Linguistics: ACL 2025

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl

work page doi:10.18653/v1/2025.findings-acl 2025
[3]

findings-acl.1301/

URL https://aclanthology.org/2025. findings-acl.1301/. Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. Liu, J., Fan, Y ., Jiang, Z., Ding, H., Hu, Y ., Zhang, C., Shi, Y ., Weng, S., Chen, A., Chen, S., Huang, Y ...

Pith/arXiv arXiv 2025
[4]

Liu, Y ., Wang, S., Liu, Z., Song, Z., Wang, J., Liu, J., Liu, Q., and Wang, Y

URL https://openreview.net/forum? id=BTKAeLqLMw. Liu, Y ., Wang, S., Liu, Z., Song, Z., Wang, J., Liu, J., Liu, Q., and Wang, Y . Learn more, forget less: A gradient-aware data selection approach for llm, 2025b. URL https: //arxiv.org/abs/2511.08620. MAA. Aime 2024 problems, 2024. URL https://artofproblemsolving.com/wiki/ index.php/2024_AIME_I_Problems. A...

arXiv 2024
[5]

LiTEx: A linguistic taxonomy of explanations for understanding within-label variation in natural language inference

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025
[6]

emnlp-main.1025/

URL https://aclanthology.org/2025. emnlp-main.1025/. OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with- llms/, 2024. Pal, A., Umapathi, L. K., and Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medi- cal domain question answering. InConference on health, inference, and learning, pp. 248–...

Pith/arXiv arXiv 2025
[7]

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

URL https://openreview.net/forum? id=K9IGlMQpif. Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. LIMO: Less is more for reasoning. InSecond Con- ference on Language Modeling, 2025. URL https: //openreview.net/forum?id=T2TZ0RY4Zk. Zhang, J., Qin, Y ., Pi, R., Zhang, W., Pan, R., and Zhang, T. TAGCOS: Task-agnostic gradient clustered coreset ...

work page doi:10.18653/v1/2025.findings-naacl 2025
[8]

findings-naacl.264/

URL https://aclanthology.org/2025. findings-naacl.264/. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scal- ing reasoning in diffusion large language models via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

2025
[9]

Zheng, Y ., Zhang, R., Zhang, J., Ye, Y ., Luo, Z., Feng, Z., and Ma, Y

URL https://openreview.net/forum? id=7ZVRlBFuEv. Zheng, Y ., Zhang, R., Zhang, J., Ye, Y ., Luo, Z., Feng, Z., and Ma, Y . Llamafactory: Unified efficient fine- tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association ...

2024
[10]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

doi: 10.18653/v1/2025.acl-long.452. URL https: //aclanthology.org/2025.acl-long.452/. Zuo, Y ., Qu, S., Li, Y ., Chen, Z., Zhu, X., Hua, E., Zhang, K., Ding, N., and Zhou, B. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 12 Reasoning Quality Emerges Early: Data Curation for Reasoning Model...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.452 2025
[11]

During SFT, parameters remain in a local region of diameter at mostϵaround the pretrained initialization
[12]

For any pair of examples z1, z2 ∈D , the loss difference Lz1(θ)−L z2(θ) is locally well approximated by a quadratic function onΘ, with curvature bounded byC H and gradient norm bounded byG
[13]

(5) is dense, i.e.,∥e∥ 2 ∞ ≤µ/dfor a moderate constantµ

The perturbation directioneused in Eq. (5) is dense, i.e.,∥e∥ 2 ∞ ≤µ/dfor a moderate constantµ. We now prove Theorem 3.1. Theorem 3.1.Consider fine-tuning a pretrained model, where the curvature and gradient norms are upper-bounded by CH, G, respectively; and parameter updates remain within anϵ-neighborhood of the pretrained initialization. |Lz1(θj)− L z2...
[14]

Physiological role of Red Blood Cells,

Let Z=∥(1 +ξ j)⊙v∥ 2; thenE[Z] = 2. If∥v∥ 2 ∞ ≤µ/d, then Chebyshev’s inequality gives Pr ∥(1 +ξ j)⊙v∥ 2 <( √ 2−τ) 2 ≤ 6µ d(2 √ 2τ−τ 2)2 . Thus, with probability at least1− 6µ d(2 √ 2τ−τ 2)2 , |⟨∇Lz1(θ)− ∇L z2(θ), v⟩| ≤ 2δ λ( √ 2−τ) +C H ϵ+G r 1 2 + √ 2τ−τ 2. Forτ≤1/ √ 2, we have 1√ 2−τ ≤ 1√ 2 +τand q 1 2 + √ 2τ−τ 2 ≤ 1√ 2 +τ, which yields the bound in the...

2024
[15]

Copy the rephrase span **verbatim** from the think prefix -- character-for-character
[16]

Stop at the **first** reasoning marker, even if the think prefix continues with many paragraphs of reasoning afterward
[17]

Do NOT extend rephrase to fill the prefix

The think prefix is only the first 500 tokens and may contain lots of reasoning **after** the rephrase. Do NOT extend rephrase to fill the prefix
[18]

## Output format (exactly) <rephrase> [verbatim rephrase text, or empty] </rephrase> Output nothing else

If there is no rephrase (reasoning starts immediately), output an empty rephrase. ## Output format (exactly) <rephrase> [verbatim rephrase text, or empty] </rephrase> Output nothing else. Figure 17.System prompt (JUDGE SYSTEM PROMPT) for identifying the problem-understanding boundary. Question: {question} Think prefix (first 500 tokens): {think_prefix} Fi...

2000

[1] [1]

naacl-long.421/

URL https://aclanthology.org/2024. naacl-long.421/. Li, Y ., Yue, X., Xu, Z., Jiang, F., Niu, L., Lin, B. Y ., Ra- masubramanian, B., and Poovendran, R. Small models struggle to learn from strong reasoners. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguis- tics: ACL 2025, pp. 25366–253...

2024

[2] [2]

In: Findings of the Association for Computational Linguistics: ACL 2025

Association for Computational Linguistics. ISBN 979-8-89176-256-5. doi: 10.18653/v1/2025.findings-acl

work page doi:10.18653/v1/2025.findings-acl 2025

[3] [3]

findings-acl.1301/

URL https://aclanthology.org/2025. findings-acl.1301/. Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023. Liu, J., Fan, Y ., Jiang, Z., Ding, H., Hu, Y ., Zhang, C., Shi, Y ., Weng, S., Chen, A., Chen, S., Huang, Y ...

Pith/arXiv arXiv 2025

[4] [4]

Liu, Y ., Wang, S., Liu, Z., Song, Z., Wang, J., Liu, J., Liu, Q., and Wang, Y

URL https://openreview.net/forum? id=BTKAeLqLMw. Liu, Y ., Wang, S., Liu, Z., Song, Z., Wang, J., Liu, J., Liu, Q., and Wang, Y . Learn more, forget less: A gradient-aware data selection approach for llm, 2025b. URL https: //arxiv.org/abs/2511.08620. MAA. Aime 2024 problems, 2024. URL https://artofproblemsolving.com/wiki/ index.php/2024_AIME_I_Problems. A...

arXiv 2024

[5] [5]

LiTEx: A linguistic taxonomy of explanations for understanding within-label variation in natural language inference

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025

[6] [6]

emnlp-main.1025/

URL https://aclanthology.org/2025. emnlp-main.1025/. OpenAI. Learning to reason with llms. https://openai.com/index/learning-to-reason-with- llms/, 2024. Pal, A., Umapathi, L. K., and Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medi- cal domain question answering. InConference on health, inference, and learning, pp. 248–...

Pith/arXiv arXiv 2025

[7] [7]

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

URL https://openreview.net/forum? id=K9IGlMQpif. Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. LIMO: Less is more for reasoning. InSecond Con- ference on Language Modeling, 2025. URL https: //openreview.net/forum?id=T2TZ0RY4Zk. Zhang, J., Qin, Y ., Pi, R., Zhang, W., Pan, R., and Zhang, T. TAGCOS: Task-agnostic gradient clustered coreset ...

work page doi:10.18653/v1/2025.findings-naacl 2025

[8] [8]

findings-naacl.264/

URL https://aclanthology.org/2025. findings-naacl.264/. Zhao, S., Gupta, D., Zheng, Q., and Grover, A. d1: Scal- ing reasoning in diffusion large language models via reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

2025

[9] [9]

Zheng, Y ., Zhang, R., Zhang, J., Ye, Y ., Luo, Z., Feng, Z., and Ma, Y

URL https://openreview.net/forum? id=7ZVRlBFuEv. Zheng, Y ., Zhang, R., Zhang, J., Ye, Y ., Luo, Z., Feng, Z., and Ma, Y . Llamafactory: Unified efficient fine- tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association ...

2024

[10] [10]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

doi: 10.18653/v1/2025.acl-long.452. URL https: //aclanthology.org/2025.acl-long.452/. Zuo, Y ., Qu, S., Li, Y ., Chen, Z., Zhu, X., Hua, E., Zhang, K., Ding, N., and Zhou, B. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 12 Reasoning Quality Emerges Early: Data Curation for Reasoning Model...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025.acl-long.452 2025

[11] [11]

During SFT, parameters remain in a local region of diameter at mostϵaround the pretrained initialization

[12] [12]

For any pair of examples z1, z2 ∈D , the loss difference Lz1(θ)−L z2(θ) is locally well approximated by a quadratic function onΘ, with curvature bounded byC H and gradient norm bounded byG

[13] [13]

(5) is dense, i.e.,∥e∥ 2 ∞ ≤µ/dfor a moderate constantµ

The perturbation directioneused in Eq. (5) is dense, i.e.,∥e∥ 2 ∞ ≤µ/dfor a moderate constantµ. We now prove Theorem 3.1. Theorem 3.1.Consider fine-tuning a pretrained model, where the curvature and gradient norms are upper-bounded by CH, G, respectively; and parameter updates remain within anϵ-neighborhood of the pretrained initialization. |Lz1(θj)− L z2...

[14] [14]

Physiological role of Red Blood Cells,

Let Z=∥(1 +ξ j)⊙v∥ 2; thenE[Z] = 2. If∥v∥ 2 ∞ ≤µ/d, then Chebyshev’s inequality gives Pr ∥(1 +ξ j)⊙v∥ 2 <( √ 2−τ) 2 ≤ 6µ d(2 √ 2τ−τ 2)2 . Thus, with probability at least1− 6µ d(2 √ 2τ−τ 2)2 , |⟨∇Lz1(θ)− ∇L z2(θ), v⟩| ≤ 2δ λ( √ 2−τ) +C H ϵ+G r 1 2 + √ 2τ−τ 2. Forτ≤1/ √ 2, we have 1√ 2−τ ≤ 1√ 2 +τand q 1 2 + √ 2τ−τ 2 ≤ 1√ 2 +τ, which yields the bound in the...

2024

[15] [15]

Copy the rephrase span **verbatim** from the think prefix -- character-for-character

[16] [16]

Stop at the **first** reasoning marker, even if the think prefix continues with many paragraphs of reasoning afterward

[17] [17]

Do NOT extend rephrase to fill the prefix

The think prefix is only the first 500 tokens and may contain lots of reasoning **after** the rephrase. Do NOT extend rephrase to fill the prefix

[18] [18]

## Output format (exactly) <rephrase> [verbatim rephrase text, or empty] </rephrase> Output nothing else

If there is no rephrase (reasoning starts immediately), output an empty rephrase. ## Output format (exactly) <rephrase> [verbatim rephrase text, or empty] </rephrase> Output nothing else. Figure 17.System prompt (JUDGE SYSTEM PROMPT) for identifying the problem-understanding boundary. Question: {question} Think prefix (first 500 tokens): {think_prefix} Fi...

2000