arxiv: 2605.13165 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: unknown

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

Chenjun Xu , Zhennan Zhou , Zhan Su , Bill Howe , Lucy Lu Wang , Bingbing Wen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords structured pruninglong chain-of-thoughton-policy pruninglow-data fine-tuningreasoning efficiencytoken reductionmath reasoning

0 comments

The pith

Structured on-policy pruning cuts long reasoning traces by 19-42% while preserving accuracy in low-data fine-tuning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STOP, an algorithm that prunes unnecessary length from chain-of-thought reasoning generated by language models during low-data fine-tuning. It turns the model's own traces into structured trees by segmenting them into nodes and labeling their roles, then keeps only the shortest prefix that reaches the first correct answer. A sympathetic reader would care because this lowers token count and inference cost without large teacher models or extra supervision. Results on GSM8K, Math 500, and AIME 2024 with DeepSeek-R1-Distill models show token reductions of 19.4 to 42.4 percent and largely unchanged accuracy. The method also produces smaller shifts in the model's output distribution than pruning guided by external teachers.

Core claim

STOP constructs self-distilled traces from the model, maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction, then applies the Earliest Correct Node rule to retain the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity.

What carries the argument

The Earliest Correct Node (ECN) rule applied to a reasoning tree built from self-distilled traces via node segmentation and taxonomy annotation

If this is right

Token reduction of 19.4-42.4% on GSM8K, Math 500, and AIME 2024 in low-data fine-tuning of DeepSeek-R1-Distill models
Smaller distributional shift in model outputs than teacher-guided pruning methods
Higher structural efficiency in the reasoning traces that remain after pruning
Reallocation of reasoning effort away from redundant verification toward productive exploration

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The node-based structure could be adapted to prune reasoning in non-math domains if similar segmentation rules are defined for those tasks
Shorter generated outputs would reduce latency in real-time applications that rely on the same fine-tuned models
Iterating the on-policy self-distillation step multiple times could compound the efficiency gains without additional labeled data

Load-bearing premise

Mapping raw reasoning traces into a structured interface with nodes and taxonomy labels preserves original semantic meaning and does not introduce new errors that affect final-answer correctness.

What would settle it

Applying STOP to a held-out math benchmark or different model family and measuring an accuracy drop exceeding 5 percentage points while tokens decrease would settle whether accuracy is preserved.

Figures

Figures reproduced from arXiv: 2605.13165 by Bill Howe, Bingbing Wen, Chenjun Xu, Lucy Lu Wang, Zhan Su, Zhennan Zhou.

**Figure 3.** Figure 3: Reasoning node-type distribution on Math 500 Level 3 for Qwen Base and Self-distilled-ECN. Self-distilled-ECN shifts reasoning toward more Exploration and less Verification/Backtracking, while largely preserving Conclusion. On GSM8K, Self-distilled-ECN preserves performance while substantially reducing token usage. For Qwen, it reaches 91.1% accuracy versus 90.1% for Base while reducing tokens by 27.8%. F… view at source ↗

**Figure 4.** Figure 4: Accuracy gains and token savings of self-distilled models across three supervision [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Response-length distribution of the base Qwen model and the selfdistilled ECN-Qwen model on Math 500 (Level 3). Overthinking and Reasoning-Length Control Recent reasoning language models (RLMs) achieve strong performance on multi-step tasks by generating long chain-of-thought (CoT) traces (Jaech et al., 2024; Guo et al., 2025). However, extended reasoning often introduces substantial computational overh… view at source ↗

**Figure 6.** Figure 6: Token accuracy over training steps for different supervision and pruning strategies. C Human Annotation Evaluation To evaluate the reliability of the model-assisted annotation pipeline, we conduct a human evaluation comparing LLM-generated annotations with human annotations for two key tasks: taxonomy labeling and conclusion correctness judgment. C.1 Annotation Protocol To evaluate the reliability of the … view at source ↗

read the original abstract

Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STOP gives a concrete on-policy way to prune redundant reasoning steps in low-data fine-tuning, but the node segmentation step lacks direct checks that it keeps semantics intact.

read the letter

The core contribution is an on-policy pruning method that turns a model's own long traces into a tree of nodes with taxonomy labels, then keeps only the shortest prefix up to the first node that both concludes the answer and gets it right. On DeepSeek-R1-Distill models fine-tuned on GSM8K, Math 500, and AIME 2024, this cuts generated tokens 19-42% while accuracy stays close to the unpruned baseline. The approach avoids external teachers and produces smaller distributional shifts than teacher-guided pruning, which is useful in the low-data setting the paper targets. The ECN rule itself is a clear algorithmic piece that goes beyond simple length or confidence cutoffs. The analyses of how reasoning effort shifts away from backtracking are also worth noting. The main gap is that the paper does not report any direct test of whether the node segmentation and taxonomy steps preserve the original reasoning semantics. If a boundary splits a necessary step or a label misclassifies a node, the retained prefix could drop critical information even when the final answer still matches. No ablation removes the annotation step, no inter-annotator or model agreement numbers are given, and the abstract supplies no run-to-run variance or statistical tests on the efficiency numbers. Those omissions make it hard to judge how robust the gains are. The work is aimed at groups doing low-data adaptation of reasoning models who need lower inference cost without large-scale distillation. It has a well-defined procedure and measurable results on standard benchmarks, so it should go to peer review rather than desk rejection. The authors will need to add the missing checks on segmentation quality and basic statistical reporting, but the idea is solid enough to review.

Referee Report

2 major / 2 minor

Summary. The paper proposes STOP, an on-policy pruning method for long chain-of-thought reasoning traces. It generates self-distilled traces, segments them into nodes with taxonomy annotations, builds a reasoning tree, and applies the Earliest Correct Node (ECN) rule to retain only the shortest prefix ending at the first node that concludes with the correct final answer. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B using GSM8K, Math 500, and AIME 2024 in low-data fine-tuning regimes report token reductions of 19.4-42.4% with largely preserved accuracy, smaller distributional shift than teacher-guided baselines, and improved structural efficiency.

Significance. If the efficiency gains and semantic-preservation claims hold under statistical scrutiny, the work would be significant for low-data adaptation of reasoning models: it offers a self-supervised alternative to distillation or test-time control, quantifies reductions in redundant verification/backtracking, and demonstrates that structured on-policy pruning can reallocate reasoning effort productively without large teacher models.

major comments (2)

[Experiments] Experiments section (results on GSM8K/Math 500/AIME 2024): the headline token-reduction range (19.4-42.4%) and accuracy-preservation claim are reported without standard deviations across runs, p-values, or explicit train/validation/test splits, which is load-bearing for the central efficiency claim in low-data regimes where variance is typically high.
[Method] Method section on node segmentation + taxonomy annotation + ECN: the assertion that ECN preserves semantic continuity (and therefore final-answer correctness) is not supported by any quantitative check such as inter-annotator agreement on node boundaries, ablation of the taxonomy step, or error analysis of cases where a labeled-correct node leads to an incorrect final answer after pruning.

minor comments (2)

[Method] Notation for the reasoning-tree construction and ECN selection rule could be clarified with a small worked example showing a full trace, its segmented nodes, taxonomy labels, and the retained prefix.
[Related Work] The paper should cite and briefly contrast with recent on-policy or self-distillation pruning baselines (e.g., those using reward models or length penalties) to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript accordingly to improve statistical reporting and methodological validation.

read point-by-point responses

Referee: [Experiments] Experiments section (results on GSM8K/Math 500/AIME 2024): the headline token-reduction range (19.4-42.4%) and accuracy-preservation claim are reported without standard deviations across runs, p-values, or explicit train/validation/test splits, which is load-bearing for the central efficiency claim in low-data regimes where variance is typically high.

Authors: We agree that stronger statistical reporting would better support the claims in low-data settings. The manuscript uses fixed, publicly documented train/validation/test splits for each benchmark (detailed in Section 4.1 and Appendix B), but reports single-run results due to compute constraints. We will revise to explicitly restate the splits, add p-values where appropriate for accuracy comparisons, and include standard deviations from at least three random seeds for the main token-reduction and accuracy metrics in the revised experiments section. revision: yes
Referee: [Method] Method section on node segmentation + taxonomy annotation + ECN: the assertion that ECN preserves semantic continuity (and therefore final-answer correctness) is not supported by any quantitative check such as inter-annotator agreement on node boundaries, ablation of the taxonomy step, or error analysis of cases where a labeled-correct node leads to an incorrect final answer after pruning.

Authors: This observation is correct and highlights a gap in validation. While ECN is defined to select the earliest node that both concludes the solution and matches the ground-truth answer (thereby preserving the prefix up to that point), we did not quantify boundary reliability or run ablations. In the revision we will add: (1) an error analysis of all cases where pruning at an ECN-labeled node yields an incorrect final answer, and (2) an ablation that removes the taxonomy annotation step to measure its impact on segmentation quality and downstream token savings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in algorithmic procedure or empirical claims

full rationale

The paper presents STOP as an on-policy algorithmic pipeline: self-distilled trace construction followed by node segmentation, taxonomy annotation, reasoning-tree building, and ECN selection of the shortest correct prefix. Reported token reductions (19.4-42.4%) and accuracy preservation are measured outcomes from fine-tuning experiments on GSM8K, Math 500, and AIME 2024 using specific DeepSeek-R1-Distill models. No equations, fitted parameters, or self-citations are shown to reduce these gains to quantities defined by construction within the paper. The core claims rest on the independent execution of the described procedure and external evaluation, satisfying the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; full paper text unavailable for exhaustive ledger construction.

invented entities (1)

Earliest Correct Node (ECN) no independent evidence
purpose: Identify the shortest prefix of a reasoning trace that reaches a correct final answer
Central pruning criterion introduced by the method

pith-pipeline@v0.9.0 · 5584 in / 1059 out tokens · 60333 ms · 2026-05-14T19:21:43.248451+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 28 canonical work pages · 7 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[9]

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296, 2025

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning , author=. arXiv preprint arXiv:2504.01296 , year=

work page arXiv
[10]

L1: Controlling how long a reasoning model thinks with reinforcement learning.ArXiv, abs/2503.04697, 2025

L1: Controlling how long a reasoning model thinks with reinforcement learning , author=. arXiv preprint arXiv:2503.04697 , year=

work page arXiv
[12]

Transactions on Machine Learning Research , issn=

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

2025
[15]

arXiv preprint arXiv:2502.09601 , year=

Cot-valve: Length-compressible chain-of-thought tuning , author=. arXiv preprint arXiv:2502.09601 , year=

work page arXiv
[19]

Findings of the Association for Computational Linguistics: ACL 2025 , year =

Token-Budget-Aware LLM Reasoning , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =

2025
[22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

CoT-Valve: Length-Compressible Chain-of-Thought Tuning , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
[24]

International Conference on Learning Representations (ICLR) , year =

LoRA: Low-Rank Adaptation of Large Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[25]

Xingyu Chen and Jiahao Xu and Tian Liang and Zhiwei He and Jianhui Pang and Dian Yu and Linfeng Song and Qiuzhi Liu and Mengfei Zhou and Zhuosheng Zhang and Rui Wang and Zhaopeng Tu and Haitao Mi and Dong Yu , booktitle=. Do. 2025 , url=

2025
[27]

Towards reasoning in large language models: A survey

Towards reasoning in large language models: A survey , author=. arXiv preprint arXiv:2212.10403 , year=

work page arXiv
[28]

Advances in Neural Information Processing Systems , editor=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

2022
[29]

Advances in Neural Information Processing Systems , volume=

Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=
[31]

Reasoning with language model is planning with world model

Reasoning with language model is planning with world model , author=. arXiv preprint arXiv:2305.14992 , year=

work page arXiv
[32]

arXiv preprint arXiv:2505.22148 , year=

What makes a good reasoning chain? Uncovering structural patterns in long chain-of-thought reasoning , author=. arXiv preprint arXiv:2505.22148 , year=

work page arXiv
[33]

1972 , publisher=

Human problem solving , author=. 1972 , publisher=

1972
[34]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Demystifying chains, trees, and graphs of thoughts , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
[35]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Let's Verify Step by Step

Let's verify step by step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

2024 , note=

MATH-500: A curated benchmark of 500 advanced math problems , author=. 2024 , note=

2024
[40]

2024 , note=

American Invitational Mathematics Examination 2024 (AIME 2024) , author=. 2024 , note=

2024
[41]

2025 , howpublished =

MixChain‑Z‑PRM12K: Math Reasoning Dataset (12K) on Hugging Face , author =. 2025 , howpublished =

2025
[42]

2025 , journal =

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting , author =. 2025 , journal =

2025
[43]

2025 , eprint=

Small Models Struggle to Learn from Strong Reasoners , author=. 2025 , eprint=

2025
[44]

2025 , eprint=

NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks , author=. 2025 , eprint=

2025
[45]

Stańczyk and Sabela Ramos and Matthieu Geist and Olivier Bachem , booktitle =

Rishabh Agarwal and Nino Vieillard and Yongchao Zhou and P. Stańczyk and Sabela Ramos and Matthieu Geist and Olivier Bachem , booktitle =. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , year =
[46]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

Demystifying Chains, Trees, and Graphs of Thoughts , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =
[47]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

2025
[48]

Your thoughts tell who you are: Characterize the reasoning patterns of lrms.arXiv preprint arXiv:2509.24147, 2025

Your Thoughts Tell Who You Are: Characterize the Reasoning Patterns of LRMs , author =. arXiv preprint arXiv:2509.24147 , year =

work page arXiv
[50]

Stańczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, P. Stańczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2023

2023
[51]

u rgen M \

Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwa \'s niewski, J \"u rgen M \"u ller, et al. Demystifying chains, trees, and graphs of thoughts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47 0 (12): 0 10967--10989, 2025

2025
[52]

Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting. arXiv preprint, arXiv:2510.18874, 2025 a . URL https://arxiv.org/abs/2510.18874

work page arXiv 2025
[53]

Do NOT think that much for 2+3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. In Forty-second International Conference on Machine Learning, 2025 b . URL https://openreview.net/forum?...

2025
[54]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

work page arXiv 2025
[56]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Token-budget-aware llm reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, 2025

2025
[58]

Mixchain‑z‑prm12k: Math reasoning dataset (12k) on hugging face

horseee . Mixchain‑z‑prm12k: Math reasoning dataset (12k) on hugging face. https://huggingface.co/datasets/horseee/MixChain-Z-PRM12K, 2025. Accessed 2025‑01‑06; contains 12,000 math reasoning samples with multi‑solution annotations

2025
[59]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

2022
[60]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning

Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 6490--6514, 2025

2025
[62]

Math-500: A curated benchmark of 500 advanced math problems, 2024

Authors Lightman. Math-500: A curated benchmark of 500 advanced math problems, 2024. Subset of the MATH dataset focusing on 500 challenging problems

2024
[63]

Style over substance: Distilled language models reason via stylistic replication

Philip Lippmann and Jie Yang. Style over substance: Distilled language models reason via stylistic replication. arXiv preprint arXiv:2504.01738, 2025

work page arXiv 2025
[64]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025

work page arXiv 2025
[65]

Cot-valve: Length-compressible chain-of-thought tuning

Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 6025--6035, 2025

2025
[66]

American invitational mathematics examination 2024 (aime 2024), 2024

Mathematical Association of America (MAA). American invitational mathematics examination 2024 (aime 2024), 2024. Benchmark dataset of 30 challenging math problems from the 2024 AIME contest

2024
[67]

Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,

Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint arXiv:2502.21074, 2025

work page arXiv 2025
[68]

Stop overthinking: A survey on efficient reasoning for large language models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=HvoG8SxggZ

2025
[69]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

Thoughts are all over the place: On the underthinking of o1-like llms

Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts are all over the place: On the underthinking of o1-like llms. arXiv preprint arXiv:2501.18585, 2025

work page arXiv 2025
[71]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...

2022
[72]

The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis

Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, and Xueqi Cheng. The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis. arXiv preprint arXiv:2508.17627, 2026

work page arXiv 2026
[73]

Tokenskip: Controllable chain-of-thought compression in llms

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067, 2025

work page arXiv 2025
[74]

Think when you need: Self-adaptive chain-of-thought learning

Junjie Yang, Ke Lin, and Xing Yu. Think when you need: Self-adaptive chain-of-thought learning. arXiv preprint arXiv:2504.03234, 2025

work page arXiv 2025
[75]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024

2024
[76]

Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.ArXiv, abs/2504.21370, 2025

Jingyang Yi and Jiazheng Wang. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning. arXiv preprint arXiv:2504.21370, 2025

work page arXiv 2025
[77]

Naturalthoughts: Selecting and distilling reasoning traces for general reasoning tasks, 2025

Weizhe Yuan, Xian Li, Jason Weston, Dong Wang, Shang-Wen Li, Yang Li, Ilia Kulikov, Thao Nguyen, Karthik Padthe, Jack Lanchantin, and Youssef Emad. Naturalthoughts: Selecting and distilling reasoning traces for general reasoning tasks, 2025. URL https://arxiv.org/abs/2507.01921

work page arXiv 2025
[78]

Small models struggle to learn from strong reasoners, 2025

Xiang Yue, Bill Yuchen Lin, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Radha Poovendran, Yuetai Li, and Bhaskar Ramasubramanian. Small models struggle to learn from strong reasoners, 2025. URL https://arxiv.org/abs/2502.12143

work page arXiv 2025
[79]

Can pruning improve reasoning? revisit- ing long-cot compression with capability in mind for bet- ter reasoning.arXiv preprint arXiv:2505.14582, 2025

Yuxin Zhang et al. Can pruning improve reasoning? revisiting long-cot compression with capability in mind. arXiv preprint arXiv:2505.14582, 2025

work page arXiv 2025