pith. machine review for the scientific record. sign in

arxiv: 2605.13165 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: unknown

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3

classification 💻 cs.CL
keywords structured pruninglong chain-of-thoughton-policy pruninglow-data fine-tuningreasoning efficiencytoken reductionmath reasoning
0
0 comments X

The pith

Structured on-policy pruning cuts long reasoning traces by 19-42% while preserving accuracy in low-data fine-tuning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STOP, an algorithm that prunes unnecessary length from chain-of-thought reasoning generated by language models during low-data fine-tuning. It turns the model's own traces into structured trees by segmenting them into nodes and labeling their roles, then keeps only the shortest prefix that reaches the first correct answer. A sympathetic reader would care because this lowers token count and inference cost without large teacher models or extra supervision. Results on GSM8K, Math 500, and AIME 2024 with DeepSeek-R1-Distill models show token reductions of 19.4 to 42.4 percent and largely unchanged accuracy. The method also produces smaller shifts in the model's output distribution than pruning guided by external teachers.

Core claim

STOP constructs self-distilled traces from the model, maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction, then applies the Earliest Correct Node rule to retain the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity.

What carries the argument

The Earliest Correct Node (ECN) rule applied to a reasoning tree built from self-distilled traces via node segmentation and taxonomy annotation

If this is right

  • Token reduction of 19.4-42.4% on GSM8K, Math 500, and AIME 2024 in low-data fine-tuning of DeepSeek-R1-Distill models
  • Smaller distributional shift in model outputs than teacher-guided pruning methods
  • Higher structural efficiency in the reasoning traces that remain after pruning
  • Reallocation of reasoning effort away from redundant verification toward productive exploration

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The node-based structure could be adapted to prune reasoning in non-math domains if similar segmentation rules are defined for those tasks
  • Shorter generated outputs would reduce latency in real-time applications that rely on the same fine-tuned models
  • Iterating the on-policy self-distillation step multiple times could compound the efficiency gains without additional labeled data

Load-bearing premise

Mapping raw reasoning traces into a structured interface with nodes and taxonomy labels preserves original semantic meaning and does not introduce new errors that affect final-answer correctness.

What would settle it

Applying STOP to a held-out math benchmark or different model family and measuring an accuracy drop exceeding 5 percentage points while tokens decrease would settle whether accuracy is preserved.

Figures

Figures reproduced from arXiv: 2605.13165 by Bill Howe, Bingbing Wen, Chenjun Xu, Lucy Lu Wang, Zhan Su, Zhennan Zhou.

Figure 1
Figure 1. Figure 1: Overview of STOP. The framework first constructs successful self-distilled traces, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reasoning node-type distribution on Math 500 Level 3 for Qwen Base and Self-distilled-ECN. Self-distilled-ECN shifts reasoning toward more Exploration and less Verification/Backtracking, while largely pre￾serving Conclusion. On GSM8K, Self-distilled-ECN preserves performance while substantially reducing token usage. For Qwen, it reaches 91.1% accuracy versus 90.1% for Base while reducing tokens by 27.8%. F… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy gains and token savings of self-distilled models across three supervision [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Response-length distribution of the base Qwen model and the self￾distilled ECN-Qwen model on Math 500 (Level 3). Overthinking and Reasoning-Length Control Recent reasoning language models (RLMs) achieve strong performance on multi-step tasks by generating long chain-of-thought (CoT) traces (Jaech et al., 2024; Guo et al., 2025). How￾ever, extended reasoning often introduces sub￾stantial computational overh… view at source ↗
Figure 6
Figure 6. Figure 6: Token accuracy over training steps for different supervision and pruning strate￾gies. C Human Annotation Evaluation To evaluate the reliability of the model-assisted annotation pipeline, we conduct a human evaluation comparing LLM-generated annotations with human annotations for two key tasks: taxonomy labeling and conclusion correctness judgment. C.1 Annotation Protocol To evaluate the reliability of the … view at source ↗
read the original abstract

Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes STOP, an on-policy pruning method for long chain-of-thought reasoning traces. It generates self-distilled traces, segments them into nodes with taxonomy annotations, builds a reasoning tree, and applies the Earliest Correct Node (ECN) rule to retain only the shortest prefix ending at the first node that concludes with the correct final answer. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B using GSM8K, Math 500, and AIME 2024 in low-data fine-tuning regimes report token reductions of 19.4-42.4% with largely preserved accuracy, smaller distributional shift than teacher-guided baselines, and improved structural efficiency.

Significance. If the efficiency gains and semantic-preservation claims hold under statistical scrutiny, the work would be significant for low-data adaptation of reasoning models: it offers a self-supervised alternative to distillation or test-time control, quantifies reductions in redundant verification/backtracking, and demonstrates that structured on-policy pruning can reallocate reasoning effort productively without large teacher models.

major comments (2)
  1. [Experiments] Experiments section (results on GSM8K/Math 500/AIME 2024): the headline token-reduction range (19.4-42.4%) and accuracy-preservation claim are reported without standard deviations across runs, p-values, or explicit train/validation/test splits, which is load-bearing for the central efficiency claim in low-data regimes where variance is typically high.
  2. [Method] Method section on node segmentation + taxonomy annotation + ECN: the assertion that ECN preserves semantic continuity (and therefore final-answer correctness) is not supported by any quantitative check such as inter-annotator agreement on node boundaries, ablation of the taxonomy step, or error analysis of cases where a labeled-correct node leads to an incorrect final answer after pruning.
minor comments (2)
  1. [Method] Notation for the reasoning-tree construction and ECN selection rule could be clarified with a small worked example showing a full trace, its segmented nodes, taxonomy labels, and the retained prefix.
  2. [Related Work] The paper should cite and briefly contrast with recent on-policy or self-distillation pruning baselines (e.g., those using reward models or length penalties) to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript accordingly to improve statistical reporting and methodological validation.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (results on GSM8K/Math 500/AIME 2024): the headline token-reduction range (19.4-42.4%) and accuracy-preservation claim are reported without standard deviations across runs, p-values, or explicit train/validation/test splits, which is load-bearing for the central efficiency claim in low-data regimes where variance is typically high.

    Authors: We agree that stronger statistical reporting would better support the claims in low-data settings. The manuscript uses fixed, publicly documented train/validation/test splits for each benchmark (detailed in Section 4.1 and Appendix B), but reports single-run results due to compute constraints. We will revise to explicitly restate the splits, add p-values where appropriate for accuracy comparisons, and include standard deviations from at least three random seeds for the main token-reduction and accuracy metrics in the revised experiments section. revision: yes

  2. Referee: [Method] Method section on node segmentation + taxonomy annotation + ECN: the assertion that ECN preserves semantic continuity (and therefore final-answer correctness) is not supported by any quantitative check such as inter-annotator agreement on node boundaries, ablation of the taxonomy step, or error analysis of cases where a labeled-correct node leads to an incorrect final answer after pruning.

    Authors: This observation is correct and highlights a gap in validation. While ECN is defined to select the earliest node that both concludes the solution and matches the ground-truth answer (thereby preserving the prefix up to that point), we did not quantify boundary reliability or run ablations. In the revision we will add: (1) an error analysis of all cases where pruning at an ECN-labeled node yields an incorrect final answer, and (2) an ablation that removes the taxonomy annotation step to measure its impact on segmentation quality and downstream token savings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in algorithmic procedure or empirical claims

full rationale

The paper presents STOP as an on-policy algorithmic pipeline: self-distilled trace construction followed by node segmentation, taxonomy annotation, reasoning-tree building, and ECN selection of the shortest correct prefix. Reported token reductions (19.4-42.4%) and accuracy preservation are measured outcomes from fine-tuning experiments on GSM8K, Math 500, and AIME 2024 using specific DeepSeek-R1-Distill models. No equations, fitted parameters, or self-citations are shown to reduce these gains to quantities defined by construction within the paper. The core claims rest on the independent execution of the described procedure and external evaluation, satisfying the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; full paper text unavailable for exhaustive ledger construction.

invented entities (1)
  • Earliest Correct Node (ECN) no independent evidence
    purpose: Identify the shortest prefix of a reasoning trace that reaches a correct final answer
    Central pruning criterion introduced by the method

pith-pipeline@v0.9.0 · 5584 in / 1059 out tokens · 60333 ms · 2026-05-14T19:21:43.248451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 28 canonical work pages · 7 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [9]

    Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296, 2025

    Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning , author=. arXiv preprint arXiv:2504.01296 , year=

  9. [10]

    L1: Controlling how long a reasoning model thinks with reinforcement learning.ArXiv, abs/2503.04697, 2025

    L1: Controlling how long a reasoning model thinks with reinforcement learning , author=. arXiv preprint arXiv:2503.04697 , year=

  10. [12]

    Transactions on Machine Learning Research , issn=

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

  11. [15]

    arXiv preprint arXiv:2502.09601 , year=

    Cot-valve: Length-compressible chain-of-thought tuning , author=. arXiv preprint arXiv:2502.09601 , year=

  12. [19]

    Findings of the Association for Computational Linguistics: ACL 2025 , year =

    Token-Budget-Aware LLM Reasoning , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =

  13. [22]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    CoT-Valve: Length-Compressible Chain-of-Thought Tuning , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  14. [24]

    International Conference on Learning Representations (ICLR) , year =

    LoRA: Low-Rank Adaptation of Large Language Models , author =. International Conference on Learning Representations (ICLR) , year =

  15. [25]

    Xingyu Chen and Jiahao Xu and Tian Liang and Zhiwei He and Jianhui Pang and Dian Yu and Linfeng Song and Qiuzhi Liu and Mengfei Zhou and Zhuosheng Zhang and Rui Wang and Zhaopeng Tu and Haitao Mi and Dong Yu , booktitle=. Do. 2025 , url=

  16. [27]

    Towards reasoning in large language models: A survey

    Towards reasoning in large language models: A survey , author=. arXiv preprint arXiv:2212.10403 , year=

  17. [28]

    Advances in Neural Information Processing Systems , editor=

    Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  18. [29]

    Advances in Neural Information Processing Systems , volume=

    Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=

  19. [31]

    Reasoning with language model is planning with world model

    Reasoning with language model is planning with world model , author=. arXiv preprint arXiv:2305.14992 , year=

  20. [32]

    arXiv preprint arXiv:2505.22148 , year=

    What makes a good reasoning chain? Uncovering structural patterns in long chain-of-thought reasoning , author=. arXiv preprint arXiv:2505.22148 , year=

  21. [33]

    1972 , publisher=

    Human problem solving , author=. 1972 , publisher=

  22. [34]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Demystifying chains, trees, and graphs of thoughts , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  23. [35]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  24. [36]

    Let's Verify Step by Step

    Let's verify step by step , author=. arXiv preprint arXiv:2305.20050 , year=

  25. [37]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=

  26. [39]

    2024 , note=

    MATH-500: A curated benchmark of 500 advanced math problems , author=. 2024 , note=

  27. [40]

    2024 , note=

    American Invitational Mathematics Examination 2024 (AIME 2024) , author=. 2024 , note=

  28. [41]

    2025 , howpublished =

    MixChain‑Z‑PRM12K: Math Reasoning Dataset (12K) on Hugging Face , author =. 2025 , howpublished =

  29. [42]

    2025 , journal =

    Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting , author =. 2025 , journal =

  30. [43]

    2025 , eprint=

    Small Models Struggle to Learn from Strong Reasoners , author=. 2025 , eprint=

  31. [44]

    2025 , eprint=

    NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks , author=. 2025 , eprint=

  32. [45]

    Stańczyk and Sabela Ramos and Matthieu Geist and Olivier Bachem , booktitle =

    Rishabh Agarwal and Nino Vieillard and Yongchao Zhou and P. Stańczyk and Sabela Ramos and Matthieu Geist and Olivier Bachem , booktitle =. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , year =

  33. [46]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

    Demystifying Chains, Trees, and Graphs of Thoughts , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

  34. [47]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

    What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =

  35. [48]

    Your thoughts tell who you are: Characterize the reasoning patterns of lrms.arXiv preprint arXiv:2509.24147, 2025

    Your Thoughts Tell Who You Are: Characterize the Reasoning Patterns of LRMs , author =. arXiv preprint arXiv:2509.24147 , year =

  36. [50]

    Stańczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, P. Stańczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2023

  37. [51]

    u rgen M \

    Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwa \'s niewski, J \"u rgen M \"u ller, et al. Demystifying chains, trees, and graphs of thoughts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47 0 (12): 0 10967--10989, 2025

  38. [52]

    Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

    Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting. arXiv preprint, arXiv:2510.18874, 2025 a . URL https://arxiv.org/abs/2510.18874

  39. [53]

    Do NOT think that much for 2+3=? on the overthinking of long reasoning models

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. In Forty-second International Conference on Machine Learning, 2025 b . URL https://openreview.net/forum?...

  40. [54]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  41. [55]

    Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025

  42. [56]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  43. [57]

    Token-budget-aware llm reasoning

    Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, 2025

  44. [58]

    Mixchain‑z‑prm12k: Math reasoning dataset (12k) on hugging face

    horseee . Mixchain‑z‑prm12k: Math reasoning dataset (12k) on hugging face. https://huggingface.co/datasets/horseee/MixChain-Z-PRM12K, 2025. Accessed 2025‑01‑06; contains 12,000 math reasoning samples with multi‑solution annotations

  45. [59]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  46. [60]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  47. [61]

    What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning

    Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 6490--6514, 2025

  48. [62]

    Math-500: A curated benchmark of 500 advanced math problems, 2024

    Authors Lightman. Math-500: A curated benchmark of 500 advanced math problems, 2024. Subset of the MATH dataset focusing on 500 challenging problems

  49. [63]

    Style over substance: Distilled language models reason via stylistic replication

    Philip Lippmann and Jie Yang. Style over substance: Distilled language models reason via stylistic replication. arXiv preprint arXiv:2504.01738, 2025

  50. [64]

    O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

    Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025

  51. [65]

    Cot-valve: Length-compressible chain-of-thought tuning

    Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 6025--6035, 2025

  52. [66]

    American invitational mathematics examination 2024 (aime 2024), 2024

    Mathematical Association of America (MAA). American invitational mathematics examination 2024 (aime 2024), 2024. Benchmark dataset of 30 challenging math problems from the 2024 AIME contest

  53. [67]

    Codi: Compressing chain-of-thought into continuous space via self-distillation.arXiv preprint arXiv: 2502.21074,

    Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint arXiv:2502.21074, 2025

  54. [68]

    Stop overthinking: A survey on efficient reasoning for large language models

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=HvoG8SxggZ

  55. [69]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025

  56. [70]

    Thoughts are all over the place: On the underthinking of o1-like llms

    Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts are all over the place: On the underthinking of o1-like llms. arXiv preprint arXiv:2501.18585, 2025

  57. [71]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...

  58. [72]

    The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis

    Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, and Xueqi Cheng. The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis. arXiv preprint arXiv:2508.17627, 2026

  59. [73]

    Tokenskip: Controllable chain-of-thought compression in llms

    Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067, 2025

  60. [74]

    Think when you need: Self-adaptive chain-of-thought learning

    Junjie Yang, Ke Lin, and Xing Yu. Think when you need: Self-adaptive chain-of-thought learning. arXiv preprint arXiv:2504.03234, 2025

  61. [75]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024

  62. [76]

    Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.ArXiv, abs/2504.21370, 2025

    Jingyang Yi and Jiazheng Wang. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning. arXiv preprint arXiv:2504.21370, 2025

  63. [77]

    Naturalthoughts: Selecting and distilling reasoning traces for general reasoning tasks, 2025

    Weizhe Yuan, Xian Li, Jason Weston, Dong Wang, Shang-Wen Li, Yang Li, Ilia Kulikov, Thao Nguyen, Karthik Padthe, Jack Lanchantin, and Youssef Emad. Naturalthoughts: Selecting and distilling reasoning traces for general reasoning tasks, 2025. URL https://arxiv.org/abs/2507.01921

  64. [78]

    Small models struggle to learn from strong reasoners, 2025

    Xiang Yue, Bill Yuchen Lin, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Radha Poovendran, Yuetai Li, and Bhaskar Ramasubramanian. Small models struggle to learn from strong reasoners, 2025. URL https://arxiv.org/abs/2502.12143

  65. [79]

    Can pruning improve reasoning? revisit- ing long-cot compression with capability in mind for bet- ter reasoning.arXiv preprint arXiv:2505.14582, 2025

    Yuxin Zhang et al. Can pruning improve reasoning? revisiting long-cot compression with capability in mind. arXiv preprint arXiv:2505.14582, 2025