Recognition: unknown
STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
Pith reviewed 2026-05-14 19:21 UTC · model grok-4.3
The pith
Structured on-policy pruning cuts long reasoning traces by 19-42% while preserving accuracy in low-data fine-tuning
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STOP constructs self-distilled traces from the model, maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction, then applies the Earliest Correct Node rule to retain the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity.
What carries the argument
The Earliest Correct Node (ECN) rule applied to a reasoning tree built from self-distilled traces via node segmentation and taxonomy annotation
If this is right
- Token reduction of 19.4-42.4% on GSM8K, Math 500, and AIME 2024 in low-data fine-tuning of DeepSeek-R1-Distill models
- Smaller distributional shift in model outputs than teacher-guided pruning methods
- Higher structural efficiency in the reasoning traces that remain after pruning
- Reallocation of reasoning effort away from redundant verification toward productive exploration
Where Pith is reading between the lines
- The node-based structure could be adapted to prune reasoning in non-math domains if similar segmentation rules are defined for those tasks
- Shorter generated outputs would reduce latency in real-time applications that rely on the same fine-tuned models
- Iterating the on-policy self-distillation step multiple times could compound the efficiency gains without additional labeled data
Load-bearing premise
Mapping raw reasoning traces into a structured interface with nodes and taxonomy labels preserves original semantic meaning and does not introduce new errors that affect final-answer correctness.
What would settle it
Applying STOP to a held-out math benchmark or different model family and measuring an accuracy drop exceeding 5 percentage points while tokens decrease would settle whether accuracy is preserved.
Figures
read the original abstract
Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes STOP, an on-policy pruning method for long chain-of-thought reasoning traces. It generates self-distilled traces, segments them into nodes with taxonomy annotations, builds a reasoning tree, and applies the Earliest Correct Node (ECN) rule to retain only the shortest prefix ending at the first node that concludes with the correct final answer. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B using GSM8K, Math 500, and AIME 2024 in low-data fine-tuning regimes report token reductions of 19.4-42.4% with largely preserved accuracy, smaller distributional shift than teacher-guided baselines, and improved structural efficiency.
Significance. If the efficiency gains and semantic-preservation claims hold under statistical scrutiny, the work would be significant for low-data adaptation of reasoning models: it offers a self-supervised alternative to distillation or test-time control, quantifies reductions in redundant verification/backtracking, and demonstrates that structured on-policy pruning can reallocate reasoning effort productively without large teacher models.
major comments (2)
- [Experiments] Experiments section (results on GSM8K/Math 500/AIME 2024): the headline token-reduction range (19.4-42.4%) and accuracy-preservation claim are reported without standard deviations across runs, p-values, or explicit train/validation/test splits, which is load-bearing for the central efficiency claim in low-data regimes where variance is typically high.
- [Method] Method section on node segmentation + taxonomy annotation + ECN: the assertion that ECN preserves semantic continuity (and therefore final-answer correctness) is not supported by any quantitative check such as inter-annotator agreement on node boundaries, ablation of the taxonomy step, or error analysis of cases where a labeled-correct node leads to an incorrect final answer after pruning.
minor comments (2)
- [Method] Notation for the reasoning-tree construction and ECN selection rule could be clarified with a small worked example showing a full trace, its segmented nodes, taxonomy labels, and the retained prefix.
- [Related Work] The paper should cite and briefly contrast with recent on-policy or self-distillation pruning baselines (e.g., those using reward models or length penalties) to better situate the contribution.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript accordingly to improve statistical reporting and methodological validation.
read point-by-point responses
-
Referee: [Experiments] Experiments section (results on GSM8K/Math 500/AIME 2024): the headline token-reduction range (19.4-42.4%) and accuracy-preservation claim are reported without standard deviations across runs, p-values, or explicit train/validation/test splits, which is load-bearing for the central efficiency claim in low-data regimes where variance is typically high.
Authors: We agree that stronger statistical reporting would better support the claims in low-data settings. The manuscript uses fixed, publicly documented train/validation/test splits for each benchmark (detailed in Section 4.1 and Appendix B), but reports single-run results due to compute constraints. We will revise to explicitly restate the splits, add p-values where appropriate for accuracy comparisons, and include standard deviations from at least three random seeds for the main token-reduction and accuracy metrics in the revised experiments section. revision: yes
-
Referee: [Method] Method section on node segmentation + taxonomy annotation + ECN: the assertion that ECN preserves semantic continuity (and therefore final-answer correctness) is not supported by any quantitative check such as inter-annotator agreement on node boundaries, ablation of the taxonomy step, or error analysis of cases where a labeled-correct node leads to an incorrect final answer after pruning.
Authors: This observation is correct and highlights a gap in validation. While ECN is defined to select the earliest node that both concludes the solution and matches the ground-truth answer (thereby preserving the prefix up to that point), we did not quantify boundary reliability or run ablations. In the revision we will add: (1) an error analysis of all cases where pruning at an ECN-labeled node yields an incorrect final answer, and (2) an ablation that removes the taxonomy annotation step to measure its impact on segmentation quality and downstream token savings. revision: yes
Circularity Check
No significant circularity in algorithmic procedure or empirical claims
full rationale
The paper presents STOP as an on-policy algorithmic pipeline: self-distilled trace construction followed by node segmentation, taxonomy annotation, reasoning-tree building, and ECN selection of the shortest correct prefix. Reported token reductions (19.4-42.4%) and accuracy preservation are measured outcomes from fine-tuning experiments on GSM8K, Math 500, and AIME 2024 using specific DeepSeek-R1-Distill models. No equations, fitted parameters, or self-citations are shown to reduce these gains to quantities defined by construction within the paper. The core claims rest on the independent execution of the described procedure and external evaluation, satisfying the criteria for a self-contained, non-circular derivation.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Earliest Correct Node (ECN)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Aho and Jeffrey D
Alfred V. Aho and Jeffrey D. Ullman , title =. 1972
1972
-
[2]
Publications Manual , year = "1983", publisher =
1983
-
[3]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[4]
Scalable training of
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
-
[5]
Dan Gusfield , title =. 1997
1997
-
[6]
Tetreault , title =
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
2015
-
[7]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[9]
Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning , author=. arXiv preprint arXiv:2504.01296 , year=
-
[10]
L1: Controlling how long a reasoning model thinks with reinforcement learning , author=. arXiv preprint arXiv:2503.04697 , year=
-
[12]
Transactions on Machine Learning Research , issn=
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2025 , url=
2025
-
[15]
arXiv preprint arXiv:2502.09601 , year=
Cot-valve: Length-compressible chain-of-thought tuning , author=. arXiv preprint arXiv:2502.09601 , year=
-
[19]
Findings of the Association for Computational Linguistics: ACL 2025 , year =
Token-Budget-Aware LLM Reasoning , author =. Findings of the Association for Computational Linguistics: ACL 2025 , year =
2025
-
[22]
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
CoT-Valve: Length-Compressible Chain-of-Thought Tuning , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
-
[24]
International Conference on Learning Representations (ICLR) , year =
LoRA: Low-Rank Adaptation of Large Language Models , author =. International Conference on Learning Representations (ICLR) , year =
-
[25]
Xingyu Chen and Jiahao Xu and Tian Liang and Zhiwei He and Jianhui Pang and Dian Yu and Linfeng Song and Qiuzhi Liu and Mengfei Zhou and Zhuosheng Zhang and Rui Wang and Zhaopeng Tu and Haitao Mi and Dong Yu , booktitle=. Do. 2025 , url=
2025
-
[27]
Towards reasoning in large language models: A survey
Towards reasoning in large language models: A survey , author=. arXiv preprint arXiv:2212.10403 , year=
-
[28]
Advances in Neural Information Processing Systems , editor=
Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=
2022
-
[29]
Advances in Neural Information Processing Systems , volume=
Tree of thoughts: Deliberate problem solving with large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
Reasoning with language model is planning with world model
Reasoning with language model is planning with world model , author=. arXiv preprint arXiv:2305.14992 , year=
-
[32]
arXiv preprint arXiv:2505.22148 , year=
What makes a good reasoning chain? Uncovering structural patterns in long chain-of-thought reasoning , author=. arXiv preprint arXiv:2505.22148 , year=
-
[33]
1972 , publisher=
Human problem solving , author=. 1972 , publisher=
1972
-
[34]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Demystifying chains, trees, and graphs of thoughts , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[35]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Let's verify step by step , author=. arXiv preprint arXiv:2305.20050 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Towards reasoning era: A survey of long chain-of-thought for reasoning large language models , author=. arXiv preprint arXiv:2503.09567 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
2024 , note=
MATH-500: A curated benchmark of 500 advanced math problems , author=. 2024 , note=
2024
-
[40]
2024 , note=
American Invitational Mathematics Examination 2024 (AIME 2024) , author=. 2024 , note=
2024
-
[41]
2025 , howpublished =
MixChain‑Z‑PRM12K: Math Reasoning Dataset (12K) on Hugging Face , author =. 2025 , howpublished =
2025
-
[42]
2025 , journal =
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting , author =. 2025 , journal =
2025
-
[43]
2025 , eprint=
Small Models Struggle to Learn from Strong Reasoners , author=. 2025 , eprint=
2025
-
[44]
2025 , eprint=
NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks , author=. 2025 , eprint=
2025
-
[45]
Stańczyk and Sabela Ramos and Matthieu Geist and Olivier Bachem , booktitle =
Rishabh Agarwal and Nino Vieillard and Yongchao Zhou and P. Stańczyk and Sabela Ramos and Matthieu Geist and Olivier Bachem , booktitle =. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes , year =
-
[46]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =
Demystifying Chains, Trees, and Graphs of Thoughts , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =
-
[47]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =
What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages =
2025
-
[48]
Your Thoughts Tell Who You Are: Characterize the Reasoning Patterns of LRMs , author =. arXiv preprint arXiv:2509.24147 , year =
-
[50]
Stańczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, P. Stańczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2023
2023
-
[51]
u rgen M \
Maciej Besta, Florim Memedi, Zhenyu Zhang, Robert Gerstenberger, Guangyuan Piao, Nils Blach, Piotr Nyczyk, Marcin Copik, Grzegorz Kwa \'s niewski, J \"u rgen M \"u ller, et al. Demystifying chains, trees, and graphs of thoughts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 47 0 (12): 0 10967--10989, 2025
2025
-
[52]
Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting. arXiv preprint, arXiv:2510.18874, 2025 a . URL https://arxiv.org/abs/2510.18874
-
[53]
Do NOT think that much for 2+3=? on the overthinking of long reasoning models
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. In Forty-second International Conference on Machine Learning, 2025 b . URL https://openreview.net/forum?...
2025
-
[54]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[55]
Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars. arXiv preprint arXiv:2503.01307, 2025
-
[56]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Token-budget-aware llm reasoning
Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. In Findings of the Association for Computational Linguistics: ACL 2025, 2025
2025
-
[58]
Mixchain‑z‑prm12k: Math reasoning dataset (12k) on hugging face
horseee . Mixchain‑z‑prm12k: Math reasoning dataset (12k) on hugging face. https://huggingface.co/datasets/horseee/MixChain-Z-PRM12K, 2025. Accessed 2025‑01‑06; contains 12,000 math reasoning samples with multi‑solution annotations
2025
-
[59]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022
2022
-
[60]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning
Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 6490--6514, 2025
2025
-
[62]
Math-500: A curated benchmark of 500 advanced math problems, 2024
Authors Lightman. Math-500: A curated benchmark of 500 advanced math problems, 2024. Subset of the MATH dataset focusing on 500 challenging problems
2024
-
[63]
Style over substance: Distilled language models reason via stylistic replication
Philip Lippmann and Jie Yang. Style over substance: Distilled language models reason via stylistic replication. arXiv preprint arXiv:2504.01738, 2025
-
[64]
O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025
Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570, 2025
-
[65]
Cot-valve: Length-compressible chain-of-thought tuning
Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, and Xinchao Wang. Cot-valve: Length-compressible chain-of-thought tuning. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 6025--6035, 2025
2025
-
[66]
American invitational mathematics examination 2024 (aime 2024), 2024
Mathematical Association of America (MAA). American invitational mathematics examination 2024 (aime 2024), 2024. Benchmark dataset of 30 challenging math problems from the 2024 AIME contest
2024
-
[67]
Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, and Yulan He. Codi: Compressing chain-of-thought into continuous space via self-distillation. arXiv preprint arXiv:2502.21074, 2025
-
[68]
Stop overthinking: A survey on efficient reasoning for large language models
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models. Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/forum?id=HvoG8SxggZ
2025
-
[69]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Thoughts are all over the place: On the underthinking of o1-like llms
Yue Wang, Qiuzhi Liu, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Linfeng Song, Dian Yu, Juntao Li, Zhuosheng Zhang, et al. Thoughts are all over the place: On the underthinking of o1-like llms. arXiv preprint arXiv:2501.18585, 2025
-
[71]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=...
2022
-
[72]
The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis
Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, and Xueqi Cheng. The evolution of thought: Tracking llm overthinking via reasoning dynamics analysis. arXiv preprint arXiv:2508.17627, 2026
-
[73]
Tokenskip: Controllable chain-of-thought compression in llms
Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067, 2025
-
[74]
Think when you need: Self-adaptive chain-of-thought learning
Junjie Yang, Ke Lin, and Xing Yu. Think when you need: Self-adaptive chain-of-thought learning. arXiv preprint arXiv:2504.03234, 2025
-
[75]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024
2024
-
[76]
Jingyang Yi and Jiazheng Wang. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning. arXiv preprint arXiv:2504.21370, 2025
-
[77]
Naturalthoughts: Selecting and distilling reasoning traces for general reasoning tasks, 2025
Weizhe Yuan, Xian Li, Jason Weston, Dong Wang, Shang-Wen Li, Yang Li, Ilia Kulikov, Thao Nguyen, Karthik Padthe, Jack Lanchantin, and Youssef Emad. Naturalthoughts: Selecting and distilling reasoning traces for general reasoning tasks, 2025. URL https://arxiv.org/abs/2507.01921
-
[78]
Small models struggle to learn from strong reasoners, 2025
Xiang Yue, Bill Yuchen Lin, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Radha Poovendran, Yuetai Li, and Bhaskar Ramasubramanian. Small models struggle to learn from strong reasoners, 2025. URL https://arxiv.org/abs/2502.12143
-
[79]
Yuxin Zhang et al. Can pruning improve reasoning? revisiting long-cot compression with capability in mind. arXiv preprint arXiv:2505.14582, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.