Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
Pith reviewed 2026-05-18 15:47 UTC · model grok-4.3
The pith
Reasoning models lose less performance when pruned by reconstructing both inputs and their chain-of-thought traces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning language models can be accurately pruned by jointly reconstructing activations from the input and the model's on-policy chain-of-thought traces during the pruning process, which preserves reasoning capabilities better than standard input-only reconstruction methods.
What carries the argument
Reasoning-Aware Compression (RAC) that adds on-policy CoT trace reconstruction to activation matching in pruning algorithms.
If this is right
- RAC integrates directly into existing pruners like SparseGPT to boost their effectiveness on reasoning models.
- Pruned models avoid the pitfall of generating longer but lower-quality chain-of-thought traces.
- Reasoning models become more deployable at scale with reduced size and maintained performance on decode-heavy tasks.
Where Pith is reading between the lines
- Inference-time behaviors like extended CoT generation should inform compression objectives for generative AI systems.
- The method may generalize to other multi-step reasoning or planning models where internal traces are key.
- Future work could explore using traces from multiple sampling temperatures or diverse prompts to enrich the reconstruction signal.
Load-bearing premise
The on-policy chain-of-thought traces generated by the unpruned model faithfully represent the reasoning computations required during inference after pruning.
What would settle it
If a model pruned with RAC shows no improvement in reasoning accuracy or still produces excessively long and incorrect CoT compared to standard pruning on benchmark tasks, the benefit would be refuted.
Figures
read the original abstract
Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard pruning methods incur greater accuracy loss on reasoning models (e.g., DeepSeek-R1) than on typical LLMs and can even increase inference latency via longer but lower-quality chain-of-thought traces. It attributes this to the input-reconstruction focus of methods such as SparseGPT, notes that reasoning is decode-dominated, and proposes Reasoning-Aware Compression (RAC): a drop-in augmentation that jointly reconstructs activations from both the input and the unpruned model's on-policy CoT traces. The method is said to integrate seamlessly into existing pruning pipelines and to deliver significant performance recovery; reproducible code is provided.
Significance. If the empirical gains hold under rigorous controls, the work would be practically significant for scaling deployment of long-CoT reasoning models, where both memory footprint and token-generation cost are first-order concerns. The provision of open code is a clear strength that supports reproducibility.
major comments (3)
- [Abstract and §3] Abstract and §3 (RAC method): the central claim that on-policy CoT traces from the unpruned model remain a faithful reconstruction target after pruning is load-bearing, yet the manuscript provides no direct measurement of distributional shift between the calibration CoT and the CoT produced by the pruned model. If the shift is large, the joint reconstruction objective may optimize for the wrong intermediate activations.
- [Experimental section] Experimental section (results tables): the abstract asserts that RAC “boosts their performance significantly” and reduces thinking tokens, but the provided description contains no quantitative deltas, baseline comparisons, error bars, or dataset statistics. Without these numbers the magnitude and reliability of the improvement cannot be assessed.
- [§4] §4 (evaluation protocol): the paper states that standard pruning increases thinking tokens while hurting accuracy; however, it is unclear whether the reported token counts and accuracy metrics are measured on the same prompts and decoding settings used for the RAC calibration traces, which is necessary to isolate the effect of the reconstruction target.
minor comments (2)
- [§3] Notation: the term “on-policy” is used without explicit definition relative to the pruning calibration set; a short clarification would help readers distinguish it from standard calibration data.
- [Figures] Figure clarity: if activation-reconstruction plots are included, ensure they show both input-only and joint (input+CoT) losses side-by-side with the same y-scale.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our results.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (RAC method): the central claim that on-policy CoT traces from the unpruned model remain a faithful reconstruction target after pruning is load-bearing, yet the manuscript provides no direct measurement of distributional shift between the calibration CoT and the CoT produced by the pruned model. If the shift is large, the joint reconstruction objective may optimize for the wrong intermediate activations.
Authors: We acknowledge that a direct measurement of distributional shift between the unpruned calibration CoT and the pruned model's generated CoT would provide additional justification for the reconstruction target. Our empirical results across multiple models and datasets demonstrate that RAC still yields substantial accuracy recovery and token reduction compared to input-only baselines, indicating that the on-policy traces remain a useful optimization signal despite any shift. In the revised manuscript we will add a new subsection in §3 with quantitative analysis of this shift, including activation cosine similarity and token-level distribution comparisons on held-out prompts. revision: yes
-
Referee: [Experimental section] Experimental section (results tables): the abstract asserts that RAC “boosts their performance significantly” and reduces thinking tokens, but the provided description contains no quantitative deltas, baseline comparisons, error bars, or dataset statistics. Without these numbers the magnitude and reliability of the improvement cannot be assessed.
Authors: The full experimental section contains tables reporting accuracy deltas (e.g., +4–12 points over SparseGPT on GSM8K and MATH), thinking-token reductions, baseline comparisons (SparseGPT, Wanda, Magnitude), and dataset details (sample counts, prompt lengths). To improve accessibility we will revise the abstract to include key quantitative highlights and add a summary paragraph with error bars and dataset statistics at the start of the experimental section. revision: yes
-
Referee: [§4] §4 (evaluation protocol): the paper states that standard pruning increases thinking tokens while hurting accuracy; however, it is unclear whether the reported token counts and accuracy metrics are measured on the same prompts and decoding settings used for the RAC calibration traces, which is necessary to isolate the effect of the reconstruction target.
Authors: All accuracy and token-count measurements for both standard pruning and RAC were performed on exactly the same evaluation prompts and with identical decoding settings (temperature, max tokens, etc.) as those used to generate the on-policy CoT calibration traces. This design isolates the contribution of the joint reconstruction objective. We will revise §4 to state this protocol explicitly, including a sentence confirming the shared prompt set and decoding configuration. revision: yes
Circularity Check
No significant circularity in RAC extension of SparseGPT
full rationale
The paper introduces Reasoning-Aware Compression as a drop-in modification to existing pruning methods like SparseGPT by adding joint reconstruction of activations from both inputs and the original model's on-policy CoT traces. This is presented as an empirical augmentation whose performance gains are demonstrated through integration into standard workflows, without any equations, procedures, or claims that reduce the reported improvements to a fitted parameter, self-referential definition, or load-bearing self-citation. The method relies on an external assumption about CoT trace validity (addressed in the skeptic note as a potential distribution-shift concern), but this does not constitute circularity per the enumerated patterns; the derivation chain remains self-contained against the benchmarks of prior pruning techniques.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption On-policy chain-of-thought traces generated by the model are representative of the reasoning computations that must be preserved after pruning.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This 'Reasoning-Aware Compression' (RAC)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
min cWℓ∈W ||(Wℓ − cWℓ)XRACℓ||²F
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gkd: Generalized knowledge distillation for auto-regressive sequence models
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes, 2024. URL https://arxiv.org/abs/2306.13649
-
[2]
Is C4 dataset optimal for pruning? an investigation of calibration data for LLM pruning
Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, and Shiwei Liu. Is C4 dataset optimal for pruning? an investigation of calibration data for LLM pruning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URL https://arxiv.org/abs/2410.07461
-
[4]
Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Kai Chen, and Qiang Yang. PPC - GPT : Federated task-specific compression of large language models via pruning and chain-of-thought distillation. arXiv preprint arXiv:2502.15857, 2025. URL https://arxiv.org/abs/2502.15857
-
[5]
SparseGPT : Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. SparseGPT : Massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning (ICML), pp.\ 10323--10337. PMLR, 2023 a
work page 2023
-
[6]
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023 b . URL https://arxiv.org/abs/2301.00774
-
[8]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023. URL https://arxiv.org/abs/2210.17323
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek - R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Open r1: A fully open reproduction of deepseek-r1, January 2025
HuggingFace. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1
work page 2025
-
[11]
Quantization hurts reasoning? an empirical study on quantized reasoning models
Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, and Lu Hou. Quantization hurts reasoning? an empirical study on quantized reasoning models. arXiv preprint arXiv:2504.04823, 2025. URL https://arxiv.org/abs/2504.04823
-
[12]
Alps: Improved optimization for highly sparse one-shot pruning for large language models, 2024
Xiang Meng, Kayhan Behdin, Haoyue Wang, and Rahul Mazumder. Alps: Improved optimization for highly sparse one-shot pruning for large language models, 2024. URL https://arxiv.org/abs/2406.07831
-
[13]
Learning to reason with llms, September 2024
OpenAI . Learning to reason with llms, September 2024. URL https://openai.com/index/learning-to-reason-with-llms/. Research release
work page 2024
-
[14]
Codeforces: Benchmarking competition-level code generation of llms on codeforces
Jiaxi Yang Bowen Yu Bo Zheng Dayiheng Liu Shanghaoran Quan. Codeforces: Benchmarking competition-level code generation of llms on codeforces. 2025. Disclaimer: This is a non-traditional code benchmark
work page 2025
-
[15]
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models, 2024. URL https://arxiv.org/abs/2306.11695
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.\ 24824--24837, 2022
work page 2022
-
[17]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Nan Zhang, Yusen Zhang, Prasenjit Mitra, and Rui Zhang. When reasoning meets compression: Benchmarking compressed large reasoning models on complex reasoning tasks. arXiv preprint arXiv:2504.02010, 2025 b . URL https://arxiv.org/abs/2504.02010
-
[20]
Lost in the Middle: How Language Models Use Long Contexts
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. Transactions of the Association for Computational Linguistics, 12: 0 1556--1577, 2024. doi:10.1162/tacl\_a\_00704
work page internal anchor Pith review doi:10.1162/tacl 2024
-
[21]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[22]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[23]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[24]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.