Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

Kayhan Behdin; Qingquan Song; Rahul Mazumder; Ryan Lucas; Shao Tang; Zhipeng Wang

arxiv: 2509.12464 · v2 · submitted 2025-09-15 · 💻 cs.AI

Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

Ryan Lucas , Kayhan Behdin , Zhipeng Wang , Qingquan Song , Shao Tang , Rahul Mazumder This is my paper

Pith reviewed 2026-05-18 15:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords model pruningchain-of-thought reconstructionreasoning language modelsactivation reconstructionSparseGPTneural compressionefficient inference

0 comments

The pith

Reasoning models lose less performance when pruned by reconstructing both inputs and their chain-of-thought traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard pruning techniques cause greater performance loss in reasoning language models than in typical tasks, sometimes leading to slower inference due to more thinking tokens. This occurs because conventional methods focus on input reconstruction, while reasoning is primarily a decoding process. The paper proposes Reasoning-Aware Compression (RAC), which jointly reconstructs activations from both the input and the model's on-policy chain-of-thought traces. This simple addition integrates into workflows like SparseGPT and significantly improves pruning outcomes for models such as DeepSeek-R1.

Core claim

Reasoning language models can be accurately pruned by jointly reconstructing activations from the input and the model's on-policy chain-of-thought traces during the pruning process, which preserves reasoning capabilities better than standard input-only reconstruction methods.

What carries the argument

Reasoning-Aware Compression (RAC) that adds on-policy CoT trace reconstruction to activation matching in pruning algorithms.

If this is right

RAC integrates directly into existing pruners like SparseGPT to boost their effectiveness on reasoning models.
Pruned models avoid the pitfall of generating longer but lower-quality chain-of-thought traces.
Reasoning models become more deployable at scale with reduced size and maintained performance on decode-heavy tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Inference-time behaviors like extended CoT generation should inform compression objectives for generative AI systems.
The method may generalize to other multi-step reasoning or planning models where internal traces are key.
Future work could explore using traces from multiple sampling temperatures or diverse prompts to enrich the reconstruction signal.

Load-bearing premise

The on-policy chain-of-thought traces generated by the unpruned model faithfully represent the reasoning computations required during inference after pruning.

What would settle it

If a model pruned with RAC shows no improvement in reasoning accuracy or still produces excessively long and incorrect CoT compared to standard pruning on benchmark tasks, the benefit would be refuted.

Figures

Figures reproduced from arXiv: 2509.12464 by Kayhan Behdin, Qingquan Song, Rahul Mazumder, Ryan Lucas, Shao Tang, Zhipeng Wang.

**Figure 1.** Figure 1: Pruning hurts both accuracy and runtime on MATH-500. At sparsity levels of 30%, 50%, and 70%, we evaluate each pruned model on the MATH-500 benchmark (zero-shot, 32k max tokens). As sparsity increases, MATH-500 accuracy falls while total evaluation time grows sharply. Counter-intuitively, heavy pruning also slows down inference because the model produces much longer chains of thought, rambling more yet an… view at source ↗

**Figure 2.** Figure 2: Tokenwise reconstruction error ratio on M [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAC adds on-policy CoT reconstruction to SparseGPT-style pruning and reports better accuracy and token counts on reasoning models, but the post-pruning distribution shift remains an open risk.

read the letter

Hey, the core claim here is that standard pruning hurts reasoning models more than regular LLMs because it ignores the decode-heavy CoT phase, and their fix is to jointly reconstruct activations from both the input and the original model's own long CoT traces inside an existing SparseGPT workflow. They call it RAC and say it integrates as a drop-in that cuts the accuracy loss and sometimes even reduces the extra thinking tokens that pruning usually triggers on models like DeepSeek-R1. The code release is a practical plus for anyone who wants to test it directly. What they do well is flag that input-only reconstruction misses the mark for tasks where most compute happens during generation, and they give a concrete way to bring the model's reasoning behavior into the pruning objective. That observation about increased thinking tokens after pruning is useful on its own. The soft spot is the assumption that the unpruned model's CoT traces stay a faithful target once weights change. Pruning shifts the output distribution, so the pruned model can generate different reasoning steps at inference; if the mismatch is large, the joint loss is optimizing for the wrong intermediate activations. The abstract does not show numbers or ablations that directly test how much the CoT diverges post-pruning, so the strength of the recovery depends on whether their full experiments control for that. This is for engineers and researchers who need to shrink long-CoT models for deployment without losing too much reasoning quality. A reader working on efficient inference would get a usable method to try and some evidence that the tweak helps. It deserves peer review because the problem is real, the change is simple, and the potential payoff is practical even if the distribution-shift concern needs tighter checks in revision.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard pruning methods incur greater accuracy loss on reasoning models (e.g., DeepSeek-R1) than on typical LLMs and can even increase inference latency via longer but lower-quality chain-of-thought traces. It attributes this to the input-reconstruction focus of methods such as SparseGPT, notes that reasoning is decode-dominated, and proposes Reasoning-Aware Compression (RAC): a drop-in augmentation that jointly reconstructs activations from both the input and the unpruned model's on-policy CoT traces. The method is said to integrate seamlessly into existing pruning pipelines and to deliver significant performance recovery; reproducible code is provided.

Significance. If the empirical gains hold under rigorous controls, the work would be practically significant for scaling deployment of long-CoT reasoning models, where both memory footprint and token-generation cost are first-order concerns. The provision of open code is a clear strength that supports reproducibility.

major comments (3)

[Abstract and §3] Abstract and §3 (RAC method): the central claim that on-policy CoT traces from the unpruned model remain a faithful reconstruction target after pruning is load-bearing, yet the manuscript provides no direct measurement of distributional shift between the calibration CoT and the CoT produced by the pruned model. If the shift is large, the joint reconstruction objective may optimize for the wrong intermediate activations.
[Experimental section] Experimental section (results tables): the abstract asserts that RAC “boosts their performance significantly” and reduces thinking tokens, but the provided description contains no quantitative deltas, baseline comparisons, error bars, or dataset statistics. Without these numbers the magnitude and reliability of the improvement cannot be assessed.
[§4] §4 (evaluation protocol): the paper states that standard pruning increases thinking tokens while hurting accuracy; however, it is unclear whether the reported token counts and accuracy metrics are measured on the same prompts and decoding settings used for the RAC calibration traces, which is necessary to isolate the effect of the reconstruction target.

minor comments (2)

[§3] Notation: the term “on-policy” is used without explicit definition relative to the pruning calibration set; a short clarification would help readers distinguish it from standard calibration data.
[Figures] Figure clarity: if activation-reconstruction plots are included, ensure they show both input-only and joint (input+CoT) losses side-by-side with the same y-scale.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (RAC method): the central claim that on-policy CoT traces from the unpruned model remain a faithful reconstruction target after pruning is load-bearing, yet the manuscript provides no direct measurement of distributional shift between the calibration CoT and the CoT produced by the pruned model. If the shift is large, the joint reconstruction objective may optimize for the wrong intermediate activations.

Authors: We acknowledge that a direct measurement of distributional shift between the unpruned calibration CoT and the pruned model's generated CoT would provide additional justification for the reconstruction target. Our empirical results across multiple models and datasets demonstrate that RAC still yields substantial accuracy recovery and token reduction compared to input-only baselines, indicating that the on-policy traces remain a useful optimization signal despite any shift. In the revised manuscript we will add a new subsection in §3 with quantitative analysis of this shift, including activation cosine similarity and token-level distribution comparisons on held-out prompts. revision: yes
Referee: [Experimental section] Experimental section (results tables): the abstract asserts that RAC “boosts their performance significantly” and reduces thinking tokens, but the provided description contains no quantitative deltas, baseline comparisons, error bars, or dataset statistics. Without these numbers the magnitude and reliability of the improvement cannot be assessed.

Authors: The full experimental section contains tables reporting accuracy deltas (e.g., +4–12 points over SparseGPT on GSM8K and MATH), thinking-token reductions, baseline comparisons (SparseGPT, Wanda, Magnitude), and dataset details (sample counts, prompt lengths). To improve accessibility we will revise the abstract to include key quantitative highlights and add a summary paragraph with error bars and dataset statistics at the start of the experimental section. revision: yes
Referee: [§4] §4 (evaluation protocol): the paper states that standard pruning increases thinking tokens while hurting accuracy; however, it is unclear whether the reported token counts and accuracy metrics are measured on the same prompts and decoding settings used for the RAC calibration traces, which is necessary to isolate the effect of the reconstruction target.

Authors: All accuracy and token-count measurements for both standard pruning and RAC were performed on exactly the same evaluation prompts and with identical decoding settings (temperature, max tokens, etc.) as those used to generate the on-policy CoT calibration traces. This design isolates the contribution of the joint reconstruction objective. We will revise §4 to state this protocol explicitly, including a sentence confirming the shared prompt set and decoding configuration. revision: yes

Circularity Check

0 steps flagged

No significant circularity in RAC extension of SparseGPT

full rationale

The paper introduces Reasoning-Aware Compression as a drop-in modification to existing pruning methods like SparseGPT by adding joint reconstruction of activations from both inputs and the original model's on-policy CoT traces. This is presented as an empirical augmentation whose performance gains are demonstrated through integration into standard workflows, without any equations, procedures, or claims that reduce the reported improvements to a fitted parameter, self-referential definition, or load-bearing self-citation. The method relies on an external assumption about CoT trace validity (addressed in the skeptic note as a potential distribution-shift concern), but this does not constitute circularity per the enumerated patterns; the derivation chain remains self-contained against the benchmarks of prior pruning techniques.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that chain-of-thought traces provide an independent and useful signal for activation reconstruction during pruning; no free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption On-policy chain-of-thought traces generated by the model are representative of the reasoning computations that must be preserved after pruning.
This premise is invoked to justify why joint reconstruction outperforms input-only reconstruction.

pith-pipeline@v0.9.0 · 5703 in / 1318 out tokens · 44145 ms · 2026-05-18T15:47:52.375674+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This 'Reasoning-Aware Compression' (RAC)
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

min cWℓ∈W ||(Wℓ − cWℓ)XRACℓ||²F

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 5 internal anchors

[1]

Gkd: Generalized knowledge distillation for auto-regressive sequence models

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes, 2024. URL https://arxiv.org/abs/2306.13649

work page arXiv 2024
[2]

Is C4 dataset optimal for pruning? an investigation of calibration data for LLM pruning

Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, and Shiwei Liu. Is C4 dataset optimal for pruning? an investigation of calibration data for LLM pruning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URL https://arxiv.org/abs/2410.07461

work page arXiv 2024
[4]

PPC - GPT : Federated task-specific compression of large language models via pruning and chain-of-thought distillation

Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Kai Chen, and Qiang Yang. PPC - GPT : Federated task-specific compression of large language models via pruning and chain-of-thought distillation. arXiv preprint arXiv:2502.15857, 2025. URL https://arxiv.org/abs/2502.15857

work page arXiv 2025
[5]

SparseGPT : Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. SparseGPT : Massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning (ICML), pp.\ 10323--10337. PMLR, 2023 a

work page 2023
[6]

SparseGPT: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023 b . URL https://arxiv.org/abs/2301.00774

work page arXiv 2023
[8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023. URL https://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek - R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Open r1: A fully open reproduction of deepseek-r1, January 2025

HuggingFace. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

work page 2025
[11]

Quantization hurts reasoning? an empirical study on quantized reasoning models

Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, and Lu Hou. Quantization hurts reasoning? an empirical study on quantized reasoning models. arXiv preprint arXiv:2504.04823, 2025. URL https://arxiv.org/abs/2504.04823

work page arXiv 2025
[12]

Alps: Improved optimization for highly sparse one-shot pruning for large language models, 2024

Xiang Meng, Kayhan Behdin, Haoyue Wang, and Rahul Mazumder. Alps: Improved optimization for highly sparse one-shot pruning for large language models, 2024. URL https://arxiv.org/abs/2406.07831

work page arXiv 2024
[13]

Learning to reason with llms, September 2024

OpenAI . Learning to reason with llms, September 2024. URL https://openai.com/index/learning-to-reason-with-llms/. Research release

work page 2024
[14]

Codeforces: Benchmarking competition-level code generation of llms on codeforces

Jiaxi Yang Bowen Yu Bo Zheng Dayiheng Liu Shanghaoran Quan. Codeforces: Benchmarking competition-level code generation of llms on codeforces. 2025. Disclaimer: This is a non-traditional code benchmark

work page 2025
[15]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models, 2024. URL https://arxiv.org/abs/2306.11695

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.\ 24824--24837, 2022

work page 2022
[17]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

When reasoning meets compression: Benchmarking compressed large reasoning models on complex reasoning tasks

Nan Zhang, Yusen Zhang, Prasenjit Mitra, and Rui Zhang. When reasoning meets compression: Benchmarking compressed large reasoning models on complex reasoning tasks. arXiv preprint arXiv:2504.02010, 2025 b . URL https://arxiv.org/abs/2504.02010

work page arXiv 2025
[20]

Lost in the Middle: How Language Models Use Long Contexts

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. Transactions of the Association for Computational Linguistics, 12: 0 1556--1577, 2024. doi:10.1162/tacl\_a\_00704

work page internal anchor Pith review doi:10.1162/tacl 2024
[21]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[22]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[23]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[24]

۱ v 6 4/z@)= 6| qu ѧj֭p ?Ahܸ1NסP 9r ىrș3gеk׼ (O 7 P(/XlY^ADDDDDD9& e+J?WW׼ Б#GgϞ (W 5 s5h [[[ -Z d9 e & :::

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

Gkd: Generalized knowledge distillation for auto-regressive sequence models

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes, 2024. URL https://arxiv.org/abs/2306.13649

work page arXiv 2024

[2] [2]

Is C4 dataset optimal for pruning? an investigation of calibration data for LLM pruning

Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, and Shiwei Liu. Is C4 dataset optimal for pruning? an investigation of calibration data for LLM pruning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URL https://arxiv.org/abs/2410.07461

work page arXiv 2024

[3] [4]

PPC - GPT : Federated task-specific compression of large language models via pruning and chain-of-thought distillation

Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Kai Chen, and Qiang Yang. PPC - GPT : Federated task-specific compression of large language models via pruning and chain-of-thought distillation. arXiv preprint arXiv:2502.15857, 2025. URL https://arxiv.org/abs/2502.15857

work page arXiv 2025

[4] [5]

SparseGPT : Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. SparseGPT : Massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning (ICML), pp.\ 10323--10337. PMLR, 2023 a

work page 2023

[5] [6]

SparseGPT: Massive language models can be accurately pruned in one-shot.arXiv preprint arXiv:2301.00774, 2023

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023 b . URL https://arxiv.org/abs/2301.00774

work page arXiv 2023

[6] [8]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023. URL https://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek - R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [10]

Open r1: A fully open reproduction of deepseek-r1, January 2025

HuggingFace. Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

work page 2025

[9] [11]

Quantization hurts reasoning? an empirical study on quantized reasoning models

Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, and Lu Hou. Quantization hurts reasoning? an empirical study on quantized reasoning models. arXiv preprint arXiv:2504.04823, 2025. URL https://arxiv.org/abs/2504.04823

work page arXiv 2025

[10] [12]

Alps: Improved optimization for highly sparse one-shot pruning for large language models, 2024

Xiang Meng, Kayhan Behdin, Haoyue Wang, and Rahul Mazumder. Alps: Improved optimization for highly sparse one-shot pruning for large language models, 2024. URL https://arxiv.org/abs/2406.07831

work page arXiv 2024

[11] [13]

Learning to reason with llms, September 2024

OpenAI . Learning to reason with llms, September 2024. URL https://openai.com/index/learning-to-reason-with-llms/. Research release

work page 2024

[12] [14]

Codeforces: Benchmarking competition-level code generation of llms on codeforces

Jiaxi Yang Bowen Yu Bo Zheng Dayiheng Liu Shanghaoran Quan. Codeforces: Benchmarking competition-level code generation of llms on codeforces. 2025. Disclaimer: This is a non-traditional code benchmark

work page 2025

[13] [15]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J. Zico Kolter. A simple and effective pruning approach for large language models, 2024. URL https://arxiv.org/abs/2306.11695

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [16]

Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pp.\ 24824--24837, 2022

work page 2022

[15] [17]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023. URL https://arxiv.org/abs/2305.10601

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [19]

When reasoning meets compression: Benchmarking compressed large reasoning models on complex reasoning tasks

Nan Zhang, Yusen Zhang, Prasenjit Mitra, and Rui Zhang. When reasoning meets compression: Benchmarking compressed large reasoning models on complex reasoning tasks. arXiv preprint arXiv:2504.02010, 2025 b . URL https://arxiv.org/abs/2504.02010

work page arXiv 2025

[17] [20]

Lost in the Middle: How Language Models Use Long Contexts

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, and Weiping Wang. A survey on model compression for large language models. Transactions of the Association for Computational Linguistics, 12: 0 1556--1577, 2024. doi:10.1162/tacl\_a\_00704

work page internal anchor Pith review doi:10.1162/tacl 2024

[18] [21]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[19] [22]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[20] [23]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[21] [24]

۱ v 6 4/z@)= 6| qu ѧj֭p ?Ahܸ1NסP 9r ىrș3gеk׼ (O 7 P(/XlY^ADDDDDD9& e+J?WW׼ Б#GgϞ (W 5 s5h [[[ -Z d9 e & :::

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page