pith. machine review for the scientific record. sign in

arxiv: 2603.05433 · v6 · submitted 2026-03-05 · 💻 cs.LG

Recognition: no theorem link

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:39 UTC · model grok-4.3

classification 💻 cs.LG
keywords compressed reasoningself-policy distillationreverse KLtoken compressionconcise reasoningLLM efficiencyMATH-500AIME 2024
0
0 comments X

The pith

CRISP teaches models to compress reasoning by self-distilling concise responses, cutting tokens by over half while raising accuracy on math benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CRISP as a method that teaches language models to reason more concisely through iterative self-policy distillation. The process conditions the same model on a be concise instruction to produce teacher logits, then minimizes per-token reverse KL divergence on the model's own student rollouts. No ground-truth labels, token budgets, or difficulty estimators are needed. This leads to automatic adaptation where easy problems shorten more aggressively than hard ones. Experiments demonstrate 57-59% token reduction on MATH-500 with 9-16 point accuracy gains on Qwen3 models and 10 point gains on AIME 2024 for the 14B variant.

Core claim

CRISP establishes that iterative self-policy distillation, where the model is conditioned on a conciseness instruction to obtain teacher logits and then minimizes reverse KL on its own rollouts, produces reasoning policies that are simultaneously shorter and more accurate. On Qwen3-8B and Qwen3-14B this yields 57-59% token reduction on MATH-500 with absolute accuracy improvements of 9-16 points, and a 10 point gain on AIME 2024 at 41% compression. The same loop transfers to other model families and to multi-step agentic planning, cutting tokens by 42-51% while preserving output quality.

What carries the argument

Iterative self-policy distillation that conditions the model on a be concise instruction to generate teacher logits and then minimizes per-token reverse KL divergence on the student's own rollouts.

If this is right

  • Token usage on MATH-500 falls 57-59% while accuracy rises 9-16 points on Qwen3-8B and 14B models.
  • The 14B model gains 10 accuracy points on AIME 2024 at 41% compression.
  • Compression occurs more on easy problems and less on hard problems without any explicit difficulty signal.
  • The method transfers to other model families and yields 42-51% token savings on multi-step planning tasks while keeping planning quality intact.
  • Qualitative conciseness instructions outperform explicit token-target instructions in both compression and accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Much of the length in standard chain-of-thought traces appears to be removable noise that self-supervision can filter without external data.
  • If the compression scales to larger models and longer horizons, inference latency and cost for deployed reasoning systems could drop substantially.
  • Periodic teacher refreshes create a stable training regime that might generalize to distilling other behavioral traits beyond length.
  • The same self-distillation loop could be tested on non-reasoning tasks such as code generation or dialogue to measure whether similar noise removal occurs.

Load-bearing premise

The approach assumes that prompting the model with a generic be concise instruction produces teacher logits that remain both correct and sufficiently informative, so that minimizing reverse KL on student rollouts does not silently degrade reasoning quality on hard problems.

What would settle it

Running the full CRISP loop on a new hard reasoning benchmark such as AIME and observing accuracy drop below the uncompressed baseline while token count decreases would falsify the claim that the distillation preserves or improves quality.

Figures

Figures reproduced from arXiv: 2603.05433 by Hejian Sang, Jiachen Sun, Ran He, Yuanda Xu, Zhengze Zhou, Zhipeng Wang.

Figure 1
Figure 1. Figure 1: The paradox of reasoning compression: less thinking, better answers. Results for Qwen3-14B across three benchmarks of increasing difficulty (30K response token bud￾get). CRISP compresses reasoning traces by 35–57% while largely preserving or improving accuracy, most dramatically on MATH-500, where accuracy jumps from 70.0% to 86.1%. ∗Equal contribution. †Correspondence to hejian@alumni.iastate.edu 1 arXiv:… view at source ↗
Figure 2
Figure 2. Figure 2: Student mean accuracy on training data increases during self-distillation. Qwen3-8B improves from ∼52% to ∼66% and Qwen3-14B from ∼46% to ∼72%, despite no correctness reward. The concise teacher’s implicit reward reshapes the student’s output distribution, concentrating probability mass on direct, correct reasoning paths. This follows from mode-seeking reverse KL (§3.2): the student is penalized for placin… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt example for student and teacher policies. Both policies share the same model parameters but differ in conditioning context. The teacher receives only a conciseness instruction c prepended to the problem; no ground-truth answers or reference solutions are provided. This is the key distinction from prior self-distillation work (Shenfeld et al., 2026), where the teacher receives the ground-truth soluti… view at source ↗
Figure 4
Figure 4. Figure 4: Soft budget teacher prompt. Unlike the qualitative conciseness instruction in [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Teacher update interval M controls the stability–compression trade-off. Accuracy (left) and output entropy (right) over 100 training steps for Qwen3-14B on MATH-500 with varying M. M=1 (updating every step) causes entropy explosion and accuracy collapse to ∼2% by step 100, consistent with the instability observed by Shenfeld et al. (2026). M ∈ {40, 50, 60} produce stable trajectories reaching ∼86–87% accur… view at source ↗
Figure 6
Figure 6. Figure 6: Average response token count over 200 training steps for Qwen3-8B and Qwen3- 14B using the qualitative concise instruction with periodic teacher update (M=50). Token count decreases rapidly in the first ∼80 steps before plateauing around 3000–3500 tokens. Further compression between steps 100 and 200 is limited, indicating that most compression is learned early and additional training yields diminishing re… view at source ↗
Figure 7
Figure 7. Figure 7: Validation accuracy (mean@8) over training steps for Qwen3-8B and Qwen3-14B using the qualitative concise instruction with periodic teacher update (M=50), evaluated on MATH-500, AIME 2024, and AIME 2025. MATH-500 accuracy improves steadily for both models, rising from ∼78% to ∼87% (8B) and ∼70% to ∼87% (14B). AIME 2024 and AIME 2025 results exhibit substantially larger variance due to their small sample si… view at source ↗
Figure 8
Figure 8. Figure 8: Self-distillation preserves model entropy throughout training. Average per￾token entropy of the student model over training steps for Qwen3-8B (left) and Qwen3-14B (right) using the concise instruction. Unlike RL with length penalties, which drives entropy toward collapse (Liu et al., 2025; Cui et al., 2025), CRISP maintains stable entropy: the model learns to be concise without losing its exploratory capa… view at source ↗
Figure 9
Figure 9. Figure 9: Validation accuracy of reverse KL (solid blue) and forward KL (dashed coral) [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mean response length on each validation set. Forward KL compresses responses [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mean response length on training set travel planning and validation set shopping [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Full model outputs: base Qwen3-8B vs. CRISP on three MATH-500 problems of increasing difficulty. Each output consists of hidden reasoning (between <think> and </think>) followed by the visible answer (below the gray rule). Blue text highlights redun￾dancy: within reasoning (self-doubt, re-derivation, verification) and in the visible answer (the base model repeats the full derivation as a formatted step-by… view at source ↗
Figure 13
Figure 13. Figure 13: Full model outputs (continued from [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
read the original abstract

Reasoning models think out loud, but much of what they say is noise. We introduce CRISP (Compressed Reasoning via Iterative Self-Policy Distillation), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a ''be concise'' instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: CRISP automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57--59% token reduction on MATH-500 while improving accuracy by 9--16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. Ablations show that qualitative conciseness instructions outperform explicit token targets, and periodic teacher refreshes yield a broad stable regime. The method generalizes across model families -- DeepSeek-R1-Distill-Llama-8B improves accuracy by up to 5 points with 17--32% compression -- and transfers beyond math to multi-step agentic planning (DeepPlanning), reducing token usage by 42--51% while preserving planning quality. Code is available at https://github.com/HJSang/OPSD_Reasoning_Compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CRISP, a self-distillation technique for compressing LLM reasoning traces: the same model is prompted with a generic 'be concise' instruction to produce teacher logits, after which per-token reverse KL is minimized between those logits and the student's own rollouts; this process is iterated without ground-truth labels, token budgets, or external supervision. The central empirical claim is that the procedure yields 41-59% token reduction on MATH-500 and AIME 2024 while simultaneously raising accuracy by 9-16 points (Qwen3-8B/14B) and generalizes to other model families and to multi-step planning tasks.

Significance. If the reported gains are robust, CRISP supplies a strikingly simple, parameter-free route to more efficient reasoning models that automatically allocates deliberation according to problem difficulty. The absence of fitted hyperparameters, the public code release, and the cross-family transfer results would constitute a useful contribution to the literature on test-time compute and self-improvement.

major comments (3)
  1. [§3 and §4] §3 (Method) and §4 (Experiments): the central claim that reverse-KL distillation from the 'be concise' teacher improves reasoning quality presupposes that the teacher logits remain at least as accurate and step-complete as the base policy on hard items. No table or figure reports the accuracy of the teacher model (under the identical 'be concise' prompt) on MATH-500 or AIME 2024 before distillation begins; without this baseline the observed 9-16 point gains could arise from iteration effects, prompt sensitivity, or evaluation variance rather than the claimed mechanism.
  2. [Table 1 and Figure 2] Table 1 and Figure 2: the reported accuracy improvements are given as point estimates without error bars, number of evaluation seeds, or statistical significance tests. Given that the method relies on stochastic rollouts and iterative self-distillation, the absence of variance estimates makes it impossible to assess whether the 9-16 point gains on MATH-500 are reliable or within the noise of the base model.
  3. [§4.3] §4.3 (Ablations): the claim that qualitative 'be concise' instructions outperform explicit token targets is load-bearing for the method's simplicity argument, yet the ablation only compares a single generic instruction against one explicit budget; no sweep over instruction phrasing or verification that the chosen instruction preserves teacher accuracy on the hardest problems is provided.
minor comments (2)
  1. [Abstract and §3] The abstract states 'periodic teacher refreshes yield a broad stable regime' but the main text does not specify the refresh frequency or the criterion used to decide when to refresh; a short clarifying sentence would help reproducibility.
  2. [§2] Notation for the reverse-KL objective is introduced without an explicit equation number; adding Eq. (X) would make the loss definition easier to reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Method) and §4 (Experiments): the central claim that reverse-KL distillation from the 'be concise' teacher improves reasoning quality presupposes that the teacher logits remain at least as accurate and step-complete as the base policy on hard items. No table or figure reports the accuracy of the teacher model (under the identical 'be concise' prompt) on MATH-500 or AIME 2024 before distillation begins; without this baseline the observed 9-16 point gains could arise from iteration effects, prompt sensitivity, or evaluation variance rather than the claimed mechanism.

    Authors: We agree that an explicit baseline for the 'be concise' teacher is necessary to isolate the contribution of the distillation process. In the revised manuscript we will add a new table (or additional columns to Table 1) reporting accuracy and token usage for the base models under the identical 'be concise' prompt before any distillation iterations begin. This will demonstrate that the teacher starts from accuracy comparable to the unprompted base policy and that the observed gains arise from the iterative self-distillation rather than prompt effects alone. revision: yes

  2. Referee: [Table 1 and Figure 2] Table 1 and Figure 2: the reported accuracy improvements are given as point estimates without error bars, number of evaluation seeds, or statistical significance tests. Given that the method relies on stochastic rollouts and iterative self-distillation, the absence of variance estimates makes it impossible to assess whether the 9-16 point gains on MATH-500 are reliable or within the noise of the base model.

    Authors: We acknowledge that variance estimates would strengthen confidence in the results. Our main experiments used single runs owing to the high computational cost of iterative distillation on 8B–14B models. The large effect sizes (9–16 points) and their replication across two model scales, two benchmarks, and additional model families make random variation unlikely. In the revision we will add an explicit discussion of this limitation and, where compute permits, report results from three evaluation seeds with standard deviations for the primary MATH-500 experiments. revision: partial

  3. Referee: [§4.3] §4.3 (Ablations): the claim that qualitative 'be concise' instructions outperform explicit token targets is load-bearing for the method's simplicity argument, yet the ablation only compares a single generic instruction against one explicit budget; no sweep over instruction phrasing or verification that the chosen instruction preserves teacher accuracy on the hardest problems is provided.

    Authors: We appreciate the call to expand the ablation. The existing comparison was chosen to emphasize the method's lack of hyperparameters, but we agree it is insufficient. In the revised §4.3 we will include a sweep over several instruction phrasings and will separately report the teacher's accuracy on the hardest subset of MATH-500 (level-5 problems) to confirm that the selected instruction preserves or improves performance on difficult items. revision: yes

Circularity Check

0 steps flagged

No significant circularity: self-distillation method is self-contained with empirical gains

full rationale

The paper defines CRISP explicitly as prompting the identical model with a fixed 'be concise' instruction to generate teacher logits, then applying standard per-token reverse KL to the student's own rollouts. No equations derive a target quantity from fitted parameters, no self-citations supply load-bearing uniqueness theorems, and no ansatz or renaming reduces the central claim to its inputs by construction. Accuracy improvements (9-16 points on MATH-500) are presented as observed outcomes rather than predictions forced by the objective. The setup contains no derivation chain that collapses to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on the standard assumption that reverse KL between teacher and student distributions is a suitable training objective for preserving reasoning quality while encouraging brevity. No new entities or fitted constants are introduced beyond ordinary training hyperparameters.

axioms (1)
  • domain assumption Reverse KL divergence is an appropriate objective for distilling concise reasoning behavior from self-generated teacher logits.
    Invoked in the description of the training objective; standard in distillation literature but treated as given here.

pith-pipeline@v0.9.0 · 5586 in / 1318 out tokens · 34374 ms · 2026-05-15T15:39:45.259663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.

  2. Multi-Rollout On-Policy Distillation via Peer Successes and Failures

    cs.LG 2026-05 unverdicted novelty 7.0

    MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.

  3. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  4. Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

    cs.LG 2026-05 unverdicted novelty 7.0

    PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.

  5. Self-Distilled RLVR

    cs.LG 2026-04 unverdicted novelty 7.0

    RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.

  6. PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

    cs.AI 2026-03 conditional novelty 7.0

    PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.

  7. Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.

  8. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.

  9. Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on...

  10. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  11. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

  12. TIP: Token Importance in On-Policy Distillation

    cs.LG 2026-04 conditional novelty 6.0

    In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.

  13. $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

    cs.LG 2026-04 unverdicted novelty 6.0

    π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...

  14. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 5.0

    Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.

  15. Reasoning Compression with Mixed-Policy Distillation

    cs.AI 2026-05 unverdicted novelty 5.0

    Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.

  16. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.

  17. Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

    cs.CL 2026-04 accept novelty 5.0

    LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 15 Pith papers · 13 internal anchors

  1. [1]

    L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697,

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697,

  2. [2]

    Distilling the essence: Efficient reasoning distillation via sequence truncation.arXiv preprint arXiv:2512.21002, 2025a

    Wei-Rui Chen, Vignesh Kothapalli, Ata Fatahibaarzi, Hejian Sang, Shao Tang, Qingquan Song, Zhipeng Wang, and Muhammad Abdul-Mageed. Distilling the essence: Efficient reasoning distillation via sequence truncation.arXiv preprint arXiv:2512.21002, 2025a. Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, and Maosong Sun. The overthink...

  3. [3]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  4. [4]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

  5. [5]

    S3-cot: Self-sampled succinct reasoning enables efficient chain- of-thought llms.arXiv preprint arXiv:2602.01982,

    Yanrui Du, Sendong Zhao, Yibo Gao, Danyang Zhao, Qika Lin, Ming Ma, Jiayun Li, Yi Jiang, Kai He, Qianyi Xu, et al. S3-cot: Self-sampled succinct reasoning enables efficient chain- of-thought llms.arXiv preprint arXiv:2602.01982,

  6. [6]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InarXiv preprint arXiv:2306.08543,

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  8. [8]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

  9. [9]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  10. [10]

    Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296,

    Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296,

  11. [11]

    Reasoning efficiently through adaptive chain-of-thought compression: A self-optimizing framework

    Kerui Huang, Shuhan Liu, Xing Hu, Tongtong Xu, Lingfeng Bao, and Xin Xia. Reasoning efficiently through adaptive chain-of-thought compression: A self-optimizing framework. arXiv preprint arXiv:2509.14093,

  12. [12]

    Reinforcement Learning via Self-Distillation

    10 Preprint. Under review. Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Rein- forcement learning via self-distillation.arXiv preprint arXiv:2601.20802,

  13. [13]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

  14. [14]

    The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward.arXiv preprint arXiv:2509.07430, 2025a

    Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, et al. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward.arXiv preprint arXiv:2509.07430, 2025a. Yanhao Li, Lu Ma, Jiaran Zhang, Lexiang Tang, Wentao Zhang, and Gui...

  15. [15]

    Trimr: Verifier-based training-free thinking compression for efficient test-time scaling.arXiv preprint arXiv:2505.17155,

    Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, and Mingxuan Yuan. Trimr: Verifier-based training-free thinking compression for efficient test-time scaling.arXiv preprint arXiv:2505.17155,

  16. [16]

    DLER: Doing length penalty right – incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110,

    SY Liu, X Dong, X Lu, S Diao, M Liu, MH Chen, H Yin, YCF Wang, KT Cheng, Y Choi, et al. DLER: Doing length penalty right – incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110,

  17. [17]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286–20332,

  18. [18]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,

  19. [19]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  20. [20]

    Mitigating overthinking in large reasoning models via difficulty-aware reinforcement learning.arXiv preprint arXiv:2601.21418,

    Qian Wan, Ziao Xu, Luona Wei, Xiaoxuan Shen, and Jianwen Sun. Mitigating overthinking in large reasoning models via difficulty-aware reinforcement learning.arXiv preprint arXiv:2601.21418,

  21. [21]

    Wait, we don’t need to" wait"! removing thinking tokens improves reasoning efficiency.arXiv preprint arXiv:2506.08343, 2025a

    Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, and Tianyi Zhou. Wait, we don’t need to" wait"! removing thinking tokens improves reasoning efficiency.arXiv preprint arXiv:2506.08343, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 8...

  22. [22]

    Under review

    11 Preprint. Under review. Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Control- lable chain-of-thought compression in llms.Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 3351–3363,

  23. [23]

    Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600,

    Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600,

  24. [24]

    Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning. arXiv preprint arXiv:2602.21420, 2026a. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation at the frontier of student competence.arXiv preprint arXiv:260...

  25. [25]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,

  26. [26]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  27. [27]

    Deepplanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv preprint arXiv:2601.18137,

    Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin. Deepplanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv preprint arXiv:2601.18137,

  28. [28]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,

  29. [29]

    equals the sequence-level KL divergence between student and teacher. Lemma 1(Chain rule of KL for autoregressive models).For autoregressive distributions q(y|x) = ∏t q(yt |x , y<t) and p(y|x) = ∏t p(yt |x , y<t), the sequence-level KL decomposes as DKL(q∥p) =E y∼q [∑t DKL(q(· |x , y<t)∥p(· |x , y<t))]. This follows directly from expanding logq(y)/p(y) = ∑...

  30. [30]

    (20) 14 Preprint

    · DC −D E ≥0. (20) 14 Preprint. Under review. Remark 2.The core assumption is (A1): harder problems have a larger fraction of essential tokens. This is empirically supported by Table 2 (MATH-500: 57–59% compression vs. AIME 2025:∼35%). Assumption (A2) is a modeling simplification; the proposition should be interpreted as holding for category-averaged KL v...

  31. [31]

    be concise

    Soft budgets achieve higher compression but substantially lower accuracy than the concise instruction, particularly on competition-level benchmarks. Accuracy (Acc, %), token reduction (Red., %), and accuracy change vs. base model (∆Acc, pp). MATH-500 AIME 2024 AIME 2025 ContextAcc Red.∆Acc Acc Red.∆Acc Acc Red.∆Acc Qwen3-8B Concise86.658.8%+8.969.6 35.4%−...

  32. [32]

    replaces the qualitative conciseness instruction with a specific reduction target while keeping all other aspects identical. Table 4 reveals a clearcompression–accuracy tradeoff across context variants: soft budgets achieve higher compression but substantially lower accuracy than the qualitative concise instruction. Soft budgets compress more aggressively...

  33. [33]

    This mirrors the finding of Shenfeld et al

    to ∼2% (step 100). This mirrors the finding of Shenfeld et al. (2026) that overly aggressive teacher updates create a moving target problem: the student chases a teacher that is itself changing in response to the student’s updates, leading to a positive feedback loop of increasingly degenerate outputs. M∈ { 40, 50, 60} form a stable plateau.These interval...

  34. [34]

    but still trails theM≥40 regime by 2–3 percentage points. Based on these results, we use M=50 for all other experiments in this paper, as it sits comfortably in the stable plateau while allowing progressive compression through periodic teacher refresh. D Survey of Reasoning Compression Methods Table 5 summarizes 19 reasoning compression methods along four...

  35. [35]

    Practical recommendation: step 100 is the sweet spot.Based on these results, we rec- ommend step 100 (roughly 3,200 training examples) as the default checkpoint

    MATH-500 accuracy remains robust throughout, suggesting that compression on easier benchmarks is more sustainable. Practical recommendation: step 100 is the sweet spot.Based on these results, we rec- ommend step 100 (roughly 3,200 training examples) as the default checkpoint. It achieves 57–59% compression on MATH-500 with accuracy gains of 9–16 pp, while...

  36. [36]

    Our implementation is built on top of the verl library Sheng et al

    86.1 1,686 56.5%76.37,577 41.0%61.710,137 35.2% CRISP (step 200)86.21,19169.2%66.3 6,08952.6%53.8 7,49652.1% F Training and Implementation Details Technical setup.All experiments are conducted on a single node equipped with eight NVIDIA H200 GPUs. Our implementation is built on top of the verl library Sheng et al. (2025), which provides a HybridEngine for...

  37. [37]

    Mixed-precision training is performed inbfloat16, and gradient checkpointing is enabled to reduce peak memory usage

    is used for inference. Mixed-precision training is performed inbfloat16, and gradient checkpointing is enabled to reduce peak memory usage. Training data.Our training data is derived from DAPO-Math-17k Yu et al. (2025), a deduplicated set of ∼17,000 competition-level math problems. We randomly split the dataset into 80% training (∼13,600 prompts) and 20% ...

  38. [38]

    Under review

    For each benchmark, we generate 8 responses per problem with temperature 0.6, top- p= 0.95, and top-k= 20, and report 21 Preprint. Under review. Parameter Value General Models Qwen3-8B, Qwen3-14B Loss function Reverse KL: KL(π student∥πteacher) Teacher Periodic update (M=50 steps) Data Training prompts∼13,600 (from DAPO-Math-17k) Validation prompts∼3,400 ...

  39. [39]

    The discrete refresh avoids continuous co-adaptation while still allowing compression to deepen over training

    enables progressive compression: after each refresh, the updated teacher, having already internalized compression from the previous round, produces even more concise traces under instruction c, providing a stronger compression signal. The discrete refresh avoids continuous co-adaptation while still allowing compression to deepen over training. Our ablatio...

  40. [40]

    exploratory

    MATH-500 accuracy improves steadily for both models, rising from ∼78% to ∼87% (8B) and ∼70% to ∼87% (14B). AIME 2024 and AIME 2025 results exhibit substantially larger variance due to their small sample sizes (30 problems each), though the overall trend remains stable or slightly improving. H.1 Entropy Stability Throughout Training Figure 8 shows that CRI...

  41. [41]

    Plan a 3- day trip to Beijing for two people with a $2,000 budget

    The same pattern is visible but milder on Qwen3-8B. Figure 10 reveals a parallel instability: forward KL compresses responses more aggressively, with length drops synchronized to the same teacher-update boundaries. On Qwen3-14B the gap widens to ∼500 tokens by step 190 (1,229 vs. 1,766, a 30% shortfall). On hard reasoning benchmarks like AIME, this trunca...

  42. [42]

    Wait,” “Let me check

    We hypothesize that this reflects a difference in the reasoning structure of the two tasks: travel planning relies on a fixed sequence of tool calls (search → book → verify) where the reasoning over- head is largely redundant narration, whereas shopping planning requires more adaptive, branching logic (compare prices, evaluate coupons, backtrack). Compres...