Recognition: no theorem link
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
Pith reviewed 2026-05-15 15:39 UTC · model grok-4.3
The pith
CRISP teaches models to compress reasoning by self-distilling concise responses, cutting tokens by over half while raising accuracy on math benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRISP establishes that iterative self-policy distillation, where the model is conditioned on a conciseness instruction to obtain teacher logits and then minimizes reverse KL on its own rollouts, produces reasoning policies that are simultaneously shorter and more accurate. On Qwen3-8B and Qwen3-14B this yields 57-59% token reduction on MATH-500 with absolute accuracy improvements of 9-16 points, and a 10 point gain on AIME 2024 at 41% compression. The same loop transfers to other model families and to multi-step agentic planning, cutting tokens by 42-51% while preserving output quality.
What carries the argument
Iterative self-policy distillation that conditions the model on a be concise instruction to generate teacher logits and then minimizes per-token reverse KL divergence on the student's own rollouts.
If this is right
- Token usage on MATH-500 falls 57-59% while accuracy rises 9-16 points on Qwen3-8B and 14B models.
- The 14B model gains 10 accuracy points on AIME 2024 at 41% compression.
- Compression occurs more on easy problems and less on hard problems without any explicit difficulty signal.
- The method transfers to other model families and yields 42-51% token savings on multi-step planning tasks while keeping planning quality intact.
- Qualitative conciseness instructions outperform explicit token-target instructions in both compression and accuracy.
Where Pith is reading between the lines
- Much of the length in standard chain-of-thought traces appears to be removable noise that self-supervision can filter without external data.
- If the compression scales to larger models and longer horizons, inference latency and cost for deployed reasoning systems could drop substantially.
- Periodic teacher refreshes create a stable training regime that might generalize to distilling other behavioral traits beyond length.
- The same self-distillation loop could be tested on non-reasoning tasks such as code generation or dialogue to measure whether similar noise removal occurs.
Load-bearing premise
The approach assumes that prompting the model with a generic be concise instruction produces teacher logits that remain both correct and sufficiently informative, so that minimizing reverse KL on student rollouts does not silently degrade reasoning quality on hard problems.
What would settle it
Running the full CRISP loop on a new hard reasoning benchmark such as AIME and observing accuracy drop below the uncompressed baseline while token count decreases would falsify the claim that the distillation preserves or improves quality.
Figures
read the original abstract
Reasoning models think out loud, but much of what they say is noise. We introduce CRISP (Compressed Reasoning via Iterative Self-Policy Distillation), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a ''be concise'' instruction to obtain teacher logits, and minimize per-token reverse KL on the student's own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: CRISP automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57--59% token reduction on MATH-500 while improving accuracy by 9--16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. Ablations show that qualitative conciseness instructions outperform explicit token targets, and periodic teacher refreshes yield a broad stable regime. The method generalizes across model families -- DeepSeek-R1-Distill-Llama-8B improves accuracy by up to 5 points with 17--32% compression -- and transfers beyond math to multi-step agentic planning (DeepPlanning), reducing token usage by 42--51% while preserving planning quality. Code is available at https://github.com/HJSang/OPSD_Reasoning_Compression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CRISP, a self-distillation technique for compressing LLM reasoning traces: the same model is prompted with a generic 'be concise' instruction to produce teacher logits, after which per-token reverse KL is minimized between those logits and the student's own rollouts; this process is iterated without ground-truth labels, token budgets, or external supervision. The central empirical claim is that the procedure yields 41-59% token reduction on MATH-500 and AIME 2024 while simultaneously raising accuracy by 9-16 points (Qwen3-8B/14B) and generalizes to other model families and to multi-step planning tasks.
Significance. If the reported gains are robust, CRISP supplies a strikingly simple, parameter-free route to more efficient reasoning models that automatically allocates deliberation according to problem difficulty. The absence of fitted hyperparameters, the public code release, and the cross-family transfer results would constitute a useful contribution to the literature on test-time compute and self-improvement.
major comments (3)
- [§3 and §4] §3 (Method) and §4 (Experiments): the central claim that reverse-KL distillation from the 'be concise' teacher improves reasoning quality presupposes that the teacher logits remain at least as accurate and step-complete as the base policy on hard items. No table or figure reports the accuracy of the teacher model (under the identical 'be concise' prompt) on MATH-500 or AIME 2024 before distillation begins; without this baseline the observed 9-16 point gains could arise from iteration effects, prompt sensitivity, or evaluation variance rather than the claimed mechanism.
- [Table 1 and Figure 2] Table 1 and Figure 2: the reported accuracy improvements are given as point estimates without error bars, number of evaluation seeds, or statistical significance tests. Given that the method relies on stochastic rollouts and iterative self-distillation, the absence of variance estimates makes it impossible to assess whether the 9-16 point gains on MATH-500 are reliable or within the noise of the base model.
- [§4.3] §4.3 (Ablations): the claim that qualitative 'be concise' instructions outperform explicit token targets is load-bearing for the method's simplicity argument, yet the ablation only compares a single generic instruction against one explicit budget; no sweep over instruction phrasing or verification that the chosen instruction preserves teacher accuracy on the hardest problems is provided.
minor comments (2)
- [Abstract and §3] The abstract states 'periodic teacher refreshes yield a broad stable regime' but the main text does not specify the refresh frequency or the criterion used to decide when to refresh; a short clarifying sentence would help reproducibility.
- [§2] Notation for the reverse-KL objective is introduced without an explicit equation number; adding Eq. (X) would make the loss definition easier to reference.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Method) and §4 (Experiments): the central claim that reverse-KL distillation from the 'be concise' teacher improves reasoning quality presupposes that the teacher logits remain at least as accurate and step-complete as the base policy on hard items. No table or figure reports the accuracy of the teacher model (under the identical 'be concise' prompt) on MATH-500 or AIME 2024 before distillation begins; without this baseline the observed 9-16 point gains could arise from iteration effects, prompt sensitivity, or evaluation variance rather than the claimed mechanism.
Authors: We agree that an explicit baseline for the 'be concise' teacher is necessary to isolate the contribution of the distillation process. In the revised manuscript we will add a new table (or additional columns to Table 1) reporting accuracy and token usage for the base models under the identical 'be concise' prompt before any distillation iterations begin. This will demonstrate that the teacher starts from accuracy comparable to the unprompted base policy and that the observed gains arise from the iterative self-distillation rather than prompt effects alone. revision: yes
-
Referee: [Table 1 and Figure 2] Table 1 and Figure 2: the reported accuracy improvements are given as point estimates without error bars, number of evaluation seeds, or statistical significance tests. Given that the method relies on stochastic rollouts and iterative self-distillation, the absence of variance estimates makes it impossible to assess whether the 9-16 point gains on MATH-500 are reliable or within the noise of the base model.
Authors: We acknowledge that variance estimates would strengthen confidence in the results. Our main experiments used single runs owing to the high computational cost of iterative distillation on 8B–14B models. The large effect sizes (9–16 points) and their replication across two model scales, two benchmarks, and additional model families make random variation unlikely. In the revision we will add an explicit discussion of this limitation and, where compute permits, report results from three evaluation seeds with standard deviations for the primary MATH-500 experiments. revision: partial
-
Referee: [§4.3] §4.3 (Ablations): the claim that qualitative 'be concise' instructions outperform explicit token targets is load-bearing for the method's simplicity argument, yet the ablation only compares a single generic instruction against one explicit budget; no sweep over instruction phrasing or verification that the chosen instruction preserves teacher accuracy on the hardest problems is provided.
Authors: We appreciate the call to expand the ablation. The existing comparison was chosen to emphasize the method's lack of hyperparameters, but we agree it is insufficient. In the revised §4.3 we will include a sweep over several instruction phrasings and will separately report the teacher's accuracy on the hardest subset of MATH-500 (level-5 problems) to confirm that the selected instruction preserves or improves performance on difficult items. revision: yes
Circularity Check
No significant circularity: self-distillation method is self-contained with empirical gains
full rationale
The paper defines CRISP explicitly as prompting the identical model with a fixed 'be concise' instruction to generate teacher logits, then applying standard per-token reverse KL to the student's own rollouts. No equations derive a target quantity from fitted parameters, no self-citations supply load-bearing uniqueness theorems, and no ansatz or renaming reduces the central claim to its inputs by construction. Accuracy improvements (9-16 points on MATH-500) are presented as observed outcomes rather than predictions forced by the objective. The setup contains no derivation chain that collapses to tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reverse KL divergence is an appropriate objective for distilling concise reasoning behavior from self-generated teacher logits.
Forward citations
Cited by 17 Pith papers
-
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
EGRSD and CL-EGRSD advance the accuracy-length frontier in LLM reasoning by entropy-guided weighting of token-level distillation signals from the teacher.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
-
Self-Distilled RLVR
RLSD mixes self-distillation for token-level policy difference magnitudes with RLVR for reliable update directions from response correctness to reach higher convergence and better training stability.
-
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence
PACED applies student pass-rate weighting w(p)=p(1-p) to distillation, concentrating on the zone of proximal development and delivering up to +8.2 gains on AIME tasks with reduced forgetting.
-
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.
-
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
ATESD makes teacher exposure to reference reasoning a learnable control variable via a Beta-policy optimized on future student improvement, yielding gains of up to +2.33 points over fixed-exposure self-distillation on...
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
TIP: Token Importance in On-Policy Distillation
In on-policy distillation, tokens with high student entropy or low entropy plus high teacher divergence provide dense corrective signal, allowing effective training on under 20% of tokens across math and planning tasks.
-
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.
-
Reasoning Compression with Mixed-Policy Distillation
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning
LLM post-training is unified as off-policy or on-policy interventions that expand support for useful behaviors, reshape policies within reachable states, or consolidate behavior across training stages.
Reference graph
Works this paper leans on
-
[1]
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697,
-
[2]
Wei-Rui Chen, Vignesh Kothapalli, Ata Fatahibaarzi, Hejian Sang, Shao Tang, Qingquan Song, Zhipeng Wang, and Muhammad Abdul-Mageed. Distilling the essence: Efficient reasoning distillation via sequence truncation.arXiv preprint arXiv:2512.21002, 2025a. Weize Chen, Jiarui Yuan, Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu, and Maosong Sun. The overthink...
-
[3]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Yanrui Du, Sendong Zhao, Yibo Gao, Danyang Zhao, Qika Lin, Ming Ma, Jiayun Li, Yi Jiang, Kai He, Qianyi Xu, et al. S3-cot: Self-sampled succinct reasoning enables efficient chain- of-thought llms.arXiv preprint arXiv:2602.01982,
-
[6]
MiniLLM: On-Policy Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InarXiv preprint arXiv:2306.08543,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[9]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296,
-
[11]
Reasoning efficiently through adaptive chain-of-thought compression: A self-optimizing framework
Kerui Huang, Shuhan Liu, Xing Hu, Tongtong Xu, Lingfeng Bao, and Xin Xia. Reasoning efficiently through adaptive chain-of-thought compression: A self-optimizing framework. arXiv preprint arXiv:2509.14093,
-
[12]
Reinforcement Learning via Self-Distillation
10 Preprint. Under review. Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Rein- forcement learning via self-distillation.arXiv preprint arXiv:2601.20802,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Long Li, Jiaran Hao, Jason Klein Liu, Zhijian Zhou, Yanting Miao, Wei Pang, Xiaoyu Tan, Wei Chu, Zhe Wang, Shirui Pan, et al. The choice of divergence: A neglected key to mitigating diversity collapse in reinforcement learning with verifiable reward.arXiv preprint arXiv:2509.07430, 2025a. Yanhao Li, Lu Ma, Jiaran Zhang, Lexiang Tang, Wentao Zhang, and Gui...
-
[15]
Weizhe Lin, Xing Li, Zhiyuan Yang, Xiaojin Fu, Hui-Ling Zhen, Yaoyuan Wang, Xianzhi Yu, Wulong Liu, Xiaosong Li, and Mingxuan Yuan. Trimr: Verifier-based training-free thinking compression for efficient test-time scaling.arXiv preprint arXiv:2505.17155,
-
[16]
SY Liu, X Dong, X Lu, S Diao, M Liu, MH Chen, H Yin, YCF Wang, KT Cheng, Y Choi, et al. DLER: Doing length penalty right – incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110,
-
[17]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test-time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286–20332,
work page 2025
-
[18]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute opti- mally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Qian Wan, Ziao Xu, Luona Wei, Xiaoxuan Shen, and Jianwen Sun. Mitigating overthinking in large reasoning models via difficulty-aware reinforcement learning.arXiv preprint arXiv:2601.21418,
-
[21]
Chenlong Wang, Yuanning Feng, Dongping Chen, Zhaoyang Chu, Ranjay Krishna, and Tianyi Zhou. Wait, we don’t need to" wait"! removing thinking tokens improves reasoning efficiency.arXiv preprint arXiv:2506.08343, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 8...
-
[22]
11 Preprint. Under review. Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Control- lable chain-of-thought compression in llms.Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 3351–3363,
work page 2025
-
[23]
Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600,
Silei Xu, Wenhao Xie, Lingxiao Zhao, and Pengcheng He. Chain of draft: Thinking faster by writing less.arXiv preprint arXiv:2502.18600,
-
[24]
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning. arXiv preprint arXiv:2602.21420, 2026a. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation at the frontier of student competence.arXiv preprint arXiv:260...
-
[25]
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models.arXiv preprint arXiv:2602.12275,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin. Deepplanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv preprint arXiv:2601.18137,
-
[28]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
equals the sequence-level KL divergence between student and teacher. Lemma 1(Chain rule of KL for autoregressive models).For autoregressive distributions q(y|x) = ∏t q(yt |x , y<t) and p(y|x) = ∏t p(yt |x , y<t), the sequence-level KL decomposes as DKL(q∥p) =E y∼q [∑t DKL(q(· |x , y<t)∥p(· |x , y<t))]. This follows directly from expanding logq(y)/p(y) = ∑...
work page 2026
-
[30]
· DC −D E ≥0. (20) 14 Preprint. Under review. Remark 2.The core assumption is (A1): harder problems have a larger fraction of essential tokens. This is empirically supported by Table 2 (MATH-500: 57–59% compression vs. AIME 2025:∼35%). Assumption (A2) is a modeling simplification; the proposition should be interpreted as holding for category-averaged KL v...
work page 2025
-
[31]
Soft budgets achieve higher compression but substantially lower accuracy than the concise instruction, particularly on competition-level benchmarks. Accuracy (Acc, %), token reduction (Red., %), and accuracy change vs. base model (∆Acc, pp). MATH-500 AIME 2024 AIME 2025 ContextAcc Red.∆Acc Acc Red.∆Acc Acc Red.∆Acc Qwen3-8B Concise86.658.8%+8.969.6 35.4%−...
work page 2024
-
[32]
replaces the qualitative conciseness instruction with a specific reduction target while keeping all other aspects identical. Table 4 reveals a clearcompression–accuracy tradeoff across context variants: soft budgets achieve higher compression but substantially lower accuracy than the qualitative concise instruction. Soft budgets compress more aggressively...
work page 2024
-
[33]
This mirrors the finding of Shenfeld et al
to ∼2% (step 100). This mirrors the finding of Shenfeld et al. (2026) that overly aggressive teacher updates create a moving target problem: the student chases a teacher that is itself changing in response to the student’s updates, leading to a positive feedback loop of increasingly degenerate outputs. M∈ { 40, 50, 60} form a stable plateau.These interval...
work page 2026
-
[34]
but still trails theM≥40 regime by 2–3 percentage points. Based on these results, we use M=50 for all other experiments in this paper, as it sits comfortably in the stable plateau while allowing progressive compression through periodic teacher refresh. D Survey of Reasoning Compression Methods Table 5 summarizes 19 reasoning compression methods along four...
work page 2024
-
[35]
MATH-500 accuracy remains robust throughout, suggesting that compression on easier benchmarks is more sustainable. Practical recommendation: step 100 is the sweet spot.Based on these results, we rec- ommend step 100 (roughly 3,200 training examples) as the default checkpoint. It achieves 57–59% compression on MATH-500 with accuracy gains of 9–16 pp, while...
work page 2024
-
[36]
Our implementation is built on top of the verl library Sheng et al
86.1 1,686 56.5%76.37,577 41.0%61.710,137 35.2% CRISP (step 200)86.21,19169.2%66.3 6,08952.6%53.8 7,49652.1% F Training and Implementation Details Technical setup.All experiments are conducted on a single node equipped with eight NVIDIA H200 GPUs. Our implementation is built on top of the verl library Sheng et al. (2025), which provides a HybridEngine for...
work page 2025
-
[37]
is used for inference. Mixed-precision training is performed inbfloat16, and gradient checkpointing is enabled to reduce peak memory usage. Training data.Our training data is derived from DAPO-Math-17k Yu et al. (2025), a deduplicated set of ∼17,000 competition-level math problems. We randomly split the dataset into 80% training (∼13,600 prompts) and 20% ...
work page 2025
-
[38]
For each benchmark, we generate 8 responses per problem with temperature 0.6, top- p= 0.95, and top-k= 20, and report 21 Preprint. Under review. Parameter Value General Models Qwen3-8B, Qwen3-14B Loss function Reverse KL: KL(π student∥πteacher) Teacher Periodic update (M=50 steps) Data Training prompts∼13,600 (from DAPO-Math-17k) Validation prompts∼3,400 ...
work page 2025
-
[39]
enables progressive compression: after each refresh, the updated teacher, having already internalized compression from the previous round, produces even more concise traces under instruction c, providing a stronger compression signal. The discrete refresh avoids continuous co-adaptation while still allowing compression to deepen over training. Our ablatio...
work page 2026
-
[40]
MATH-500 accuracy improves steadily for both models, rising from ∼78% to ∼87% (8B) and ∼70% to ∼87% (14B). AIME 2024 and AIME 2025 results exhibit substantially larger variance due to their small sample sizes (30 problems each), though the overall trend remains stable or slightly improving. H.1 Entropy Stability Throughout Training Figure 8 shows that CRI...
work page 2024
-
[41]
Plan a 3- day trip to Beijing for two people with a $2,000 budget
The same pattern is visible but milder on Qwen3-8B. Figure 10 reveals a parallel instability: forward KL compresses responses more aggressively, with length drops synchronized to the same teacher-update boundaries. On Qwen3-14B the gap widens to ∼500 tokens by step 190 (1,229 vs. 1,766, a 30% shortfall). On hard reasoning benchmarks like AIME, this trunca...
work page 2026
-
[42]
We hypothesize that this reflects a difference in the reasoning structure of the two tasks: travel planning relies on a fixed sequence of tool calls (search → book → verify) where the reasoning over- head is largely redundant narration, whereas shopping planning requires more adaptive, branching logic (compare prices, evaluate coupons, backtrack). Compres...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.