pith. sign in

arxiv: 2602.08324 · v3 · pith:6O3MBUNUnew · submitted 2026-02-09 · 💻 cs.LG

Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression

Pith reviewed 2026-05-21 13:38 UTC · model grok-4.3

classification 💻 cs.LG
keywords chain-of-thought compressionextreme ratio compressionlarge language modelsmathematical reasoningsupervised fine-tuningreinforcement learningtoken efficiencyinference optimization
0
0 comments X

The pith

Extra-CoT compresses chain-of-thought to extreme ratios while improving accuracy on math tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that chain-of-thought reasoning in language models can be shortened to a small fraction of its usual length without losing the ability to reach correct answers. It does this by first training a compressor on math reasoning traces that carry detailed annotations, then applying mixed-ratio supervised fine-tuning to teach the model many different compression levels at once, and finally running reinforcement learning with a hierarchical reward that pushes for strong performance even when the token budget is very low. If the approach holds, reasoning models could deliver the same results with far less computation at inference time. Readers would care because long chain-of-thought sequences currently dominate the cost of using these models for hard problems.

Core claim

Extra-CoT produces reliable high-fidelity supervision at extreme compression ratios by training a dedicated semantically-preserved compressor on fine-grained mathematical CoT data, followed by mixed-ratio SFT that exposes the model to a spectrum of token budgets and CHRPO that uses constrained hierarchical rewards to incentivize question-solving ability under lower budgets, yielding over 73 percent token reduction and a 0.6 percent accuracy gain on MATH-500 with Qwen3-1.7B while outperforming prior methods on three mathematical reasoning benchmarks.

What carries the argument

Extra-CoT framework, whose core mechanisms are a fine-grained compressor that generates compressed yet semantically faithful CoT pairs and Constrained and Hierarchical Ratio Policy Optimization (CHRPO) that explicitly rewards accurate answers at successively tighter token limits.

If this is right

  • Models learn to follow a continuous range of compression budgets after mixed-ratio SFT.
  • Hierarchical rewards in the RL stage directly improve solving ability when token counts are forced lower.
  • The same pipeline outperforms earlier CoT compression techniques at the highest ratios tested.
  • Token budgets can be reduced by more than 70 percent on standard math benchmarks while accuracy holds or rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the compressor stays faithful across domains, the same extreme-ratio recipe could shorten reasoning traces in code generation or scientific problem solving.
  • Lower average token counts would reduce energy use when many reasoning queries run in parallel on shared hardware.
  • One direct test would be to measure whether the accuracy advantage persists when the base model size increases or when the training data includes non-math tasks.

Load-bearing premise

A compressor trained on annotated mathematical reasoning traces can produce compressed chains that remain logically correct at extreme ratios so that later supervised and reinforcement stages can keep or improve final answer accuracy.

What would settle it

Running Extra-CoT on MATH-500 with Qwen3-1.7B and measuring either less than 70 percent token reduction or an accuracy drop instead of the reported 0.6 percent gain would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2602.08324 by Bohan Jia, Jiao Xie, Jie Hu, Lianyue Zhang, Shaohui Lin, Wei Li, Wenxi Li, Wenxuan Huang, Xinghao Chen Rongrong Ji, Yuntian Tang.

Figure 1
Figure 1. Figure 1: Comparison between accuracy and actual compression ratio of CoT tokens, defined as the ratio of the compressed CoT token length to the original length, across three math benchmarks evaluated on Qwen3-1.7B. Extra-CoT outperforms TokenSkip and Thinkless in the extremely low-ratio regime. CHRPO policy further improves performance at the lowest inference budgets, validating the effectiveness of our RL optimiza… view at source ↗
Figure 2
Figure 2. Figure 2: Overall pipeline of the proposed Extra-CoT, which includes three-stage training: (a) Semantically-preserved, question-aware CoT compressor training, (b) Mixed-ratio SFT and (c) CHRPO. We first train a CoT compressor on mathematical CoT data with fine-grained annotations to generate in-domain fixed-ratio compressed data. During mixed-ratio SFT stage, a reasoning LLM is fine-tuned on these fixed-ratio data c… view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of our proposed CHRPO’s hierarchical reward mechanism, which features a main reward and a control￾head reward. The main reward, targeting all tokens, integrates four criteria: accuracy, rationale integrity, budget calibration, and rationale-optimized mode. In contrast, the control-head reward is applied only to the first token, providing a direct and immediate signal to shape the policy’s r… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of output quality between our compressor and LLMLingua-2 at 0.2 and 0.4 compression ratios. While our compressor produces a coherent and semantically faithful output that preserves structural and formula integrity, LLMLingua-2’s output degrades into a fragmented text with semantic discontinuities and incomplete formulas. 20 40 60 80 Compression Ratio 0 1 2 3 4 Score (weighted mean) llmlingua-2 O… view at source ↗
Figure 5
Figure 5. Figure 5: Compressor quality comparison between our method (Ours) and LLMLingua-2. Both compressors were used to com￾press the same dataset at four fixed compression ratios. LLMs then scored the outputs on a 1-5 scale across three metrics: Math Fidelity, Reasoning Coherence, and Clarity & Readability. Quantitative Compressor Evaluation. To quantify com￾pressor quality and test our hypothesis that low-fidelity su￾per… view at source ↗
Figure 6
Figure 6. Figure 6: Compression Labeling Prompt used to generate supervi￾sion data. A. Prompt Templates for Compression and Evaluation Compression Labeling Prompt. We employ a special￾ized prompt to leverage GPT-4o as our primary annota￾tor for CoT compression. Provided with a question and a word-indexed CoT, the model is tasked with identifying the minimal subsequence of token indices necessary to recon￾struct a complete, qu… view at source ↗
Figure 7
Figure 7. Figure 7: Compression Evaluation Prompt. Clarity & Readability. The judge is explicitly instructed to verify whether the compressed text retains the logical validity of the original solution. The output is structured as a JSON object containing individual scores and a brief jus￾tification to facilitate automated aggregation and statistical analysis. The complete evaluation prompt is presented in [PITH_FULL_IMAGE:fi… view at source ↗
read the original abstract

Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods. Our source codes have been released at https://github.com/Mwie1024/Extra-CoT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Extra-CoT, a framework for extreme-ratio Chain-of-Thought compression. It first trains a dedicated compressor on fine-grained mathematical CoT annotations to produce high-fidelity compressed sequences, then performs mixed-ratio supervised fine-tuning on an LLM, and finally applies Constrained and Hierarchical Ratio Policy Optimization (CHRPO) with hierarchical rewards to maintain question-solving accuracy under reduced token budgets. Experiments on three mathematical reasoning benchmarks, including MATH-500 with Qwen3-1.7B, report over 73% token reduction accompanied by a 0.6% accuracy gain while outperforming prior methods; source code is released.

Significance. If the central results hold under rigorous verification, the work could meaningfully advance efficient inference for reasoning LLMs by demonstrating that aggressive CoT compression need not degrade (and may even improve) final-answer accuracy. The explicit release of source code and the use of a hierarchical reward structure in CHRPO are constructive elements that support reproducibility and targeted optimization.

major comments (2)
  1. Abstract: The headline result (73% token reduction +0.6% accuracy on MATH-500) is load-bearing for the central claim yet rests on the unverified assumption that the dedicated compressor preserves full logical structure at extreme ratios. No quantitative fidelity metrics, error analysis, or examples of preserved versus omitted reasoning steps are referenced, leaving open the possibility that downstream SFT and CHRPO merely compensate for introduced inconsistencies rather than benefiting from true high-fidelity compression.
  2. Method description of CHRPO: The hierarchical reward is defined primarily in terms of final-answer correctness and token budget. This creates a potential mismatch with the compressor-fidelity concern; if subtle logical errors survive compression, the reward signal may not penalize them, undermining the claim that CHRPO explicitly incentivizes reliable reasoning under lower budgets.
minor comments (2)
  1. Abstract and experimental section: Baseline implementations, data splits, statistical significance tests, and ablation results on compressor quality are not described, which hinders direct comparison and assessment of robustness.
  2. Notation for mixed compression ratios: The spectrum of budgets used in SFT is referenced but not formalized with an equation or explicit sampling procedure, making the training protocol harder to replicate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have helped us identify areas where additional evidence and clarification strengthen the manuscript. We address each major comment below and have revised the paper accordingly.

read point-by-point responses
  1. Referee: Abstract: The headline result (73% token reduction +0.6% accuracy on MATH-500) is load-bearing for the central claim yet rests on the unverified assumption that the dedicated compressor preserves full logical structure at extreme ratios. No quantitative fidelity metrics, error analysis, or examples of preserved versus omitted reasoning steps are referenced, leaving open the possibility that downstream SFT and CHRPO merely compensate for introduced inconsistencies rather than benefiting from true high-fidelity compression.

    Authors: We agree that explicit evidence of compressor fidelity is essential to support the headline claims. The original manuscript describes training the compressor on fine-grained mathematical CoT annotations to achieve semantic preservation, but we acknowledge that quantitative fidelity metrics, error analysis, and concrete examples were not included in the abstract or sufficiently highlighted in the main text. In the revised version we have added a dedicated subsection (Section 3.2) reporting step-level fidelity metrics (BERTScore and ROUGE on reasoning steps) together with representative examples of preserved versus omitted steps and an accompanying error analysis. These additions demonstrate that the compressor maintains logical structure at extreme ratios and that the observed accuracy gains arise from high-fidelity compression rather than downstream compensation. revision: yes

  2. Referee: Method description of CHRPO: The hierarchical reward is defined primarily in terms of final-answer correctness and token budget. This creates a potential mismatch with the compressor-fidelity concern; if subtle logical errors survive compression, the reward signal may not penalize them, undermining the claim that CHRPO explicitly incentivizes reliable reasoning under lower budgets.

    Authors: We appreciate the referee’s observation on the reward design. The hierarchical reward indeed centers on final-answer correctness as the primary term and token budget as a secondary constraint. Because the SFT stage is performed on high-fidelity compressed CoTs produced by the dedicated compressor, logical errors are largely eliminated before RL begins; any residual inconsistency that leads to an incorrect answer is directly penalized by the correctness reward. To make this interaction explicit, we have expanded the CHRPO method section with a clearer breakdown of the hierarchical reward components and added a short discussion of how upstream fidelity and the correctness signal together ensure reliable reasoning. We have also included an ablation showing performance degradation when the compressor is replaced by a lower-fidelity baseline, further supporting the design. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central claims rest on empirical results from training a compressor on annotated CoT data, followed by mixed-ratio SFT and CHRPO-based RL, then measuring accuracy on held-out benchmarks such as MATH-500. These accuracy numbers are obtained after training and are not equivalent to the training inputs by construction. The hierarchical reward in CHRPO is a training objective tied to question-solving but does not reduce the reported benchmark gains to a definitional tautology or fitted input renamed as prediction. No equations, self-citations, or uniqueness theorems are invoked in a load-bearing way that collapses the result to prior author work or ansatz. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on empirical training with several tunable components and the unproven assumption that semantic compression preserves logical validity at extreme ratios.

free parameters (2)
  • mixed compression ratios
    Spectrum of token budgets used during SFT and RL stages; values chosen to cover the target extreme-ratio regime.
  • CHRPO reward coefficients
    Weights balancing question-solving accuracy against token budget in the hierarchical policy optimization.
axioms (1)
  • domain assumption A compressor trained on annotated mathematical CoT can generate high-fidelity compressed traces at extreme ratios.
    Invoked to justify the first training stage that supplies supervision for the main model.
invented entities (1)
  • CHRPO (Constrained and Hierarchical Ratio Policy Optimization) no independent evidence
    purpose: RL algorithm that explicitly rewards correct answers under progressively tighter token budgets.
    New policy optimization method introduced to stabilize training at extreme compression.

pith-pipeline@v0.9.0 · 5817 in / 1249 out tokens · 66859 ms · 2026-05-21T13:38:36.701370+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens

    cs.CL 2026-04 unverdicted novelty 7.0

    Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    Aimo validation amc (amc 2023 sub- set)

    AI-MO Team. Aimo validation amc (amc 2023 sub- set). https://huggingface.co/datasets/ AI-MO/aimo-validation-amc ,

  2. [2]

    arXiv preprint arXiv:2503.05179

    URL https://huggingface.co/datasets/ AI-MO/aimo-validation-amc . Derived from AMC12 2022–2023 problems; this work uses the 2023 subset. Aytes, S. A., Baek, J., and Hwang, S. J. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching.arXiv preprint arXiv:2503.05179,

  3. [3]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

  4. [4]

    Pangu embedded: An efficient dual-system llm reasoner with metacognition.arXiv preprint arXiv:2505.22375, 2025

    Chen, H., Wang, Y ., Han, K., Li, D., Li, L., Bi, Z., Li, J., Wang, H., Mi, F., Zhu, M., et al. Pangu embedded: An efficient dual-system llm reasoner with metacognition. arXiv preprint arXiv:2505.22375,

  5. [5]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187,

  6. [6]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  7. [7]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  8. [8]

    Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?arXiv preprint arXiv:2504.06514,

    Fan, C., Li, M., Sun, L., and Zhou, T. Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?arXiv preprint arXiv:2504.06514,

  9. [9]

    Thinkless: LLMlearnswhentothink.arXivpreprint arXiv:2505.13379,2025

    Fang, G., Ma, X., and Wang, X. Thinkless: Llm learns when to think.arXiv preprint arXiv:2505.13379,

  10. [10]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  12. [12]

    Token-budget-aware llm reasoning

    Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., and Chen, Z. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 24842–24855,

  13. [13]

    Training Large Language Models to Reason in a Continuous Latent Space

    Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,

  14. [14]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,

  15. [15]

    Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes

    Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes. InFind- ings of the Association for Computational Linguistics: ACL 2023, pp. 8003–8017,

  16. [16]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    9 Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  17. [17]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  18. [18]

    A chain-of-thought is as strong as its weakest link: A bench- mark for verifiers of reasoning chains.arXiv preprint arXiv:2402.00559,

    Jacovi, A., Bitton, Y ., Bohnet, B., Herzig, J., Honovich, O., Tseng, M., Collins, M., Aharoni, R., and Geva, M. A chain-of-thought is as strong as its weakest link: A bench- mark for verifiers of reasoning chains.arXiv preprint arXiv:2402.00559,

  19. [19]

    OpenAI o1 System Card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  20. [20]

    Llmlingua: Compressing prompts for accelerated inference of large language models

    Jiang, H., Wu, Q., Lin, C.-Y ., Yang, Y ., and Qiu, L. Llmlin- gua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736,

  21. [21]

    How well do llms compress their own chain-of-thought? a token complexity approach

    Lee, A., Che, E., and Peng, T. How well do llms compress their own chain-of-thought? a token complexity approach. arXiv preprint arXiv:2503.01141,

  22. [22]

    Camel: Communicative agents for “mind” exploration of large language model society.Advances in Neural In- formation Processing Systems, 36:51991–52008, 2023a

    Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem, B. Camel: Communicative agents for “mind” exploration of large language model society.Advances in Neural In- formation Processing Systems, 36:51991–52008, 2023a. Li, Y ., Dong, B., Guerin, F., and Lin, C. Compressing context to enhance inference efficiency of large language models. InProceedings ...

  23. [23]

    Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression.arXiv preprint arXiv:2403.12968,

    Pan, Z., Wu, Q., Jiang, H., Xia, M., Luo, X., Zhang, J., Lin, Q., R ¨uhle, V ., Yang, Y ., Lin, C.-Y ., et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression.arXiv preprint arXiv:2403.12968,

  24. [24]

    Are NLP Models really able to Solve Simple Math Word Problems?

    Patel, A., Bhattamishra, S., and Goyal, N. Are nlp models really able to solve simple math word problems?arXiv preprint arXiv:2103.07191,

  25. [25]

    Style-compress: An llm-based prompt compression framework considering task-specific styles.arXiv preprint arXiv:2410.14042,

    Pu, X., He, T., and Wan, X. Style-compress: An llm-based prompt compression framework considering task-specific styles.arXiv preprint arXiv:2410.14042,

  26. [26]

    and Roth, D

    Roy, S. and Roth, D. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pp. 1743–1752,

  27. [27]

    Betweenunderthinkingandoverthinking: Anempiricalstudyofreasoninglengthandcorrectnessinllms.arXivpreprintarXiv:2505.00127,2025

    Su, J., Healey, J., Nakov, P., and Cardie, C. Between un- derthinking and overthinking: An empirical study of rea- soning length and correctness in llms.arXiv preprint arXiv:2505.00127,

  28. [28]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Sui, Y ., Chuang, Y .-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

  29. [29]

    Kimi K2: Open Agentic Intelligence

    Team, K., Bai, Y ., Bao, Y ., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y ., Chen, Y ., Chen, Y ., et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  30. [30]

    Wait, we don’t need to “wait”! removing think- ing tokens improves reasoning efficiency.arXiv preprint arXiv:2506.08343,

    Wang, C., Feng, Y ., Chen, D., Chu, Z., Krishna, R., and Zhou, T. Wait, we don’t need to “wait”! removing think- ing tokens improves reasoning efficiency.arXiv preprint arXiv:2506.08343,

  31. [31]

    Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,

    Xia, H., Leong, C. T., Wang, W., Li, Y ., and Li, W. Token- skip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067,

  32. [32]

    SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

    Xu, Y ., Guo, X., Zeng, Z., and Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms.arXiv preprint arXiv:2502.12134,

  33. [33]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  34. [34]

    Compact: Compressing retrieved documents actively for question answering.arXiv preprint arXiv:2407.09014,

    10 Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression Yoon, C., Lee, T., Hwang, H., Jeong, M., and Kang, J. Compact: Compressing retrieved documents actively for question answering.arXiv preprint arXiv:2407.09014,

  35. [35]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y ., Kwok, J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot- strap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,

  36. [36]

    D., Yu, Z., Xu, X., Qi, W., and Chen, K

    Yuan, H., Yu, B., Li, H., Yang, S., Wang, C. D., Yu, Z., Xu, X., Qi, W., and Chen, K. Not all tokens are what you need in thinking.arXiv preprint arXiv:2505.17827,

  37. [37]

    BudgetγOurs (Wins) llmlingua-2 (Wins) Ours Pref

    per rater. BudgetγOurs (Wins) llmlingua-2 (Wins) Ours Pref. (%) 0.2 49.4 0.6 98.8 0.4 49.8 0.2 100.0 0.6 47.6 2.4 95.2 0.8 42.0 8.0 84.0 stage using the exact same RL dataset S and base model backbone as our Extra-CoT method. By keeping the core decoding strategies and optimization hyperparameters iden- tical, this re-implementation isolates the algorithm...