Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression
Pith reviewed 2026-05-21 13:38 UTC · model grok-4.3
The pith
Extra-CoT compresses chain-of-thought to extreme ratios while improving accuracy on math tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Extra-CoT produces reliable high-fidelity supervision at extreme compression ratios by training a dedicated semantically-preserved compressor on fine-grained mathematical CoT data, followed by mixed-ratio SFT that exposes the model to a spectrum of token budgets and CHRPO that uses constrained hierarchical rewards to incentivize question-solving ability under lower budgets, yielding over 73 percent token reduction and a 0.6 percent accuracy gain on MATH-500 with Qwen3-1.7B while outperforming prior methods on three mathematical reasoning benchmarks.
What carries the argument
Extra-CoT framework, whose core mechanisms are a fine-grained compressor that generates compressed yet semantically faithful CoT pairs and Constrained and Hierarchical Ratio Policy Optimization (CHRPO) that explicitly rewards accurate answers at successively tighter token limits.
If this is right
- Models learn to follow a continuous range of compression budgets after mixed-ratio SFT.
- Hierarchical rewards in the RL stage directly improve solving ability when token counts are forced lower.
- The same pipeline outperforms earlier CoT compression techniques at the highest ratios tested.
- Token budgets can be reduced by more than 70 percent on standard math benchmarks while accuracy holds or rises.
Where Pith is reading between the lines
- If the compressor stays faithful across domains, the same extreme-ratio recipe could shorten reasoning traces in code generation or scientific problem solving.
- Lower average token counts would reduce energy use when many reasoning queries run in parallel on shared hardware.
- One direct test would be to measure whether the accuracy advantage persists when the base model size increases or when the training data includes non-math tasks.
Load-bearing premise
A compressor trained on annotated mathematical reasoning traces can produce compressed chains that remain logically correct at extreme ratios so that later supervised and reinforcement stages can keep or improve final answer accuracy.
What would settle it
Running Extra-CoT on MATH-500 with Qwen3-1.7B and measuring either less than 70 percent token reduction or an accuracy drop instead of the reported 0.6 percent gain would falsify the central performance claim.
Figures
read the original abstract
Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs), yet it incurs substantial computational overhead for inference. Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios, resulting in significant performance degradation. To achieve high-fidelity, fast reasoning, we propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. To generate reliable, high-fidelity supervision, we first train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations. An LLM is then fine-tuned on these compressed pairs via a mixed-ratio supervised fine-tuning (SFT), teaching it to follow a spectrum of compression budgets and providing a stable initialization for reinforcement learning (RL). We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward. Experiments on three mathematical reasoning benchmarks show the superiority of Extra-CoT. For example, on MATH-500 using Qwen3-1.7B, Extra-CoT achieves over 73\% token reduction with an accuracy improvement of 0.6\%, significantly outperforming state-of-the-art (SOTA) methods. Our source codes have been released at https://github.com/Mwie1024/Extra-CoT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Extra-CoT, a framework for extreme-ratio Chain-of-Thought compression. It first trains a dedicated compressor on fine-grained mathematical CoT annotations to produce high-fidelity compressed sequences, then performs mixed-ratio supervised fine-tuning on an LLM, and finally applies Constrained and Hierarchical Ratio Policy Optimization (CHRPO) with hierarchical rewards to maintain question-solving accuracy under reduced token budgets. Experiments on three mathematical reasoning benchmarks, including MATH-500 with Qwen3-1.7B, report over 73% token reduction accompanied by a 0.6% accuracy gain while outperforming prior methods; source code is released.
Significance. If the central results hold under rigorous verification, the work could meaningfully advance efficient inference for reasoning LLMs by demonstrating that aggressive CoT compression need not degrade (and may even improve) final-answer accuracy. The explicit release of source code and the use of a hierarchical reward structure in CHRPO are constructive elements that support reproducibility and targeted optimization.
major comments (2)
- Abstract: The headline result (73% token reduction +0.6% accuracy on MATH-500) is load-bearing for the central claim yet rests on the unverified assumption that the dedicated compressor preserves full logical structure at extreme ratios. No quantitative fidelity metrics, error analysis, or examples of preserved versus omitted reasoning steps are referenced, leaving open the possibility that downstream SFT and CHRPO merely compensate for introduced inconsistencies rather than benefiting from true high-fidelity compression.
- Method description of CHRPO: The hierarchical reward is defined primarily in terms of final-answer correctness and token budget. This creates a potential mismatch with the compressor-fidelity concern; if subtle logical errors survive compression, the reward signal may not penalize them, undermining the claim that CHRPO explicitly incentivizes reliable reasoning under lower budgets.
minor comments (2)
- Abstract and experimental section: Baseline implementations, data splits, statistical significance tests, and ablation results on compressor quality are not described, which hinders direct comparison and assessment of robustness.
- Notation for mixed compression ratios: The spectrum of budgets used in SFT is referenced but not formalized with an equation or explicit sampling procedure, making the training protocol harder to replicate.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments have helped us identify areas where additional evidence and clarification strengthen the manuscript. We address each major comment below and have revised the paper accordingly.
read point-by-point responses
-
Referee: Abstract: The headline result (73% token reduction +0.6% accuracy on MATH-500) is load-bearing for the central claim yet rests on the unverified assumption that the dedicated compressor preserves full logical structure at extreme ratios. No quantitative fidelity metrics, error analysis, or examples of preserved versus omitted reasoning steps are referenced, leaving open the possibility that downstream SFT and CHRPO merely compensate for introduced inconsistencies rather than benefiting from true high-fidelity compression.
Authors: We agree that explicit evidence of compressor fidelity is essential to support the headline claims. The original manuscript describes training the compressor on fine-grained mathematical CoT annotations to achieve semantic preservation, but we acknowledge that quantitative fidelity metrics, error analysis, and concrete examples were not included in the abstract or sufficiently highlighted in the main text. In the revised version we have added a dedicated subsection (Section 3.2) reporting step-level fidelity metrics (BERTScore and ROUGE on reasoning steps) together with representative examples of preserved versus omitted steps and an accompanying error analysis. These additions demonstrate that the compressor maintains logical structure at extreme ratios and that the observed accuracy gains arise from high-fidelity compression rather than downstream compensation. revision: yes
-
Referee: Method description of CHRPO: The hierarchical reward is defined primarily in terms of final-answer correctness and token budget. This creates a potential mismatch with the compressor-fidelity concern; if subtle logical errors survive compression, the reward signal may not penalize them, undermining the claim that CHRPO explicitly incentivizes reliable reasoning under lower budgets.
Authors: We appreciate the referee’s observation on the reward design. The hierarchical reward indeed centers on final-answer correctness as the primary term and token budget as a secondary constraint. Because the SFT stage is performed on high-fidelity compressed CoTs produced by the dedicated compressor, logical errors are largely eliminated before RL begins; any residual inconsistency that leads to an incorrect answer is directly penalized by the correctness reward. To make this interaction explicit, we have expanded the CHRPO method section with a clearer breakdown of the hierarchical reward components and added a short discussion of how upstream fidelity and the correctness signal together ensure reliable reasoning. We have also included an ablation showing performance degradation when the compressor is replaced by a lower-fidelity baseline, further supporting the design. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's central claims rest on empirical results from training a compressor on annotated CoT data, followed by mixed-ratio SFT and CHRPO-based RL, then measuring accuracy on held-out benchmarks such as MATH-500. These accuracy numbers are obtained after training and are not equivalent to the training inputs by construction. The hierarchical reward in CHRPO is a training objective tied to question-solving but does not reduce the reported benchmark gains to a definitional tautology or fitted input renamed as prediction. No equations, self-citations, or uniqueness theorems are invoked in a load-bearing way that collapses the result to prior author work or ansatz. The method is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- mixed compression ratios
- CHRPO reward coefficients
axioms (1)
- domain assumption A compressor trained on annotated mathematical CoT can generate high-fidelity compressed traces at extreme ratios.
invented entities (1)
-
CHRPO (Constrained and Hierarchical Ratio Policy Optimization)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy. ... train a dedicated semantically-preserved compressor on mathematical CoT data with fine-grained annotations.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further propose Constrained and Hierarchical Ratio Policy Optimization (CHRPO) to explicitly incentivize question-solving ability under lower budgets by a hierarchical reward.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Shorthand for Thought: Compressing LLM Reasoning via Entropy-Guided Supertokens
Entropy-guided supertokens from BPE on reasoning traces compress LLM outputs by 8.1% on average across models and math benchmarks with no accuracy loss while exposing strategy differences between correct and incorrect traces.
Reference graph
Works this paper leans on
-
[1]
Aimo validation amc (amc 2023 sub- set)
AI-MO Team. Aimo validation amc (amc 2023 sub- set). https://huggingface.co/datasets/ AI-MO/aimo-validation-amc ,
work page 2023
-
[2]
arXiv preprint arXiv:2503.05179
URL https://huggingface.co/datasets/ AI-MO/aimo-validation-amc . Derived from AMC12 2022–2023 problems; this work uses the 2023 subset. Aytes, S. A., Baek, J., and Hwang, S. J. Sketch-of-thought: Efficient llm reasoning with adaptive cognitive-inspired sketching.arXiv preprint arXiv:2503.05179,
-
[3]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
Pangu embedded: An efficient dual-system llm reasoner with metacognition
Chen, H., Wang, Y ., Han, K., Li, D., Li, L., Bi, Z., Li, J., Wang, H., Mi, F., Zhu, M., et al. Pangu embedded: An efficient dual-system llm reasoner with metacognition. arXiv preprint arXiv:2505.22375,
-
[5]
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Chen, X., Xu, J., Liang, T., He, Z., Pang, J., Yu, D., Song, L., Liu, Q., Zhou, M., Zhang, Z., et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Fan, C., Li, M., Sun, L., and Zhou, T. Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?arXiv preprint arXiv:2504.06514,
-
[9]
Thinkless: LLMlearnswhentothink.arXivpreprint arXiv:2505.13379,2025
Fang, G., Ma, X., and Wang, X. Thinkless: Llm learns when to think.arXiv preprint arXiv:2505.13379,
-
[10]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Token-budget-aware llm reasoning
Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., and Chen, Z. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 24842–24855,
work page 2025
-
[13]
Training Large Language Models to Reason in a Continuous Latent Space
Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y . Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Measuring Massive Multitask Language Understanding
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring mas- sive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[15]
Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. Distill- ing step-by-step! outperforming larger language models with less training data and smaller model sizes. InFind- ings of the Association for Computational Linguistics: ACL 2023, pp. 8003–8017,
work page 2023
-
[16]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
9 Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y ., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Jacovi, A., Bitton, Y ., Bohnet, B., Herzig, J., Honovich, O., Tseng, M., Collins, M., Aharoni, R., and Geva, M. A chain-of-thought is as strong as its weakest link: A bench- mark for verifiers of reasoning chains.arXiv preprint arXiv:2402.00559,
-
[19]
Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Llmlingua: Compressing prompts for accelerated inference of large language models
Jiang, H., Wu, Q., Lin, C.-Y ., Yang, Y ., and Qiu, L. Llmlin- gua: Compressing prompts for accelerated inference of large language models.arXiv preprint arXiv:2310.05736,
-
[21]
How well do llms compress their own chain-of-thought? a token complexity approach
Lee, A., Che, E., and Peng, T. How well do llms compress their own chain-of-thought? a token complexity approach. arXiv preprint arXiv:2503.01141,
-
[22]
Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem, B. Camel: Communicative agents for “mind” exploration of large language model society.Advances in Neural In- formation Processing Systems, 36:51991–52008, 2023a. Li, Y ., Dong, B., Guerin, F., and Lin, C. Compressing context to enhance inference efficiency of large language models. InProceedings ...
work page 2023
-
[23]
Pan, Z., Wu, Q., Jiang, H., Xia, M., Luo, X., Zhang, J., Lin, Q., R ¨uhle, V ., Yang, Y ., Lin, C.-Y ., et al. Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression.arXiv preprint arXiv:2403.12968,
-
[24]
Are NLP Models really able to Solve Simple Math Word Problems?
Patel, A., Bhattamishra, S., and Goyal, N. Are nlp models really able to solve simple math word problems?arXiv preprint arXiv:2103.07191,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Pu, X., He, T., and Wan, X. Style-compress: An llm-based prompt compression framework considering task-specific styles.arXiv preprint arXiv:2410.14042,
-
[26]
Roy, S. and Roth, D. Solving general arithmetic word problems. InProceedings of the 2015 conference on empirical methods in natural language processing, pp. 1743–1752,
work page 2015
-
[27]
Su, J., Healey, J., Nakov, P., and Cardie, C. Between un- derthinking and overthinking: An empirical study of rea- soning length and correctness in llms.arXiv preprint arXiv:2505.00127,
-
[28]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Sui, Y ., Chuang, Y .-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Kimi K2: Open Agentic Intelligence
Team, K., Bai, Y ., Bao, Y ., Chen, G., Chen, J., Chen, N., Chen, R., Chen, Y ., Chen, Y ., Chen, Y ., et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Wang, C., Feng, Y ., Chen, D., Chu, Z., Krishna, R., and Zhou, T. Wait, we don’t need to “wait”! removing think- ing tokens improves reasoning efficiency.arXiv preprint arXiv:2506.08343,
-
[31]
Tokenskip: Controllable chain-of-thought compression in llms.arXiv preprint arXiv:2502.12067,
Xia, H., Leong, C. T., Wang, W., Li, Y ., and Li, W. Token- skip: Controllable chain-of-thought compression in llms. arXiv preprint arXiv:2502.12067,
-
[32]
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Xu, Y ., Guo, X., Zeng, Z., and Miao, C. Softcot: Soft chain-of-thought for efficient reasoning with llms.arXiv preprint arXiv:2502.12134,
-
[33]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
10 Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression Yoon, C., Lee, T., Hwang, H., Jeong, M., and Kang, J. Compact: Compressing retrieved documents actively for question answering.arXiv preprint arXiv:2407.09014,
-
[35]
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
Yu, L., Jiang, W., Shi, H., Yu, J., Liu, Z., Zhang, Y ., Kwok, J. T., Li, Z., Weller, A., and Liu, W. Metamath: Boot- strap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
D., Yu, Z., Xu, X., Qi, W., and Chen, K
Yuan, H., Yu, B., Li, H., Yang, S., Wang, C. D., Yu, Z., Xu, X., Qi, W., and Chen, K. Not all tokens are what you need in thinking.arXiv preprint arXiv:2505.17827,
-
[37]
BudgetγOurs (Wins) llmlingua-2 (Wins) Ours Pref
per rater. BudgetγOurs (Wins) llmlingua-2 (Wins) Ours Pref. (%) 0.2 49.4 0.6 98.8 0.4 49.8 0.2 100.0 0.6 47.6 2.4 95.2 0.8 42.0 8.0 84.0 stage using the exact same RL dataset S and base model backbone as our Extra-CoT method. By keeping the core decoding strategies and optimization hyperparameters iden- tical, this re-implementation isolates the algorithm...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.