ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

Haiwei Wang; Jinchang Luo; Jing Jin; Miaohui Wang; MingQuan Cheng; Tingcheng Bian; Wenyuan Jiang; Yuzhe Zhang

arxiv: 2605.07501 · v2 · pith:W2QNV2A7new · submitted 2026-05-08 · 💻 cs.LG · cs.CL

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

Tingcheng Bian , Yuzhe Zhang , Jing Jin , Jinchang Luo , MingQuan Cheng , Haiwei Wang , Wenyuan Jiang , Miaohui Wang This is my paper

Pith reviewed 2026-05-20 23:06 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learningchain-of-thought compressionreward shapingadaptive advantagemathematical reasoningtoken efficiencylarge reasoning models

0 comments

The pith

Experience-guided RL compresses chain-of-thought by up to 77% while improving accuracy

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ExpThink to address excessive token use in large reasoning models during chain-of-thought reasoning. It introduces experience-guided reward shaping that tracks the shortest correct answer for each problem and applies rewards that tighten as the model improves. A difficulty-adaptive advantage mechanism normalizes gradients based on correct solution counts to focus learning on harder problems. This results in shorter responses that maintain or increase accuracy on mathematical reasoning tasks.

Core claim

ExpThink shows that tracking the shortest correct solution per problem to shape rewards into a three-tier system and replacing standard deviation normalization with correct-count normalization for advantages allows reinforcement learning to enforce concise yet accurate reasoning, yielding up to 77% shorter responses and up to 3 times better accuracy-efficiency ratios than baselines.

What carries the argument

experience-guided reward shaping, which maintains per-problem records of shortest correct solutions to automatically adjust reward thresholds for full, discounted, or zero credit, together with difficulty-adaptive advantage that uses correct-count normalization to produce difficulty-scaled learning signals.

If this is right

Reduces average response length by up to 77% on multiple mathematical reasoning benchmarks.
Improves accuracy simultaneously with the length reduction.
Achieves up to 3 times higher accuracy-efficiency ratio than the vanilla baseline.
Outperforms existing RL-based compression methods on both length and accuracy metrics.
Requires no manual scheduling for reward thresholds due to the self-evolving curriculum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This mechanism could apply to other sequential decision tasks where balancing correctness and brevity matters.
Deployment of such models in resource-constrained environments would see reduced latency and cost.
Future work might explore combining this with prompt engineering or other efficiency techniques for compounded benefits.
Similar per-instance tracking could improve stability in other RL applications with variable difficulty.

Load-bearing premise

Tracking the shortest correct solution found so far for each problem and tightening rewards based on it will produce stable unbiased gradients without manual tuning or selection biases favoring certain problem types.

What would settle it

If experiments on additional benchmarks show that accuracy drops below the baseline when length is reduced, or if the accuracy-efficiency ratio does not exceed that of standard methods.

Figures

Figures reproduced from arXiv: 2605.07501 by Haiwei Wang, Jinchang Luo, Jing Jin, Miaohui Wang, MingQuan Cheng, Tingcheng Bian, Wenyuan Jiang, Yuzhe Zhang.

**Figure 1.** Figure 1: Top: Standard RL treats each epoch independently, discarding all trajectory information after each update. Bottom: ExpThink accumulates successful trajectories into an experience buffer, enabling a self-evolving compression curriculum that tightens automatically as the policy improves. ciency via intelligence per token (IPT), defined as the ratio of correctness to generation length, and find that current L… view at source ↗

**Figure 2.** Figure 2: Response length dynamics during RL train [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the ExpThink framework. For each query, the policy samples a group of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Experience buffer dynamics during ExpThink training. (a) Step-level response length trajectories; the running best-correct curve marks the tightening length target maintained by the experience buffer. (b) Average batch length and accuracy across training. (2) Larger models benefit more. IPT increases steadily with model size: from 7.23 to 23.29 on the 1.5B model, from 11.15 to 32.84 on 7B, and from 8.64 to… view at source ↗

**Figure 5.** Figure 5: Training dynamics under different rpen settings and advantage functions. (a) AMC23 Pass@1 over training steps. (b) Average response length (tokens) over training steps. (c) Wall-clock time per training step. wrong answers: AIME24 accuracy collapses to 7.92% and MATH-500 to 43.6%, far below the unmodified baseline. Relaxing the penalty to 0.3 partially restores accuracy but is still too aggressive to mainta… view at source ↗

**Figure 6.** Figure 6: Analysis of ExpThink’s behaviour. (a) Overthinking suppression across keywords and datasets; (b) Difficulty-adaptive behaviour on MATH-500. Level-1 responses shrink by 79.8% while Level-5 shrinks by 65.4%. This happens because easy problems are solved correctly by more rollouts, giving them a larger |Cq| that weakens the advantage signal and pushes the model toward brevity. For harder problems where fewer … view at source ↗

**Figure 7.** Figure 7: Inference-time analysis on AIME24. (a) Average response length vs. per-problem coeffi [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Training curves for ablation experiments on DeepSeek-R1-Distill-Qwen-1.5B. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Response length dynamics of ExpThink on DeepSeek-R1-DistillQwen-7B. We analyze the evolution of the training dynamics over 300 update steps using the DeepSeek-R1-Distill-Qwen-7B backbone, tracking the mean response length at each step [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Token usage comparison between Vanilla and [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Case Study 1 (AIME24, Both Correct): ExpThink solves the problem in 1,416 tokens vs. Vanilla’s 6,421 tokens (−77.9%). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Case Study 2 (AMC23, Both Correct): ExpThink solves the problem in 1,309 tokens vs. Vanilla’s 15,606 tokens (−91.6%). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Case Study 3 (MATH-500, ExpThink Correct / Vanilla Incorrect): ExpThink applies twin-prime reasoning in 829 tokens; Vanilla uses 12,523 tokens and returns the wrong answer. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Case Study 4 (Minerva Math, ExpThink Correct / Vanilla Incorrect): ExpThink applies the magnitude formula in 892 tokens; Vanilla uses 13,942 tokens and returns an answer wrong by 103 . 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Case Study 5 (OlympiadBench, ExpThink Correct / Vanilla Incorrect): ExpThink correctly solves the hexagon problem in 1,525 tokens; Vanilla uses 15,361 tokens and returns the wrong answer. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Case Study 6 (MMLU, Both Correct): ExpThink answers in 768 tokens vs. Vanilla’s 15,493 tokens (−95.0%). 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Case Study 7 (GPQA-Diamond, ExpThink Correct / Vanilla Incorrect): ExpThink finds the correct answer in 1,143 tokens; Vanilla uses 13,024 tokens and returns the wrong answer. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Case Study 8 (LiveCodeBench, Both Correct): [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

read the original abstract

Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77\% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExpThink pairs per-problem shortest-correct tracking with correct-count advantage normalization to drive adaptive CoT compression, but the abstract leaves the bias-mitigation details and full experimental controls unshown.

read the letter

The paper's main move is to replace static length penalties with an experience-guided three-tier reward that tightens automatically as the model finds shorter correct traces per problem, paired with correct-count normalization for advantages instead of standard deviation. This is presented as a way to create a self-evolving curriculum that keeps accuracy first while pushing brevity, and the headline numbers are up to 77% shorter responses with accuracy gains and a 3x better accuracy-per-token ratio on math benchmarks, beating prior RL compression baselines. If the full runs hold, the mechanisms look like a practical step toward cheaper inference for reasoning models without manual reward schedules. What stands out is the attempt to make the reward thresholds problem-specific and dynamic rather than uniform, and the normalization choice that tries to amplify signal on hard items. That combination is not just another length penalty. The approach earns credit for targeting both model improvement over time and per-problem difficulty variation in one framework. On the soft side, the abstract supplies no ablation tables, no statistical tests, and no per-problem breakdown showing whether early short solutions on easy items create the selection effect the stress-test flags. The interaction between the tightening schedule and the correct-count scaling is claimed to balance things, but without those controls visible it is hard to judge if gradients stay uniformly difficulty-scaled or concentrate on compressible subsets. Reproducibility also looks thin until code or exact threshold schedules appear. This is aimed at groups doing RL post-training for large reasoning models who care about token budgets at deployment. A reader already working on CoT efficiency would pick up the two mechanisms and test them directly. The work deserves a serious referee because the problem is real and the proposed levers are concrete, even if the current write-up needs more evidence on stability and bias before the claims can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper proposes ExpThink, an RL framework for chain-of-thought compression in large reasoning models. It introduces experience-guided reward shaping that tracks the shortest correct solution found so far per problem to automatically tighten a three-tier reward (full credit for concise correct, discounted for verbose correct, zero for incorrect), creating a self-evolving curriculum. It also uses difficulty-adaptive advantage normalization based on correct-count rather than standard deviation to scale gradients monotonically with difficulty. Experiments on mathematical reasoning benchmarks claim up to 77% reduction in average response length with simultaneous accuracy gains, up to 3× higher accuracy-efficiency ratio than the vanilla baseline, and outperformance over existing RL-based compression methods.

Significance. If the results hold after addressing the noted concerns, the work would be significant for practical deployment of reasoning models, as it offers a parameter-light way to dynamically trade off accuracy and token efficiency without static penalties or manual schedules. The self-evolving per-problem threshold and correct-count normalization are conceptually appealing for handling capability dynamics and difficulty variation.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The headline claims of 77% length reduction, accuracy improvement, and 3× accuracy-efficiency gains are presented without any reported baselines, statistical tests, ablation results, or implementation details (e.g., RL algorithm, hyperparameters, or number of runs). This prevents verification of whether the gains are attributable to the proposed mechanisms or to other factors.
[§3.1] §3.1 (experience-guided reward shaping): The per-problem tracking of the shortest correct solution to tighten thresholds creates a potential selection effect, as problems that yield short traces early receive progressively stricter length penalties while harder problems lag. The interaction with difficulty-adaptive advantage normalization (claimed to yield monotonically difficulty-scaled gradients) is not shown via analysis or ablation to eliminate bias in the learning signal; this is load-bearing for the robustness of the 77% compression + accuracy claim.

minor comments (2)

[Abstract] Define the accuracy-efficiency ratio explicitly (accuracy divided by average token count) and specify how it is aggregated across problems and benchmarks.
[§3.1] Clarify the exact form of the three-tier reward function and the schedule for automatic threshold tightening (e.g., how the shortest-solution length is updated and applied).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major concerns point by point below, and have revised the manuscript to incorporate additional details, analyses, and ablations as suggested.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline claims of 77% length reduction, accuracy improvement, and 3× accuracy-efficiency gains are presented without any reported baselines, statistical tests, ablation results, or implementation details (e.g., RL algorithm, hyperparameters, or number of runs). This prevents verification of whether the gains are attributable to the proposed mechanisms or to other factors.

Authors: We agree that more comprehensive reporting is necessary for reproducibility and verification. In the revised version, we have expanded §4 to include comparisons against additional baselines such as standard PPO without our mechanisms, as well as prior RL compression methods. We report results averaged over 5 independent runs with standard deviations, and include statistical significance tests (paired t-tests) where appropriate. Implementation details, including the specific RL algorithm (PPO), all hyperparameters, and training setup, are now provided in Appendix A. Ablation studies isolating the contribution of experience-guided reward shaping and difficulty-adaptive advantage are added in §4.3, confirming that both components are necessary for the observed gains in accuracy-efficiency ratio. revision: yes
Referee: [§3.1] §3.1 (experience-guided reward shaping): The per-problem tracking of the shortest correct solution to tighten thresholds creates a potential selection effect, as problems that yield short traces early receive progressively stricter length penalties while harder problems lag. The interaction with difficulty-adaptive advantage normalization (claimed to yield monotonically difficulty-scaled gradients) is not shown via analysis or ablation to eliminate bias in the learning signal; this is load-bearing for the robustness of the 77% compression + accuracy claim.

Authors: This is a valid concern regarding potential bias in the learning dynamics. To clarify, the difficulty-adaptive advantage uses the number of correct solutions found so far (across all attempts) to normalize, which increases the gradient scale for problems with fewer successes, thereby prioritizing accuracy on harder problems even as the length threshold tightens for easier ones. We have added a theoretical analysis in the revised §3.1 demonstrating that this normalization ensures monotonic scaling with difficulty, independent of the per-problem reward threshold. Furthermore, we include an ablation in the experiments where we disable the per-problem tracking and use a fixed global threshold; this results in lower accuracy on hard problems, supporting that the combination mitigates selection bias. These additions strengthen the robustness claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: mechanisms defined from external observations and explicit design choices

full rationale

The paper's core mechanisms—experience-guided reward shaping that tracks the shortest correct solution found so far per problem to set three-tier thresholds, and difficulty-adaptive advantage using correct-count normalization—are presented as explicit algorithmic choices rather than derived results. These draw directly from training-time observations (external per-problem data) and a deliberate replacement of standard deviation normalization, without reducing any claimed performance gains to fitted parameters or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation. The empirical claims rest on benchmark experiments, making the chain self-contained with independent content.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; the framework implicitly relies on standard RL assumptions such as reward shaping being sufficient to guide compression without accuracy loss.

free parameters (1)

reward threshold tightening schedule
Thresholds tighten automatically with model improvement but exact update rule and initial values are unspecified.

pith-pipeline@v0.9.0 · 5805 in / 1106 out tokens · 31731 ms · 2026-05-20T23:06:17.129120+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

experience-guided reward shaping tracks the shortest correct solution found so far for each problem and applies a three-tier reward... difficulty-adaptive advantage replaces standard deviation normalization with correct-count normalization
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reduces average response length by up to 77% while simultaneously improving accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 2 internal anchors

[4]

STEP: success- rate-aware trajectory-efficient policy optimization.CoRR, abs/2511.13091, 2025

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. STEP: success- rate-aware trajectory-efficient policy optimization.CoRR, abs/2511.13091, 2025. doi: 10. 48550/ARXIV .2511.13091. URLhttps://doi.org/10.48550/arXiv.2511.13091

work page doi:10.48550/arxiv.2511.13091 2025
[6]

American invitational mathematics examination-aime 2024, 2024, 2024

MAA Codeforces. American invitational mathematics examination-aime 2024, 2024, 2024

work page 2024
[8]

Conciserl: Conciseness-guided reinforcement learning for efficient reasoning models

Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, and Liangming Pan. Conciserl: Conciseness-guided reinforcement learning for efficient reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, Novem- ber 4-9, 2025,...

work page 2025
[9]

Complexity-based prompting for multi-step reasoning

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=yf1icZHC-l9

work page 2023
[11]

Reasoning without self-doubt: More efficient chain-of-thought through certainty probing

Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. InICLR 2025 Workshop on Foundation Models in the Wild, 2025

work page 2025
[14]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, 11 Proc...

work page doi:10.18653/v1/2024.acl-long.211 2024
[15]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[16]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Pro- ceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks ...

work page 2021
[17]

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.Trans

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.Trans. Mach. Learn. Res., 2026, 2026. URLhttps://openreview.net/forum?id=V51gPu1uQD

work page 2026
[18]

Efficient reasoning for large reasoning language models via certainty-guided reflection suppression

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, and Lu Hou. Efficient reasoning for large reasoning language models via certainty-guided reflection suppression. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Int...

work page doi:10.1609/aaai.v40i37.40379 2026
[19]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2...

work page 2025
[20]

Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V . Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. O...

work page 2022
[21]

LANPO: bootstrapping language and numerical feedback for reinforcement learning in llms.CoRR, abs/2510.16552,

Ang Li, Yifei Wang, Zhihang Yuan, Stefanie Jegelka, and Yisen Wang. LANPO: bootstrapping language and numerical feedback for reinforcement learning in llms.CoRR, abs/2510.16552,

work page arXiv
[27]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=v8L0pN6EOi

work page 2024
[30]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 3(5), 2025

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, et al. Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 3(5), 2025

work page 2025
[35]

Amc23 dataset

math-ai. Amc23 dataset. https://huggingface.co/datasets/math-ai/amc23, 2023. Accessed: 2025-01-26

work page 2023
[37]

ProRL v2: Scaling LLM reinforcement learning with prolonged training.NVIDIA Technical Blog, 2025

NVIDIA Research. ProRL v2: Scaling LLM reinforcement learning with prolonged training.NVIDIA Technical Blog, 2025. URL https://developer.nvidia.com/blog/ scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/

work page 2025
[41]

DAST: difficulty-adaptive slow-thinking for large rea- soning models

Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. DAST: difficulty-adaptive slow-thinking for large rea- soning models. In Saloni Potdar, Lina Maria Rojas-Barahona, and Sébastien Montella, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2025.emnlp-industry 2025
[45]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

work page 2022
[47]

Learning to hint for reinforcement learning.arXiv preprint arXiv:2604.00698, 2026

Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, and Yuxiong He. Learning to hint for reinforcement learning.arXiv preprint arXiv:2604.00698, 2026

work page arXiv 2026
[51]

Large reasoning models know how to think efficiently

XING Zeyu, Xing Li, Huiling Zhen, Xianzhi Yu, Mingxuan Yuan, and Sinno Jialin Pan. Large reasoning models know how to think efficiently. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025

work page 2025
[52]

Adaptthink: Reasoning models can learn when to think

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 3716–3730....

work page doi:10.18653/v1/2025.emnlp-main.184 2025
[53]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, and Huan Zhang. Alphaone: Reasoning models thinking slow and fast at test time. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in ...

work page doi:10.18653/v1/2025 2025
[54]

Johnson, Lukas Koller, Edoardo Manino, ThanhVu H

Ruofan Zhang, Bin Xia, Zhen Cheng, Cairen Jian, Minglun Yang, Ngai Wong, and Yuan Cheng. DART: difficulty-adaptive reasoning truncation for efficient large language models. CoRR, abs/2511.01170, 2025. doi: 10.48550/ARXIV .2511.01170. URL https://doi.org/ 10.48550/arXiv.2511.01170. 15 A Related Work A.1 Experience-Guided Reinforcement Learning with Verifia...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[55]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [4]

STEP: success- rate-aware trajectory-efficient policy optimization.CoRR, abs/2511.13091, 2025

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. STEP: success- rate-aware trajectory-efficient policy optimization.CoRR, abs/2511.13091, 2025. doi: 10. 48550/ARXIV .2511.13091. URLhttps://doi.org/10.48550/arXiv.2511.13091

work page doi:10.48550/arxiv.2511.13091 2025

[2] [6]

American invitational mathematics examination-aime 2024, 2024, 2024

MAA Codeforces. American invitational mathematics examination-aime 2024, 2024, 2024

work page 2024

[3] [8]

Conciserl: Conciseness-guided reinforcement learning for efficient reasoning models

Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, and Liangming Pan. Conciserl: Conciseness-guided reinforcement learning for efficient reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, Novem- ber 4-9, 2025,...

work page 2025

[4] [9]

Complexity-based prompting for multi-step reasoning

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=yf1icZHC-l9

work page 2023

[5] [11]

Reasoning without self-doubt: More efficient chain-of-thought through certainty probing

Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. InICLR 2025 Workshop on Foundation Models in the Wild, 2025

work page 2025

[6] [14]

O lympiad B ench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, 11 Proc...

work page doi:10.18653/v1/2024.acl-long.211 2024

[7] [15]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021

[8] [16]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Pro- ceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks ...

work page 2021

[9] [17]

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.Trans

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.Trans. Mach. Learn. Res., 2026, 2026. URLhttps://openreview.net/forum?id=V51gPu1uQD

work page 2026

[10] [18]

Efficient reasoning for large reasoning language models via certainty-guided reflection suppression

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, and Lu Hou. Efficient reasoning for large reasoning language models via certainty-guided reflection suppression. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Int...

work page doi:10.1609/aaai.v40i37.40379 2026

[11] [19]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2...

work page 2025

[12] [20]

Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V . Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. O...

work page 2022

[13] [21]

LANPO: bootstrapping language and numerical feedback for reinforcement learning in llms.CoRR, abs/2510.16552,

Ang Li, Yifei Wang, Zhihang Yuan, Stefanie Jegelka, and Yisen Wang. LANPO: bootstrapping language and numerical feedback for reinforcement learning in llms.CoRR, abs/2510.16552,

work page arXiv

[14] [27]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=v8L0pN6EOi

work page 2024

[15] [30]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [33]

Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 3(5), 2025

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, et al. Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 3(5), 2025

work page 2025

[17] [35]

Amc23 dataset

math-ai. Amc23 dataset. https://huggingface.co/datasets/math-ai/amc23, 2023. Accessed: 2025-01-26

work page 2023

[18] [37]

ProRL v2: Scaling LLM reinforcement learning with prolonged training.NVIDIA Technical Blog, 2025

NVIDIA Research. ProRL v2: Scaling LLM reinforcement learning with prolonged training.NVIDIA Technical Blog, 2025. URL https://developer.nvidia.com/blog/ scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/

work page 2025

[19] [41]

DAST: difficulty-adaptive slow-thinking for large rea- soning models

Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. DAST: difficulty-adaptive slow-thinking for large rea- soning models. In Saloni Potdar, Lina Maria Rojas-Barahona, and Sébastien Montella, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2025.emnlp-industry 2025

[20] [45]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

work page 2022

[21] [47]

Learning to hint for reinforcement learning.arXiv preprint arXiv:2604.00698, 2026

Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, and Yuxiong He. Learning to hint for reinforcement learning.arXiv preprint arXiv:2604.00698, 2026

work page arXiv 2026

[22] [51]

Large reasoning models know how to think efficiently

XING Zeyu, Xing Li, Huiling Zhen, Xianzhi Yu, Mingxuan Yuan, and Sinno Jialin Pan. Large reasoning models know how to think efficiently. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025

work page 2025

[23] [52]

Adaptthink: Reasoning models can learn when to think

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 3716–3730....

work page doi:10.18653/v1/2025.emnlp-main.184 2025

[24] [53]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, and Huan Zhang. Alphaone: Reasoning models thinking slow and fast at test time. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in ...

work page doi:10.18653/v1/2025 2025

[25] [54]

Johnson, Lukas Koller, Edoardo Manino, ThanhVu H

Ruofan Zhang, Bin Xia, Zhen Cheng, Cairen Jian, Minglun Yang, Ngai Wong, and Yuan Cheng. DART: difficulty-adaptive reasoning truncation for efficient large language models. CoRR, abs/2511.01170, 2025. doi: 10.48550/ARXIV .2511.01170. URL https://doi.org/ 10.48550/arXiv.2511.01170. 15 A Related Work A.1 Experience-Guided Reinforcement Learning with Verifia...

work page internal anchor Pith review doi:10.48550/arxiv 2025

[26] [55]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page