arxiv: 2605.07501 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

Tingcheng Bian , Yuzhe Zhang , Jing Jin , Jinchang Luo , MingQuan Cheng , Haiwei Wang , Wenyuan Jiang , Miaohui Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learningchain of thoughtreasoning compressionlarge language modelsmathematical reasoningadaptive rewardsefficiency optimization

0 comments

The pith

ExpThink applies experience-guided rewards and adaptive normalization in reinforcement learning to shorten chain-of-thought reasoning by up to 77% while increasing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ExpThink as a reinforcement learning framework designed to compress the lengthy chain-of-thought outputs of large reasoning models without sacrificing performance. It does so by tracking the shortest correct solutions encountered for each problem to create a three-tier reward system that favors brevity once accuracy is achieved, and by using the number of correct solutions to normalize advantages so that learning focuses more on difficult problems. These changes create a training dynamic where accuracy comes first and compression follows naturally as the model improves. Readers should care because excessive token usage in detailed reasoning drives up costs and slows down responses in practical applications of AI to math and logic tasks.

Core claim

ExpThink introduces experience-guided reward shaping that awards full credit only to concise correct responses, partial credit to verbose correct ones, and none to incorrect, with the conciseness threshold tightening automatically based on the model's best solutions so far. It combines this with difficulty-adaptive advantage estimation that normalizes by correct-count instead of standard deviation, producing stronger gradients for hard problems to maintain accuracy and weaker ones for easy problems to promote shorter answers. On mathematical reasoning benchmarks this yields up to 77% shorter average responses, higher accuracy than the baseline, and up to three times the accuracy per token.

What carries the argument

Experience-guided reward shaping that tracks shortest correct solutions for three-tier rewards and difficulty-adaptive advantage that uses correct-count normalization to scale learning signals.

If this is right

Large reasoning models can produce shorter, more efficient responses on math problems while maintaining or improving accuracy.
The self-evolving threshold creates a curriculum that requires no manual adjustment as the model gets better.
Correct-count normalization amplifies learning on difficult problems to avoid accuracy loss during compression.
The method outperforms other reinforcement learning approaches to compression on both length reduction and accuracy metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar experience-tracking mechanisms could help compress reasoning in other domains such as code generation or scientific hypothesis testing.
Models trained this way might exhibit less overthinking on simple queries in deployed systems.
Resource-constrained environments could benefit from the reduced token counts for real-time applications.
The approach implies that personalized performance history can serve as a better signal than fixed penalties for balancing efficiency and correctness.

Load-bearing premise

That the three-tier rewards based on shortest correct solutions and the correct-count normalization will generalize stably to new problems and models without causing accuracy losses not visible in the tested benchmarks.

What would settle it

Evaluating the trained model on a held-out set of harder or differently distributed math problems and finding that accuracy decreases as response lengths are forced shorter.

Figures

Figures reproduced from arXiv: 2605.07501 by Haiwei Wang, Jinchang Luo, Jing Jin, Miaohui Wang, MingQuan Cheng, Tingcheng Bian, Wenyuan Jiang, Yuzhe Zhang.

**Figure 1.** Figure 1: Top: Standard RL treats each epoch independently, discarding all trajectory information after each update. Bottom: ExpThink accumulates successful trajectories into an experience buffer, enabling a self-evolving compression curriculum that tightens automatically as the policy improves. ciency via intelligence per token (IPT), defined as the ratio of correctness to generation length, and find that current L… view at source ↗

**Figure 2.** Figure 2: Response length dynamics during RL train [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the ExpThink framework. For each query, the policy samples a group of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Experience buffer dynamics during ExpThink training. (a) Step-level response length trajectories; the running best-correct curve marks the tightening length target maintained by the experience buffer. (b) Average batch length and accuracy across training. (2) Larger models benefit more. IPT increases steadily with model size: from 7.23 to 23.29 on the 1.5B model, from 11.15 to 32.84 on 7B, and from 8.64 to… view at source ↗

**Figure 5.** Figure 5: Training dynamics under different rpen settings and advantage functions. (a) AMC23 Pass@1 over training steps. (b) Average response length (tokens) over training steps. (c) Wall-clock time per training step. wrong answers: AIME24 accuracy collapses to 7.92% and MATH-500 to 43.6%, far below the unmodified baseline. Relaxing the penalty to 0.3 partially restores accuracy but is still too aggressive to mainta… view at source ↗

**Figure 6.** Figure 6: Analysis of ExpThink’s behaviour. (a) Overthinking suppression across keywords and datasets; (b) Difficulty-adaptive behaviour on MATH-500. Level-1 responses shrink by 79.8% while Level-5 shrinks by 65.4%. This happens because easy problems are solved correctly by more rollouts, giving them a larger |Cq| that weakens the advantage signal and pushes the model toward brevity. For harder problems where fewer … view at source ↗

**Figure 7.** Figure 7: Inference-time analysis on AIME24. (a) Average response length vs. per-problem coeffi [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Training curves for ablation experiments on DeepSeek-R1-Distill-Qwen-1.5B. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Response length dynamics of ExpThink on DeepSeek-R1-DistillQwen-7B. We analyze the evolution of the training dynamics over 300 update steps using the DeepSeek-R1-Distill-Qwen-7B backbone, tracking the mean response length at each step [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Token usage comparison between Vanilla and [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Case Study 1 (AIME24, Both Correct): ExpThink solves the problem in 1,416 tokens vs. Vanilla’s 6,421 tokens (−77.9%). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: Case Study 2 (AMC23, Both Correct): ExpThink solves the problem in 1,309 tokens vs. Vanilla’s 15,606 tokens (−91.6%). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Case Study 3 (MATH-500, ExpThink Correct / Vanilla Incorrect): ExpThink applies twin-prime reasoning in 829 tokens; Vanilla uses 12,523 tokens and returns the wrong answer. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Case Study 4 (Minerva Math, ExpThink Correct / Vanilla Incorrect): ExpThink applies the magnitude formula in 892 tokens; Vanilla uses 13,942 tokens and returns an answer wrong by 103 . 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗

**Figure 15.** Figure 15: Case Study 5 (OlympiadBench, ExpThink Correct / Vanilla Incorrect): ExpThink correctly solves the hexagon problem in 1,525 tokens; Vanilla uses 15,361 tokens and returns the wrong answer. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Case Study 6 (MMLU, Both Correct): ExpThink answers in 768 tokens vs. Vanilla’s 15,493 tokens (−95.0%). 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Case Study 7 (GPQA-Diamond, ExpThink Correct / Vanilla Incorrect): ExpThink finds the correct answer in 1,143 tokens; Vanilla uses 13,024 tokens and returns the wrong answer. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Case Study 8 (LiveCodeBench, Both Correct): [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

read the original abstract

Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penalties that neglect model capability dynamics and problem-level difficulty variation. We propose \textbf{ExpThink}\xspace, an RL framework that addresses both dimensions through two complementary mechanisms. First, \emph{experience-guided reward shaping} tracks the shortest correct solution found so far for each problem and applies a three-tier reward: full credit for concise correct responses, discounted credit for verbose correct ones, and zero for incorrect ones. The threshold tightens automatically with model improvement, forming a self-evolving curriculum that requires no manual scheduling. Second, \emph{difficulty-adaptive advantage} replaces standard deviation normalization with correct-count normalization, yielding monotonically difficulty-scaled gradients that amplify learning on hard problems to preserve accuracy while suppressing gradients on easy ones to encourage brevity. Together, these mechanisms enforce an accuracy-first, compression-second training objective. Experiments on multiple mathematical reasoning benchmarks demonstrate that \textbf{ExpThink}\xspace reduces average response length by up to 77\% while simultaneously improving accuracy, achieving up to $3\times$ higher accuracy-efficiency ratio (accuracy divided by average token count) than the vanilla baseline and outperforming existing RL-based compression methods on both metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExpThink's per-problem shortest-solution tracking and correct-count advantage give a practical adaptive reward for CoT compression, but the instance-specific memory risks overfitting rather than teaching general brevity.

read the letter

ExpThink stands out for its experience-guided reward that tracks the shortest correct chain-of-thought per problem and tightens the threshold automatically as the model improves. It also uses correct-count normalization for the advantage to scale learning based on difficulty. This setup aims to cut token use while keeping or boosting accuracy on math reasoning tasks. The approach improves on prior RL compression methods by avoiding fixed penalties and instead letting the reward evolve with the model's own progress. The three-tier reward and the adaptive advantage make sense for balancing accuracy and brevity without manual tuning. The reported results show substantial length reductions up to 77% alongside accuracy gains and better efficiency ratios on several benchmarks. That said, the per-problem shortest-solution memory raises a real issue. Since the reward depends on the best length found for each specific problem during training, the model might overfit to producing concise answers only for those seen instances rather than learning a general skill for shorter reasoning. The abstract does not detail whether evaluation uses held-out problems or the same set, which leaves open the possibility that the gains are partly from memorization. Without seeing the full methods, ablations, or statistical tests, it's difficult to assess how stable the improvements are across different setups. This paper will interest researchers focused on making large reasoning models more efficient at inference time. Anyone working on RL fine-tuning for LLMs or CoT optimization could pick up useful ideas from the reward design. It deserves a serious referee because the core mechanisms are clearly described and the empirical claims are bold enough to warrant detailed review and verification, even with the current limitations in the available information.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes ExpThink, an RL framework for compressing chain-of-thought (CoT) reasoning in large reasoning models. It introduces two mechanisms: experience-guided reward shaping, which tracks the shortest correct solution found so far per problem and applies a three-tier reward (full credit for concise correct responses, discounted for verbose correct ones, zero for incorrect), creating a self-evolving curriculum; and difficulty-adaptive advantage, which replaces standard deviation normalization with correct-count normalization to scale gradients monotonically by problem difficulty. Experiments on mathematical reasoning benchmarks claim up to 77% reduction in average response length while improving accuracy, yielding up to 3× higher accuracy-efficiency ratio (accuracy / average token count) versus the vanilla baseline and outperforming prior RL compression methods.

Significance. If the empirical gains prove robust, the work offers a practical advance for efficient inference in reasoning models by replacing static length penalties with adaptive, accuracy-first mechanisms that require no manual curriculum scheduling. The per-problem experience tracking and correct-count normalization are conceptually appealing for handling capability dynamics and difficulty variation. Credit is due for the falsifiable prediction of simultaneous length reduction and accuracy improvement on standard benchmarks, though significance is tempered by the empirical focus and need for stronger controls.

major comments (2)

[§3.1] §3.1 (Experience-Guided Reward Shaping): The three-tier reward is defined using a per-problem record of the shortest correct CoT found so far, with full credit only for matching or beating that length. This couples the reward directly to individual training instances. When standard benchmarks (GSM8K, MATH) are used for both training and evaluation, the design risks instance-specific memorization of concise paths rather than learning generalizable compression, directly undermining the central claim of up to 77% length reduction with simultaneous accuracy gains.
[§4] §4 (Experiments): The reported accuracy-efficiency ratio improvements and length reductions lack ablations that isolate the two proposed mechanisms, multiple random seeds with error bars, or statistical significance tests. Without these, it is unclear whether the gains are load-bearing results of the experience-guided and difficulty-adaptive components or artifacts of hyperparameter choices and benchmark overlap.

minor comments (2)

The abstract and method description refer to 'multiple mathematical reasoning benchmarks' without a summary table listing per-benchmark length, accuracy, and ratio values; adding such a table would improve clarity and allow direct comparison to baselines.
[§3] Notation for the accuracy-efficiency ratio is introduced in the abstract but should be formalized with an equation in §3 to ensure consistent use across the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on our methodological choices and commit to specific revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3.1] §3.1 (Experience-Guided Reward Shaping): The three-tier reward is defined using a per-problem record of the shortest correct CoT found so far, with full credit only for matching or beating that length. This couples the reward directly to individual training instances. When standard benchmarks (GSM8K, MATH) are used for both training and evaluation, the design risks instance-specific memorization of concise paths rather than learning generalizable compression, directly undermining the central claim of up to 77% length reduction with simultaneous accuracy gains.

Authors: We appreciate the referee's concern about potential memorization. The experience-guided reward is explicitly designed to avoid static per-instance targets by dynamically updating the length threshold only when a shorter correct solution is discovered during training; this creates a self-improving curriculum that rewards the policy for finding generalizable compression strategies rather than recalling fixed paths. Because each rollout generates a fresh CoT from the current policy (not retrieval), and the same problem is typically sampled multiple times with stochastic generation, the model must learn transferable reasoning patterns to consistently beat its own prior best. Training uses the standard train splits while evaluation is performed on the corresponding test splits, mitigating direct overlap. In the revision we will expand Section 3.1 with a paragraph on generalization, add qualitative examples of compressed reasoning on novel problem variants, and include a small out-of-distribution evaluation to further demonstrate that the compression policy transfers beyond the training instances. revision: partial
Referee: [§4] §4 (Experiments): The reported accuracy-efficiency ratio improvements and length reductions lack ablations that isolate the two proposed mechanisms, multiple random seeds with error bars, or statistical significance tests. Without these, it is unclear whether the gains are load-bearing results of the experience-guided and difficulty-adaptive components or artifacts of hyperparameter choices and benchmark overlap.

Authors: We agree that isolating the contributions of each component and providing statistical controls would strengthen the empirical section. The original experiments emphasized the joint effect because the two mechanisms are complementary (one shapes the reward landscape while the other scales the advantage), yet we recognize the value of separate ablations. In the revised manuscript we will add a dedicated ablation study in Section 4 that evaluates (i) experience-guided reward alone, (ii) difficulty-adaptive advantage alone, and (iii) both together, using the same hyperparameter settings. Regarding multiple seeds and statistical tests, the high computational cost of RL fine-tuning on large reasoning models limited us to single-run reporting in the initial submission; we will rerun the key experiments with at least three independent seeds, report mean and standard deviation, and include paired t-tests or Wilcoxon tests to establish statistical significance of the accuracy-efficiency gains. These additions will be included in the camera-ready version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical RL method validated on external benchmarks

full rationale

The paper proposes ExpThink as an RL framework with experience-guided reward shaping (tracking per-problem shortest correct CoT) and difficulty-adaptive advantage (correct-count normalization). All central claims of length reduction and accuracy gains are presented as outcomes of experiments on mathematical reasoning benchmarks (e.g., GSM8K, MATH). No equations, predictions, or first-principles derivations are offered that reduce by construction to fitted parameters, self-citations, or ansatzes within the paper. The per-problem tracking is a deliberate design choice whose generalization is an empirical question, not a definitional loop. This matches the default case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard RL assumptions plus two paper-specific mechanisms whose exact implementation details are not visible in the abstract; no new physical entities are introduced.

axioms (1)

standard math Standard policy gradient assumptions hold for the shaped rewards and normalized advantages.
Invoked implicitly when claiming the RL updates produce the observed length and accuracy changes.

pith-pipeline@v0.9.0 · 5574 in / 1206 out tokens · 21869 ms · 2026-05-11T01:56:57.476130+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

experience-guided reward shaping tracks the shortest correct solution found so far for each problem and applies a three-tier reward
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

difficulty-adaptive advantage replaces standard deviation normalization with correct-count normalization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[4]

Step: Success- rate-aware trajectory-efficient policy optimization.arXiv preprint arXiv:2511.13091, 2025

Yuhan Chen, Yuxuan Liu, Long Zhang, Pengzhi Gao, Jian Luan, and Wei Liu. STEP: success- rate-aware trajectory-efficient policy optimization.CoRR, abs/2511.13091, 2025. doi: 10. 48550/ARXIV .2511.13091. URLhttps://doi.org/10.48550/arXiv.2511.13091

work page doi:10.48550/arxiv.2511.13091 2025
[6]

American invitational mathematics examination-aime 2024, 2024, 2024

MAA Codeforces. American invitational mathematics examination-aime 2024, 2024, 2024

work page 2024
[8]

Conciserl: Conciseness-guided reinforcement learning for efficient reasoning models

Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, and Liangming Pan. Conciserl: Conciseness-guided reinforcement learning for efficient reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, Novem- ber 4-9, 2025,...

work page 2025
[9]

Complexity-based prompting for multi-step reasoning

Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/forum?id=yf1icZHC-l9

work page 2023
[11]

Reasoning without self-doubt: More efficient chain-of-thought through certainty probing

Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. InICLR 2025 Workshop on Foundation Models in the Wild, 2025

work page 2025
[14]

Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, 11 Proc...

work page doi:10.18653/v1/2024.acl-long.211 2024
[15]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[16]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors,Pro- ceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks ...

work page 2021
[17]

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.Trans

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.Trans. Mach. Learn. Res., 2026, 2026. URLhttps://openreview.net/forum?id=V51gPu1uQD

work page 2026
[18]

Efficient reasoning for large reasoning language models via certainty-guided reflection suppression

Jiameng Huang, Baijiong Lin, Guhao Feng, Jierun Chen, Di He, and Lu Hou. Efficient reasoning for large reasoning language models via certainty-guided reflection suppression. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Int...

work page doi:10.1609/aaai.v40i37.40379 2026
[19]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2...

work page 2025
[20]

Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V . Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. O...

work page 2022
[21]

Lanpo: Bootstrapping language and numerical feedback for reinforcement learning in LLMs, 2025

Ang Li, Yifei Wang, Zhihang Yuan, Stefanie Jegelka, and Yisen Wang. LANPO: bootstrapping language and numerical feedback for reinforcement learning in llms.CoRR, abs/2510.16552,

work page arXiv
[27]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum? id=v8L0pN6EOi

work page 2024
[30]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.CoRR, abs/2503.20783,

work page Pith review arXiv
[33]

Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 3(5), 2025

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Tianjun Zhang, Li Erran Li, et al. Deepscaler: Surpassing o1-preview with a 1.5 b model by scaling rl.Notion Blog, 3(5), 2025

work page 2025
[35]

Amc23 dataset

math-ai. Amc23 dataset. https://huggingface.co/datasets/math-ai/amc23, 2023. Accessed: 2025-01-26

work page 2023
[37]

ProRL v2: Scaling LLM reinforcement learning with prolonged training.NVIDIA Technical Blog, 2025

NVIDIA Research. ProRL v2: Scaling LLM reinforcement learning with prolonged training.NVIDIA Technical Blog, 2025. URL https://developer.nvidia.com/blog/ scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/

work page 2025
[41]

DAST: difficulty-adaptive slow-thinking for large rea- soning models

Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. DAST: difficulty-adaptive slow-thinking for large rea- soning models. In Saloni Potdar, Lina Maria Rojas-Barahona, and Sébastien Montella, editors, Proceedings of the 2025 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2025.emnlp-industry 2025
[45]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neura...

work page 2022
[47]

Learning to hint for reinforcement learning

Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, and Yuxiong He. Learning to hint for reinforcement learning.arXiv preprint arXiv:2604.00698, 2026

work page arXiv 2026
[51]

Large reasoning models know how to think efficiently

XING Zeyu, Xing Li, Huiling Zhen, Xianzhi Yu, Mingxuan Yuan, and Sinno Jialin Pan. Large reasoning models know how to think efficiently. InES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025

work page 2025
[52]

Adaptthink: Reasoning models can learn when to think

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pages 3716–3730....

work page doi:10.18653/v1/2025.emnlp-main.184 2025
[53]

ISBN 979-8-89176-332-6

Junyu Zhang, Runpei Dong, Han Wang, Xuying Ning, Haoran Geng, Peihao Li, Xialin He, Yutong Bai, Jitendra Malik, Saurabh Gupta, and Huan Zhang. Alphaone: Reasoning models thinking slow and fast at test time. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in ...

work page doi:10.18653/v1/2025 2025
[54]

Neural Controlled Differential Equations for Irregular Time Series

Ruofan Zhang, Bin Xia, Zhen Cheng, Cairen Jian, Minglun Yang, Ngai Wong, and Yuan Cheng. DART: difficulty-adaptive reasoning truncation for efficient large language models. CoRR, abs/2511.01170, 2025. doi: 10.48550/ARXIV .2511.01170. URL https://doi.org/ 10.48550/arXiv.2511.01170. 15 A Related Work A.1 Experience-Guided Reinforcement Learning with Verifia...

work page internal anchor Pith review doi:10.48550/arxiv 2025
[55]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page