CLORE: Content-Level Optimization for Reasoning Efficiency

Guanxing Lu; Manling Li; Olexandr Isayev; Qiyao Xue; Weichen Liu; Yuyang Wu; Zihan Wang

arxiv: 2605.22211 · v1 · pith:IYHAY4R4new · submitted 2026-05-21 · 💻 cs.AI

CLORE: Content-Level Optimization for Reasoning Efficiency

Yuyang Wu , Qiyao Xue , Guanxing Lu , Weichen Liu , Zihan Wang , Manling Li , Olexandr Isayev This is my paper

Pith reviewed 2026-05-22 05:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords reasoning efficiencycontent optimizationreinforcement learninglarge language modelsmathematical reasoningDPOpost-trainingrollout editing

0 comments

The pith

Editing correct reasoning traces to remove repetition improves the accuracy-efficiency trade-off in language model post-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that existing methods for efficient reasoning in large language models focus too much on controlling response length and leave the actual content of reasoning steps weakly supervised. CLORE addresses this by using an external model to edit only correct on-policy rollouts, deleting repetitive segments, illegible content, or reasoning that continues after the answer is found while keeping the final answer intact. These edited pairs are then trained with an auxiliary reference-free DPO objective alongside normal policy gradient updates. The approach keeps changes local so the edited examples stay close to the original policy distribution. Experiments on two 7B models across five math benchmarks show better accuracy and efficiency together with compatibility to other training techniques.

Core claim

CLORE improves reasoning efficiency by editing correct on-policy rollouts. An external augmentation model deletes repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented-original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch.

What carries the argument

Content-level editing of correct on-policy rollouts by an external augmentation model, followed by auxiliary reference-free DPO optimization.

If this is right

The accuracy-efficiency trade-off improves on mathematical reasoning benchmarks for models such as DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B.
The method remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune.
Content analyses show reductions in repetitive reasoning, illegible content, and post-answer exploration.
Content-level supervision acts as a complementary direction to length-level control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The editing approach could be tested on non-mathematical tasks such as code generation or scientific explanation to check for broader applicability.
Jointly training the augmentation model with the main policy might reduce dependence on a separate external model.
Direct content supervision of this kind could be combined with length-based rewards to further refine reasoning traces.

Load-bearing premise

The external augmentation model can reliably delete only repetitive, illegible, or post-answer segments from correct trajectories without introducing factual errors or changing the final answer, and the resulting edited rollouts remain sufficiently close to the policy distribution to avoid harmful off-policy effects.

What would settle it

If running CLORE on the five mathematical reasoning benchmarks produces lower accuracy or worse efficiency metrics than standard RL post-training without the editing step, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22211 by Guanxing Lu, Manling Li, Olexandr Isayev, Qiyao Xue, Weichen Liu, Yuyang Wu, Zihan Wang.

**Figure 1.** Figure 1: Overview of CLORE. Existing length-based efficient reasoning methods regulate response length but fail to remove low-quality reasoning content. CLORE improves efficient reasoning by augmenting correct trajectories to delete repetitive, illegible, and uninformative reasoning segments, while remaining compatible with existing efficient RL training methods. These patterns characterize a form of redundant unin… view at source ↗

**Figure 2.** Figure 2: Overview of the CLORE framework. The policy model samples multiple reasoning trajectories, the correct ones are augmented to remove low-quality content, and the resulting preference pairs are optimized via a reference-free DPO objective jointly with policy-gradient training. of correct trajectories. We restrict augmentation to correct trajectories because they generally contain less low-quality reasoning … view at source ↗

**Figure 3.** Figure 3: DeepSeek-R1-Distill-Qwen-7B training dynamics, with average response length on DAPOMath-17K and validation accuracy on OlympiadBench, MATH500, AMC2023 (left to right). Minerva. On Qwen2.5-Math-7B, the gains are more pronounced, with average AE improvements mostly exceeding 1.0 for GRPO, DAPO, and ThinkPrune. The largest improvement is observed when applying CLORE to GRPO on Minerva, where the AE score inc… view at source ↗

**Figure 4.** Figure 4: Qwen2.5-Math-7B training dynamics, with average response length on DAPO-Math-17K and validation accuracy on OlympiadBench, MATH500 and AMC2023 (left to right). 5.2 Is Content-Level Optimization Complementary to Length-Level Control? The evaluation results in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on the DPO weight and augmentation model. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Augmentation analysis. training in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Repetitive reasoning analysis. level supervision reduces repetitive reasoning patterns. We compare the base model, GRPO, and GRPO with CLORE using repetition metrics rep-4, rep-8, and the longest repeated span [47]. GRPO increases repetition relative to the base model, suggesting that rewards alone may reinforce redundant reasoning during RL training. In contrast, GRPO+CLORE reduces both local n-gram rep… view at source ↗

**Figure 8.** Figure 8: Post-answer reasoning analysis. solution, measured as the fraction of generated reasoning that appears after the final answer has been reached, computed as the average percentage of post-answer tokens in responses. The base model frequently continues reasoning after reaching the solution, while RL training only partially reduces this post-answer content. Adding CLORE consistently suppresses this behavior… view at source ↗

**Figure 9.** Figure 9: Illegible reasoning analysis. quality by measuring the presence of uninterpretable or task-irrelevant content through averaged LLM-as-a-judge result over various strong commercial models5 . The base model and standard RL-trained methods still produce illegible reasoning, showing that final-answer correctness alone does not sufficiently control intermediate reasoning quality. CLORE consistently reduces su… view at source ↗

read the original abstract

Reinforcement learning post-training has improved the reasoning ability of large language models, but often produces unnecessarily long, repetitive, or semantically opaque reasoning traces. Existing efficient reasoning methods mainly regulate response length through explicit budgets or length-aware rewards, leaving intermediate reasoning content weakly supervised. We propose CLORE, a content-level optimization framework that improves reasoning efficiency by editing correct on-policy rollouts. CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented--original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy--efficiency trade-off and remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune. Content-level analyses further show that CLORE reduces repetitive reasoning, illegible content, and post-answer exploration, supporting content-level supervision as a complementary direction to length-level control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLORE adds content-level editing of correct rollouts plus reference-free DPO but leaves the augmentation model's edit quality unmeasured.

read the letter

The main takeaway is that CLORE uses an external model to edit correct reasoning rollouts at the content level before applying reference-free DPO, aiming to reduce unnecessary parts without hurting accuracy. It focuses on deleting repetitive, illegible, or post-answer segments while keeping the final answer intact, then optimizes the augmented-original pairs alongside standard policy-gradient training. This is positioned as a complement to length-based methods like budgets or rewards.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes CLORE, a content-level optimization framework for LLM reasoning efficiency. It uses an external augmentation model to edit correct on-policy rollouts via local deletions of repetitive, illegible, or post-answer segments (while preserving the final answer), then optimizes the resulting augmented-original pairs with an auxiliary reference-free DPO objective alongside standard policy-gradient training. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks report improved accuracy-efficiency trade-offs, with compatibility to GRPO, DAPO, Training Efficient, and ThinkPrune; content analyses show reductions in repetitive and post-answer content.

Significance. If the augmentation preserves reasoning validity, CLORE supplies a complementary content-level supervision signal to existing length-based or reward-shaping approaches for efficient reasoning. The reported compatibility with multiple RL methods and the supporting content-level analyses would strengthen the case for content supervision as a distinct direction.

major comments (2)

[Abstract and Methods] Abstract and Methods: The central claim that CLORE improves the accuracy-efficiency trade-off via content-level supervision rests on the external augmentation model performing only safe, local deletions on correct trajectories without introducing factual errors, changing solution paths, or altering the final answer. No quantitative checks (edit-induced answer change rate, semantic similarity, or human-rated reasoning quality) are reported on the augmented data, leaving open the possibility that observed gains arise from implicit filtering or length reduction rather than genuine content supervision.
[Experiments] Experiments: The abstract reports accuracy-efficiency gains on five benchmarks but provides no details on data splits, run-to-run variance, statistical significance testing, or full hyperparameter settings for the auxiliary DPO component, preventing verification of the robustness of the claimed improvements.

minor comments (2)

[Methods] Clarify in the methods section how the reference-free DPO loss is exactly formulated and weighted relative to the primary policy-gradient objective.
[Experiments] Ensure all benchmark results include both accuracy and efficiency metrics (e.g., token count or latency) in the same table for direct trade-off comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the presentation of our results and experimental details.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The central claim that CLORE improves the accuracy-efficiency trade-off via content-level supervision rests on the external augmentation model performing only safe, local deletions on correct trajectories without introducing factual errors, changing solution paths, or altering the final answer. No quantitative checks (edit-induced answer change rate, semantic similarity, or human-rated reasoning quality) are reported on the augmented data, leaving open the possibility that observed gains arise from implicit filtering or length reduction rather than genuine content supervision.

Authors: We agree that quantitative validation of the augmentation process would provide stronger support for the claim that gains stem from content-level supervision rather than incidental effects. By construction, augmentation is restricted to correct on-policy rollouts and limited to local deletions that preserve the final answer, which keeps edited trajectories close to the original distribution. However, we acknowledge the absence of explicit metrics in the current version. In the revised manuscript, we will add an analysis subsection reporting: (i) the edit-induced answer change rate (verified to be zero by construction but quantified across the dataset), (ii) semantic similarity (e.g., embedding cosine similarity) between original and augmented trajectories, and (iii) a small-scale human evaluation of reasoning quality on a random sample of 100 pairs. These additions will appear in the Methods and Experiments sections to address the concern directly. revision: yes
Referee: [Experiments] Experiments: The abstract reports accuracy-efficiency gains on five benchmarks but provides no details on data splits, run-to-run variance, statistical significance testing, or full hyperparameter settings for the auxiliary DPO component, preventing verification of the robustness of the claimed improvements.

Authors: We concur that additional experimental details are necessary for reproducibility and to demonstrate robustness. In the revised manuscript, we will expand the Experiments section to include: (i) explicit description of training/validation/test splits for each benchmark, (ii) mean and standard deviation of accuracy and efficiency metrics across at least three independent runs with different random seeds, (iii) results of statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing CLORE against baselines, and (iv) complete hyperparameter settings for the auxiliary DPO objective, including the DPO beta coefficient, learning rate, batch size, and number of epochs. These details will also be summarized in a new table or appendix for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity: CLORE framework is empirically grounded without self-referential derivations

full rationale

The paper introduces CLORE as a practical framework that augments correct on-policy rollouts via an external model and applies auxiliary DPO alongside standard policy-gradient training. No mathematical derivations, equations, or first-principles predictions are presented that reduce the accuracy-efficiency improvements to fitted parameters defined by the result itself or to self-citation chains. The method explicitly builds on existing techniques (GRPO, DAPO, DPO) with stated assumptions about the augmentation model's behavior, and experimental results on multiple benchmarks provide independent empirical support. The central claims do not collapse into tautological redefinitions or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that an external model can perform safe, correctness-preserving edits and that the resulting pairs remain on-policy enough for stable training.

axioms (1)

domain assumption External augmentation model deletes only superfluous content while preserving correctness and final answer
Invoked when restricting augmentation to correct trajectories and performing local deletion to keep edited rollouts close to policy distribution.

pith-pipeline@v0.9.0 · 5772 in / 1245 out tokens · 40117 ms · 2026-05-22T05:47:02.194382+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented–original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy-efficiency trade-off

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 15 internal anchors

[1]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

work page 2024
[3]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges.arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024
[4]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

Daman Arora and Andrea Zanette. Training language models to reason efficiently.arXiv preprint arXiv:2502.04463, 2025

work page arXiv 2025
[5]

Trims: Real-time tracking of minimal sufficient length for efficient reasoning via rl.arXiv preprint arXiv:2603.17449, 2026

Tingcheng Bian, Jinchang Luo, Mingquan Cheng, Jinyu Zhang, Xiaoling Xia, Ni Li, Yan Tao, and Haiwei Wang. Trims: Real-time tracking of minimal sufficient length for efficient reasoning via rl.arXiv preprint arXiv:2603.17449, 2026

work page arXiv 2026
[6]

Do not think that much for 2+ 3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025

work page 2025
[7]

Large reasoning mod- els are not thinking straight: on the unreliability of thinking trajectories.arXiv preprint arXiv:2507.00711, 2025

Jhouben Cuesta-Ramirez, Samuel Beaussant, and Mehdi Mounsif. Large reasoning mod- els are not thinking straight: on the unreliability of thinking trajectories.arXiv preprint arXiv:2507.00711, 2025

work page arXiv 2025
[8]

Stable reinforcement learning for efficient reasoning

Muzhi Dai, Shixuan Liu, and Qingyi Si. Stable reinforcement learning for efficient reasoning. arXiv preprint arXiv:2505.18086, 2025

work page arXiv 2025
[9]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025
[10]

A pragmatic way to measure chain-of-thought monitorability.arXiv preprint arXiv:2510.23966, 2025

Scott Emmons, Roland S Zimmermann, David K Elson, and Rohin Shah. A pragmatic way to measure chain-of-thought monitorability.arXiv preprint arXiv:2510.23966, 2025

work page arXiv 2025
[11]

Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

work page arXiv 2025
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Token-budget-aware llm reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025

work page 2025
[14]

Don’t overthink it

Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025

work page arXiv 2025
[15]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024
[16]

Thinkdial: An open recipe for controlling reasoning effort in large language models.arXiv preprint arXiv:2508.18773, 2025

Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, and Jiangjie Chen. Thinkdial: An open recipe for controlling reasoning effort in large language models.arXiv preprint arXiv:2508.18773, 2025. 10

work page arXiv 2025
[17]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models.arXiv preprint arXiv:2501.03262, 1(3):5, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting

Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, and Yi R Fung. Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting. arXiv preprint arXiv:2505.18822, 2025

work page arXiv 2025
[21]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning

Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6501–6525, 2025

work page 2025
[23]

Reasoning models sometimes output illegible chains of thought.arXiv preprint arXiv:2510.27338, 2025

Arun Jose. Reasoning models sometimes output illegible chains of thought.arXiv preprint arXiv:2510.27338, 2025

work page arXiv 2025
[24]

Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

work page arXiv 2024
[25]

Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542, 2025

Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542, 2025

work page arXiv 2025
[26]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022
[27]

Aalc: Large language model efficient reasoning via adaptive accuracy-length control.arXiv preprint arXiv:2506.20160, 2025

Ruosen Li, Ziming Luo, Quan Zhang, Ruochen Li, Ben Zhou, Ali Payani, and Xinya Du. Aalc: Large language model efficient reasoning via adaptive accuracy-length control.arXiv preprint arXiv:2506.20160, 2025

work page arXiv 2025
[28]

When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423, 2025

Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, and Anurag Beniwal. When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423, 2025

work page arXiv 2025
[29]

SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, and Zhifang Sui. Selfbudgeter: Adaptive token allocation for efficient llm reasoning.arXiv preprint arXiv:2505.11274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[31]

Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

work page arXiv 2025
[32]

Craft: Calibrated reasoning with answer-faithful traces via reinforcement learning for multi-hop question answering.arXiv preprint arXiv:2602.01348, 2026

Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen, Cheng Hu, Pin Xu, Yuling Yang, Kun Peng, Diandian Guo, et al. Craft: Calibrated reasoning with answer-faithful traces via reinforcement learning for multi-hop question answering.arXiv preprint arXiv:2602.01348, 2026. 11

work page arXiv 2026
[33]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025

work page arXiv 2025
[34]

Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

work page arXiv 2024
[35]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[36]

Revisiting overthinking in long chain-of-thought from the perspective of self-doubt.arXiv preprint arXiv:2505.23480, 2025

Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, and Dacheng Tao. Revisiting overthinking in long chain-of-thought from the perspective of self-doubt.arXiv preprint arXiv:2505.23480, 2025

work page arXiv 2025
[37]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[38]

A principled approach to chain-of-thought monitorability in reasoning models

Avinash Reddy, Souradip Chakraborty, and Amrit Singh Bedi. A principled approach to chain-of-thought monitorability in reasoning models

work page
[39]

Measuring reasoning trace legibility: Can those who understand teach?arXiv preprint arXiv:2603.20508, 2026

Dani Roytburg, Shreya Sridhar, and Daphne Ippolito. Measuring reasoning trace legibility: Can those who understand teach?arXiv preprint arXiv:2603.20508, 2026

work page arXiv 2026
[40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.arXiv preprint arXiv:2505.18298, 2025

Jinyan Su and Claire Cardie. Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.arXiv preprint arXiv:2505.18298, 2025

work page arXiv 2025
[43]

Betweenunderthinkingandoverthinking: Anempiricalstudyofreasoninglengthandcorrectnessinllms.arXivpreprintarXiv:2505.00127,2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025
[44]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[47]

Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

work page arXiv 1908
[48]

How easily do irrelevant inputs skew the responses of large language models? arXiv preprint arXiv:2404.03302, 2024

Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024

work page arXiv 2024
[49]

The art of efficient reasoning: Data, reward, and optimization.arXiv preprint arXiv:2602.20945, 2026

Taiqiang Wu, Zenan Zu, Bo Zhou, and Ngai Wong. The art of efficient reasoning: Data, reward, and optimization.arXiv preprint arXiv:2602.20945, 2026

work page arXiv 2026
[50]

When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025. 12

work page arXiv 2025
[51]

Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025

Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, and Nick Haber. Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025

work page arXiv 2025
[52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark

Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Yang Wang, and Liang- ming Pan. How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13340–13358, 2025

work page 2025
[55]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[56]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms.arXiv preprint arXiv:2502.03373, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.arXiv preprint arXiv:2504.21370, 2025

Jingyang Yi, Jiazheng Wang, and Sida Li. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.arXiv preprint arXiv:2504.21370, 2025

work page arXiv 2025
[58]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Efficient rl training for reasoning models via length-aware optimization.arXiv preprint arXiv:2505.12284, 2025

Danlong Yuan, Tian Xie, Shaohan Huang, Zhuocheng Gong, Huishuai Zhang, Chong Luo, Furu Wei, and Dongyan Zhao. Efficient rl training for reasoning models via length-aware optimization.arXiv preprint arXiv:2505.12284, 2025

work page arXiv 2025
[60]

find the time 1000 days later

Xinliang Frederick Zhang, Anhad Mohananey, Alexandra Chronopoulou, Pinelopi Papalampidi, Somit Gupta, Tsendsuren Munkhdalai, Lu Wang, and Shyam Upadhyay. Do llms really need 10+ thoughts for" find the time 1000 days later"? towards structural understanding of llm overthinking.arXiv preprint arXiv:2510.07880, 2025

work page arXiv 2025
[61]

Python verification

Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?Advances in Neural Information Processing Systems, 37:123846–123910, 2024. 13 Appendix: Table of Contents A. Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page 2024
[62]

The possible sums he can end up with are125, 126, 127

What is the most probable sum he can get? ROLLOUT(EXCERPT). . . . The possible sums he can end up with are125, 126, 127. . . . The probability of ending with a sum of 127 is the probability of picking the card with value 127 first, which is 1 7 . . . . From the above analysis, the most probable sum is126, as it has the highest probability of occurring. im...

work page 2000

[1] [1]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024

work page 2024

[3] [3]

Large language models for mathematical reasoning: Progresses and challenges

Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges.arXiv preprint arXiv:2402.00157, 2024

work page arXiv 2024

[4] [4]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

Daman Arora and Andrea Zanette. Training language models to reason efficiently.arXiv preprint arXiv:2502.04463, 2025

work page arXiv 2025

[5] [5]

Trims: Real-time tracking of minimal sufficient length for efficient reasoning via rl.arXiv preprint arXiv:2603.17449, 2026

Tingcheng Bian, Jinchang Luo, Mingquan Cheng, Jinyu Zhang, Xiaoling Xia, Ni Li, Yan Tao, and Haiwei Wang. Trims: Real-time tracking of minimal sufficient length for efficient reasoning via rl.arXiv preprint arXiv:2603.17449, 2026

work page arXiv 2026

[6] [6]

Do not think that much for 2+ 3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025

work page 2025

[7] [7]

Large reasoning mod- els are not thinking straight: on the unreliability of thinking trajectories.arXiv preprint arXiv:2507.00711, 2025

Jhouben Cuesta-Ramirez, Samuel Beaussant, and Mehdi Mounsif. Large reasoning mod- els are not thinking straight: on the unreliability of thinking trajectories.arXiv preprint arXiv:2507.00711, 2025

work page arXiv 2025

[8] [8]

Stable reinforcement learning for efficient reasoning

Muzhi Dai, Shixuan Liu, and Qingyi Si. Stable reinforcement learning for efficient reasoning. arXiv preprint arXiv:2505.18086, 2025

work page arXiv 2025

[9] [9]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, et al. Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025

[10] [10]

A pragmatic way to measure chain-of-thought monitorability.arXiv preprint arXiv:2510.23966, 2025

Scott Emmons, Roland S Zimmermann, David K Elson, and Rohin Shah. A pragmatic way to measure chain-of-thought monitorability.arXiv preprint arXiv:2510.23966, 2025

work page arXiv 2025

[11] [11]

Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, and Dacheng Tao. Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

work page arXiv 2025

[12] [12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Token-budget-aware llm reasoning

Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen. Token-budget-aware llm reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 24842–24855, 2025

work page 2025

[14] [14]

Don’t overthink it

Michael Hassid, Gabriel Synnaeve, Yossi Adi, and Roy Schwartz. Don’t overthink it. preferring shorter thinking chains for improved llm reasoning.arXiv preprint arXiv:2505.17813, 2025

work page arXiv 2025

[15] [15]

Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Paper...

work page 2024

[16] [16]

Thinkdial: An open recipe for controlling reasoning effort in large language models.arXiv preprint arXiv:2508.18773, 2025

Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, and Jiangjie Chen. Thinkdial: An open recipe for controlling reasoning effort in large language models.arXiv preprint arXiv:2508.18773, 2025. 10

work page arXiv 2025

[17] [17]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: An efficient rlhf algorithm with robustness to both prompt and reward models.arXiv preprint arXiv:2501.03262, 1(3):5, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting

Shijue Huang, Hongru Wang, Wanjun Zhong, Zhaochen Su, Jiazhan Feng, Bowen Cao, and Yi R Fung. Adactrl: Towards adaptive and controllable reasoning via difficulty-aware budgeting. arXiv preprint arXiv:2505.18822, 2025

work page arXiv 2025

[21] [21]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning

Gangwei Jiang, Yahui Liu, Zhaoyi Li, Wei Bi, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of- thought reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6501–6525, 2025

work page 2025

[23] [23]

Reasoning models sometimes output illegible chains of thought.arXiv preprint arXiv:2510.27338, 2025

Arun Jose. Reasoning models sometimes output illegible chains of thought.arXiv preprint arXiv:2510.27338, 2025

work page arXiv 2025

[24] [24]

Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

Jan Hendrik Kirchner, Yining Chen, Harri Edwards, Jan Leike, Nat McAleese, and Yuri Burda. Prover-verifier games improve legibility of llm outputs.arXiv preprint arXiv:2407.13692, 2024

work page arXiv 2024

[25] [25]

Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542, 2025

Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. Overthink: Slowdown attacks on reasoning llms.arXiv preprint arXiv:2502.02542, 2025

work page arXiv 2025

[26] [26]

Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quan- titative reasoning problems with language models.Advances in neural information processing systems, 35:3843–3857, 2022

work page 2022

[27] [27]

Aalc: Large language model efficient reasoning via adaptive accuracy-length control.arXiv preprint arXiv:2506.20160, 2025

Ruosen Li, Ziming Luo, Quan Zhang, Ruochen Li, Ben Zhou, Ali Payani, and Xinya Du. Aalc: Large language model efficient reasoning via adaptive accuracy-length control.arXiv preprint arXiv:2506.20160, 2025

work page arXiv 2025

[28] [28]

When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423, 2025

Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, and Anurag Beniwal. When thinking fails: The pitfalls of reasoning for instruction- following in llms.arXiv preprint arXiv:2505.11423, 2025

work page arXiv 2025

[29] [29]

SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, and Zhifang Sui. Selfbudgeter: Adaptive token allocation for efficient llm reasoning.arXiv preprint arXiv:2505.11274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[31] [31]

Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025

work page arXiv 2025

[32] [32]

Craft: Calibrated reasoning with answer-faithful traces via reinforcement learning for multi-hop question answering.arXiv preprint arXiv:2602.01348, 2026

Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen, Cheng Hu, Pin Xu, Yuling Yang, Kun Peng, Diandian Guo, et al. Craft: Calibrated reasoning with answer-faithful traces via reinforcement learning for multi-hop question answering.arXiv preprint arXiv:2602.01348, 2026. 11

work page arXiv 2026

[33] [33]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025

Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.arXiv preprint arXiv:2501.12570, 2025

work page arXiv 2025

[34] [34]

Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

Sania Nayab, Giulio Rossolini, Marco Simoni, Andrea Saracino, Giorgio Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli. Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

work page arXiv 2024

[35] [35]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[36] [36]

Revisiting overthinking in long chain-of-thought from the perspective of self-doubt.arXiv preprint arXiv:2505.23480, 2025

Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, and Dacheng Tao. Revisiting overthinking in long chain-of-thought from the perspective of self-doubt.arXiv preprint arXiv:2505.23480, 2025

work page arXiv 2025

[37] [37]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023

[38] [38]

A principled approach to chain-of-thought monitorability in reasoning models

Avinash Reddy, Souradip Chakraborty, and Amrit Singh Bedi. A principled approach to chain-of-thought monitorability in reasoning models

work page

[39] [39]

Measuring reasoning trace legibility: Can those who understand teach?arXiv preprint arXiv:2603.20508, 2026

Dani Roytburg, Shreya Sridhar, and Daphne Ippolito. Measuring reasoning trace legibility: Can those who understand teach?arXiv preprint arXiv:2603.20508, 2026

work page arXiv 2026

[40] [40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.arXiv preprint arXiv:2505.18298, 2025

Jinyan Su and Claire Cardie. Thinking fast and right: Balancing accuracy and reasoning length with adaptive rewards.arXiv preprint arXiv:2505.18298, 2025

work page arXiv 2025

[43] [43]

Betweenunderthinkingandoverthinking: Anempiricalstudyofreasoninglengthandcorrectnessinllms.arXivpreprintarXiv:2505.00127,2025

Jinyan Su, Jennifer Healey, Preslav Nakov, and Claire Cardie. Between underthinking and overthinking: An empirical study of reasoning length and correctness in llms.arXiv preprint arXiv:2505.00127, 2025

work page arXiv 2025

[44] [44]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[47] [47]

Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training.arXiv preprint arXiv:1908.04319, 2019

work page arXiv 1908

[48] [48]

How easily do irrelevant inputs skew the responses of large language models? arXiv preprint arXiv:2404.03302, 2024

Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024

work page arXiv 2024

[49] [49]

The art of efficient reasoning: Data, reward, and optimization.arXiv preprint arXiv:2602.20945, 2026

Taiqiang Wu, Zenan Zu, Bo Zhou, and Ngai Wong. The art of efficient reasoning: Data, reward, and optimization.arXiv preprint arXiv:2602.20945, 2026

work page arXiv 2026

[50] [50]

When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025

Yuyang Wu, Yifei Wang, Ziyu Ye, Tianqi Du, Stefanie Jegelka, and Yisen Wang. When more is less: Understanding chain-of-thought length in llms.arXiv preprint arXiv:2502.07266, 2025. 12

work page arXiv 2025

[51] [51]

Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025

Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, and Nick Haber. Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025

work page arXiv 2025

[52] [52]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark

Minglai Yang, Ethan Huang, Liang Zhang, Mihai Surdeanu, William Yang Wang, and Liang- ming Pan. How is llm reasoning distracted by irrelevant context? an analysis using a controlled benchmark. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13340–13358, 2025

work page 2025

[55] [55]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[56] [56]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms.arXiv preprint arXiv:2502.03373, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.arXiv preprint arXiv:2504.21370, 2025

Jingyang Yi, Jiazheng Wang, and Sida Li. Shorterbetter: Guiding reasoning models to find optimal inference length for efficient reasoning.arXiv preprint arXiv:2504.21370, 2025

work page arXiv 2025

[58] [58]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Efficient rl training for reasoning models via length-aware optimization.arXiv preprint arXiv:2505.12284, 2025

Danlong Yuan, Tian Xie, Shaohan Huang, Zhuocheng Gong, Huishuai Zhang, Chong Luo, Furu Wei, and Dongyan Zhao. Efficient rl training for reasoning models via length-aware optimization.arXiv preprint arXiv:2505.12284, 2025

work page arXiv 2025

[60] [60]

find the time 1000 days later

Xinliang Frederick Zhang, Anhad Mohananey, Alexandra Chronopoulou, Pinelopi Papalampidi, Somit Gupta, Tsendsuren Munkhdalai, Lu Wang, and Shyam Upadhyay. Do llms really need 10+ thoughts for" find the time 1000 days later"? towards structural understanding of llm overthinking.arXiv preprint arXiv:2510.07880, 2025

work page arXiv 2025

[61] [61]

Python verification

Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales?Advances in Neural Information Processing Systems, 37:123846–123910, 2024. 13 Appendix: Table of Contents A. Limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page 2024

[62] [62]

The possible sums he can end up with are125, 126, 127

What is the most probable sum he can get? ROLLOUT(EXCERPT). . . . The possible sums he can end up with are125, 126, 127. . . . The probability of ending with a sum of 127 is the probability of picking the card with value 127 first, which is 1 7 . . . . From the above analysis, the most probable sum is126, as it has the highest probability of occurring. im...

work page 2000