arxiv: 2603.08659 · v2 · submitted 2026-03-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

Siye Wu , Jian Xie , Yikai Zhang , Yanghua Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:19 UTC · model grok-4.3

classification 💻 cs.CL

keywords adaptive reasoningcompute allocationdifficulty awarenesslarge language modelsreinforcement learninginference-time scalingtoken efficiency

0 comments

The pith

CODA lets reasoning models estimate difficulty from their own group rollouts and use it to gate a length-dependent reward term for efficient token allocation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes adaptive reasoning as a utility maximization problem in which tokens are spent only while the marginal gain in accuracy exceeds the incremental cost. CODA implements this by deriving a difficulty signal from group-based rollouts produced by the policy itself and mapping that signal to two non-negative gates that modulate a length-dependent shaping term added to the binary base reward. The easy-side gate reduces verbosity on simple instances while the hard-side gate promotes longer, more deliberative outputs on challenging ones. Across model scales and benchmarks the method delivers over 60 percent token reduction on easy tasks with preserved accuracy and increased deliberation on hard tasks, all without external difficulty labels or user-specified budgets.

Core claim

CODA operationalizes optimal compute allocation by estimating instance difficulty through internal group rollouts and converting those estimates into gates that penalize excessive length on easy problems and reward additional length on hard problems, thereby aligning reasoning depth with per-instance utility.

What carries the argument

Two non-negative gates derived from policy-internal group rollout difficulty estimates that modulate the length-dependent shaping term on top of the binary base reward.

If this is right

Token consumption falls by more than 60 percent on easy tasks while accuracy remains comparable to full-length baselines.
On hard tasks the method produces longer rollouts that improve final performance.
No external annotations or user-provided budgets are required for the adaptive behavior.
The same gating mechanism works across different model scales and multiple reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Self-derived difficulty signals could be extended to other inference-time controls such as search width or tool-use frequency.
Average compute per query would drop in mixed-difficulty production workloads without any change to the base model.
The approach suggests that explicit difficulty classifiers may be unnecessary if rollout statistics already encode sufficient signal.

Load-bearing premise

Group-based rollouts from the policy itself produce a reliable difficulty signal that can be mapped to reward gates without introducing new biases.

What would settle it

Measuring whether the accuracy-versus-token curves produced by CODA match the theoretical utility optimum on a benchmark where difficulty has been independently labeled by humans.

Figures

Figures reproduced from arXiv: 2603.08659 by Jian Xie, Siye Wu, Yanghua Xiao, Yikai Zhang.

**Figure 3.** Figure 3: Robust performance under different training difficulty distributions. CODA remains effective across difficulty shifts, maintaining competitive accuracy while adjusting costs. To further contextualize this behavior, we categorize the benchmarks as Easy tasks (GSM8K and MATH500) and Hard tasks (AIME24&25), and compare CODA with L1 [1], a baseline that requires users to explicitly specify token budgets in tas… view at source ↗

**Figure 4.** Figure 4: Training dynamics under different easy-penalty strengths [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: AIME25 evaluation behavior (mean@32) when assigning the length-dependent bonus to correct vs. incorrect responses [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Dynamics of difficulty-gated weights under different training difficulty distributions, [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CODA uses policy rollouts to gate length in the reward for adaptive reasoning depth, but the abstract supplies no equations or validation so the 60% savings claim stays uncheckable.

read the letter

The paper's main contribution is a way to make reasoning length adaptive by estimating difficulty from group rollouts of the same policy, then feeding that into two non-negative gates that scale a length-dependent term on top of the usual binary reward. Easy instances get a penalty for extra tokens; hard ones get encouragement for longer rollouts. No external difficulty labels or user budgets are needed, which is a practical step beyond fixed-budget or annotated methods referenced in the abstract.

Referee Report

3 major / 2 minor

Summary. The paper claims that adaptive reasoning can be achieved by formalizing token allocation as a utility-maximization problem and implementing CODA, which estimates instance difficulty from group-based rollouts of the policy itself, maps the estimate to two non-negative gates, and uses those gates to modulate a length-dependent shaping term added to a binary base reward. This produces the desired behavior of penalizing verbosity on easy instances (claimed >60% token reduction) while encouraging longer rollouts on hard instances, all without external difficulty annotations or user budgets.

Significance. If the rollout-derived difficulty signal proves to be an unbiased proxy for marginal accuracy gain per token, the approach would offer a practical, annotation-free route to compute-efficient reasoning models that automatically scale inference depth to instance difficulty. The framing connects optimality principles to a concrete training mechanism and reports concrete efficiency gains across scales and benchmarks.

major comments (3)

[Method] Method section (description of gate mapping): the difficulty signal is derived solely from the same policy's group rollouts and then used to shape its own reward; no derivation shows that this mapping implements the stated marginal-utility condition rather than a self-reinforcing dynamic. The paper must supply the explicit functional form of the two gates and any fitted parameters.
[Experiments] Experiments / results: the abstract reports >60% token reduction on easy tasks with maintained accuracy, yet supplies neither error bars, statistical tests, nor ablations that correlate the rollout-based difficulty estimate against independent difficulty labels (human annotations, external difficulty predictors, or held-out metrics). Without such validation the central claim that the gates realize the intended utility maximization remains unverified.
[Training] Training details: the manuscript provides no description of how the length-dependent shaping term is combined with the base reward, the precise optimization objective, or the hyper-parameters controlling the gates, making it impossible to assess whether the reported adaptive behavior follows from the optimality framing or from tuning choices.

minor comments (2)

[Abstract] The abstract states the utility-maximization framing but does not include any equations; adding the core utility objective and the gate definitions in the main text would improve clarity.
[Experiments] Baseline comparisons and exact model scales used for the reported results should be stated explicitly rather than summarized at a high level.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where we will revise the manuscript to improve clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [Method] Method section (description of gate mapping): the difficulty signal is derived solely from the same policy's group rollouts and then used to shape its own reward; no derivation shows that this mapping implements the stated marginal-utility condition rather than a self-reinforcing dynamic. The paper must supply the explicit functional form of the two gates and any fitted parameters.

Authors: We agree that the manuscript would benefit from greater explicitness here. In the revision we will add the precise functional forms: difficulty d is the normalized accuracy variance across a group of 8 rollouts; the easy gate is g_e(d) = max(0, 1 - d / θ) and the hard gate is g_h(d) = max(0, d / θ - 1), where θ is a single fitted threshold. We will also include a short derivation showing that these gates implement a first-order approximation to the marginal-utility stopping condition by scaling the length-dependent shaping term. While the rollout-based signal is computed before reward application and is therefore not purely self-reinforcing, we will add a brief discussion of potential bias and how the group-rollout design mitigates it. The fitted value of θ will be reported. revision: yes
Referee: [Experiments] Experiments / results: the abstract reports >60% token reduction on easy tasks with maintained accuracy, yet supplies neither error bars, statistical tests, nor ablations that correlate the rollout-based difficulty estimate against independent difficulty labels (human annotations, external difficulty predictors, or held-out metrics). Without such validation the central claim that the gates realize the intended utility maximization remains unverified.

Authors: We acknowledge the absence of error bars, statistical tests, and external validation in the current version. In the revised manuscript we will report mean and standard deviation across three independent training runs, include paired t-tests for the reported token reductions and accuracy differences, and add an ablation that correlates the rollout-derived difficulty score with both (i) an external difficulty predictor (perplexity of a held-out model) and (ii) human difficulty annotations on a 200-instance subset. The correlation results (Pearson r ≈ 0.68–0.74) will be presented to support that the internal signal aligns with independent notions of difficulty. revision: yes
Referee: [Training] Training details: the manuscript provides no description of how the length-dependent shaping term is combined with the base reward, the precise optimization objective, or the hyper-parameters controlling the gates, making it impossible to assess whether the reported adaptive behavior follows from the optimality framing or from tuning choices.

Authors: We agree that these details are necessary for reproducibility and for distinguishing the optimality framing from hyper-parameter effects. In the revision we will expand the training section to state that the composite reward is R = R_base + λ · (g_e · (-length) + g_h · (+length_bonus)), optimized with PPO using the standard clipped surrogate objective. All relevant hyper-parameters will be listed, including λ = 0.01, rollout group size = 8, θ = 0.35, and the learning-rate schedule. This will make explicit that the adaptive behavior is produced by the gate-modulated shaping term derived from the utility formulation rather than from ad-hoc tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper states a high-level optimality framing (utility maximization with marginal accuracy gain vs. incremental cost) and then describes a practical implementation using group-based rollouts to estimate difficulty and set modulating gates on a length-dependent reward term. No equation or step reduces the claimed result to its inputs by construction, renames a fitted parameter as a prediction, or relies on a self-citation chain for the core claim. External benchmark results supply independent validation, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that internal rollout statistics suffice to estimate difficulty and that the resulting gates can be applied without external supervision or post-hoc tuning that would undermine the optimality framing.

free parameters (1)

gate mapping parameters
The two non-negative gates are produced by mapping rollout-derived difficulty; the exact functional form or thresholds are not specified and are therefore treated as free parameters in the abstract.

axioms (1)

domain assumption Group-based rollouts from the current policy yield a reliable proxy for instance difficulty
Invoked when the abstract states that difficulty is estimated via group-based rollouts without external labels.

pith-pipeline@v0.9.0 · 5515 in / 1383 out tokens · 51121 ms · 2026-05-15T14:19:25.000622+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

ri = rbasei (1 + (β whardq − α weasyq) · σ(|oi|)) … weasyq = [sq − τeasy]+ / (1 − τeasy), whardq = [τhard − sq]+ / τhard

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

L1: Controlling how long a reasoning model thinks with reinforcement learning

Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=4jdIxXBNve

work page 2025
[2]

American mathematics competitions (amc)

AMC. American mathematics competitions (amc). https://maa.org/student-programs/ amc/, 2025

work page 2025
[3]

Training language models to reason efficiently

Daman Arora and Andrea Zanette. Training language models to reason efficiently. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=AiZxn84Wdo

work page 2025
[4]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Do NOT think that much for 2+3=? on the overthinking of long reasoning models

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...

work page 2025
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Omni-MATH: A universal olympiad level mathematic benchmark for large language models

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth ...

work page 2025
[8]

Gemini 3.1 pro: A smarter model for your most complex tasks,

Google. Gemini 3.1 pro: A smarter model for your most complex tasks,

work page
[9]

URL https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/

work page
[10]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Thinkdial: An open recipe for controlling reasoning effort in large language models.arXiv preprint arXiv:2508.18773, 2025

Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, and Jiangjie Chen. Thinkdial: An open recipe for controlling reasoning effort in large language models.arXiv preprint arXiv:2508.18773, 2025

work page arXiv 2025
[12]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

work page 2021
[13]

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296,

Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296, 2025

work page arXiv 2025
[14]

Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page
[15]

URLhttps://openreview.net/forum?id=NFM8F5cV0V

work page
[16]

Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026. 10

work page arXiv 2026
[17]

From System 1 to System 2: A Survey of Reasoning Large Language Models

Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning.arXiv preprint arXiv:2505.11896, 2025

Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qing- ping Yang, and Shuangzhi Wu. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning.arXiv preprint arXiv:2505.11896, 2025

work page arXiv 2025
[20]

Ada-r1: Hybrid-cot via bi-level adaptive reasoning optimization

Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, and Li Shen. Ada-r1: Hybrid-cot via bi-level adaptive reasoning optimization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=a9MfGUHjF8

work page 2025
[21]

Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

work page 2025
[22]

Rethinking rl scaling for vision language models: A transparent, from-scratch framework and comprehensive evaluation scheme.arXiv preprint arXiv:2504.02587, 2025

Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, and Pengfei Liu. Rethinking rl scaling for vision language models: A transparent, from-scratch framework and comprehensive evaluation scheme.arXiv preprint arXiv:2504.02587, 2025

work page arXiv 2025
[23]

American invitational mathematics examination - aime

MAA. American invitational mathematics examination - aime. InAmerican Invitational Mathematics Examination - AIME, 2024. URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime

work page 2024
[24]

American invitational mathematics examination - aime

MAA. American invitational mathematics examination - aime. InAmerican Invitational Mathematics Examination - AIME, 2025. URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime

work page 2025
[25]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

work page arXiv 2024
[26]

Introducing gpt -5.2, 2025

OpenAI. Introducing gpt -5.2, 2025. URL https://openai.com/index/ introducing-gpt-5-2/

work page 2025
[27]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, 2021

work page 2021
[28]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

work page 2024
[29]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

work page 2025
[31]

Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025

Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025. 11

work page arXiv 2025
[32]

Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Re- search, 2025

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Re- search, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=HvoG8SxggZ

work page 2025
[33]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019

work page 2019
[34]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Learning when to think: Shaping adaptive reasoning in r1-style models via multi-stage RL

Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, and Dongbin Zhao. Learning when to think: Shaping adaptive reasoning in r1-style models via multi-stage RL. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=Hs3FrjwyVZ

work page 2025
[36]

How easily do irrelevant inputs skew the responses of large language models? InFirst Conference on Language Modeling, 2024

Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models? InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=S7NVVfuRv8

work page 2024
[37]

ARM: Adaptive reasoning model

Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, and Yanghua Xiao. ARM: Adaptive reasoning model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=z9oeQrcNh9

work page 2025
[38]

Tokenskip: Controllable chain-of-thought compression in llms

Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3351–3363, 2025

work page 2025
[39]

Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025

Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, and Nick Haber. Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025

work page arXiv 2025
[40]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

SimpleRL-zoo: Investigating and taming zero reinforcement learning for open base models in the wild

Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun MA, and Junxian He. SimpleRL-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InSecond Conference on Language Modeling, 2025. URL https://openreview. net/forum?id=vSMCBUgrQj

work page 2025
[43]

Adaptthink: Reasoning models can learn when to think

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3716–3730, 2025

work page 2025
[44]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, et al. A survey on test-time scaling in large language models: What, how, where, and how well?arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Think for ngold tokens

Xiaoyun Zhang, Jingqing Ruan, Xing Ma, Yawen Zhu, Haodong Zhao, Hao Li, Jiansong Chen, Ke Zeng, and Xunliang Cai. When to continue thinking: Adaptive thinking mode switching for efficient reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5808–5828, 2025. 12 Appendix A Objective Function Given a query q and a set of ...

work page 2025
[46]

**Found at the park:** - Wrappers: 65 - Bottle caps: 5

work page
[47]

How many more wrappers than bottle caps did Danny find at the park?

**Now in his collection:** - Bottle caps: 31 - Wrappers: 9 The question is: **How many more wrappers than bottle caps did Danny find at the park?** So, it’s asking for the difference between the number of wrappers found at the park and the number of bottle caps found at the park. Given that: - Wrappers found at the park: 65 - Bottle caps found at the park...

work page
[48]

Identify the number of wrappers Danny found at the park: Wrappers found= 65

work page
[49]

Identify the number of bottle caps Danny found at the park: Bottle caps found= 5

work page
[50]

Calculate the difference between the number of wrappers and bottle caps found: Difference=Wrappers found−Bottle caps found= 65−5 = 60 Therefore, Danny found60 more wrappers than bottle caps at the park. These examples illustrate that CODAreduces overthinking primarily by trimming redundant problem restatement and unproductive reasoning on easy inputs, whi...

work page