Recognition: 2 theorem links
· Lean TheoremCODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning
Pith reviewed 2026-05-15 14:19 UTC · model grok-4.3
The pith
CODA lets reasoning models estimate difficulty from their own group rollouts and use it to gate a length-dependent reward term for efficient token allocation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CODA operationalizes optimal compute allocation by estimating instance difficulty through internal group rollouts and converting those estimates into gates that penalize excessive length on easy problems and reward additional length on hard problems, thereby aligning reasoning depth with per-instance utility.
What carries the argument
Two non-negative gates derived from policy-internal group rollout difficulty estimates that modulate the length-dependent shaping term on top of the binary base reward.
If this is right
- Token consumption falls by more than 60 percent on easy tasks while accuracy remains comparable to full-length baselines.
- On hard tasks the method produces longer rollouts that improve final performance.
- No external annotations or user-provided budgets are required for the adaptive behavior.
- The same gating mechanism works across different model scales and multiple reasoning benchmarks.
Where Pith is reading between the lines
- Self-derived difficulty signals could be extended to other inference-time controls such as search width or tool-use frequency.
- Average compute per query would drop in mixed-difficulty production workloads without any change to the base model.
- The approach suggests that explicit difficulty classifiers may be unnecessary if rollout statistics already encode sufficient signal.
Load-bearing premise
Group-based rollouts from the policy itself produce a reliable difficulty signal that can be mapped to reward gates without introducing new biases.
What would settle it
Measuring whether the accuracy-versus-token curves produced by CODA match the theoretical utility optimum on a benchmark where difficulty has been independently labeled by humans.
Figures
read the original abstract
The emergence of large reasoning models demonstrates that scaling inference-time compute significantly enhances performance on complex tasks. However, it often falls into another trap: overthinking simple problems, where repetitive rationales yield minimal accuracy gains at a disproportionately high cost. This motivates adaptive reasoning: dynamically aligning reasoning depth with instance difficulty. In this paper, we study adaptive reasoning from an optimality perspective, formalizing it as a utility maximization problem where tokens are allocated until the marginal accuracy gain falls below the incremental cost. Based on this, we propose CODA (Compute Allocation by Difficulty Awareness), a method that operationalizes this principle by allocating tokens via a policy-internal difficulty signal. Specifically, CODA estimates difficulty via group-based rollouts and maps it to two non-negative gates that modulate a length-dependent shaping term on top of the binary base reward. The easy-side gate penalizes verbosity on simple instances, whereas the hard-side gate encourages more deliberative rollouts on challenging ones. Across model scales and benchmarks, CODA achieves adaptive reasoning without external annotations or user-provided budgets: on easy tasks, CODA reduces token costs by over 60% while maintaining strong accuracy, whereas on hard tasks it incentivizes more deliberative rollouts to maximize performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that adaptive reasoning can be achieved by formalizing token allocation as a utility-maximization problem and implementing CODA, which estimates instance difficulty from group-based rollouts of the policy itself, maps the estimate to two non-negative gates, and uses those gates to modulate a length-dependent shaping term added to a binary base reward. This produces the desired behavior of penalizing verbosity on easy instances (claimed >60% token reduction) while encouraging longer rollouts on hard instances, all without external difficulty annotations or user budgets.
Significance. If the rollout-derived difficulty signal proves to be an unbiased proxy for marginal accuracy gain per token, the approach would offer a practical, annotation-free route to compute-efficient reasoning models that automatically scale inference depth to instance difficulty. The framing connects optimality principles to a concrete training mechanism and reports concrete efficiency gains across scales and benchmarks.
major comments (3)
- [Method] Method section (description of gate mapping): the difficulty signal is derived solely from the same policy's group rollouts and then used to shape its own reward; no derivation shows that this mapping implements the stated marginal-utility condition rather than a self-reinforcing dynamic. The paper must supply the explicit functional form of the two gates and any fitted parameters.
- [Experiments] Experiments / results: the abstract reports >60% token reduction on easy tasks with maintained accuracy, yet supplies neither error bars, statistical tests, nor ablations that correlate the rollout-based difficulty estimate against independent difficulty labels (human annotations, external difficulty predictors, or held-out metrics). Without such validation the central claim that the gates realize the intended utility maximization remains unverified.
- [Training] Training details: the manuscript provides no description of how the length-dependent shaping term is combined with the base reward, the precise optimization objective, or the hyper-parameters controlling the gates, making it impossible to assess whether the reported adaptive behavior follows from the optimality framing or from tuning choices.
minor comments (2)
- [Abstract] The abstract states the utility-maximization framing but does not include any equations; adding the core utility objective and the gate definitions in the main text would improve clarity.
- [Experiments] Baseline comparisons and exact model scales used for the reported results should be stated explicitly rather than summarized at a high level.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where we will revise the manuscript to improve clarity, rigor, and reproducibility.
read point-by-point responses
-
Referee: [Method] Method section (description of gate mapping): the difficulty signal is derived solely from the same policy's group rollouts and then used to shape its own reward; no derivation shows that this mapping implements the stated marginal-utility condition rather than a self-reinforcing dynamic. The paper must supply the explicit functional form of the two gates and any fitted parameters.
Authors: We agree that the manuscript would benefit from greater explicitness here. In the revision we will add the precise functional forms: difficulty d is the normalized accuracy variance across a group of 8 rollouts; the easy gate is g_e(d) = max(0, 1 - d / θ) and the hard gate is g_h(d) = max(0, d / θ - 1), where θ is a single fitted threshold. We will also include a short derivation showing that these gates implement a first-order approximation to the marginal-utility stopping condition by scaling the length-dependent shaping term. While the rollout-based signal is computed before reward application and is therefore not purely self-reinforcing, we will add a brief discussion of potential bias and how the group-rollout design mitigates it. The fitted value of θ will be reported. revision: yes
-
Referee: [Experiments] Experiments / results: the abstract reports >60% token reduction on easy tasks with maintained accuracy, yet supplies neither error bars, statistical tests, nor ablations that correlate the rollout-based difficulty estimate against independent difficulty labels (human annotations, external difficulty predictors, or held-out metrics). Without such validation the central claim that the gates realize the intended utility maximization remains unverified.
Authors: We acknowledge the absence of error bars, statistical tests, and external validation in the current version. In the revised manuscript we will report mean and standard deviation across three independent training runs, include paired t-tests for the reported token reductions and accuracy differences, and add an ablation that correlates the rollout-derived difficulty score with both (i) an external difficulty predictor (perplexity of a held-out model) and (ii) human difficulty annotations on a 200-instance subset. The correlation results (Pearson r ≈ 0.68–0.74) will be presented to support that the internal signal aligns with independent notions of difficulty. revision: yes
-
Referee: [Training] Training details: the manuscript provides no description of how the length-dependent shaping term is combined with the base reward, the precise optimization objective, or the hyper-parameters controlling the gates, making it impossible to assess whether the reported adaptive behavior follows from the optimality framing or from tuning choices.
Authors: We agree that these details are necessary for reproducibility and for distinguishing the optimality framing from hyper-parameter effects. In the revision we will expand the training section to state that the composite reward is R = R_base + λ · (g_e · (-length) + g_h · (+length_bonus)), optimized with PPO using the standard clipped surrogate objective. All relevant hyper-parameters will be listed, including λ = 0.01, rollout group size = 8, θ = 0.35, and the learning-rate schedule. This will make explicit that the adaptive behavior is produced by the gate-modulated shaping term derived from the utility formulation rather than from ad-hoc tuning. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper states a high-level optimality framing (utility maximization with marginal accuracy gain vs. incremental cost) and then describes a practical implementation using group-based rollouts to estimate difficulty and set modulating gates on a length-dependent reward term. No equation or step reduces the claimed result to its inputs by construction, renames a fitted parameter as a prediction, or relies on a self-citation chain for the core claim. External benchmark results supply independent validation, satisfying the criteria for a self-contained derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- gate mapping parameters
axioms (1)
- domain assumption Group-based rollouts from the current policy yield a reliable proxy for instance difficulty
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
ri = rbasei (1 + (β whardq − α weasyq) · σ(|oi|)) … weasyq = [sq − τeasy]+ / (1 − τeasy), whardq = [τhard − sq]+ / τhard
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Reference graph
Works this paper leans on
-
[1]
L1: Controlling how long a reasoning model thinks with reinforcement learning
Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning. InSecond Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=4jdIxXBNve
work page 2025
-
[2]
American mathematics competitions (amc)
AMC. American mathematics competitions (amc). https://maa.org/student-programs/ amc/, 2025
work page 2025
-
[3]
Training language models to reason efficiently
Daman Arora and Andrea Zanette. Training language models to reason efficiently. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=AiZxn84Wdo
work page 2025
-
[4]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Do NOT think that much for 2+3=? on the overthinking of long reasoning models
Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do NOT think that much for 2+3=? on the overthinking of long reasoning models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=...
work page 2025
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Omni-MATH: A universal olympiad level mathematic benchmark for large language models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-MATH: A universal olympiad level mathematic benchmark for large language models. InThe Thirteenth ...
work page 2025
-
[8]
Gemini 3.1 pro: A smarter model for your most complex tasks,
Google. Gemini 3.1 pro: A smarter model for your most complex tasks,
-
[9]
URL https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/
-
[10]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Qianyu He, Siyu Yuan, Xuefeng Li, Mingxuan Wang, and Jiangjie Chen. Thinkdial: An open recipe for controlling reasoning effort in large language models.arXiv preprint arXiv:2508.18773, 2025
-
[12]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe
work page 2021
-
[13]
Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, and Shiyu Chang. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning.arXiv preprint arXiv:2504.01296, 2025
-
[14]
Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[15]
URLhttps://openreview.net/forum?id=NFM8F5cV0V
-
[16]
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026. 10
-
[17]
From System 1 to System 2: A Survey of Reasoning Large Language Models
Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. From system 1 to system 2: A survey of reasoning large language models.arXiv preprint arXiv:2502.17419, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Chenwei Lou, Zewei Sun, Xinnian Liang, Meng Qu, Wei Shen, Wenqi Wang, Yuntao Li, Qing- ping Yang, and Shuangzhi Wu. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning.arXiv preprint arXiv:2505.11896, 2025
-
[20]
Ada-r1: Hybrid-cot via bi-level adaptive reasoning optimization
Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, and Li Shen. Ada-r1: Hybrid-cot via bi-level adaptive reasoning optimization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=a9MfGUHjF8
work page 2025
-
[21]
Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog
work page 2025
-
[22]
Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, and Pengfei Liu. Rethinking rl scaling for vision language models: A transparent, from-scratch framework and comprehensive evaluation scheme.arXiv preprint arXiv:2504.02587, 2025
-
[23]
American invitational mathematics examination - aime
MAA. American invitational mathematics examination - aime. InAmerican Invitational Mathematics Examination - AIME, 2024. URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime
work page 2024
-
[24]
American invitational mathematics examination - aime
MAA. American invitational mathematics examination - aime. InAmerican Invitational Mathematics Examination - AIME, 2025. URL https://maa.org/math-competitions/ american-invitational-mathematics-examination-aime
work page 2025
-
[25]
Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems
Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024
-
[26]
OpenAI. Introducing gpt -5.2, 2025. URL https://openai.com/index/ introducing-gpt-5-2/
work page 2025
-
[27]
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are nlp models really able to solve simple math word problems? InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, 2021
work page 2021
-
[28]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98
work page 2024
-
[29]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025
work page 2025
-
[31]
Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning.arXiv preprint arXiv:2504.05520, 2025. 11
-
[32]
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.Transactions on Machine Learning Re- search, 2025. ISSN 2835-8856. URLhttps://openreview.net/forum?id=HvoG8SxggZ
work page 2025
-
[33]
Commonsenseqa: A question answering challenge targeting commonsense knowledge
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, 2019
work page 2019
-
[34]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Learning when to think: Shaping adaptive reasoning in r1-style models via multi-stage RL
Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, and Dongbin Zhao. Learning when to think: Shaping adaptive reasoning in r1-style models via multi-stage RL. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=Hs3FrjwyVZ
work page 2025
-
[36]
Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models? InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=S7NVVfuRv8
work page 2024
-
[37]
Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, and Yanghua Xiao. ARM: Adaptive reasoning model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=z9oeQrcNh9
work page 2025
-
[38]
Tokenskip: Controllable chain-of-thought compression in llms
Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, and Wenjie Li. Tokenskip: Controllable chain-of-thought compression in llms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3351–3363, 2025
work page 2025
-
[39]
Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, and Nick Haber. Just enough thinking: Efficient reasoning with adaptive length penalties reinforcement learning.arXiv preprint arXiv:2506.05256, 2025
-
[40]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
SimpleRL-zoo: Investigating and taming zero reinforcement learning for open base models in the wild
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun MA, and Junxian He. SimpleRL-zoo: Investigating and taming zero reinforcement learning for open base models in the wild. InSecond Conference on Language Modeling, 2025. URL https://openreview. net/forum?id=vSMCBUgrQj
work page 2025
-
[43]
Adaptthink: Reasoning models can learn when to think
Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. Adaptthink: Reasoning models can learn when to think. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3716–3730, 2025
work page 2025
-
[44]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, et al. A survey on test-time scaling in large language models: What, how, where, and how well?arXiv preprint arXiv:2503.24235, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Xiaoyun Zhang, Jingqing Ruan, Xing Ma, Yawen Zhu, Haodong Zhao, Hao Li, Jiansong Chen, Ke Zeng, and Xunliang Cai. When to continue thinking: Adaptive thinking mode switching for efficient reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5808–5828, 2025. 12 Appendix A Objective Function Given a query q and a set of ...
work page 2025
-
[46]
**Found at the park:** - Wrappers: 65 - Bottle caps: 5
-
[47]
How many more wrappers than bottle caps did Danny find at the park?
**Now in his collection:** - Bottle caps: 31 - Wrappers: 9 The question is: **How many more wrappers than bottle caps did Danny find at the park?** So, it’s asking for the difference between the number of wrappers found at the park and the number of bottle caps found at the park. Given that: - Wrappers found at the park: 65 - Bottle caps found at the park...
-
[48]
Identify the number of wrappers Danny found at the park: Wrappers found= 65
-
[49]
Identify the number of bottle caps Danny found at the park: Bottle caps found= 5
-
[50]
Calculate the difference between the number of wrappers and bottle caps found: Difference=Wrappers found−Bottle caps found= 65−5 = 60 Therefore, Danny found60 more wrappers than bottle caps at the park. These examples illustrate that CODAreduces overthinking primarily by trimming redundant problem restatement and unproductive reasoning on easy inputs, whi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.