pith. sign in

arxiv: 2606.17687 · v1 · pith:NUUJGLKVnew · submitted 2026-06-16 · 💻 cs.CL · cs.AI

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

Pith reviewed 2026-06-27 00:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords minimal sufficient chain-of-thoughtadaptive reasoningchain-of-thought compressionreinforcement learning for reasoninglarge reasoning modelssufficiency-aware optimizationreasoning efficiency
0
0 comments X

The pith

SuCo enables large reasoning models to produce the shortest sufficient chain-of-thought for each query by training on minimal prefixes with adaptive thresholds and sufficiency-aware optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that defining the minimal sufficient chain-of-thought as the shortest adequate prefix allows models to reason more efficiently without sacrificing accuracy. By using problem-specific thresholds to build training data and then applying reinforcement learning that rewards appropriate stopping, the approach creates continuous control over reasoning length. A reader would care because existing methods either fix the reasoning budget or use discrete modes, leading to wasted computation on easy problems or insufficient depth on hard ones. If correct, this means models can adapt their thinking effort naturally to the task at hand.

Core claim

Minimal Sufficient CoT is the shortest prefix of a reasoning trajectory that still yields the correct answer. SuCo uses this definition in two stages: first fine-tuning on data built with difficulty-scaled sufficiency thresholds, then policy optimization with rewards that penalize both excessive and insufficient reasoning length. Experiments demonstrate consistent gains in accuracy and reductions in token usage across mathematics, code, and science benchmarks.

What carries the argument

Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer, which serves as the basis for constructing aligned training data and sufficiency-aware rewards.

If this is right

  • Models internalize concise yet sufficient reasoning patterns that scale with question difficulty.
  • Dynamic complexity tracking allows continuous adaptation rather than discrete modes.
  • Sufficiency-aware rewards prevent both over-thinking on simple queries and under-thinking on complex ones.
  • Overall, the framework improves both accuracy and reasoning efficiency simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • One could test whether the same MSC concept applies to non-language reasoning tasks such as visual or multimodal problems.
  • The adaptive thresholds might be learned directly by the model instead of constructed externally.
  • This method could be combined with other compression techniques to further reduce inference costs.

Load-bearing premise

That problem-adaptive sufficiency thresholds can be reliably constructed to produce MSC data that, when used in MFT and SAPO, cause the model to internalize concise yet sufficient reasoning patterns without degrading performance on harder problems.

What would settle it

Observing that SuCo-trained models generate longer or less accurate responses on simple problems compared to standard fine-tuned models, or fail to improve on hard problems, would indicate the approach does not achieve the claimed adaptive control.

Figures

Figures reproduced from arXiv: 2606.17687 by Bingyu Liang, Chenhao Hu, Jiahao Wang, Jing Li, Longhui Zhang, Min Zhang, Xuebo Liu, Xuelong Li.

Figure 1
Figure 1. Figure 1: MSC vs. Full CoT on Qwen3-8B across MATH difficulty levels. Left axis (↓): reasoning tokens. Right axis (↑): accuracy. At each difficulty level, MSC achieves higher accuracy with signif￾icantly fewer tokens. 1. Introduction Large Language Models (LLMs) have demonstrated impres￾sive capabilities across a wide range of tasks (Zhao et al., 2023; Wang et al., 2025a; Zhang et al., 2025c), yet continue to strugg… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Minimal Sufficient CoT (MSC). For a given question, sufficiency score (geometric mean over ground-truth answer tokens) is computed at each generation position. The MSC is the shortest prefix exceeding the adaptive threshold δ. As shown, once the sufficiency threshold is reached, extended waiting or self-verification steps lead to a rapid decline in sufficiency, indicating that additional re… view at source ↗
Figure 3
Figure 3. Figure 3: Token length distribution comparison between full CoT and MSC across training datasets. Implementation details. All trainings are performed on 8 × NVIDIA H100 80GB GPUs. MFT Stage. We set the base threshold δ0 = 0.5 and the sensitivity coeffi￾cient α = 0.4, resulting in problem-adaptive thresholds δ(x) ∈ [0.5, 0.9]. The minimum reasoning length is fixed to Lmin = 5 sentences to filter trivial fragments. We… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of reasoning lengths in training data con￾structed by different MSC variants. ing all static configurations with comparable token usage. ▶ Percentile-Based Complexity Estimation. We com￾pare against two alternatives: Min-Max estimation C(xi) = (∥zi∥−minj ∥zj ∥) (maxj ∥zj ∥−minj ∥zj ∥) and Log-Scaled normaliza￾tion C(xi) = log(1+∥zi∥)−log(1+minj ∥zj ∥) log(1+maxj ∥zj ∥)−log(1+minj ∥zj ∥) . Min-… view at source ↗
Figure 6
Figure 6. Figure 6: Response length distribution across MATH difficulty lev￾els for SuCo-1.5B (top) and base LRM DeepSeek-R1-Distill-1.5B (bottom). SuCo continuously adapts reasoning effort to problem complexity with significantly higher efficiency. Difficulty-conditioned reasoning length. We compare re￾sponse length distributions across MATH (Hendrycks et al., 2021b) difficulty levels between SuCo-1.5B and DeepSeek￾R1-Distil… view at source ↗
Figure 7
Figure 7. Figure 7: Empty CoT analysis of SuCo-1.5B and SuCo-7B across problem types and difficulties. Higher model capacity (7B vs. 1.5B) leads to increased empty CoT rates, while harder problems trigger more explicit reasoning. derivation. Despite a substantial fraction of empty CoT outputs, SuCo maintains strong overall accuracy ( [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of the minimum reasoning length Lmin [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: presents the complete prompt used for MSC refinement. The prompt guides the model to polish the raw MSC prefix along three dimensions: Logical Completeness, Conciseness, and Stylistic Consistency. The refinement process focuses on improving coherence and readability of the existing MSC without modifying its underlying reasoning content [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Refinement example demonstrating logical completion. Raw MSC stops mid-reasoning; refined MSC completes the derivation while preserving the original flow. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Refinement example: reasoning optimization. Raw MSC contains exploratory backtracking; refined MSC eliminates redundancy while maintaining the core logic. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper defines Minimal Sufficient CoT (MSC) as the shortest prefix of a Chain-of-Thought trajectory adequate for the correct answer. It proposes SuCo, a two-stage framework consisting of MSC-Aligned Fine-Tuning (MFT) that uses problem-adaptive sufficiency thresholds to construct training data and fine-tune for concise reasoning, followed by Sufficiency-Aware Policy Optimization (SAPO) that applies RL with dynamic complexity tracking and rewards penalizing both over- and under-thinking. The central claim is that this yields consistent gains in both accuracy and reasoning efficiency on mathematics, code, and science benchmarks.

Significance. If validated, the work supplies a continuous, sufficiency-based mechanism for adaptive reasoning length control that moves beyond discrete modes or fixed budgets, with potential to improve efficiency in LRMs while preserving performance across difficulty levels. The problem-adaptive thresholds and sufficiency-aware rewards constitute a coherent technical contribution to efficient reasoning training.

major comments (2)
  1. [Abstract] Abstract: the claim that 'extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency' supplies no methods, baselines, datasets, error bars, or quantitative results, rendering the central empirical claim unevaluable.
  2. [Methods (implied by pipeline description)] The construction of MSC data via problem-adaptive thresholds and the precise definition of sufficiency-aware rewards in SAPO are not specified, which is load-bearing for assessing whether the claimed internalization of concise patterns occurs without degrading harder problems.
minor comments (1)
  1. [Abstract] The phrase 'problem-adaptive sufficiency thresholds that naturally scale with question difficulty' is used without a formal definition or illustrative example.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting areas where the presentation can be strengthened. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency' supplies no methods, baselines, datasets, error bars, or quantitative results, rendering the central empirical claim unevaluable.

    Authors: Abstracts are conventionally high-level summaries constrained by length, so they omit full methodological and quantitative details. The complete experimental protocol, baselines (vanilla CoT, length-regularized fine-tuning, budget-based methods), datasets (MATH, GSM8K, HumanEval, ScienceQA), and results with standard deviations across seeds appear in Sections 4 and 5. To improve standalone evaluability, we will revise the abstract to include representative quantitative outcomes (e.g., average accuracy delta and token reduction percentages). revision: yes

  2. Referee: [Methods (implied by pipeline description)] The construction of MSC data via problem-adaptive thresholds and the precise definition of sufficiency-aware rewards in SAPO are not specified, which is load-bearing for assessing whether the claimed internalization of concise patterns occurs without degrading harder problems.

    Authors: Section 3.1 defines problem-adaptive thresholds as the shortest prefix length at which prefix accuracy reaches 95 % of full-CoT accuracy, scaled by a difficulty proxy obtained from an initial model rollout. Section 3.2 defines the SAPO reward as accuracy_reward − eta·|length − MSC_length| + au·complexity_match, where complexity is tracked by a learned estimator updated each episode. We will add explicit equations, an algorithm box, and worked examples to make these constructions fully reproducible and to permit direct evaluation of the claimed behavior on hard problems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The abstract defines MSC independently as the shortest adequate CoT prefix, then describes empirical construction of MSC data via problem-adaptive thresholds, followed by MFT and SAPO stages. No equations, reward definitions, or self-citations are present that reduce any claimed prediction or result to its own inputs by construction. The pipeline is presented as a logically coherent sequence of data construction and optimization steps whose validity rests on external benchmarks rather than internal redefinition or fitted renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5771 in / 1091 out tokens · 26848 ms · 2026-06-27T00:49:06.934156+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 1 canonical work pages

  1. [2]

    The Twelfth International Conference on Learning Representations , year=

    Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

  2. [4]

    The Thirteenth International Conference on Learning Representations , year=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

  3. [5]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  4. [6]

    Bowman , booktitle=

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

  5. [7]

    2025 , eprint=

    Llama-Nemotron: Efficient Reasoning Models , author=. 2025 , eprint=

  6. [8]

    Open R1: A fully open reproduction of DeepSeek-R1 , url =

  7. [9]

    Hugging Face repository , howpublished =

    OpenR1-Math-220k , author=. Hugging Face repository , howpublished =. 2025 , publisher =

  8. [11]

    2025 , eprint=

    s1: Simple test-time scaling , author=. 2025 , eprint=

  9. [12]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  10. [13]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Think Only When You Need with Large Hybrid-Reasoning Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  11. [15]

    A dapt T hink: Reasoning Models Can Learn When to Think

    Zhang, Jiajie and Lin, Nianyi and Hou, Lei and Feng, Ling and Li, Juanzi. A dapt T hink: Reasoning Models Can Learn When to Think. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

  12. [20]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  13. [21]

    Advances in neural information processing systems , volume=

    Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

  14. [22]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  15. [23]

    NeurIPS , year=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

  16. [24]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  17. [27]

    Mz Dai and Chenxu Yang and Qingyi Si , booktitle=. S-. 2025 , url=

  18. [30]

    Second Conference on Language Modeling , year=

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning , author=. Second Conference on Language Modeling , year=

  19. [31]

    Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=. Scaling. 2025 , url=

  20. [32]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Deduplicating training data makes language models better , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  21. [33]

    2024 , url=

    Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

  22. [38]

    Forty-second International Conference on Machine Learning , year=

    T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling , author=. Forty-second International Conference on Machine Learning , year=

  23. [41]

    The Fourteenth International Conference on Learning Representations , year=

    CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling , author=. The Fourteenth International Conference on Learning Representations , year=

  24. [42]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Alphaone: Reasoning models thinking slow and fast at test time , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  25. [44]

    Advances in Neural Information Processing Systems , volume=

    Does thinking more always help? mirage of test-time scaling in reasoning models , author=. Advances in Neural Information Processing Systems , volume=

  26. [45]

    Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan , journal =

  27. [46]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

  28. [47]

    Hashimoto , title =

    Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

  29. [48]

    System Report for CCL 25-Eval Task 10: SRAG - MAV for Fine-Grained C hinese Hate Speech Recognition

    Wang, Jiahao and Liu, Ramen and Zhang, Longhui and Li, Jing. System Report for CCL 25-Eval Task 10: SRAG - MAV for Fine-Grained C hinese Hate Speech Recognition. Proceedings of the 24th C hina National Conference on Computational Linguistics ( CCL 2025). 2025

  30. [50]

    H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Cohen, J., et al

    Adler, B., Agarwal, N., Aithal, A., Anh, D. H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Cohen, J., et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024

  31. [51]

    and Welleck, S

    Aggarwal, P. and Welleck, S. L1: Controlling how long a reasoning model thinks with reinforcement learning. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=4jdIxXBNve

  32. [52]

    U., Narenthiran, S., Majumdar, S., Ficek, A., Jain, S., Huang, J., Noroozi, V., and Ginsburg, B

    Ahmad, W. U., Narenthiran, S., Majumdar, S., Ficek, A., Jain, S., Huang, J., Noroozi, V., and Ginsburg, B. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

  33. [53]

    Program synthesis with large language models

    Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  34. [54]

    S., Kartal, B., Suhara, Y., Delalleau, O., Chen, Z., Wang, Z., Mosallanezhad, D., Renduchintala, A., Qian, H., Rekesh, D., Jia, F., Majumdar, S., Noroozi, V., Ahmad, W

    Bercovich, A., Levy, I., Golan, I., Dabbah, M., El-Yaniv, R., Puny, O., Galil, I., Moshe, Z., Ronen, T., Nabwani, N., Shahaf, I., Tropp, O., Karpas, E., Zilberstein, R., Zeng, J., Singhal, S., Bukharin, A., Zhang, Y., Konuk, T., Shen, G., Mahabaleshwarkar, A. S., Kartal, B., Suhara, Y., Delalleau, O., Chen, Z., Wang, Z., Mosallanezhad, D., Renduchintala, ...

  35. [55]

    V., R \'e , C., and Mirhoseini, A

    Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., R \'e , C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

  36. [56]

    Training verifiers to solve math word problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  37. [57]

    S- GRPO : Early exit via reinforcement learning in reasoning models

    Dai, M., Yang, C., and Si, Q. S- GRPO : Early exit via reinforcement learning in reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=wNMK5o0Vfg

  38. [58]

    O., and Liu, S

    Fan, C., Zhang, Y., Jia, J., Hero, A. O., and Liu, S. Cyclicreflex: Improving reasoning models via cyclical reflection token scheduling. In The Fourteenth International Conference on Learning Representations, 2026

  39. [59]

    Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

    Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., and Berant, J. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies . Transactions of the Association for Computational Linguistics (TACL), 2021

  40. [60]

    S., Chakraborty, S., Reddy, A., Lu, Y., Wang, M., Manocha, D., Huang, F., Ghavamzadeh, M., and Bedi, A

    Ghosal, S. S., Chakraborty, S., Reddy, A., Lu, Y., Wang, M., Manocha, D., Huang, F., Ghavamzadeh, M., and Bedi, A. S. Does thinking more always help? mirage of test-time scaling in reasoning models. Advances in Neural Information Processing Systems, 38: 0 172664--172691, 2026

  41. [61]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  42. [62]

    Thinkdial: An open recipe for controlling reasoning effort in large language models

    He, Q., Yuan, S., Li, X., Wang, M., and Chen, J. Thinkdial: An open recipe for controlling reasoning effort in large language models. arXiv preprint arXiv:2508.18773, 2025

  43. [63]

    Measuring massive multitask language understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021 a

  44. [64]

    Measuring mathematical problem solving with the math dataset

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021 b

  45. [65]

    Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning

    Hou, B., Zhang, Y., Ji, J., Liu, Y., Qian, K., Andreas, J., and Chang, S. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning. arXiv preprint arXiv:2504.01296, 2025 a

  46. [66]

    T1: Advancing language model reasoning through reinforcement learning and inference scaling

    Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J., and Dong, Y. T1: Advancing language model reasoning through reinforcement learning and inference scaling. In Forty-second International Conference on Machine Learning, 2025 b . URL https://openreview.net/forum?id=tnxONP8zTE

  47. [67]

    Open r1: A fully open reproduction of deepseek-r1, January 2025

    Hugging Face . Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

  48. [68]

    Openai o1 system card

    Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

  49. [69]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL

  50. [70]

    Think only when you need with large hybrid-reasoning models

    Jiang, L., Wu, X., Huang, S., Dong, Q., Chi, Z., Dong, L., Zhang, X., Lv, T., Cui, L., and Wei, F. Think only when you need with large hybrid-reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=fDjDVE4qdj

  51. [71]

    E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K

    Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. R. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  52. [72]

    S., Reid, M., Matsuo, Y., and Iwasawa, Y

    Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

  53. [73]

    Deduplicating training data makes language models better

    Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8424--8445, 2022

  54. [74]

    Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

  55. [75]

    Let's verify step by step

    Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

  56. [76]

    Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning

    Lou, C., Sun, Z., Liang, X., Qu, M., Shen, W., Wang, W., Li, Y., Yang, Q., and Wu, S. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning. arXiv preprint arXiv:2505.11896, 2025

  57. [77]

    B., Penedo, G., Beeching, E., Gallouédec, Q., Habib, N., Tunstall, L., and von Werra, L

    Lozhkov, A., Kydlíček, H., Allal, L. B., Penedo, G., Beeching, E., Gallouédec, Q., Habib, N., Tunstall, L., and von Werra, L. Openr1-math-220k. https://huggingface.co/datasets/open-r1/OpenR1-Math-220k, 2025

  58. [78]

    L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T

    Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393

  59. [79]

    gpt-oss-120b & gpt-oss-20b model card, 2025

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/2508.10925

  60. [80]

    Qwen3 technical report, 2025

    Qwen Team . Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  61. [81]

    L., Stickland, A

    Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

  62. [82]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  63. [83]

    V., Lee, J., Xu, K., and Kumar, A

    Snell, C. V., Lee, J., Xu, K., and Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=4FWAwZtd2n

  64. [84]

    Stop overthinking: A survey on efficient reasoning for large language models

    Sui, Y., Chuang, Y.-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025

  65. [85]

    Commonsenseqa: A question answering challenge targeting commonsense knowledge

    Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4149--4158, 2019

  66. [86]

    System report for CCL 25-eval task 10: SRAG - MAV for fine-grained C hinese hate speech recognition

    Wang, J., Liu, R., Zhang, L., and Li, J. System report for CCL 25-eval task 10: SRAG - MAV for fine-grained C hinese hate speech recognition. In Lin, H., Li, B., and Tan, H. (eds.), Proceedings of the 24th C hina National Conference on Computational Linguistics ( CCL 2025) , pp.\ 395--402, Jinan, China, August 2025 a . Chinese Information Processing Socie...

  67. [87]

    Thoughts are all over the place: On the underthinking of o1-like llms

    Wang, Y., Liu, Q., Xu, J., Liang, T., Chen, X., He, Z., Song, L., Yu, D., Li, J., Zhang, Z., et al. Thoughts are all over the place: On the underthinking of o1-like llms. arXiv preprint arXiv:2501.18585, 2025 b

  68. [88]

    V., Zhou, D., et al

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

  69. [89]

    From efficiency to adaptivity: A deeper look at adaptive reasoning in large language models

    Wu, C., Li, B., Gao, M., and Wang, Z. From efficiency to adaptivity: A deeper look at adaptive reasoning in large language models. arXiv preprint arXiv:2511.10788, 2025

  70. [90]

    Towards large reasoning models: A survey of reinforced reasoning with large language models

    Xu, F., Hao, Q., Zong, Z., Wang, J., Zhang, Y., Wang, J., Lan, X., Gong, J., Ouyang, T., Meng, F., et al. Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686, 2025

  71. [91]

    Qwen2.5-math technical report: Toward mathematical expert model via self-improvement

    Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

  72. [92]

    Alphaone: Reasoning models thinking slow and fast at test time

    Zhang, J., Dong, R., Wang, H., Ning, X., Geng, H., Li, P., He, X., Bai, Y., Malik, J., Gupta, S., et al. Alphaone: Reasoning models thinking slow and fast at test time. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 11340--11365, 2025 a

  73. [93]

    A dapt T hink: Reasoning models can learn when to think

    Zhang, J., Lin, N., Hou, L., Feng, L., and Li, J. A dapt T hink: Reasoning models can learn when to think. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025 b . URL https://aclanthology.org/2025.emnlp-main.184/

  74. [94]

    Speed up your code: Progressive code acceleration through bidirectional tree editing

    Zhang, L., Wang, J., Zhang, M., Cao, G., Shi, E., Ma, Y., Yu, J., Liu, H., Li, J., and Zhang, M. Speed up your code: Progressive code acceleration through bidirectional tree editing. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  75. [95]

    Tinyllama: An open-source small language model

    Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024

  76. [96]

    Saber: Switchable and balanced training for efficient llm reasoning

    Zhao, K., Zhao, Y., Song, J., He, S., Zhang, L., Zhang, Q., and Li, T. Saber: Switchable and balanced training for efficient llm reasoning. arXiv preprint arXiv:2508.10026, 2025

  77. [97]

    X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al

    Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1 0 (2), 2023