SuCo: Sufficiency-guided Continuous Adaptive Reasoning

Bingyu Liang; Chenhao Hu; Jiahao Wang; Jing Li; Longhui Zhang; Min Zhang; Xuebo Liu; Xuelong Li

arxiv: 2606.17687 · v1 · pith:NUUJGLKVnew · submitted 2026-06-16 · 💻 cs.CL · cs.AI

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

Jiahao Wang , Bingyu Liang , Chenhao Hu , Longhui Zhang , Xuebo Liu , Min zhang , Jing Li , Xuelong Li This is my paper

Pith reviewed 2026-06-27 00:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords minimal sufficient chain-of-thoughtadaptive reasoningchain-of-thought compressionreinforcement learning for reasoninglarge reasoning modelssufficiency-aware optimizationreasoning efficiency

0 comments

The pith

SuCo enables large reasoning models to produce the shortest sufficient chain-of-thought for each query by training on minimal prefixes with adaptive thresholds and sufficiency-aware optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that defining the minimal sufficient chain-of-thought as the shortest adequate prefix allows models to reason more efficiently without sacrificing accuracy. By using problem-specific thresholds to build training data and then applying reinforcement learning that rewards appropriate stopping, the approach creates continuous control over reasoning length. A reader would care because existing methods either fix the reasoning budget or use discrete modes, leading to wasted computation on easy problems or insufficient depth on hard ones. If correct, this means models can adapt their thinking effort naturally to the task at hand.

Core claim

Minimal Sufficient CoT is the shortest prefix of a reasoning trajectory that still yields the correct answer. SuCo uses this definition in two stages: first fine-tuning on data built with difficulty-scaled sufficiency thresholds, then policy optimization with rewards that penalize both excessive and insufficient reasoning length. Experiments demonstrate consistent gains in accuracy and reductions in token usage across mathematics, code, and science benchmarks.

What carries the argument

Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer, which serves as the basis for constructing aligned training data and sufficiency-aware rewards.

If this is right

Models internalize concise yet sufficient reasoning patterns that scale with question difficulty.
Dynamic complexity tracking allows continuous adaptation rather than discrete modes.
Sufficiency-aware rewards prevent both over-thinking on simple queries and under-thinking on complex ones.
Overall, the framework improves both accuracy and reasoning efficiency simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

One could test whether the same MSC concept applies to non-language reasoning tasks such as visual or multimodal problems.
The adaptive thresholds might be learned directly by the model instead of constructed externally.
This method could be combined with other compression techniques to further reduce inference costs.

Load-bearing premise

That problem-adaptive sufficiency thresholds can be reliably constructed to produce MSC data that, when used in MFT and SAPO, cause the model to internalize concise yet sufficient reasoning patterns without degrading performance on harder problems.

What would settle it

Observing that SuCo-trained models generate longer or less accurate responses on simple problems compared to standard fine-tuned models, or fail to improve on hard problems, would indicate the approach does not achieve the claimed adaptive control.

Figures

Figures reproduced from arXiv: 2606.17687 by Bingyu Liang, Chenhao Hu, Jiahao Wang, Jing Li, Longhui Zhang, Min Zhang, Xuebo Liu, Xuelong Li.

**Figure 1.** Figure 1: MSC vs. Full CoT on Qwen3-8B across MATH difficulty levels. Left axis (↓): reasoning tokens. Right axis (↑): accuracy. At each difficulty level, MSC achieves higher accuracy with significantly fewer tokens. 1. Introduction Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of tasks (Zhao et al., 2023; Wang et al., 2025a; Zhang et al., 2025c), yet continue to strugg… view at source ↗

**Figure 2.** Figure 2: Illustration of Minimal Sufficient CoT (MSC). For a given question, sufficiency score (geometric mean over ground-truth answer tokens) is computed at each generation position. The MSC is the shortest prefix exceeding the adaptive threshold δ. As shown, once the sufficiency threshold is reached, extended waiting or self-verification steps lead to a rapid decline in sufficiency, indicating that additional re… view at source ↗

**Figure 3.** Figure 3: Token length distribution comparison between full CoT and MSC across training datasets. Implementation details. All trainings are performed on 8 × NVIDIA H100 80GB GPUs. MFT Stage. We set the base threshold δ0 = 0.5 and the sensitivity coefficient α = 0.4, resulting in problem-adaptive thresholds δ(x) ∈ [0.5, 0.9]. The minimum reasoning length is fixed to Lmin = 5 sentences to filter trivial fragments. We… view at source ↗

**Figure 4.** Figure 4: Distribution of reasoning lengths in training data constructed by different MSC variants. ing all static configurations with comparable token usage. ▶ Percentile-Based Complexity Estimation. We compare against two alternatives: Min-Max estimation C(xi) = (∥zi∥−minj ∥zj ∥) (maxj ∥zj ∥−minj ∥zj ∥) and Log-Scaled normalization C(xi) = log(1+∥zi∥)−log(1+minj ∥zj ∥) log(1+maxj ∥zj ∥)−log(1+minj ∥zj ∥) . Min-… view at source ↗

**Figure 6.** Figure 6: Response length distribution across MATH difficulty levels for SuCo-1.5B (top) and base LRM DeepSeek-R1-Distill-1.5B (bottom). SuCo continuously adapts reasoning effort to problem complexity with significantly higher efficiency. Difficulty-conditioned reasoning length. We compare response length distributions across MATH (Hendrycks et al., 2021b) difficulty levels between SuCo-1.5B and DeepSeekR1-Distil… view at source ↗

**Figure 7.** Figure 7: Empty CoT analysis of SuCo-1.5B and SuCo-7B across problem types and difficulties. Higher model capacity (7B vs. 1.5B) leads to increased empty CoT rates, while harder problems trigger more explicit reasoning. derivation. Despite a substantial fraction of empty CoT outputs, SuCo maintains strong overall accuracy ( [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of the minimum reasoning length Lmin [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: presents the complete prompt used for MSC refinement. The prompt guides the model to polish the raw MSC prefix along three dimensions: Logical Completeness, Conciseness, and Stylistic Consistency. The refinement process focuses on improving coherence and readability of the existing MSC without modifying its underlying reasoning content [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Refinement example demonstrating logical completion. Raw MSC stops mid-reasoning; refined MSC completes the derivation while preserving the original flow. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Refinement example: reasoning optimization. Raw MSC contains exploratory backtracking; refined MSC eliminates redundancy while maintaining the core logic. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines minimal sufficient CoT and builds a two-stage training pipeline around adaptive thresholds and sufficiency-aware RL to shorten reasoning traces.

read the letter

The main point is that this work defines Minimal Sufficient CoT as the shortest prefix of a trace that still yields the correct answer, then trains models to produce traces that stop at that point.

The new element is the continuous, problem-dependent framing. Stage one uses difficulty-scaled thresholds to build MSC data for fine-tuning. Stage two adds RL with dynamic tracking and rewards that penalize both over- and under-thinking. This is a step past fixed budgets or discrete modes.

The paper does well at naming a real deployment cost in current LRMs and showing claimed gains in both accuracy and token count across math, code, and science benchmarks. The pipeline description is internally consistent.

Soft spots are in the empirical details. The abstract gives no information on how thresholds are chosen in practice, what baselines are used, or whether gains survive different random seeds and out-of-distribution cases. The RL reward design could still be sensitive to tuning even if the high-level logic holds.

This is for groups working on efficient inference for reasoning models. A serious editor should send it to peer review because the framing is clear and the problem is worth referee attention, even if the results need tighter validation.

Referee Report

2 major / 1 minor

Summary. The paper defines Minimal Sufficient CoT (MSC) as the shortest prefix of a Chain-of-Thought trajectory adequate for the correct answer. It proposes SuCo, a two-stage framework consisting of MSC-Aligned Fine-Tuning (MFT) that uses problem-adaptive sufficiency thresholds to construct training data and fine-tune for concise reasoning, followed by Sufficiency-Aware Policy Optimization (SAPO) that applies RL with dynamic complexity tracking and rewards penalizing both over- and under-thinking. The central claim is that this yields consistent gains in both accuracy and reasoning efficiency on mathematics, code, and science benchmarks.

Significance. If validated, the work supplies a continuous, sufficiency-based mechanism for adaptive reasoning length control that moves beyond discrete modes or fixed budgets, with potential to improve efficiency in LRMs while preserving performance across difficulty levels. The problem-adaptive thresholds and sufficiency-aware rewards constitute a coherent technical contribution to efficient reasoning training.

major comments (2)

[Abstract] Abstract: the claim that 'extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency' supplies no methods, baselines, datasets, error bars, or quantitative results, rendering the central empirical claim unevaluable.
[Methods (implied by pipeline description)] The construction of MSC data via problem-adaptive thresholds and the precise definition of sufficiency-aware rewards in SAPO are not specified, which is load-bearing for assessing whether the claimed internalization of concise patterns occurs without degrading harder problems.

minor comments (1)

[Abstract] The phrase 'problem-adaptive sufficiency thresholds that naturally scale with question difficulty' is used without a formal definition or illustrative example.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for highlighting areas where the presentation can be strengthened. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency' supplies no methods, baselines, datasets, error bars, or quantitative results, rendering the central empirical claim unevaluable.

Authors: Abstracts are conventionally high-level summaries constrained by length, so they omit full methodological and quantitative details. The complete experimental protocol, baselines (vanilla CoT, length-regularized fine-tuning, budget-based methods), datasets (MATH, GSM8K, HumanEval, ScienceQA), and results with standard deviations across seeds appear in Sections 4 and 5. To improve standalone evaluability, we will revise the abstract to include representative quantitative outcomes (e.g., average accuracy delta and token reduction percentages). revision: yes
Referee: [Methods (implied by pipeline description)] The construction of MSC data via problem-adaptive thresholds and the precise definition of sufficiency-aware rewards in SAPO are not specified, which is load-bearing for assessing whether the claimed internalization of concise patterns occurs without degrading harder problems.

Authors: Section 3.1 defines problem-adaptive thresholds as the shortest prefix length at which prefix accuracy reaches 95 % of full-CoT accuracy, scaled by a difficulty proxy obtained from an initial model rollout. Section 3.2 defines the SAPO reward as accuracy_reward − eta·|length − MSC_length| + au·complexity_match, where complexity is tracked by a learned estimator updated each episode. We will add explicit equations, an algorithm box, and worked examples to make these constructions fully reproducible and to permit direct evaluation of the claimed behavior on hard problems. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The abstract defines MSC independently as the shortest adequate CoT prefix, then describes empirical construction of MSC data via problem-adaptive thresholds, followed by MFT and SAPO stages. No equations, reward definitions, or self-citations are present that reduce any claimed prediction or result to its own inputs by construction. The pipeline is presented as a logically coherent sequence of data construction and optimization steps whose validity rests on external benchmarks rather than internal redefinition or fitted renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5771 in / 1091 out tokens · 26848 ms · 2026-06-27T00:49:06.934156+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 1 canonical work pages

[2]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=
[4]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=
[5]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=
[6]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024
[7]

2025 , eprint=

Llama-Nemotron: Efficient Reasoning Models , author=. 2025 , eprint=

2025
[8]

Open R1: A fully open reproduction of DeepSeek-R1 , url =
[9]

Hugging Face repository , howpublished =

OpenR1-Math-220k , author=. Hugging Face repository , howpublished =. 2025 , publisher =

2025
[11]

2025 , eprint=

s1: Simple test-time scaling , author=. 2025 , eprint=

2025
[12]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[13]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Think Only When You Need with Large Hybrid-Reasoning Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[15]

A dapt T hink: Reasoning Models Can Learn When to Think

Zhang, Jiajie and Lin, Nianyi and Hou, Lei and Feng, Ling and Li, Juanzi. A dapt T hink: Reasoning Models Can Learn When to Think. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025
[20]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[21]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
[22]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025
[23]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=
[24]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
[27]

Mz Dai and Chenxu Yang and Qingyi Si , booktitle=. S-. 2025 , url=

2025
[30]

Second Conference on Language Modeling , year=

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning , author=. Second Conference on Language Modeling , year=
[31]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=. Scaling. 2025 , url=

2025
[32]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Deduplicating training data makes language models better , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[33]

2024 , url=

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

2024
[38]

Forty-second International Conference on Machine Learning , year=

T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling , author=. Forty-second International Conference on Machine Learning , year=
[41]

The Fourteenth International Conference on Learning Representations , year=

CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling , author=. The Fourteenth International Conference on Learning Representations , year=
[42]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Alphaone: Reasoning models thinking slow and fast at test time , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[44]

Advances in Neural Information Processing Systems , volume=

Does thinking more always help? mirage of test-time scaling in reasoning models , author=. Advances in Neural Information Processing Systems , volume=
[45]

Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan , journal =
[46]

Commonsenseqa: A question answering challenge targeting commonsense knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019
[47]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

2023
[48]

System Report for CCL 25-Eval Task 10: SRAG - MAV for Fine-Grained C hinese Hate Speech Recognition

Wang, Jiahao and Liu, Ramen and Zhang, Longhui and Li, Jing. System Report for CCL 25-Eval Task 10: SRAG - MAV for Fine-Grained C hinese Hate Speech Recognition. Proceedings of the 24th C hina National Conference on Computational Linguistics ( CCL 2025). 2025

2025
[50]

H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Cohen, J., et al

Adler, B., Agarwal, N., Aithal, A., Anh, D. H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Cohen, J., et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024

arXiv 2024
[51]

and Welleck, S

Aggarwal, P. and Welleck, S. L1: Controlling how long a reasoning model thinks with reinforcement learning. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=4jdIxXBNve

2025
[52]

U., Narenthiran, S., Majumdar, S., Ficek, A., Jain, S., Huang, J., Noroozi, V., and Ginsburg, B

Ahmad, W. U., Narenthiran, S., Majumdar, S., Ficek, A., Jain, S., Huang, J., Noroozi, V., and Ginsburg, B. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

Pith/arXiv arXiv 2025
[53]

Program synthesis with large language models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

Pith/arXiv arXiv 2021
[54]

S., Kartal, B., Suhara, Y., Delalleau, O., Chen, Z., Wang, Z., Mosallanezhad, D., Renduchintala, A., Qian, H., Rekesh, D., Jia, F., Majumdar, S., Noroozi, V., Ahmad, W

Bercovich, A., Levy, I., Golan, I., Dabbah, M., El-Yaniv, R., Puny, O., Galil, I., Moshe, Z., Ronen, T., Nabwani, N., Shahaf, I., Tropp, O., Karpas, E., Zilberstein, R., Zeng, J., Singhal, S., Bukharin, A., Zhang, Y., Konuk, T., Shen, G., Mahabaleshwarkar, A. S., Kartal, B., Suhara, Y., Delalleau, O., Chen, Z., Wang, Z., Mosallanezhad, D., Renduchintala, ...

arXiv 2025
[55]

V., R \'e , C., and Mirhoseini, A

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., R \'e , C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

Pith/arXiv arXiv 2024
[56]

Training verifiers to solve math word problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021
[57]

S- GRPO : Early exit via reinforcement learning in reasoning models

Dai, M., Yang, C., and Si, Q. S- GRPO : Early exit via reinforcement learning in reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=wNMK5o0Vfg

2025
[58]

O., and Liu, S

Fan, C., Zhang, Y., Jia, J., Hero, A. O., and Liu, S. Cyclicreflex: Improving reasoning models via cyclical reflection token scheduling. In The Fourteenth International Conference on Learning Representations, 2026

2026
[59]

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., and Berant, J. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies . Transactions of the Association for Computational Linguistics (TACL), 2021

2021
[60]

S., Chakraborty, S., Reddy, A., Lu, Y., Wang, M., Manocha, D., Huang, F., Ghavamzadeh, M., and Bedi, A

Ghosal, S. S., Chakraborty, S., Reddy, A., Lu, Y., Wang, M., Manocha, D., Huang, F., Ghavamzadeh, M., and Bedi, A. S. Does thinking more always help? mirage of test-time scaling in reasoning models. Advances in Neural Information Processing Systems, 38: 0 172664--172691, 2026

2026
[61]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[62]

Thinkdial: An open recipe for controlling reasoning effort in large language models

He, Q., Yuan, S., Li, X., Wang, M., and Chen, J. Thinkdial: An open recipe for controlling reasoning effort in large language models. arXiv preprint arXiv:2508.18773, 2025

arXiv 2025
[63]

Measuring massive multitask language understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021 a

2021
[64]

Measuring mathematical problem solving with the math dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021 b

2021
[65]

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning

Hou, B., Zhang, Y., Ji, J., Liu, Y., Qian, K., Andreas, J., and Chang, S. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning. arXiv preprint arXiv:2504.01296, 2025 a

Pith/arXiv arXiv 2025
[66]

T1: Advancing language model reasoning through reinforcement learning and inference scaling

Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J., and Dong, Y. T1: Advancing language model reasoning through reinforcement learning and inference scaling. In Forty-second International Conference on Machine Learning, 2025 b . URL https://openreview.net/forum?id=tnxONP8zTE

2025
[67]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face . Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

2025
[68]

Openai o1 system card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

Pith/arXiv arXiv 2024
[69]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL

2025
[70]

Think only when you need with large hybrid-reasoning models

Jiang, L., Wu, X., Huang, S., Dong, Q., Chi, Z., Dong, L., Zhang, X., Lv, T., Cui, L., and Wei, F. Think only when you need with large hybrid-reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=fDjDVE4qdj

2025
[71]

E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. R. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

2024
[72]

S., Reid, M., Matsuo, Y., and Iwasawa, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

2022
[73]

Deduplicating training data makes language models better

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8424--8445, 2022

2022
[74]

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

2023
[75]

Let's verify step by step

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

2024
[76]

Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning

Lou, C., Sun, Z., Liang, X., Qu, M., Shen, W., Wang, W., Li, Y., Yang, Q., and Wu, S. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning. arXiv preprint arXiv:2505.11896, 2025

arXiv 2025
[77]

B., Penedo, G., Beeching, E., Gallouédec, Q., Habib, N., Tunstall, L., and von Werra, L

Lozhkov, A., Kydlíček, H., Allal, L. B., Penedo, G., Beeching, E., Gallouédec, Q., Habib, N., Tunstall, L., and von Werra, L. Openr1-math-220k. https://huggingface.co/datasets/open-r1/OpenR1-Math-220k, 2025

2025
[78]

L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393

Pith/arXiv arXiv 2025
[79]

gpt-oss-120b & gpt-oss-20b model card, 2025

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/2508.10925

Pith/arXiv arXiv 2025
[80]

Qwen3 technical report, 2025

Qwen Team . Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025
[81]

L., Stickland, A

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

2024
[82]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[83]

V., Lee, J., Xu, K., and Kumar, A

Snell, C. V., Lee, J., Xu, K., and Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=4FWAwZtd2n

2025
[84]

Stop overthinking: A survey on efficient reasoning for large language models

Sui, Y., Chuang, Y.-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025

Pith/arXiv arXiv 2025
[85]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4149--4158, 2019

2019
[86]

System report for CCL 25-eval task 10: SRAG - MAV for fine-grained C hinese hate speech recognition

Wang, J., Liu, R., Zhang, L., and Li, J. System report for CCL 25-eval task 10: SRAG - MAV for fine-grained C hinese hate speech recognition. In Lin, H., Li, B., and Tan, H. (eds.), Proceedings of the 24th C hina National Conference on Computational Linguistics ( CCL 2025) , pp.\ 395--402, Jinan, China, August 2025 a . Chinese Information Processing Socie...

2025
[87]

Thoughts are all over the place: On the underthinking of o1-like llms

Wang, Y., Liu, Q., Xu, J., Liang, T., Chen, X., He, Z., Song, L., Yu, D., Li, J., Zhang, Z., et al. Thoughts are all over the place: On the underthinking of o1-like llms. arXiv preprint arXiv:2501.18585, 2025 b

arXiv 2025
[88]

V., Zhou, D., et al

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022
[89]

From efficiency to adaptivity: A deeper look at adaptive reasoning in large language models

Wu, C., Li, B., Gao, M., and Wang, Z. From efficiency to adaptivity: A deeper look at adaptive reasoning in large language models. arXiv preprint arXiv:2511.10788, 2025

arXiv 2025
[90]

Towards large reasoning models: A survey of reinforced reasoning with large language models

Xu, F., Hao, Q., Zong, Z., Wang, J., Zhang, Y., Wang, J., Lan, X., Gong, J., Ouyang, T., Meng, F., et al. Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686, 2025

Pith/arXiv arXiv 2025
[91]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

Pith/arXiv arXiv 2024
[92]

Alphaone: Reasoning models thinking slow and fast at test time

Zhang, J., Dong, R., Wang, H., Ning, X., Geng, H., Li, P., He, X., Bai, Y., Malik, J., Gupta, S., et al. Alphaone: Reasoning models thinking slow and fast at test time. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 11340--11365, 2025 a

2025
[93]

A dapt T hink: Reasoning models can learn when to think

Zhang, J., Lin, N., Hou, L., Feng, L., and Li, J. A dapt T hink: Reasoning models can learn when to think. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025 b . URL https://aclanthology.org/2025.emnlp-main.184/

2025
[94]

Speed up your code: Progressive code acceleration through bidirectional tree editing

Zhang, L., Wang, J., Zhang, M., Cao, G., Shi, E., Ma, Y., Yu, J., Liu, H., Li, J., and Zhang, M. Speed up your code: Progressive code acceleration through bidirectional tree editing. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

work page doi:10.18653/v1/2025.acl-long.1387 2025
[95]

Tinyllama: An open-source small language model

Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024

Pith/arXiv arXiv 2024
[96]

Saber: Switchable and balanced training for efficient llm reasoning

Zhao, K., Zhao, Y., Song, J., He, S., Zhang, L., Zhang, Q., and Li, T. Saber: Switchable and balanced training for efficient llm reasoning. arXiv preprint arXiv:2508.10026, 2025

arXiv 2025
[97]

X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1 0 (2), 2023

Pith/arXiv arXiv 2023

[1] [2]

The Twelfth International Conference on Learning Representations , year=

Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

[2] [4]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

[3] [5]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

[4] [6]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024

[5] [7]

2025 , eprint=

Llama-Nemotron: Efficient Reasoning Models , author=. 2025 , eprint=

2025

[6] [8]

Open R1: A fully open reproduction of DeepSeek-R1 , url =

[7] [9]

Hugging Face repository , howpublished =

OpenR1-Math-220k , author=. Hugging Face repository , howpublished =. 2025 , publisher =

2025

[8] [11]

2025 , eprint=

s1: Simple test-time scaling , author=. 2025 , eprint=

2025

[9] [12]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[10] [13]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Think Only When You Need with Large Hybrid-Reasoning Models , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[11] [15]

A dapt T hink: Reasoning Models Can Learn When to Think

Zhang, Jiajie and Lin, Nianyi and Hou, Lei and Feng, Ling and Li, Juanzi. A dapt T hink: Reasoning Models Can Learn When to Think. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

2025

[12] [20]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[13] [21]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

[14] [22]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025

[15] [23]

NeurIPS , year=

Measuring Mathematical Problem Solving With the MATH Dataset , author=. NeurIPS , year=

[16] [24]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

[17] [27]

Mz Dai and Chenxu Yang and Qingyi Si , booktitle=. S-. 2025 , url=

2025

[18] [30]

Second Conference on Language Modeling , year=

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning , author=. Second Conference on Language Modeling , year=

[19] [31]

Charlie Victor Snell and Jaehoon Lee and Kelvin Xu and Aviral Kumar , booktitle=. Scaling. 2025 , url=

2025

[20] [32]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Deduplicating training data makes language models better , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[21] [33]

2024 , url=

Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan , booktitle=. 2024 , url=

2024

[22] [38]

Forty-second International Conference on Machine Learning , year=

T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling , author=. Forty-second International Conference on Machine Learning , year=

[23] [41]

The Fourteenth International Conference on Learning Representations , year=

CyclicReflex: Improving Reasoning Models via Cyclical Reflection Token Scheduling , author=. The Fourteenth International Conference on Learning Representations , year=

[24] [42]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Alphaone: Reasoning models thinking slow and fast at test time , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[25] [44]

Advances in Neural Information Processing Systems , volume=

Does thinking more always help? mirage of test-time scaling in reasoning models , author=. Advances in Neural Information Processing Systems , volume=

[26] [45]

Geva, Mor and Khashabi, Daniel and Segal, Elad and Khot, Tushar and Roth, Dan and Berant, Jonathan , journal =

[27] [46]

Commonsenseqa: A question answering challenge targeting commonsense knowledge , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

2019

[28] [47]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

2023

[29] [48]

System Report for CCL 25-Eval Task 10: SRAG - MAV for Fine-Grained C hinese Hate Speech Recognition

Wang, Jiahao and Liu, Ramen and Zhang, Longhui and Li, Jing. System Report for CCL 25-Eval Task 10: SRAG - MAV for Fine-Grained C hinese Hate Speech Recognition. Proceedings of the 24th C hina National Conference on Computational Linguistics ( CCL 2025). 2025

2025

[30] [50]

H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Cohen, J., et al

Adler, B., Agarwal, N., Aithal, A., Anh, D. H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Cohen, J., et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704, 2024

arXiv 2024

[31] [51]

and Welleck, S

Aggarwal, P. and Welleck, S. L1: Controlling how long a reasoning model thinks with reinforcement learning. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=4jdIxXBNve

2025

[32] [52]

U., Narenthiran, S., Majumdar, S., Ficek, A., Jain, S., Huang, J., Noroozi, V., and Ginsburg, B

Ahmad, W. U., Narenthiran, S., Majumdar, S., Ficek, A., Jain, S., Huang, J., Noroozi, V., and Ginsburg, B. Opencodereasoning: Advancing data distillation for competitive coding. arXiv preprint arXiv:2504.01943, 2025

Pith/arXiv arXiv 2025

[33] [53]

Program synthesis with large language models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

Pith/arXiv arXiv 2021

[34] [54]

S., Kartal, B., Suhara, Y., Delalleau, O., Chen, Z., Wang, Z., Mosallanezhad, D., Renduchintala, A., Qian, H., Rekesh, D., Jia, F., Majumdar, S., Noroozi, V., Ahmad, W

Bercovich, A., Levy, I., Golan, I., Dabbah, M., El-Yaniv, R., Puny, O., Galil, I., Moshe, Z., Ronen, T., Nabwani, N., Shahaf, I., Tropp, O., Karpas, E., Zilberstein, R., Zeng, J., Singhal, S., Bukharin, A., Zhang, Y., Konuk, T., Shen, G., Mahabaleshwarkar, A. S., Kartal, B., Suhara, Y., Delalleau, O., Chen, Z., Wang, Z., Mosallanezhad, D., Renduchintala, ...

arXiv 2025

[35] [55]

V., R \'e , C., and Mirhoseini, A

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., R \'e , C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787, 2024

Pith/arXiv arXiv 2024

[36] [56]

Training verifiers to solve math word problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

Pith/arXiv arXiv 2021

[37] [57]

S- GRPO : Early exit via reinforcement learning in reasoning models

Dai, M., Yang, C., and Si, Q. S- GRPO : Early exit via reinforcement learning in reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=wNMK5o0Vfg

2025

[38] [58]

O., and Liu, S

Fan, C., Zhang, Y., Jia, J., Hero, A. O., and Liu, S. Cyclicreflex: Improving reasoning models via cyclical reflection token scheduling. In The Fourteenth International Conference on Learning Representations, 2026

2026

[39] [59]

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D., and Berant, J. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies . Transactions of the Association for Computational Linguistics (TACL), 2021

2021

[40] [60]

S., Chakraborty, S., Reddy, A., Lu, Y., Wang, M., Manocha, D., Huang, F., Ghavamzadeh, M., and Bedi, A

Ghosal, S. S., Chakraborty, S., Reddy, A., Lu, Y., Wang, M., Manocha, D., Huang, F., Ghavamzadeh, M., and Bedi, A. S. Does thinking more always help? mirage of test-time scaling in reasoning models. Advances in Neural Information Processing Systems, 38: 0 172664--172691, 2026

2026

[41] [61]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[42] [62]

Thinkdial: An open recipe for controlling reasoning effort in large language models

He, Q., Yuan, S., Li, X., Wang, M., and Chen, J. Thinkdial: An open recipe for controlling reasoning effort in large language models. arXiv preprint arXiv:2508.18773, 2025

arXiv 2025

[43] [63]

Measuring massive multitask language understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021 a

2021

[44] [64]

Measuring mathematical problem solving with the math dataset

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021 b

2021

[45] [65]

Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning

Hou, B., Zhang, Y., Ji, J., Liu, Y., Qian, K., Andreas, J., and Chang, S. Thinkprune: Pruning long chain-of-thought of llms via reinforcement learning. arXiv preprint arXiv:2504.01296, 2025 a

Pith/arXiv arXiv 2025

[46] [66]

T1: Advancing language model reasoning through reinforcement learning and inference scaling

Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J., and Dong, Y. T1: Advancing language model reasoning through reinforcement learning and inference scaling. In Forty-second International Conference on Machine Learning, 2025 b . URL https://openreview.net/forum?id=tnxONP8zTE

2025

[47] [67]

Open r1: A fully open reproduction of deepseek-r1, January 2025

Hugging Face . Open r1: A fully open reproduction of deepseek-r1, January 2025. URL https://github.com/huggingface/open-r1

2025

[48] [68]

Openai o1 system card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

Pith/arXiv arXiv 2024

[49] [69]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=chfJJYC3iL

2025

[50] [70]

Think only when you need with large hybrid-reasoning models

Jiang, L., Wu, X., Huang, S., Dong, Q., Chi, Z., Dong, L., Zhang, X., Lv, T., Cui, L., and Wei, F. Think only when you need with large hybrid-reasoning models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=fDjDVE4qdj

2025

[51] [71]

E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. R. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

2024

[52] [72]

S., Reid, M., Matsuo, Y., and Iwasawa, Y

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

2022

[53] [73]

Deduplicating training data makes language models better

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8424--8445, 2022

2022

[54] [74]

Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023

2023

[55] [75]

Let's verify step by step

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi

2024

[56] [76]

Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning

Lou, C., Sun, Z., Liang, X., Qu, M., Shen, W., Wang, W., Li, Y., Yang, Q., and Wu, S. Adacot: Pareto-optimal adaptive chain-of-thought triggering via reinforcement learning. arXiv preprint arXiv:2505.11896, 2025

arXiv 2025

[57] [77]

B., Penedo, G., Beeching, E., Gallouédec, Q., Habib, N., Tunstall, L., and von Werra, L

Lozhkov, A., Kydlíček, H., Allal, L. B., Penedo, G., Beeching, E., Gallouédec, Q., Habib, N., Tunstall, L., and von Werra, L. Openr1-math-220k. https://huggingface.co/datasets/open-r1/OpenR1-Math-220k, 2025

2025

[58] [78]

L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Candès, E., and Hashimoto, T. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393

Pith/arXiv arXiv 2025

[59] [79]

gpt-oss-120b & gpt-oss-20b model card, 2025

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/2508.10925

Pith/arXiv arXiv 2025

[60] [80]

Qwen3 technical report, 2025

Qwen Team . Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

Pith/arXiv arXiv 2025

[61] [81]

L., Stickland, A

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., and Bowman, S. R. GPQA : A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=Ti67584b98

2024

[62] [82]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[63] [83]

V., Lee, J., Xu, K., and Kumar, A

Snell, C. V., Lee, J., Xu, K., and Kumar, A. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=4FWAwZtd2n

2025

[64] [84]

Stop overthinking: A survey on efficient reasoning for large language models

Sui, Y., Chuang, Y.-N., Wang, G., Zhang, J., Zhang, T., Yuan, J., Liu, H., Wen, A., Zhong, S., Zou, N., et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025

Pith/arXiv arXiv 2025

[65] [85]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4149--4158, 2019

2019

[66] [86]

System report for CCL 25-eval task 10: SRAG - MAV for fine-grained C hinese hate speech recognition

Wang, J., Liu, R., Zhang, L., and Li, J. System report for CCL 25-eval task 10: SRAG - MAV for fine-grained C hinese hate speech recognition. In Lin, H., Li, B., and Tan, H. (eds.), Proceedings of the 24th C hina National Conference on Computational Linguistics ( CCL 2025) , pp.\ 395--402, Jinan, China, August 2025 a . Chinese Information Processing Socie...

2025

[67] [87]

Thoughts are all over the place: On the underthinking of o1-like llms

Wang, Y., Liu, Q., Xu, J., Liang, T., Chen, X., He, Z., Song, L., Yu, D., Li, J., Zhang, Z., et al. Thoughts are all over the place: On the underthinking of o1-like llms. arXiv preprint arXiv:2501.18585, 2025 b

arXiv 2025

[68] [88]

V., Zhou, D., et al

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022

[69] [89]

From efficiency to adaptivity: A deeper look at adaptive reasoning in large language models

Wu, C., Li, B., Gao, M., and Wang, Z. From efficiency to adaptivity: A deeper look at adaptive reasoning in large language models. arXiv preprint arXiv:2511.10788, 2025

arXiv 2025

[70] [90]

Towards large reasoning models: A survey of reinforced reasoning with large language models

Xu, F., Hao, Q., Zong, Z., Wang, J., Zhang, Y., Wang, J., Lan, X., Gong, J., Ouyang, T., Meng, F., et al. Towards large reasoning models: A survey of reinforced reasoning with large language models. arXiv preprint arXiv:2501.09686, 2025

Pith/arXiv arXiv 2025

[71] [91]

Qwen2.5-math technical report: Toward mathematical expert model via self-improvement

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

Pith/arXiv arXiv 2024

[72] [92]

Alphaone: Reasoning models thinking slow and fast at test time

Zhang, J., Dong, R., Wang, H., Ning, X., Geng, H., Li, P., He, X., Bai, Y., Malik, J., Gupta, S., et al. Alphaone: Reasoning models thinking slow and fast at test time. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 11340--11365, 2025 a

2025

[73] [93]

A dapt T hink: Reasoning models can learn when to think

Zhang, J., Lin, N., Hou, L., Feng, L., and Li, J. A dapt T hink: Reasoning models can learn when to think. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025 b . URL https://aclanthology.org/2025.emnlp-main.184/

2025

[74] [94]

Speed up your code: Progressive code acceleration through bidirectional tree editing

Zhang, L., Wang, J., Zhang, M., Cao, G., Shi, E., Ma, Y., Yu, J., Liu, H., Li, J., and Zhang, M. Speed up your code: Progressive code acceleration through bidirectional tree editing. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

work page doi:10.18653/v1/2025.acl-long.1387 2025

[75] [95]

Tinyllama: An open-source small language model

Zhang, P., Zeng, G., Wang, T., and Lu, W. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024

Pith/arXiv arXiv 2024

[76] [96]

Saber: Switchable and balanced training for efficient llm reasoning

Zhao, K., Zhao, Y., Song, J., He, S., Zhang, L., Zhang, Q., and Li, T. Saber: Switchable and balanced training for efficient llm reasoning. arXiv preprint arXiv:2508.10026, 2025

arXiv 2025

[77] [97]

X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 1 0 (2), 2023

Pith/arXiv arXiv 2023