Understanding and Mitigating Premature Confidence for Better LLM Reasoning

Aditi Raghunathan; Andrej Risteski; Chen Wu; Christina Baek; Guanning Zeng; Jingchu Gai; J.Zico Kolter

arxiv: 2605.24396 · v1 · pith:BPC7EWQNnew · submitted 2026-05-23 · 💻 cs.AI

Understanding and Mitigating Premature Confidence for Better LLM Reasoning

Jingchu Gai , Guanning Zeng , Christina Baek , Chen Wu , J.Zico Kolter , Andrej Risteski , Aditi Raghunathan This is my paper

Pith reviewed 2026-06-30 13:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords premature confidencechain-of-thought reasoningreinforcement learningLLM reasoningprogressive confidence shapingreasoning faithfulnessprocess supervision

0 comments

The pith

LLMs that commit to an answer early in chain-of-thought reasoning produce more errors, and a label-free RL objective that penalizes early commitment improves accuracy and faithfulness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies premature confidence—the tendency for models to settle on an answer early in a reasoning trace and then rationalize it with later tokens—as a strong predictor of flawed reasoning across tasks and model sizes. This pattern limits the value of longer chains of thought because the model is no longer updating its beliefs as new steps are generated. The authors introduce progressive confidence shaping, a reinforcement learning objective that rewards gradual increases in confidence during reasoning and penalizes early commitment, requiring no external labels or reward models. When applied, the method raises accuracy and reduces logical gaps on arithmetic, math, and science benchmarks while also making reasoning traces more transparent about misleading content. The correlation between confidence dynamics and reasoning quality holds from 1.5B to 8B parameters and grows stronger with task difficulty.

Core claim

Premature confidence, the tendency to commit to an answer early and use the remaining tokens to rationalize it, strongly predicts flawed reasoning across tasks and model scales. Progressive confidence shaping, a reinforcement learning objective that trains models to update their confidence as they reason rather than commit early by rewarding gradual confidence growth and penalizing early commitment, improves accuracy and reasoning quality from 1.5B to 8B parameters across arithmetic, math, and science tasks without external labels or reward models.

What carries the argument

progressive confidence shaping, a reinforcement learning objective that rewards gradual confidence growth during reasoning and penalizes early commitment

If this is right

Accuracy on Countdown improves 3.2x with a 42 percentage point gain and flawed reasoning drops 48 percentage points.
On AIME, Pass@64 improves by 6.6 percentage points.
The method improves faithfulness on safety benchmarks by making models more likely to surface misleading content transparently rather than conceal it.
Both the prevalence of premature confidence and the size of the accuracy gains increase together with model scale and task difficulty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same confidence-shaping signal could be tested as a lightweight detector of reasoning errors at inference time without any retraining.
If the mechanism generalizes, progressive confidence shaping might reduce the need for expensive human-annotated process reward models in other sequential tasks such as code generation.
The approach could be extended to inference-time interventions that steer token generation toward slower confidence growth on difficult problems.
It remains open whether the same objective would help or hurt on tasks where early commitment is sometimes advantageous.

Load-bearing premise

That training models via reinforcement learning to grow their reported confidence gradually will improve actual reasoning quality rather than merely changing how confidence scores are reported.

What would settle it

A controlled experiment in which the RL-trained models display the targeted gradual confidence growth curves yet show no gains in accuracy or no reduction in logical gaps on held-out tasks would indicate that the intervention does not causally fix the reasoning flaws.

Figures

Figures reproduced from arXiv: 2605.24396 by Aditi Raghunathan, Andrej Risteski, Chen Wu, Christina Baek, Guanning Zeng, Jingchu Gai, J.Zico Kolter.

**Figure 1.** Figure 1: Overview. Left: a prematurely confident CoT with logical errors and the answer is not derived from the reasoning. Middle: a progressively confident CoT with 0 errors—confidence rises from 12% to 99% as the model derives the answer. Right: our method penalizes premature CoT. intermediate points of its CoT, we can see that it often commits to an answer well before the reasoning chain is complete—the remainin… view at source ↗

**Figure 2.** Figure 2: Premature vs. progressive. (a) GPQA: issue proportion across Spearman thresholds τ (magenta: ρ < τ , cyan: ρ ≥ τ ). (b) Avg. issue count, four benchmarks, ρthr = 0.4. (c) Same as (b), correct samples only. restrict the analysis to correctly answered samples only. The gap proportion difference persists: on CSQA at threshold 0.5, prematurely confident correct samples have a 12.5% issue rate versus 3.7% for p… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: SciQA evaluation (Qwen3). (a) Accuracy across model scales; SELF (Nguyen et al., 2025) is a concurrent baseline for hard reasoning. (b) Method comparison on 1.7B. (c) Logical shortcut proportion (fraction of samples with ≥1 shortcut) on 1.7B. Our method improves accuracy by up to +5.8pp and reduces shortcuts. 3.2.3 Evaluation on Math Reasoning We evaluate on mathematical problem solving at two scales. For … view at source ↗

**Figure 5.** Figure 5: Math and safety evaluation. (a,b) Pass@K on AIME and HMMT (1.5B). (c) Pass@K on DAPO hard (7B). (d) Safety benchmark of Nguyen et al. (2025) (7B). 3.2.4 Evaluation on Safety Benchmark Beyond accuracy, we ask whether progressive confidence shaping also produces more faithful CoT—models that more transparently surface external evidence influencing their answers. We evaluate the same Qwen2.5- Math-7B checkpoi… view at source ↗

**Figure 6.** Figure 6: Factor analysis across Countdown tasks of increasing difficulty. Effect of Task Difficulty on Premature Confidence. Based on the results in [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Premature confidence increases with model size (base Qwen3 models on SciQA Chemistry, before any RL training). At every threshold, the fraction of prematurely confident samples grows monotonically from 1.7B to 4B to 8B, for both all samples (a) and correct-only samples (b). Effect of Model Size on Premature Confidence. We examine whether premature confidence is an intrinsic property of model scale by eva… view at source ↗

**Figure 8.** Figure 8: Inner product quantification. No-gap proportion for prematurely confident (dark) vs. progressively confident (light) using inner product classification across Spearman thresholds (ρ = 0.4, 0.5, 0.6, 0.7). D Countdown Case Study: Detailed Results D.1 Task Description The Countdown task requires the model to find an arithmetic expression that equals a given target number, using a provided set of numbers exa… view at source ↗

read the original abstract

Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from additional test-time compute. Improving reasoning quality directly would require process reward models, but the step-level annotations needed to train them are expensive and scarce. We find such a signal in how the model's confidence evolves during reasoning: premature confidence, the tendency to commit to an answer early and use the remaining tokens to rationalize it, strongly predicts flawed reasoning across tasks and model scales. We exploit this in progressive confidence shaping, a reinforcement learning objective that trains models to update their confidence as they reason rather than commit early -- rewarding gradual confidence growth and penalizing early commitment, with no external labels or reward models. The method improves accuracy and reasoning quality from 1.5B to 8B parameters across arithmetic (Countdown), math (DAPO, AIME), and science (ScienceQA): on Countdown, accuracy improves 3.2x (+42.0pp) and flawed reasoning drops 48pp; on AIME, Pass@64 improves 6.6pp. Consistent with this mechanism, the method also improves faithfulness: on a safety benchmark, our models more transparently surface misleading content in their reasoning traces rather than concealing it. Controlled experiments reveal that the problem and its remedy scale together: premature confidence grows with model size and task difficulty, and so do the gains from addressing it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper links early overconfidence in CoT to flawed reasoning and uses a label-free RL objective to encourage gradual confidence growth, with large reported gains on math tasks, but the abstract leaves the causal mechanism unclear.

read the letter

The core observation is that premature confidence in reasoning traces predicts errors across tasks and scales, and they train an RL objective that rewards gradual confidence increases instead of early commitment. This produces big accuracy lifts like +42pp on Countdown and +6.6pp Pass@64 on AIME, plus drops in flawed reasoning and better faithfulness on a safety benchmark.

The work does well by avoiding any need for process labels or reward models, which is practical. They document that both the problem and the gains scale with model size and difficulty, and the faithfulness result is a useful extra signal. The approach is straightforward to implement if the details hold.

The soft spot is the lack of visible controls for whether the RL actually changes the logical steps or mainly shifts when confidence gets verbalized. The abstract mentions accuracy and flawed-reasoning drops but gives no information on how flawed reasoning is scored, no ablations against confidence-matched baselines with unrelated rewards, and no step-level checks independent of language. That matches the stress-test concern, so the causal claim needs stronger evidence in the full paper.

This is for groups working on test-time scaling, CoT reliability, and label-efficient supervision in math and science domains. A reader focused on practical RL tweaks for reasoning would find the empirical scale worth discussing.

It deserves peer review because the label-free angle and reported effect sizes are large enough to merit referee scrutiny, even if the mechanism section will likely need expansion and ablations.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that premature confidence—early high-confidence commitment to an answer followed by rationalization—strongly predicts flawed reasoning in long CoT outputs across tasks and model scales. It introduces progressive confidence shaping, a label-free RL objective that rewards gradual confidence growth during reasoning and penalizes early commitment. Reported outcomes include a 3.2x accuracy gain (+42pp) and 48pp drop in flawed reasoning on Countdown, Pass@64 gains of 6.6pp on AIME, improvements on DAPO, ScienceQA, and a safety benchmark from 1.5B–8B models, with both the phenomenon and the remedy scaling with model size and difficulty.

Significance. If the causal mechanism is established, the work supplies a scalable, annotation-free route to better reasoning quality that exploits an observable internal signal rather than process supervision. The reported scaling behavior with model size and task difficulty, together with the faithfulness improvement on the safety benchmark, would be a useful empirical contribution to understanding CoT limitations.

major comments (3)

[Abstract] Abstract: the central claim that premature confidence 'strongly predicts flawed reasoning' and that the RL intervention reduces it by 48pp rests on an unspecified measurement of flawed reasoning; without details on annotation protocol, inter-annotator agreement, or independence from confidence language itself, the correlation and the accuracy gains cannot be evaluated.
[Abstract] Abstract: the reported accuracy improvements are presented without controls that isolate the proposed mechanism (e.g., an RL baseline that matches confidence schedules but uses unrelated rewards, or step-level correctness checks independent of verbalized confidence); this leaves open whether gains arise from altered logical content or only from changed confidence phrasing.
[Abstract] Abstract: no information is supplied on statistical controls, hyperparameter robustness, or ablation studies (e.g., reward-function variants), which are load-bearing for the claim that the method improves reasoning quality rather than merely shifting reported confidence.

minor comments (1)

The abstract would benefit from a concise statement of how confidence is quantified (token-level probability, verbalized score, etc.) to allow readers to assess the RL objective.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We will revise the abstract to include concise summaries of the evaluation details, controls, and robustness checks while referencing the relevant manuscript sections for full protocols.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that premature confidence 'strongly predicts flawed reasoning' and that the RL intervention reduces it by 48pp rests on an unspecified measurement of flawed reasoning; without details on annotation protocol, inter-annotator agreement, or independence from confidence language itself, the correlation and the accuracy gains cannot be evaluated.

Authors: We agree the abstract should briefly describe the measurement. Section 3.2 and Appendix B detail that flawed reasoning is assessed by human annotators identifying logical gaps and unjustified leaps. Annotations use a standardized protocol with three independent annotators (Fleiss' κ = 0.82), and confidence-related phrases are masked to ensure independence from verbalized confidence. We will add a one-sentence summary of this protocol to the abstract. revision: yes
Referee: [Abstract] Abstract: the reported accuracy improvements are presented without controls that isolate the proposed mechanism (e.g., an RL baseline that matches confidence schedules but uses unrelated rewards, or step-level correctness checks independent of verbalized confidence); this leaves open whether gains arise from altered logical content or only from changed confidence phrasing.

Authors: We acknowledge that an explicit baseline matching confidence schedules with unrelated rewards is not present. Our experiments do include standard RL baselines and reward variants that do not target confidence (Section 4.4), and final-answer accuracy is measured independently of phrasing. We will add explicit discussion of this limitation and, space permitting, include the suggested control in the revision or supplementary material. revision: partial
Referee: [Abstract] Abstract: no information is supplied on statistical controls, hyperparameter robustness, or ablation studies (e.g., reward-function variants), which are load-bearing for the claim that the method improves reasoning quality rather than merely shifting reported confidence.

Authors: The manuscript reports standard errors across runs, ablation studies on reward-function variants (Section 4.3), and hyperparameter sensitivity (Appendix C). We will revise the abstract to reference these analyses and note that robustness checks support the mechanism beyond phrasing changes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlation drives RL objective without definitional reduction

full rationale

The paper identifies premature confidence via direct observation of token-level confidence evolution during CoT generation, then defines a reward for gradual growth in an RL objective. No equations or steps reduce a claimed prediction to a fitted input by construction, nor does any load-bearing premise rest on self-citation chains or imported uniqueness theorems. The method is presented as a label-free empirical intervention whose validity is checked against held-out accuracy and faithfulness metrics on multiple tasks and scales, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the central claim rests on an empirical correlation between confidence trajectories and reasoning quality plus the effectiveness of an unspecified RL reward formulation.

pith-pipeline@v0.9.1-grok · 5805 in / 1197 out tokens · 42175 ms · 2026-06-30T13:58:57.224416+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 22 canonical work pages · 16 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don't always say what they think. arXiv preprint arXiv:2505.05410, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

2022
[11]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations, volume 2024, pp.\ 39578--39601, 2024

2024
[13]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems, 35: 0 2507--2521, 2022

2022
[14]

Policy invariance under reward transformations: Theory and application to reward shaping

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pp.\ 278--287. Citeseer, 1999

1999
[15]

The reasoning boundary paradox: How reinforcement learning constrains language models

Phuc Minh Nguyen, Chinh D La, Duy MH Nguyen, Nitesh V Chawla, Binh T Nguyen, and Khoa D Doan. The reasoning boundary paradox: How reinforcement learning constrains language models. arXiv preprint arXiv:2510.02230, 2025

work page arXiv 2025
[16]

Show your work: Scratchpads for intermediate computation with language models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

2021
[17]

Tinyzero

Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr. Tinyzero. https://github.com/Jiayi-Pan/TinyZero, 2025. Accessed: 2025-01-24

2025
[18]

Let's think dot by dot: Hidden computation in transformer language models

Jacob Pfau, William Merrill, and Samuel R Bowman. Let's think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024

work page arXiv 2024
[19]

Optimizing test-time compute via meta reinforcement fine-tuning

Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning. arXiv preprint arXiv:2503.07572, 2025

work page arXiv 2025
[20]

Qwen2.5 Technical Report

Qwen Team . Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Preventing language models from hiding their reasoning

Fabien Roger and Ryan Greenblatt. Preventing language models from hiding their reasoning. arXiv preprint arXiv:2310.18512, 2023

work page arXiv 2023
[23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Musr: Testing the limits of chain-of-thought with multistep soft reasoning

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. In International Conference on Learning Representations, volume 2024, pp.\ 14670--14728, 2024

2024
[25]

To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning

Zayne Sprague, Fangcong Yin, Juan Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning. In International Conference on Learning Representations, volume 2025, pp.\ 94118--94162, 2025

2025
[26]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023

2023
[27]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4149--4158, 2019

2019
[28]

On the hardness of faithful chain-of-thought reasoning in large language models

Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, and Himabindu Lakkaraju. On the hardness of faithful chain-of-thought reasoning in large language models. arXiv preprint arXiv:2406.10625, 2024

work page arXiv 2024
[29]

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023

2023
[30]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9426--9439, 2024

2024
[32]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022
[34]

Measuring the faithfulness of thinking drafts in large reasoning models

Zidi Xiong, Shan Chen, Zhenting Qi, and Himabindu Lakkaraju. Measuring the faithfulness of thinking drafts in large reasoning models. Advances in Neural Information Processing Systems, 38: 0 139438--139467, 2026

2026
[35]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Dapo: An open-source llm reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. Advances in Neural Information Processing Systems, 38: 0 113222--113244, 2026

2026
[38]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

2022
[39]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[40]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[41]

ý/;|*k'J' AD*v&J3 a Z z ݋ [P Ff3YI Wn oE> ^ l<!CI C)Sޗ& |\! 9 < I Ytj _Y 5 ̑7y ^؛ (`YP )Ÿ ثKq 9PVMGC! 'xءYj= qoq0wO:0Q^z >

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 1974

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Iv \'a n Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, and David Farhi. Monitoring reasoning models for misbehavior and the risks of promoting obfuscation. arXiv preprint arXiv:2503.11926, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Reasoning Models Don't Always Say What They Think

Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, et al. Reasoning models don't always say what they think. arXiv preprint arXiv:2505.05410, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35: 0 22199--22213, 2022

2022

[11] [11]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In International Conference on Learning Representations, volume 2024, pp.\ 39578--39601, 2024

2024

[13] [13]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in neural information processing systems, 35: 0 2507--2521, 2022

2022

[14] [14]

Policy invariance under reward transformations: Theory and application to reward shaping

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pp.\ 278--287. Citeseer, 1999

1999

[15] [15]

The reasoning boundary paradox: How reinforcement learning constrains language models

Phuc Minh Nguyen, Chinh D La, Duy MH Nguyen, Nitesh V Chawla, Binh T Nguyen, and Khoa D Doan. The reasoning boundary paradox: How reinforcement learning constrains language models. arXiv preprint arXiv:2510.02230, 2025

work page arXiv 2025

[16] [16]

Show your work: Scratchpads for intermediate computation with language models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. 2021

2021

[17] [17]

Tinyzero

Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr. Tinyzero. https://github.com/Jiayi-Pan/TinyZero, 2025. Accessed: 2025-01-24

2025

[18] [18]

Let's think dot by dot: Hidden computation in transformer language models

Jacob Pfau, William Merrill, and Samuel R Bowman. Let's think dot by dot: Hidden computation in transformer language models. arXiv preprint arXiv:2404.15758, 2024

work page arXiv 2024

[19] [19]

Optimizing test-time compute via meta reinforcement fine-tuning

Yuxiao Qu, Matthew YR Yang, Amrith Setlur, Lewis Tunstall, Edward Emanuel Beeching, Ruslan Salakhutdinov, and Aviral Kumar. Optimizing test-time compute via meta reinforcement fine-tuning. arXiv preprint arXiv:2503.07572, 2025

work page arXiv 2025

[20] [20]

Qwen2.5 Technical Report

Qwen Team . Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Preventing language models from hiding their reasoning

Fabien Roger and Ryan Greenblatt. Preventing language models from hiding their reasoning. arXiv preprint arXiv:2310.18512, 2023

work page arXiv 2023

[23] [23]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

Musr: Testing the limits of chain-of-thought with multistep soft reasoning

Zayne Sprague, Xi Ye, Kaj Bostrom, Swarat Chaudhuri, and Greg Durrett. Musr: Testing the limits of chain-of-thought with multistep soft reasoning. In International Conference on Learning Representations, volume 2024, pp.\ 14670--14728, 2024

2024

[25] [25]

To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning

Zayne Sprague, Fangcong Yin, Juan Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain-of-thought helps mainly on math and symbolic reasoning. In International Conference on Learning Representations, volume 2025, pp.\ 94118--94162, 2025

2025

[26] [26]

Challenging big-bench tasks and whether chain-of-thought can solve them

Mirac Suzgun, Nathan Scales, Nathanael Sch \"a rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 13003--13051, 2023

2023

[27] [27]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.\ 4149--4158, 2019

2019

[28] [28]

On the hardness of faithful chain-of-thought reasoning in large language models

Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, and Himabindu Lakkaraju. On the hardness of faithful chain-of-thought reasoning in large language models. arXiv preprint arXiv:2406.10625, 2024

work page arXiv 2024

[29] [29]

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023

2023

[30] [30]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9426--9439, 2024

2024

[32] [32]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022

[34] [34]

Measuring the faithfulness of thinking drafts in large reasoning models

Zidi Xiong, Shan Chen, Zhenting Qi, and Himabindu Lakkaraju. Measuring the faithfulness of thinking drafts in large reasoning models. Advances in Neural Information Processing Systems, 38: 0 139438--139467, 2026

2026

[35] [35]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Dapo: An open-source llm reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. Advances in Neural Information Processing Systems, 38: 0 113222--113244, 2026

2026

[38] [38]

Star: Bootstrapping reasoning with reasoning

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 0 15476--15488, 2022

2022

[39] [39]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

[40] [40]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

[41] [41]

ý/;|*k'J' AD*v&J3 a Z z ݋ [P Ff3YI Wn oE> ^ l<!CI C)Sޗ& |\! 9 < I Ytj _Y 5 ̑7y ^؛ (`YP )Ÿ ثKq 9PVMGC! 'xءYj= qoq0wO:0Q^z >

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 1974