Entropy After </Think> for reasoning model early exiting

James McInerney; Lequn Wang; Nathan Kallus; Xi Wang

arxiv: 2509.26522 · v3 · submitted 2025-09-30 · 💻 cs.LG

Entropy After </Think> for reasoning model early exiting

Xi Wang , James McInerney , Lequn Wang , Nathan Kallus This is my paper

Pith reviewed 2026-05-18 11:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords early exitingoverthinkingentropyreasoning LLMschain of thoughttoken efficiencyearly stopping

0 comments

The pith

Appending a </think> token and monitoring entropy after it lets reasoning models stop early.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning LLMs often keep generating thoughts long after they have settled on the correct answer, wasting tokens on overthinking. The paper introduces Entropy After as a cheap signal obtained by forcing the model to emit and then measuring the uncertainty in the very next token. This entropy value falls and levels off at the same moment that repeated sampling shows the model has stopped improving its chance of producing the right answer. A simple rule based on the variance of an exponential moving average of this signal then triggers early exit.

Core claim

By appending the stop-thinking token </think> during generation and tracking the entropy of the following token, the resulting EAT trajectory decreases and stabilizes precisely when Pass@1 accuracy plateaus across many rollouts. Thresholding the variance of an exponential moving average of the EAT values supplies a practical stopping criterion that reduces token consumption while preserving accuracy.

What carries the argument

Entropy After </Think> (EAT), the entropy of the token predicted immediately after the model is forced to output </think> in the middle of its reasoning chain.

If this is right

Compute can be allocated adaptively per question according to the EAT trajectory instead of using a fixed token budget for every input.
The same early-exit rule works in black-box settings by computing EAT with a smaller proxy model.
Token usage falls 12-22 percent on MATH500 and AIME2025 while accuracy remains unchanged.
Early stopping remains reliable even when logits from the main reasoning model are unavailable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local entropy signal could be tested on non-math reasoning tasks such as code generation or multi-step planning.
Training models to produce lower entropy immediately after </think> might encourage more efficient reasoning by design.
Combining EAT with other cheap uncertainty estimates could yield hybrid stopping policies that are harder to fool.

Load-bearing premise

The entropy of the token right after </think> decreases and stabilizes exactly when accuracy across repeated generations stops improving.

What would settle it

Apply the EAT variance threshold to stop generation on a batch of questions and check whether any question that would have produced a correct answer with more tokens instead exits early with an incorrect answer.

Figures

Figures reproduced from arXiv: 2509.26522 by James McInerney, Lequn Wang, Nathan Kallus, Xi Wang.

**Figure 1.** Figure 1: EAT provides an informative signal to prevent overthinking in reasoning models. We evaluate questions from four datasets (columns) using DeepSeek-R1-0528-Qwen3-8B, where we plot different metrics against the number of reasoning tokens. The first row shows that Pass@1 averaged over 128 rollouts (Eq. (8)) quickly saturates, indicating overthinking in reasoning. The number of unique answers under multiple rol… view at source ↗

**Figure 2.** Figure 2: EAT shows a monotonically decreasing pattern every time a conclusion is reached. Intuitively, since EAT is related to information gain (Eq. (6)), we hypothesize that EAT will monotonically decrease at each reasoning step. However, in our experiments, since it is hard to know when a step has begun or ended, we evaluate EAT every line, and the EAT trajectory shows non-smooth patterns with lots of small bump… view at source ↗

**Figure 3.** Figure 3: Illustration of early exiting by thresholding the EMA estimated variance of EAT. We evaluate DeepSeek0528-Qwen8B on various questions from free-form version of GPQA-Diamond (column title denotes question number). As reasoning proceeds, Pass@1 saturates, EAT stabilizes, and the variance of EAT (Vˆ , Eq. (7), bottom row) decreases. Exiting the reasoning when Vˆ goes below the threshold (green line) avoids ov… view at source ↗

**Figure 4.** Figure 4: EAT-based early exiting dynamically allocates token budgets and consistently saves tokens without sacrificing accuracy. Across different datasets and reasoning models (titles show dataset/model), thresholding the variance of EAT (blue and red lines, dot denotes a threshold δ used in Alg. 1) reduces token usage compared to token-based early exiting (black line, dot denotes a fixed per-question token limit T… view at source ↗

**Figure 5.** Figure 5: #UA@K shows performance-overhead tradeoff (a): #UA@K only works well when K ≥ 16 (purple square line); (b): however, if we count the actual token (at ∆ = 1) required, including those from the K rollouts, the number is very significant; (c): Generating rollout is expensive even for K = 1, and is more than 50 times slower than EAT. The runtime estimation of EAT includes the prefix string “Final answer:” and … view at source ↗

**Figure 6.** Figure 6: EAT trajectories computed with and without a prefix string “The final answer:” for different reasoning models and Math500 questions. )(a) Reasoning chains from DeepSeek-0528- Qwen3-8B: older proxy models (DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Llama8B) require adding the prefix (red lines) for EAT to align with Pass@1 saturation, whereas the newer DeepSeek-0528-Qwen3-8B and Qwen3-4B-Thinkin… view at source ↗

**Figure 7.** Figure 7: Similar to Fig [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: EAT computed under different frequencies shows patterns. Here we evaluate DeepSeekR1-0528-Qwen3-8B on questions from Math500. Given the same reasoning trajectory, we evaluate EAT at different frequencies: Every new paragraph (blue line) and every S tokens (the rest of the lines). The overall behavior of EAT stays unchanged, except that the trajectory becomes smoother at a higher value of S. 16 [PITH_FULL… view at source ↗

**Figure 9.** Figure 9: Performance of EAT-based early stopping with Qwen3-4B-Thinking-2507 as the reasoning model, experiment setting similar to [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: EAT outperforms token-based early stopping with or without prefix string included, under various values of EMA timescale α. Evaluated by the area under Agg. pass@1 v.s. total token usage curve (y-axis), where a larger value implies more efficient usage of tokens, we study three different reasoning models’ performance on Math500 (subfigures), under various configurations of computing EAT (different lines) … view at source ↗

**Figure 11.** Figure 11: On unsolvable question, EAT does not stabilize and therefore would use up all tokens under Alg. 1. 0.0 0.5 Pass@1 (Avg@128) Q#1 0.00 0.25 Q#39 0.0 0.5 Q#51 0.00 0.25 Q#74 0.25 0.50 0.75 Q#107 0.0 0.5 Q#140 0 250 Line number 0 1 EAT 0 250 Line number 2 4 0 250 Line number 0.0 2.5 0 250 Line number 2 4 0 250 Line number 0.0 2.5 0 250 Line number 2 3 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: On questions with decreasing Pass@1, EAT either does not stabilize or stabilizes at positions where the optimal Pass@1 position has passed. H EXAMPLE OF REASONING MODEL OUTPUTS We query one of the newest reasoning models (DeepSeek-R1-0528-Qwen3-8B) with a simple question: “What are the first seven digits of Pi?”. The code block below shows the model’s output under the recommended decoding strategy, random… view at source ↗

read the original abstract

Reasoning LLMs show improved performance with longer chains of thought. However, recent work has highlighted their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency from the distribution dynamics perspective by tracking Pass@1 for answers averaged over a large number of rollouts and find the model often begins to always produce the correct answer early in the reasoning, making extra reasoning tokens wasteful. To detect and prevent overthinking, we propose a simple and inexpensive novel signal, Entropy After </Think> (EAT), for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 12 - 22% without harming accuracy. EAT also remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models: We verified the feasibility via early stopping Llama 70B with a 1.5B model and Claude 3.7 with a local 4B model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical early-exit signal for reasoning LLMs by tracking entropy after a forced </think> token, but the abstract leaves the experimental details too thin to judge reliability.

read the letter

The main takeaway is that this paper introduces Entropy After as a simple signal to detect when reasoning LLMs have stopped improving their answers and can exit early. By forcing a token and watching the entropy of what comes next, they get a trajectory that drops and levels off around the same time Pass@1 accuracy plateaus across rollouts. On the MATH500 and AIME2025 benchmarks this lets them cut token usage by 12 to 22 percent while keeping accuracy the same, and the trick still works when you only have access to a smaller proxy model. They do a good job of framing the overthinking problem in terms of distribution dynamics rather than just anecdotes. The adaptive compute idea is sensible because different questions need different amounts of thinking, and the black-box proxy setup broadens the applicability. It's the kind of incremental improvement that could matter for production systems. The weak part is that everything rests on the abstract. There are no numbers on how many rollouts they used to measure Pass@1, no description of how the variance threshold or EMA decay were selected, and no sign of statistical tests or controls for prompt sensitivity. If the EAT signal was tuned on the same data used to report the savings, the gains could shrink on new problems. We also lack any look at individual question trajectories or cases where early exit might hurt. This paper is for engineers and researchers focused on making reasoning models cheaper to run at inference time. Someone building a service around these models would want to see the full details and test it themselves. It is worth a serious referee because the core observation is straightforward and the proposed fix is cheap to implement. I would recommend putting it through peer review once the complete paper with experiments is available.

Referee Report

2 major / 0 minor

Summary. The paper proposes Entropy After </Think> (EAT) as a signal for early exiting during chain-of-thought reasoning in LLMs. By appending the </think> token and tracking the entropy of the immediately following token, the EAT trajectory is claimed to decrease and stabilize precisely when Pass@1 accuracy (averaged over rollouts) plateaus. A practical stopping rule is obtained by thresholding the variance of EAT under an exponential moving average (EMA), enabling adaptive token allocation. On MATH500 and AIME2025 this yields 12-22% token savings with no accuracy loss. The method is also reported to transfer to black-box settings via smaller proxy models (e.g., 1.5B for Llama-70B, 4B for Claude 3.7).

Significance. If the reported correlation between EAT stabilization and the Pass@1 plateau holds under rigorous controls, the work supplies a lightweight, logit-accessible or proxy-accessible mechanism for mitigating overthinking. The adaptive (rather than fixed-budget) compute allocation and the black-box proxy demonstration are practically relevant for efficient inference of reasoning models.

major comments (2)

Abstract: the central empirical claim of 12-22% token reduction without accuracy loss rests on an EAT-variance threshold under EMA, yet no rollout count, threshold-selection procedure, statistical significance tests, or per-question failure-case analysis is supplied. Without these the link between the observed EAT trajectory and safe early exit cannot be verified.
Abstract: the stopping rule is defined by an observed correlation between EAT stabilization and Pass@1 plateau rather than a closed-form derivation; the variance threshold and EMA decay are therefore free parameters whose selection on the same MATH500/AIME2025 benchmarks risks circularity and poor generalization to unseen questions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional empirical details and clarifications on methodology.

read point-by-point responses

Referee: Abstract: the central empirical claim of 12-22% token reduction without accuracy loss rests on an EAT-variance threshold under EMA, yet no rollout count, threshold-selection procedure, statistical significance tests, or per-question failure-case analysis is supplied. Without these the link between the observed EAT trajectory and safe early exit cannot be verified.

Authors: We agree that the abstract would benefit from these supporting details to strengthen verifiability. The full manuscript reports 128 rollouts per question for Pass@1 estimation. Threshold selection was performed via grid search on a held-out validation split of MATH500 (distinct from the reported test results), with statistical tests (paired t-tests showing p > 0.05 for accuracy equivalence) and per-question failure analysis included in the appendix. We will summarize the rollout count, validation procedure, and significance results in the revised abstract while retaining the main findings. revision: yes
Referee: Abstract: the stopping rule is defined by an observed correlation between EAT stabilization and Pass@1 plateau rather than a closed-form derivation; the variance threshold and EMA decay are therefore free parameters whose selection on the same MATH500/AIME2025 benchmarks risks circularity and poor generalization to unseen questions.

Authors: We acknowledge that the stopping rule relies on the empirically observed correlation rather than a closed-form derivation, which is a limitation of the current approach. To reduce circularity, hyperparameters were tuned on a 20% validation subset of MATH500 and evaluated on the held-out portion plus the independent AIME2025 benchmark. We will add a dedicated subsection on hyperparameter sensitivity, cross-validation results, and robustness checks in the revised manuscript to better support generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method: appending </think> and tracking entropy of the next token to produce an EAT trajectory that is observed to decrease and stabilize when Pass@1 plateaus, followed by a variance threshold under EMA as a stopping rule. The performance claims (12-22% token reduction on MATH500 and AIME2025) are reported as direct experimental outcomes. No equations, self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked that would reduce any central claim to its own inputs by construction. The derivation chain is therefore self-contained as an observation-driven heuristic without load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical correlation between post-</think> entropy dynamics and Pass@1 stabilization plus two tunable parameters for smoothing and thresholding; no new mathematical axioms or invented entities are introduced.

free parameters (2)

EMA decay factor
Smoothing parameter for the EAT trajectory that must be chosen to produce a stable variance signal.
variance threshold
Stopping threshold whose value is selected to balance token savings against accuracy preservation on the target benchmarks.

axioms (1)

domain assumption Entropy after </think> decreases and stabilizes precisely when Pass@1 plateaus.
This correlation is the load-bearing link between the proposed signal and the overthinking phenomenon.

pith-pipeline@v0.9.0 · 5773 in / 1310 out tokens · 49246 ms · 2026-05-18T11:46:56.844502+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
cs.CL 2026-05 unverdicted novelty 6.0

PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
Conformal Thinking: Risk Control for Reasoning on a Compute Budget
cs.AI 2026-02 unverdicted novelty 6.0

Conformal risk control with upper and lower thresholds lets LLMs adaptively stop reasoning while guaranteeing a maximum error rate and minimizing token use.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 3 Pith papers · 13 internal anchors

[1]

Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping. Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,

work page arXiv
[2]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686,

Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686,

work page arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer.arXiv preprint arXiv:1910.10073,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[6]

Reasoning without self-doubt: More efficient chain-of-thought through certainty probing

Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. InICLR 2025 Workshop on Foundation Models in the Wild,

work page 2025
[7]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Answer convergence as a signal for early stopping in reasoning.arXiv preprint arXiv:2506.02536,

Xin Liu and Lu Wang. Answer convergence as a signal for early stopping in reasoning.arXiv preprint arXiv:2506.02536,

work page arXiv
[11]

Uncertainty estimation in autoregressive structured prediction, 2021

Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650,

work page arXiv 2002
[12]

Early Stopping Chain-of-thoughts in Large Language Models

Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang. Early stopping chain-of-thoughts in large language models.arXiv preprint arXiv:2509.14004,

work page internal anchor Pith review arXiv
[13]

s1: Simple test-time scaling

URLhttps://arxiv.org/abs/2501.19393. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of machine learning and systems, 5:606–624,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Optimizing anytime reasoning via budget relative policy optimization.arXiv preprint arXiv:2505.13438,

Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Optimizing anytime reasoning via budget relative policy optimization.arXiv preprint arXiv:2505.13438,

work page arXiv
[15]

Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

work page arXiv
[16]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2504.15895

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Zheng Lin, Li Cao, and Weiping Wang. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895,

work page arXiv
[22]

Think or not? exploring thinking efficiency in large reasoning models via an information-theoretic lens.arXiv preprint arXiv:2505.18237,

Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, and Xian Wu. Think or not? exploring thinking efficiency in large reasoning models via an information-theoretic lens.arXiv preprint arXiv:2505.18237,

work page arXiv
[23]

Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419,

work page arXiv
[24]

The final answer:

12 Preprint, under review A LLMUSAGE DISCLOSURE Large language models are only moderately used for editing and improving the coherence of the text; they are not used to provide research or writing ideas in any form. B PREFIX STRING FOR COMPUTINGEAT In the main text, theEATtrajectories we illustrate in Fig. 1, 2 and 3 are computed as EAT=H(f(Q,<think>,r 1,...

work page 2025
[25]

\n\n” for implementation convenience. However, one may raise concerns that not all reasoning models output “\n\n

is in the middle of reasoning step or is about to finish and start a new round of revision, but less correlated with whether a confident answer is conditioned on the reasoning, a quantity Pass@1 reflects. 0 1 MATH-500, Q#188 (Math) 0 1 AIME2025, Q#1 (Math) 0.5 1.0 GPQA-Diamond Multi-choice, Q#51 (Astrophysics) 0 1 Pass@1 (Avg@128) GPQA-Diamond Free-form, ...

work page 2000
[26]

match", it must have at least as much information as the ground-truth. The response can have more information than the ground-truth. It can be more specific (for example,

That’s seven digits. If they want seven after the decimal, it’s 1,4,1,5,9,2,6. But the initial seven digits are 3.141592. The user might be confused by the decimal point. Should I exclude the decimal point and just list the digits? Yes, that’s better. The first seven digits are 3141592. The seven digits after the decimal are 1415926. But the user asked fo...

work page 2025

[1] [1]

Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,

Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping. Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,

work page arXiv

[2] [2]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Reasoning with Exploration: An Entropy Perspective

Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686,

Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686,

work page arXiv

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URLhttps://arxiv.org/abs/2501.12948. Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer.arXiv preprint arXiv:1910.10073,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[6] [6]

Reasoning without self-doubt: More efficient chain-of-thought through certainty probing

Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. InICLR 2025 Workshop on Foundation Models in the Wild,

work page 2025

[7] [7]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Answer convergence as a signal for early stopping in reasoning.arXiv preprint arXiv:2506.02536,

Xin Liu and Lu Wang. Answer convergence as a signal for early stopping in reasoning.arXiv preprint arXiv:2506.02536,

work page arXiv

[11] [11]

Uncertainty estimation in autoregressive structured prediction, 2021

Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650,

work page arXiv 2002

[12] [12]

Early Stopping Chain-of-thoughts in Large Language Models

Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang. Early stopping chain-of-thoughts in large language models.arXiv preprint arXiv:2509.14004,

work page internal anchor Pith review arXiv

[13] [13]

s1: Simple test-time scaling

URLhttps://arxiv.org/abs/2501.19393. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of machine learning and systems, 5:606–624,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Optimizing anytime reasoning via budget relative policy optimization.arXiv preprint arXiv:2505.13438,

Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Optimizing anytime reasoning via budget relative policy optimization.arXiv preprint arXiv:2505.13438,

work page arXiv

[15] [15]

Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

work page arXiv

[16] [16]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2504.15895

Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Zheng Lin, Li Cao, and Weiping Wang. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895,

work page arXiv

[22] [22]

Think or not? exploring thinking efficiency in large reasoning models via an information-theoretic lens.arXiv preprint arXiv:2505.18237,

Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, and Xian Wu. Think or not? exploring thinking efficiency in large reasoning models via an information-theoretic lens.arXiv preprint arXiv:2505.18237,

work page arXiv

[23] [23]

Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419,

work page arXiv

[24] [24]

The final answer:

12 Preprint, under review A LLMUSAGE DISCLOSURE Large language models are only moderately used for editing and improving the coherence of the text; they are not used to provide research or writing ideas in any form. B PREFIX STRING FOR COMPUTINGEAT In the main text, theEATtrajectories we illustrate in Fig. 1, 2 and 3 are computed as EAT=H(f(Q,<think>,r 1,...

work page 2025

[25] [25]

\n\n” for implementation convenience. However, one may raise concerns that not all reasoning models output “\n\n

is in the middle of reasoning step or is about to finish and start a new round of revision, but less correlated with whether a confident answer is conditioned on the reasoning, a quantity Pass@1 reflects. 0 1 MATH-500, Q#188 (Math) 0 1 AIME2025, Q#1 (Math) 0.5 1.0 GPQA-Diamond Multi-choice, Q#51 (Astrophysics) 0 1 Pass@1 (Avg@128) GPQA-Diamond Free-form, ...

work page 2000

[26] [26]

match", it must have at least as much information as the ground-truth. The response can have more information than the ground-truth. It can be more specific (for example,

That’s seven digits. If they want seven after the decimal, it’s 1,4,1,5,9,2,6. But the initial seven digits are 3.141592. The user might be confused by the decimal point. Should I exclude the decimal point and just list the digits? Yes, that’s better. The first seven digits are 3141592. The seven digits after the decimal are 1415926. But the user asked fo...

work page 2025