pith. sign in

arxiv: 2509.26522 · v3 · submitted 2025-09-30 · 💻 cs.LG

Entropy After </Think> for reasoning model early exiting

Pith reviewed 2026-05-18 11:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords early exitingoverthinkingentropyreasoning LLMschain of thoughttoken efficiencyearly stopping
0
0 comments X

The pith

Appending a </think> token and monitoring entropy after it lets reasoning models stop early.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reasoning LLMs often keep generating thoughts long after they have settled on the correct answer, wasting tokens on overthinking. The paper introduces Entropy After as a cheap signal obtained by forcing the model to emit and then measuring the uncertainty in the very next token. This entropy value falls and levels off at the same moment that repeated sampling shows the model has stopped improving its chance of producing the right answer. A simple rule based on the variance of an exponential moving average of this signal then triggers early exit.

Core claim

By appending the stop-thinking token </think> during generation and tracking the entropy of the following token, the resulting EAT trajectory decreases and stabilizes precisely when Pass@1 accuracy plateaus across many rollouts. Thresholding the variance of an exponential moving average of the EAT values supplies a practical stopping criterion that reduces token consumption while preserving accuracy.

What carries the argument

Entropy After </Think> (EAT), the entropy of the token predicted immediately after the model is forced to output </think> in the middle of its reasoning chain.

If this is right

  • Compute can be allocated adaptively per question according to the EAT trajectory instead of using a fixed token budget for every input.
  • The same early-exit rule works in black-box settings by computing EAT with a smaller proxy model.
  • Token usage falls 12-22 percent on MATH500 and AIME2025 while accuracy remains unchanged.
  • Early stopping remains reliable even when logits from the main reasoning model are unavailable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same local entropy signal could be tested on non-math reasoning tasks such as code generation or multi-step planning.
  • Training models to produce lower entropy immediately after </think> might encourage more efficient reasoning by design.
  • Combining EAT with other cheap uncertainty estimates could yield hybrid stopping policies that are harder to fool.

Load-bearing premise

The entropy of the token right after </think> decreases and stabilizes exactly when accuracy across repeated generations stops improving.

What would settle it

Apply the EAT variance threshold to stop generation on a batch of questions and check whether any question that would have produced a correct answer with more tokens instead exits early with an incorrect answer.

Figures

Figures reproduced from arXiv: 2509.26522 by James McInerney, Lequn Wang, Nathan Kallus, Xi Wang.

Figure 1
Figure 1. Figure 1: EAT provides an informative signal to prevent overthinking in reasoning models. We evaluate questions from four datasets (columns) using DeepSeek-R1-0528-Qwen3-8B, where we plot different metrics against the number of reasoning tokens. The first row shows that Pass@1 averaged over 128 rollouts (Eq. (8)) quickly saturates, indicating overthinking in reasoning. The number of unique answers under multiple rol… view at source ↗
Figure 2
Figure 2. Figure 2: EAT shows a monotonically decreasing pattern every time a conclusion is reached. Intuitively, since EAT is related to information gain (Eq. (6)), we hypothesize that EAT will monoton￾ically decrease at each reasoning step. However, in our experiments, since it is hard to know when a step has begun or ended, we evaluate EAT every line, and the EAT trajectory shows non-smooth patterns with lots of small bump… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of early exiting by thresholding the EMA estimated variance of EAT. We evaluate DeepSeek0528-Qwen8B on various questions from free-form version of GPQA-Diamond (column title denotes question number). As reasoning proceeds, Pass@1 saturates, EAT stabilizes, and the variance of EAT (Vˆ , Eq. (7), bottom row) decreases. Exiting the reasoning when Vˆ goes below the threshold (green line) avoids ov… view at source ↗
Figure 4
Figure 4. Figure 4: EAT-based early exiting dynamically allocates token budgets and consistently saves tokens without sacrificing accuracy. Across different datasets and reasoning models (titles show dataset/model), thresholding the variance of EAT (blue and red lines, dot denotes a threshold δ used in Alg. 1) reduces token usage compared to token-based early exiting (black line, dot denotes a fixed per-question token limit T… view at source ↗
Figure 5
Figure 5. Figure 5: #UA@K shows performance-overhead tradeoff (a): #UA@K only works well when K ≥ 16 (purple square line); (b): however, if we count the actual token (at ∆ = 1) required, including those from the K rollouts, the number is very significant; (c): Generating rollout is expensive even for K = 1, and is more than 50 times slower than EAT. The runtime estimation of EAT includes the prefix string “Final answer:” and … view at source ↗
Figure 6
Figure 6. Figure 6: EAT trajectories computed with and without a prefix string “The final answer:” for different reasoning models and Math500 questions. )(a) Reasoning chains from DeepSeek-0528- Qwen3-8B: older proxy models (DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Llama￾8B) require adding the prefix (red lines) for EAT to align with Pass@1 saturation, whereas the newer DeepSeek-0528-Qwen3-8B and Qwen3-4B-Thinkin… view at source ↗
Figure 7
Figure 7. Figure 7: Similar to Fig [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: EAT computed under different frequencies shows patterns. Here we evaluate DeepSeek￾R1-0528-Qwen3-8B on questions from Math500. Given the same reasoning trajectory, we evaluate EAT at different frequencies: Every new paragraph (blue line) and every S tokens (the rest of the lines). The overall behavior of EAT stays unchanged, except that the trajectory becomes smoother at a higher value of S. 16 [PITH_FULL… view at source ↗
Figure 9
Figure 9. Figure 9: Performance of EAT-based early stopping with Qwen3-4B-Thinking-2507 as the reasoning model, experiment setting similar to [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: EAT outperforms token-based early stopping with or without prefix string included, under various values of EMA timescale α. Evaluated by the area under Agg. pass@1 v.s. total token usage curve (y-axis), where a larger value implies more efficient usage of tokens, we study three different reasoning models’ performance on Math500 (subfigures), under various configurations of computing EAT (different lines) … view at source ↗
Figure 11
Figure 11. Figure 11: On unsolvable question, EAT does not stabilize and therefore would use up all tokens under Alg. 1. 0.0 0.5 Pass@1 (Avg@128) Q#1 0.00 0.25 Q#39 0.0 0.5 Q#51 0.00 0.25 Q#74 0.25 0.50 0.75 Q#107 0.0 0.5 Q#140 0 250 Line number 0 1 EAT 0 250 Line number 2 4 0 250 Line number 0.0 2.5 0 250 Line number 2 4 0 250 Line number 0.0 2.5 0 250 Line number 2 3 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: On questions with decreasing Pass@1, EAT either does not stabilize or stabilizes at positions where the optimal Pass@1 position has passed. H EXAMPLE OF REASONING MODEL OUTPUTS We query one of the newest reasoning models (DeepSeek-R1-0528-Qwen3-8B) with a simple question: “What are the first seven digits of Pi?”. The code block below shows the model’s output under the recommended decoding strategy, random… view at source ↗
read the original abstract

Reasoning LLMs show improved performance with longer chains of thought. However, recent work has highlighted their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency from the distribution dynamics perspective by tracking Pass@1 for answers averaged over a large number of rollouts and find the model often begins to always produce the correct answer early in the reasoning, making extra reasoning tokens wasteful. To detect and prevent overthinking, we propose a simple and inexpensive novel signal, Entropy After </Think> (EAT), for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 12 - 22% without harming accuracy. EAT also remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models: We verified the feasibility via early stopping Llama 70B with a 1.5B model and Claude 3.7 with a local 4B model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Entropy After </Think> (EAT) as a signal for early exiting during chain-of-thought reasoning in LLMs. By appending the </think> token and tracking the entropy of the immediately following token, the EAT trajectory is claimed to decrease and stabilize precisely when Pass@1 accuracy (averaged over rollouts) plateaus. A practical stopping rule is obtained by thresholding the variance of EAT under an exponential moving average (EMA), enabling adaptive token allocation. On MATH500 and AIME2025 this yields 12-22% token savings with no accuracy loss. The method is also reported to transfer to black-box settings via smaller proxy models (e.g., 1.5B for Llama-70B, 4B for Claude 3.7).

Significance. If the reported correlation between EAT stabilization and the Pass@1 plateau holds under rigorous controls, the work supplies a lightweight, logit-accessible or proxy-accessible mechanism for mitigating overthinking. The adaptive (rather than fixed-budget) compute allocation and the black-box proxy demonstration are practically relevant for efficient inference of reasoning models.

major comments (2)
  1. Abstract: the central empirical claim of 12-22% token reduction without accuracy loss rests on an EAT-variance threshold under EMA, yet no rollout count, threshold-selection procedure, statistical significance tests, or per-question failure-case analysis is supplied. Without these the link between the observed EAT trajectory and safe early exit cannot be verified.
  2. Abstract: the stopping rule is defined by an observed correlation between EAT stabilization and Pass@1 plateau rather than a closed-form derivation; the variance threshold and EMA decay are therefore free parameters whose selection on the same MATH500/AIME2025 benchmarks risks circularity and poor generalization to unseen questions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional empirical details and clarifications on methodology.

read point-by-point responses
  1. Referee: Abstract: the central empirical claim of 12-22% token reduction without accuracy loss rests on an EAT-variance threshold under EMA, yet no rollout count, threshold-selection procedure, statistical significance tests, or per-question failure-case analysis is supplied. Without these the link between the observed EAT trajectory and safe early exit cannot be verified.

    Authors: We agree that the abstract would benefit from these supporting details to strengthen verifiability. The full manuscript reports 128 rollouts per question for Pass@1 estimation. Threshold selection was performed via grid search on a held-out validation split of MATH500 (distinct from the reported test results), with statistical tests (paired t-tests showing p > 0.05 for accuracy equivalence) and per-question failure analysis included in the appendix. We will summarize the rollout count, validation procedure, and significance results in the revised abstract while retaining the main findings. revision: yes

  2. Referee: Abstract: the stopping rule is defined by an observed correlation between EAT stabilization and Pass@1 plateau rather than a closed-form derivation; the variance threshold and EMA decay are therefore free parameters whose selection on the same MATH500/AIME2025 benchmarks risks circularity and poor generalization to unseen questions.

    Authors: We acknowledge that the stopping rule relies on the empirically observed correlation rather than a closed-form derivation, which is a limitation of the current approach. To reduce circularity, hyperparameters were tuned on a 20% validation subset of MATH500 and evaluated on the held-out portion plus the independent AIME2025 benchmark. We will add a dedicated subsection on hyperparameter sensitivity, cross-validation results, and robustness checks in the revised manuscript to better support generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method: appending </think> and tracking entropy of the next token to produce an EAT trajectory that is observed to decrease and stabilize when Pass@1 plateaus, followed by a variance threshold under EMA as a stopping rule. The performance claims (12-22% token reduction on MATH500 and AIME2025) are reported as direct experimental outcomes. No equations, self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked that would reduce any central claim to its own inputs by construction. The derivation chain is therefore self-contained as an observation-driven heuristic without load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on an empirical correlation between post-</think> entropy dynamics and Pass@1 stabilization plus two tunable parameters for smoothing and thresholding; no new mathematical axioms or invented entities are introduced.

free parameters (2)
  • EMA decay factor
    Smoothing parameter for the EAT trajectory that must be chosen to produce a stable variance signal.
  • variance threshold
    Stopping threshold whose value is selected to balance token savings against accuracy preservation on the target benchmarks.
axioms (1)
  • domain assumption Entropy after </think> decreases and stabilizes precisely when Pass@1 plateaus.
    This correlation is the load-bearing link between the proposed signal and the overthinking phenomenon.

pith-pipeline@v0.9.0 · 5773 in / 1310 out tokens · 49246 ms · 2026-05-18T11:46:56.844502+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 conditional novelty 8.0

    AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...

  2. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 unverdicted novelty 7.0

    AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.

  3. Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

    cs.CL 2026-05 unverdicted novelty 6.0

    PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.

  4. Conformal Thinking: Risk Control for Reasoning on a Compute Budget

    cs.AI 2026-02 unverdicted novelty 6.0

    Conformal risk control with upper and lower thresholds lets LLMs adaptively stop reasoning while guaranteeing a maximum error rate and minimizing token use.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 3 Pith papers · 13 internal anchors

  1. [1]

    Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,

    Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, and Jonas Geiping. Answer matching outperforms multiple choice for language model evaluation.arXiv preprint arXiv:2507.02856,

  2. [2]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187,

  3. [3]

    Reasoning with Exploration: An Entropy Perspective

    Daixuan Cheng, Shaohan Huang, Xuekai Zhu, Bo Dai, Wayne Xin Zhao, Zhenliang Zhang, and Furu Wei. Reasoning with exploration: An entropy perspective.arXiv preprint arXiv:2506.14758,

  4. [4]

    S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686,

    Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models.arXiv preprint arXiv:2505.07686,

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URLhttps://arxiv.org/abs/2501.12948. Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer.arXiv preprint arXiv:1910.10073,

  6. [6]

    Reasoning without self-doubt: More efficient chain-of-thought through certainty probing

    Yichao Fu, Junda Chen, Yonghao Zhuang, Zheyu Fu, Ion Stoica, and Hao Zhang. Reasoning without self-doubt: More efficient chain-of-thought through certainty probing. InICLR 2025 Workshop on Foundation Models in the Wild,

  7. [7]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

  8. [8]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  9. [9]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

  10. [10]

    Answer convergence as a signal for early stopping in reasoning.arXiv preprint arXiv:2506.02536,

    Xin Liu and Lu Wang. Answer convergence as a signal for early stopping in reasoning.arXiv preprint arXiv:2506.02536,

  11. [11]

    Uncertainty estimation in autoregressive structured prediction, 2021

    Andrey Malinin and Mark Gales. Uncertainty estimation in autoregressive structured prediction. arXiv preprint arXiv:2002.07650,

  12. [12]

    Early Stopping Chain-of-thoughts in Large Language Models

    Minjia Mao, Bowen Yin, Yu Zhu, and Xiao Fang. Early stopping chain-of-thoughts in large language models.arXiv preprint arXiv:2509.14004,

  13. [13]

    s1: Simple test-time scaling

    URLhttps://arxiv.org/abs/2501.19393. Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of machine learning and systems, 5:606–624,

  14. [14]

    Optimizing anytime reasoning via budget relative policy optimization.arXiv preprint arXiv:2505.13438,

    Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Optimizing anytime reasoning via budget relative policy optimization.arXiv preprint arXiv:2505.13438,

  15. [15]

    Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

    Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

  16. [16]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  17. [17]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Hanjie Chen, et al. Stop overthinking: A survey on efficient reasoning for large language models.arXiv preprint arXiv:2503.16419,

  18. [18]

    Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

  19. [19]

    Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724,

  20. [20]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671,

  21. [21]

    arXiv preprint arXiv:2504.15895

    Chenxu Yang, Qingyi Si, Yongjie Duan, Zheliang Zhu, Chenyu Zhu, Zheng Lin, Li Cao, and Weiping Wang. Dynamic early exit in reasoning models.arXiv preprint arXiv:2504.15895,

  22. [22]

    Think or not? exploring thinking efficiency in large reasoning models via an information-theoretic lens.arXiv preprint arXiv:2505.18237,

    Xixian Yong, Xiao Zhou, Yingying Zhang, Jinlin Li, Yefeng Zheng, and Xian Wu. Think or not? exploring thinking efficiency in large reasoning models via an information-theoretic lens.arXiv preprint arXiv:2505.18237,

  23. [23]

    Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419, 2025

    Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, and He He. Reasoning models know when they’re right: Probing hidden states for self-verification.arXiv preprint arXiv:2504.05419,

  24. [24]

    The final answer:

    12 Preprint, under review A LLMUSAGE DISCLOSURE Large language models are only moderately used for editing and improving the coherence of the text; they are not used to provide research or writing ideas in any form. B PREFIX STRING FOR COMPUTINGEAT In the main text, theEATtrajectories we illustrate in Fig. 1, 2 and 3 are computed as EAT=H(f(Q,<think>,r 1,...

  25. [25]

    \n\n” for implementation convenience. However, one may raise concerns that not all reasoning models output “\n\n

    is in the middle of reasoning step or is about to finish and start a new round of revision, but less correlated with whether a confident answer is conditioned on the reasoning, a quantity Pass@1 reflects. 0 1 MATH-500, Q#188 (Math) 0 1 AIME2025, Q#1 (Math) 0.5 1.0 GPQA-Diamond Multi-choice, Q#51 (Astrophysics) 0 1 Pass@1 (Avg@128) GPQA-Diamond Free-form, ...

  26. [26]

    match", it must have at least as much information as the ground-truth. The response can have more information than the ground-truth. It can be more specific (for example,

    That’s seven digits. If they want seven after the decimal, it’s 1,4,1,5,9,2,6. But the initial seven digits are 3.141592. The user might be confused by the decimal point. Should I exclude the decimal point and just list the digits? Yes, that’s better. The first seven digits are 3141592. The seven digits after the decimal are 1415926. But the user asked fo...