arxiv: 2604.08563 · v1 · submitted 2026-03-18 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

Mousa Salah , Amgad Muneer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords temperaturepromptingchain-of-thoughtzero-shotextended reasoningLLMmathematical problemssampling

0 comments

The pith

In extended-reasoning LLMs, zero-shot prompting reaches peak accuracy at moderate temperatures while chain-of-thought prompting performs best at the temperature extremes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how sampling temperature changes the effectiveness of zero-shot versus chain-of-thought prompting inside an extended-reasoning LLM. Experiments run on 39 International Mathematical Olympiad-level problems show zero-shot accuracy hitting 59 percent at temperatures 0.4 and 0.7, while chain-of-thought does better at the lowest and highest settings. The advantage provided by extended reasoning itself grows from 6 times at temperature 0 to 14.3 times at temperature 1. These patterns indicate that temperature and prompting strategy must be chosen together rather than defaulting to zero temperature for reasoning work.

Core claim

Using Grok-4.1 with extended reasoning on the AMO-Bench set of 39 hard mathematical problems, zero-shot prompting achieves its maximum accuracy of 59 percent at temperatures 0.4 and 0.7, whereas chain-of-thought prompting performs best at the boundary values of 0.0 and 1.0. The multiplicative benefit of extended reasoning over standard generation rises steadily from a factor of 6 at temperature 0.0 to a factor of 14.3 at temperature 1.0.

What carries the argument

The temperature-prompting interaction that determines when explicit chain-of-thought steps add more value than direct zero-shot answers under different levels of sampling randomness.

Load-bearing premise

The observed patterns of temperature and prompting performance on 39 math problems with one particular extended-reasoning model will hold for other models, larger problem collections, and non-mathematical domains.

What would settle it

Repeating the identical evaluation on a different extended-reasoning model or a substantially larger problem set and finding that zero-shot no longer peaks at moderate temperatures or that the extended-reasoning benefit ratio no longer increases with temperature.

Figures

Figures reproduced from arXiv: 2604.08563 by Amgad Muneer, Mousa Salah.

**Figure 1.** Figure 1: Overview of the evaluation methodology. The framework consists of: (Left) Dataset & Input Configuration using the AMO [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: The Reasoning Amplification Effect. This bar chart illustrates how the performance benefit of internal reasoning chains scales with sampling randomness. While extended reasoning provides a 6.0× improvement at T = 0.0, the multiplier surges to 14.3× at the high-diversity limit (T = 1.0). This suggests that internal reasoning mechanisms act as critical logical guardrails, effectively harnessing rather than d… view at source ↗

**Figure 3.** Figure 3: The interaction between prompting strategy and temperature. Note the strategy reversal: Zero-shot achieves peak efficiency at [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports that extended-reasoning gains scale sharply with temperature on this benchmark, but the 39-problem sample and missing variance numbers leave the interaction pattern unreliable.

read the letter

This paper tests zero-shot and chain-of-thought prompting at four temperatures on Grok-4.1 using 39 AMO-Bench problems. The headline result is that the multiplier from extended reasoning rises from 6x at T=0 to 14.3x at T=1, with zero-shot peaking at 59% around T=0.4-0.7 and CoT doing better at the extremes. They conclude that temperature and prompting should be tuned together instead of defaulting to T=0 for reasoning tasks.

Referee Report

1 major / 1 minor

Summary. The paper evaluates the interaction of sampling temperature (0.0, 0.4, 0.7, 1.0) with zero-shot and chain-of-thought prompting in the extended-reasoning model Grok-4.1 on 39 AMO-Bench problems. It reports that zero-shot prompting reaches peak accuracy of 59% at moderate temperatures (T=0.4 and T=0.7), chain-of-thought performs best at the temperature extremes, and the performance multiplier from extended reasoning grows from 6x at T=0.0 to 14.3x at T=1.0, leading to the recommendation that temperature and prompting strategy should be optimized jointly rather than defaulting to T=0.

Significance. If the reported temperature-prompting interactions are robust, the findings would have immediate practical value for configuring extended-reasoning LLMs and would challenge the widespread practice of using deterministic sampling for reasoning tasks. The work supplies direct empirical measurements on a challenging benchmark and highlights a scaling trend in the benefit of extended reasoning, which could inform future prompt-engineering guidelines.

major comments (1)

Experiments section: the central empirical claims (59% zero-shot accuracy at T=0.4/0.7, 6x-to-14.3x scaling of extended-reasoning benefit) rest on point estimates from only 39 problems with no reported standard errors, bootstrap intervals, or statistical tests. Because LLM outputs are stochastic (especially at T=1.0), the observed differences and interaction pattern may be noise; this directly weakens the recommendation to jointly optimize temperature and prompting.

minor comments (1)

Abstract and Experiments: the problem-selection criteria for the 39 AMO-Bench items and the exact definition of 'extended reasoning' (number of tokens or steps) are not stated, making it difficult to assess reproducibility or domain coverage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for statistical rigor in our empirical analysis. We address the concern below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Experiments section: the central empirical claims (59% zero-shot accuracy at T=0.4/0.7, 6x-to-14.3x scaling of extended-reasoning benefit) rest on point estimates from only 39 problems with no reported standard errors, bootstrap intervals, or statistical tests. Because LLM outputs are stochastic (especially at T=1.0), the observed differences and interaction pattern may be noise; this directly weakens the recommendation to jointly optimize temperature and prompting.

Authors: We agree that uncertainty quantification is essential given the stochastic nature of LLM sampling. In the revised manuscript we will add 95% bootstrap confidence intervals (1,000 resamples of the 39 problems) for every accuracy and multiplier reported. We will also include McNemar tests for paired differences between zero-shot and chain-of-thought at each temperature, with p-values and effect sizes. These additions will allow readers to assess whether the observed temperature-prompting interaction remains statistically supported. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of observed accuracies

full rationale

The paper contains no derivations, equations, fitted parameters, or self-citations that reduce claims to inputs by construction. All central results (59% zero-shot peak at T=0.4/0.7, CoT performance at extremes, extended-reasoning benefit scaling from 6x to 14.3x) are direct point estimates from running Grok-4.1 on 39 fixed AMO-Bench problems under four temperature and two prompting conditions. No parameter is estimated from a subset and then relabeled as a prediction; no uniqueness theorem or ansatz is imported via self-citation; no known pattern is renamed as a new result. The evaluation is self-contained against external benchmarks (the benchmark problems themselves) and stands on observed outputs rather than constructed quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that AMO-Bench problems are representative of complex reasoning and that Grok-4.1 behavior is typical of extended-reasoning models; no free parameters or new entities are introduced.

axioms (1)

domain assumption AMO-Bench constitutes a valid and challenging test of mathematical reasoning
The evaluation uses this 39-problem set without additional validation of its coverage or difficulty calibration.

pith-pipeline@v0.9.0 · 5491 in / 1153 out tokens · 44949 ms · 2026-05-15T10:37:33.302719+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 5 internal anchors

[1]

Chain-of-thought in neural code generation: From and for lightweight lan- guage models.IEEE Transactions on Software Engineering, 50(9):2437–2457, 2024

Guang Yang, Yu Zhou, Xiang Chen, Xiangyu Zhang, Terry Yue Zhuo, and Taolue Chen. Chain-of-thought in neural code generation: From and for lightweight lan- guage models.IEEE Transactions on Software Engineering, 50(9):2437–2457, 2024. 1

work page 2024
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.arXiv preprint arXiv:2504.09037, 2025

Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, et al. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.arXiv preprint arXiv:2504.09037, 2025. 1

work page arXiv 2025
[5]

Llm as a mastermind: A survey of strate- gic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024

Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, and Furu Wei. Llm as a mastermind: A survey of strate- gic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024

work page arXiv 2024
[6]

Llm integration in extended reality: A com- prehensive review of current trends, challenges, and future perspectives

Yiliu Tang, Jason Situ, Andrea Yaoyun Cui, Mengke Wu, and Yun Huang. Llm integration in extended reality: A com- prehensive review of current trends, challenges, and future perspectives. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–24, 2025. 1

work page 2025
[7]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1, 2, 3

work page 2022
[9]

Prompting science report 2: The decreasing value of chain of thought in prompting.arXiv preprint arXiv:2506.07142, 2025

Lennart Meincke, Ethan Mollick, Lilach Mollick, and Dan Shapiro. Prompting science report 2: The decreasing value of chain of thought in prompting.arXiv preprint arXiv:2506.07142, 2025. 1, 2, 9

work page arXiv 2025
[10]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488,

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488,

work page
[11]

On the role of temperature sampling in test-time scaling.arXiv preprint arXiv:2510.02611, 2025

Yuheng Wu, Azalia Mirhoseini, and Thierry Tambe. On the role of temperature sampling in test-time scaling.arXiv preprint arXiv:2510.02611, 2025. 1, 2

work page arXiv 2025
[12]

Auto- matic generation of model and data cards: A step towards responsible ai

Jiarui Liu, Wenkai Li, Zhijing Jin, and Mona Diab. Auto- matic generation of model and data cards: A step towards responsible ai. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1975–1997, 2024. 1

work page 2024
[13]

The effect of sampling temperature on prob- lem solving in large language models

Matthew Renze. The effect of sampling temperature on prob- lem solving in large language models. InFindings of the as- sociation for computational linguistics: EMNLP 2024, pages 7346–7356, 2024. 1, 2, 9

work page 2024
[14]

Sampling- efficient test-time scaling: Self-estimating the best-of-n sam- pling in early decoding.arXiv preprint arXiv:2503.01422,

Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, and Rui Wang. Sampling- efficient test-time scaling: Self-estimating the best-of-n sam- pling in early decoding.arXiv preprint arXiv:2503.01422,

work page arXiv
[15]

Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 2, 3

work page 2022
[16]

Optimizing temperature for language models with multi-sample infer- ence.arXiv preprint arXiv:2502.05234, 2025

Weihua Du, Yiming Yang, and Sean Welleck. Optimizing temperature for language models with multi-sample infer- ence.arXiv preprint arXiv:2502.05234, 2025. 2

work page arXiv 2025
[17]

Grok 4.1 model card.https://data.x.ai/ 2025 - 11 - 17 - grok - 4 - 1model - card

xAI. Grok 4.1 model card.https://data.x.ai/ 2025 - 11 - 17 - grok - 4 - 1model - card . pdf, Nov

work page 2025
[18]

Accessed: 2025-12-12. 2, 3

work page 2025
[19]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Jerry Tworek, Peter Wang, Xi Chen, Ethan Perez, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Amo-bench: Large language models still struggle in high school math competitions.arXiv preprint arXiv:2510.26768, 2025

Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Zi- wen Wang, et al. Amo-bench: Large language models still struggle in high school math competitions.arXiv preprint arXiv:2510.26768, 2025. 3

work page arXiv 2025