Recognition: no theorem link
Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Pith reviewed 2026-05-15 10:37 UTC · model grok-4.3
The pith
In extended-reasoning LLMs, zero-shot prompting reaches peak accuracy at moderate temperatures while chain-of-thought prompting performs best at the temperature extremes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using Grok-4.1 with extended reasoning on the AMO-Bench set of 39 hard mathematical problems, zero-shot prompting achieves its maximum accuracy of 59 percent at temperatures 0.4 and 0.7, whereas chain-of-thought prompting performs best at the boundary values of 0.0 and 1.0. The multiplicative benefit of extended reasoning over standard generation rises steadily from a factor of 6 at temperature 0.0 to a factor of 14.3 at temperature 1.0.
What carries the argument
The temperature-prompting interaction that determines when explicit chain-of-thought steps add more value than direct zero-shot answers under different levels of sampling randomness.
Load-bearing premise
The observed patterns of temperature and prompting performance on 39 math problems with one particular extended-reasoning model will hold for other models, larger problem collections, and non-mathematical domains.
What would settle it
Repeating the identical evaluation on a different extended-reasoning model or a substantially larger problem set and finding that zero-shot no longer peaks at moderate temperatures or that the extended-reasoning benefit ratio no longer increases with temperature.
Figures
read the original abstract
Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates the interaction of sampling temperature (0.0, 0.4, 0.7, 1.0) with zero-shot and chain-of-thought prompting in the extended-reasoning model Grok-4.1 on 39 AMO-Bench problems. It reports that zero-shot prompting reaches peak accuracy of 59% at moderate temperatures (T=0.4 and T=0.7), chain-of-thought performs best at the temperature extremes, and the performance multiplier from extended reasoning grows from 6x at T=0.0 to 14.3x at T=1.0, leading to the recommendation that temperature and prompting strategy should be optimized jointly rather than defaulting to T=0.
Significance. If the reported temperature-prompting interactions are robust, the findings would have immediate practical value for configuring extended-reasoning LLMs and would challenge the widespread practice of using deterministic sampling for reasoning tasks. The work supplies direct empirical measurements on a challenging benchmark and highlights a scaling trend in the benefit of extended reasoning, which could inform future prompt-engineering guidelines.
major comments (1)
- Experiments section: the central empirical claims (59% zero-shot accuracy at T=0.4/0.7, 6x-to-14.3x scaling of extended-reasoning benefit) rest on point estimates from only 39 problems with no reported standard errors, bootstrap intervals, or statistical tests. Because LLM outputs are stochastic (especially at T=1.0), the observed differences and interaction pattern may be noise; this directly weakens the recommendation to jointly optimize temperature and prompting.
minor comments (1)
- Abstract and Experiments: the problem-selection criteria for the 39 AMO-Bench items and the exact definition of 'extended reasoning' (number of tokens or steps) are not stated, making it difficult to assess reproducibility or domain coverage.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for statistical rigor in our empirical analysis. We address the concern below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Experiments section: the central empirical claims (59% zero-shot accuracy at T=0.4/0.7, 6x-to-14.3x scaling of extended-reasoning benefit) rest on point estimates from only 39 problems with no reported standard errors, bootstrap intervals, or statistical tests. Because LLM outputs are stochastic (especially at T=1.0), the observed differences and interaction pattern may be noise; this directly weakens the recommendation to jointly optimize temperature and prompting.
Authors: We agree that uncertainty quantification is essential given the stochastic nature of LLM sampling. In the revised manuscript we will add 95% bootstrap confidence intervals (1,000 resamples of the 39 problems) for every accuracy and multiplier reported. We will also include McNemar tests for paired differences between zero-shot and chain-of-thought at each temperature, with p-values and effect sizes. These additions will allow readers to assess whether the observed temperature-prompting interaction remains statistically supported. revision: yes
Circularity Check
No circularity: purely empirical reporting of observed accuracies
full rationale
The paper contains no derivations, equations, fitted parameters, or self-citations that reduce claims to inputs by construction. All central results (59% zero-shot peak at T=0.4/0.7, CoT performance at extremes, extended-reasoning benefit scaling from 6x to 14.3x) are direct point estimates from running Grok-4.1 on 39 fixed AMO-Bench problems under four temperature and two prompting conditions. No parameter is estimated from a subset and then relabeled as a prediction; no uniqueness theorem or ansatz is imported via self-citation; no known pattern is renamed as a new result. The evaluation is self-contained against external benchmarks (the benchmark problems themselves) and stands on observed outputs rather than constructed quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AMO-Bench constitutes a valid and challenging test of mathematical reasoning
Reference graph
Works this paper leans on
-
[1]
Guang Yang, Yu Zhou, Xiang Chen, Xiangyu Zhang, Terry Yue Zhuo, and Taolue Chen. Chain-of-thought in neural code generation: From and for lightweight lan- guage models.IEEE Transactions on Software Engineering, 50(9):2437–2457, 2024. 1
work page 2024
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning ca- pability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, et al. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.arXiv preprint arXiv:2504.09037, 2025. 1
-
[5]
Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang, Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song, Man Lan, and Furu Wei. Llm as a mastermind: A survey of strate- gic reasoning with large language models.arXiv preprint arXiv:2404.01230, 2024
-
[6]
Yiliu Tang, Jason Situ, Andrea Yaoyun Cui, Mengke Wu, and Yun Huang. Llm integration in extended reality: A com- prehensive review of current trends, challenges, and future perspectives. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–24, 2025. 1
work page 2025
-
[7]
The Curious Case of Neural Text Degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[8]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 1, 2, 3
work page 2022
-
[9]
Lennart Meincke, Ethan Mollick, Lilach Mollick, and Dan Shapiro. Prompting science report 2: The decreasing value of chain of thought in prompting.arXiv preprint arXiv:2506.07142, 2025. 1, 2, 9
-
[10]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488,
-
[11]
On the role of temperature sampling in test-time scaling.arXiv preprint arXiv:2510.02611, 2025
Yuheng Wu, Azalia Mirhoseini, and Thierry Tambe. On the role of temperature sampling in test-time scaling.arXiv preprint arXiv:2510.02611, 2025. 1, 2
-
[12]
Auto- matic generation of model and data cards: A step towards responsible ai
Jiarui Liu, Wenkai Li, Zhijing Jin, and Mona Diab. Auto- matic generation of model and data cards: A step towards responsible ai. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1975–1997, 2024. 1
work page 2024
-
[13]
The effect of sampling temperature on prob- lem solving in large language models
Matthew Renze. The effect of sampling temperature on prob- lem solving in large language models. InFindings of the as- sociation for computational linguistics: EMNLP 2024, pages 7346–7356, 2024. 1, 2, 9
work page 2024
-
[14]
Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, and Rui Wang. Sampling- efficient test-time scaling: Self-estimating the best-of-n sam- pling in early decoding.arXiv preprint arXiv:2503.01422,
-
[15]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 2, 3
work page 2022
-
[16]
Weihua Du, Yiming Yang, and Sean Welleck. Optimizing temperature for language models with multi-sample infer- ence.arXiv preprint arXiv:2502.05234, 2025. 2
-
[17]
Grok 4.1 model card.https://data.x.ai/ 2025 - 11 - 17 - grok - 4 - 1model - card
xAI. Grok 4.1 model card.https://data.x.ai/ 2025 - 11 - 17 - grok - 4 - 1model - card . pdf, Nov
work page 2025
-
[18]
Accessed: 2025-12-12. 2, 3
work page 2025
-
[19]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Jerry Tworek, Peter Wang, Xi Chen, Ethan Perez, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Zi- wen Wang, et al. Amo-bench: Large language models still struggle in high school math competitions.arXiv preprint arXiv:2510.26768, 2025. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.