Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
Pith reviewed 2026-05-08 12:10 UTC · model grok-4.3
The pith
Even at a nominal temperature of zero, large language models produce divergent outputs due to implementation-dependent perturbations that can be characterized as an effective background temperature.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the notion of background temperature T_bg as the effective temperature induced by an implementation-dependent perturbation process that is observed even when the nominal temperature T is set to zero. This T_bg relates to a stochastic perturbation governed by the specific inference environment I, and can be estimated through the equivalent temperature T_n(I) of an ideal reference system. The formalization and estimation protocol are demonstrated via experiments on LLMs from major providers.
What carries the argument
Background temperature T_bg, which captures the effective randomness from implementation perturbations at nominal T=0 and is estimated via the equivalent temperature T_n(I) in a reference system determined by the inference environment I.
If this is right
- Outputs from LLMs at T=0 will still vary based on the specific inference setup used.
- Reproducibility requires controlling or measuring the background temperature induced by the environment.
- Evaluation of models must consider this hidden variability to avoid misleading comparisons.
- Deployment in production systems needs awareness of how different inference environments affect output stability.
- The proposed estimation protocol can be used to quantify and mitigate these effects.
Where Pith is reading between the lines
- Standardizing the measurement of background temperature could enable fairer benchmarks across different hardware and software stacks.
- This concept might extend to other generative AI models beyond LLMs where implementation noise affects outputs.
- Future work could explore ways to reduce T_bg through more deterministic computing practices or software fixes.
- The idea connects to broader issues in computational reproducibility in scientific computing.
Load-bearing premise
The divergence observed at T=0 can be accurately represented as an equivalent temperature T_n(I) within an ideal reference model controlled by the inference environment I.
What would settle it
Running the same input multiple times at T=0 on a fixed inference setup and finding that the output variability does not correspond to what the estimated T_n(I) would predict in the reference system.
Figures
read the original abstract
Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs produce divergent outputs even at nominal temperature T=0 due to implementation-dependent perturbations (batch-size variation, kernel non-invariance, floating-point non-associativity). It formalizes this by defining background temperature T_bg as the effective temperature induced by a stochastic perturbation process governed by the inference environment I, relates T_bg to an equivalent temperature T_n(I) in an ideal reference system, proposes an empirical protocol for estimating T_bg, presents pilot experiments across major LLM providers, and discusses implications for reproducibility, evaluation, and deployment.
Significance. If the distributional equivalence between implementation perturbations and temperature sampling holds and the estimation protocol is validated, the framework could provide a useful quantitative lens for hidden nondeterminism in LLM inference, supporting more reproducible research and reliable production systems. The pilot experiments add initial empirical grounding by showing the effect across providers.
major comments (2)
- [Abstract (definitions and relations)] The core modeling step—that implementation perturbations induce an output distribution equivalent to temperature scaling of logits before softmax—is load-bearing for the definition of T_bg and the protocol using T_n(I), yet the abstract supplies no distributional justification or comparison (e.g., via KL divergence, entropy matching, or moment analysis). Sources such as order-dependent rounding can produce structured, non-uniform effects that do not replicate the entropy-increasing action of temperature across the full vocabulary.
- [Abstract (empirical protocol and pilot experiments)] The empirical protocol for estimating T_n(I) is described at a high level but lacks concrete details on the matching procedure, metrics, or controls for confounding factors (e.g., how multiple runs at T=0 are compared to temperature sweeps in the reference system). Without these, it is impossible to assess whether the pilot experiments actually support the claimed equivalence.
minor comments (1)
- Clarify the precise mathematical relation between T_bg and T_n(I) upon first introduction; the current phrasing leaves open whether T_bg is defined as identical to T_n(I) or merely estimated by it.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our short note. The comments correctly note that the abstract is high-level and could better support the core claims with additional clarification. We address each major comment below and will revise the manuscript accordingly to improve transparency without altering the note's scope or conclusions.
read point-by-point responses
-
Referee: The core modeling step—that implementation perturbations induce an output distribution equivalent to temperature scaling of logits before softmax—is load-bearing for the definition of T_bg and the protocol using T_n(I), yet the abstract supplies no distributional justification or comparison (e.g., via KL divergence, entropy matching, or moment analysis). Sources such as order-dependent rounding can produce structured, non-uniform effects that do not replicate the entropy-increasing action of temperature across the full vocabulary.
Authors: We agree the abstract does not detail distributional comparisons. The manuscript defines T_bg operationally via the inference environment I and relates it to T_n(I) through empirical matching of output statistics (e.g., entropy or diversity measures) rather than assuming exact equivalence for every perturbation source. Structured effects such as non-associativity are treated as contributing to net effective randomness. We will revise the abstract to state explicitly that equivalence is operational and metric-based (entropy or KL matching), not a claim of identical mechanisms across all implementation artifacts. This preserves the framework while addressing the concern. revision: yes
-
Referee: The empirical protocol for estimating T_n(I) is described at a high level but lacks concrete details on the matching procedure, metrics, or controls for confounding factors (e.g., how multiple runs at T=0 are compared to temperature sweeps in the reference system). Without these, it is impossible to assess whether the pilot experiments actually support the claimed equivalence.
Authors: The referee correctly observes that the abstract omits protocol specifics. The full manuscript outlines the protocol as comparing variability from repeated T=0 runs under I against temperature sweeps in a reference system, using metrics such as output entropy and unique response rates, with controls including fixed prompts and averaging across trials. We will expand the abstract with a concise description of the matching procedure and metrics, and augment the pilot experiments section with further details on controls and the reference implementation. These changes will make the empirical grounding more transparent. revision: yes
Circularity Check
New definitional concept with empirical protocol; no reduction to inputs by construction
full rationale
The paper introduces background temperature T_bg purely as a new formalization of observed nondeterminism at nominal T=0, defines its relation to an equivalent T_n(I) in a reference system, and outlines an empirical estimation protocol. No equations, derivations, or predictions are shown that reduce claimed results back to fitted parameters, self-citations, or ansatzes by construction. The contribution remains self-contained as a definitional and measurement framework without load-bearing circular steps.
Axiom & Free-Parameter Ledger
invented entities (1)
-
background temperature T_bg
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Llm stability: A detailed analysis with some surprises
Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin. Non-determinism of “deterministic” llm settings.arXiv, 2408.04667,
-
[2]
Version 5, accessed 2025-09-15
work page 2025
-
[3]
SmolLM3: smol, multilingual, long-context reasoner.https://huggingface.co/blog/smollm3, 2025
Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Car- los Miguel Pati˜ no, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallou´ edec, Kashif Rasul, Nathan Habib, Cl´ ementine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raff...
work page 2025
-
[4]
Defeating nondeterminism in llm inference
Horace He and Thinking Machines Lab. Defeating nondeterminism in llm inference. Thinking Machines Lab blog, 2025. Accessed: 2025-09-15
work page 2025
-
[5]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics
work page 2017
-
[6]
Truthfulqa: Measuring how models mimic human falsehoods, 2021
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2021
work page 2021
-
[7]
Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,
Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. An empirical study of the non-determinism of chatgpt in code generation. InarXiv preprint, volume 2308.02828, 2023. accessed 2025-09-15
-
[8]
S. Price and D. L. Cote. Document analysis with llms: Assessing performance, bias, and nondeterminism in decision making. InICPRAM 2025: Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods, pages 207–214, 2025. ISBN: 978-989-758-730-6. 13
work page 2025
-
[9]
Nikita Ravi, Abhinav Goel, James C. Davis, and George K. Thiruvathukal. Improving the reproducibility of deep learning software: An initial investigation through a case study analysis.arXiv preprint, arXiv:2505.03165, 2025. Accessed: 2025-09-15
-
[10]
Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Oscar Hernandez, Mark Coletti, and Ada Sedova. Impacts of floating-point non-associativity on reproducibility for hpc and deep learning applications.arXiv preprint, arXiv:2408.05148, 2024. Accessed: 2025-09-15
-
[11]
The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism, 2024
Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. Evaluation of llms should not ignore non-determinism.arXiv, 2407.10457, 2024. accessed 2025-09-15. 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.