pith. sign in

arxiv: 2604.22411 · v1 · submitted 2026-04-24 · 💻 cs.AI · cs.CL· cs.LG

Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models

Pith reviewed 2026-05-08 12:10 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords background temperaturelarge language modelsnondeterminismreproducibilityinference environmenttemperature samplingLLM evaluationhidden randomness
0
0 comments X

The pith

Even at a nominal temperature of zero, large language models produce divergent outputs due to implementation-dependent perturbations that can be characterized as an effective background temperature.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper formalizes the hidden nondeterminism in LLMs that occurs even when temperature is set to zero for deterministic decoding. It introduces background temperature as the effective temperature caused by factors like batch size variation, kernel choices, and floating-point operations in the inference environment. The authors provide definitions linking this to stochastic perturbations and outline an empirical method to estimate it using an ideal reference system. Pilot experiments across major LLM providers illustrate the concept and its effects on output consistency. Understanding this allows for better control over reproducibility in model evaluations and deployments.

Core claim

We introduce the notion of background temperature T_bg as the effective temperature induced by an implementation-dependent perturbation process that is observed even when the nominal temperature T is set to zero. This T_bg relates to a stochastic perturbation governed by the specific inference environment I, and can be estimated through the equivalent temperature T_n(I) of an ideal reference system. The formalization and estimation protocol are demonstrated via experiments on LLMs from major providers.

What carries the argument

Background temperature T_bg, which captures the effective randomness from implementation perturbations at nominal T=0 and is estimated via the equivalent temperature T_n(I) in a reference system determined by the inference environment I.

If this is right

  • Outputs from LLMs at T=0 will still vary based on the specific inference setup used.
  • Reproducibility requires controlling or measuring the background temperature induced by the environment.
  • Evaluation of models must consider this hidden variability to avoid misleading comparisons.
  • Deployment in production systems needs awareness of how different inference environments affect output stability.
  • The proposed estimation protocol can be used to quantify and mitigate these effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardizing the measurement of background temperature could enable fairer benchmarks across different hardware and software stacks.
  • This concept might extend to other generative AI models beyond LLMs where implementation noise affects outputs.
  • Future work could explore ways to reduce T_bg through more deterministic computing practices or software fixes.
  • The idea connects to broader issues in computational reproducibility in scientific computing.

Load-bearing premise

The divergence observed at T=0 can be accurately represented as an equivalent temperature T_n(I) within an ideal reference model controlled by the inference environment I.

What would settle it

Running the same input multiple times at T=0 on a fixed inference setup and finding that the output variability does not correspond to what the estimated T_n(I) would predict in the reference system.

Figures

Figures reproduced from arXiv: 2604.22411 by Alberto Messina, Stefano Scotta.

Figure 1
Figure 1. Figure 1: Measuring protocol. 6.2 Controlling the Inference Environment I Run repeated inference (e.g., M ≥ 50 per prompt) at T = 0 while varying I along axes known to influence nondeterminism: • Batch structure: batch size, e.g. ∈ {1, 2, 4, 8, 16, 32, . . . }; co-batching with other prompts vs. serial. • Concurrency/load: single request vs. many simultaneous requests. • Hardware/backends: GPU types, CPU vs. GPU, pr… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of exact-match fractions obtained from the reference LLM answers. Top row (from left to right): histograms representing the distributions f0, f0.2 and f1. Bottom row: kernel density estimates of the exact-match fraction for all sampled temperatures in Θ. Note that for T = 0, the density is represented as a vertical line because all answers are identical, so the density is entirely concentrated… view at source ↗
Figure 3
Figure 3. Figure 3: Discrete distribution g of the fraction of identical answers given by the LLM under test, gpt-4.1-nano, to the prompts in Π. The distribution is shown both as histograms (with the y-axis on the left) and as a kernel density estimate (with the y-axis on the right). In order to compare the discrete distributions of observations, fT for T ∈ Θ and g, we chose to use the Kolmogorov–Smirnov (K-S) distance, which… view at source ↗
Figure 4
Figure 4. Figure 4: K-S distances between fT and g at different T ∈ Θ for the tested model gpt-4.1-nano. Plot of all the values (a). Exact values for a sample of temperatures (b). The tested model background temperature estimate is 0.05 view at source ↗
Figure 5
Figure 5. Figure 5: Side-by-side histograms of the two empirical discrete distributions, g and f0.05, which is the closest to g among {fT : T ∈ Θ} in terms of K-S distance. Π used in Section 7.1), limiting the answers to 32 tokens, analogously to what we did for SmolLm3-3B in Section 7.1, for each T ∈ Θ˜ = Θ ∪ {1.05, 1.1, . . . , 1.45, 1.5}. So, hereinafter the set of reference LLM L = {smoll, llama}, where smoll and llama st… view at source ↗
Figure 6
Figure 6. Figure 6: Heatmap representing the K-S distance between fT and ˜fT′ for T ∈ Θ and T ′ ∈ Θ˜ . In particular the cell on the t-row and t ′ -column represent the K-S distance between ft and ˜ft ′ . Cells with red borders correspond to K-S distance values less or equal than 0.15. 7.3 Estimating Tbg for other models In this section, we present a couple of other experiments aimed at estimating Tbg for other LLMs accessed … view at source ↗
read the original abstract

Even when decoding with temperature $T=0$, large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emph{background temperature} $T_{\mathrm{bg}}$, the effective temperature induced by an implementation-dependent perturbation process observed even when nominal $T=0$. We provide clean definitions, show how $T_{\mathrm{bg}}$ relates to a stochastic perturbation governed by the inference environment $I$, and propose an empirical protocol to estimate $T_{bg}$ via the equivalent temperature $T_n(I)$ of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLMs produce divergent outputs even at nominal temperature T=0 due to implementation-dependent perturbations (batch-size variation, kernel non-invariance, floating-point non-associativity). It formalizes this by defining background temperature T_bg as the effective temperature induced by a stochastic perturbation process governed by the inference environment I, relates T_bg to an equivalent temperature T_n(I) in an ideal reference system, proposes an empirical protocol for estimating T_bg, presents pilot experiments across major LLM providers, and discusses implications for reproducibility, evaluation, and deployment.

Significance. If the distributional equivalence between implementation perturbations and temperature sampling holds and the estimation protocol is validated, the framework could provide a useful quantitative lens for hidden nondeterminism in LLM inference, supporting more reproducible research and reliable production systems. The pilot experiments add initial empirical grounding by showing the effect across providers.

major comments (2)
  1. [Abstract (definitions and relations)] The core modeling step—that implementation perturbations induce an output distribution equivalent to temperature scaling of logits before softmax—is load-bearing for the definition of T_bg and the protocol using T_n(I), yet the abstract supplies no distributional justification or comparison (e.g., via KL divergence, entropy matching, or moment analysis). Sources such as order-dependent rounding can produce structured, non-uniform effects that do not replicate the entropy-increasing action of temperature across the full vocabulary.
  2. [Abstract (empirical protocol and pilot experiments)] The empirical protocol for estimating T_n(I) is described at a high level but lacks concrete details on the matching procedure, metrics, or controls for confounding factors (e.g., how multiple runs at T=0 are compared to temperature sweeps in the reference system). Without these, it is impossible to assess whether the pilot experiments actually support the claimed equivalence.
minor comments (1)
  1. Clarify the precise mathematical relation between T_bg and T_n(I) upon first introduction; the current phrasing leaves open whether T_bg is defined as identical to T_n(I) or merely estimated by it.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our short note. The comments correctly note that the abstract is high-level and could better support the core claims with additional clarification. We address each major comment below and will revise the manuscript accordingly to improve transparency without altering the note's scope or conclusions.

read point-by-point responses
  1. Referee: The core modeling step—that implementation perturbations induce an output distribution equivalent to temperature scaling of logits before softmax—is load-bearing for the definition of T_bg and the protocol using T_n(I), yet the abstract supplies no distributional justification or comparison (e.g., via KL divergence, entropy matching, or moment analysis). Sources such as order-dependent rounding can produce structured, non-uniform effects that do not replicate the entropy-increasing action of temperature across the full vocabulary.

    Authors: We agree the abstract does not detail distributional comparisons. The manuscript defines T_bg operationally via the inference environment I and relates it to T_n(I) through empirical matching of output statistics (e.g., entropy or diversity measures) rather than assuming exact equivalence for every perturbation source. Structured effects such as non-associativity are treated as contributing to net effective randomness. We will revise the abstract to state explicitly that equivalence is operational and metric-based (entropy or KL matching), not a claim of identical mechanisms across all implementation artifacts. This preserves the framework while addressing the concern. revision: yes

  2. Referee: The empirical protocol for estimating T_n(I) is described at a high level but lacks concrete details on the matching procedure, metrics, or controls for confounding factors (e.g., how multiple runs at T=0 are compared to temperature sweeps in the reference system). Without these, it is impossible to assess whether the pilot experiments actually support the claimed equivalence.

    Authors: The referee correctly observes that the abstract omits protocol specifics. The full manuscript outlines the protocol as comparing variability from repeated T=0 runs under I against temperature sweeps in a reference system, using metrics such as output entropy and unique response rates, with controls including fixed prompts and averaging across trials. We will expand the abstract with a concise description of the matching procedure and metrics, and augment the pilot experiments section with further details on controls and the reference implementation. These changes will make the empirical grounding more transparent. revision: yes

Circularity Check

0 steps flagged

New definitional concept with empirical protocol; no reduction to inputs by construction

full rationale

The paper introduces background temperature T_bg purely as a new formalization of observed nondeterminism at nominal T=0, defines its relation to an equivalent T_n(I) in a reference system, and outlines an empirical estimation protocol. No equations, derivations, or predictions are shown that reduce claimed results back to fitted parameters, self-citations, or ansatzes by construction. The contribution remains self-contained as a definitional and measurement framework without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The contribution rests on defining T_bg as an effective temperature from implementation perturbations and proposing an empirical estimation method; no free parameters, background axioms, or new physical entities are stated in the abstract.

invented entities (1)
  • background temperature T_bg no independent evidence
    purpose: To quantify and characterize hidden randomness from implementation details even at nominal temperature zero
    Newly introduced quantity defined in terms of an equivalent temperature in a reference system

pith-pipeline@v0.9.0 · 5461 in / 1115 out tokens · 48082 ms · 2026-05-08T12:10:39.936495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

  1. [1]

    Llm stability: A detailed analysis with some surprises

    Berk Atil, Sarp Aykent, Alexa Chittams, Lisheng Fu, Rebecca J. Passonneau, Evan Radcliffe, Guru Rajan Rajagopal, Adam Sloan, Tomasz Tudrej, Ferhan Ture, Zhe Wu, Lixinyu Xu, and Breck Baldwin. Non-determinism of “deterministic” llm settings.arXiv, 2408.04667,

  2. [2]

    Version 5, accessed 2025-09-15

  3. [3]

    SmolLM3: smol, multilingual, long-context reasoner.https://huggingface.co/blog/smollm3, 2025

    Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, Car- los Miguel Pati˜ no, Edward Beeching, Aymeric Roucher, Aksel Joonas Reedi, Quentin Gallou´ edec, Kashif Rasul, Nathan Habib, Cl´ ementine Fourrier, Hynek Kydlicek, Guilherme Penedo, Hugo Larcher, Mathieu Morlon, Vaibhav Srivastav, Joshua Lochner, Xuan-Son Nguyen, Colin Raff...

  4. [4]

    Defeating nondeterminism in llm inference

    Horace He and Thinking Machines Lab. Defeating nondeterminism in llm inference. Thinking Machines Lab blog, 2025. Accessed: 2025-09-15

  5. [5]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics

  6. [6]

    Truthfulqa: Measuring how models mimic human falsehoods, 2021

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2021

  7. [7]

    Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,

    Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. An empirical study of the non-determinism of chatgpt in code generation. InarXiv preprint, volume 2308.02828, 2023. accessed 2025-09-15

  8. [8]

    Price and D

    S. Price and D. L. Cote. Document analysis with llms: Assessing performance, bias, and nondeterminism in decision making. InICPRAM 2025: Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods, pages 207–214, 2025. ISBN: 978-989-758-730-6. 13

  9. [9]

    Davis, and George K

    Nikita Ravi, Abhinav Goel, James C. Davis, and George K. Thiruvathukal. Improving the reproducibility of deep learning software: An initial investigation through a case study analysis.arXiv preprint, arXiv:2505.03165, 2025. Accessed: 2025-09-15

  10. [10]

    Impacts of floating-point non-associativity on reproducibility for hpc and deep learning applications.arXiv preprint, arXiv:2408.05148, 2024

    Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Oscar Hernandez, Mark Coletti, and Ada Sedova. Impacts of floating-point non-associativity on reproducibility for hpc and deep learning applications.arXiv preprint, arXiv:2408.05148, 2024. Accessed: 2025-09-15

  11. [11]

    The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism, 2024

    Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. Evaluation of llms should not ignore non-determinism.arXiv, 2407.10457, 2024. accessed 2025-09-15. 14