pith. sign in

arxiv: 2606.06238 · v1 · pith:WENSMHBWnew · submitted 2026-06-04 · 💻 cs.LG · cond-mat.stat-mech· hep-lat

Generative Criticality in Large Language Model Temperature Scaling

Pith reviewed 2026-06-28 02:36 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.stat-mechhep-lat
keywords large language modelstemperature scalingcritical phenomenasusceptibilityorder parameterintrinsic dimensionphase transitiontoken embeddings
0
0 comments X

The pith

Treating LLM token embeddings as spins on a chain reveals a critical temperature Tc marked by a susceptibility peak and output collapse to one direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models token embeddings produced by large language models as continuous spin variables arranged on a one-dimensional chain. It defines a susceptibility from the connected two-point correlator of these embeddings and an order parameter from their ensemble average, then scans the softmax temperature T. A sharp peak in susceptibility appears near a characteristic Tc, accompanied by power-law-like scaling, a rapid shift in the order parameter, and a collapse of outputs onto a single semantic direction below Tc. The intrinsic dimension computed by the TwoNN estimator reaches a minimum at the same point. These signatures appear consistently across model sizes from 0.6B to 32B parameters and across prompt categories. A reader might care because the framework supplies concrete statistical-mechanics observables for diagnosing when generated text becomes more ordered or diverse.

Core claim

Defining a susceptibility from the connected two-point correlator and an order parameter from the ensemble-averaged embedding field, we vary the softmax temperature T and observe a sharp susceptibility peak near a characteristic Tc with power-law-like scaling, a concurrent rapid change in the order parameter, and a collapse onto a single semantic direction below Tc. The intrinsic dimension estimated by the two nearest neighbor (TwoNN) method independently corroborates these findings, reaching a minimum near Tc. Results are robust across model scales (Qwen3: 0.6B--32B) and prompt categories. While the phenomenology closely resembles a continuous phase transition, the non-equilibrium nature of

What carries the argument

Susceptibility constructed from the connected two-point correlator of token embeddings treated as continuous spin variables on a one-dimensional chain.

If this is right

  • Below Tc the generated embeddings collapse onto a single semantic direction.
  • The intrinsic dimension of the output distribution reaches a minimum near Tc.
  • The observed scaling of susceptibility near Tc is consistent with power-law behavior.
  • The same signatures appear across model scales from 0.6B to 32B parameters.
  • The framework supplies quantitative observables for the collective structure of LLM outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • An operating temperature near Tc could balance coherence and diversity in generation without external tuning.
  • The same correlator-based observables might be applied to other sampling methods such as nucleus sampling.
  • The noted non-equilibrium character implies the transition may be a dynamical crossover rather than a true thermodynamic phase transition.
  • Prompt engineering could shift the location of Tc by altering the effective interaction range along the token chain.

Load-bearing premise

Token embeddings generated by autoregressive sampling can be treated as continuous spin variables on a one-dimensional chain so that equilibrium statistical-mechanics quantities remain meaningful descriptors of the output distribution.

What would settle it

Repeating the full measurement protocol on a new family of models or with non-autoregressive sampling and finding no susceptibility peak at any temperature.

Figures

Figures reproduced from arXiv: 2606.06238 by Huajian Ruan, Jinyang Li, Lingxiao Wang, Xingyu Guo.

Figure 1
Figure 1. Figure 1: Susceptibility χ vs. temperature T for (a) different model scales using Wikipedia prompts (1000 samples, 280 tokens). All models exhibit a peak near Tc ≈ 1.4, with larger models showing higher peak susceptibility. (b) Different prompt categories generated by Qwen3-32B (1000 samples, 300 tokens). 0 5 10 Temperature 2 4 6 8 Susceptibility  2 º = 0:203 (dof = 51) Data Fit region Fit, =0.100 Tc = 1.407 (a) 10… view at source ↗
Figure 2
Figure 2. Figure 2: Power-law fits of susceptibility near Tc for Qwen3-32B. (a) Full χ(T) curve with the fit region highlighted; (b) log-log plot for T > Tc; (c) log-log plot for T < Tc. Both sides yield a critical exponent γ ≈ 0.1, consistent with power-law divergence. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Stability analysis of the susceptibility signal. (a) Convergence with increasing ensemble [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Order parameter structure via PCA. (a) Projection onto the first two principal components [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Intrinsic dimension Id vs. temperature for five Qwen3 model scales. All models exhibit a minimum near Tc ≈ 1.3–1.5, after which Id rises sharply and saturates at high T, consistent with the onset of a disordered phase where the configuration space expands. range correlations, and leveraging the critical behavior identified here as a diagnostic for trustworthy AI—for instance, the sharp onset of the disorde… view at source ↗
Figure 6
Figure 6. Figure 6: By evaluating the ratio of the second-nearest neighbor distance to the nearest neighbor [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Scaling laws tests at different sizes. Panels (a), (b), and (c) show exponent fitting for a [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

We propose a statistical-field framework for text generated by large language models (LLMs), treating token embeddings as continuous spin variables on a one-dimensional chain. Defining a susceptibility from the connected two-point correlator and an order parameter from the ensemble-averaged embedding field, we vary the \texttt{softmax} temperature $T$ and observe a sharp susceptibility peak near a characteristic $T_c$ with power-law-like scaling, a concurrent rapid change in the order parameter, and a collapse onto a single semantic direction below $T_c$. The intrinsic dimension estimated by the two nearest neighbor (TwoNN) method independently corroborates these findings, reaching a minimum near $T_c$. Results are robust across model scales (Qwen3: 0.6B--32B) and prompt categories. While the phenomenology closely resembles a continuous phase transition, the non-equilibrium nature of autoregressive generation warrants further investigation. Our framework provides quantitative tools for probing the collective statistical structure of LLM outputs and suggests connections between decoding strategies and critical phenomena.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a statistical-field framework treating LLM token embeddings as continuous spins on a 1D chain. Susceptibility is defined from the connected two-point correlator and an order parameter from the ensemble-averaged embedding field; varying softmax temperature T yields a sharp susceptibility peak near a characteristic Tc with power-law-like scaling, a concurrent rapid change in the order parameter, semantic collapse below Tc, and a minimum in TwoNN intrinsic dimension near Tc. These features are reported as robust across Qwen3 model scales (0.6B–32B) and prompt categories, with the non-equilibrium character of autoregressive generation noted as requiring further study.

Significance. If the equilibrium interpretation can be substantiated, the work supplies quantitative, reproducible diagnostics (connected correlator, order parameter, TwoNN dimension) that link decoding temperature to collective phenomena and could inform temperature selection. The cross-scale robustness and independent corroboration by intrinsic dimension are concrete strengths. The central claim, however, rests on the applicability of equilibrium statistical mechanics to autoregressive output distributions.

major comments (2)
  1. [Abstract; §3 (framework definition)] The mapping of autoregressive token sequences to an equilibrium 1D Ising-like chain is assumed without demonstration that the joint distribution over embeddings satisfies fluctuation-dissipation or cluster properties required for the connected correlator to be interpreted as a susceptibility. This assumption is load-bearing for the claim that the observed peak constitutes criticality rather than a sequential-conditioning artifact (see skeptic concern and abstract caveat on non-equilibrium nature).
  2. [§4 (results and figures)] No information is supplied on how the connected two-point correlator is estimated from finite samples, whether bootstrap or jackknife error bars are computed, or what statistical test establishes the power-law scaling and the location of Tc. This directly affects the soundness of the peak and scaling claims.
minor comments (2)
  1. [§3] Clarify the precise centering and normalization used for the embedding field when defining the order parameter.
  2. [§4] Add a brief methods paragraph on the TwoNN implementation and its sensitivity to sample size.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the phenomenological scope of the framework while committing to added methodological detail.

read point-by-point responses
  1. Referee: [Abstract; §3 (framework definition)] The mapping of autoregressive token sequences to an equilibrium 1D Ising-like chain is assumed without demonstration that the joint distribution over embeddings satisfies fluctuation-dissipation or cluster properties required for the connected correlator to be interpreted as a susceptibility. This assumption is load-bearing for the claim that the observed peak constitutes criticality rather than a sequential-conditioning artifact (see skeptic concern and abstract caveat on non-equilibrium nature).

    Authors: We agree the equilibrium mapping is an analogy rather than a derived result. The abstract already states that 'the non-equilibrium nature of autoregressive generation warrants further investigation' and presents the features as phenomenological resemblance to criticality. The connected correlator is employed as a diagnostic of collective structure in the generated sequences. We will revise §3 to state more explicitly that fluctuation-dissipation relations are not demonstrated and that the peak may partly reflect sequential conditioning; the framework is offered as a practical tool rather than a claim of thermodynamic equilibrium. revision: partial

  2. Referee: [§4 (results and figures)] No information is supplied on how the connected two-point correlator is estimated from finite samples, whether bootstrap or jackknife error bars are computed, or what statistical test establishes the power-law scaling and the location of Tc. This directly affects the soundness of the peak and scaling claims.

    Authors: We accept that these procedural details are missing. In the revised manuscript we will add a dedicated paragraph in §4 describing: (i) the ensemble construction (multiple independent generations per prompt), (ii) the explicit estimator for the connected two-point correlator, (iii) bootstrap resampling for error bars on susceptibility and order-parameter curves, and (iv) the fitting protocol used to locate Tc and assess power-law scaling. revision: yes

Circularity Check

0 steps flagged

No circularity: direct measurement of defined observables under temperature variation

full rationale

The paper defines susceptibility via the connected two-point correlator of embedding vectors and an order parameter via the ensemble-averaged field, then directly computes these quantities while varying the softmax temperature T on sampled outputs. The reported peak near Tc, power-law-like scaling, order-parameter jump, and TwoNN dimension minimum are all empirical observations from this procedure, not quantities fitted to data and re-labeled as predictions. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the central claims; the framework is applied to generated sequences without reducing the target phenomenology to its own inputs by construction. The non-equilibrium caveat is explicitly noted, confirming the analysis remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on one central domain assumption; no free parameters or new entities are named in the abstract.

axioms (1)
  • domain assumption Token embeddings can be treated as continuous spin variables on a one-dimensional chain
    This premise is required to define the connected two-point correlator and susceptibility from the generated sequence.

pith-pipeline@v0.9.1-grok · 5716 in / 1340 out tokens · 49342 ms · 2026-06-28T02:36:12.369087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 9 canonical work pages · 4 internal anchors

  1. [1]

    Penguin uK, 2003

    Steven Pinker.The language instinct: How the mind creates language. Penguin uK, 2003

  2. [2]

    Walter de Gruyter, 2002

    Noam Chomsky.Syntactic structures. Walter de Gruyter, 2002

  3. [3]

    The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023

    Melanie Mitchell and David C Krakauer. The debate over understanding in ai’s large language models.Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023

  4. [4]

    Smith, and William Bialek

    Colin Scheibner, Lindsay M. Smith, and William Bialek. Large language models and the entropy of english.ArXiv, abs/2512.24969, 2025

  5. [5]

    Entropy rate estimates for natural language—a new extrapolation of compressed large-scale corpora.Entropy, 18(10):364, 2016

    Ryosuke Takahira, Kumiko Tanaka-Ishii, and Łukasz D˛ ebowski. Entropy rate estimates for natural language—a new extrapolation of compressed large-scale corpora.Entropy, 18(10):364, 2016

  6. [6]

    Mutual information functions of natural language texts

    Wentian Li. Mutual information functions of natural language texts. Santa Fe Institute Santa Fe, NM, USA, 1989

  7. [7]

    Phase transition in large language models and the criticality of natural languages

    Kai Nakaishi, Yoshihiko Nishikawa, and Koji Hukushima. Critical phase transition in large language models.arXiv preprint arXiv:2406.05335, 2024

  8. [8]

    Phase transitions in large language models and the o(n) model.arXiv preprint arXiv:2501.16241, 2025

    Youran Sun and Babak Haghighat. Phase transitions in large language models and the o(n) model.arXiv preprint arXiv:2501.16241, 2025

  9. [9]

    Machine learning and the physical sciences

    Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby, Leslie V ogt-Maranto, and Lenka Zdeborová. Machine learning and the physical sciences. Reviews of Modern Physics, 91(4):045002, 2019

  10. [10]

    World Scientific, 2005

    Daniel J Amit and Victor Martin-Mayor.Field theory, the renormalization group, and critical phenomena: graphs to computers. World Scientific, 2005. 5

  11. [11]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  12. [12]

    Hot or cold? adaptive temperature sampling for code generation with large language models

    Yuqi Zhu, Jia Li, Ge Li, YunFei Zhao, Zhi Jin, and Hong Mei. Hot or cold? adaptive temperature sampling for code generation with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 437–445, 2024

  13. [13]

    Critical behavior in physics and probabilistic formal languages

    Henry W Lin and Max Tegmark. Critical behavior in physics and probabilistic formal languages. Entropy, 19(7):299, 2017

  14. [14]

    A simple explanation for the phase transition in large language models with list decoding.arXiv preprint arXiv:2303.13112, 2023

    Cheng-Shang Chang. A simple explanation for the phase transition in large language models with list decoding.arXiv preprint arXiv:2303.13112, 2023

  15. [15]

    Mendes-Santos, X

    T. Mendes-Santos, X. Turkeshi, M. Dalmonte, and Alex Rodriguez. Unsupervised learning universal critical behavior via the intrinsic dimension.Physical Review X, 11(1), February 2021

  16. [16]

    Measuring the Intrinsic Dimension of Objective Landscapes

    Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes.arXiv preprint arXiv:1804.08838, 2018

  17. [17]

    Springer Nature, 2023

    Christopher M Bishop and Hugh Bishop.Deep learning: Foundations and concepts. Springer Nature, 2023

  18. [18]

    Teal: Tokenize and embed all for multi-modal large language models.arXiv preprint arXiv:2311.04589, 2023

    Zhen Yang, Yingxue Zhang, Fandong Meng, and Jie Zhou. Teal: Tokenize and embed all for multi-modal large language models.arXiv preprint arXiv:2311.04589, 2023

  19. [19]

    Semantic tokenizer for enhanced natural language processing.arXiv preprint arXiv:2304.12404, 2023

    Sandeep Mehta, Darpan Shah, Ravindra Kulkarni, and Cornelia Caragea. Semantic tokenizer for enhanced natural language processing.arXiv preprint arXiv:2304.12404, 2023

  20. [20]

    Text representations and word embeddings: Vectorizing textual data

    Roman Egger. Text representations and word embeddings: Vectorizing textual data. InApplied data science in tourism: Interdisciplinary approaches, methodologies, and applications, pages 335–361. Springer, 2022

  21. [21]

    Spontaneous symmetry breaking in the o (n) model for large n.Physical Review D, 10(8):2491, 1974

    Sidney Coleman, Roman Jackiw, and HDavid Politzer. Spontaneous symmetry breaking in the o (n) model for large n.Physical Review D, 10(8):2491, 1974

  22. [22]

    Exact solution of the o (n) model on a random lattice.Nuclear Physics B, 455(3):577–618, 1995

    Bertrand Eynard and Charlotte Kristjansen. Exact solution of the o (n) model on a random lattice.Nuclear Physics B, 455(3):577–618, 1995

  23. [23]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  24. [24]

    Spin-spin correlations in the two-dimensional ising model.Il Nuovo Cimento B (1965-1970), 44(2):276–305, 1966

    Leo P Kadanoff. Spin-spin correlations in the two-dimensional ising model.Il Nuovo Cimento B (1965-1970), 44(2):276–305, 1966

  25. [25]

    Oxford university press, 2021

    Jean Zinn-Justin.Quantum field theory and critical phenomena, volume 171. Oxford university press, 2021

  26. [26]

    Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 7(1):12140, 2017

    Elena Facco, Maria d’Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 7(1):12140, 2017

  27. [27]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  28. [28]

    The renormalization group and the ϵ expansion.Physics Reports, 12(2):75–199, 1974

    Kenneth G Wilson and John Kogut. The renormalization group and the ϵ expansion.Physics Reports, 12(2):75–199, 1974. 6 A TwoNN Fitting We present the data and fitting plot for the TwoNN method applied to Qwen3-32B in the appendix to show more details. 0.0 0.1 ln( ) 0 5-ln(1-F) T=0.000, d=27.80, R 2=-1.195 Data Fit: d=27.80 0.0 0.2 ln( ) 0 5-ln(1-F) T=0.200...