Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling

Desi R. Ivanova; Philip H. S. Torr; Tim G. J. Rudner; Tom A. Lamb

arxiv: 2604.07172 · v1 · submitted 2026-04-08 · 💻 cs.LG

Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling

Tom A. Lamb , Desi R. Ivanova , Philip H. S. Torr , Tim G. J. Rudner This is my paper

Pith reviewed 2026-05-10 17:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords semantic uncertaintytemperature scalingcalibrationdiscriminationlanguage modelsquestion answeringuncertainty quantification

0 comments

The pith

Optimizing a single scalar temperature corrects miscalibration in semantic uncertainty for language model question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current methods for quantifying uncertainty in language models often overlook calibration in favor of discrimination, resulting in unreliable confidence scores. Fixed-temperature approaches lead to systematic miscalibration on question-answering tasks. The paper shows that optimizing one scalar temperature parameter offers an effective inductive bias, improving both calibration and discrimination metrics. This simple adjustment also enhances downstream entropy measures and surpasses more complex recalibration techniques. A reader cares because accurate uncertainty helps in deciding when to trust model answers in real-world use.

Core claim

By optimizing a single scalar temperature for token-level scaling, semantic confidence distributions in language models become better calibrated and more discriminative on question-answering tasks, outperforming fixed heuristics and expressive token-level methods while improving entropy.

What carries the argument

Token-level temperature scaling with a single optimized scalar that adjusts probability distributions to achieve semantic calibration.

If this is right

Temperature scaling improves semantic calibration consistently across evaluated confidence measures.
It enhances discrimination and downstream entropy on QA tasks.
The method outperforms heuristic baselines and more complex recalibration approaches.
A single scalar suffices as a suitable inductive bias for this adjustment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests simpler global scaling may suffice for many semantic uncertainty tasks without per-token complexity.
The gains could extend to other text generation settings where reliable confidence is needed.
Optimal temperatures might transfer across models, allowing reuse without full re-optimization.

Load-bearing premise

A single scalar temperature provides a suitable inductive bias for semantic uncertainty in language models.

What would settle it

A new collection of question-answering tasks and confidence measures where optimizing the scalar temperature fails to reduce calibration error relative to fixed-temperature baselines.

read the original abstract

Calibration is central to reliable semantic uncertainty quantification, yet prior work has largely focused on discrimination, neglecting calibration. As calibration and discrimination capture distinct aspects of uncertainty, focusing on discrimination alone yields an incomplete picture. We address this gap by systematically evaluating both aspects across a broad set of confidence measures. We show that current approaches, particularly fixed-temperature heuristics, produce systematically miscalibrated and poorly discriminative semantic confidence distributions. We demonstrate that optimising a single scalar temperature, which, we argue, provides a suitable inductive bias, is a surprisingly simple yet effective solution. Our exhaustive evaluation confirms that temperature scaling consistently improves semantic calibration, discrimination, and downstream entropy, outperforming both heuristic baselines and more expressive token-level recalibration methods on question-answering tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A single learned scalar temperature improves semantic calibration and discrimination in LLM QA more than token-level scaling or fixed heuristics, based on broad but high-level empirical tests.

read the letter

The paper's main takeaway is that tuning one temperature scalar gives better calibrated semantic uncertainty estimates than either fixed heuristics or per-token scaling on question-answering tasks. The authors correctly note that prior work has mostly chased discrimination while ignoring whether the numbers line up with actual accuracy. They check both aspects across several confidence measures and report consistent gains in calibration, discrimination, and entropy after scaling. The finding that the simpler scalar version beats the more flexible token-level approach is the part worth paying attention to, since extra parameters do not help here. The evaluation spans multiple tasks and measures, which adds some weight to the results. This is a straightforward extension of temperature scaling to semantic uncertainty rather than a new framework, and the empirical pattern holds up in the reported experiments. The inductive-bias argument for why a single scalar should work is thin and mostly explains the outcome after the fact instead of predicting it. Details on how the temperature was fit, whether the token-level baselines were tuned equally, and any statistical tests are not visible from the abstract, so post-hoc selection effects cannot be ruled out yet. No code or data artifacts are mentioned. This is useful for people building or deploying language models where knowing when an answer is reliable matters, such as in safety or decision-support settings. Readers focused on calibration techniques for generative models will find the comparison practical. The work shows clear separation of calibration from discrimination and has a reproducible experimental shape, so it deserves peer review. I would send it out, with reviewers asked to verify the optimization details and run additional controls on the token-level methods.

Referee Report

0 major / 1 minor

Summary. The paper claims that optimizing a single scalar temperature, which provides a suitable inductive bias, is a simple yet effective solution for semantic uncertainty quantification in language model question-answering. It argues that fixed-temperature heuristics produce miscalibrated and poorly discriminative confidence distributions, and demonstrates through exhaustive evaluation that temperature scaling consistently improves semantic calibration, discrimination, and downstream entropy while outperforming both heuristic baselines and more expressive token-level recalibration methods.

Significance. If the empirical results hold under rigorous controls, this work is significant for showing that a minimal, single-parameter adjustment can outperform more complex recalibration techniques in improving the reliability of semantic uncertainty estimates in LLMs. It addresses an important gap by jointly evaluating calibration and discrimination rather than discrimination alone, and the emphasis on a parameter-efficient inductive bias could influence practical implementations of trustworthy QA systems.

minor comments (1)

[Abstract] The abstract refers to an 'exhaustive evaluation' and 'broad set of confidence measures' without quantifying the scope (e.g., number of tasks, models, or measures); adding this detail would improve immediate clarity for readers assessing the representativeness of the claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review, which accurately summarizes our contributions on the benefits of single-scalar temperature scaling for semantic uncertainty quantification. We are pleased that the referee highlights the value of jointly assessing calibration and discrimination, as well as the practical advantages of a parameter-efficient approach. Since the recommendation is for minor revision and no specific major comments were raised, we provide no point-by-point responses below.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper advances an empirical claim that optimizing a single scalar temperature improves semantic calibration, discrimination, and downstream entropy on QA tasks, outperforming fixed heuristics and token-level methods. This rests on exhaustive evaluation and data-driven fitting rather than any mathematical derivation chain. No self-definitional reductions, fitted inputs relabeled as predictions, load-bearing self-citations, or ansatzes smuggled via prior work are present; the inductive-bias argument is offered explicitly as an argument, not a tautological derivation. The work is self-contained against external benchmarks and does not reduce its central result to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that temperature scaling supplies an appropriate inductive bias for semantic confidence distributions and that calibration and discrimination are distinct and both worth optimizing.

free parameters (1)

scalar temperature
Single fitted parameter optimized to improve calibration of semantic confidence scores.

axioms (1)

domain assumption Calibration and discrimination capture distinct aspects of uncertainty
Invoked to justify evaluating both and claiming prior work is incomplete.

pith-pipeline@v0.9.0 · 5437 in / 1202 out tokens · 52789 ms · 2026-05-10T17:44:48.007354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate that optimising a single scalar temperature, which, we argue, provides a suitable inductive bias, is a surprisingly simple yet effective solution.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TS enforces a single global constraint across the entire generation. This global focus acts as a regulariser against overfitting to meaningless filler tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

5 Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR. 2, 5, 7, 9 He, P., Liu, X., Gao, J., and Chen, W. (2021). De- berta: Decoding-enhanced bert with disentangled attention. InInternational Conference on Learning Representations. 6 Hu...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

[Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. [Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] (c) (Optional) Anonymized source code, with specification of all dependencies, including extern...

work page
[3]

[Not Applicable] (b) Complete proofs of all theoretical results

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Not Applicable] (b) Complete proofs of all theoretical results. [Not Applicable] (c) Clear explanations of any assumptions. [Not Applicable]

work page
[4]

[Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen)

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to re- produce the main experimental results (either in the supplemental material or as a URL). [Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen). [Yes] (c) A clear definition of the spe...

work page
[5]

[Yes] (b) The license information of the assets, if appli- cable

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses existing assets. [Yes] (b) The license information of the assets, if appli- cable. [Yes] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Yes] (d) Information ab...

work page
[6]

20” vs. “twenty

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable] Improving Semantic Uncertainty Quant...

work page 2000

[1] [1]

5 Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR. 2, 5, 7, 9 He, P., Liu, X., Gao, J., and Chen, W. (2021). De- berta: Decoding-enhanced bert with disentangled attention. InInternational Conference on Learning Representations. 6 Hu...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

[Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. [Yes] (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] (c) (Optional) Anonymized source code, with specification of all dependencies, including extern...

work page

[3] [3]

[Not Applicable] (b) Complete proofs of all theoretical results

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Not Applicable] (b) Complete proofs of all theoretical results. [Not Applicable] (c) Clear explanations of any assumptions. [Not Applicable]

work page

[4] [4]

[Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen)

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to re- produce the main experimental results (either in the supplemental material or as a URL). [Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen). [Yes] (c) A clear definition of the spe...

work page

[5] [5]

[Yes] (b) The license information of the assets, if appli- cable

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses existing assets. [Yes] (b) The license information of the assets, if appli- cable. [Yes] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Yes] (d) Information ab...

work page

[6] [6]

20” vs. “twenty

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable] Improving Semantic Uncertainty Quant...

work page 2000