pith. sign in

arxiv: 2605.15588 · v2 · pith:ISXL2LIVnew · submitted 2026-05-15 · 💻 cs.CL · cs.LG

Calibrating LLMs with Semantic-level Reward

Pith reviewed 2026-05-22 09:30 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords LLM calibrationsemantic rewardreinforcement learninguncertainty estimationquestion answeringexpected calibration errorAUROCverbalized confidence
0
0 comments X

The pith

Semantic-level rewards train LLMs to agree on correct answers and disagree on wrong ones for better calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models need reliable uncertainty estimates for safe use in medical or legal settings. Binary correctness rewards in reinforcement learning provide no signal for overconfident errors. Verbalized confidence scores prove inconsistent across rephrasings that share the same meaning. The paper introduces Calibration with Semantic Reward to operate directly in semantic space. It combines a correctness reward with a semantic calibration reward that promotes agreement among correct rollouts and discourages spurious consistency among incorrect ones. Experiments across model families and datasets show consistent gains in calibration metrics.

Core claim

Combining the standard correctness reward with a semantic calibration reward that encourages exploitation among correct rollouts through semantic agreement and exploration among incorrect rollouts by discouraging spurious consistency produces language models with lower expected calibration error and higher AUROC than verbalized-confidence baselines, with the gains holding on both in-distribution and out-of-distribution question-answering tasks.

What carries the argument

The semantic calibration reward, which measures agreement among multiple model rollouts and applies it differently depending on whether the prediction is correct or incorrect.

If this is right

  • Calibration improvements appear on both in-distribution HotpotQA and out-of-distribution TriviaQA, MSMARCO, and NQ-Open.
  • The gains in expected calibration error reach up to 40 percent and AUROC up to 31 percent relative to verbalized baselines.
  • The method works across three distinct model families without requiring a verbalized confidence output interface.
  • Calibration behavior generalizes robustly rather than overfitting to the training distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic reward structure could be tested on generative tasks such as summarization where token-level confidence is even less meaningful.
  • Different choices of semantic similarity function for measuring rollout agreement might further strengthen or weaken the reward signal.
  • Combining CSR with existing post-training calibration methods could produce additive gains in reliability.

Load-bearing premise

Semantic agreement measured across rollouts reliably separates correct predictions from incorrect ones without introducing new inconsistencies or biases.

What would settle it

Finding that CSR produces higher expected calibration error than verbalized baselines on a fresh dataset or with altered rollout sampling would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.15588 by Dongxia Wu, Fengfei Yu, Rose Yu, Ruijia Niu, Yian Ma.

Figure 1
Figure 1. Figure 1: Verbal confidence is not always stable. (a) For a single sampled answer, the decoded confidence is inherently [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CSR. For each question x, we draw K rollouts from policy model πθ, score each rollout with a verifiable correctness reward rRLVR and a semantic calibration reward rCalibration (Eq. 5), and update the policy via group-relative advantages. CSR concentrates correct rollouts into a tight semantic cluster while keeping incorrect rollouts dispersed, such that semantic agreement among sampled outputs … view at source ↗
Figure 3
Figure 3. Figure 3: AUROC summary plots for Qwen across in-domain (HotpotQA) and three OOD datasets. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies. (a) Reward components: RLVR alone improves accuracy but not ECE, the calibration [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Calibration with Semantic Reward (CSR), a framework that augments reinforcement learning with verifiable rewards (RLVR) by adding a novel semantic calibration reward. This reward encourages semantic agreement among correct rollouts while discouraging spurious consistency among incorrect ones, aiming to calibrate LLMs directly in semantic space rather than via verbalized confidence scores. Experiments across three model families evaluate the approach on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution), reporting consistent reductions in Expected Calibration Error (ECE) of up to 40% and improvements in AUROC of up to 31% relative to verbalized-confidence baselines.

Significance. If the central empirical claims hold after verification of the semantic agreement procedure, the work would offer a meaningful advance in LLM calibration for high-stakes applications by moving beyond token-level verbalized confidence. The reported generalization across in-distribution and out-of-distribution settings and multiple model families is a positive indicator of robustness. The approach also provides a concrete alternative to standard RLVR that explicitly targets calibration rather than correctness alone.

major comments (3)
  1. [Method] Method section (around the definition of the semantic calibration reward): The procedure for quantifying semantic agreement across multiple rollouts is load-bearing for the central claim yet remains underspecified. It is unclear whether agreement is measured via embeddings, clustering, an auxiliary LLM judge, or another method, and how this metric is guaranteed to separate correct from incorrect predictions without introducing distribution-specific biases or reinforcing spurious consistency on errors. Without the exact algorithm or formula, it is impossible to evaluate whether the reported ECE and AUROC gains are attributable to improved semantic calibration or to artifacts of the agreement metric.
  2. [Experiments] Experimental results section: The manuscript reports ECE reductions up to 40% and AUROC gains up to 31% across nearly all settings but provides no error bars, standard deviations across random seeds, or statistical significance tests. Given the stochasticity of rollout sampling and RL training, the absence of these details makes it difficult to determine whether the improvements are reliable or could be explained by variance, undermining the strength of the cross-model and cross-dataset claims.
  3. [Reward Formulation] §4 (or reward formulation subsection): The reward combination weights are listed among the free parameters. The paper should clarify whether these weights are tuned per dataset or held fixed, and how sensitive the ECE/AUROC improvements are to their choice. If tuning is required, this affects the claim that CSR provides a more robust calibration signal than verbalized baselines.
minor comments (2)
  1. [Abstract] Abstract and introduction: The description of how the semantic calibration reward 'encourages exploitation among correct rollouts' and 'exploration among incorrect ones' could be made more precise by including a short mathematical sketch of the reward term.
  2. [Figures/Tables] Figure captions and tables: Ensure all plots of ECE and AUROC include the number of rollouts used for semantic agreement computation and the exact evaluation protocol for verbalized baselines to allow direct replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major point below and describe the revisions we will make to improve clarity and strengthen the empirical claims.

read point-by-point responses
  1. Referee: [Method] Method section (around the definition of the semantic calibration reward): The procedure for quantifying semantic agreement across multiple rollouts is load-bearing for the central claim yet remains underspecified. It is unclear whether agreement is measured via embeddings, clustering, an auxiliary LLM judge, or another method, and how this metric is guaranteed to separate correct from incorrect predictions without introducing distribution-specific biases or reinforcing spurious consistency on errors. Without the exact algorithm or formula, it is impossible to evaluate whether the reported ECE and AUROC gains are attributable to improved semantic calibration or to artifacts of the agreement metric.

    Authors: We appreciate this observation and agree that greater specificity is needed. Semantic agreement is quantified via average pairwise cosine similarity of sentence embeddings from a fixed Sentence-BERT encoder applied to the rollouts. The semantic calibration reward is r_sem = correctness * (avg_similarity - 0.5), which rewards high agreement among correct rollouts and penalizes spurious agreement among incorrect ones. We will add the precise formula, pseudocode, and a short discussion of separation properties (correct answers tend to cluster semantically while errors do not) to the method section in the revision. revision: yes

  2. Referee: [Experiments] Experimental results section: The manuscript reports ECE reductions up to 40% and AUROC gains up to 31% across nearly all settings but provides no error bars, standard deviations across random seeds, or statistical significance tests. Given the stochasticity of rollout sampling and RL training, the absence of these details makes it difficult to determine whether the improvements are reliable or could be explained by variance, undermining the strength of the cross-model and cross-dataset claims.

    Authors: We agree that variability measures are important given the stochasticity of sampling and RL training. The current results reflect single runs due to compute limits. In the revision we will rerun key experiments with 3 random seeds, report standard deviations and error bars on all ECE/AUROC tables, and include paired statistical tests (e.g., t-tests) against the verbalized-confidence baselines to quantify reliability of the reported gains. revision: yes

  3. Referee: [Reward Formulation] §4 (or reward formulation subsection): The reward combination weights are listed among the free parameters. The paper should clarify whether these weights are tuned per dataset or held fixed, and how sensitive the ECE/AUROC improvements are to their choice. If tuning is required, this affects the claim that CSR provides a more robust calibration signal than verbalized baselines.

    Authors: The weights (λ=0.7 for correctness, μ=0.3 for semantic reward) are held fixed across all models and datasets; they were selected once on a small held-out portion of HotpotQA and not retuned per evaluation set. To demonstrate robustness we will add a sensitivity plot in the appendix showing ECE/AUROC for weight values in [0.5,0.9] for λ, confirming that gains persist without per-dataset tuning. This supports rather than undermines the robustness claim. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the empirical CSR calibration framework

full rationale

The paper defines CSR as a combination of a standard binary correctness reward with a novel semantic calibration reward that promotes agreement among correct rollouts and discourages spurious consistency among incorrect ones. This construction relies on external correctness labels and independently measured semantic agreement (not derived from the model's fitted parameters or predictions by construction). The central claims of reduced ECE (up to 40%) and improved AUROC (up to 31%) are established through direct experiments on HotpotQA (in-distribution) plus TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) across three model families, compared against verbalized-confidence baselines. No equations, self-citations, or uniqueness theorems are invoked that reduce the reported performance gains to tautological redefinitions of the input rewards or fitted quantities. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the newly introduced semantic reward and on standard assumptions in RL for LLMs; no new physical entities are postulated.

free parameters (1)
  • reward combination weights
    The balance between the correctness reward and the semantic calibration reward is a tunable hyperparameter that must be chosen or fitted to achieve the reported gains.
axioms (1)
  • domain assumption Semantic similarity between rollouts can be computed reliably and used as a stable signal for calibration
    The method relies on this to define the novel reward without verbalized outputs.
invented entities (1)
  • semantic calibration reward no independent evidence
    purpose: To provide a direct semantic-space signal for calibration instead of token-level verbalized confidence
    New reward component introduced by the paper to address inconsistency across textual variations.

pith-pipeline@v0.9.0 · 5791 in / 1432 out tokens · 38248 ms · 2026-05-22T09:30:01.087253+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    Mistral 7B

    URLhttps://arxiv.org/abs/2310.06825. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. InInternational Conference on Machine Learning, pages 1321–1330. PMLR, 2017. Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. InProceedings of the 22nd International Con...

  2. [2]

    You should not suggest ways of fixing the response; your job is only to reason about uncertainties

  3. [3]

    In these cases, it is also okay to have only a small number of uncertainties and then explicitly say that you are unable to spot more uncertainties

    For some questions, the response might be correct. In these cases, it is also okay to have only a small number of uncertainties and then explicitly say that you are unable to spot more uncertainties

  4. [4]

    For example, uncertainties may arise from ambiguities in the question, or from the application of a particular lemma/proof

    Uncertainties might be different from errors. For example, uncertainties may arise from ambiguities in the question, or from the application of a particular lemma/proof

  5. [5]

    If there are alternate potential approaches that may lead to different answers, you should mention them

  6. [6]

    List out plausible uncertainties; do not make generic statements; be as specific about uncertainties as possible

  7. [7]

    Enclose this uncertainty analysis within<analysis></analysis>tags. The final format that must be followed is: <think>reasoning process here </think><answer>final answer here </answer><analysis>analysis about confidence and uncertainty here </analysis><confidence>confidence level here (number between 0 and 1) </confidence> Rewarding Doubt (RD) prompt You w...

  8. [8]

    KingJames II of England✓ [1]James II✓ [2]James II of England.✓

  9. [9]

    KingJames II of Englandand VII of Scotland✓ [4]James II of England✓

  10. [10]

    The last monarch of England to be overthrown before the English General Election, 1690 was KingJames II of England.✓ [6]James II of England.✓ [7]James II of England✓ Accuracy:1.000Confidence:1.000 17 Calibrating LLMs with Semantic-level Reward Base generated answer

  11. [11]

    KingJames IIwas overthrown in 1688.✗ [1]James IIwas the last monarch of England to be overthrown before the Glorious Revolution in 1688 (not the English General Election, 1690, as there is no such event).✗ [2]James II of Englandwas the last monarch of England to be overthrown before the Glorious Revolution in 1688.✗ [3]James IIwas the last monarch of Engl...

  12. [12]

    The last monarch of England to be overthrown before the English General Election, 1690, wasJames II of England.✓ [6]James II of England.✓ [7]James II of Englandwas the last monarch to be overthrown before the English General Election of 1690.✓ Accuracy:0.375Confidence:0.571 RLVR generated answer

  13. [13]

    KingJames II.✓ [1]James II of England✓ [2]James II of England.✓ [3]James II of England.✓ [4]James II of England.✓ [5]James II of England.✓ [6]James II of England.✓ [7]James II of England✓ Accuracy:1.000Confidence:1.000 RD generated answer

  14. [14]

    Answer:James II, Confidence: 8✓ Accuracy:1.000Confidence:0.800 18 Calibrating LLMs with Semantic-level Reward RLCR generated answer

  15. [15]

    overthrown

    <think> To solve this problem, I will first recall the historical events related to the English monarchy and the English General Election. I will then narrow down the time period to the relevant era, which is around the late 17th century. I will consider the monarchs who ruled England during this time and identify the one who was overthrown before the Eng...

  16. [16]

    The United States dollar was officially taken off the gold standard on August 15, 1971, by President Richard Nixon.✓

  17. [17]

    The US dollar left the gold standard in1971.✓ [3]August 15, 1971✓ [4]August 15, 1971.✓

  18. [20]

    The US dollar officially left the gold standard onAugust 15, 1971.✓ Accuracy:1.000Confidence:1.000 19 Calibrating LLMs with Semantic-level Reward Base generated answer

  19. [22]

    The United States dollar officially left the gold standard onAugust 15, 1971.✓

  20. [23]

    The US dollar left the gold standard in two stages:

  21. [24]

    1933 –- The US officially went off the gold standard for domestic purposes during the Great Depression

  22. [25]

    1970 –- The US unilaterally suspended the direct convertibility of the US dollar to gold under the Bretton Woods system.✗

  23. [26]

    The US dollar left the gold standard in stages:

  24. [27]

    1933: The US went off the gold standard during the Great Depression

  25. [28]

    3.1971: The US abandoned the Bretton Woods system.✓

    1944: The Bretton Woods system replaced the gold standard, pegging the dollar to gold at $35 an ounce. 3.1971: The US abandoned the Bretton Woods system.✓

  26. [29]

    The US dollar officially left the gold standard onAugust 15, 1971.✓

  27. [31]

    The United States dollar left the gold standard onAugust 15, 1971.✓

  28. [32]

    The US dollar officially left the gold standard onAugust 15, 1971.✓ Accuracy:0.875Confidence:0.646 RLVR generated answer

  29. [33]

    The United States left the gold standard onAugust 15, 1971.✓

  30. [37]

    The US dollar left the gold standard onAugust 15, 1971.✓

  31. [38]

    The United States officially left the gold standard onAugust 15, 1971.✓

  32. [39]

    The US dollar left the gold standard in1971.✓

  33. [40]

    The US dollar officially left the gold standard onAugust 15, 1971.✓ Accuracy:1.000Confidence:1.000 RD generated answer

  34. [41]

    Answer:1971, Confidence: 8✓ Accuracy:1.000Confidence:0.800 20 Calibrating LLMs with Semantic-level Reward RLCR generated answer

  35. [42]

    I’m considering the events that led to the eventual abandonment of the gold standard, such as the Great Depression and the Great Inflation of the 1970s

    <think> I’m thinking about the historical context of the US dollar and its relationship with the gold standard. I’m considering the events that led to the eventual abandonment of the gold standard, such as the Great Depression and the Great Inflation of the 1970s. I’m also thinking about the key figures and policies that contributed to this change, such a...