pith. sign in

arxiv: 2507.16806 · v2 · pith:XRC2M3DMnew · submitted 2025-07-22 · 💻 cs.LG · cs.AI· cs.CL

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

Pith reviewed 2026-05-21 23:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learninglanguage modelscalibrationBrier scorereasoning chainsuncertainty estimationproper scoring rules
0
0 comments X

The pith

Augmenting binary rewards with Brier scores in RL training produces accurate and well-calibrated language model reasoners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Binary rewards in RL for LM reasoning improve accuracy but often worsen calibration, leading to more hallucinations on other tasks. RLCR adds a Brier score term to the reward so models generate both answers and numerical confidence estimates after reasoning. The paper proves this composite reward yields models whose predictions are accurate and well-calibrated for any bounded proper scoring rule. Experiments across datasets show RLCR improves calibration with no accuracy loss compared to standard RL or post-hoc classifiers, and verbalized confidence aids test-time scaling methods.

Core claim

By training with a composite reward of binary correctness plus Brier score on verbalized confidence, language models learn to generate reasoning chains that support both correct answers and accurate probability estimates, with theoretical guarantees from proper scoring rules and empirical gains in calibration metrics across in- and out-of-domain settings.

What carries the argument

The calibration-augmented reward function in RLCR, which scores both the correctness of the final answer and the accuracy of the accompanying confidence estimate using the Brier score.

If this is right

  • RLCR improves calibration metrics while maintaining accuracy on diverse QA datasets.
  • Verbalized confidence from RLCR models enables better accuracy and calibration through confidence-weighted scaling at test time.
  • The approach outperforms both standard RL and post-hoc trained confidence classifiers.
  • Ordinary RL training degrades calibration, whereas RLCR enhances it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This joint optimization could support safer use of reasoning models in settings where knowing when outputs are uncertain reduces risk.
  • The method might combine with other uncertainty techniques to handle different types of model uncertainty.
  • Scaling tests on longer or multi-step reasoning problems could check whether the calibration gains hold without extra stabilization.

Load-bearing premise

Language models can be trained via RL to produce meaningful numerical confidence estimates alongside reasoning chains, with the Brier score term optimized jointly without destabilizing policy gradients.

What would settle it

If RLCR models show no reduction in expected calibration error compared to standard RL models on out-of-domain tasks, the claim of improved calibration would be falsified.

Figures

Figures reproduced from arXiv: 2507.16806 by Idan Shenfeld, Isha Puri, Jacob Andreas, Leshem Choshen, Mehul Damani, Stewart Slocum, Yoon Kim.

Figure 1
Figure 1. Figure 1: (a): Sample chain-of-thought from a model trained with RLCR, using <think>, <answer>, <analysis>, and <confidence> tags. (b) On in-domain evaluation tasks, RLCR improves on standard reasoning training (RLVR) and even slightly outperforms a combination of RLVR and a dedicated classifier trained to predict RLVR correctness. (c) When evaluating generalization to novel tasks, RLCR improves both accuracy and ca… view at source ↗
Figure 2
Figure 2. Figure 2: (a): RLVR focuses solely on correctness, which can incentivize guessing. (b): RLCR uses a calibrated reward that jointly optimizes for correctness and calibration. reasoning chains that produce both answers and confidences estimates (as in Fig. 1a). They are then trained to optimize: RRLCR(y, q, y∗ ) = 1y≡y∗ − (q − 1y≡y∗ ) 2 . (8) Intuitively, this reward incentivizes correctness but penalizes models when … view at source ↗
Figure 3
Figure 3. Figure 3: (a) Reward curves for RLCR (ours) and RLVR. Both correctness and calibration rewards improve under our method, demonstrating simultaneous gains in correctness and calibration. The Brier reward is shifted upward by 1 for clarity. (b) Completion lengths during training. The completion lengths of our method gradually increase during training as uncertainty reasoning improves. 4.1 EXPERIMENTAL SETUP Training D… view at source ↗
Figure 4
Figure 4. Figure 4: Test-time scaling curves. (a) Accuracy vs Number of Samples (N). Accuracy improves for all methods with increasing compute. Confidence-weighted majority vote outperforms both vanilla majority vote and max-confidence, highlighting complementary benefits of combining voting with confidence scores. (b) Brier Scores vs Ensemble Size (K). Here we evalute the effect of applying test-time scaling to confidence es… view at source ↗
Figure 5
Figure 5. Figure 5: Brier scores (a) and ECE (b) of baseline / analysis classifiers on HotPotQA-Modified across three model sizes. Analysis classifiers outperform baselines at smaller sizes, suggesting that uncertainty CoT is essential for better calibration when capacity is limited. Accuracy Given N responses y1, y2, ...yN and a reward model r(x, y), best-of-N selects the response with highest reward: ychosen = arg max {y1,.… view at source ↗
Figure 6
Figure 6. Figure 6: (a): Distribution of standard deviation in confidence across multiple uncertainty reasoning chains for the same solution/answer. Most samples exhibit low deviation, indicating that the model’s confidence estimates are self-consistent. (b) Swarm plot of confidence sums across 3 datasets. RLCR consistently remains closer to the ideal sum of 1. Nonetheless, overconfidence remains, suggesting room for further … view at source ↗
read the original abstract

When language models (LMs) are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models. Code, models, and further info is available at https://rl-calibration.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RLCR, an RL approach for training language models on reasoning tasks that augments standard binary correctness rewards with a Brier score term to jointly optimize accuracy and calibrated confidence estimates. It proves that any reward using a bounded proper scoring rule produces models whose predictions are both accurate and well-calibrated, demonstrates empirically that RLCR improves calibration with no accuracy loss on in- and out-of-domain tasks (outperforming ordinary RL and post-hoc classifiers), and shows that verbalized confidence can be used at test time for further gains via scaling methods.

Significance. If the central claims hold, the work is significant because it directly tackles the calibration degradation induced by binary rewards in RL for reasoning, a common issue leading to increased hallucinations. The theoretical result leverages standard properties of proper scoring rules to guarantee calibration under reward maximization, and the empirical results across diverse datasets provide evidence that explicit calibration optimization can yield more reliable reasoning models without sacrificing performance. Releasing code and models further strengthens the contribution for reproducibility.

major comments (2)
  1. [§4 (Theoretical Analysis)] §4 (Theoretical Analysis): The proof that bounded proper scoring rules (including the binary + Brier composite) yield accurate and well-calibrated models assumes exact expected-reward maximization. It does not address whether standard policy-gradient methods (e.g., REINFORCE) can reach this joint optimum given the differing scales and variances of the binary term and the (c - y)^2 Brier term, which is load-bearing for transferring the guarantee to practical LM training.
  2. [Experimental section (results on calibration and accuracy)] Experimental section (results on calibration and accuracy): The reported improvements in calibration with no accuracy loss rest on the assumption that the Brier term can be optimized jointly without destabilizing updates. No ablations on reward scaling, normalization, or separate critics are described to mitigate gradient issues, weakening the claim that RLCR reliably outperforms ordinary RL in practice.
minor comments (2)
  1. [Abstract] The abstract states that RLCR 'substantially improves calibration with no loss in accuracy' but should explicitly name the primary metrics (e.g., ECE, Brier score) and the number of datasets used in the one-paragraph summary.
  2. [Method] Notation for the composite reward function should be introduced earlier and used consistently when describing how the Brier score is added to the binary correctness signal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and describe the revisions we plan to make.

read point-by-point responses
  1. Referee: [§4 (Theoretical Analysis)] §4 (Theoretical Analysis): The proof that bounded proper scoring rules (including the binary + Brier composite) yield accurate and well-calibrated models assumes exact expected-reward maximization. It does not address whether standard policy-gradient methods (e.g., REINFORCE) can reach this joint optimum given the differing scales and variances of the binary term and the (c - y)^2 Brier term, which is load-bearing for transferring the guarantee to practical LM training.

    Authors: We agree that the theoretical guarantee in §4 holds under exact expected-reward maximization, which is the natural setting for analyzing proper scoring rules. The proof establishes that any policy maximizing the composite reward (binary correctness plus bounded proper scoring rule) must be both accurate and calibrated. We do not claim that REINFORCE or other policy-gradient methods necessarily converge to this exact joint optimum, given the potential differences in scale and variance between the two reward components. Our empirical results across in-domain and out-of-domain tasks nevertheless show that RLCR consistently improves calibration without accuracy loss relative to standard RL, indicating that the practical optimization procedure is effective. In the revision we will add a brief discussion of this distinction between the exact optimum and the approximate optimization achieved by policy gradients, along with a note on the empirical evidence supporting transfer of the qualitative guarantee. revision: partial

  2. Referee: [Experimental section (results on calibration and accuracy)] Experimental section (results on calibration and accuracy): The reported improvements in calibration with no accuracy loss rest on the assumption that the Brier term can be optimized jointly without destabilizing updates. No ablations on reward scaling, normalization, or separate critics are described to mitigate gradient issues, weakening the claim that RLCR reliably outperforms ordinary RL in practice.

    Authors: We acknowledge that the current experiments do not include explicit ablations on reward scaling, normalization, or separate critics, which would strengthen the robustness claims. In the experiments we combined the terms into a single scalar reward and observed stable training with the reported gains. To address this concern we will add ablations in the revised manuscript that vary the relative scaling of the Brier term, apply normalization to each component, and compare against a variant using separate value heads where feasible. These results will be included to demonstrate that the calibration improvements are not artifacts of a particular reward formulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; calibration proof relies on external proper scoring rule properties

full rationale

The paper's load-bearing theoretical step is the claim that a composite reward using a bounded proper scoring rule (binary correctness plus Brier score) produces accurate and calibrated models. This follows directly from the known mathematical property that proper scoring rules elicit truthful reporting of beliefs, which is a standard result in decision theory and is invoked as external background rather than derived or fitted inside the paper. No equations reduce a prediction to a fitted parameter by construction, no uniqueness theorem is imported from the authors' prior work, and no ansatz is smuggled via self-citation. Empirical RL results and test-time scaling are presented as separate contributions without circular dependence on the proof. A minor self-citation for related RL reasoning methods may be present but is not load-bearing for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical property that proper scoring rules elicit calibrated predictions and on the assumption that this property transfers to RL policy optimization over language-model outputs.

axioms (1)
  • standard math Bounded proper scoring rules such as the Brier score incentivize calibrated probability estimates.
    Invoked in the proof that the combined reward yields well-calibrated models.

pith-pipeline@v0.9.0 · 5856 in / 1180 out tokens · 51040 ms · 2026-05-21T23:58:14.364168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

    cs.LG 2026-05 unverdicted novelty 7.0

    Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...

  2. Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

    cs.LG 2026-05 unverdicted novelty 6.0

    Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.

  3. Process Supervision of Confidence Margin for Calibrated LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    RLCM trains LLMs with a margin-enhanced process reward that widens the gap between correct and incorrect reasoning steps, improving calibration on math, code, logic, and science tasks without hurting accuracy.

  4. Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement

    cs.LG 2026-04 unverdicted novelty 6.0

    RIC replaces single-pass label imitation with RL-driven iterative belief refinement, recovering cross-entropy optima while enabling adaptive halting via a value function.

  5. Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    Unsupervised single-generation confidence calibration for reasoning LLMs via offline self-consistency proxy distillation outperforms baselines on math and QA tasks and improves selective prediction.

  6. Calibration-Aware Policy Optimization for Reasoning LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    CAPO improves LLM calibration by up to 15% while matching or exceeding GRPO accuracy through logistic AUC loss and noise masking, enabling better abstention and scaling performance.

  7. Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

    cs.LG 2026-03 unverdicted novelty 6.0

    DCPO decouples reasoning optimization from calibration in RLVR to fix overconfidence in LLMs without losing accuracy.

  8. Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

    cs.CL 2026-01 unverdicted novelty 6.0

    Neighbor-Consistency Belief (NCB) measures LLM belief robustness across conceptual neighborhoods, revealing that high-NCB facts resist contextual interference better, and Structure-Aware Training reduces brittleness b...

  9. AI Observability for Large Language Model Systems: A Multi-Layer Analysis of Monitoring Approaches from Confidence Calibration to Infrastructure Tracing

    cs.SE 2026-04 unverdicted novelty 3.0

    This analysis synthesizes recent LLM observability research into a five-layer framework and identifies the integration of model signals with infrastructure anomalies as the central open problem.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 8 Pith papers · 1 internal anchor

  1. [1]

    On Verbalized Confidence Scores for LLMs

    URL https://arxiv.org/abs/2412.14737. 14 Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answer- ing. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empir...

  2. [2]

    If S(p, 1) − S(p, 0) is bounded, then there exists a finiteλ > 0 such that the reward function R(c, q) = λc − S(q, c) satisfies the correctness condition: S(p, 1) − S(p, 0) ≤ λ for all p ∈ [0, 1] and thus jointly incentivizes calibration and correctness

  3. [3]

    Examples

    If S(p, 1) − S(p, 0) is unbounded, then for any finite λ > 0, there may exist some y ≥ y′ such that W (py) < W (py′), and RRLCR prefers (y′, py′) to (y, py). Examples. The Brier score is bounded: S(p, 1) = (1 − p)2, S(p, 0) = p2, so: S(p, 1) − S(p, 0) = 1 − 2p ≤ 1 for all p ∈ [0, 1] Thus, the condition holds for λ = 1. In contrast, the logarithmic score i...

  4. [4]

    We slightly modify the dataset and remove 2 non-relevant paragraphs from each question

    HotPotQA (Distractor): We use 1000 validation examples from the original HotpotQA distractor dataset. We slightly modify the dataset and remove 2 non-relevant paragraphs from each question. Thus, each question has 8 paragraphs with both supporting paragraphs present. We measure correctness using exact-match (Yang et al., 2018)

  5. [5]

    We measure correctness using exact-match

    HotPotQA-Modified: We evaluate on 500 held-out validation examples. We measure correctness using exact-match. 18

  6. [6]

    We use the no-context split to purely test factual accuracy.We evaluate using LLM-as-a-judge

    TriviaQA: We use 2000 examples from the validation set of the TriviaQA dataset (Joshi et al., 2017). We use the no-context split to purely test factual accuracy.We evaluate using LLM-as-a-judge

  7. [7]

    We evaluate using LLM-as-a-judge

    SimpleQA: We use the full SimpleQA dataset consisting of 4326 factual questions (Wei et al., 2024). We evaluate using LLM-as-a-judge

  8. [8]

    We evaluate usingmath-verify, a mathematical expression evaluation system released by huggingface

    Math-500 We use the popular MATH-500 dataset, which contains a subset of problems from the original MATH dataset (Hendrycks et al., 2021). We evaluate usingmath-verify, a mathematical expression evaluation system released by huggingface

  9. [9]

    We evaluate using math-verify

    GSM8K: We use the test set ( 1319 problems) of the popular Grade School Math 8K dataset (Cobbe et al., 2021). We evaluate using math-verify

  10. [10]

    We evaluate using math-verify

    Big-Math-Digits: We evaluate on1000 held-out validation examples. We evaluate using math-verify

  11. [11]

    We evaluate using LLM-as-a-judge

    CommonSenseQA: We use the validation set (1220 problems) of the CommonsenseQA dataset (Talmor et al., 2019), a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers. We evaluate using LLM-as-a-judge

  12. [12]

    We evaluate using LLM- as-a-judge

    GPQA: We use the GPQA main dataset containing 448 multiple-choice questions written by experts in biology, physics, and chemistry (Rein et al., 2024). We evaluate using LLM- as-a-judge. B.4 E VALUATION DETAILS All models are evaluated with temperature 0. For all datasets except Math and GSM8K, we use a maximum token budget of 4096. The system prompt for e...

  13. [13]

    They are evaluated with the same system prompts they are trained on

    RLCR (ours): RLCR models use <think>, <answer>, <analysis> and <confidence> tags. They are evaluated with the same system prompts they are trained on. We extract their answer from <answer> tag and their confidence from <confidence> tag

  14. [14]

    It is evaluated with the same system prompt and we extract their answer from the <answer> tag

    RLVR: RLVR models use the <think> and <answer>. It is evaluated with the same system prompt and we extract their answer from the <answer> tag. To obtain their verbalized confidence, we append ”Thinking time ended. My verbalized confidence in my answer as a number between 0 and 100 is equal to” to their generated output

  15. [15]

    These methods thus use RLVR model as a generator and their reported accuracies in the result tables are equal

    Classifier/Probe: Both methods are conditioned on the question and the RLVR model’s generation (solution and answer). These methods thus use RLVR model as a generator and their reported accuracies in the result tables are equal

  16. [16]

    In case no valid confidence can be extracted, we append ”Thinking time ended

    Base: The base model is not good at instruction following and is prompted with a simpler system prompt (Simple Confidence Prompt) that guides it to use <think>, <answer> and <confidence> tags. In case no valid confidence can be extracted, we append ”Thinking time ended. My verbalized confidence in my answer as a number between 0 and 100 is equal to” to th...

  17. [17]

    You should not suggest ways of fixing the response, your job is only to reason about uncertainties

  18. [18]

    In these cases, It is also okay to have only a small number of uncertainties and then explicitly say that I am unable to spot more uncertainties

    For some questions, the response might be correct. In these cases, It is also okay to have only a small number of uncertainties and then explicitly say that I am unable to spot more uncertainties

  19. [19]

    For example, uncertainties may arise from ambiguities in the question, or from the application of a particular lemma/proof

    Uncertainties might be different from errors. For example, uncertainties may arise from ambiguities in the question, or from the application of a particular lemma/proof

  20. [20]

    If there are alternate potential approaches that may lead to different answers, you should mention them

  21. [21]

    List out plausible uncertainties, do not make generic statements, be as specific about uncertainties as possible

  22. [22]

    Enclose this uncertainty analysis within <analysis> </analysis> tags. The final format that must be followed is : <think> reasoning process here </think> <answer> final answer here</analysis> <analysis> analysis about confidence and uncertainty here </analysis> <confidence> confidence level here (number between 0 and 1) </confidence> ) Simple RLCR Prompt ...

  23. [23]

    The user asks a question, and the Assistant solves it

    </confidence> 20 Simple Confidence Prompt A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and analyzes its confidence about the solution and then provides the user with the final answer as well as its confidence level. The confidence level ...

  24. [24]

    Your analysis should also be in ’I’ form as if you wrote the solution and are now verifying it

  25. [25]

    Your goal is not to solve the problem but instead to verify if the steps in the presented solution are correct

  26. [26]

    If there are ambiguities in the solution steps or if a step introduces uncertainty, you should mention it in the analysis

  27. [27]

    Go through the solution sequentially in a step-by-step manner

  28. [28]

    The analysis should be 300 characters minimum

  29. [29]

    21 D R ESULTS D.1 M ODELS TRAINED ON HOTPOT QA Method SimpleQA Trivia Acc

    Enclose this uncertainty analysis within <analysis> </analysis> tags. 21 D R ESULTS D.1 M ODELS TRAINED ON HOTPOT QA Method SimpleQA Trivia Acc. AUROC Brier ECE Acc. AUROC Brier ECE Base 13.5% 0.502 0.773 0.809 57.8% 0.510 0.381 0.371 RLVR 12.4% 0.501 0.875 0.875 62.2% 0.502 0.377 0.377 RLVR +Classifier 12.4% 0.477 0.531 0.638 62.2% 0.567 0.256 0.151 RLVR...

  30. [30]

    **Total movies watched by each individual:** - Dalton: 15 movies - Hunter: 19 movies - Alex: 25 movies - Bella: 21 movies - Chris: 11 movies

  31. [31]

    **Movies watched together:** They all watched 5 movies together

  32. [32]

    - Bella and Chris watched 2 more movies only with each other

    **Movies watched exclusively among themselves:** - Dalton, Hunter, and Alex watched an additional 3 movies only among themselves. - Bella and Chris watched 2 more movies only with each other

  33. [33]

    - Bella and Chris watching 2 movies only with each other are already subtracted when we subtracted the 5 movies watched together

    **Calculating the total number of movies watched:** - First, sum the total movies watched by each individual: 15 + 19 + 25 + 21 + 11 = 91 - Subtract the movies watched together (since they are counted multiple times in the individual totals): 91 − 5 = 86 - Add back the movies watched together (they should only be counted once): 86 + 5 = 91 - Subtract the ...

  34. [34]

    **Final calculation:** The total number of different movies they saw is the sum of all individual movies watched, minus the movies watched together (to avoid double-counting), plus the movies watched together again (to include them once), and no further adjustments are needed for the exclusive movies: 91 − 5 + 5 = 91 Thus, the total number of different mo...