pith. sign in

arxiv: 2606.24281 · v1 · pith:XJAI36RVnew · submitted 2026-06-23 · 💻 cs.CL · cs.AI

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

Pith reviewed 2026-06-26 00:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords confidence calibrationreasoning language modelsexpected calibration errorpre-reasoningpost-reasoningBigMathDigitsGPQA
0
0 comments X

The pith

Reasoning models achieve better calibrated confidence by supervising pre-thinking estimates with prompt success probability and post-answer estimates with answer correctness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that confidence estimates in language models change in meaning during reasoning. Before the model reasons, confidence should predict the chance of eventually solving the prompt correctly. After producing an answer, confidence should instead indicate whether that answer is right. CALIBER applies this distinction by eliciting both types of estimates and training each with the matching target. This produces lower calibration error on math reasoning and question answering benchmarks, with larger gains when the test data differs from training.

Core claim

CALIBER elicits both pre-reasoning and post-reasoning confidence estimates in language models. Pre-reasoning estimates are supervised by whether the prompt is solvable, while post-reasoning estimates are supervised by whether the generated answer is correct. This unified protocol reduces Expected Calibration Error by 52.5% over the strongest single-confidence baseline on BigMathDigits for the 7B model and achieves the best Brier score and AUROC.

What carries the argument

The position-target alignment mechanism, which matches the supervision target for each confidence estimate to the information state at the time it is elicited.

If this is right

  • Reduces Expected Calibration Error substantially on in-distribution math tasks while staying close to peak accuracy.
  • Achieves best calibration metrics on out-of-distribution benchmarks like GPQA and TriviaQA.
  • Shows consistent calibration improvements under distribution shift compared to single-estimate methods.
  • Scales to larger models while maintaining competitive performance in Brier score and AUROC.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results suggest that ignoring the change in available information during reasoning limits calibration in standard approaches.
  • Models may benefit from always producing two distinct confidence scores rather than one.
  • Deployment in settings with shifting data distributions could see reliability gains from this two-stage supervision.

Load-bearing premise

That the appropriate target for supervising confidence is prompt-level success before reasoning and answer-level correctness after reasoning.

What would settle it

Running the same experiments but supervising both estimates with the same target, such as always using answer correctness, and finding no reduction or even an increase in calibration error on BigMathDigits.

Figures

Figures reproduced from arXiv: 2606.24281 by Beyza Ermis, Conor Finlay, Joshua Kurien, Marzieh Fadaee, Saurabh Dash.

Figure 1
Figure 1. Figure 1: Confidence positions and calibration targets for the evaluated methods. [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ECE–AUROC tradeoff on BigMathDigits. Lower ECE and higher AUROC are better. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reliability diagrams for BigMathDigits, omitting bins with fewer than 10 samples. Curves closer to the [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reliability diagrams on TriviaQA, omitting bins with fewer than 10 samples. Curves closer to the diagonal [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average OOD ECE–AUROC tradeoff across GPQA, TriviaQA, and SimpleQA. Lower ECE and higher [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reliability diagrams on GPQA. 0% 20% 40% 60% 80% 100% Predicted confidence 0% 20% 40% 60% 80% 100% Empirical accuracy COREA-lite RLCR CALIBER DCPO-lite (a) 7B model. 0% 20% 40% 60% 80% 100% Predicted confidence 0% 20% 40% 60% 80% 100% Empirical accuracy COREA-lite RLCR CALIBER DCPO-lite (b) 30B model [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reliability diagrams on SimpleQA. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Pareto plots on TriviaQA. 0.4 0.35 0.3 0.25 0.2 0.15 0.1 ECE (lower is better ) 0.54 0.56 0.58 0.6 0.62 0.64 A U R O C ( h i g h e r i s b e t t e r ) CoCA-lite CALIBER RLCR DCPO-lite COREA-lite (a) 7B model. 0.2 0.18 0.15 0.12 0.1 0.08 0.05 0.03 ECE (lower is better ) 0.58 0.6 0.62 0.65 0.68 0.7 0.73 0.75 A U R O C ( h i g h e r i s b e t t e r ) CoCA-lite CALIBER COREA-lite RLCR DCPO-lite (b) 30B model … view at source ↗
Figure 9
Figure 9. Figure 9: Pareto plots on GPQA. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pareto plots on SimpleQA. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
read the original abstract

Reasoning language models are increasingly asked not only to answer difficult questions, but also to estimate their likelihood of success. Existing methods typically elicit confidence only once: either before thinking or after answering. We argue that confidence in reasoning models is state-dependent: before thinking, confidence should estimate the chance of the model correctly solving the prompt, while after thinking it should predict whether the realized answer is likely to be correct. This distinction determines the appropriate supervision target: prompt-level success should supervise confidence estimates made after seeing the prompt, while individual answer-level correctness should supervise confidence estimates made after answering. We introduce CALIBER (Calibration Before and After Reasoning), which elicits both estimates and supervises each with the target matched to its information state. Under this unified protocol, CALIBER reduces Expected Calibration Error (ECE) by 52.5% over the strongest single-confidence baseline on BigMathDigits for the 7B model, while achieving the best Brier score and AUROC, and remains within 2.1 points of the best accuracy. Further, on a larger 30B model, CALIBER achieves the best ECE on BigMathDigits while remaining competitive in Brier score and AUROC. Out of distribution, it achieves the best ECE and Brier score on GPQA and TriviaQA, and remains competitive on SimpleQA. Ablations further show that this position-target alignment is most beneficial under distribution shift where it consistently reduces calibration error across all out-of-distribution benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces CALIBER, which elicits separate pre-reasoning and post-reasoning confidence estimates in language models and supervises the former with prompt-level success probability and the latter with answer-level correctness. It reports that this matched-supervision protocol yields a 52.5% reduction in Expected Calibration Error over the strongest single-confidence baseline on BigMathDigits (7B model), best-in-class Brier score and AUROC, accuracy within 2.1 points of the best, and strong OOD performance on GPQA, TriviaQA, and SimpleQA, with ablations indicating the alignment is especially helpful under distribution shift.

Significance. If the reported gains prove robust, the work supplies a concrete, state-aware protocol for confidence elicitation that improves calibration without sacrificing accuracy and is particularly effective out-of-distribution. The explicit ablations linking the position-target alignment to OOD gains constitute a reproducible empirical contribution that future reasoning-model calibration studies can directly build upon.

minor comments (3)
  1. The abstract and §4 should explicitly state the number of random seeds and the precise data-split protocol used for the BigMathDigits, GPQA, TriviaQA, and SimpleQA evaluations so that the 52.5% ECE figure can be reproduced.
  2. Figure 3 (or the corresponding ablation table) would benefit from error bars or a statistical test comparing the aligned vs. misaligned supervision variants under each OOD shift.
  3. Notation for the two supervision targets (prompt-level success probability vs. answer-level correctness) should be introduced once in §3 with a short equation or pseudocode block to avoid repeated prose definitions later.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of CALIBER, the recognition of its empirical contributions on position-target alignment, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; empirical protocol with benchmark results

full rationale

The paper introduces CALIBER as a protocol for eliciting and supervising two state-dependent confidence estimates (pre- and post-reasoning) with matched targets (prompt-level success probability vs. answer-level correctness). All reported gains are empirical performance metrics (ECE reductions, Brier scores, AUROC) measured on fixed external benchmarks such as BigMathDigits, GPQA, TriviaQA and SimpleQA. No equations, fitted parameters, or self-citations are presented that reduce the claimed improvements to a definitional identity or to a quantity computed from the same fitted values. The central claim therefore rests on observable benchmark outcomes rather than on any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are described; the contribution is an empirical training protocol.

pith-pipeline@v0.9.1-grok · 5809 in / 1144 out tokens · 25306 ms · 2026-06-26T00:17:31.769155+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 15 canonical work pages · 10 internal anchors

  1. [1]

    Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387,

    Alon Albalak, Duy Phung, Nathan Lile, Rafael Rafailov, Kanishk Gandhi, Louis Castricato, Anikait Singh, Chase Blagden, Violet Xiang, Dakota Mahan, et al. Big-math: A large-scale, high-quality math dataset for reinforcement learning in language models.arXiv preprint arXiv:2502.17387,

  2. [2]

    Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models.arXiv preprint arXiv:2503.02623,

    David Bani-Harouni, Chantal Pellegrini, Paul Stangel, Ege Özsoy, Kamilia Zaripova, Nassir Navab, and Matthias Keicher. Rewarding doubt: A reinforcement learning approach to calibrated confidence expression of large language models.arXiv preprint arXiv:2503.02623,

  3. [3]

    Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.arXiv preprint arXiv:2502.11028,

    Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.arXiv preprint arXiv:2502.11028,

  4. [4]

    Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,

    Team Cohere, Arash Ahmadian, Marwan Ahmed, Jay Alammar, Milad Alizadeh, Yazeed Alnumay, 15 Sophia Althammer, Arkady Arkhangorodsky, Viraat Aryabumi, Dennis Aumiller, et al. Command a: An enterprise-ready large language model.arXiv preprint arXiv:2504.00698,

  5. [5]

    Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

    Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, and Jacob Andreas. Beyond binary rewards: Training lms to reason about their uncertainty.arXiv preprint arXiv:2507.16806,

  6. [6]

    A survey of confidence estimation and calibration in large language models

    Jiahui Geng, Fengyu Cai, Yuxia Wang, Heinz Koeppl, Preslav Nakov, and Iryna Gurevych. A survey of confidence estimation and calibration in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 6577–6595,

  7. [7]

    Uncertainty distillation: Teaching language models to express semantic confidence.arXiv preprint arXiv:2503.14749,

    Sophia Hager, David Mueller, Kevin Duh, and Nicholas Andrews. Uncertainty distillation: Teaching language models to express semantic confidence.arXiv preprint arXiv:2503.14749,

  8. [8]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221,

  9. [9]

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664,

  10. [10]

    How do LLMs Compute Verbal Confidence

    Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, and Petar Velickovic. How do llms compute verbal confidence.arXiv preprint arXiv:2603.17839,

  11. [11]

    Tulu 3: Pushing Frontiers in Open Language Model Post-Training

    Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training.arXiv preprint arXiv:2411.15124,

  12. [12]

    Taming overconfidence in llms: Reward calibration in RLHF

    Jixuan Leng, Chengsong Huang, Banghua Zhu, and Jiaxin Huang. Taming overconfidence in llms: Reward calibration in RLHF. InInternational Conference on Learning Representations, volume 2025, pp. 16484–16517,

  13. [13]

    Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation

    Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An, Junxiang Qiu, Xiang Wang, and Qi Tian. Confidence before answering: A paradigm shift for efficient llm uncertainty estimation.arXiv preprint arXiv:2603.05881, 2026a. Yibo Li, Miao Xiong, Jiaying Wu, and Bryan Hooi. Conftuner: Training large language models to express their confidence verball...

  14. [14]

    Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

    Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, and Le Sun. Decoupling reasoning and confidence: Resurrecting calibration in reinforcement learning from verifiable rewards.arXiv preprint arXiv:2603.09117,

  15. [15]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

  16. [16]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  17. [17]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368,

  18. [18]

    Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. In International Conference on Learning Representations, volume 2024, pp. 23650–23678,

  19. [19]

    Sayself: Teaching LLMs to express confidence with self-reflective rationales

    Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, and Jing Gao. Sayself: Teaching LLMs to express confidence with self-reflective rationales. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 5985–5998,

  20. [20]

    17 A Appendix A.1 Training and Evaluation Details Table 5 summarizes the main training and evaluation settings used for the 7B and 30B experiments. Hyperparameter / setting V alue Shared training settings Calibration RL steps 500 Reward weights Equal Training batch size 2048 Rollouts per prompt during training 32 Max sequence length 16384 Model-specific t...

  21. [21]

    Okay, let’s tackle this question:

    Let me check: C10 = 1 11 20 10 . 20 10 is 184756, so 184756 divided by 11 is 16796.[. . . ]But wait, I should make sure that this applies here. The problem states that the chords do not share endpoints and do not intersect. That’s exactly the condition for Catalan numbers.[. . . ]SoC 10 is indeed 16796.” Q.A person contracted the flu, and after two rounds...

  22. [22]

    Okay, let’s tackle this question:

    Let me start by recalling what I know about this case.[. . . ]So the most likely scenario is that the user has the wrong year, but if I have to answer based on the given information, the correct answer would still be David Lee Roth, even though the year is incorrect. However, since the user specified 2009, maybe there’s a different person. Alternatively, ...