Confidence Calibration in Large Language Models

Daniel BenShushan; Don A. Moore; Jacob Bien; Noam Michael

REVIEW 3 major objections 4 minor 1 cited by

Like people, large language models are overconfident on hard tasks and underconfident on easy ones.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.5

2026-07-13 13:22 UTC pith:7FYDCXWR

load-bearing objection Solid preregistered multi-model demo of average overconfidence plus a classic hard-easy effect in LLMs, with a useful new continuous-difficulty probe (LifeEval); contamination is real but does not erase the pattern. the 3 major comments →

arxiv 2605.23909 v1 pith:7FYDCXWR submitted 2026-04-03 cs.AI cs.LG

Confidence Calibration in Large Language Models

Noam Michael , Daniel BenShushan , Jacob Bien , Don A. Moore This is my paper

classification cs.AI cs.LG

keywords confidence calibrationlarge language modelshard-easy effectoverconfidenceLifeEvalexpected calibration errorreasoning models

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper asks whether commercially available large language models know how sure they should be of their own answers. Across eleven models and six task families, the models report confidence that exceeds actual accuracy on average—about nine percentage points of overconfidence. That average, however, hides a strong hard-easy pattern: overconfidence is largest on difficult items, while easy items produce substantial underconfidence. To isolate difficulty from other confounds, the authors introduce LifeEval, a lifespan-prediction task whose difficulty can be dialed continuously using actuarial life tables. The same pattern appears there, and reasoning-oriented models produce more graded confidence reports than ordinary chat models. The practical stakes are straightforward: if a model cannot dial its stated certainty down when the problem gets harder, users cannot safely treat its confidence as a reliability signal.

Core claim

Current large language models are, on average, overconfident: stated confidence exceeds accuracy. This overconfidence is moderated by a hard-easy effect—overconfidence grows with task difficulty and flips into underconfidence on easy tasks. LifeEval, a new continuous-difficulty Bayesian-style task grounded in actuarial probabilities, isolates the effect while holding other task features fixed.

What carries the argument

LifeEval: a lifespan-prediction probe that varies sex, minimum age, and accuracy radius so that difficulty is defined as 1 minus the Maximum Achievable Score (the actuarial probability mass inside the optimal radius). Model point estimates and stated confidences are scored against the true conditional probabilities from Social Security period life tables.

Load-bearing premise

The claim that LifeEval cleanly measures difficulty rests on the premise that its actuarial ceiling is free of training-data contamination and other confounds, so that changes in radius truly isolate how models respond to hardness.

What would settle it

Re-run the same models on a LifeEval-style task whose ground-truth distribution is guaranteed never to have appeared in training; if the hard-easy slope of overconfidence versus Maximum Achievable Score disappears or reverses, the isolation claim fails.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Solid preregistered multi-model demo of average overconfidence plus a classic hard-easy effect in LLMs, with a useful new continuous-difficulty probe (LifeEval); contamination is real but does not erase the pattern.

read the letter

The punchline is straightforward: across eleven models and six tasks, LLMs show the same average overconfidence and hard-easy pattern humans do (confidence lags accuracy when difficulty changes). LifeEval is the real addition—an actuarially grounded continuous-difficulty task that lets them hold content fixed while dialing radius, age, and sex.

What they do well is the design. Preregistration, explicit ECE and overconfidence formulas, chain-of-thought plus JSON prompts held constant, and a full contamination audit in Appendix G that re-runs the key LifeEval plots on the no-evidence subset. Reasoning models look better calibrated and less rounded than chat models; that comparison is clean and useful. The hard-easy regression coefficients stay positive even after the audit, and the other five datasets (BoolQ, SciQ, LSAT-AR, SAT-EN, HaluEval) already show the same qualitative pattern without any actuarial tables.

The soft spot is exactly the one the stress-test flags, but it is not fatal. Strongest reasoning models show high SSA-table contamination (50–71 % strong evidence), and the clean subset shrinks their N to dozens of items. So LifeEval’s claim to isolate pure difficulty sensitivity is weaker for those models than the abstract implies; some of the overconfidence on low-MAS items may just be missing the correct conditional probabilities. Still, the effect survives on the uncontaminated responses and on the non-LifeEval tasks, so the broader claim holds. Minor prereg deviations and incomplete public artifacts are ordinary for this genre.

This is for people working on LLM reliability, uncertainty elicitation, or human-AI trust. The math and citation pattern look solid; no circularity. I would send it to peer review—worth a serious referee, even if they push for fuller code/data release and tighter contamination controls. Engage with it.

Referee Report

3 major / 4 minor

Summary. The paper reports a preregistered multi-model study of verbal confidence calibration for 11 LLMs (5 reasoning, 6 chat) on six English question sets. Aggregate results show overconfidence (mean stated confidence ~88% vs. accuracy ~79%), moderated by a hard-easy effect: overconfidence is largest on difficult items (LSAT-AR, small-radius LifeEval) while easy items (SciQ, SAT-EN, large-radius LifeEval) produce underconfidence. Reasoning models are better calibrated, less prone to 5% rounding, and more correlated with true probabilities. The novel LifeEval task elicits point estimates of age-at-death plus radius-specific confidence, scored against SSA actuarial probabilities (Eqs. 2-4) to yield a continuous exogenous difficulty measure (1-MAS). ECE (Eq. 9) and overconfidence (Eq. 10) are computed throughout; Appendix G audits SSA contamination and re-runs key plots on the clean subset.

Significance. If robust, the work is a clear contribution: it documents that contemporary LLMs exhibit the classic human hard-easy pattern, with direct implications for user trust, hallucination risk, and possible RLHF-induced overconfidence. Strengths that raise the paper above typical LLM evaluation studies include the preregistration, explicit formulas for ECE and overconfidence, systematic comparison of reasoning vs. chat models, the continuous-difficulty LifeEval construction grounded in public actuarial tables, and the contamination audit that re-analyzes the no-evidence subset. These elements make the central empirical claims falsifiable and relatively careful. The discussion of reflection-style debiasing and the three forms of overconfidence also usefully connects the AI results to the psychological literature.

major comments (3)

[Appendix G, Tables 4-5, Figure 8] Appendix G / Tables 4-5 / Figure 8: Contamination rates for the strongest reasoning models are high (DeepSeek-R1 71.5%, Gemini-2.5-Pro 71%, GPT-o3 50.1% strong evidence). The no_evidence subset therefore shrinks to N=35-45 for those models. While hard-easy coefficients remain positive, the tiny samples and the selection process itself (clean responses are those that do not cite table values) leave open the possibility that overconfidence on low-MAS items simply reflects inability to compute the actuarial p(k,r|a,s) of Eq. 3 rather than the claimed noisy-confidence mechanism. This undercuts the claim that LifeEval isolates a pure, exogenous difficulty effect (Section 4.5). Either supply CIs/power calculations on the clean subset or substantially qualify the isolation language in the abstract and Discussion.
[Table 2, Section 7] Table 2 and Section 7: The Hard-Easy regression coefficients (overconfidence on 1-MAS) are presented as point estimates without standard errors, confidence intervals, or any inferential statistics. Reasoning-model averages are modest (0.168) while chat-model averages are large (0.732); without uncertainty quantification it is impossible to judge whether the 'powerful' hard-easy effect is reliable for the better-calibrated models that matter most. Bootstrap or mixed-effects intervals are needed before the central claim can be taken as established.
[Appendix A, Section 4.5] Appendix A and Section 4.5: LifeEval scoring was changed after preregistration from a binary 'within-radius' indicator to the continuous actuarial probability (Eq. 3). The continuous metric is preferable for calibration analysis, yet it is a material deviation on the paper's flagship task. Sensitivity of the hard-easy slopes and ECE numbers to the original binary scoring should be reported as a robustness check.

minor comments (4)

[Introduction] Introduction (p. 2): typographical error 'based on of quantitative elements'.
[Figures 2-4, 8-12] Figures 2-4 and 8-12 lack error bars or confidence bands; given the varying N across models and contamination strata, visual uncertainty would help readers.
[Table 1] Table 1 and Section 4: HaluEval answer field is listed as N/A; a one-sentence clarification that the model only supplies confidence (not an answer) would improve readability.
[Section 4.5 / Table 2] Notation for Maximum Achievable Score (MAS) appears first in the LifeEval results without a formal definition until later; move the definition earlier or add a short glossary.

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of stated confidence against external accuracy or actuarial ground truth, with no fitted parameters or self-referential definitions in the metrics.

full rationale

The paper's central claims (average overconfidence of ~9%, moderated by a hard-easy effect) rest on direct, non-circular comparisons: stated confidence (prompted numerical probabilities, normalized if needed via Eq. 5) versus observed accuracy (Eq. 6) or actuarial success probability (Eqs. 1-4 from public SSA Period Life Tables). Overconfidence is defined as conf(Q) - acc(Q) (Eq. 10); ECE is the standard binned absolute difference (Eq. 9). LifeEval difficulty is 1 - MAS, where MAS is the maximum probability mass inside the optimal radius under the external life tables (Figure 5, Eq. 3); this is independent of any model output and is not estimated from the LLMs. The hard-easy regression simply correlates this exogenous difficulty with the already-computed overconfidence. No parameters are fitted to the evaluation data and then re-used as 'predictions'; no uniqueness theorems or ansatze are imported via self-citation; related-work citations are to external psychology and calibration literature. Contamination screening (Appendix G) is a post-hoc robustness check, not part of the metric definitions. The derivation chain is therefore self-contained against external benchmarks and exhibits zero circular reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 3 axioms · 1 invented entities

Empirical paper; load-bearing premises are standard statistical definitions plus the modeling choices that turn SSA tables into a difficulty continuum and that treat verbalized JSON confidence as the model's subjective probability.

free parameters (2)

ECE bin edges
Ten equal-width bins plus a dedicated 1.0 bin; choice affects reported ECE magnitude though not the qualitative hard-easy pattern.
LifeEval radii set {1,5,10,20}
Hand-chosen discrete difficulty levels; continuous difficulty is recovered via MAS but the discrete set still structures the main figures.

axioms (3)

domain assumption Stated numerical confidence in the JSON field is a faithful report of the model's subjective probability of correctness.
Core measurement assumption used for every ECE and overconfidence calculation (Section 5).
domain assumption SSA Period Life Tables supply the true conditional probabilities p(k,r|a,s) against which LifeEval confidence is scored.
Ground-truth definition for LifeEval accuracy and MAS (Eqs. 2-4).
ad hoc to paper Maximum Achievable Score (MAS) is a pure exogenous difficulty measure that models can detect.
Defines the continuous difficulty axis used for the hard-easy regression (Section 4.5).

invented entities (1)

LifeEval independent evidence
purpose: Provide continuous, actuarially grounded difficulty while holding task format fixed, enabling clean measurement of the hard-easy effect.
New benchmark constructed for this paper; independent evidence is the public SSA tables, but the specific question format and scoring are paper-specific.

pith-pipeline@v1.1.0-grok45 · 22813 in / 2324 out tokens · 22738 ms · 2026-07-13T13:22:22.096438+00:00 · methodology

0 comments

read the original abstract

We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study show that the current crop of LLMs are, like people, too sure they are right: confidence exceeds accuracy, on average. Importantly, however, this tendency is moderated by a powerful hard-easy effect, wherein overconfidence is greatest on difficult tests; by contrast, easy tests actually show substantial underconfidence. We develop LifeEval, a test for evaluating model calibration across levels of difficulty.

Figures

Figures reproduced from arXiv: 2605.23909 by Daniel BenShushan, Don A. Moore, Jacob Bien, Noam Michael.

**Figure 2.** Figure 2: Aggregate calibration plots for each question set, showing accuracy conditional on confidence. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Overconfidence by question set and Model. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Overconfidence as a function of model and radius; [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: LifeEval allows for the monotonic decrease in task difficulty given age, sex, and radius. As the Maximum [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Best age to guess as a function of minimum age, sex, and radii. We see that the optimal age is constant until a [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison between the calibration error of stated confidence versus token probability for models when [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Overconfidence as a function of model and radius; difficulty decreases with larger accuracy radius. Specifically [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Side by side plots of all models (rows) and all question sets (columns). GPT-4o, Llama-3.1-70B, and [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Each plot displays the relationship between Stated Confidence and actual Score for various model families. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Accuracy for each model on all question sets. SAT-EN and SciQ had the highest performance while [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: ECE for each model on all question sets. Some questions sets like LSAT-AR and HaluEval saw high [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Neuro-Symbolic AI for LEED compliance: Document-Centric Benchmarking, Deterministic Numeric Checking, and When Multimodal Hurts
cs.AI 2026-07 conditional novelty 6.0

A 4B local LLM hits 67.3% on LEED v4.1 credit screening; a deterministic numeric checker lifts EA-p2 from 50% to 100%, but the full neuro-symbolic pipeline trails at 61.6%.

Reference graph

Works this paper leans on

26 extracted references · 17 linked inside Pith · cited by 1 Pith paper

[1]

ArXiv: 2011.06225

A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76:243–297. ArXiv: 2011.06225. Saleh Afroogh, Ali Akbari, Emmie Malone, Mohammadali Kargar, and Hananeh Alambeigi

Pith/arXiv arXiv 2011
[2]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Preprint, arXiv:2502.11028. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova

arXiv
[3]

Boolq: Exploring the surprising difficulty of natural yes/no questions.Preprint, arXiv:1905.10044. A. P. Dawid

Pith/arXiv arXiv 1905
[4]

Do llms implicitly determine the suitable text difficulty for users?Preprint, arXiv:2402.14453. Google. 2025a. Gemini 2.5 flash. Google. 2025b. Gemini 2.5 pro. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger

Pith/arXiv arXiv
[5]

Irene Hou, Hannah Vy Nguyen, Owen Man, and Stephen MacNeil

On calibration of modern neural networks.Preprint, arXiv:1706.04599. Irene Hou, Hannah Vy Nguyen, Owen Man, and Stephen MacNeil

Pith/arXiv arXiv
[6]

InProceedings of the 56th ACM Technical Symposium on Computer Science Education V

The evolving usage of genai by computing students. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V . 2, SIGCSE TS 2025, page 1481–1482. ACM. Seonjeong Hwang, Hyounghun Kim, and Gary Geunbae Lee

2025
[7]

Can llms estimate cognitive complexity of reading comprehension items?Preprint, arXiv:2510.25064. Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bow...

Pith/arXiv arXiv
[8]

ArXiv:2207.05221 [cs]

Language models (mostly) know what they know.arXiv, arXiv:2207.05221. ArXiv:2207.05221 [cs]. 10 Daniel Kahneman. 2011.Thinking fast and slow. Farrar, Straus and Giroux, New York. Citation Key: Kahneman2011. Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang

Pith/arXiv arXiv 2011
[9]

Preprint, arXiv:2509.04664

Why language models hallucinate. Preprint, arXiv:2509.04664. Gideon Keren

Pith/arXiv arXiv
[10]

ArXiv:2410.09724 [cs]

Taming overconfidence in llms: Reward calibration in rlhf.arXiv, arXiv:2410.09724. ArXiv:2410.09724 [cs]. Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen

Pith/arXiv arXiv
[11]

Sarah Lichtenstein and Baruch Fischhoff

Conftuner: Training large language models to express their confidence verbally.Preprint, arXiv:2508.18847. Sarah Lichtenstein and Baruch Fischhoff

arXiv
[12]

Towards robust mathematical reasoning.Preprint, arXiv:2511.01846. Meta. 2024a. Llama 3.1 70b instruct. Meta. 2024b. Llama 3.1 8b instruct. Don A. Moore

arXiv
[13]

ArXiv: 2003.04026

When are bayesian model probabilities overconfident?arXiv:2003.04026. ArXiv: 2003.04026. OpenAI

Pith/arXiv arXiv 2003
[14]

Philip and Hemang

Understanding model calibration–a gentle introduction and visual exploration of calibration and the expected calibration error (ece).arXiv preprint arXiv:2501.19047. Philip and Hemang

Pith/arXiv arXiv
[15]

Web page

Period life table, 2022 (used in the 2025 trustees report). Web page. Presented by the Office of the Chief Actuary; accessed via SSA website. Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, and Jordan Lee Boyd-Graber

2022
[16]

Grace: A granular benchmark for evaluating model calibration against human calibration.Preprint, arXiv:2502.19684. H. M. Shadman Tabib and Jaber Ahmed Deedar

Pith/arXiv arXiv
[17]

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D

Toward trustworthy difficulty assessments: Large language models as judges in programming and synthetic tasks.Preprint, arXiv:2511.18597. Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D. Manning

arXiv
[18]

Sahil Tripathi, Md Tabrez Nafis, Imran Hussain, and Jiechao Gao

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.arXiv preprint arXiv:2305.14975. Sahil Tripathi, Md Tabrez Nafis, Imran Hussain, and Jiechao Gao

Pith/arXiv arXiv
[19]

ArXiv:2506.23464 [cs]

The confidence paradox: Can llm know when it’s wrong.arXiv, arXiv:2506.23464. ArXiv:2506.23464 [cs]. Thomas S Wallsten, David V Budescu, and Rami Zwick

arXiv
[20]

ArXiv:1707.06209 [cs]

Crowdsourcing multiple choice science questions.arXiv, arXiv:1707.06209. ArXiv:1707.06209 [cs]. Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J. Su, and Li Shen

Pith/arXiv arXiv
[21]

ArXiv:2505.01997 [cs]

Restoring calibration for aligned large language models: A calibration-aware fine-tuning approach.arXiv, arXiv:2505.01997. ArXiv:2505.01997 [cs]. Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi

arXiv
[22]

Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, and Bill Howe

Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.Preprint, arXiv:2306.13063. Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, and Bill Howe

Pith/arXiv arXiv
[23]

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan

Do language models mirror human confidence? exploring psychological insights to address overconfidence in llms.Preprint, arXiv:2506.00582. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan

Pith/arXiv arXiv
[24]

Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan

Agieval: A human-centric benchmark for evaluating foundation models.Preprint, arXiv:2304.06364. Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan

Pith/arXiv arXiv
[25]

within true range

Ar-lsat: Investigating analytical reasoning of text.Preprint, arXiv:2104.06598. A Deviations from Pre-Registration • DeepSeek log probabilities.DeepSeek did not provide usable token-level log probabilities, so logprob-based analyses for this model were omitted. • LifeEval scoring rule.ForLifeEval, we scored answers using the conditional true (actuarial) p...

Pith/arXiv arXiv
[26]

Reasoning

For example: Question: <Question> Options: A) <Option A> B) <Option B> C) <Option C> D) <Option D> E) <Option E> Response: { "Reasoning": "<your step-by-step reasoning>", "Answer": "<A, B, C, D, or E>", "A": <float>, "B": <float>, "C": <float>, "D": <float>, "E": <float> } 18 When answering the question about confidence, give a probability that is an hone...

2022

[1] [1]

ArXiv: 2011.06225

A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76:243–297. ArXiv: 2011.06225. Saleh Afroogh, Ali Akbari, Emmie Malone, Mohammadali Kargar, and Hananeh Alambeigi

Pith/arXiv arXiv 2011

[2] [2]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Preprint, arXiv:2502.11028. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova

arXiv

[3] [3]

Boolq: Exploring the surprising difficulty of natural yes/no questions.Preprint, arXiv:1905.10044. A. P. Dawid

Pith/arXiv arXiv 1905

[4] [4]

Do llms implicitly determine the suitable text difficulty for users?Preprint, arXiv:2402.14453. Google. 2025a. Gemini 2.5 flash. Google. 2025b. Gemini 2.5 pro. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger

Pith/arXiv arXiv

[5] [5]

Irene Hou, Hannah Vy Nguyen, Owen Man, and Stephen MacNeil

On calibration of modern neural networks.Preprint, arXiv:1706.04599. Irene Hou, Hannah Vy Nguyen, Owen Man, and Stephen MacNeil

Pith/arXiv arXiv

[6] [6]

InProceedings of the 56th ACM Technical Symposium on Computer Science Education V

The evolving usage of genai by computing students. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V . 2, SIGCSE TS 2025, page 1481–1482. ACM. Seonjeong Hwang, Hyounghun Kim, and Gary Geunbae Lee

2025

[7] [7]

Can llms estimate cognitive complexity of reading comprehension items?Preprint, arXiv:2510.25064. Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bow...

Pith/arXiv arXiv

[8] [8]

ArXiv:2207.05221 [cs]

Language models (mostly) know what they know.arXiv, arXiv:2207.05221. ArXiv:2207.05221 [cs]. 10 Daniel Kahneman. 2011.Thinking fast and slow. Farrar, Straus and Giroux, New York. Citation Key: Kahneman2011. Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang

Pith/arXiv arXiv 2011

[9] [9]

Preprint, arXiv:2509.04664

Why language models hallucinate. Preprint, arXiv:2509.04664. Gideon Keren

Pith/arXiv arXiv

[10] [10]

ArXiv:2410.09724 [cs]

Taming overconfidence in llms: Reward calibration in rlhf.arXiv, arXiv:2410.09724. ArXiv:2410.09724 [cs]. Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen

Pith/arXiv arXiv

[11] [11]

Sarah Lichtenstein and Baruch Fischhoff

Conftuner: Training large language models to express their confidence verbally.Preprint, arXiv:2508.18847. Sarah Lichtenstein and Baruch Fischhoff

arXiv

[12] [12]

Towards robust mathematical reasoning.Preprint, arXiv:2511.01846. Meta. 2024a. Llama 3.1 70b instruct. Meta. 2024b. Llama 3.1 8b instruct. Don A. Moore

arXiv

[13] [13]

ArXiv: 2003.04026

When are bayesian model probabilities overconfident?arXiv:2003.04026. ArXiv: 2003.04026. OpenAI

Pith/arXiv arXiv 2003

[14] [14]

Philip and Hemang

Understanding model calibration–a gentle introduction and visual exploration of calibration and the expected calibration error (ece).arXiv preprint arXiv:2501.19047. Philip and Hemang

Pith/arXiv arXiv

[15] [15]

Web page

Period life table, 2022 (used in the 2025 trustees report). Web page. Presented by the Office of the Chief Actuary; accessed via SSA website. Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, and Jordan Lee Boyd-Graber

2022

[16] [16]

Grace: A granular benchmark for evaluating model calibration against human calibration.Preprint, arXiv:2502.19684. H. M. Shadman Tabib and Jaber Ahmed Deedar

Pith/arXiv arXiv

[17] [17]

Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D

Toward trustworthy difficulty assessments: Large language models as judges in programming and synthetic tasks.Preprint, arXiv:2511.18597. Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christo- pher D. Manning

arXiv

[18] [18]

Sahil Tripathi, Md Tabrez Nafis, Imran Hussain, and Jiechao Gao

Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback.arXiv preprint arXiv:2305.14975. Sahil Tripathi, Md Tabrez Nafis, Imran Hussain, and Jiechao Gao

Pith/arXiv arXiv

[19] [19]

ArXiv:2506.23464 [cs]

The confidence paradox: Can llm know when it’s wrong.arXiv, arXiv:2506.23464. ArXiv:2506.23464 [cs]. Thomas S Wallsten, David V Budescu, and Rami Zwick

arXiv

[20] [20]

ArXiv:1707.06209 [cs]

Crowdsourcing multiple choice science questions.arXiv, arXiv:1707.06209. ArXiv:1707.06209 [cs]. Jiancong Xiao, Bojian Hou, Zhanliang Wang, Ruochen Jin, Qi Long, Weijie J. Su, and Li Shen

Pith/arXiv arXiv

[21] [21]

ArXiv:2505.01997 [cs]

Restoring calibration for aligned large language models: A calibration-aware fine-tuning approach.arXiv, arXiv:2505.01997. ArXiv:2505.01997 [cs]. Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi

arXiv

[22] [22]

Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, and Bill Howe

Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.Preprint, arXiv:2306.13063. Chenjun Xu, Bingbing Wen, Bin Han, Robert Wolfe, Lucy Lu Wang, and Bill Howe

Pith/arXiv arXiv

[23] [23]

Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan

Do language models mirror human confidence? exploring psychological insights to address overconfidence in llms.Preprint, arXiv:2506.00582. Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan

Pith/arXiv arXiv

[24] [24]

Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan

Agieval: A human-centric benchmark for evaluating foundation models.Preprint, arXiv:2304.06364. Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan

Pith/arXiv arXiv

[25] [25]

within true range

Ar-lsat: Investigating analytical reasoning of text.Preprint, arXiv:2104.06598. A Deviations from Pre-Registration • DeepSeek log probabilities.DeepSeek did not provide usable token-level log probabilities, so logprob-based analyses for this model were omitted. • LifeEval scoring rule.ForLifeEval, we scored answers using the conditional true (actuarial) p...

Pith/arXiv arXiv

[26] [26]

Reasoning

For example: Question: <Question> Options: A) <Option A> B) <Option B> C) <Option C> D) <Option D> E) <Option E> Response: { "Reasoning": "<your step-by-step reasoning>", "Answer": "<A, B, C, D, or E>", "A": <float>, "B": <float>, "C": <float>, "D": <float>, "E": <float> } 18 When answering the question about confidence, give a probability that is an hone...

2022