pith. sign in

arxiv: 2503.18562 · v1 · submitted 2025-03-24 · 💻 cs.CL · cs.AI· cs.HC· cs.LG

Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

Pith reviewed 2026-05-22 23:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.LG
keywords Large Language ModelsConfidence ElicitationGastroenterologyUncertainty QuantificationBrier ScoreOverconfidenceModel CalibrationArtificial Intelligence
0
0 comments X

The pith

Large language models consistently overestimate their certainty when answering gastroenterology board questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how well different large language models judge their own accuracy by asking them to report confidence levels on 300 gastroenterology board-style questions. The top models reach Brier scores of 0.15-0.2 and AUROC of 0.6, showing their confidence carries some signal but they remain overconfident overall. Newer models perform better on the questions themselves yet the overconfidence pattern holds across commercial, open-source, and quantized versions. A reader would care because poor uncertainty handling blocks safe use of these models in medical settings where mistaken certainty could affect decisions.

Core claim

When large language models are prompted to give self-reported confidence along with answers to 300 gastroenterology board-style questions, the highest-performing ones (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieve Brier scores of 0.15-0.2 and AUROC of 0.6 while all models, regardless of type, display a consistent tendency toward overconfidence.

What carries the argument

Elicitation of self-reported confidence probabilities from LLMs on multiple-choice gastroenterology questions, evaluated for calibration with Brier score and AUROC.

If this is right

  • Newer models improve answer accuracy yet still exhibit the same overconfidence in self-reported certainty.
  • The calibration problem appears across commercial, open-source, and quantized model categories.
  • Uncertainty estimation remains a core barrier to safe deployment of LLMs in gastroenterology practice.
  • All tested models require additional mechanisms to align reported confidence with actual performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training focused only on question accuracy may not teach models to recognize the boundaries of their knowledge in medical domains.
  • Physicians using these models would need independent checks rather than trusting self-reported confidence levels.
  • Similar overconfidence patterns could appear in other medical specialties that rely on board-style question formats.

Load-bearing premise

Self-reported confidence on these board-style questions serves as a valid proxy for the models' uncertainty in actual gastroenterology practice, and the 300 questions adequately represent the domain.

What would settle it

A direct comparison of the same models' self-reported confidence and accuracy on real clinical gastroenterology cases versus the board questions would show whether the overconfidence pattern holds outside standardized testing.

Figures

Figures reproduced from arXiv: 2503.18562 by Ali Soroush, Girish Nadkarni, Nariman Naderi, Peter Lewis, Seyed Amir Ahmad Safavi-Naini, Thomas Savage, Zahra Atf.

Figure 1
Figure 1. Figure 1: Summary illustration of pipeline for confidence score extraction from raw textual [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average accuracy versus average confidence scores for LLMs with more than 150 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left panel: Overall distribution of self-reported confidence scores and mean response accuracy (stars) for each model. Right panel: Distribution of self-reported confidence scores for each model stratified by response accuracy [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates self-reported confidence across commercial, open-source, and quantized LLMs (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, Qwen) on 300 gastroenterology board-style questions. Top models (GPT-o1 preview, GPT-4o, Claude-3.5-Sonnet) achieve Brier scores of 0.15-0.2 and AUROC of ~0.6; all models exhibit overconfidence. The work concludes that uncertainty estimation poses a significant challenge to safe LLM use in healthcare.

Significance. If the reported calibration metrics hold, the study supplies a concrete multi-model benchmark on gastroenterology MCQs, including quantized variants, and documents a consistent overconfidence pattern. This is a useful empirical contribution for the subfield of medical LLM evaluation. The broader claim about healthcare safety, however, rests on an untested transfer from closed-ended exam questions to clinical practice.

major comments (2)
  1. [Abstract and Conclusion] The central safety implication ('Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare') is load-bearing yet unsupported: all quantitative results (Brier 0.15-0.2, AUROC 0.6, overconfidence) derive exclusively from 300 closed-ended board MCQs with unambiguous public answers. No data, ablation, or discussion tests whether the same miscalibration appears under open-ended differentials, missing data, time pressure, or patient-specific factors.
  2. [Methods] Methods description (question selection, exact confidence-elicitation prompt, statistical procedure for Brier/AUROC, error estimation, and controls for output stochasticity) is absent from the reported results. Without these, the numeric claims cannot be verified or reproduced, directly affecting soundness of the headline metrics.
minor comments (1)
  1. [Abstract] Abstract should explicitly state the total number of models and families evaluated and note the board-question limitation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for clarification and strengthening. We address each major comment below and commit to revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract and Conclusion] The central safety implication ('Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare') is load-bearing yet unsupported: all quantitative results (Brier 0.15-0.2, AUROC 0.6, overconfidence) derive exclusively from 300 closed-ended board MCQs with unambiguous public answers. No data, ablation, or discussion tests whether the same miscalibration appears under open-ended differentials, missing data, time pressure, or patient-specific factors.

    Authors: We agree that the quantitative findings are confined to closed-ended MCQs and that no direct evidence is presented for open-ended clinical scenarios. The safety statement represents an interpretive extension rather than a tested claim. In revision we will qualify the abstract and conclusion to state that the observed overconfidence in this controlled MCQ setting suggests uncertainty estimation remains challenging, while explicitly noting the absence of data on open-ended, time-pressured, or patient-specific contexts. A dedicated limitations paragraph will be added discussing generalizability. revision: yes

  2. Referee: [Methods] Methods description (question selection, exact confidence-elicitation prompt, statistical procedure for Brier/AUROC, error estimation, and controls for output stochasticity) is absent from the reported results. Without these, the numeric claims cannot be verified or reproduced, directly affecting soundness of the headline metrics.

    Authors: We acknowledge the methods section was insufficiently detailed. The revised manuscript will expand the Methods to include: (1) the exact source and curation process for the 300 gastroenterology board-style questions; (2) the verbatim prompt template used to elicit self-reported confidence; (3) the precise formulas and implementation for Brier score and AUROC; (4) the error-estimation procedure (including any bootstrapping or variance calculation); and (5) the protocol for controlling output stochasticity (number of runs, temperature settings, and aggregation method). Supplementary code and prompts will be provided for full reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study; no derivations or self-referential reductions present.

full rationale

The paper conducts an empirical evaluation of multiple LLMs on 300 board-style gastroenterology questions, reporting observed Brier scores, AUROC values, and overconfidence tendencies. No equations, parameter fits, uniqueness theorems, or ansatzes are defined or invoked. Results are direct measurements from the test set with no reduction to prior self-citations or constructed inputs. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5685 in / 1064 out tokens · 39110 ms · 2026-05-22T23:15:16.571459+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

  1. [1]

    H., Entwistle, D

    Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and Adoption of Large Language Models in Medicine. JAMA 330, 866–869 (2023)

  2. [2]

    E., Motzfeldt, A

    Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions? Patterns 5, (2024)

  3. [3]

    Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 1–8 (2025) doi:10.1038/s41591-024-03423-7. 23

  4. [4]

    & Ranisch, R

    Haltaufderheide, J. & Ranisch, R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). Npj Digit. Med. 7, 1–11 (2024)

  5. [5]

    McKenna, N. et al. Sources of Hallucination by Large Language Models on Inference Tasks. Preprint at https://doi.org/10.48550/arXiv.2305.14552 (2023)

  6. [6]

    Xiong, M. et al. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. Preprint at https://doi.org/10.48550/arXiv.2306.13063 (2024)

  7. [7]

    Fadeeva, E. et al. Fact-Checking the Output of Large Language Models via Token -Level Uncertainty Quantification. Preprint at https://doi.org/10.48550/arXiv.2403.04696 (2024)

  8. [8]

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

    Li, K., Patel, O., Viégas, F., Pfister, H. & Wattenberg, M. Inference -Time Intervention: Eliciting Truthful Answers from a Language Model. Preprint at https://doi.org/10.48550/arXiv.2306.03341 (2024)

  9. [9]

    & Mitchell, T

    Azaria, A. & Mitchell, T. The Internal State of an LLM Knows When It`s Lying. in Findings of the Association for Computational Linguistics: EMNLP 2023 (eds. Bouamor, H., Pino, J. & Bali, K.) 967 –976 (Association for Computational Linguistics, Singapore, 2023). doi:10.18653/v1/2023.findings-emnlp.68

  10. [10]

    SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero -Resource Black -Box Hallucination Detection for Generative Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2303.08896 (2023). 24

  11. [11]

    Duan, J. et al. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free -Form Large Language Models. in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Ku, L.-W., Martins, A. & Srikumar, V.) 5050 –5063 (Association for Computational Linguistics, Bangko...

  12. [12]

    & Majumdar, S

    Raj, H., Rosati, D. & Majumdar, S. Measuring Reliability of Large Language Models through Semantic Consistency. Preprint at https://doi.org/10.48550/arXiv.2211.05853 (2023)

  13. [13]

    & Zhou, H.-Y

    Wu, J., Yu, Y. & Zhou, H.-Y. Uncertainty Estimation of Large Language Models in Medical Question Answering. Preprint at https://doi.org/10.48550/arXiv.2407.08662 (2024)

  14. [14]

    & Sattigeri, P

    Pedapati, T., Dhurandhar, A., Ghosh, S., Dan, S. & Sattigeri, P. Large Language Model Confidence Estimation via Black -Box Access. Preprint at https://doi.org/10.48550/arXiv.2406.04370 (2024)

  15. [15]

    H., Talbott, W

    Tsai, Y.-H. H., Talbott, W. & Zhang, J. Efficient Non -Parametric Uncertainty Quantification for Black -Box Large Language Models and Decision Planning. Preprint at https://doi.org/10.48550/arXiv.2402.00251 (2024)

  16. [16]

    Tian, K. et al. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine -Tuned with Human Feedback. in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds. Bouamor, H., Pino, J. & Bali, K.) 5433 –5442 (Association for Computational Linguistics, Singapore, 2023). doi:...

  17. [18]

    & Guo, J

    Ni, S., Bi, K., Yu, L. & Guo, J. Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence? Preprint at https://doi.org/10.48550/arXiv.2408.09773 (2024)

  18. [20]

    Savage, T. et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J. Am. Med. Inform. Assoc. JAMIA 32, 139–149 (2025)

  19. [21]

    Safavi-Naini, S. A. A. et al. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. Preprint at https://doi.org/10.48550/arXiv.2409.00084 (2024)

  20. [22]

    S., Nadkarni, G

    Omar, M., Agbareia, R., Glicksberg, B. S., Nadkarni, G. N. & Klang, E. Benchmarking the Confidence of Large Language Models in Clinical Questions. 2024.08.11.24311810 Preprint at https://doi.org/10.1101/2024.08.11.24311810 (2024)

  21. [23]

    Vashurin, R. et al. Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph. Preprint at https://doi.org/10.48550/arXiv.2406.15627 (2024)

  22. [24]

    Yu, D. et al. Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models. Preprint at https://doi.org/10.48550/arXiv.2310.17567 (2023). 26

  23. [25]

    On Verbalized Confidence Scores for LLMs

    Yang, D., Tsai, Y.-H. H. & Yamada, M. On Verbalized Confidence Scores for LLMs. Preprint at https://doi.org/10.48550/arXiv.2412.14737 (2024). 27 List of Supplementary Files This is a supplementary file to "Self-Reported Confidence of Large Language Model in Gastroenterology across Commercial, Open-Source, and Quantized Models" by Nariman Naderi, Seyed Ami...