Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

Ali Soroush; Girish Nadkarni; Nariman Naderi; Peter Lewis; Seyed Amir Ahmad Safavi-Naini; Thomas Savage; Zahra Atf

arxiv: 2503.18562 · v1 · submitted 2025-03-24 · 💻 cs.CL · cs.AI· cs.HC· cs.LG

Self-Reported Confidence of Large Language Models in Gastroenterology: Analysis of Commercial, Open-Source, and Quantized Models

Nariman Naderi , Seyed Amir Ahmad Safavi-Naini , Thomas Savage , Zahra Atf , Peter Lewis , Girish Nadkarni , Ali Soroush This is my paper

Pith reviewed 2026-05-22 23:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.LG

keywords Large Language ModelsConfidence ElicitationGastroenterologyUncertainty QuantificationBrier ScoreOverconfidenceModel CalibrationArtificial Intelligence

0 comments

The pith

Large language models consistently overestimate their certainty when answering gastroenterology board questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests how well different large language models judge their own accuracy by asking them to report confidence levels on 300 gastroenterology board-style questions. The top models reach Brier scores of 0.15-0.2 and AUROC of 0.6, showing their confidence carries some signal but they remain overconfident overall. Newer models perform better on the questions themselves yet the overconfidence pattern holds across commercial, open-source, and quantized versions. A reader would care because poor uncertainty handling blocks safe use of these models in medical settings where mistaken certainty could affect decisions.

Core claim

When large language models are prompted to give self-reported confidence along with answers to 300 gastroenterology board-style questions, the highest-performing ones (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieve Brier scores of 0.15-0.2 and AUROC of 0.6 while all models, regardless of type, display a consistent tendency toward overconfidence.

What carries the argument

Elicitation of self-reported confidence probabilities from LLMs on multiple-choice gastroenterology questions, evaluated for calibration with Brier score and AUROC.

If this is right

Newer models improve answer accuracy yet still exhibit the same overconfidence in self-reported certainty.
The calibration problem appears across commercial, open-source, and quantized model categories.
Uncertainty estimation remains a core barrier to safe deployment of LLMs in gastroenterology practice.
All tested models require additional mechanisms to align reported confidence with actual performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training focused only on question accuracy may not teach models to recognize the boundaries of their knowledge in medical domains.
Physicians using these models would need independent checks rather than trusting self-reported confidence levels.
Similar overconfidence patterns could appear in other medical specialties that rely on board-style question formats.

Load-bearing premise

Self-reported confidence on these board-style questions serves as a valid proxy for the models' uncertainty in actual gastroenterology practice, and the 300 questions adequately represent the domain.

What would settle it

A direct comparison of the same models' self-reported confidence and accuracy on real clinical gastroenterology cases versus the board questions would show whether the overconfidence pattern holds outside standardized testing.

Figures

Figures reproduced from arXiv: 2503.18562 by Ali Soroush, Girish Nadkarni, Nariman Naderi, Peter Lewis, Seyed Amir Ahmad Safavi-Naini, Thomas Savage, Zahra Atf.

**Figure 2.** Figure 2: Average accuracy versus average confidence scores for LLMs with more than 150 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Left panel: Overall distribution of self-reported confidence scores and mean response accuracy (stars) for each model. Right panel: Distribution of self-reported confidence scores for each model stratified by response accuracy [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

This study evaluated self-reported response certainty across several large language models (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, and Qwen) using 300 gastroenterology board-style questions. The highest-performing models (GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet) achieved Brier scores of 0.15-0.2 and AUROC of 0.6. Although newer models demonstrated improved performance, all exhibited a consistent tendency towards overconfidence. Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare. Keywords: Large Language Models; Confidence Elicitation; Artificial Intelligence; Gastroenterology; Uncertainty Quantification

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows consistent overconfidence in LLMs on 300 gastro board questions but offers no test of whether that pattern appears in actual clinical decisions.

read the letter

The main thing to know is that top models like GPT-o1 preview, GPT-4o, and Claude-3.5-Sonnet still lean overconfident on these questions, landing Brier scores of 0.15-0.2 and AUROC around 0.6, with the pattern holding across commercial, open-source, and quantized variants. Newer models improve on raw accuracy but do not fix the calibration issue. That is the concrete empirical result here. The work covers a useful range of model families on one medical subdomain and reports numeric outcomes rather than just qualitative claims, which is straightforward to build on. The consistent direction of the miscalibration across models is the clearest signal in the abstract. The soft spots are more substantial. All measurements come from closed-ended board-style MCQs whose answers are fixed and publicly available. Real gastroenterology work involves open differentials, incomplete information, and patient context where uncertainty behaves differently, and nothing in the study checks whether the same models show comparable overconfidence on those inputs. The abstract also gives no information on how the 300 questions were selected, what statistical methods were used, or whether error bars or controls were applied, so the numbers are hard to assess for robustness. The safety implication in the abstract therefore rests on an untested transfer from exam performance to clinical use. This paper is mainly for groups already running LLM evaluations in medicine who want another data point on calibration in one specialty. A reader focused on deployment safety would get a cautionary signal but not strong evidence that the problem generalizes. I would send it to peer review. The topic is practical and the basic observation is worth a closer look with tighter methods, even if the current version needs work on scope and reporting.

Referee Report

2 major / 1 minor

Summary. The paper evaluates self-reported confidence across commercial, open-source, and quantized LLMs (GPT, Claude, Llama, Phi, Mistral, Gemini, Gemma, Qwen) on 300 gastroenterology board-style questions. Top models (GPT-o1 preview, GPT-4o, Claude-3.5-Sonnet) achieve Brier scores of 0.15-0.2 and AUROC of ~0.6; all models exhibit overconfidence. The work concludes that uncertainty estimation poses a significant challenge to safe LLM use in healthcare.

Significance. If the reported calibration metrics hold, the study supplies a concrete multi-model benchmark on gastroenterology MCQs, including quantized variants, and documents a consistent overconfidence pattern. This is a useful empirical contribution for the subfield of medical LLM evaluation. The broader claim about healthcare safety, however, rests on an untested transfer from closed-ended exam questions to clinical practice.

major comments (2)

[Abstract and Conclusion] The central safety implication ('Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare') is load-bearing yet unsupported: all quantitative results (Brier 0.15-0.2, AUROC 0.6, overconfidence) derive exclusively from 300 closed-ended board MCQs with unambiguous public answers. No data, ablation, or discussion tests whether the same miscalibration appears under open-ended differentials, missing data, time pressure, or patient-specific factors.
[Methods] Methods description (question selection, exact confidence-elicitation prompt, statistical procedure for Brier/AUROC, error estimation, and controls for output stochasticity) is absent from the reported results. Without these, the numeric claims cannot be verified or reproduced, directly affecting soundness of the headline metrics.

minor comments (1)

[Abstract] Abstract should explicitly state the total number of models and families evaluated and note the board-question limitation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for clarification and strengthening. We address each major comment below and commit to revisions where appropriate.

read point-by-point responses

Referee: [Abstract and Conclusion] The central safety implication ('Uncertainty estimation presents a significant challenge to the safe use of LLMs in healthcare') is load-bearing yet unsupported: all quantitative results (Brier 0.15-0.2, AUROC 0.6, overconfidence) derive exclusively from 300 closed-ended board MCQs with unambiguous public answers. No data, ablation, or discussion tests whether the same miscalibration appears under open-ended differentials, missing data, time pressure, or patient-specific factors.

Authors: We agree that the quantitative findings are confined to closed-ended MCQs and that no direct evidence is presented for open-ended clinical scenarios. The safety statement represents an interpretive extension rather than a tested claim. In revision we will qualify the abstract and conclusion to state that the observed overconfidence in this controlled MCQ setting suggests uncertainty estimation remains challenging, while explicitly noting the absence of data on open-ended, time-pressured, or patient-specific contexts. A dedicated limitations paragraph will be added discussing generalizability. revision: yes
Referee: [Methods] Methods description (question selection, exact confidence-elicitation prompt, statistical procedure for Brier/AUROC, error estimation, and controls for output stochasticity) is absent from the reported results. Without these, the numeric claims cannot be verified or reproduced, directly affecting soundness of the headline metrics.

Authors: We acknowledge the methods section was insufficiently detailed. The revised manuscript will expand the Methods to include: (1) the exact source and curation process for the 300 gastroenterology board-style questions; (2) the verbatim prompt template used to elicit self-reported confidence; (3) the precise formulas and implementation for Brier score and AUROC; (4) the error-estimation procedure (including any bootstrapping or variance calculation); and (5) the protocol for controlling output stochasticity (number of runs, temperature settings, and aggregation method). Supplementary code and prompts will be provided for full reproducibility. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study; no derivations or self-referential reductions present.

full rationale

The paper conducts an empirical evaluation of multiple LLMs on 300 board-style gastroenterology questions, reporting observed Brier scores, AUROC values, and overconfidence tendencies. No equations, parameter fits, uniqueness theorems, or ansatzes are defined or invoked. Results are direct measurements from the test set with no reduction to prior self-citations or constructed inputs. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5685 in / 1064 out tokens · 39110 ms · 2026-05-22T23:15:16.571459+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 5 internal anchors

[1]

H., Entwistle, D

Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and Adoption of Large Language Models in Medicine. JAMA 330, 866–869 (2023)

work page 2023
[2]

E., Motzfeldt, A

Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions? Patterns 5, (2024)

work page 2024
[3]

Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 1–8 (2025) doi:10.1038/s41591-024-03423-7. 23

work page doi:10.1038/s41591-024-03423-7 2025
[4]

& Ranisch, R

Haltaufderheide, J. & Ranisch, R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). Npj Digit. Med. 7, 1–11 (2024)

work page 2024
[5]

McKenna, N. et al. Sources of Hallucination by Large Language Models on Inference Tasks. Preprint at https://doi.org/10.48550/arXiv.2305.14552 (2023)

work page doi:10.48550/arxiv.2305.14552 2023
[6]

Xiong, M. et al. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. Preprint at https://doi.org/10.48550/arXiv.2306.13063 (2024)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.13063 2024
[7]

Fadeeva, E. et al. Fact-Checking the Output of Large Language Models via Token -Level Uncertainty Quantification. Preprint at https://doi.org/10.48550/arXiv.2403.04696 (2024)

work page doi:10.48550/arxiv.2403.04696 2024
[8]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Li, K., Patel, O., Viégas, F., Pfister, H. & Wattenberg, M. Inference -Time Intervention: Eliciting Truthful Answers from a Language Model. Preprint at https://doi.org/10.48550/arXiv.2306.03341 (2024)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.03341 2024
[9]

& Mitchell, T

Azaria, A. & Mitchell, T. The Internal State of an LLM Knows When It`s Lying. in Findings of the Association for Computational Linguistics: EMNLP 2023 (eds. Bouamor, H., Pino, J. & Bali, K.) 967 –976 (Association for Computational Linguistics, Singapore, 2023). doi:10.18653/v1/2023.findings-emnlp.68

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[10]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero -Resource Black -Box Hallucination Detection for Generative Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2303.08896 (2023). 24

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08896 2023
[11]

Duan, J. et al. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free -Form Large Language Models. in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Ku, L.-W., Martins, A. & Srikumar, V.) 5050 –5063 (Association for Computational Linguistics, Bangko...

work page doi:10.18653/v1/2024.acl-long.276 2024
[12]

& Majumdar, S

Raj, H., Rosati, D. & Majumdar, S. Measuring Reliability of Large Language Models through Semantic Consistency. Preprint at https://doi.org/10.48550/arXiv.2211.05853 (2023)

work page doi:10.48550/arxiv.2211.05853 2023
[13]

& Zhou, H.-Y

Wu, J., Yu, Y. & Zhou, H.-Y. Uncertainty Estimation of Large Language Models in Medical Question Answering. Preprint at https://doi.org/10.48550/arXiv.2407.08662 (2024)

work page doi:10.48550/arxiv.2407.08662 2024
[14]

& Sattigeri, P

Pedapati, T., Dhurandhar, A., Ghosh, S., Dan, S. & Sattigeri, P. Large Language Model Confidence Estimation via Black -Box Access. Preprint at https://doi.org/10.48550/arXiv.2406.04370 (2024)

work page doi:10.48550/arxiv.2406.04370 2024
[15]

H., Talbott, W

Tsai, Y.-H. H., Talbott, W. & Zhang, J. Efficient Non -Parametric Uncertainty Quantification for Black -Box Large Language Models and Decision Planning. Preprint at https://doi.org/10.48550/arXiv.2402.00251 (2024)

work page doi:10.48550/arxiv.2402.00251 2024
[16]

Tian, K. et al. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine -Tuned with Human Feedback. in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds. Bouamor, H., Pino, J. & Bali, K.) 5433 –5442 (Association for Computational Linguistics, Singapore, 2023). doi:...

work page doi:10.18653/v1/2023.emnlp-main.330 2023
[18]

& Guo, J

Ni, S., Bi, K., Yu, L. & Guo, J. Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence? Preprint at https://doi.org/10.48550/arXiv.2408.09773 (2024)

work page doi:10.48550/arxiv.2408.09773 2024
[20]

Savage, T. et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J. Am. Med. Inform. Assoc. JAMIA 32, 139–149 (2025)

work page 2025
[21]

Safavi-Naini, S. A. A. et al. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. Preprint at https://doi.org/10.48550/arXiv.2409.00084 (2024)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.00084 2024
[22]

S., Nadkarni, G

Omar, M., Agbareia, R., Glicksberg, B. S., Nadkarni, G. N. & Klang, E. Benchmarking the Confidence of Large Language Models in Clinical Questions. 2024.08.11.24311810 Preprint at https://doi.org/10.1101/2024.08.11.24311810 (2024)

work page doi:10.1101/2024.08.11.24311810 2024
[23]

Vashurin, R. et al. Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph. Preprint at https://doi.org/10.48550/arXiv.2406.15627 (2024)

work page doi:10.48550/arxiv.2406.15627 2024
[24]

Yu, D. et al. Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models. Preprint at https://doi.org/10.48550/arXiv.2310.17567 (2023). 26

work page doi:10.48550/arxiv.2310.17567 2023
[25]

On Verbalized Confidence Scores for LLMs

Yang, D., Tsai, Y.-H. H. & Yamada, M. On Verbalized Confidence Scores for LLMs. Preprint at https://doi.org/10.48550/arXiv.2412.14737 (2024). 27 List of Supplementary Files This is a supplementary file to "Self-Reported Confidence of Large Language Model in Gastroenterology across Commercial, Open-Source, and Quantized Models" by Nariman Naderi, Seyed Ami...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.14737 2024

[1] [1]

H., Entwistle, D

Shah, N. H., Entwistle, D. & Pfeffer, M. A. Creation and Adoption of Large Language Models in Medicine. JAMA 330, 866–869 (2023)

work page 2023

[2] [2]

E., Motzfeldt, A

Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions? Patterns 5, (2024)

work page 2024

[3] [3]

Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 1–8 (2025) doi:10.1038/s41591-024-03423-7. 23

work page doi:10.1038/s41591-024-03423-7 2025

[4] [4]

& Ranisch, R

Haltaufderheide, J. & Ranisch, R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). Npj Digit. Med. 7, 1–11 (2024)

work page 2024

[5] [5]

McKenna, N. et al. Sources of Hallucination by Large Language Models on Inference Tasks. Preprint at https://doi.org/10.48550/arXiv.2305.14552 (2023)

work page doi:10.48550/arxiv.2305.14552 2023

[6] [6]

Xiong, M. et al. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. Preprint at https://doi.org/10.48550/arXiv.2306.13063 (2024)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.13063 2024

[7] [7]

Fadeeva, E. et al. Fact-Checking the Output of Large Language Models via Token -Level Uncertainty Quantification. Preprint at https://doi.org/10.48550/arXiv.2403.04696 (2024)

work page doi:10.48550/arxiv.2403.04696 2024

[8] [8]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Li, K., Patel, O., Viégas, F., Pfister, H. & Wattenberg, M. Inference -Time Intervention: Eliciting Truthful Answers from a Language Model. Preprint at https://doi.org/10.48550/arXiv.2306.03341 (2024)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.03341 2024

[9] [9]

& Mitchell, T

Azaria, A. & Mitchell, T. The Internal State of an LLM Knows When It`s Lying. in Findings of the Association for Computational Linguistics: EMNLP 2023 (eds. Bouamor, H., Pino, J. & Bali, K.) 967 –976 (Association for Computational Linguistics, Singapore, 2023). doi:10.18653/v1/2023.findings-emnlp.68

work page doi:10.18653/v1/2023.findings-emnlp.68 2023

[10] [10]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Manakul, P., Liusie, A. & Gales, M. J. F. SelfCheckGPT: Zero -Resource Black -Box Hallucination Detection for Generative Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2303.08896 (2023). 24

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08896 2023

[11] [11]

Duan, J. et al. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free -Form Large Language Models. in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Ku, L.-W., Martins, A. & Srikumar, V.) 5050 –5063 (Association for Computational Linguistics, Bangko...

work page doi:10.18653/v1/2024.acl-long.276 2024

[12] [12]

& Majumdar, S

Raj, H., Rosati, D. & Majumdar, S. Measuring Reliability of Large Language Models through Semantic Consistency. Preprint at https://doi.org/10.48550/arXiv.2211.05853 (2023)

work page doi:10.48550/arxiv.2211.05853 2023

[13] [13]

& Zhou, H.-Y

Wu, J., Yu, Y. & Zhou, H.-Y. Uncertainty Estimation of Large Language Models in Medical Question Answering. Preprint at https://doi.org/10.48550/arXiv.2407.08662 (2024)

work page doi:10.48550/arxiv.2407.08662 2024

[14] [14]

& Sattigeri, P

Pedapati, T., Dhurandhar, A., Ghosh, S., Dan, S. & Sattigeri, P. Large Language Model Confidence Estimation via Black -Box Access. Preprint at https://doi.org/10.48550/arXiv.2406.04370 (2024)

work page doi:10.48550/arxiv.2406.04370 2024

[15] [15]

H., Talbott, W

Tsai, Y.-H. H., Talbott, W. & Zhang, J. Efficient Non -Parametric Uncertainty Quantification for Black -Box Large Language Models and Decision Planning. Preprint at https://doi.org/10.48550/arXiv.2402.00251 (2024)

work page doi:10.48550/arxiv.2402.00251 2024

[16] [16]

Tian, K. et al. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine -Tuned with Human Feedback. in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds. Bouamor, H., Pino, J. & Bali, K.) 5433 –5442 (Association for Computational Linguistics, Singapore, 2023). doi:...

work page doi:10.18653/v1/2023.emnlp-main.330 2023

[17] [18]

& Guo, J

Ni, S., Bi, K., Yu, L. & Guo, J. Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence? Preprint at https://doi.org/10.48550/arXiv.2408.09773 (2024)

work page doi:10.48550/arxiv.2408.09773 2024

[18] [20]

Savage, T. et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J. Am. Med. Inform. Assoc. JAMIA 32, 139–149 (2025)

work page 2025

[19] [21]

Safavi-Naini, S. A. A. et al. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. Preprint at https://doi.org/10.48550/arXiv.2409.00084 (2024)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.00084 2024

[20] [22]

S., Nadkarni, G

Omar, M., Agbareia, R., Glicksberg, B. S., Nadkarni, G. N. & Klang, E. Benchmarking the Confidence of Large Language Models in Clinical Questions. 2024.08.11.24311810 Preprint at https://doi.org/10.1101/2024.08.11.24311810 (2024)

work page doi:10.1101/2024.08.11.24311810 2024

[21] [23]

Vashurin, R. et al. Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph. Preprint at https://doi.org/10.48550/arXiv.2406.15627 (2024)

work page doi:10.48550/arxiv.2406.15627 2024

[22] [24]

Yu, D. et al. Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models. Preprint at https://doi.org/10.48550/arXiv.2310.17567 (2023). 26

work page doi:10.48550/arxiv.2310.17567 2023

[23] [25]

On Verbalized Confidence Scores for LLMs

Yang, D., Tsai, Y.-H. H. & Yamada, M. On Verbalized Confidence Scores for LLMs. Preprint at https://doi.org/10.48550/arXiv.2412.14737 (2024). 27 List of Supplementary Files This is a supplementary file to "Self-Reported Confidence of Large Language Model in Gastroenterology across Commercial, Open-Source, and Quantized Models" by Nariman Naderi, Seyed Ami...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.14737 2024