Benchmarking open-source large language models, gpt-4 and claude 2 on multiple-choice questions in nephrology

Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Zhe Fei, Fabien Scalzo, Ira Kurtz · 2024

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

citation-role summary

background 1

background 1

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

Showing 1 of 1 citing paper.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL · 2026-04-03 · unverdicted · none · ref 51
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.