Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Human-AI hybrids achieve only +0.4pp over AI alone on diverse tasks because confidence routing fails to identify the small set of cases where humans can correct AI errors.
Chain-of-thought monitorability provides a promising but fragile method for AI safety oversight that developers should actively preserve.
citing papers explorer
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.