Distinguishing the Knowable from the Unknowable with Language Models
read the original abstract
We study the feasibility of identifying epistemic uncertainty (reflecting a lack of knowledge), as opposed to aleatoric uncertainty (reflecting entropy in the underlying distribution), in the outputs of large language models (LLMs) over free-form text. In the absence of ground-truth probabilities, we explore a setting where, in order to (approximately) disentangle a given LLM's uncertainty, a significantly larger model stands in as a proxy for the ground truth. We show that small linear probes trained on the embeddings of frozen, pretrained models accurately predict when larger models will be more confident at the token level and that probes trained on one text domain generalize to others. Going further, we propose a fully unsupervised method that achieves non-trivial accuracy on the same task. Taken together, we interpret these results as evidence that LLMs naturally contain internal representations of different types of uncertainty that could potentially be leveraged to devise more informative indicators of model confidence in diverse practical settings.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier
PROPEL amortizes solver evaluation with a trained activation probe to optimize task generators toward a target solve rate, raising the share of learnable tasks from ~10% to ~20% in coding and SWE experiments.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study
CLUES decomposes semantic uncertainty into separate ambiguity and instability scores for clinical Text-to-SQL, with instability via Schur complement, outperforming Kernel Language Entropy on failure prediction while e...
-
Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal
A new pipeline uses interpretability to characterize concepts in preference data and shape rewards via feature or data interventions during LM post-training.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.