Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.
super hub Canonical reference
Language Models (Mostly) Know What They Know
Canonical reference. 74% of citing Pith papers cite this work as background.
abstract
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at sel
authors
co-cited works
representative citing papers
Current benchmarks overlook abstention competence in agents due to compliance bias; a new three-gap taxonomy and metrics (Safety Rate, Usability Rate, Informed Refusal Rate) demonstrate tunable safety-usability tradeoffs in preliminary tests across five model families.
A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
SECL reduces expected calibration error in language models by 56-78% via test-time discriminative distillation from the model's own P(True) signal, adapting on only 6-26% of inputs.
Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.
Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.
SPOT-E uses entropy shaping on answer predictions with low-entropy anchors to optimize visual spotlights at test time via GRPO for better VLM performance on evidence-intensive tasks.
MortarBench benchmark shows LLMs achieve ≤77.1% accuracy on loan origination; CRIT calibration raises accuracy to 80.5% and reduces bias.
A framework elicits discriminative MRF statistics from an LLM and closes the model via maximum entropy to enable zero-shot active feature acquisition, outperforming baselines on IBD patient data especially for hardest cases.
Operadic consistency is a new per-question signal that correlates strongly with accuracy (r 0.86-0.94) across four multi-hop QA datasets and improves selective prediction over CoT-SC baselines.
CalBrief is a new diagnostic benchmark showing that explicit four-way strength calibration makes LLMs over-conservative mainly due to label-space expansion, while structured organization improves gap reasoning.
MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.
Behavior Forecasters trained on LRM trajectories outperform larger models in predicting repeatability and input sensitivity at low cost.
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
OpenHalDet creates a standardized benchmark and open codebase for comparing hallucination detectors across diverse LLM generation scenarios and access settings.
Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.
Introduces CHARM framework that detects cascading hallucinations in agentic RAG at 89.4% rate with 5.3% false positives and reduces error propagation by 82.1% on multi-hop QA benchmarks.
LLM rerankers can internally predict ranking quality via self-consistency of sampled outputs, matching SOTA external QPP while direct confidence is overconfident; supervised token-efficient methods improve calibration.
The DECK taxonomy partitions LLM hallucinations into four detectability regimes using consistency and confidence axes, mapping each to scorer families and identifying a universal blind spot for output-level uncertainty quantification on knowledge-gap inputs.
Dynamic reputation updates per objective-expert pair plus a three-arm counterfactual gate improve robustness over fixed LLM priors on synthetic tests and molecule benchmarks, but raw LLM confidence is not reliably helpful.
Frontier VLMs overconfidently answer spatial questions under occlusion (~30% accuracy) and perspective ambiguity (<10% accuracy) instead of abstaining, and often fail to select helpful additional views.
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.
citing papers explorer
- MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support
- Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
- HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench
- High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models