super hub Canonical reference

Language Models (Mostly) Know What They Know

Amanda Askell, Dawn Drain, Ethan Perez, Saurav Kadavath, Tom Conerly, Tom Henighan · 2022 · cs.CL · arXiv 2207.05221

Canonical reference. 74% of citing Pith papers cite this work as background.

308 Pith papers citing it

Background 74% of classified citations

open full Pith review browse 308 citing papers more from Amanda Askell arXiv PDF

abstract

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 34 method 3 baseline 2

citation-polarity summary

background 29 support 3 use method 3 baseline 2 unclear 2

claims ledger

abstract We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at sel

authors

Amanda Askell Dawn Drain Ethan Perez Saurav Kadavath Tom Conerly Tom Henighan

co-cited works

representative citing papers

Reported Confidence in LLMs Tracks Commitment More Than Correctness

cs.LG · 2026-06-28 · unverdicted · novelty 8.0

Verbal confidence in LLMs tracks future commit/abstain decisions more than answer correctness, while log-probabilities track correctness.

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

cs.AI · 2026-06-01 · conditional · novelty 8.0

Current benchmarks overlook abstention competence in agents due to compliance bias; a new three-gap taxonomy and metrics (Safety Rate, Usability Rate, Informed Refusal Rate) demonstrate tunable safety-usability tradeoffs in preliminary tests across five model families.

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

cs.AI · 2026-05-17 · unverdicted · novelty 8.0

A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.

Pretraining Exposure Explains Popularity Judgments in Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.

Self-Calibrating Language Models via Test-Time Discriminative Distillation

cs.CL · 2026-03-18 · unverdicted · novelty 8.0

SECL reduces expected calibration error in language models by 56-78% via test-time discriminative distillation from the model's own P(True) signal, adapting on only 6-26% of inputs.

An Empirical Study of Security Calibration in Large Language Models for Code

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Empirical evaluation of three LLMs finds prevalent overconfidence in insecure code generation, with security calibration outperforming functional calibration but both degrading in repository-level settings.

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

Reclaim evaluation shows lossy memory in language models is never better than empty memory across eight models, with a source-first policy restoring correctability at fixed budget.

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

SPOT-E uses entropy shaping on answer predictions with low-entropy anchors to optimize visual spotlights at test time via GRPO for better VLM performance on evidence-intensive tasks.

MortarBench: Evaluating Mortgage Loan Origination Agents

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

MortarBench benchmark shows LLMs achieve ≤77.1% accuracy on loan origination; CRIT calibration raises accuracy to 80.5% and reduces bias.

Zero-Shot Active Feature Acquisition via LLM-Elicitation

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

A framework elicits discriminative MRF statistics from an LLM and closes the model via maximum entropy to enable zero-shot active feature acquisition, outperforming baselines on IBD patient data especially for hardest cases.

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

cs.CL · 2026-06-11 · unverdicted · novelty 7.0

Operadic consistency is a new per-question signal that correlates strongly with accuracy (r 0.86-0.94) across four multi-hop QA datasets and improves selective prediction over CoT-SC baselines.

CalBrief: A Pilot Diagnostic Benchmark for Evidence-Calibrated Scientific Briefing with Large Language Models

cs.DL · 2026-06-11 · conditional · novelty 7.0

CalBrief is a new diagnostic benchmark showing that explicit four-way strength calibration makes LLMs over-conservative mainly due to label-space expansion, while structured organization improves gap reasoning.

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.

Forecasting Future Behavior as a Learning Task

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Behavior Forecasters trained on LRM trajectories outperform larger models in predicting repeatability and input sensitivity at low cost.

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

cs.RO · 2026-06-07 · unverdicted · novelty 7.0

ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

OpenHalDet creates a standardized benchmark and open codebase for comparing hallucination detectors across diverse LLM generation scenarios and access settings.

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces CHARM framework that detects cascading hallucinations in agentic RAG at 89.4% rate with 5.3% false positives and reduces error propagation by 82.1% on multi-hop QA benchmarks.

Can LLM Rerankers Predict Their Own Ranking Performance?

cs.IR · 2026-06-02 · unverdicted · novelty 7.0

LLM rerankers can internally predict ranking quality via self-consistency of sampled outputs, matching SOTA external QPP while direct confidence is overconfident; supervised token-efficient methods improve calibration.

DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

The DECK taxonomy partitions LLM hallucinations into four detectability regimes using consistency and confidence axes, mapping each to scorer families and identifying a universal blind spot for output-level uncertainty quantification on knowledge-gap inputs.

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

Dynamic reputation updates per objective-expert pair plus a three-arm counterfactual gate improve robustness over fixed LLM priors on synthetic tests and molecule benchmarks, but raw LLM confidence is not reliably helpful.

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Frontier VLMs overconfidently answer spatial questions under occlusion (~30% accuracy) and perspective ambiguity (<10% accuracy) instead of abstaining, and often fail to select helpful additional views.

How's it going? Reinforcement learning in language models recruits a functional welfare axis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

LLMs struggle to associate epistemic markers with stable internal confidence levels across distributions, even under model-centric interpretations, while maintaining somewhat consistent marker rankings.

citing papers explorer

Showing 4 of 4 citing papers after filters.

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support cs.LG · 2026-04-21 · unreviewed · ref 28 · 2 links · internal anchor
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards cs.LG · 2026-03-10 · unreviewed · ref 10 · internal anchor
HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench cs.LG · 2026-01-28 · unreviewed · ref 12 · internal anchor
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models cs.CV · 2025-12-26 · unreviewed · ref 16 · internal anchor

Language Models (Mostly) Know What They Know

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer