LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
super hub Canonical reference
Language Models (Mostly) Know What They Know
Canonical reference. 74% of citing Pith papers cite this work as background.
abstract
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at sel
authors
co-cited works
representative citing papers
SECL reduces expected calibration error in language models by 56-78% via test-time discriminative distillation from the model's own P(True) signal, adapting on only 6-26% of inputs.
New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.
Hallucination is detected as a transport-cost excursion in hidden-state trajectories, localized via contrastive PCA in a teacher model and distilled to a BiLSTM student.
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
The Metacognitive Probe identifies large within-model gaps in LLM confidence behavior, including a 47-point dissociation in Gemini 2.5 Flash between strong task calibration and weak difficulty prediction.
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
LLM agents encode tool necessity in pre-generation hidden states with high linear decodability (AUROC 0.89-0.96); Probe&Prefill uses this to reduce tool calls 48% with 1.7% accuracy loss.
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.
Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
Partial Evidence Bench is a deterministic benchmark that measures agent correctness, completeness awareness, gap-report quality, and unsafe overclaiming in authorization-constrained evidence environments.
First-token normalized entropy (phi_first) from one greedy decode reaches mean AUROC 0.820 for hallucination detection, matching or exceeding semantic self-consistency (0.793) and surface self-consistency (0.791) across three 7-8B models and two benchmarks.
citing papers explorer
-
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.
-
Self-Calibrating Language Models via Test-Time Discriminative Distillation
SECL reduces expected calibration error in language models by 56-78% via test-time discriminative distillation from the model's own P(True) signal, adapting on only 6-26% of inputs.
-
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
New benchmark evaluates three frontier deep research agents on 42 SME prompts with verifiers and rubrics, reporting low acceptance rates of 9.5-21.4% and agent-specific failure modes.
-
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
-
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
-
Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
-
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
Model-adaptive tool necessity shows 26-54% mismatch with actual tool calls across LLMs, driven by nearly orthogonal hidden-state signals for cognition versus action.
-
Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
Hallucination is detected as a transport-cost excursion in hidden-state trajectories, localized via contrastive PCA in a teacher model and distilled to a BiLSTM student.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
-
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a training-free surrogate framework that outperforms baselines.
-
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
-
Task-Aware Calibration: Provably Optimal Decoding in LLMs
Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.
-
The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs
The Metacognitive Probe identifies large within-model gaps in LLM confidence behavior, including a 47-point dissociation in Gemini 2.5 Flash between strong task calibration and weak difficulty prediction.
-
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.
-
EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium
EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.
-
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
LLM agents encode tool necessity in pre-generation hidden states with high linear decodability (AUROC 0.89-0.96); Probe&Prefill uses this to reduce tool calls 48% with 1.7% accuracy loss.
-
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.
-
Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations
LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.
-
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
-
Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization
Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.
-
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
-
Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems
Partial Evidence Bench is a deterministic benchmark that measures agent correctness, completeness awareness, gap-report quality, and unsafe overclaiming in authorization-constrained evidence environments.
-
The First Token Knows: Single-Decode Confidence for Hallucination Detection
First-token normalized entropy (phi_first) from one greedy decode reaches mean AUROC 0.820 for hallucination detection, matching or exceeding semantic self-consistency (0.793) and surface self-consistency (0.791) across three 7-8B models and two benchmarks.
-
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.
-
SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass
SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.
-
Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models
Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.
-
Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI
Introduces Defensibility Index, Ambiguity Index, and Probabilistic Defensibility Signal to evaluate AI moderation decisions by logical derivability from explicit rules rather than agreement with historical labels, with validation on 193k+ Reddit cases showing 33-46.6 pp metric gaps and a Governance
-
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
-
Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report
Validity indices adapted from clinical assessment classify four frontier LLMs as construct-level invalid on metacognitive probes, with valid models showing positive item-sensitive confidence (r=.18) while invalid ones show the opposite (r=-.20).
-
Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents
LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.
-
UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval
UsefulBench is a new benchmark dataset that separates relevance from usefulness in information retrieval, revealing that similarity-based systems and current LLMs fall short on decision-useful content.
-
The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
The Metacognitive Monitoring Battery applied to 20 LLMs identifies three self-monitoring profiles, shows inverted accuracy and sensitivity ranks, and finds retrospective and prospective regulation largely dissociable.
-
Calibrated Confidence Estimation for Tabular Question Answering
Tabular QA LLMs are overconfident, but Multi-Format Agreement using Markdown/HTML/JSON/CSV variants improves AUROC to 0.80 and cuts calibration error by 44-63% at lower cost than sampling.
-
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
-
Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment
Bounded agents induce capacity-derived semantic spaces via quotient POMDPs, with a structural phase transition making intent-preserving communication impossible below a critical rate determined by quotient mismatch.
-
Unified Multimodal Uncertain Inference
Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.
-
Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation
Hypothesis Graph Refinement represents frontier predictions as revisable hypothesis nodes and applies verification-driven cascade correction to prune erroneous subgraphs, achieving 72.41% success and 56.22% SPL on GOAT-Bench.
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing
Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.
-
ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation
ScrapeGraphAI-100k releases 93,695 real telemetry examples pairing web page content with prompts, schemas, and LLM responses to support training and benchmarking of schema-constrained generation.
-
Explaining Sources of Uncertainty in Automated Fact-Checking
CLUE generates natural language explanations of model uncertainty in fact-checking by unsupervised identification of claim-evidence and inter-evidence conflicts and agreements, followed by prompting and attention steering.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Larger LLMs hallucinate more often despite having the correct concept available because instruction tuning causes probability mass to disperse across alternative surface forms instead of concentrating on one.
-
PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts
PromptNCE frames LLM conditional probability estimation as contrastive prompting augmented with an OTHER category, recovering true P(y|x) and achieving up to 0.82 Spearman correlation with human-derived PMI on three datasets.
-
Reading Calibrated Uncertainty from Language Model Trajectories
Geometric features from per-layer MLP update trajectories fed to a sparse linear probe outperform maximum softmax probability for uncertainty quantification under selective abstention, with gains up to 21 AURC points.
-
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs
Experiments reveal that LLMs follow instructions at rates from 1% to 99% when opposed by hardcoded conflicting patterns, with robustness tied to output diversity and alignment with model priors rather than general capability.
-
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
SCA framework applies Information Bottleneck to assign step-level confidence in black-box LLM reasoning traces, flagging errors and boosting self-correction success by up to 13.5% on math and QA tasks.
-
PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization
PAIR combines a hidden-state probe with an attention correction to deliver robust step-level rewards for GRPO-based optimization of multi-turn LLM agents, achieving high AUROC on contaminated trajectories at low cost.
-
CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models
CyberCorrect applies cybernetic control theory to LLM self-correction, reporting 79.8% accuracy on a new 440-task benchmark with 6.2-point gains and 41% less over-correction.