pith. machine review for the scientific record. sign in

arxiv: 2207.05221 · v4 · submitted 2022-07-11 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Language Models (Mostly) Know What They Know

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords language modelsself-evaluationcalibrationhonestyP(True)P(IK)uncertainty estimation
0
0 comments X

The pith

Language models can assess whether their proposed answers are likely correct by outputting a P(True) probability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models can judge the truth of their own statements and foresee which questions they will handle accurately. Larger models prove well-calibrated when answering multiple-choice or true/false items presented in suitable formats. For open-ended questions, the models first generate candidate answers and then estimate the probability that a given answer is true, yielding usable calibration that improves with scale. Training the models to directly predict the probability that they know the answer to a question produces further gains and allows partial transfer to unseen tasks. These results matter because they open a route toward language models that can flag their own uncertainty instead of confidently stating falsehoods.

Core claim

Larger models are well-calibrated on diverse multiple-choice and true/false questions when given in the right format. On open-ended tasks, models first sample answers and then assign a P(True) value to the correctness of each; this self-evaluation shows encouraging calibration and continues to improve with model size. Performance rises further when models review many of their own samples before scoring one. Separately, models trained to output P(IK), the probability that they know the answer to a question without seeing any candidate, perform well, generalize partially across tasks, and adjust their P(IK) upward when given relevant context or solution hints.

What carries the argument

The P(True) probability assigned after a model proposes an answer, and the P(IK) probability trained to indicate whether the model knows the answer to a question without reference to any specific proposal.

If this is right

  • Self-evaluation accuracy increases when the model is allowed to consider multiple samples before scoring any one of them.
  • P(IK) values rise appropriately when the model receives relevant source material in its context.
  • P(IK) values also rise when the model is given hints that point toward the solution of math word problems.
  • Models trained to predict P(IK) achieve partial generalization from one task to another.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting or training approach could be used to let models decline to answer when their internal probability falls below a chosen threshold.
  • Downstream systems might filter or rerank model outputs by requiring a minimum P(True) before accepting a generation.
  • Further experiments could check whether P(IK) remains informative after the model is fine-tuned on objectives other than next-token prediction.
  • The method supplies a concrete signal that could be monitored during training to encourage more consistent self-assessment.

Load-bearing premise

The output probabilities reflect an internal sense of knowledge rather than surface patterns copied from the prompt format and training data.

What would settle it

A new collection of questions where the model's assigned P(True) values show no correlation with actual correctness or where calibration stops improving with scale.

read the original abstract

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that larger language models are well-calibrated on multiple-choice and true/false questions when presented in appropriate prompt formats. It extends this to open-ended generation by having models first sample answers and then estimate P(True) for correctness, reporting encouraging calibration, scaling with model size, and further gains from considering multiple samples. Models can also be trained to directly predict P(IK) (probability they know the answer to a question), with partial cross-task generalization; these P(IK) estimates increase appropriately when relevant context or hints are provided.

Significance. If the results hold under appropriate controls, the work provides useful empirical evidence on scaling of self-evaluation in LMs and demonstrates that explicit training for P(IK) prediction can yield partial generalization and context sensitivity. Strengths include the breadth of tasks examined and the observation that performance improves with multiple samples and added hints. This contributes to the broader goal of training more honest models, though the significance hinges on whether the prompted scalar probabilities capture genuine epistemic assessment.

major comments (2)
  1. [§3.2] §3.2 (P(True) on open-ended tasks): The reported calibration and scaling results rely on fixed prompt templates for eliciting P(True). No ablations on prompt paraphrasing, format randomization, or semantically equivalent but surface-altered instructions are presented. This is load-bearing for the central self-knowledge claim, as the skeptic concern (that outputs may reflect learned statistical associations with prompt phrasing rather than internal epistemic state) remains unaddressed by the current evidence.
  2. [Sections 2 and 4] Methods and experimental details (Sections 2 and 4): Full specification of data splits, number of independent runs, statistical tests for scaling trends, and exact training procedures for the P(IK) predictors is not provided. This affects assessment of the partial generalization results and the claim that P(IK) increases with relevant context, as it is unclear whether task selection or post-hoc choices influence the reported outcomes.
minor comments (2)
  1. [Abstract] The abstract introduces P(True) and P(IK) without a brief parenthetical definition; adding one sentence would improve accessibility for readers unfamiliar with the notation.
  2. A consolidated table listing all tasks, datasets, and prompt formats used across experiments would aid reproducibility and allow readers to assess the diversity claim more directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each of the major comments below and have revised the paper to incorporate additional details and robustness checks where feasible.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (P(True) on open-ended tasks): The reported calibration and scaling results rely on fixed prompt templates for eliciting P(True). No ablations on prompt paraphrasing, format randomization, or semantically equivalent but surface-altered instructions are presented. This is load-bearing for the central self-knowledge claim, as the skeptic concern (that outputs may reflect learned statistical associations with prompt phrasing rather than internal epistemic state) remains unaddressed by the current evidence.

    Authors: We acknowledge that reliance on fixed prompt templates leaves open the possibility that results could partly reflect surface-level associations rather than deeper epistemic assessment. While we chose templates that performed well in preliminary checks and observed consistent scaling across diverse tasks, we agree this is an important robustness concern. In the revised manuscript we will add an ablation examining a subset of tasks under paraphrased and semantically equivalent prompt variants, confirming that calibration and scaling trends are largely preserved. revision: yes

  2. Referee: [Sections 2 and 4] Methods and experimental details (Sections 2 and 4): Full specification of data splits, number of independent runs, statistical tests for scaling trends, and exact training procedures for the P(IK) predictors is not provided. This affects assessment of the partial generalization results and the claim that P(IK) increases with relevant context, as it is unclear whether task selection or post-hoc choices influence the reported outcomes.

    Authors: We agree that the current manuscript lacks sufficient methodological detail for full reproducibility and evaluation of the reported trends. The revised version will expand Sections 2 and 4 to include complete specifications of data splits, the number of independent runs performed, any statistical tests used to evaluate scaling, and the precise training procedures, hyperparameters, and implementation details for the P(IK) predictors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurements against held-out correctness

full rationale

The paper reports direct empirical measurements of model calibration, scaling, and generalization for prompted P(True) and P(IK) outputs compared to ground-truth correctness on held-out questions. No mathematical derivations, equations, or first-principles claims are present that could reduce to self-definitional inputs, fitted parameters renamed as predictions, or self-citation chains. All performance numbers are falsifiable against external labels and benchmarks independent of the prompting procedure itself. Minor self-citations (if any) are not load-bearing for the central empirical observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of the transformer training paradigm and the validity of prompted probability outputs as epistemic estimates. No new physical or mathematical entities are introduced.

axioms (1)
  • domain assumption Prompted next-token probabilities can be interpreted as the model's estimate of answer correctness
    Invoked when converting model outputs to P(True) and P(IK) scores

pith-pipeline@v0.9.0 · 5676 in / 1117 out tokens · 28500 ms · 2026-05-10T15:37:54.996703+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pretraining Exposure Explains Popularity Judgments in Large Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    LLM popularity judgments align more closely with pretraining data exposure counts than with Wikipedia popularity, with stronger effects in pairwise comparisons and larger models.

  2. Architecture Determines Observability of Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.

  3. Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

    cs.CL 2026-05 unverdicted novelty 7.0

    Hallucination is detected as a transport-cost excursion in hidden-state trajectories, localized via contrastive PCA in a teacher model and distilled to a BiLSTM student.

  4. Inducing Artificial Uncertainty in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

  5. TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

    cs.AI 2026-05 unverdicted novelty 7.0

    TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across...

  6. Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...

  7. Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

    cs.CL 2026-05 unverdicted novelty 7.0

    BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...

  8. Task-Aware Calibration: Provably Optimal Decoding in LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    Task calibration aligns LLM distributions in latent task spaces to make MBR decoding provably optimal and improve generation quality.

  9. The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs

    cs.AI 2026-05 unverdicted novelty 7.0

    The Metacognitive Probe identifies large within-model gaps in LLM confidence behavior, including a 47-point dissociation in Gemini 2.5 Flash between strong task calibration and weak difficulty prediction.

  10. RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

    cs.LG 2026-05 unverdicted novelty 7.0

    RubricRefine improves tool-use agent reliability to 0.86 on M3ToolEval by generating rubrics for pre-execution contract checking and iterative repair, outperforming baselines at 2.6X lower latency while showing no gai...

  11. EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

    cs.AI 2026-05 unverdicted novelty 7.0

    EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to ad...

  12. LLM Agents Already Know When to Call Tools -- Even Without Reasoning

    cs.CL 2026-05 conditional novelty 7.0

    LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.

  13. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion from terminal commitment in embodied agents, exposing up to 19.7 pp gaps in benchmark success despite comparable execution across 20 models.

  14. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    VIGIL decouples world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps in B for models with similar W across 20 systems on 1000 episodes.

  15. Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

    cs.HC 2026-05 accept novelty 7.0

    LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

  16. Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?

    cs.CL 2026-05 unverdicted novelty 7.0

    Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.

  17. Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

    cs.LG 2026-05 unverdicted novelty 7.0

    Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...

  18. Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

    cs.AI 2026-05 unverdicted novelty 7.0

    Joint Consistency casts test-time aggregation as Ising-type energy minimization with pairwise LLM-judge interactions, subsuming voting methods and outperforming baselines across reasoning tasks.

  19. AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

    astro-ph.IM 2026-05 unverdicted novelty 7.0

    AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.

  20. Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    Partial Evidence Bench is a deterministic benchmark that measures agent correctness, completeness awareness, gap-report quality, and unsafe overclaiming in authorization-constrained evidence environments.

  21. The First Token Knows: Single-Decode Confidence for Hallucination Detection

    cs.CL 2026-05 unverdicted novelty 7.0

    First-token normalized entropy (phi_first) from one greedy decode reaches mean AUROC 0.820 for hallucination detection, matching or exceeding semantic self-consistency (0.793) and surface self-consistency (0.791) acro...

  22. Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.

  23. SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass

    cs.IT 2026-05 unverdicted novelty 7.0

    SENECA uses a novel self-consistent missing mass calculation to improve discrete entropy estimates in small-sample regimes and outperforms alternatives in numerical tests.

  24. Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models

    eess.AS 2026-04 unverdicted novelty 7.0

    Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- a...

  25. Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

    cs.AI 2026-04 unverdicted novelty 7.0

    Introduces Defensibility Index, Ambiguity Index, and Probabilistic Defensibility Signal to evaluate AI moderation decisions by logical derivability from explicit rules rather than agreement with historical labels, wit...

  26. Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

    cs.LG 2026-04 unverdicted novelty 7.0

    Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.

  27. Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report

    cs.CL 2026-04 conditional novelty 7.0

    Validity indices adapted from clinical assessment classify four frontier LLMs as construct-level invalid on metacognitive probes, with valid models showing positive item-sensitive confidence (r=.18) while invalid ones...

  28. Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.

  29. UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval

    cs.IR 2026-04 unverdicted novelty 7.0

    UsefulBench is a new benchmark dataset that separates relevance from usefulness in information retrieval, revealing that similarity-based systems and current LLMs fall short on decision-useful content.

  30. The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

    cs.CL 2026-04 unverdicted novelty 7.0

    The Metacognitive Monitoring Battery applied to 20 LLMs identifies three self-monitoring profiles, shows inverted accuracy and sensitivity ranks, and finds retrospective and prospective regulation largely dissociable.

  31. Calibrated Confidence Estimation for Tabular Question Answering

    cs.CL 2026-04 unverdicted novelty 7.0

    Tabular QA LLMs are overconfident, but Multi-Format Agreement using Markdown/HTML/JSON/CSV variants improves AUROC to 0.80 and cuts calibration error by 44-63% at lower cost than sampling.

  32. Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code

    cs.SE 2026-04 unverdicted novelty 7.0

    CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...

  33. Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment

    cs.IT 2026-04 unverdicted novelty 7.0

    Bounded agents induce capacity-derived semantic spaces via quotient POMDPs, with a structural phase transition making intent-preserving communication impossible below a critical rate determined by quotient mismatch.

  34. Unified Multimodal Uncertain Inference

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces UMUI task for fine-grained multimodal probabilistic inference and CLUE calibration method, where a 3B model matches larger baselines.

  35. Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation

    cs.CV 2026-04 unverdicted novelty 7.0

    Hypothesis Graph Refinement represents frontier predictions as revisable hypothesis nodes and applies verification-driven cascade correction to prune erroneous subgraphs, achieving 72.41% success and 56.22% SPL on GOAT-Bench.

  36. BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

    cs.CL 2026-04 unverdicted novelty 7.0

    BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

  37. Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing

    cs.CL 2026-03 conditional novelty 7.0

    Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.

  38. LLM4Log: A Systematic Review of Large Language Model-based Log Analysis

    cs.SE 2026-03 accept novelty 7.0

    LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.

  39. LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

    cs.LG 2026-05 conditional novelty 6.0

    A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.

  40. When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

    cs.CL 2026-05 conditional novelty 6.0

    Conflicting biomedical evidence triggers order-dependent prediction flips in RAG LLMs, and a new abstention score combining confidence with conflict detection raises selective accuracy by 7-33 points in the hardest co...

  41. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

    cs.AI 2026-05 unverdicted novelty 6.0

    LLMs show a knowing-doing gap in tool use: they often recognize when tools are needed via internal states but fail to translate that into actual tool calls, with mismatches of 26-54% on arithmetic and factual tasks.

  42. LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information

    stat.ML 2026-05 unverdicted novelty 6.0

    Response entropy in LLMs rises with missing context on SQuAD while sampling-based confidence stays high, supporting the multiple imputation criterion and introducing a diagnostic for uncertainty reduction by context level.

  43. ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ORCE decouples answer generation from confidence estimation in LLMs and applies rank-based reinforcement learning on sampled completions to better align verbalized confidence with actual correctness likelihood.

  44. Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

    cs.AI 2026-05 unverdicted novelty 6.0

    Semantic Reward Collapse compresses different epistemic issues into unified rewards in preference optimization, risking loss of calibrated uncertainty, with Constitutional Reward Stratification proposed as a domain-st...

  45. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

    cs.AI 2026-05 unverdicted novelty 6.0

    CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.

  46. VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference

    cs.LG 2026-05 unverdicted novelty 6.0

    VERDI derives three structural confidence signals from decomposed LLM verification traces and calibrates them with Platt-scaled logistic regression to achieve AUROC 0.72-0.91 on GPT models and 0.56-0.70 on Qwen models...

  47. Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...

  48. Interpretability Can Be Actionable

    cs.LG 2026-05 conditional novelty 6.0

    Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

  49. Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

    cs.CV 2026-05 unverdicted novelty 6.0

    Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.

  50. RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement

    cs.LG 2026-05 unverdicted novelty 6.0

    RubricRefine raises average tool-use reliability to 0.86 on M3ToolEval across seven models by scoring candidate code against generated contract rubrics before execution, beating prior inference-time methods at 2.6X lo...

  51. Hidden Error Awareness in Chain-of-Thought Reasoning: The Signal Is Diagnostic, Not Causal

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs detect CoT reasoning errors in hidden states with 0.95 AUROC but cannot use this awareness to correct them via steering, patching, or self-correction, indicating the signal is diagnostic not causal.

  52. A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

    cs.LG 2026-05 unverdicted novelty 6.0

    LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.

  53. Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    VIGIL separates world-state completion (W) from benchmark success (B) requiring correct terminal reports, showing up to 19.7 pp gaps between models with similar execution on 1000 episodes across 20 systems.

  54. An Interpretable and Scalable Framework for Evaluating Large Language Models

    stat.ML 2026-05 unverdicted novelty 6.0

    A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

  55. Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport

    cs.LG 2026-05 unverdicted novelty 6.0

    Conditional optimal transport is used to turn raw PRM outputs into monotonic quantile functions that improve calibration and downstream Best-of-N performance on MATH-500 and AIME.

  56. Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

    cs.AI 2026-05 unverdicted novelty 6.0

    Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.

  57. A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

    cs.AI 2026-05 unverdicted novelty 6.0

    Hygieia is a new AI agent system that integrates phenotypes, genetics, and records to achieve superior rare disease diagnosis and gene prioritization with confidence scores.

  58. A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

    cs.AI 2026-05 unverdicted novelty 6.0

    Hygieia is a router-based multi-modal AI system that outperforms physicians in rare disease diagnosis benchmarks and assists with real-world medical records.

  59. LoopTrap: Termination Poisoning Attacks on LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.

  60. Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation

    cs.CL 2026-05 unverdicted novelty 6.0

    DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.

Reference graph

Works this paper leans on

292 extracted references · 200 canonical work pages · cited by 126 Pith papers · 42 internal anchors

  1. [1]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Ganguli, Deep and Lovitt, Liane and Kernion, Jackson and Askell, Amanda and Bai, Yuntao and Kadavath, Saurav and Mann, Ben and Perez, Ethan and Schiefer, Nicholas and Ndousse, Kamal and Jones, Andy and Bowman, Sam and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Elhage, Nelson and El-Showk, Sheer and Fort, Stanislav and Dodds, Zac Ha...

  2. [2]

    Self-critiquing models for assisting human evaluators

    Saunders, William and Yeh, Catherine and Wu, Jeff and Bills, Steven and Ouyang, Long and Ward, Jonathan and Leike, Jan , keywords =. Self-critiquing models for assisting human evaluators , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2206.05802 , url =

  3. [3]

    Training Language Models with Language Feedback , author=

  4. [4]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Ahn, Michael and Brohan, Anthony and Brown, Noah and Chebotar, Yevgen and Cortes, Omar and David, Byron and Finn, Chelsea and Gopalakrishnan, Keerthana and Hausman, Karol and Herzog, Alex and Ho, Daniel and Hsu, Jasmine and Ibarz, Julian and Ichter, Brian and Irpan, Alex and Jang, Eric and Ruano, Rosario Jauregui and Jeffrey, Kyle and Jesmonth, Sally and ...

  5. [5]

    2017 , eprint=

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , author=. 2017 , eprint=

  6. [6]

    Red Teaming Language Models with Language Models

    Perez, Ethan and Huang, Saffron and Song, Francis and Cai, Trevor and Ring, Roman and Aslanides, John and Glaese, Amelia and McAleese, Nat and Irving, Geoffrey , keywords =. Red Teaming Language Models with Language Models , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2202.03286 , url =

  7. [7]

    LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks , publisher =

    Dinh, Tuan and Zeng, Yuchen and Zhang, Ruisu and Lin, Ziqian and Gira, Michael and Rajput, Shashank and Sohn, Jy-yong and Papailiopoulos, Dimitris and Lee, Kangwook , keywords =. LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2206.06565 , url =

  8. [8]

    Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings , publisher =

    Varshney, Neeraj and Mishra, Swaroop and Baral, Chitta , keywords =. Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2203.00211 , url =

  9. [9]

    On calibration of modern neural networks,

    Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q. , keywords =. On Calibration of Modern Neural Networks , publisher =. 2017 , copyright =. doi:10.48550/ARXIV.1706.04599 , url =

  10. [10]

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

    Lakshminarayanan, Balaji and Pritzel, Alexander and Blundell, Charles , keywords =. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , publisher =. 2016 , copyright =. doi:10.48550/ARXIV.1612.01474 , url =

  11. [11]

    Hybrid Models with Deep and Invertible Features , publisher =

    Nalisnick, Eric and Matsukawa, Akihiro and Teh, Yee Whye and Gorur, Dilan and Lakshminarayanan, Balaji , keywords =. Hybrid Models with Deep and Invertible Features , publisher =. 2019 , copyright =. doi:10.48550/ARXIV.1902.02767 , url =

  12. [12]

    Winkens, R

    Winkens, Jim and Bunel, Rudy and Roy, Abhijit Guha and Stanforth, Robert and Natarajan, Vivek and Ledsam, Joseph R. and MacWilliams, Patricia and Kohli, Pushmeet and Karthikesalingam, Alan and Kohl, Simon and Cemgil, Taylan and Eslami, S. M. Ali and Ronneberger, Olaf , keywords =. Contrastive Training for Improved Out-of-Distribution Detection , publisher...

  13. [13]

    Hybrid Models for Open Set Recognition , publisher =

    Zhang, Hongjie and Li, Ang and Guo, Jie and Guo, Yanwen , keywords =. Hybrid Models for Open Set Recognition , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2003.12506 , url =

  14. [14]

    Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness , publisher =

    Liu, Jeremiah Zhe and Lin, Zi and Padhy, Shreyas and Tran, Dustin and Bedrax-Weiss, Tania and Lakshminarayanan, Balaji , keywords =. Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness , publisher =. 2020 , copyright =. doi:10.48550/ARXIV.2006.10108 , url =

  15. [15]

    Liang, Y

    Liang, Shiyu and Li, Yixuan and Srikant, R. , keywords =. Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , publisher =. 2017 , copyright =. doi:10.48550/ARXIV.1706.02690 , url =

  16. [16]

    2022 , eprint=

    Adversarial vulnerability of powerful near out-of-distribution detection , author=. 2022 , eprint=

  17. [17]

    2021 , url=

    A Simple and Effective Baseline for Out-of-Distribution Detection using Abstention , author=. 2021 , url=

  18. [18]

    Deep Anomaly Detection with Outlier Exposure

    Hendrycks, Dan and Mazeika, Mantas and Dietterich, Thomas , keywords =. Deep Anomaly Detection with Outlier Exposure , publisher =. 2018 , copyright =. doi:10.48550/ARXIV.1812.04606 , url =

  19. [19]

    org/abs/1807.03888

    Lee, Kimin and Lee, Kibok and Lee, Honglak and Shin, Jinwoo , keywords =. A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , publisher =. 2018 , copyright =. doi:10.48550/ARXIV.1807.03888 , url =

  20. [20]

    A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks

    Hendrycks, Dan and Gimpel, Kevin , keywords =. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , publisher =. 2016 , copyright =. doi:10.48550/ARXIV.1610.02136 , url =

  21. [21]

    Nguyen, J

    Nguyen, Anh and Yosinski, Jason and Clune, Jeff , keywords =. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images , publisher =. 2014 , copyright =. doi:10.48550/ARXIV.1412.1897 , url =

  22. [22]

    OpenAI , author =

    Fine-. OpenAI , author =. 2019 , file =

  23. [23]

    ArXiv , year=

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. ArXiv , year=

  24. [24]

    2021 , eprint=

    Exploring the Limits of Out-of-Distribution Detection , author=. 2021 , eprint=

  25. [25]

    2021 , Eprint =

    Johannes Welbl and Amelia Glaese and Jonathan Uesato and Sumanth Dathathri and John Mellor and Lisa Anne Hendricks and Kirsty Anderson and Pushmeet Kohli and Ben Coppin and Po-Sen Huang , Title =. 2021 , Eprint =

  26. [26]

    2021 , eprint=

    A General Language Assistant as a Laboratory for Alignment , author=. 2021 , eprint=

  27. [27]

    International Conference on Learning Representations , year=

    Effect of scale on catastrophic forgetting in neural networks , author=. International Conference on Learning Representations , year=

  28. [28]

    Training language models to follow instructions with human feedback

    Training language models to follow instructions with human feedback , author=. arXiv preprint arXiv:2203.02155 , year=

  29. [29]

    LaMDA: Language Models for Dialog Applications

    Romal Thoppilan and Daniel De Freitas and Jamie Hall and Noam Shazeer and Apoorv Kulshreshtha and Heng. LaMDA: Language Models for Dialog Applications , journal =. 2022 , url =. 2201.08239 , timestamp =

  30. [30]

    Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,

    Sebastian Borgeaud and Arthur Mensch and Jordan Hoffmann and Trevor Cai and Eliza Rutherford and Katie Millican and George van den Driessche and Jean. Improving language models by retrieving from trillions of tokens , journal =. 2021 , url =. 2112.04426 , timestamp =

  31. [31]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano and Jacob Hilton and Suchir Balaji and Jeff Wu and Long Ouyang and Christina Kim and Christopher Hesse and Shantanu Jain and Vineet Kosaraju and William Saunders and Xu Jiang and Karl Cobbe and Tyna Eloundou and Gretchen Krueger and Kevin Button and Matthew Knight and Benjamin Chess and John Schulman , title =. CoRR , volume =. 2021 , url...

  32. [32]

    2021 , eprint=

    Scaling Scaling Laws with Board Games , author=. 2021 , eprint=

  33. [33]

    2021 , eprint=

    When Combating Hype, Proceed with Caution , author=. 2021 , eprint=

  34. [34]

    2019 , eprint=

    Generating Long Sequences with Sparse Transformers , author=. 2019 , eprint=

  35. [35]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  36. [36]

    2021 , eprint=

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2021 , eprint=

  37. [37]

    2021 , eprint=

    Mitigating harm in language models with conditional-likelihood filtration , author=. 2021 , eprint=

  38. [38]

    2020 , eprint=

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

  39. [39]

    2016 , eprint=

    Concrete Problems in AI Safety , author=. 2016 , eprint=

  40. [40]

    2020 , eprint=

    Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics , author=. 2020 , eprint=

  41. [41]

    2020 , eprint=

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=

  42. [42]

    2021 , eprint=

    Unsolved Problems in ML Safety , author=. 2021 , eprint=

  43. [43]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  44. [44]

    2021 , eprint=

    Aligning AI With Shared Human Values , author=. 2021 , eprint=

  45. [45]

    2023 , month = jan, journal =

    Jack Koch and Lauro Langosco and Jacob Pfau and James Le and Lee Sharkey , title =. CoRR , volume =. 2021 , url =. 2105.14111 , timestamp =

  46. [46]

    The effects of reward misspecification: Mapping and mitigating misaligned models

    Alexander Pan and Kush Bhatia and Jacob Steinhardt , title =. CoRR , volume =. 2022 , url =. 2201.03544 , timestamp =

  47. [47]

    Training Compute-Optimal Large Language Models

    Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and Casas, Diego de Las and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and Driessche, George van den and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

  48. [48]

    2021 , eprint=

    Decision Transformer: Reinforcement Learning via Sequence Modeling , author=. 2021 , eprint=

  49. [49]

    Patrick S. H. Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. CoRR , volume =. 2020 , url =. 2005.11401 , timestamp =

  50. [50]

    doi:10.48550/arXiv.2002.08909 , abstract =

    Kelvin Guu and Kenton Lee and Zora Tung and Panupong Pasupat and Ming. CoRR , volume =. 2020 , url =. 2002.08909 , timestamp =

  51. [51]

    Ethical Challenges in Data-Driven Dialogue Systems , journal =

    Peter Henderson and Koustuv Sinha and Nicolas Angelard. Ethical Challenges in Data-Driven Dialogue Systems , journal =. 2017 , url =. 1711.09050 , timestamp =

  52. [52]

    Dexperts: Decoding-time controlled text generation with experts and anti-experts.arXiv preprint arXiv:2105.03023, 2021

    Alisa Liu and Maarten Sap and Ximing Lu and Swabha Swayamdipta and Chandra Bhagavatula and Noah A. Smith and Yejin Choi , title =. CoRR , volume =. 2021 , url =. 2105.03023 , timestamp =

  53. [54]

    2021 , eprint=

    Delphi: Towards Machine Ethics and Norms , author=. 2021 , eprint=

  54. [55]

    2018 , eprint=

    Supervising strong learners by amplifying weak experts , author=. 2018 , eprint=

  55. [56]

    2018 , eprint=

    AI safety via debate , author=. 2018 , eprint=

  56. [57]

    2021 , eprint=

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. 2021 , eprint=

  57. [58]

    2021 , eprint=

    Finetuned Language Models Are Zero-Shot Learners , author=. 2021 , eprint=

  58. [59]

    2021 , eprint=

    TruthfulQA: Measuring How Models Mimic Human Falsehoods , author=. 2021 , eprint=

  59. [60]

    2020 , eprint=

    Learning to summarize from human feedback , author=. 2020 , eprint=

  60. [61]

    2016 , eprint=

    Generative Adversarial Imitation Learning , author=. 2016 , eprint=

  61. [62]

    2020 , eprint=

    Language GANs Falling Short , author=. 2020 , eprint=

  62. [63]

    2019 , eprint=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. 2019 , eprint=

  63. [64]

    Ethical and social risks of harm from Language Models

    Laura Weidinger and John Mellor and Maribeth Rauh and Conor Griffin and Jonathan Uesato and Po. Ethical and social risks of harm from Language Models , journal =. 2021 , url =. 2112.04359 , timestamp =

  64. [65]

    ISBN 978-1-4503-8309-7

    Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , title =. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages =. 2021 , isbn =. doi:10.1145/3442188.3445922 , abstract =

  65. [66]

    Scaling laws for transfer

    Danny Hernandez and Jared Kaplan and Tom Henighan and Sam McCandlish , title =. CoRR , volume =. 2021 , url =. 2102.01293 , timestamp =

  66. [67]

    Tillet, Philippe and Kung, H. T. and Cox, David , title =. Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages =. 2019 , isbn =

  67. [68]

    Advances in Neural Information Processing Systems 32 , editor =

    PyTorch: An Imperative Style, High-Performance Deep Learning Library , author =. Advances in Neural Information Processing Systems 32 , editor =. 2019 , publisher =

  68. [69]

    Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,

    Menick, Jacob and Trebacz, Maja and Mikulik, Vladimir and Aslanides, John and Song, Francis and Chadwick, Martin and Glaese, Mia and Young, Susannah and Campbell-Gillingham, Lucy and Irving, Geoffrey and McAleese, Nat , keywords =. Teaching language models to support answers with verified quotes , publisher =. 2022 , copyright =. doi:10.48550/ARXIV.2203.1...

  69. [70]

    arXiv preprint arXiv:2202.07785 , year=

    Ganguli, Deep and Hernandez, Danny and Lovitt, Liane and DasSarma, Nova and Henighan, Tom and Jones, Andy and Joseph, Nicholas and Kernion, Jackson and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Drain, Dawn and Elhage, Nelson and Showk, Sheer El and Fort, Stanislav and Hatfield-Dodds, Zac and Johnston, Scott and Krave...

  70. [71]

    Recipes for safety in open-domain chatbots

    Recipes for safety in open-domain chatbots , author=. arXiv preprint arXiv:2010.07079 , year=

  71. [72]

    International Conference on Machine Learning , pages=

    Phasic policy gradient , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  72. [74]

    BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193,

    Alicia Parrish and Angelica Chen and Nikita Nangia and Vishakh Padmakumar and Jason Phang and Jana Thompson and Phu Mon Htut and Samuel R. Bowman , title =. CoRR , volume =. 2021 , url =. 2110.08193 , timestamp =

  73. [75]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power and Yuri Burda and Harrison Edwards and Igor Babuschkin and Vedant Misra , title =. CoRR , volume =. 2022 , url =. 2201.02177 , timestamp =

  74. [76]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W. Rae and Sebastian Borgeaud and Trevor Cai and Katie Millican and Jordan Hoffmann and H. Francis Song and John Aslanides and Sarah Henderson and Roman Ring and Susannah Young and Eliza Rutherford and Tom Hennigan and Jacob Menick and Albin Cassirer and Richard Powell and George van den Driessche and Lisa Anne Hendricks and Maribeth Rauh and Po. Sca...

  75. [77]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Bhojanapalli, Srinadh and Chakrabarti, Ayan and Glasner, Daniel and Li, Daliang and Unterthiner, Thomas and Veit, Andreas , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

  76. [78]

    EMNLP , year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , author=. EMNLP , year=

  77. [79]

    ArXiv , year=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. ArXiv , year=

  78. [80]

    2017 , eprint=

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. 2017 , eprint=

  79. [81]

    2021 , eprint=

    Imitating Interactive Intelligence , author=. 2021 , eprint=

  80. [82]

    Artificial Intelligence , Values and Alignment

    Gabriel, Iason , year=. Artificial Intelligence, Values, and Alignment , volume=. Minds and Machines , publisher=. doi:10.1007/s11023-020-09539-2 , number=

Showing first 80 references.