Truthful ai: Developing and governing ai that does not lie

Truthful AI: Developing, governing AI that does not lie , author= · 2021 · arXiv 2110.06674

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Teaching Models to Express Their Uncertainty in Words

cs.CL · 2022-05-28 · unverdicted · novelty 8.0

GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.

DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

The DECK taxonomy partitions LLM hallucinations into four detectability regimes using consistency and confidence axes, mapping each to scorer families and identifying a universal blind spot for output-level uncertainty quantification on knowledge-gap inputs.

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

cs.AI · 2025-09-27 · unverdicted · novelty 7.0

Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Truthful ai: Developing and governing ai that does not lie

fields

years

verdicts

representative citing papers

citing papers explorer