Truthful AI: Developing and governing AI that does not lie , publisher =

Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, William Saunders · 2021 · arXiv 2110.06674

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Teaching Models to Express Their Uncertainty in Words

cs.CL · 2022-05-28 · unverdicted · novelty 8.0

GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

cs.AI · 2025-09-27 · unverdicted · novelty 7.0

Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

citing papers explorer

Showing 4 of 4 citing papers.

Discovering Latent Knowledge in Language Models Without Supervision cs.CL · 2022-12-07 · conditional · none · ref 9
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
Teaching Models to Express Their Uncertainty in Words cs.CL · 2022-05-28 · unverdicted · none · ref 7
GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia cs.AI · 2025-09-27 · unverdicted · none · ref 7
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 196
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Truthful AI: Developing and governing AI that does not lie , publisher =

fields

years

verdicts

representative citing papers

citing papers explorer