pith. sign in

Truthful AI: Developing and governing AI that does not lie , publisher =

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

fields

cs.CL 3 cs.AI 1

years

2025 1 2022 3

representative citing papers

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

cs.AI · 2025-09-27 · unverdicted · novelty 7.0

Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

citing papers explorer

Showing 4 of 4 citing papers.

  • Discovering Latent Knowledge in Language Models Without Supervision cs.CL · 2022-12-07 · conditional · none · ref 9

    An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

  • Teaching Models to Express Their Uncertainty in Words cs.CL · 2022-05-28 · unverdicted · none · ref 7

    GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.

  • Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia cs.AI · 2025-09-27 · unverdicted · none · ref 7

    Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.

  • Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 196

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.