An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
Truthful AI: Developing and governing AI that does not lie , publisher =
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
citing papers explorer
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.
-
Teaching Models to Express Their Uncertainty in Words
GPT-3 can learn to express well-calibrated uncertainty about its answers using natural language phrases rather than logits.
-
Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Mini-Mafia supplies an analytical model logit(p) = v*(m-d) for mafia win probability in LLM role interactions and uses Bayesian inference to estimate per-model parameters that predict tournament results with 76.6% Brier-score improvement over random.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.