Inference-time intervention: Eliciting truthful answers from a language model

Li, Kenneth et al · 2023 · arXiv 2511.02593

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

Open-weight LLMs show no output bias on matched mortgage applications differing only by racially-associated names, yet retain and amplify demographic representations that steering interventions can causally activate to produce near-complete asymmetric decision reversals.

citing papers explorer

Showing 1 of 1 citing paper.

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions cs.AI · 2026-05-12 · unverdicted · none · ref 6
Open-weight LLMs show no output bias on matched mortgage applications differing only by racially-associated names, yet retain and amplify demographic representations that steering interventions can causally activate to produce near-complete asymmetric decision reversals.

Inference-time intervention: Eliciting truthful answers from a language model

fields

years

verdicts

representative citing papers

citing papers explorer