Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
citing papers explorer
-
Do Linear Probes Generalize Better in Persona Coordinates?
Probes on persona principal components from contrastive prompts generalize better than raw activation probes for harmful behaviors across 10 datasets.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.