CausalDetox identifies minimal attention heads causally linked to toxicity via Probability of Necessity and Sufficiency, then applies targeted inference-time steering or fine-tuning to reduce toxic generation while preserving fluency and achieving faster selection.
ArXiv:1501.01332 [stat]
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification
CausalDetox identifies minimal attention heads causally linked to toxicity via Probability of Necessity and Sufficiency, then applies targeted inference-time steering or fine-tuning to reduce toxic generation while preserving fluency and achieving faster selection.