Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.
DExperts reaches 100% safety on explicit toxicity benchmarks but only 98.5% on implicit hate speech from ToxiGen while imposing a 10x latency increase on GPT-2.
citing papers explorer
-
Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models
Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.
-
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content
Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.
-
Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
DExperts reaches 100% safety on explicit toxicity benchmarks but only 98.5% on implicit hate speech from ToxiGen while imposing a 10x latency increase on GPT-2.