Title resolution pending

Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap · 2024 · arXiv 2405.09373

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

cs.CL · 2026-05-27 · unverdicted · novelty 5.0

Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

cs.LG · 2026-05-28 · unverdicted · novelty 4.0

Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

cs.CL · 2026-05-13 · conditional · novelty 2.0 · 2 refs

DExperts reaches 100% safety on explicit toxicity benchmarks but only 98.5% on implicit hate speech from ToxiGen while imposing a 10x latency increase on GPT-2.

citing papers explorer

Showing 3 of 3 citing papers.

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models cs.CL · 2026-05-27 · unverdicted · none · ref 17
Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content cs.LG · 2026-05-28 · unverdicted · none · ref 20
Opir introduces efficient multi-task encoder models trained on a 996-category safety taxonomy that match or exceed larger baselines on most safety benchmarks while using under 100M parameters for edge variants.
Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study cs.CL · 2026-05-13 · conditional · none · ref 22 · 2 links
DExperts reaches 100% safety on explicit toxicity benchmarks but only 98.5% on implicit hate speech from ToxiGen while imposing a 10x latency increase on GPT-2.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer