From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza Ermis · 2024 · arXiv 2403.03893

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

support 1

representative citing papers

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

cs.AI · 2026-06-05 · unverdicted · novelty 5.0

LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

cs.CL · 2026-05-27 · unverdicted · novelty 5.0

Toxicity in language models is disproportionately encoded in early MLP layers and can be localized via activation differentials then suppressed at inference time without gradient descent.

LLM Harms: A Taxonomy and Discussion

cs.CY · 2025-12-05

citing papers explorer

Showing 1 of 1 citing paper after filters.

LLM Harms: A Taxonomy and Discussion cs.CY · 2025-12-05 · unreviewed · ref 85

From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer