Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
Pith reviewed 2026-05-19 13:34 UTC · model grok-4.3
The pith
DExperts nearly eliminates explicit toxicity in language models but proves less effective against implicit hate speech.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This replication study evaluates DExperts on GPT-2 models using RealToxicityPrompts for baseline explicit toxicity and the ToxiGen dataset for implicit hate speech. DExperts achieves 100% safety rates on explicit toxicity benchmarks by adjusting token probabilities during decoding. However, safety rates drop to 98.5% when tested against adversarial implicit hate speech, while generation latency increases from 0.2 seconds to 2.0 seconds per output. These findings point to a gap in robustness between handling straightforward toxic prompts and more subtle, adversarial ones.
What carries the argument
Decoding-time Experts (DExperts), a technique that uses separate expert and anti-expert models to modify the probability distribution of the next token at each step of generation to reduce toxicity without retraining the base model.
If this is right
- DExperts can be applied to achieve near-complete mitigation of explicit toxic outputs from LLMs.
- It demonstrates vulnerability to adversarial examples designed to elicit implicit hate speech.
- The technique incurs a significant computational overhead that may limit its use in latency-sensitive applications.
- Comprehensive safety assessments require testing against both explicit and implicit forms of toxicity.
Where Pith is reading between the lines
- Combining DExperts with other safety methods might close the gap on implicit toxicity.
- The observed latency penalty suggests exploring lighter-weight alternatives for production systems.
- New benchmarks focused on diverse implicit hate patterns could better evaluate true robustness.
- These results imply that claims of safety based only on explicit tests may overstate real-world performance.
Load-bearing premise
The ToxiGen dataset serves as a general test for implicit hate speech that reveals true brittleness rather than being specially crafted to expose weaknesses in DExperts.
What would settle it
Evaluating DExperts on an independent set of implicit hate speech prompts not derived from or similar to ToxiGen and observing safety rates remaining at or near 100% would indicate that the brittleness finding may not hold broadly.
Figures
read the original abstract
Large Language Models (LLMs) trained on web-scale corpora inherently absorb toxic patterns from their training data. This leads to toxic degeneration where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of DExperts (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using RealToxicityPrompts on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial ToxiGen dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5%. Furthermore, we quantify a critical trade-off. The method introduces a 10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate-speech patterns without incurring prohibitive computational costs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is a replication study evaluating DExperts, an inference-time toxicity mitigation technique for LLMs. It proceeds in three phases: (1) baseline toxicity measurement on RealToxicityPrompts using standard GPT-2, (2) application of DExperts to mitigate explicit toxicity, and (3) stress-testing against implicit hate speech on the adversarial ToxiGen dataset. The central empirical claims are that DExperts reaches 100% safety on explicit benchmarks but drops to 98.5% on ToxiGen, while incurring a 10x latency penalty (0.2s to 2.0s per generation). The work concludes by highlighting a robustness gap between explicit and implicit toxicity mitigation.
Significance. If the reported safety rates and latency figures are supported by proper controls, the study adds to AI safety literature by providing a systematic replication that isolates the limitations of decoding-time expert steering on adversarial implicit content. The use of established benchmarks (RealToxicityPrompts and ToxiGen) and the explicit quantification of the utility-safety-compute trade-off are constructive contributions that could inform deployment decisions.
major comments (1)
- [Stress-testing phase] Stress-testing phase: The brittleness claim rests on the safety rate dropping from 100% (explicit benchmarks) to 98.5% on ToxiGen. No baseline safety rate for the unmodified GPT-2 model on the same ToxiGen prompts is reported or compared. Without this control measurement, it remains unclear whether the 98.5% figure indicates a genuine robustness gap or whether the ToxiGen prompts simply do not elicit toxicity from the base model.
minor comments (2)
- [Abstract and results] The abstract and results sections would be strengthened by including the precise definition of 'safety rate' (e.g., whether it is the fraction of generations below a toxicity threshold and which classifier is used).
- A summary table comparing safety rates, latency, and any utility metrics across the three phases and both models would improve readability and allow direct visual assessment of the claimed trade-off.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review of our replication study on DExperts. The feedback highlights an important methodological point regarding the stress-testing phase, which we address below. We will incorporate the suggested control in the revised manuscript to strengthen the empirical claims.
read point-by-point responses
-
Referee: [Stress-testing phase] Stress-testing phase: The brittleness claim rests on the safety rate dropping from 100% (explicit benchmarks) to 98.5% on ToxiGen. No baseline safety rate for the unmodified GPT-2 model on the same ToxiGen prompts is reported or compared. Without this control measurement, it remains unclear whether the 98.5% figure indicates a genuine robustness gap or whether the ToxiGen prompts simply do not elicit toxicity from the base model.
Authors: We agree that reporting the baseline toxicity rate of the unmodified GPT-2 model on the ToxiGen prompts is essential for rigorously interpreting the 98.5% safety figure achieved by DExperts. This control measurement would clarify whether the observed drop reflects a limitation in the method's robustness to implicit hate speech or simply the adversarial nature of the ToxiGen dataset itself. In the revised manuscript, we will add this baseline evaluation using the identical ToxiGen prompt set and report the corresponding safety rate for the base model. This addition will provide a direct comparison and better support our conclusion about the robustness gap between explicit and implicit toxicity mitigation. revision: yes
Circularity Check
No circularity: all results are direct empirical measurements on external benchmarks
full rationale
The paper is a replication study structured around three phases of direct measurement: baseline toxicity on RealToxicityPrompts with GPT-2, evaluation of DExperts on explicit toxicity, and stress-testing on the independent ToxiGen adversarial dataset. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear as load-bearing steps in the reported results. The 100% and 98.5% safety rates, latency figures, and robustness-gap conclusion are presented as outcomes of these measurements against fixed external benchmarks, with no reduction to self-definitional inputs or prior author work by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our empirical results confirm that while DExperts achieves near-perfect safety rates (100%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5%.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tom B. Brown et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[2]
Language models are unsupervised multitask learners.OpenAI Blog, 2019
Alec Radford et al. Language models are unsupervised multitask learners.OpenAI Blog, 2019
work page 2019
-
[3]
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. InFindings of EMNLP, pages 3356–3369, 2020
work page 2020
-
[4]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of FAccT, pages 610–623, 2021
work page 2021
-
[5]
The woman worked as a babysitter: On biases in language generation
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. InProceedings of EMNLP, pages 3407–3412, 2019
work page 2019
-
[6]
Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavat- ula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. InProceedings of ACL-IJCNLP, pages 6691–6706, 2021
work page 2021
-
[7]
Challenges in detoxifying language models
Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. InFindings of EMNLP, pages 2447–2469, 2021. Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
work page 2021
-
[8]
Training language models to follow instructions with human feedback
Long Ouyang et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[9]
Plug and play language models: A simple approach to controlled text generation
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. InProceedings of ICLR, 2020
work page 2020
-
[10]
FUDGE: Controlled text generation with future discriminators
Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. InProceedings of NAACL, pages 3511–3535, 2021
work page 2021
-
[11]
ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection
Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of ACL, pages 3309–3326, 2022
work page 2022
-
[12]
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. The risk of racial bias in hate speech detection. InProceedings of ACL, pages 1668–1678, 2019
work page 2019
-
[13]
Perspective API. Perspective API documentation. https://www.perspectiveapi. com/, 2023
work page 2023
-
[14]
Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apos- toloff, Luca Zappella, and Pau Rodríguez Whispering Experts: Neural Inter- ventions for Toxicity Mitigation in Language Models. InProceedings of ACL, 2024
work page 2024
-
[15]
Heegyu Kim and Hyunsouk Cho GTA: Gated Toxicity Avoidance for LM Perfor- mance Preservation. InFindings of EMNLP, 2023
work page 2023
-
[16]
Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang Unveiling the Implicit Toxicity in Large Language Models. InProceedings of EMNLP, 2023
work page 2023
-
[17]
Sarthak Roy, Ashish Harshavardhan, Animesh Mukherjee, and Punyajoy Saha Probing LLMs for hate speech detection: strengths and vulnerabilities. InFindings of EMNLP, 2023
work page 2023
-
[18]
Jingjie Zeng, Liang Yang, Zekun Wang, Yuanyuan Sun, and Hongfei Lin Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech? In Proceedings of ACL, 2025
work page 2025
-
[19]
Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed Bias and Fairness in Large Language Models: A Survey.Computational Linguistics, volume 50, pages 1097–1179, 2024
work page 2024
-
[20]
Tinh Son Luong, Thanh-Thien Le, Linh Ngo Van, and Thien Huu Nguyen Realistic Evaluation of Toxicity in Large Language Models. InFindings of ACL, 2024
work page 2024
-
[21]
Detoxifying Large Language Models via Knowledge Editing
Mengru Wang et al. Detoxifying Large Language Models via Knowledge Editing. arXiv preprint arXiv:2403.14472, 2024
-
[22]
arXiv preprint arXiv:2405.09373, 2024
Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models. arXiv preprint arXiv:2405.09373, 2024
-
[23]
Controllable text generation for large language models: A survey
Xun Liang, Hanyu Wang, Yezhaohui Wang, Shichao Song, Jiawei Yang, Simin Niu, Jie Hu, Dan Liu, Shunyu Yao, Feiyu Xiong, and Zhiyu Li Controllable Text Gen- eration for Large Language Models: A Survey. arXiv preprint arXiv:2408.12599, 2024
-
[24]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza Ermis From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models. InFindings of ACL, 2024
work page 2024
-
[26]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. arXiv preprint arXiv:2404.01318, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
arXiv preprint arXiv:2501.00066, 2024
Bohdan Turbal, Anastasiia Mazur, Jiaxu Zhao, and Mykola Pechenizkiy On Adversarial Robustness of Language Models in Transfer Learning. arXiv preprint arXiv:2501.00066, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.