pith. sign in

arxiv: 2605.14087 · v2 · pith:HWCARECCnew · submitted 2026-05-13 · 💻 cs.CL · cs.LG

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

Pith reviewed 2026-05-19 13:34 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords toxicityDExpertsLLMsreplicationimplicit hate speechsafetymitigationlatency
0
0 comments X

The pith

DExperts nearly eliminates explicit toxicity in language models but proves less effective against implicit hate speech.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates the evaluation of DExperts, an inference-time method for steering language model outputs away from toxic content. It measures performance first on standard explicit toxicity prompts with GPT-2 models and then on adversarial implicit hate speech datasets. The results show strong performance on explicit cases but a noticeable drop in safety for implicit ones, along with a substantial increase in processing time. A sympathetic reader would care because this highlights practical limitations in deploying safe AI systems, especially when facing cleverly worded harmful inputs that current fixes may miss.

Core claim

This replication study evaluates DExperts on GPT-2 models using RealToxicityPrompts for baseline explicit toxicity and the ToxiGen dataset for implicit hate speech. DExperts achieves 100% safety rates on explicit toxicity benchmarks by adjusting token probabilities during decoding. However, safety rates drop to 98.5% when tested against adversarial implicit hate speech, while generation latency increases from 0.2 seconds to 2.0 seconds per output. These findings point to a gap in robustness between handling straightforward toxic prompts and more subtle, adversarial ones.

What carries the argument

Decoding-time Experts (DExperts), a technique that uses separate expert and anti-expert models to modify the probability distribution of the next token at each step of generation to reduce toxicity without retraining the base model.

If this is right

  • DExperts can be applied to achieve near-complete mitigation of explicit toxic outputs from LLMs.
  • It demonstrates vulnerability to adversarial examples designed to elicit implicit hate speech.
  • The technique incurs a significant computational overhead that may limit its use in latency-sensitive applications.
  • Comprehensive safety assessments require testing against both explicit and implicit forms of toxicity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining DExperts with other safety methods might close the gap on implicit toxicity.
  • The observed latency penalty suggests exploring lighter-weight alternatives for production systems.
  • New benchmarks focused on diverse implicit hate patterns could better evaluate true robustness.
  • These results imply that claims of safety based only on explicit tests may overstate real-world performance.

Load-bearing premise

The ToxiGen dataset serves as a general test for implicit hate speech that reveals true brittleness rather than being specially crafted to expose weaknesses in DExperts.

What would settle it

Evaluating DExperts on an independent set of implicit hate speech prompts not derived from or similar to ToxiGen and observing safety rates remaining at or near 100% would indicate that the brittleness finding may not hold broadly.

Figures

Figures reproduced from arXiv: 2605.14087 by Akshaj Satishkumar, Archit Rathod, Mokshit Surana.

Figure 2
Figure 2. Figure 2: Baseline Safety Success Rate showing that 95.8% [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Baseline Toxicity Distribution from Phase 1. The [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Shift in Toxicity Distribution with DExperts Mit [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CDF comparison between baseline and DExperts, [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trade-offs in DExperts mitigation: (a) Perfect safety achievement on RealToxicityPrompts with 100% safe generations, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: The Robustness Gap: Violin plot comparison be [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: CDF curves showing the robustness gap. Phase 2 [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Computational overhead comparison showing his [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
read the original abstract

Large Language Models (LLMs) trained on web-scale corpora inherently absorb toxic patterns from their training data. This leads to toxic degeneration where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of DExperts (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using RealToxicityPrompts on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial ToxiGen dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5%. Furthermore, we quantify a critical trade-off. The method introduces a 10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate-speech patterns without incurring prohibitive computational costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper is a replication study evaluating DExperts, an inference-time toxicity mitigation technique for LLMs. It proceeds in three phases: (1) baseline toxicity measurement on RealToxicityPrompts using standard GPT-2, (2) application of DExperts to mitigate explicit toxicity, and (3) stress-testing against implicit hate speech on the adversarial ToxiGen dataset. The central empirical claims are that DExperts reaches 100% safety on explicit benchmarks but drops to 98.5% on ToxiGen, while incurring a 10x latency penalty (0.2s to 2.0s per generation). The work concludes by highlighting a robustness gap between explicit and implicit toxicity mitigation.

Significance. If the reported safety rates and latency figures are supported by proper controls, the study adds to AI safety literature by providing a systematic replication that isolates the limitations of decoding-time expert steering on adversarial implicit content. The use of established benchmarks (RealToxicityPrompts and ToxiGen) and the explicit quantification of the utility-safety-compute trade-off are constructive contributions that could inform deployment decisions.

major comments (1)
  1. [Stress-testing phase] Stress-testing phase: The brittleness claim rests on the safety rate dropping from 100% (explicit benchmarks) to 98.5% on ToxiGen. No baseline safety rate for the unmodified GPT-2 model on the same ToxiGen prompts is reported or compared. Without this control measurement, it remains unclear whether the 98.5% figure indicates a genuine robustness gap or whether the ToxiGen prompts simply do not elicit toxicity from the base model.
minor comments (2)
  1. [Abstract and results] The abstract and results sections would be strengthened by including the precise definition of 'safety rate' (e.g., whether it is the fraction of generations below a toxicity threshold and which classifier is used).
  2. A summary table comparing safety rates, latency, and any utility metrics across the three phases and both models would improve readability and allow direct visual assessment of the claimed trade-off.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review of our replication study on DExperts. The feedback highlights an important methodological point regarding the stress-testing phase, which we address below. We will incorporate the suggested control in the revised manuscript to strengthen the empirical claims.

read point-by-point responses
  1. Referee: [Stress-testing phase] Stress-testing phase: The brittleness claim rests on the safety rate dropping from 100% (explicit benchmarks) to 98.5% on ToxiGen. No baseline safety rate for the unmodified GPT-2 model on the same ToxiGen prompts is reported or compared. Without this control measurement, it remains unclear whether the 98.5% figure indicates a genuine robustness gap or whether the ToxiGen prompts simply do not elicit toxicity from the base model.

    Authors: We agree that reporting the baseline toxicity rate of the unmodified GPT-2 model on the ToxiGen prompts is essential for rigorously interpreting the 98.5% safety figure achieved by DExperts. This control measurement would clarify whether the observed drop reflects a limitation in the method's robustness to implicit hate speech or simply the adversarial nature of the ToxiGen dataset itself. In the revised manuscript, we will add this baseline evaluation using the identical ToxiGen prompt set and report the corresponding safety rate for the base model. This addition will provide a direct comparison and better support our conclusion about the robustness gap between explicit and implicit toxicity mitigation. revision: yes

Circularity Check

0 steps flagged

No circularity: all results are direct empirical measurements on external benchmarks

full rationale

The paper is a replication study structured around three phases of direct measurement: baseline toxicity on RealToxicityPrompts with GPT-2, evaluation of DExperts on explicit toxicity, and stress-testing on the independent ToxiGen adversarial dataset. No equations, derivations, fitted parameters renamed as predictions, or self-citations appear as load-bearing steps in the reported results. The 100% and 98.5% safety rates, latency figures, and robustness-gap conclusion are presented as outcomes of these measurements against fixed external benchmarks, with no reduction to self-definitional inputs or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This empirical replication relies on the validity of prior toxicity benchmarks and the faithful re-implementation of DExperts; no new mathematical axioms, free parameters, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5798 in / 1176 out tokens · 55818 ms · 2026-05-19T13:34:44.012369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

  1. [1]

    Brown et al

    Tom B. Brown et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  2. [2]

    Language models are unsupervised multitask learners.OpenAI Blog, 2019

    Alec Radford et al. Language models are unsupervised multitask learners.OpenAI Blog, 2019

  3. [3]

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. InFindings of EMNLP, pages 3356–3369, 2020

  4. [4]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? InProceedings of FAccT, pages 610–623, 2021

  5. [5]

    The woman worked as a babysitter: On biases in language generation

    Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. The woman worked as a babysitter: On biases in language generation. InProceedings of EMNLP, pages 3407–3412, 2019

  6. [6]

    Smith, and Yejin Choi

    Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavat- ula, Noah A. Smith, and Yejin Choi. DExperts: Decoding-time controlled text generation with experts and anti-experts. InProceedings of ACL-IJCNLP, pages 6691–6706, 2021

  7. [7]

    Challenges in detoxifying language models

    Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. Challenges in detoxifying language models. InFindings of EMNLP, pages 2447–2469, 2021. Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

  8. [8]

    Training language models to follow instructions with human feedback

    Long Ouyang et al. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  9. [9]

    Plug and play language models: A simple approach to controlled text generation

    Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. InProceedings of ICLR, 2020

  10. [10]

    FUDGE: Controlled text generation with future discriminators

    Kevin Yang and Dan Klein. FUDGE: Controlled text generation with future discriminators. InProceedings of NAACL, pages 3511–3535, 2021

  11. [11]

    ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. InProceedings of ACL, pages 3309–3326, 2022

  12. [12]

    Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith. The risk of racial bias in hate speech detection. InProceedings of ACL, pages 1668–1678, 2019

  13. [13]

    Perspective API documentation

    Perspective API. Perspective API documentation. https://www.perspectiveapi. com/, 2023

  14. [14]

    InProceedings of ACL, 2024

    Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apos- toloff, Luca Zappella, and Pau Rodríguez Whispering Experts: Neural Inter- ventions for Toxicity Mitigation in Language Models. InProceedings of ACL, 2024

  15. [15]

    InFindings of EMNLP, 2023

    Heegyu Kim and Hyunsouk Cho GTA: Gated Toxicity Avoidance for LM Perfor- mance Preservation. InFindings of EMNLP, 2023

  16. [16]

    InProceedings of EMNLP, 2023

    Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie Huang Unveiling the Implicit Toxicity in Large Language Models. InProceedings of EMNLP, 2023

  17. [17]

    InFindings of EMNLP, 2023

    Sarthak Roy, Ashish Harshavardhan, Animesh Mukherjee, and Punyajoy Saha Probing LLMs for hate speech detection: strengths and vulnerabilities. InFindings of EMNLP, 2023

  18. [18]

    Jingjie Zeng, Liang Yang, Zekun Wang, Yuanyuan Sun, and Hongfei Lin Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech? In Proceedings of ACL, 2025

  19. [19]

    Gallegos, Ryan A

    Isabel O. Gallegos, Ryan A. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K. Ahmed Bias and Fairness in Large Language Models: A Survey.Computational Linguistics, volume 50, pages 1097–1179, 2024

  20. [20]

    InFindings of ACL, 2024

    Tinh Son Luong, Thanh-Thien Le, Linh Ngo Van, and Thien Huu Nguyen Realistic Evaluation of Toxicity in Large Language Models. InFindings of ACL, 2024

  21. [21]

    Detoxifying Large Language Models via Knowledge Editing

    Mengru Wang et al. Detoxifying Large Language Models via Knowledge Editing. arXiv preprint arXiv:2403.14472, 2024

  22. [22]

    arXiv preprint arXiv:2405.09373, 2024

    Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten Sap PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models. arXiv preprint arXiv:2405.09373, 2024

  23. [23]

    Controllable text generation for large language models: A survey

    Xun Liang, Hanyu Wang, Yezhaohui Wang, Shichao Song, Jiawei Yang, Simin Niu, Jie Hu, Dan Liu, Shunyu Yao, Feiyu Xiong, and Zhiyu Li Controllable Text Gen- eration for Large Language Models: A Survey. arXiv preprint arXiv:2408.12599, 2024

  24. [24]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai et al. Constitutional AI: Harmlessness from AI Feedback. arXiv preprint arXiv:2212.08073, 2022

  25. [25]

    InFindings of ACL, 2024

    Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza Ermis From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models. InFindings of ACL, 2024

  26. [26]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. arXiv preprint arXiv:2404.01318, 2024

  27. [27]

    arXiv preprint arXiv:2501.00066, 2024

    Bohdan Turbal, Anastasiia Mazur, Jiaxu Zhao, and Mykola Pechenizkiy On Adversarial Robustness of Language Models in Transfer Learning. arXiv preprint arXiv:2501.00066, 2024