pith. sign in

arxiv: 2507.05660 · v3 · pith:NH4KH3L5new · submitted 2025-07-08 · 💻 cs.CR · cs.AI· cs.CL

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

Pith reviewed 2026-05-22 12:13 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords toxicity mitigationLLM fine-tuningsafety defenseDirect Preference Optimizationhealing dataconversational AIadversarial resilience
0
0 comments X

The pith

Optimus mitigates toxicity during fine-tuning of conversational AI even when using highly biased toxicity classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Optimus as a defense framework that reduces the risk of toxic behaviors when customizing large language models on untrusted datasets. It achieves this by repurposing the existing safety alignment in LLMs as a training-free way to classify toxic content and then uses synthetic healing data together with direct preference optimization to realign the model. A reader would care because fine-tuning LLMs for specific conversations is popular but can introduce harmful outputs, and Optimus aims to make this process safer without needing perfect detection tools or losing the model's usefulness in conversation. The evaluations show it works better than prior defenses and holds up against attacks designed to bypass it.

Core claim

Optimus combines a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs with a dual-strategy alignment process using synthetic healing data and Direct Preference Optimization to steer models toward safety, thereby mitigating toxicity even when toxicity classifiers suffer up to 85% degradation in recall, outperforming StarDSS and showing resilience to adaptive adversarial and jailbreak attacks.

What carries the argument

The training-free toxicity classification scheme repurposing safety alignment of LLMs, paired with synthetic healing data and DPO for dual-strategy alignment.

If this is right

  • Customizing LLMs on untrusted data can be done with lower risk of injecting toxic behaviors.
  • Defense performance does not depend on having highly accurate toxicity classifiers.
  • Models maintain their conversational utility after the safety alignment process.
  • Protection extends to resisting post-fine-tuning adversarial and jailbreak attempts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that built-in safety mechanisms in LLMs may contain extractable signals useful for other safety tasks without retraining.
  • Similar repurposing techniques could be tested for mitigating other unwanted behaviors like bias or hallucinations in fine-tuned models.
  • Deployments of fine-tuned conversational AI in sensitive areas might become more feasible with such lightweight defenses.

Load-bearing premise

The safety alignment already present in commodity LLMs can be directly repurposed as a reliable, training-free toxicity classifier that remains effective across different base models and datasets.

What would settle it

Testing Optimus on a new base model and dataset where the repurposed safety alignment fails to identify most toxic samples in the fine-tuning data, then checking if toxicity in model outputs remains low after the alignment step.

Figures

Figures reproduced from arXiv: 2507.05660 by Aravind Cheruvu, Bimal Viswanath, Daphne Yao, Murtuza Jadliwala, Nicholas Ka-Shing Kong, Shravya Kanchi, Sifat Muhammad Abdullah.

Figure 1
Figure 1. Figure 1: Mitigating toxicity using TuneShield. the LLM can be untrustworthy and contain problematic con￾versations or toxic language. Prior work has rigorously studied how an attack that poisons the training dataset with toxic language can controllably inject toxicity into a chatbot [78], i.e., the chatbot learns to produce toxic responses. This will cause real harm to its users, especially for deployments that exp… view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of TuneShield framework. (1) Constantly evolving base models and fine-tuning strate￾gies. As foundation models advance, their capacity, archi￾tecture, and training objectives evolve, leading to changes in fine-tuning strategies [27]. This makes designing a safety framework that can work for a large variety of base models and fine-tuning approaches, challenging. (2) Mitigating toxicity while pr… view at source ↗
Figure 3
Figure 3. Figure 3: Prompt for the Refusal approach. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Instruction for contextual healing. Step 2: Model fine-tuning. We fine-tune the base model on the training dataset that was updated with the healing data. In this updated training dataset, any context-response pair flagged as toxic in the original dataset, is replaced by a ‘healed’ context-response pair. Step 3: Model alignment using DPO. Step 2 can mitigate toxicity in some cases, but we find that a biase… view at source ↗
Figure 5
Figure 5. Figure 5: F1-scores for toxic class in the offensive and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Instruction for contextual healing. D. TuneShield: Defense Evaluation Model alignment using DPO. We fine-tune the LLaMA-2 model using DPO with same set of hyperparameters but with a LR of 5e-6 and β = 0.3. We fine-tune for 2 and 3 epochs respectively for the offensive and specialized categories. E. Adversarial Robustness: Dialog-based Learning We use the same hyperparameters from Weeks et al. [78] and fine… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template to generate adversarial response. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Refusal classifier prompt with manually-designed [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Refusal classifier prompt with optimization-based [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 9
Figure 9. Figure 9: Healing data generation prompt with jailbreak attack [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Optimus, a defense framework to mitigate toxicity risks when fine-tuning conversational LLMs on untrusted data. It proposes a training-free toxicity classifier that repurposes the safety alignment already present in commodity LLMs, generates synthetic healing data from this signal, and applies Direct Preference Optimization (DPO) to steer the model toward safer behavior while preserving utility. Key empirical claims include effective toxicity mitigation even under classifiers with up to 85% recall degradation, outperformance versus the prior StarDSS baseline, and resilience to adaptive adversarial and jailbreak attacks. The work provides open-source code and datasets.

Significance. If the central robustness claims hold under the reported conditions, the work is significant for practical LLM fine-tuning pipelines, where high-quality toxicity detectors are often unavailable or biased. The training-free reuse of existing safety alignments is an efficient approach that avoids additional training overhead for detection. Explicit open-sourcing of code and datasets is a strength that supports reproducibility and follow-on work in the field.

major comments (1)
  1. [§4] §4 (Experimental Evaluation), results on recall-degraded classifiers: the headline claim that mitigation remains effective at up to 85% recall degradation is load-bearing for the paper's contribution. The manuscript must quantify the actual number of toxic instances detected and included in the healing data at each degradation level, and must include an explicit DPO-only baseline (i.e., preference optimization without any classifier-generated healing data) to demonstrate that the observed gains are attributable to the training-free scheme rather than generic alignment.
minor comments (2)
  1. [§3.2] The abstract and §3.2 use the term 'healing data' without a concise formal definition or pseudocode for its generation process; a short algorithm box would improve clarity.
  2. [Table 3] Table 3 (or equivalent results table) reports performance numbers but does not include standard error or statistical significance tests across the multiple runs; adding these would strengthen the comparison to StarDSS.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have carefully considered the major comment on the experimental evaluation and provide a point-by-point response below. We agree that the requested additions will improve clarity and will incorporate them in the revised version.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Evaluation), results on recall-degraded classifiers: the headline claim that mitigation remains effective at up to 85% recall degradation is load-bearing for the paper's contribution. The manuscript must quantify the actual number of toxic instances detected and included in the healing data at each degradation level, and must include an explicit DPO-only baseline (i.e., preference optimization without any classifier-generated healing data) to demonstrate that the observed gains are attributable to the training-free scheme rather than generic alignment.

    Authors: We agree that quantifying the number of toxic instances detected at each recall degradation level will provide valuable transparency into the strength of the training-free signal under bias. In the revised manuscript, we will add a table in §4 reporting, for each degradation level (including up to 85%), the exact count of toxic examples identified by the repurposed LLM classifier and the number subsequently used to synthesize the healing data. This will allow readers to directly assess how the volume of the safety signal changes with classifier quality. Regarding the DPO-only baseline, we acknowledge that an explicit comparison isolating the contribution of the classifier-generated healing data is useful to rule out generic alignment effects. We will add this baseline by running DPO on the fine-tuning dataset without any classifier-derived preferences or healing data, and include the results alongside the full Optimus pipeline in the updated experiments. These revisions will directly address the load-bearing nature of the robustness claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of training-free classifier and DPO alignment on external benchmarks

full rationale

The paper presents an empirical defense framework relying on repurposing LLM safety alignment as a toxicity signal, generating healing data, and applying DPO. All central claims are supported by experimental results comparing against baselines like StarDSS, testing under controlled recall degradation, and evaluating resilience to attacks. No equations, derivations, or self-referential fitting are described; the method is not shown to reduce to its inputs by construction. The approach is self-contained against external datasets and benchmarks, with no load-bearing self-citation chains or ansatz smuggling identified in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach depends on the effectiveness of existing LLM safety alignments for classification and on the utility of synthetic healing data plus DPO for steering behavior.

axioms (1)
  • domain assumption Safety alignments in commodity LLMs can be repurposed for accurate toxicity classification without additional training.
    This underpins the training-free classification scheme.

pith-pipeline@v0.9.0 · 5739 in / 1195 out tokens · 48957 ms · 2026-05-22T12:13:08.907419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reasoning Structure Matters for Safety Alignment of Reasoning Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    https://platform.openai.com/docs/guides/mo deration

    OpenAI moderation API. https://platform.openai.com/docs/guides/mo deration

  2. [2]

    https://platform.openai.com/docs/a pi-reference/moderations/object

    OpenAI moderation API categories. https://platform.openai.com/docs/a pi-reference/moderations/object

  3. [3]

    https://perspectiveapi.com/

    Perspective API. https://perspectiveapi.com/

  4. [4]

    The CRINGE Loss: Learning what language not to model

    Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The CRINGE Loss: Learning what language not to model. In Proc. of ACL , 2023

  5. [5]

    https://aws.amazon.com/bedrock/, 2024

  6. [6]

    Revisiting Contextual Toxicity Detection in Conversations

    Atijit Anuchitanukul, Julia Ive, and Lucia Specia. Revisiting Contextual Toxicity Detection in Conversations. Proc. of CoRR abs/2111.12447 , 2021

  7. [7]

    https://azure.microsoft.com/en-us/products/ai-services/openai-service, 2024

  8. [8]

    Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts

    Ashutosh Baheti, Maarten Sap, Alan Ritter, and Mark Riedl. Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts. In Proc. of EMNLP , 2021

  9. [9]

    Assessing political prudence of open-domain chatbots

    Yejin Bang, Nayeon Lee, Etsuko Ishii, Andrea Madotto, and Pascale Fung. Assessing political prudence of open-domain chatbots. In Proc. of SIGDIAL, 2021

  10. [11]

    Language Models are Few-Shot Learners

    Tom Brown et al. Language Models are Few-Shot Learners. In Proc. of NIPS, 2020

  11. [12]

    Carlini, M

    N. Carlini, M. Jagielski, C. Choquette-Choo, D. Paleka, W. Pearce, H. Anderson, A. Terzis, K. Thomas, and F. Tramèr. Poisoning Web- Scale Training Datasets is Practical. In Proc. of IEEE S&P , 2024

  12. [13]

    Are aligned neural networks adversarially aligned? In Proc

    Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? In Proc. of NIPS , 2023

  13. [14]

    I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language

    Tommaso Caselli, Valerio Basile, Jelena Mitrovi ´c, Inga Kartoziya, and Michael Granitzer. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Proc. of LREC, 2020

  14. [15]

    Introducing ChatGPT https://openai.com/blog/chatgpt/, 2022

  15. [16]

    Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots

    Bocheng Chen, Guangjing Wang, Hanqing Guo, Yuanda Wang, and Qiben Yan. Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots. In Proc. of RAID , 2023

  16. [17]

    Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-3 0-vicuna/, 2023

    Wei-Lin Chiang et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-3 0-vicuna/, 2023

  17. [18]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung et al. Scaling Instruction-Finetuned Language Models. Proc. of CoRR abs/2210.11416 , 2022

  18. [19]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. Proc. of CoRR abs/2305.14314, 2023

  19. [20]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of ACL , 2019

  20. [21]

    Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser

    Emily Dinan, Gavin Abercrombie, Ari Bergman, Shannon L. Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. In Proc. of ACL, 2022

  21. [22]

    Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack

    Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. In Proc. of EMNLP , 2019

  22. [23]

    A Survey on Automatic Detection of Hate Speech in Text

    Paula Fortuna and Sérgio Nunes. A Survey on Automatic Detection of Hate Speech in Text. ACM Computing Surveys (CSUR) , 2018

  23. [24]

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

    Samuel Gehman, S Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proc. of EMNLP , 2020

  24. [25]

    Chatbot AI makes racist judgements on the basis of dialect

    Elizabeth Gibney. Chatbot AI makes racist judgements on the basis of dialect. https://www.nature.com/articles/d41586-024-00779-1, 2024

  25. [26]

    Demystifying Prompts in Language Models via Perplexity Estimation

    Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. Demystifying Prompts in Language Models via Perplexity Estimation. Proc. of CoRR abs/2212.04037 , 2022

  26. [27]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. Parameter- Efficient Fine-Tuning for Large Models: A Comprehensive Survey.Proc. of CoRR abs/2403.14608 , 2024

  27. [28]

    Detoxify

    Laura Hanu. Detoxify. https://github.com/unitaryai/detoxify, 2021

  28. [29]

    You Only Prompt Once: On the Capabilities of Prompt Learning on Large Lan- guage Models to Tackle Toxic Content

    Xinlei He, Savvas Zannettou, Yun Shen, and Yang Zhang. You Only Prompt Once: On the Capabilities of Prompt Learning on Large Lan- guage Models to Tackle Toxic Content. Proc. of CoRR abs/2308.05596, 2023

  29. [30]

    https://huggingface.co/, 2024

  30. [31]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. Proc. of CoRR abs/2106.09685, 2021

  31. [32]

    Toxicity detection for free

    Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, and David Wagner. Toxicity Detection for Free. Proc. of CoRR abs/2405.18822 , 2024

  32. [33]

    GRADE: Automatic Graph- Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

    L Huang, Z Ye, J Qin, L Lin, and X Liang. GRADE: Automatic Graph- Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. In Proc. of EMNLP , 2020

  33. [34]

    OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

    Srinivas Iyer et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. Proc. of CoRR abs/2212.12017, 2022

  34. [35]

    A systematic review of Hate Speech automatic detection using Natural Language Processing

    Md Saroar Jahan and Mourad Oussalah. A systematic review of Hate Speech automatic detection using Natural Language Processing. Neurocomputing, 2023

  35. [36]

    Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

    Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. In Proc. of AAAI , 34:8018–8025, 2020

  36. [37]

    Constructing interval variables via faceted rasch measurement and multitask deep learning: a hate speech application

    Chris J Kennedy, Geoff Bacon, Alexander Sahn, and Claudia von Vacano. Constructing interval variables via faceted rasch measurement and multitask deep learning: a hate speech application. Proc. of CoRR abs/2009.10277, 2020

  37. [38]

    CTRL: A Conditional Transformer Language Model for Controllable Generation

    Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A Conditional Transformer Language Model for Controllable Generation. In Proc. of CoRR abs/1909.05858 , 2019

  38. [39]

    Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents

    San Kim and Gary Geunbae Lee. Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents. In Proc. of CoRR abs/2405.12900 , 2024

  39. [40]

    Understand- ing the Effects of RLHF on LLM Generalisation and Diversity

    Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understand- ing the Effects of RLHF on LLM Generalisation and Diversity. In Proc. of ICLR, 2024

  40. [41]

    GeDi: 14 Generative Discriminator Guided Sequence Generation

    Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. GeDi: 14 Generative Discriminator Guided Sequence Generation. In Proc. of EMNLP Findings, 2021

  41. [42]

    Amplegcg-plus: A strong generative model of adversarial suffixes to jailbreak llms with higher success rates in fewer attempts

    Vishal Kumar, Zeyi Liao, Jaylen Jones, and Huan Sun. Amplegcg-plus: A strong generative model of adversarial suffixes to jailbreak llms with higher success rates in fewer attempts. CoRR abs/2410.22143, 2024

  42. [43]

    ‘No, Alexa, no!’: designing child-safe AI and pro- tecting children from the risks of the ‘empathy gap’in large language models

    Nomisha Kurian. ‘No, Alexa, no!’: designing child-safe AI and pro- tecting children from the risks of the ‘empathy gap’in large language models. Taylor & Francis Learning, Media and Technology , 2024

  43. [44]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Mike Lewis et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proc. of ACL , 2020

  44. [45]

    BERT-ATTACK: Adversarial Attack Against BERT Using BERT

    Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proc. of EMNLP , 2020

  45. [46]

    DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

    Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proc. of IJCNLP , 2017

  46. [47]

    Focal Loss for Dense Object Detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal Loss for Dense Object Detection. In Proc. of CoRR abs/1708.02002, 2017

  47. [48]

    DEXPERTS: Decoding- Time Controlled Text Generation with Experts and Anti-Experts

    Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A Smith, and Yejin Choi. DEXPERTS: Decoding- Time Controlled Text Generation with Experts and Anti-Experts. In Proc. of ACL , 2021

  48. [49]

    Automatic and universal prompt injection attacks against large language models,

    Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. CoRR abs/2403.04957, 2024

  49. [50]

    Formalizing and Benchmarking Prompt Injection Attacks and Defenses

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In Proc. of USENIX Security , 2024

  50. [51]

    Hakkani-Tür

    Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Z. Hakkani-Tür. Using In- Context Learning to Improve Dialogue Safety. In Proc. of EMNLP Findings, 2023

  51. [52]

    Prompt shields in azure ai content safety

    Microsoft. Prompt shields in azure ai content safety. https://learn.micr osoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-det ection, October 2024. Accessed: 2025-04-23

  52. [53]

    https://platform.openai.com/docs/guides/fine-tuning, 2024

  53. [54]

    Training language models to follow instructions with human feedback

    Long Ouyang et al. Training language models to follow instructions with human feedback. In Proc. of NIPS , 2022

  54. [55]

    Toxicity Detection: Does Context Really Matter? In Proc

    John Pavlopoulos, Jeffrey Scott Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. Toxicity Detection: Does Context Really Matter? In Proc. of ACL, 2020

  55. [56]

    Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech

    Flor Miriam Plaza-Del-Arco, Debora Nozza, and Dirk Hovy. Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech. In Proc. of WOAH, 2023

  56. [57]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! Proc. of CoRR abs/2310.03693, 2023

  57. [58]

    Language Model Detoxification in Dialogue with Contextualized Stance Control

    Jingu Qian and Xifeng Yan. Language Model Detoxification in Dialogue with Contextualized Stance Control. In Proc. of EMNLP Findings, 2023

  58. [59]

    Manning, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christo- pher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proc. of NeurIPS, 2023

  59. [60]

    Ex- ploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Ex- ploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research , 2020

  60. [61]

    DeepSpeed: System Optimizations Enable Training Deep Learning Mod- els with Over 100 Billion Parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimizations Enable Training Deep Learning Mod- els with Over 100 Billion Parameters. In Proc. of SIGKDD , 2020

  61. [62]

    Recipes for Building an Open-Domain Chatbot

    Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. Recipes for Building an Open-Domain Chatbot. In Proc. of ACL , 2021

  62. [63]

    Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

    Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. In Proc. of ACL, 2021

  63. [64]

    A Survey on Hate Speech Detec- tion using Natural Language Processing

    Anna Schmidt and Michael Wiegand. A Survey on Hate Speech Detec- tion using Natural Language Processing. In Proc. of ACL SocialNLP , 2017

  64. [65]

    AI chatbot blamed for Belgian man’s suicide

    Mark Sellman. AI chatbot blamed for Belgian man’s suicide. https: //www.thetimes.com/business-money/technology/article/ai-chatbot-bla med-for-belgian-mans-suicide-zcjzlztcc, 2023

  65. [66]

    BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

    Kurt Shuster et al. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. Proc. of CoRR abs/2208.03188, 2022

  66. [67]

    Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

    Wai Man Si, M Backes, J Blackburn, E De Cristofaro, G Stringhini, S Zannettou, and Y Zhang. Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. In Proc. of the ACM CCS , 2022

  67. [68]

    A Large- scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

    Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C Park. A Large- scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit. In Proc. of CoNLL , 2021

  68. [69]

    Conceptnet 5.5: An open multilingual graph of general knowledge

    R Speer et al. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proc of AAAI , 2017

  69. [70]

    On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark

    Hao Sun, Guangxuan Xu, Deng Jiawen, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark. In Proc. of ACL Findings , 2021

  70. [71]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. Proc. of CoRR abs/2307.09288 , 2023

  71. [72]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models. Proc. of CoRR abs/2302.13971 , 2023

  72. [73]

    Vishwamitra, K

    N. Vishwamitra, K. Guo, F. Romit, I. Ondracek, L. Cheng, Z. Zhao, and H. Hu. Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models. In Proc. of IEEE S&P , 2024

  73. [74]

    TRL: Transformer Reinforcement Learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl, 2020

  74. [75]

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

    Boxin Wang et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In Proc. of NIPS , 2024

  75. [76]

    Adversarial glue: A multi-task benchmark for robustness evaluation of language models

    Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, and Bo Li. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. Proc. of CoRR abs/2111.02840 , 2021

  76. [77]

    Understanding Abuse: A Typology of Abusive Language Detection Subtasks

    Z Waseem, T Davidson, D Warmsley, and I Weber. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In Proc. of ALW Workshop, 2017

  77. [78]

    A First Look at Toxicity Injection Attacks on Open-domain Chatbots

    Connor Weeks, Aravind Cheruvu, Sifat Muhammad Abdullah, Shravya Kanchi, Danfeng (Daphne) Yao, and Bimal Viswanath. A First Look at Toxicity Injection Attacks on Open-domain Chatbots. In Proc. of ACSAC, 2023

  78. [79]

    Context Sensitivity Estimation in Toxicity Detection

    Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. Context Sensitivity Estimation in Toxicity Detection. In Proc. of WOAH, 2021

  79. [80]

    Toxicity Detection can be Sensitive to the Conversational Context

    Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and Leo Laugier. Toxicity Detection can be Sensitive to the Conversational Context. In Proc. of WOAH, 2021

  80. [81]

    Assessing Dialogue Systems with Distribution Distances

    Jiannan Xiang, Yahui Liu, Deng Cai, Huayang Li, Defu Lian, and Lemao Liu. Assessing Dialogue Systems with Distribution Distances. In Proc. of ACL Findings , 2021

Showing first 80 references.