Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

Aravind Cheruvu; Bimal Viswanath; Daphne Yao; Murtuza Jadliwala; Nicholas Ka-Shing Kong; Shravya Kanchi; Sifat Muhammad Abdullah

arxiv: 2507.05660 · v3 · pith:NH4KH3L5new · submitted 2025-07-08 · 💻 cs.CR · cs.AI· cs.CL

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

Aravind Cheruvu , Shravya Kanchi , Sifat Muhammad Abdullah , Nicholas Ka-Shing Kong , Daphne Yao , Murtuza Jadliwala , Bimal Viswanath This is my paper

Pith reviewed 2026-05-22 12:13 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords toxicity mitigationLLM fine-tuningsafety defenseDirect Preference Optimizationhealing dataconversational AIadversarial resilience

0 comments

The pith

Optimus mitigates toxicity during fine-tuning of conversational AI even when using highly biased toxicity classifiers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Optimus as a defense framework that reduces the risk of toxic behaviors when customizing large language models on untrusted datasets. It achieves this by repurposing the existing safety alignment in LLMs as a training-free way to classify toxic content and then uses synthetic healing data together with direct preference optimization to realign the model. A reader would care because fine-tuning LLMs for specific conversations is popular but can introduce harmful outputs, and Optimus aims to make this process safer without needing perfect detection tools or losing the model's usefulness in conversation. The evaluations show it works better than prior defenses and holds up against attacks designed to bypass it.

Core claim

Optimus combines a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs with a dual-strategy alignment process using synthetic healing data and Direct Preference Optimization to steer models toward safety, thereby mitigating toxicity even when toxicity classifiers suffer up to 85% degradation in recall, outperforming StarDSS and showing resilience to adaptive adversarial and jailbreak attacks.

What carries the argument

The training-free toxicity classification scheme repurposing safety alignment of LLMs, paired with synthetic healing data and DPO for dual-strategy alignment.

If this is right

Customizing LLMs on untrusted data can be done with lower risk of injecting toxic behaviors.
Defense performance does not depend on having highly accurate toxicity classifiers.
Models maintain their conversational utility after the safety alignment process.
Protection extends to resisting post-fine-tuning adversarial and jailbreak attempts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that built-in safety mechanisms in LLMs may contain extractable signals useful for other safety tasks without retraining.
Similar repurposing techniques could be tested for mitigating other unwanted behaviors like bias or hallucinations in fine-tuned models.
Deployments of fine-tuned conversational AI in sensitive areas might become more feasible with such lightweight defenses.

Load-bearing premise

The safety alignment already present in commodity LLMs can be directly repurposed as a reliable, training-free toxicity classifier that remains effective across different base models and datasets.

What would settle it

Testing Optimus on a new base model and dataset where the repurposed safety alignment fails to identify most toxic samples in the fine-tuning data, then checking if toxicity in model outputs remains low after the alignment step.

Figures

Figures reproduced from arXiv: 2507.05660 by Aravind Cheruvu, Bimal Viswanath, Daphne Yao, Murtuza Jadliwala, Nicholas Ka-Shing Kong, Shravya Kanchi, Sifat Muhammad Abdullah.

**Figure 1.** Figure 1: Mitigating toxicity using TuneShield. the LLM can be untrustworthy and contain problematic conversations or toxic language. Prior work has rigorously studied how an attack that poisons the training dataset with toxic language can controllably inject toxicity into a chatbot [78], i.e., the chatbot learns to produce toxic responses. This will cause real harm to its users, especially for deployments that exp… view at source ↗

**Figure 2.** Figure 2: Architecture of TuneShield framework. (1) Constantly evolving base models and fine-tuning strategies. As foundation models advance, their capacity, architecture, and training objectives evolve, leading to changes in fine-tuning strategies [27]. This makes designing a safety framework that can work for a large variety of base models and fine-tuning approaches, challenging. (2) Mitigating toxicity while pr… view at source ↗

**Figure 3.** Figure 3: Prompt for the Refusal approach. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Instruction for contextual healing. Step 2: Model fine-tuning. We fine-tune the base model on the training dataset that was updated with the healing data. In this updated training dataset, any context-response pair flagged as toxic in the original dataset, is replaced by a ‘healed’ context-response pair. Step 3: Model alignment using DPO. Step 2 can mitigate toxicity in some cases, but we find that a biase… view at source ↗

**Figure 5.** Figure 5: F1-scores for toxic class in the offensive and [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Instruction for contextual healing. D. TuneShield: Defense Evaluation Model alignment using DPO. We fine-tune the LLaMA-2 model using DPO with same set of hyperparameters but with a LR of 5e-6 and β = 0.3. We fine-tune for 2 and 3 epochs respectively for the offensive and specialized categories. E. Adversarial Robustness: Dialog-based Learning We use the same hyperparameters from Weeks et al. [78] and fine… view at source ↗

**Figure 7.** Figure 7: Prompt template to generate adversarial response. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 10.** Figure 10: Refusal classifier prompt with manually-designed [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Refusal classifier prompt with optimization-based [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 9.** Figure 9: Healing data generation prompt with jailbreak attack [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Optimus repurposes existing LLM safety alignments as a training-free toxicity detector then adds synthetic healing data and DPO, but the robustness story needs ablations to show the detector actually matters.

read the letter

The main point is that Optimus turns the safety tuning already baked into commodity LLMs into a no-training toxicity filter, generates synthetic healing examples from it, and runs DPO to steer the model away from toxic outputs during fine-tuning on untrusted data. It claims this still works when the filter is badly biased, with recall drops up to 85 percent, and that it beats StarDSS while resisting adaptive attacks and jailbreaks.

Referee Report

1 major / 2 minor

Summary. The paper introduces Optimus, a defense framework to mitigate toxicity risks when fine-tuning conversational LLMs on untrusted data. It proposes a training-free toxicity classifier that repurposes the safety alignment already present in commodity LLMs, generates synthetic healing data from this signal, and applies Direct Preference Optimization (DPO) to steer the model toward safer behavior while preserving utility. Key empirical claims include effective toxicity mitigation even under classifiers with up to 85% recall degradation, outperformance versus the prior StarDSS baseline, and resilience to adaptive adversarial and jailbreak attacks. The work provides open-source code and datasets.

Significance. If the central robustness claims hold under the reported conditions, the work is significant for practical LLM fine-tuning pipelines, where high-quality toxicity detectors are often unavailable or biased. The training-free reuse of existing safety alignments is an efficient approach that avoids additional training overhead for detection. Explicit open-sourcing of code and datasets is a strength that supports reproducibility and follow-on work in the field.

major comments (1)

[§4] §4 (Experimental Evaluation), results on recall-degraded classifiers: the headline claim that mitigation remains effective at up to 85% recall degradation is load-bearing for the paper's contribution. The manuscript must quantify the actual number of toxic instances detected and included in the healing data at each degradation level, and must include an explicit DPO-only baseline (i.e., preference optimization without any classifier-generated healing data) to demonstrate that the observed gains are attributable to the training-free scheme rather than generic alignment.

minor comments (2)

[§3.2] The abstract and §3.2 use the term 'healing data' without a concise formal definition or pseudocode for its generation process; a short algorithm box would improve clarity.
[Table 3] Table 3 (or equivalent results table) reports performance numbers but does not include standard error or statistical significance tests across the multiple runs; adding these would strengthen the comparison to StarDSS.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We have carefully considered the major comment on the experimental evaluation and provide a point-by-point response below. We agree that the requested additions will improve clarity and will incorporate them in the revised version.

read point-by-point responses

Referee: [§4] §4 (Experimental Evaluation), results on recall-degraded classifiers: the headline claim that mitigation remains effective at up to 85% recall degradation is load-bearing for the paper's contribution. The manuscript must quantify the actual number of toxic instances detected and included in the healing data at each degradation level, and must include an explicit DPO-only baseline (i.e., preference optimization without any classifier-generated healing data) to demonstrate that the observed gains are attributable to the training-free scheme rather than generic alignment.

Authors: We agree that quantifying the number of toxic instances detected at each recall degradation level will provide valuable transparency into the strength of the training-free signal under bias. In the revised manuscript, we will add a table in §4 reporting, for each degradation level (including up to 85%), the exact count of toxic examples identified by the repurposed LLM classifier and the number subsequently used to synthesize the healing data. This will allow readers to directly assess how the volume of the safety signal changes with classifier quality. Regarding the DPO-only baseline, we acknowledge that an explicit comparison isolating the contribution of the classifier-generated healing data is useful to rule out generic alignment effects. We will add this baseline by running DPO on the fine-tuning dataset without any classifier-derived preferences or healing data, and include the results alongside the full Optimus pipeline in the updated experiments. These revisions will directly address the load-bearing nature of the robustness claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of training-free classifier and DPO alignment on external benchmarks

full rationale

The paper presents an empirical defense framework relying on repurposing LLM safety alignment as a toxicity signal, generating healing data, and applying DPO. All central claims are supported by experimental results comparing against baselines like StarDSS, testing under controlled recall degradation, and evaluating resilience to attacks. No equations, derivations, or self-referential fitting are described; the method is not shown to reduce to its inputs by construction. The approach is self-contained against external datasets and benchmarks, with no load-bearing self-citation chains or ansatz smuggling identified in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach depends on the effectiveness of existing LLM safety alignments for classification and on the utility of synthetic healing data plus DPO for steering behavior.

axioms (1)

domain assumption Safety alignments in commodity LLMs can be repurposed for accurate toxicity classification without additional training.
This underpins the training-free classification scheme.

pith-pipeline@v0.9.0 · 5739 in / 1195 out tokens · 48957 ms · 2026-05-22T12:13:08.907419+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic 'healing data' with Direct Preference Optimization (DPO)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a zero-shot prompting approach that leverages the instruction following and safety alignment abilities of LLMs. We call this Refusal approach.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reasoning Structure Matters for Safety Alignment of Reasoning Models
cs.AI 2026-04 unverdicted novelty 6.0

Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

https://platform.openai.com/docs/guides/mo deration

OpenAI moderation API. https://platform.openai.com/docs/guides/mo deration

work page
[2]

https://platform.openai.com/docs/a pi-reference/moderations/object

OpenAI moderation API categories. https://platform.openai.com/docs/a pi-reference/moderations/object

work page
[3]

https://perspectiveapi.com/

Perspective API. https://perspectiveapi.com/

work page
[4]

The CRINGE Loss: Learning what language not to model

Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The CRINGE Loss: Learning what language not to model. In Proc. of ACL , 2023

work page 2023
[5]

https://aws.amazon.com/bedrock/, 2024

work page 2024
[6]

Revisiting Contextual Toxicity Detection in Conversations

Atijit Anuchitanukul, Julia Ive, and Lucia Specia. Revisiting Contextual Toxicity Detection in Conversations. Proc. of CoRR abs/2111.12447 , 2021

work page arXiv 2021
[7]

https://azure.microsoft.com/en-us/products/ai-services/openai-service, 2024

work page 2024
[8]

Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts

Ashutosh Baheti, Maarten Sap, Alan Ritter, and Mark Riedl. Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts. In Proc. of EMNLP , 2021

work page 2021
[9]

Assessing political prudence of open-domain chatbots

Yejin Bang, Nayeon Lee, Etsuko Ishii, Andrea Madotto, and Pascale Fung. Assessing political prudence of open-domain chatbots. In Proc. of SIGDIAL, 2021

work page 2021
[11]

Language Models are Few-Shot Learners

Tom Brown et al. Language Models are Few-Shot Learners. In Proc. of NIPS, 2020

work page 2020
[12]

Carlini, M

N. Carlini, M. Jagielski, C. Choquette-Choo, D. Paleka, W. Pearce, H. Anderson, A. Terzis, K. Thomas, and F. Tramèr. Poisoning Web- Scale Training Datasets is Practical. In Proc. of IEEE S&P , 2024

work page 2024
[13]

Are aligned neural networks adversarially aligned? In Proc

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? In Proc. of NIPS , 2023

work page 2023
[14]

I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language

Tommaso Caselli, Valerio Basile, Jelena Mitrovi ´c, Inga Kartoziya, and Michael Granitzer. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Proc. of LREC, 2020

work page 2020
[15]

Introducing ChatGPT https://openai.com/blog/chatgpt/, 2022

work page 2022
[16]

Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots

Bocheng Chen, Guangjing Wang, Hanqing Guo, Yuanda Wang, and Qiben Yan. Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots. In Proc. of RAID , 2023

work page 2023
[17]

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-3 0-vicuna/, 2023

Wei-Lin Chiang et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-3 0-vicuna/, 2023

work page 2023
[18]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung et al. Scaling Instruction-Finetuned Language Models. Proc. of CoRR abs/2210.11416 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. Proc. of CoRR abs/2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of ACL , 2019

work page 2019
[21]

Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser

Emily Dinan, Gavin Abercrombie, Ari Bergman, Shannon L. Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. In Proc. of ACL, 2022

work page 2022
[22]

Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack

Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. In Proc. of EMNLP , 2019

work page 2019
[23]

A Survey on Automatic Detection of Hate Speech in Text

Paula Fortuna and Sérgio Nunes. A Survey on Automatic Detection of Hate Speech in Text. ACM Computing Surveys (CSUR) , 2018

work page 2018
[24]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, S Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proc. of EMNLP , 2020

work page 2020
[25]

Chatbot AI makes racist judgements on the basis of dialect

Elizabeth Gibney. Chatbot AI makes racist judgements on the basis of dialect. https://www.nature.com/articles/d41586-024-00779-1, 2024

work page 2024
[26]

Demystifying Prompts in Language Models via Perplexity Estimation

Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. Demystifying Prompts in Language Models via Perplexity Estimation. Proc. of CoRR abs/2212.04037 , 2022

work page arXiv 2022
[27]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. Parameter- Efficient Fine-Tuning for Large Models: A Comprehensive Survey.Proc. of CoRR abs/2403.14608 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Detoxify

Laura Hanu. Detoxify. https://github.com/unitaryai/detoxify, 2021

work page 2021
[29]

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Lan- guage Models to Tackle Toxic Content

Xinlei He, Savvas Zannettou, Yun Shen, and Yang Zhang. You Only Prompt Once: On the Capabilities of Prompt Learning on Large Lan- guage Models to Tackle Toxic Content. Proc. of CoRR abs/2308.05596, 2023

work page arXiv 2023
[30]

https://huggingface.co/, 2024

work page 2024
[31]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. Proc. of CoRR abs/2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Toxicity detection for free

Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, and David Wagner. Toxicity Detection for Free. Proc. of CoRR abs/2405.18822 , 2024

work page arXiv 2024
[33]

GRADE: Automatic Graph- Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

L Huang, Z Ye, J Qin, L Lin, and X Liang. GRADE: Automatic Graph- Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. In Proc. of EMNLP , 2020

work page 2020
[34]

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Srinivas Iyer et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. Proc. of CoRR abs/2212.12017, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

A systematic review of Hate Speech automatic detection using Natural Language Processing

Md Saroar Jahan and Mourad Oussalah. A systematic review of Hate Speech automatic detection using Natural Language Processing. Neurocomputing, 2023

work page 2023
[36]

Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. In Proc. of AAAI , 34:8018–8025, 2020

work page 2020
[37]

Constructing interval variables via faceted rasch measurement and multitask deep learning: a hate speech application

Chris J Kennedy, Geoff Bacon, Alexander Sahn, and Claudia von Vacano. Constructing interval variables via faceted rasch measurement and multitask deep learning: a hate speech application. Proc. of CoRR abs/2009.10277, 2020

work page arXiv 2009
[38]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A Conditional Transformer Language Model for Controllable Generation. In Proc. of CoRR abs/1909.05858 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[39]

Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents

San Kim and Gary Geunbae Lee. Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents. In Proc. of CoRR abs/2405.12900 , 2024

work page arXiv 2024
[40]

Understand- ing the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understand- ing the Effects of RLHF on LLM Generalisation and Diversity. In Proc. of ICLR, 2024

work page 2024
[41]

GeDi: 14 Generative Discriminator Guided Sequence Generation

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. GeDi: 14 Generative Discriminator Guided Sequence Generation. In Proc. of EMNLP Findings, 2021

work page 2021
[42]

Amplegcg-plus: A strong generative model of adversarial suffixes to jailbreak llms with higher success rates in fewer attempts

Vishal Kumar, Zeyi Liao, Jaylen Jones, and Huan Sun. Amplegcg-plus: A strong generative model of adversarial suffixes to jailbreak llms with higher success rates in fewer attempts. CoRR abs/2410.22143, 2024

work page arXiv 2024
[43]

‘No, Alexa, no!’: designing child-safe AI and pro- tecting children from the risks of the ‘empathy gap’in large language models

Nomisha Kurian. ‘No, Alexa, no!’: designing child-safe AI and pro- tecting children from the risks of the ‘empathy gap’in large language models. Taylor & Francis Learning, Media and Technology , 2024

work page 2024
[44]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proc. of ACL , 2020

work page 2020
[45]

BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proc. of EMNLP , 2020

work page 2020
[46]

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proc. of IJCNLP , 2017

work page 2017
[47]

Focal Loss for Dense Object Detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal Loss for Dense Object Detection. In Proc. of CoRR abs/1708.02002, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

DEXPERTS: Decoding- Time Controlled Text Generation with Experts and Anti-Experts

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A Smith, and Yejin Choi. DEXPERTS: Decoding- Time Controlled Text Generation with Experts and Anti-Experts. In Proc. of ACL , 2021

work page 2021
[49]

Automatic and universal prompt injection attacks against large language models,

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. CoRR abs/2403.04957, 2024

work page arXiv 2024
[50]

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In Proc. of USENIX Security , 2024

work page 2024
[51]

Hakkani-Tür

Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Z. Hakkani-Tür. Using In- Context Learning to Improve Dialogue Safety. In Proc. of EMNLP Findings, 2023

work page 2023
[52]

Prompt shields in azure ai content safety

Microsoft. Prompt shields in azure ai content safety. https://learn.micr osoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-det ection, October 2024. Accessed: 2025-04-23

work page 2024
[53]

https://platform.openai.com/docs/guides/fine-tuning, 2024

work page 2024
[54]

Training language models to follow instructions with human feedback

Long Ouyang et al. Training language models to follow instructions with human feedback. In Proc. of NIPS , 2022

work page 2022
[55]

Toxicity Detection: Does Context Really Matter? In Proc

John Pavlopoulos, Jeffrey Scott Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. Toxicity Detection: Does Context Really Matter? In Proc. of ACL, 2020

work page 2020
[56]

Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech

Flor Miriam Plaza-Del-Arco, Debora Nozza, and Dirk Hovy. Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech. In Proc. of WOAH, 2023

work page 2023
[57]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! Proc. of CoRR abs/2310.03693, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Language Model Detoxification in Dialogue with Contextualized Stance Control

Jingu Qian and Xifeng Yan. Language Model Detoxification in Dialogue with Contextualized Stance Control. In Proc. of EMNLP Findings, 2023

work page 2023
[59]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christo- pher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proc. of NeurIPS, 2023

work page 2023
[60]

Ex- ploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Ex- ploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research , 2020

work page 2020
[61]

DeepSpeed: System Optimizations Enable Training Deep Learning Mod- els with Over 100 Billion Parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimizations Enable Training Deep Learning Mod- els with Over 100 Billion Parameters. In Proc. of SIGKDD , 2020

work page 2020
[62]

Recipes for Building an Open-Domain Chatbot

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. Recipes for Building an Open-Domain Chatbot. In Proc. of ACL , 2021

work page 2021
[63]

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. In Proc. of ACL, 2021

work page 2021
[64]

A Survey on Hate Speech Detec- tion using Natural Language Processing

Anna Schmidt and Michael Wiegand. A Survey on Hate Speech Detec- tion using Natural Language Processing. In Proc. of ACL SocialNLP , 2017

work page 2017
[65]

AI chatbot blamed for Belgian man’s suicide

Mark Sellman. AI chatbot blamed for Belgian man’s suicide. https: //www.thetimes.com/business-money/technology/article/ai-chatbot-bla med-for-belgian-mans-suicide-zcjzlztcc, 2023

work page 2023
[66]

BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

Kurt Shuster et al. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. Proc. of CoRR abs/2208.03188, 2022

work page arXiv 2022
[67]

Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Wai Man Si, M Backes, J Blackburn, E De Cristofaro, G Stringhini, S Zannettou, and Y Zhang. Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. In Proc. of the ACM CCS , 2022

work page 2022
[68]

A Large- scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C Park. A Large- scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit. In Proc. of CoNLL , 2021

work page 2021
[69]

Conceptnet 5.5: An open multilingual graph of general knowledge

R Speer et al. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proc of AAAI , 2017

work page 2017
[70]

On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark

Hao Sun, Guangxuan Xu, Deng Jiawen, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark. In Proc. of ACL Findings , 2021

work page 2021
[71]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. Proc. of CoRR abs/2307.09288 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models. Proc. of CoRR abs/2302.13971 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Vishwamitra, K

N. Vishwamitra, K. Guo, F. Romit, I. Ondracek, L. Cheng, Z. Zhao, and H. Hu. Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models. In Proc. of IEEE S&P , 2024

work page 2024
[74]

TRL: Transformer Reinforcement Learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl, 2020

work page 2020
[75]

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In Proc. of NIPS , 2024

work page 2024
[76]

Adversarial glue: A multi-task benchmark for robustness evaluation of language models

Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, and Bo Li. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. Proc. of CoRR abs/2111.02840 , 2021

work page arXiv 2021
[77]

Understanding Abuse: A Typology of Abusive Language Detection Subtasks

Z Waseem, T Davidson, D Warmsley, and I Weber. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In Proc. of ALW Workshop, 2017

work page 2017
[78]

A First Look at Toxicity Injection Attacks on Open-domain Chatbots

Connor Weeks, Aravind Cheruvu, Sifat Muhammad Abdullah, Shravya Kanchi, Danfeng (Daphne) Yao, and Bimal Viswanath. A First Look at Toxicity Injection Attacks on Open-domain Chatbots. In Proc. of ACSAC, 2023

work page 2023
[79]

Context Sensitivity Estimation in Toxicity Detection

Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. Context Sensitivity Estimation in Toxicity Detection. In Proc. of WOAH, 2021

work page 2021
[80]

Toxicity Detection can be Sensitive to the Conversational Context

Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and Leo Laugier. Toxicity Detection can be Sensitive to the Conversational Context. In Proc. of WOAH, 2021

work page 2021
[81]

Assessing Dialogue Systems with Distribution Distances

Jiannan Xiang, Yahui Liu, Deng Cai, Huayang Li, Defu Lian, and Lemao Liu. Assessing Dialogue Systems with Distribution Distances. In Proc. of ACL Findings , 2021

work page 2021

Showing first 80 references.

[1] [1]

https://platform.openai.com/docs/guides/mo deration

OpenAI moderation API. https://platform.openai.com/docs/guides/mo deration

work page

[2] [2]

https://platform.openai.com/docs/a pi-reference/moderations/object

OpenAI moderation API categories. https://platform.openai.com/docs/a pi-reference/moderations/object

work page

[3] [3]

https://perspectiveapi.com/

Perspective API. https://perspectiveapi.com/

work page

[4] [4]

The CRINGE Loss: Learning what language not to model

Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The CRINGE Loss: Learning what language not to model. In Proc. of ACL , 2023

work page 2023

[5] [5]

https://aws.amazon.com/bedrock/, 2024

work page 2024

[6] [6]

Revisiting Contextual Toxicity Detection in Conversations

Atijit Anuchitanukul, Julia Ive, and Lucia Specia. Revisiting Contextual Toxicity Detection in Conversations. Proc. of CoRR abs/2111.12447 , 2021

work page arXiv 2021

[7] [7]

https://azure.microsoft.com/en-us/products/ai-services/openai-service, 2024

work page 2024

[8] [8]

Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts

Ashutosh Baheti, Maarten Sap, Alan Ritter, and Mark Riedl. Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts. In Proc. of EMNLP , 2021

work page 2021

[9] [9]

Assessing political prudence of open-domain chatbots

Yejin Bang, Nayeon Lee, Etsuko Ishii, Andrea Madotto, and Pascale Fung. Assessing political prudence of open-domain chatbots. In Proc. of SIGDIAL, 2021

work page 2021

[10] [11]

Language Models are Few-Shot Learners

Tom Brown et al. Language Models are Few-Shot Learners. In Proc. of NIPS, 2020

work page 2020

[11] [12]

Carlini, M

N. Carlini, M. Jagielski, C. Choquette-Choo, D. Paleka, W. Pearce, H. Anderson, A. Terzis, K. Thomas, and F. Tramèr. Poisoning Web- Scale Training Datasets is Practical. In Proc. of IEEE S&P , 2024

work page 2024

[12] [13]

Are aligned neural networks adversarially aligned? In Proc

Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? In Proc. of NIPS , 2023

work page 2023

[13] [14]

I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language

Tommaso Caselli, Valerio Basile, Jelena Mitrovi ´c, Inga Kartoziya, and Michael Granitzer. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Proc. of LREC, 2020

work page 2020

[14] [15]

Introducing ChatGPT https://openai.com/blog/chatgpt/, 2022

work page 2022

[15] [16]

Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots

Bocheng Chen, Guangjing Wang, Hanqing Guo, Yuanda Wang, and Qiben Yan. Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots. In Proc. of RAID , 2023

work page 2023

[16] [17]

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-3 0-vicuna/, 2023

Wei-Lin Chiang et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-3 0-vicuna/, 2023

work page 2023

[17] [18]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung et al. Scaling Instruction-Finetuned Language Models. Proc. of CoRR abs/2210.11416 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [19]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. Proc. of CoRR abs/2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [20]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of ACL , 2019

work page 2019

[20] [21]

Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser

Emily Dinan, Gavin Abercrombie, Ari Bergman, Shannon L. Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. In Proc. of ACL, 2022

work page 2022

[21] [22]

Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack

Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. In Proc. of EMNLP , 2019

work page 2019

[22] [23]

A Survey on Automatic Detection of Hate Speech in Text

Paula Fortuna and Sérgio Nunes. A Survey on Automatic Detection of Hate Speech in Text. ACM Computing Surveys (CSUR) , 2018

work page 2018

[23] [24]

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Samuel Gehman, S Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proc. of EMNLP , 2020

work page 2020

[24] [25]

Chatbot AI makes racist judgements on the basis of dialect

Elizabeth Gibney. Chatbot AI makes racist judgements on the basis of dialect. https://www.nature.com/articles/d41586-024-00779-1, 2024

work page 2024

[25] [26]

Demystifying Prompts in Language Models via Perplexity Estimation

Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. Demystifying Prompts in Language Models via Perplexity Estimation. Proc. of CoRR abs/2212.04037 , 2022

work page arXiv 2022

[26] [27]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. Parameter- Efficient Fine-Tuning for Large Models: A Comprehensive Survey.Proc. of CoRR abs/2403.14608 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [28]

Detoxify

Laura Hanu. Detoxify. https://github.com/unitaryai/detoxify, 2021

work page 2021

[28] [29]

You Only Prompt Once: On the Capabilities of Prompt Learning on Large Lan- guage Models to Tackle Toxic Content

Xinlei He, Savvas Zannettou, Yun Shen, and Yang Zhang. You Only Prompt Once: On the Capabilities of Prompt Learning on Large Lan- guage Models to Tackle Toxic Content. Proc. of CoRR abs/2308.05596, 2023

work page arXiv 2023

[29] [30]

https://huggingface.co/, 2024

work page 2024

[30] [31]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. Proc. of CoRR abs/2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [32]

Toxicity detection for free

Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, and David Wagner. Toxicity Detection for Free. Proc. of CoRR abs/2405.18822 , 2024

work page arXiv 2024

[32] [33]

GRADE: Automatic Graph- Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

L Huang, Z Ye, J Qin, L Lin, and X Liang. GRADE: Automatic Graph- Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. In Proc. of EMNLP , 2020

work page 2020

[33] [34]

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Srinivas Iyer et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. Proc. of CoRR abs/2212.12017, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [35]

A systematic review of Hate Speech automatic detection using Natural Language Processing

Md Saroar Jahan and Mourad Oussalah. A systematic review of Hate Speech automatic detection using Natural Language Processing. Neurocomputing, 2023

work page 2023

[35] [36]

Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. In Proc. of AAAI , 34:8018–8025, 2020

work page 2020

[36] [37]

Constructing interval variables via faceted rasch measurement and multitask deep learning: a hate speech application

Chris J Kennedy, Geoff Bacon, Alexander Sahn, and Claudia von Vacano. Constructing interval variables via faceted rasch measurement and multitask deep learning: a hate speech application. Proc. of CoRR abs/2009.10277, 2020

work page arXiv 2009

[37] [38]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A Conditional Transformer Language Model for Controllable Generation. In Proc. of CoRR abs/1909.05858 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[38] [39]

Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents

San Kim and Gary Geunbae Lee. Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents. In Proc. of CoRR abs/2405.12900 , 2024

work page arXiv 2024

[39] [40]

Understand- ing the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understand- ing the Effects of RLHF on LLM Generalisation and Diversity. In Proc. of ICLR, 2024

work page 2024

[40] [41]

GeDi: 14 Generative Discriminator Guided Sequence Generation

Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. GeDi: 14 Generative Discriminator Guided Sequence Generation. In Proc. of EMNLP Findings, 2021

work page 2021

[41] [42]

Amplegcg-plus: A strong generative model of adversarial suffixes to jailbreak llms with higher success rates in fewer attempts

Vishal Kumar, Zeyi Liao, Jaylen Jones, and Huan Sun. Amplegcg-plus: A strong generative model of adversarial suffixes to jailbreak llms with higher success rates in fewer attempts. CoRR abs/2410.22143, 2024

work page arXiv 2024

[42] [43]

‘No, Alexa, no!’: designing child-safe AI and pro- tecting children from the risks of the ‘empathy gap’in large language models

Nomisha Kurian. ‘No, Alexa, no!’: designing child-safe AI and pro- tecting children from the risks of the ‘empathy gap’in large language models. Taylor & Francis Learning, Media and Technology , 2024

work page 2024

[43] [44]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Mike Lewis et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proc. of ACL , 2020

work page 2020

[44] [45]

BERT-ATTACK: Adversarial Attack Against BERT Using BERT

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proc. of EMNLP , 2020

work page 2020

[45] [46]

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proc. of IJCNLP , 2017

work page 2017

[46] [47]

Focal Loss for Dense Object Detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal Loss for Dense Object Detection. In Proc. of CoRR abs/1708.02002, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [48]

DEXPERTS: Decoding- Time Controlled Text Generation with Experts and Anti-Experts

Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A Smith, and Yejin Choi. DEXPERTS: Decoding- Time Controlled Text Generation with Experts and Anti-Experts. In Proc. of ACL , 2021

work page 2021

[48] [49]

Automatic and universal prompt injection attacks against large language models,

Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. CoRR abs/2403.04957, 2024

work page arXiv 2024

[49] [50]

Formalizing and Benchmarking Prompt Injection Attacks and Defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In Proc. of USENIX Security , 2024

work page 2024

[50] [51]

Hakkani-Tür

Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Z. Hakkani-Tür. Using In- Context Learning to Improve Dialogue Safety. In Proc. of EMNLP Findings, 2023

work page 2023

[51] [52]

Prompt shields in azure ai content safety

Microsoft. Prompt shields in azure ai content safety. https://learn.micr osoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-det ection, October 2024. Accessed: 2025-04-23

work page 2024

[52] [53]

https://platform.openai.com/docs/guides/fine-tuning, 2024

work page 2024

[53] [54]

Training language models to follow instructions with human feedback

Long Ouyang et al. Training language models to follow instructions with human feedback. In Proc. of NIPS , 2022

work page 2022

[54] [55]

Toxicity Detection: Does Context Really Matter? In Proc

John Pavlopoulos, Jeffrey Scott Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. Toxicity Detection: Does Context Really Matter? In Proc. of ACL, 2020

work page 2020

[55] [56]

Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech

Flor Miriam Plaza-Del-Arco, Debora Nozza, and Dirk Hovy. Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech. In Proc. of WOAH, 2023

work page 2023

[56] [57]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! Proc. of CoRR abs/2310.03693, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [58]

Language Model Detoxification in Dialogue with Contextualized Stance Control

Jingu Qian and Xifeng Yan. Language Model Detoxification in Dialogue with Contextualized Stance Control. In Proc. of EMNLP Findings, 2023

work page 2023

[58] [59]

Manning, and Chelsea Finn

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christo- pher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proc. of NeurIPS, 2023

work page 2023

[59] [60]

Ex- ploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Ex- ploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research , 2020

work page 2020

[60] [61]

DeepSpeed: System Optimizations Enable Training Deep Learning Mod- els with Over 100 Billion Parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimizations Enable Training Deep Learning Mod- els with Over 100 Billion Parameters. In Proc. of SIGKDD , 2020

work page 2020

[61] [62]

Recipes for Building an Open-Domain Chatbot

Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. Recipes for Building an Open-Domain Chatbot. In Proc. of ACL , 2021

work page 2021

[62] [63]

Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP

Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. In Proc. of ACL, 2021

work page 2021

[63] [64]

A Survey on Hate Speech Detec- tion using Natural Language Processing

Anna Schmidt and Michael Wiegand. A Survey on Hate Speech Detec- tion using Natural Language Processing. In Proc. of ACL SocialNLP , 2017

work page 2017

[64] [65]

AI chatbot blamed for Belgian man’s suicide

Mark Sellman. AI chatbot blamed for Belgian man’s suicide. https: //www.thetimes.com/business-money/technology/article/ai-chatbot-bla med-for-belgian-mans-suicide-zcjzlztcc, 2023

work page 2023

[65] [66]

BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage

Kurt Shuster et al. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. Proc. of CoRR abs/2208.03188, 2022

work page arXiv 2022

[66] [67]

Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Wai Man Si, M Backes, J Blackburn, E De Cristofaro, G Stringhini, S Zannettou, and Y Zhang. Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. In Proc. of the ACM CCS , 2022

work page 2022

[67] [68]

A Large- scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C Park. A Large- scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit. In Proc. of CoNLL , 2021

work page 2021

[68] [69]

Conceptnet 5.5: An open multilingual graph of general knowledge

R Speer et al. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proc of AAAI , 2017

work page 2017

[69] [70]

On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark

Hao Sun, Guangxuan Xu, Deng Jiawen, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark. In Proc. of ACL Findings , 2021

work page 2021

[70] [71]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. Proc. of CoRR abs/2307.09288 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [72]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models. Proc. of CoRR abs/2302.13971 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[72] [73]

Vishwamitra, K

N. Vishwamitra, K. Guo, F. Romit, I. Ondracek, L. Cheng, Z. Zhao, and H. Hu. Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models. In Proc. of IEEE S&P , 2024

work page 2024

[73] [74]

TRL: Transformer Reinforcement Learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl, 2020

work page 2020

[74] [75]

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Boxin Wang et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In Proc. of NIPS , 2024

work page 2024

[75] [76]

Adversarial glue: A multi-task benchmark for robustness evaluation of language models

Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, and Bo Li. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. Proc. of CoRR abs/2111.02840 , 2021

work page arXiv 2021

[76] [77]

Understanding Abuse: A Typology of Abusive Language Detection Subtasks

Z Waseem, T Davidson, D Warmsley, and I Weber. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In Proc. of ALW Workshop, 2017

work page 2017

[77] [78]

A First Look at Toxicity Injection Attacks on Open-domain Chatbots

Connor Weeks, Aravind Cheruvu, Sifat Muhammad Abdullah, Shravya Kanchi, Danfeng (Daphne) Yao, and Bimal Viswanath. A First Look at Toxicity Injection Attacks on Open-domain Chatbots. In Proc. of ACSAC, 2023

work page 2023

[78] [79]

Context Sensitivity Estimation in Toxicity Detection

Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. Context Sensitivity Estimation in Toxicity Detection. In Proc. of WOAH, 2021

work page 2021

[79] [80]

Toxicity Detection can be Sensitive to the Conversational Context

Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and Leo Laugier. Toxicity Detection can be Sensitive to the Conversational Context. In Proc. of WOAH, 2021

work page 2021

[80] [81]

Assessing Dialogue Systems with Distribution Distances

Jiannan Xiang, Yahui Liu, Deng Cai, Huayang Li, Defu Lian, and Lemao Liu. Assessing Dialogue Systems with Distribution Distances. In Proc. of ACL Findings , 2021

work page 2021